Programming in C++ for ush solutions
Review
Dynamic Set Operators
Insert, Remove, Search, Max, Min, Previous, Next
Binary Search Trees
Queries (Search, Max, Min, Previous, Next)
Modifiers (Insert, Remove)
Inserting nodes with/without duplicates
Traversing a tree (visiting all nodes in a tree)
Pre-order, post-order, in-order
Using trees as expression parsers
Heaps (trees implemented in arrays)
Heapsort; Using a heap to implement a priority queue
1
A Completely New Tree Application
Sometimes it’s not just the data in the node that’s important, but the PATH to the data.
Why does Manhattan have area code 212?
213, 312, and 313 belong to (parts of) Los Angeles, Chicago, and Detroit (respectively)
214 and 412 are Dallas and Pittsburgh (resp.)
All of Alaska has 907
Why?
When area codes were devised, we had rotary dial phones, and ALL area codes had either 0 or 1 as the middle digit. See here
2
Rotary Telephone Dial
No pushbuttons!
It takes longer to dial 907 (Alaska) than it does to dial 212 (Manhattan) See here, starting about 4:10, or here
212 is the fastest 3-digit number (with a 1 or 0 as the middle digit) you can dial
Areas with more people got the “shorter” area codes
The idea was to save more time calling people in big cities than would be lost calling people in rural areas
3
A Related System – Morse code
Used by telegraphers in mid-1800’s before long-distance telephone lines existed
Letters were transmitted using sequences of dots (short pulses) and dashes (long pulses)
Why was “E” assigned the one-dot code?
4
Frequency Of Occurrence (1)
We encode data in 8-bit bytes (using one byte per character), because computers work well with fixed-size characters
What if we could encode data with variable-length “bytes”?
If we assigned shorter bit strings to the characters that occur most often (and longer bit strings to the ones that occur least often), could we save space overall?
Does the "savings" (space saved on the more frequently-occurring characters) outweigh the "cost" (space lost on the less-frequently-occurring ones)?
5
Frequency Of Occurrence (2)
The English alphabet has 26 letters, but 12 of them comprise about 80% of the total usage
Obviously, we have to consider huge bodies of text to make the averages accurate
Specific documents, or text related to a specific area would have different frequency of occurrence patterns (as would different languages, of course)
Documents from the Oxford University website probably have many more “X” and “F” characters than we would normally expect in English prose.
Documents from LCCC would have disproportionately many instances of “C”
6
Huffman Coding
Huffman Coding (David Huffman, MIT Ph.D. student, 1952)
Uses variable-length bit codes to represent characters
Requires that we know (up-front) the relative frequency of occurrence for each character.
Once we build the “Huffman tree”, we can encode and decode the characters / bit strings
A way to implement lossless compression (but doesn’t compress as much as ZIP, RAR, or 7-zip)
Used in JPEG images
7
So How Does It work?
Start with the table of relative frequencies
Let’s simplify things a bit -- rather than encoding arbitrary text that might use the entire alphabet, let’s consider something shorter
“THIS IS AN EXAMPLE OF A HUFFMAN TREE”
This 36-character (total) message consists of 16 different characters:
{T, H, I, S, space, A, N, E, X, M, P, L, O, F, U, R}
Next, we need to count how many times each appears (compute the frequency of occurrence) in our body of text.
8
Compute Frequency Of Occurrence
“THIS IS AN EXAMPLE OF A HUFFMAN TREE”
{T, H, I, S, space, A, N, E, X, M, P, L, O, F, U, R}
| Character | Occurs | Character | Occurs | |
| T | 2 | X | 1 | |
| H | 2 | M | 2 | |
| I | 2 | P | 1 | |
| S | 2 | L | 1 | |
| space | 7 | O | 1 | |
| A | 4 | F | 3 | |
| N | 2 | U | 1 | |
| E | 4 | R | 1 |
9
Sort The Table (Descending)
The remaining question:
Do we save enough bits by representing the space, 'A', and 'E' with something shorter to pay for having to represent 'L', 'O', 'P', 'R', 'U', and 'X' with something longer?
| Character | Occurs | Character | Occurs | |
| space | 7 | S | 2 | |
| A | 4 | T | 2 | |
| E | 4 | L | 1 | |
| F | 3 | O | 1 | |
| H | 2 | P | 1 | |
| I | 2 | R | 1 | |
| M | 2 | U | 1 | |
| N | 2 | X | 1 |
10
Huffman Trees
Huffman proved (mathematically) that, on a single-character basis, this method was optimal (ZIP, RAR, 7-Zip all consider more than one character at a time, and so can compress better)
The Huffman process is based on building a tree from the frequency of occurrence information
This IS a binary tree, but it is NOT a Binary Search Tree – the BST property is not required to hold at any particular point in the tree, and we don’t make any “less than goes left; greater than goes right” decisions at any node.
Bear with me, this will all make sense shortly…
11
Use This Data To Build A Binary Tree
This is a special type of binary tree
The leaves contain letters (symbols) and “weights”
The non-leaf nodes contain only “weights” (no letter)
How do we tell a leaf from an internal node?
struct treenode
{
char symb;
int weight;
treenode *LCH;
treenode *RCH;
}
12
Build The Binary Tree (2)
Make a leaf node for every character
The frequency of occurrence becomes the weight for a leaf
We’ll use '*' (for now) to represent 'space‘
Repeat (make a node) for every character
This gives us an array of nodes. Note: at this point, the nodes are ALL disconnected (no child pointers have been set)
* 7
Weight (frequency)
Character
Left Child
Right Child
13
Make A Leaf Node For Every Character
* 7
A 4
| Character | Occurs | Character | Occurs | |
| space | 7 | S | 2 | |
| A | 4 | T | 2 | |
| E | 4 | L | 1 | |
| F | 3 | O | 1 | |
| H | 2 | P | 1 | |
| I | 2 | R | 1 | |
| M | 2 | U | 1 | |
| N | 2 | X | 1 |
E 4
F 3
H 2
I 2
M 2
N 2
S 2
T 2
L 1
O 1
P 1
R 1
U 1
X 1
Hint: A single node is also a tree which just happens to have no children – a trivial one-node tree
14
Build The Huffman Tree: Overview
Next, we take our leaf nodes, and start building the tree from the bottom-up
We built the BST from the top-down
How? We said that in a binary tree, all accesses start at the root
How can we build a tree starting with a leaf?
Simple – gather two leaves, create them a parent node, and then we have a tree – the parent node is the root of the subtree with the two leaves
Then repeat, over and over and over…
15
Start Building Up The Tree (1)
Take the 2 items with the lowest weight, and make them children of a new internal node. If there are more than 2 items with this lowest weight, pick any 2
The sum of their weights becomes the weight of the internal node (which has no character)
* 7
A 4
E 4
F 3
H 2
I 2
M 2
N 2
S 2
T 2
L 1
O 1
P 1
R 1
U 1
X 1
L 1
O 1
2
16
Start Building Up The Tree (1)
Replace the two leaves (in our list) with the subtree
We started with 16 items (16 characters)
Now we have 15 items: 14 leaf nodes, and one tree containing 2 characters
Repeat!
* 7
A 4
E 4
F 3
H 2
I 2
M 2
N 2
S 2
T 2
L 1
O 1
P 1
R 1
U 1
X 1
L 1
O 1
2
17
Start Building Up The Tree (2)
Take the 2 items with the lowest weight, and make them children of a new internal node. If there are more than 2 items with this lowest weight, pick any 2
The sum of their weights becomes the weight of the internal node (which has no character)
* 7
A 4
E 4
F 3
H 2
I 2
M 2
N 2
S 2
T 2
P 1
R 1
U 1
X 1
P 1
R 1
2
L 1
O 1
2
18
Start Building Up The Tree (3)
Take the 2 items with the lowest weight, and make them children of a new internal node. If there are more than 2 items with this lowest weight, pick any 2
The sum of their weights becomes the weight of the internal node (which has no character)
* 7
A 4
E 4
F 3
H 2
I 2
M 2
N 2
S 2
T 2
U 1
X 1
U 1
X 1
2
L 1
O 1
2
P 1
R 1
2
19
Start Building Up The Tree (4)
After three applications of “group the two lowest-weight items into a tree whose parent has their combined weight”, we have what is shown below
When we group two items together by making them children of a new node, the new node becomes an item that replaces the two single ones – we never consider those leaves again on their own
Repeat!
* 7
A 4
E 4
F 3
H 2
I 2
M 2
N 2
S 2
T 2
L 1
O 1
2
P 1
R 1
2
U 1
X 1
2
20
Continue Building The Tree
Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:
* 7
A 4
E 4
F 3
H 2
I 2
M 2
N 2
S 2
T 2
L 1
O 1
2
P 1
R 1
2
U 1
X 1
2
H 2
I 2
4
21
Continue Building The Tree
Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:
* 7
A 4
E 4
F 3
M 2
N 2
S 2
T 2
L 1
O 1
2
P 1
R 1
2
U 1
X 1
2
M 2
N 2
4
H 2
I 2
4
22
Continue Building The Tree
Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:
* 7
A 4
E 4
F 3
S 2
T 2
L 1
O 1
2
P 1
R 1
2
U 1
X 1
2
S 2
T 2
4
H 2
I 2
4
M 2
N 2
4
23
Continue Building The Tree
Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:
* 7
A 4
E 4
F 3
L 1
O 1
2
P 1
R 1
2
U 1
X 1
2
S 2
T 2
4
H 2
I 2
4
M 2
N 2
4
Now what are the two “lowest-weight items”?
24
Continue Building The Tree
Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:
* 7
A 4
E 4
F 3
L 1
O 1
2
P 1
R 1
2
U 1
X 1
2
S 2
T 2
4
H 2
I 2
4
M 2
N 2
4
L 1
O 1
2
P 1
R 1
2
4
25
Continue Building The Tree
Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:
* 7
A 4
E 4
F 3
U 1
X 1
2
S 2
T 2
4
H 2
I 2
4
M 2
N 2
4
L 1
O 1
2
P 1
R 1
2
4
Now what are the two “lowest-weight items”?
26
Continue Building The Tree
Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:
* 7
A 4
E 4
F 3
U 1
X 1
2
S 2
T 2
4
H 2
I 2
4
M 2
N 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
27
Continue Building The Tree
Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:
* 7
A 4
E 4
S 2
T 2
4
H 2
I 2
4
M 2
N 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
28
Continue Building The Tree
Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:
* 7
S 2
T 2
4
H 2
I 2
4
M 2
N 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
29
Continue Building The Tree
Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
30
Continue Building The Tree
Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
31
Continue Building The Tree
Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
32
Continue Building The Tree
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
33
Continue Building The Tree
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
16
34
Continue Building The Tree
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
16
35
Continue Building The Tree
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
16
20
36
Continue Building The Tree
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
16
20
36
37
Let’s Refine The Drawing A Bit
We know that only leaves have characters (letters), so we’ll drop the “” indicators
Similarly, we can tell where the pointers go, so we’ll drop the LChild and RChild boxes
Outline the leaves (nodes containing characters) in green
We're not making any content changes to the tree; just in how we draw it
38
Before Our Notational Change
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
16
20
36
39
The Finished Tree
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
16
20
36
40
What Next?
At this point, we don’t care about weights, but we will continue to show them
The weights are only needed to build the tree; from this point on, they're useless, but since they're already in the nodes, we'll just let them sit there.
Note: the weight at the root should equal the number of characters we started with (this is a convenient way to double-check that the tree was not built incorrectly), but it’s still possible to build the tree wrong with the correct root weight
Number (label) all left branches “0”; all right branches “1”
41
Finished Tree – Numbered Branches
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
16
20
36
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
The Path to 'A': 000
The Path to 'E': 001
The Path to 'P'?
11110
42
The Original Data, Revisited
What do we notice about the paths we wound up with?
The characters that occur more often have shorter paths (and vice versa)
| Character | Occurs | Path | Character | Occurs | Path | |
| space | 7 | 100 | S | 2 | 1100 | |
| A | 4 | 000 | T | 2 | 1101 | |
| E | 4 | 001 | L | 1 | 11100 | |
| F | 3 | 1010 | O | 1 | 11101 | |
| H | 2 | 0100 | P | 1 | 11110 | |
| I | 2 | 0101 | R | 1 | 11111 | |
| M | 2 | 0110 | U | 1 | 10110 | |
| N | 2 | 0111 | X | 1 | 10111 |
43
The Original Data, Revisited
In Huffman Encoding, the PATHs to the leaf nodes become the bit strings we use to represent the characters stored in the leaf nodes
Hold onto this idea for a minute
| Character | Occurs | Path | Character | Occurs | Path | |
| space | 7 | 100 | S | 2 | 1100 | |
| A | 4 | 000 | T | 2 | 1101 | |
| E | 4 | 001 | L | 1 | 11100 | |
| F | 3 | 1010 | O | 1 | 11101 | |
| H | 2 | 0100 | P | 1 | 11110 | |
| I | 2 | 0101 | R | 1 | 11111 | |
| M | 2 | 0110 | U | 1 | 10110 | |
| N | 2 | 0111 | X | 1 | 10111 |
44
The Original Data, Revisited
The original “alphabet” for this message consists of 16 characters
| Character | Occurs | Character | Occurs | |
| space | 7 | S | 2 | |
| A | 4 | T | 2 | |
| E | 4 | L | 1 | |
| F | 3 | O | 1 | |
| H | 2 | P | 1 | |
| I | 2 | R | 1 | |
| M | 2 | U | 1 | |
| N | 2 | X | 1 |
45
The Original Data, Revisited
We could have used a (fixed-size) four-bit scheme to encode these sixteen characters
“THIS IS AN EXAMPLE OF A HUFFMAN TREE” (36 characters) takes 36*4 = 144 bits
| Character | Encoded | Character | Encoded | |
| space | 0000 | S | 1000 | |
| A | 0001 | T | 1001 | |
| E | 0010 | L | 1010 | |
| F | 0011 | O | 1011 | |
| H | 0100 | P | 1110 | |
| I | 0101 | R | 1101 | |
| M | 0110 | U | 1110 | |
| N | 0111 | X | 1111 |
46
The Original Data, Revisited
How do we encode “THIS IS AN EXAMPLE OF A HUFFMAN TREE” using our tree paths?
Simple substitution:
| Character | Path | Character | Path | |
| space | 100 | S | 1100 | |
| A | 000 | T | 1101 | |
| E | 001 | L | 11100 | |
| F | 1010 | O | 11101 | |
| H | 0100 | P | 11110 | |
| I | 0101 | R | 11111 | |
| M | 0110 | U | 10110 | |
| N | 0111 | X | 10111 |
1101 0100 0101 1100 100 0101 1100 100 000 0111 100 001 10111 000 1001
T H I S * I S * A N * E X A M…
Total space required using fixed 4-bit encoding: 36 * 4 = 144 bits
Total space required using variable-bit-length encoding: 135 bits
135/144 = 0.9375 we got 6.25% compression
47
About Compression Levels
We got only ~6% compression
Hardly seems worth the effort
However: We had a small alphabet, and the character frequency counts did not have widely different values
Our Huffman tree was "somewhat" imbalanced
The more imbalanced a Huffman tree is, the higher the level of compression
If 'space' and 'A' had occurred 100 times each, they would have had two-bit patterns more compression
48
The Original Data, Revisited
Encoding is only half of the process – we have to be able to DECODE as well
| Character | Path | Character | Path | |
| space | 100 | S | 1100 | |
| A | 000 | T | 1101 | |
| E | 001 | L | 11100 | |
| F | 1010 | O | 11101 | |
| H | 0100 | P | 11110 | |
| I | 0101 | R | 11111 | |
| M | 0110 | U | 10110 | |
| N | 0111 | X | 10111 |
What does this say (i.e., how do we decode)?
11110111110101011111010011111110001011100100111010111
We use the tree we built!
49
Using The Tree To Decode
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
16
20
36
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
Decode: 11110111110101011111010011111110001011100100111010111
50
Using The Tree To Decode
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
16
20
36
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
Decode: 11110111110101011111010011111110001011100100111010111
51
Using The Tree To Decode
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
16
20
36
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
Decode: 11110111110101011111010011111110001011100100111010111
52
Using The Tree To Decode
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
16
20
36
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
Decode: 11110111110101011111010011111110001011100100111010111
53
Using The Tree To Decode
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
16
20
36
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
Decode: 11110111110101011111010011111110001011100100111010111
54
Using The Tree To Decode
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
16
20
36
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
Decode: 11110111110101011111010011111110001011100100111010111
55
Using The Tree To Decode
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
16
20
36
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
Decode: 11110111110101011111010011111110001011100100111010111
When we get to a leaf node, output the character stored in the leaf, and resume decoding at the root
Output: P
56
Using The Tree To Decode
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
16
20
36
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
Decode: 11110111110101011111010011111110001011100100111010111
Output: P
57
Using The Tree To Decode
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
16
20
36
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
Decode: 11110111110101011111010011111110001011100100111010111
Output: P
58
Using The Tree To Decode
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
16
20
36
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
Decode: 11110111110101011111010011111110001011100100111010111
Output: PR
59
Using The Tree To Decode
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
16
20
36
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
Decode: 11110111110101011111010011111110001011100100111010111
Output: PRI
60
Using The Tree To Decode
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
16
20
36
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
Decode: 11110111110101011111010011111110001011100100111010111
Output: PRIN
61
Using The Tree To Decode
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
16
20
36
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
Decode: 11110111110101011111010011111110001011100100111010111
Output: PRINT
62
Decode The Rest Of This String
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
16
20
36
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
Decode: 11110111110101011111010011111110001011100100111010111
Output: PRINTER IS ON
When we run out of bits in the message we’re decoding, we should be at the leaf corresponding to the last character in the original message.
63
Storing Variable-Length Bit Strings
(Most) computers work in 8-bit bytes
How can we store these variable-length strings in fixed-length bytes?
11110111 11010101 11110100 11111110 00101110 01001110 10111XXX
We add padding bits to fill the void at the end
We need three bits to fill the last byte
What 3 bits do we choose as our padding bits?
Well, there are only 8 possibilities…
64
Storing Variable-Length Bit Strings
If we’re storing this variable-length bit string in 8-bit bytes:
11110111 11010101 11110100 11111110 00101110 01001110 10111XXX
8 Options:
000
001
010
011
100
101
110
111
| Character | Path | Character | Path | |
| space | 100 | S | 1100 | |
| A | 000 | T | 1101 | |
| E | 001 | L | 11100 | |
| F | 1010 | O | 11101 | |
| H | 0100 | P | 11110 | |
| I | 0101 | R | 11111 | |
| M | 0110 | U | 10110 | |
| N | 0111 | X | 10111 |
What three padding bits should we use?
Why can we use NONE of these?
65
Storing Variable-Length Bit Strings
A few observations:
Whatever we choose must not change the meaning of the message as-transmitted
If we need padding bits at all, then we need somewhere between 1 and 7 of them, if we're using 8-bit bytes
We may not need any padding bits at all (if our encoded message happens to be some multiple of 8 bits long – that fits perfectly)
So, we need between 0 and 7 padding bits for an 8-bit byte; between 0 and 3 for our 4-bit bytes
| Character | Path | Character | Path | |
| space | 100 | S | 1100 | |
| A | 000 | T | 1101 | |
| E | 001 | L | 11100 | |
| F | 1010 | O | 11101 | |
| H | 0100 | P | 11110 | |
| I | 0101 | R | 11111 | |
| M | 0110 | U | 10110 | |
| N | 0111 | X | 10111 |
66
Choosing Padding Bits
We have a few options:
Use (up to) 3 space characters (and trim when received)
Add some special end-of-transmission character to our alphabet (may not be possible/practical)
If the longest bit string is longer than the maximum number of padding bits we might need, then we could use that bit string, knowing it would never generate a complete trip to a leaf:
No unwanted character!
Any of the longest (5-bit) strings work!
| Character | Path | Character | Path | |
| space | 100 | S | 1100 | |
| A | 000 | T | 1101 | |
| E | 001 | L | 11100 | |
| F | 1010 | O | 11101 | |
| H | 0100 | P | 11110 | |
| I | 0101 | R | 11111 | |
| M | 0110 | U | 10110 | |
| N | 0111 | X | 10111 |
67
Choosing Padding Bits
* 7
S 2
T 2
4
L 1
O 1
2
P 1
R 1
2
4
F 3
U 1
X 1
2
5
A 4
E 4
8
H 2
I 2
4
M 2
N 2
4
8
8
12
16
20
36
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
Decode: 11110111110101011111010011111110001011100100111010111xxx
Output: PRINTER IS ON
All OK:
010
011
101
110
111
Not OK:
000
001
100
* 7
A 4
E 4
4
4
5
4
4
If we’ve chosen our padding bits wisely, we will be somewhere in the middle of the tree (at some internal node) when we run out of bits, so we can just stop without generating some unwanted character
68
Huffman Coding - Summary
If not already known, establish frequency of occurrence counts / ratios
Build the tree iteratively with the “aggregate the two smallest-weight items” algorithm
Traverse the tree to identify the encoding strings, and store them in a table
Encode via table look-up, padding if needed
Decode via tree navigation, outputting characters when leaves reached, continuing at root. Stop when we run out of bits (even if not at a leaf node)
69
? Questions ?
Any Questions?