Programming in C++ for ush solutions

profileJake_9876
HuffmanTrees-Part1.pptx

Review

Dynamic Set Operators

Insert, Remove, Search, Max, Min, Previous, Next

Binary Search Trees

Queries (Search, Max, Min, Previous, Next)

Modifiers (Insert, Remove)

Inserting nodes with/without duplicates

Traversing a tree (visiting all nodes in a tree)

Pre-order, post-order, in-order

Using trees as expression parsers

Heaps (trees implemented in arrays)

Heapsort; Using a heap to implement a priority queue

1

A Completely New Tree Application

Sometimes it’s not just the data in the node that’s important, but the PATH to the data.

Why does Manhattan have area code 212?

213, 312, and 313 belong to (parts of) Los Angeles, Chicago, and Detroit (respectively)

214 and 412 are Dallas and Pittsburgh (resp.)

All of Alaska has 907

Why?

When area codes were devised, we had rotary dial phones, and ALL area codes had either 0 or 1 as the middle digit. See here

2

Rotary Telephone Dial

No pushbuttons!

It takes longer to dial 907 (Alaska) than it does to dial 212 (Manhattan) See here, starting about 4:10, or here

212 is the fastest 3-digit number (with a 1 or 0 as the middle digit) you can dial

Areas with more people got the “shorter” area codes

The idea was to save more time calling people in big cities than would be lost calling people in rural areas

3

A Related System – Morse code

Used by telegraphers in mid-1800’s before long-distance telephone lines existed

Letters were transmitted using sequences of dots (short pulses) and dashes (long pulses)

Why was “E” assigned the one-dot code?

4

Frequency Of Occurrence (1)

We encode data in 8-bit bytes (using one byte per character), because computers work well with fixed-size characters

What if we could encode data with variable-length “bytes”?

If we assigned shorter bit strings to the characters that occur most often (and longer bit strings to the ones that occur least often), could we save space overall?

Does the "savings" (space saved on the more frequently-occurring characters) outweigh the "cost" (space lost on the less-frequently-occurring ones)?

5

Frequency Of Occurrence (2)

The English alphabet has 26 letters, but 12 of them comprise about 80% of the total usage

Obviously, we have to consider huge bodies of text to make the averages accurate

Specific documents, or text related to a specific area would have different frequency of occurrence patterns (as would different languages, of course)

Documents from the Oxford University website probably have many more “X” and “F” characters than we would normally expect in English prose.

Documents from LCCC would have disproportionately many instances of “C”

6

Huffman Coding

Huffman Coding (David Huffman, MIT Ph.D. student, 1952)

Uses variable-length bit codes to represent characters

Requires that we know (up-front) the relative frequency of occurrence for each character.

Once we build the “Huffman tree”, we can encode and decode the characters / bit strings

A way to implement lossless compression (but doesn’t compress as much as ZIP, RAR, or 7-zip)

Used in JPEG images

7

So How Does It work?

Start with the table of relative frequencies

Let’s simplify things a bit -- rather than encoding arbitrary text that might use the entire alphabet, let’s consider something shorter

“THIS IS AN EXAMPLE OF A HUFFMAN TREE”

This 36-character (total) message consists of 16 different characters:

{T, H, I, S, space, A, N, E, X, M, P, L, O, F, U, R}

Next, we need to count how many times each appears (compute the frequency of occurrence) in our body of text.

8

Compute Frequency Of Occurrence

“THIS IS AN EXAMPLE OF A HUFFMAN TREE”

{T, H, I, S, space, A, N, E, X, M, P, L, O, F, U, R}

Character Occurs Character Occurs
T 2 X 1
H 2 M 2
I 2 P 1
S 2 L 1
space 7 O 1
A 4 F 3
N 2 U 1
E 4 R 1

9

Sort The Table (Descending)

The remaining question:

Do we save enough bits by representing the space, 'A', and 'E' with something shorter to pay for having to represent 'L', 'O', 'P', 'R', 'U', and 'X' with something longer?

Character Occurs Character Occurs
space 7 S 2
A 4 T 2
E 4 L 1
F 3 O 1
H 2 P 1
I 2 R 1
M 2 U 1
N 2 X 1

10

Huffman Trees

Huffman proved (mathematically) that, on a single-character basis, this method was optimal (ZIP, RAR, 7-Zip all consider more than one character at a time, and so can compress better)

The Huffman process is based on building a tree from the frequency of occurrence information

This IS a binary tree, but it is NOT a Binary Search Tree – the BST property is not required to hold at any particular point in the tree, and we don’t make any “less than goes left; greater than goes right” decisions at any node.

Bear with me, this will all make sense shortly…

11

Use This Data To Build A Binary Tree

This is a special type of binary tree

The leaves contain letters (symbols) and “weights”

The non-leaf nodes contain only “weights” (no letter)

How do we tell a leaf from an internal node?

struct treenode

{

char symb;

int weight;

treenode *LCH;

treenode *RCH;

}

12

Build The Binary Tree (2)

Make a leaf node for every character

The frequency of occurrence becomes the weight for a leaf

We’ll use '*' (for now) to represent 'space‘

Repeat (make a node) for every character

This gives us an array of nodes. Note: at this point, the nodes are ALL disconnected (no child pointers have been set)

* 7

 

Weight (frequency)

Character

Left Child

Right Child

13

Make A Leaf Node For Every Character

* 7

 

A 4

 

Character Occurs Character Occurs
space 7 S 2
A 4 T 2
E 4 L 1
F 3 O 1
H 2 P 1
I 2 R 1
M 2 U 1
N 2 X 1

E 4

 

F 3

 

H 2

 

I 2

 

M 2

 

N 2

 

S 2

 

T 2

 

L 1

 

O 1

 

P 1

 

R 1

 

U 1

 

X 1

 

Hint: A single node is also a tree which just happens to have no children – a trivial one-node tree

14

Build The Huffman Tree: Overview

Next, we take our leaf nodes, and start building the tree from the bottom-up

We built the BST from the top-down

How? We said that in a binary tree, all accesses start at the root

How can we build a tree starting with a leaf?

Simple – gather two leaves, create them a parent node, and then we have a tree – the parent node is the root of the subtree with the two leaves

Then repeat, over and over and over…

15

Start Building Up The Tree (1)

Take the 2 items with the lowest weight, and make them children of a new internal node. If there are more than 2 items with this lowest weight, pick any 2

The sum of their weights becomes the weight of the internal node (which has no character)

* 7

 

A 4

 

E 4

 

F 3

 

H 2

 

I 2

 

M 2

 

N 2

 

S 2

 

T 2

 

L 1

 

O 1

 

P 1

 

R 1

 

U 1

 

X 1

 

L 1

 

O 1

 

2

16

Start Building Up The Tree (1)

Replace the two leaves (in our list) with the subtree

We started with 16 items (16 characters)

Now we have 15 items: 14 leaf nodes, and one tree containing 2 characters

Repeat!

* 7

 

A 4

 

E 4

 

F 3

 

H 2

 

I 2

 

M 2

 

N 2

 

S 2

 

T 2

 

L 1

 

O 1

 

P 1

 

R 1

 

U 1

 

X 1

 

L 1

 

O 1

 

2

17

Start Building Up The Tree (2)

Take the 2 items with the lowest weight, and make them children of a new internal node. If there are more than 2 items with this lowest weight, pick any 2

The sum of their weights becomes the weight of the internal node (which has no character)

* 7

 

A 4

 

E 4

 

F 3

 

H 2

 

I 2

 

M 2

 

N 2

 

S 2

 

T 2

 

P 1

 

R 1

 

U 1

 

X 1

 

P 1

 

R 1

 

2

L 1

 

O 1

 

2

18

Start Building Up The Tree (3)

Take the 2 items with the lowest weight, and make them children of a new internal node. If there are more than 2 items with this lowest weight, pick any 2

The sum of their weights becomes the weight of the internal node (which has no character)

* 7

 

A 4

 

E 4

 

F 3

 

H 2

 

I 2

 

M 2

 

N 2

 

S 2

 

T 2

 

U 1

 

X 1

 

U 1

 

X 1

 

2

L 1

 

O 1

 

2

P 1

 

R 1

 

2

19

Start Building Up The Tree (4)

After three applications of “group the two lowest-weight items into a tree whose parent has their combined weight”, we have what is shown below

When we group two items together by making them children of a new node, the new node becomes an item that replaces the two single ones – we never consider those leaves again on their own

Repeat!

* 7

 

A 4

 

E 4

 

F 3

 

H 2

 

I 2

 

M 2

 

N 2

 

S 2

 

T 2

 

L 1

 

O 1

 

2

P 1

 

R 1

 

2

U 1

 

X 1

 

2

20

Continue Building The Tree

Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:

* 7

 

A 4

 

E 4

 

F 3

 

H 2

 

I 2

 

M 2

 

N 2

 

S 2

 

T 2

 

L 1

 

O 1

 

2

P 1

 

R 1

 

2

U 1

 

X 1

 

2

H 2

 

I 2

 

4

21

Continue Building The Tree

Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:

* 7

 

A 4

 

E 4

 

F 3

 

M 2

 

N 2

 

S 2

 

T 2

 

L 1

 

O 1

 

2

P 1

 

R 1

 

2

U 1

 

X 1

 

2

M 2

 

N 2

 

4

H 2

 

I 2

 

4

22

Continue Building The Tree

Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:

* 7

 

A 4

 

E 4

 

F 3

 

S 2

 

T 2

 

L 1

 

O 1

 

2

P 1

 

R 1

 

2

U 1

 

X 1

 

2

S 2

 

T 2

 

4

H 2

 

I 2

 

4

M 2

 

N 2

 

4

23

Continue Building The Tree

Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:

* 7

 

A 4

 

E 4

 

F 3

 

L 1

 

O 1

 

2

P 1

 

R 1

 

2

U 1

 

X 1

 

2

S 2

 

T 2

 

4

H 2

 

I 2

 

4

M 2

 

N 2

 

4

Now what are the two “lowest-weight items”?

24

Continue Building The Tree

Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:

* 7

 

A 4

 

E 4

 

F 3

 

L 1

 

O 1

 

2

P 1

 

R 1

 

2

U 1

 

X 1

 

2

S 2

 

T 2

 

4

H 2

 

I 2

 

4

M 2

 

N 2

 

4

L 1

 

O 1

 

2

P 1

 

R 1

 

2

4

25

Continue Building The Tree

Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:

* 7

 

A 4

 

E 4

 

F 3

 

U 1

 

X 1

 

2

S 2

 

T 2

 

4

H 2

 

I 2

 

4

M 2

 

N 2

 

4

L 1

 

O 1

 

2

P 1

 

R 1

 

2

4

Now what are the two “lowest-weight items”?

26

Continue Building The Tree

Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:

* 7

 

A 4

 

E 4

 

F 3

 

U 1

 

X 1

 

2

S 2

 

T 2

 

4

H 2

 

I 2

 

4

M 2

 

N 2

 

4

L 1

 

O 1

 

2

P 1

 

R 1

 

2

4

F 3

 

U 1

 

X 1

 

2

5

27

Continue Building The Tree

Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:

* 7

 

A 4

 

E 4

 

S 2

 

T 2

 

4

H 2

 

I 2

 

4

M 2

 

N 2

 

4

L 1

 

O 1

 

2

P 1

 

R 1

 

2

4

F 3

 

U 1

 

X 1

 

2

5

A 4

 

E 4

 

8

28

Continue Building The Tree

Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:

* 7

 

S 2

 

T 2

 

4

H 2

 

I 2

 

4

M 2

 

N 2

 

4

L 1

 

O 1

 

2

P 1

 

R 1

 

2

4

F 3

 

U 1

 

X 1

 

2

5

A 4

 

E 4

 

8

H 2

 

I 2

 

4

M 2

 

N 2

 

4

8

29

Continue Building The Tree

Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:

* 7

 

S 2

 

T 2

 

4

L 1

 

O 1

 

2

P 1

 

R 1

 

2

4

F 3

 

U 1

 

X 1

 

2

5

A 4

 

E 4

 

8

H 2

 

I 2

 

4

M 2

 

N 2

 

4

8

30

Continue Building The Tree

Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:

* 7

 

S 2

 

T 2

 

4

L 1

 

O 1

 

2

P 1

 

R 1

 

2

4

F 3

 

U 1

 

X 1

 

2

5

A 4

 

E 4

 

8

H 2

 

I 2

 

4

M 2

 

N 2

 

4

8

8

31

Continue Building The Tree

Keep going: Group the two lowest-weight items into a tree whose parent node has their combined weight:

* 7

 

S 2

 

T 2

 

4

L 1

 

O 1

 

2

P 1

 

R 1

 

2

4

F 3

 

U 1

 

X 1

 

2

5

A 4

 

E 4

 

8

H 2

 

I 2

 

4

M 2

 

N 2

 

4

8

8

12

32

Continue Building The Tree

* 7

 

S 2

 

T 2

 

4

L 1

 

O 1

 

2

P 1

 

R 1

 

2

4

F 3

 

U 1

 

X 1

 

2

5

A 4

 

E 4

 

8

H 2

 

I 2

 

4

M 2

 

N 2

 

4

8

8

12

33

Continue Building The Tree

* 7

 

S 2

 

T 2

 

4

L 1

 

O 1

 

2

P 1

 

R 1

 

2

4

F 3

 

U 1

 

X 1

 

2

5

A 4

 

E 4

 

8

H 2

 

I 2

 

4

M 2

 

N 2

 

4

8

8

12

16

34

Continue Building The Tree

* 7

 

S 2

 

T 2

 

4

L 1

 

O 1

 

2

P 1

 

R 1

 

2

4

F 3

 

U 1

 

X 1

 

2

5

A 4

 

E 4

 

8

H 2

 

I 2

 

4

M 2

 

N 2

 

4

8

8

12

16

35

Continue Building The Tree

* 7

 

S 2

 

T 2

 

4

L 1

 

O 1

 

2

P 1

 

R 1

 

2

4

F 3

 

U 1

 

X 1

 

2

5

A 4

 

E 4

 

8

H 2

 

I 2

 

4

M 2

 

N 2

 

4

8

8

12

16

20

36

Continue Building The Tree

* 7

 

S 2

 

T 2

 

4

L 1

 

O 1

 

2

P 1

 

R 1

 

2

4

F 3

 

U 1

 

X 1

 

2

5

A 4

 

E 4

 

8

H 2

 

I 2

 

4

M 2

 

N 2

 

4

8

8

12

16

20

36

37

Let’s Refine The Drawing A Bit

We know that only leaves have characters (letters), so we’ll drop the “” indicators

Similarly, we can tell where the pointers go, so we’ll drop the LChild and RChild boxes

Outline the leaves (nodes containing characters) in green

We're not making any content changes to the tree; just in how we draw it

38

Before Our Notational Change

* 7

 

S 2

 

T 2

 

4

L 1

 

O 1

 

2

P 1

 

R 1

 

2

4

F 3

 

U 1

 

X 1

 

2

5

A 4

 

E 4

 

8

H 2

 

I 2

 

4

M 2

 

N 2

 

4

8

8

12

16

20

36

39

The Finished Tree

* 7

S 2

T 2

4

L 1

O 1

2

P 1

R 1

2

4

F 3

U 1

X 1

2

5

A 4

E 4

8

H 2

I 2

4

M 2

N 2

4

8

8

12

16

20

36

40

What Next?

At this point, we don’t care about weights, but we will continue to show them

The weights are only needed to build the tree; from this point on, they're useless, but since they're already in the nodes, we'll just let them sit there.

Note: the weight at the root should equal the number of characters we started with (this is a convenient way to double-check that the tree was not built incorrectly), but it’s still possible to build the tree wrong with the correct root weight

Number (label) all left branches “0”; all right branches “1”

41

Finished Tree – Numbered Branches

* 7

S 2

T 2

4

L 1

O 1

2

P 1

R 1

2

4

F 3

U 1

X 1

2

5

A 4

E 4

8

H 2

I 2

4

M 2

N 2

4

8

8

12

16

20

36

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

The Path to 'A': 000

The Path to 'E': 001

The Path to 'P'?

11110

42

The Original Data, Revisited

What do we notice about the paths we wound up with?

The characters that occur more often have shorter paths (and vice versa)

Character Occurs Path Character Occurs Path
space 7 100 S 2 1100
A 4 000 T 2 1101
E 4 001 L 1 11100
F 3 1010 O 1 11101
H 2 0100 P 1 11110
I 2 0101 R 1 11111
M 2 0110 U 1 10110
N 2 0111 X 1 10111

43

The Original Data, Revisited

In Huffman Encoding, the PATHs to the leaf nodes become the bit strings we use to represent the characters stored in the leaf nodes

Hold onto this idea for a minute

Character Occurs Path Character Occurs Path
space 7 100 S 2 1100
A 4 000 T 2 1101
E 4 001 L 1 11100
F 3 1010 O 1 11101
H 2 0100 P 1 11110
I 2 0101 R 1 11111
M 2 0110 U 1 10110
N 2 0111 X 1 10111

44

The Original Data, Revisited

The original “alphabet” for this message consists of 16 characters

Character Occurs Character Occurs
space 7 S 2
A 4 T 2
E 4 L 1
F 3 O 1
H 2 P 1
I 2 R 1
M 2 U 1
N 2 X 1

45

The Original Data, Revisited

We could have used a (fixed-size) four-bit scheme to encode these sixteen characters

“THIS IS AN EXAMPLE OF A HUFFMAN TREE” (36 characters) takes 36*4 = 144 bits

Character Encoded Character Encoded
space 0000 S 1000
A 0001 T 1001
E 0010 L 1010
F 0011 O 1011
H 0100 P 1110
I 0101 R 1101
M 0110 U 1110
N 0111 X 1111

46

The Original Data, Revisited

How do we encode “THIS IS AN EXAMPLE OF A HUFFMAN TREE” using our tree paths?

Simple substitution:

Character Path Character Path
space 100 S 1100
A 000 T 1101
E 001 L 11100
F 1010 O 11101
H 0100 P 11110
I 0101 R 11111
M 0110 U 10110
N 0111 X 10111

1101 0100 0101 1100 100 0101 1100 100 000 0111 100 001 10111 000 1001

T H I S * I S * A N * E X A M…

Total space required using fixed 4-bit encoding: 36 * 4 = 144 bits

Total space required using variable-bit-length encoding: 135 bits

135/144 = 0.9375  we got 6.25% compression

47

About Compression Levels

We got only ~6% compression

Hardly seems worth the effort

However: We had a small alphabet, and the character frequency counts did not have widely different values

Our Huffman tree was "somewhat" imbalanced

The more imbalanced a Huffman tree is, the higher the level of compression

If 'space' and 'A' had occurred 100 times each, they would have had two-bit patterns  more compression

48

The Original Data, Revisited

Encoding is only half of the process – we have to be able to DECODE as well

Character Path Character Path
space 100 S 1100
A 000 T 1101
E 001 L 11100
F 1010 O 11101
H 0100 P 11110
I 0101 R 11111
M 0110 U 10110
N 0111 X 10111

What does this say (i.e., how do we decode)?

11110111110101011111010011111110001011100100111010111

We use the tree we built!

49

Using The Tree To Decode

* 7

S 2

T 2

4

L 1

O 1

2

P 1

R 1

2

4

F 3

U 1

X 1

2

5

A 4

E 4

8

H 2

I 2

4

M 2

N 2

4

8

8

12

16

20

36

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

Decode: 11110111110101011111010011111110001011100100111010111

50

Using The Tree To Decode

* 7

S 2

T 2

4

L 1

O 1

2

P 1

R 1

2

4

F 3

U 1

X 1

2

5

A 4

E 4

8

H 2

I 2

4

M 2

N 2

4

8

8

12

16

20

36

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

Decode: 11110111110101011111010011111110001011100100111010111

51

Using The Tree To Decode

* 7

S 2

T 2

4

L 1

O 1

2

P 1

R 1

2

4

F 3

U 1

X 1

2

5

A 4

E 4

8

H 2

I 2

4

M 2

N 2

4

8

8

12

16

20

36

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

Decode: 11110111110101011111010011111110001011100100111010111

52

Using The Tree To Decode

* 7

S 2

T 2

4

L 1

O 1

2

P 1

R 1

2

4

F 3

U 1

X 1

2

5

A 4

E 4

8

H 2

I 2

4

M 2

N 2

4

8

8

12

16

20

36

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

Decode: 11110111110101011111010011111110001011100100111010111

53

Using The Tree To Decode

* 7

S 2

T 2

4

L 1

O 1

2

P 1

R 1

2

4

F 3

U 1

X 1

2

5

A 4

E 4

8

H 2

I 2

4

M 2

N 2

4

8

8

12

16

20

36

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

Decode: 11110111110101011111010011111110001011100100111010111

54

Using The Tree To Decode

* 7

S 2

T 2

4

L 1

O 1

2

P 1

R 1

2

4

F 3

U 1

X 1

2

5

A 4

E 4

8

H 2

I 2

4

M 2

N 2

4

8

8

12

16

20

36

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

Decode: 11110111110101011111010011111110001011100100111010111

55

Using The Tree To Decode

* 7

S 2

T 2

4

L 1

O 1

2

P 1

R 1

2

4

F 3

U 1

X 1

2

5

A 4

E 4

8

H 2

I 2

4

M 2

N 2

4

8

8

12

16

20

36

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

Decode: 11110111110101011111010011111110001011100100111010111

When we get to a leaf node, output the character stored in the leaf, and resume decoding at the root

Output: P

56

Using The Tree To Decode

* 7

S 2

T 2

4

L 1

O 1

2

P 1

R 1

2

4

F 3

U 1

X 1

2

5

A 4

E 4

8

H 2

I 2

4

M 2

N 2

4

8

8

12

16

20

36

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

Decode: 11110111110101011111010011111110001011100100111010111

Output: P

57

Using The Tree To Decode

* 7

S 2

T 2

4

L 1

O 1

2

P 1

R 1

2

4

F 3

U 1

X 1

2

5

A 4

E 4

8

H 2

I 2

4

M 2

N 2

4

8

8

12

16

20

36

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

Decode: 11110111110101011111010011111110001011100100111010111

Output: P

58

Using The Tree To Decode

* 7

S 2

T 2

4

L 1

O 1

2

P 1

R 1

2

4

F 3

U 1

X 1

2

5

A 4

E 4

8

H 2

I 2

4

M 2

N 2

4

8

8

12

16

20

36

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

Decode: 11110111110101011111010011111110001011100100111010111

Output: PR

59

Using The Tree To Decode

* 7

S 2

T 2

4

L 1

O 1

2

P 1

R 1

2

4

F 3

U 1

X 1

2

5

A 4

E 4

8

H 2

I 2

4

M 2

N 2

4

8

8

12

16

20

36

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

Decode: 11110111110101011111010011111110001011100100111010111

Output: PRI

60

Using The Tree To Decode

* 7

S 2

T 2

4

L 1

O 1

2

P 1

R 1

2

4

F 3

U 1

X 1

2

5

A 4

E 4

8

H 2

I 2

4

M 2

N 2

4

8

8

12

16

20

36

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

Decode: 11110111110101011111010011111110001011100100111010111

Output: PRIN

61

Using The Tree To Decode

* 7

S 2

T 2

4

L 1

O 1

2

P 1

R 1

2

4

F 3

U 1

X 1

2

5

A 4

E 4

8

H 2

I 2

4

M 2

N 2

4

8

8

12

16

20

36

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

Decode: 11110111110101011111010011111110001011100100111010111

Output: PRINT

62

Decode The Rest Of This String

* 7

S 2

T 2

4

L 1

O 1

2

P 1

R 1

2

4

F 3

U 1

X 1

2

5

A 4

E 4

8

H 2

I 2

4

M 2

N 2

4

8

8

12

16

20

36

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

Decode: 11110111110101011111010011111110001011100100111010111

Output: PRINTER IS ON

When we run out of bits in the message we’re decoding, we should be at the leaf corresponding to the last character in the original message.

63

Storing Variable-Length Bit Strings

(Most) computers work in 8-bit bytes

How can we store these variable-length strings in fixed-length bytes?

11110111 11010101 11110100 11111110 00101110 01001110 10111XXX

We add padding bits to fill the void at the end

We need three bits to fill the last byte

What 3 bits do we choose as our padding bits?

Well, there are only 8 possibilities…

64

Storing Variable-Length Bit Strings

If we’re storing this variable-length bit string in 8-bit bytes:

11110111 11010101 11110100 11111110 00101110 01001110 10111XXX

8 Options:

000

001

010

011

100

101

110

111

Character Path Character Path
space 100 S 1100
A 000 T 1101
E 001 L 11100
F 1010 O 11101
H 0100 P 11110
I 0101 R 11111
M 0110 U 10110
N 0111 X 10111

What three padding bits should we use?

Why can we use NONE of these?

65

Storing Variable-Length Bit Strings

A few observations:

Whatever we choose must not change the meaning of the message as-transmitted

If we need padding bits at all, then we need somewhere between 1 and 7 of them, if we're using 8-bit bytes

We may not need any padding bits at all (if our encoded message happens to be some multiple of 8 bits long – that fits perfectly)

So, we need between 0 and 7 padding bits for an 8-bit byte; between 0 and 3 for our 4-bit bytes

Character Path Character Path
space 100 S 1100
A 000 T 1101
E 001 L 11100
F 1010 O 11101
H 0100 P 11110
I 0101 R 11111
M 0110 U 10110
N 0111 X 10111

66

Choosing Padding Bits

We have a few options:

Use (up to) 3 space characters (and trim when received)

Add some special end-of-transmission character to our alphabet (may not be possible/practical)

If the longest bit string is longer than the maximum number of padding bits we might need, then we could use that bit string, knowing it would never generate a complete trip to a leaf:

No unwanted character!

Any of the longest (5-bit) strings work!

Character Path Character Path
space 100 S 1100
A 000 T 1101
E 001 L 11100
F 1010 O 11101
H 0100 P 11110
I 0101 R 11111
M 0110 U 10110
N 0111 X 10111

67

Choosing Padding Bits

* 7

S 2

T 2

4

L 1

O 1

2

P 1

R 1

2

4

F 3

U 1

X 1

2

5

A 4

E 4

8

H 2

I 2

4

M 2

N 2

4

8

8

12

16

20

36

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

Decode: 11110111110101011111010011111110001011100100111010111xxx

Output: PRINTER IS ON

All OK:

010

011

101

110

111

Not OK:

000

001

100

* 7

A 4

E 4

4

4

5

4

4

If we’ve chosen our padding bits wisely, we will be somewhere in the middle of the tree (at some internal node) when we run out of bits, so we can just stop without generating some unwanted character

68

Huffman Coding - Summary

If not already known, establish frequency of occurrence counts / ratios

Build the tree iteratively with the “aggregate the two smallest-weight items” algorithm

Traverse the tree to identify the encoding strings, and store them in a table

Encode via table look-up, padding if needed

Decode via tree navigation, outputting characters when leaves reached, continuing at root. Stop when we run out of bits (even if not at a leaf node)

69

? Questions ?

Any Questions?