Huffman application

compSci
previous_assignment_info.pdf

1/1/13 10:10 PMCS 211 Homework Seven — Fall 2012

Page 1 of 3http://web.cs.mtholyoke.edu/~candrews/cs211/homework/hw7.html

CS 212 — Homework Seven Huffman Codes

Due: 13 November 2012, 11:59p

Worth: 90 points

The Task

This week you are going to implement a new data structure and the main Huffman tree. Next week we will finish this off with a full application for compressing and decompressing text files.

Part one: PriorityQueue [20 pts]

As I said in class, I redesigned final product of this project, so we need to redo the PriorityQueue and the Heap. This is actually good practice for you because refactoring code to meet changing specifications is a common occurrence in software development. We basically want to have a different level of control, so we are going to change our classes to be closer to the PriorityQueue provided in the Java API. The new interface for PriorityQueue looks like this:

public PriorityQueue(int initialCapacity,Comparator<? super E> comparator) This is the constructor. It should initialize the heap using the initialCapacity. The comparator is the tool that will allow us to change how priority is calculated for various inputs.

public E peek() This returns the next item without removing it. This returns null if the queue is empty.

E remove() This removes the first item from the queue.

void add(E item) This adds item to the queue

boolean isEmpty() Obviously, this says if the queue is empty or not.

int size() This should return the size of the queue.

The main difference is the addition of the Comparator. Comparators are classes that provide the same functionality as compareTo(). You will use this instead of the compareTo method for comparing items in the queue. We can write different comparators to give us high value priority, low value priority, or more complex variations (which is why we need to do this).

Doing this will mean that the wrapper class you wrote for the last assignment is no longer needed, the priority will be supplied in a different way. You may be wondering about the Comparator<? super E>. We are specifying that our Comparator is capable of comparing two objects of type E. The ? says that it can compare two Es, or E is a subclass of the class that the Comparator can handle.

You will, of course, need to adjust the Heap as well to work with this new PriorityQueue. Add the Comparator to the constructor and update siftup and siftdown to use it. Also, write two static, public inner classes on the Heap called MaxHeapComparator and MinHeapComparator. These comparators will provide the normal functionality of min and max by calling the compareTo methods on the stored items (i.e., they will only work for objects that implement Comparable.

Part two: Map [30 pts]

Next, we need a new data structure. We are going to build this in two pieces so that we can reuse the

1/1/13 10:10 PMCS 211 Homework Seven — Fall 2012

Page 2 of 3http://web.cs.mtholyoke.edu/~candrews/cs211/homework/hw7.html

interface later. So, the first this you should build is an interface called Map<K extends Comparable<K>,V>. The interface should look like this:

void put(K key, V value) This should load the value value into the map at the location associated with the key, replacing any preexisting value already associated with the key.

V get(K key) This should fetch the value V associated with the key key, or null, if no match can be found.

boolean containsKey(K key) This will report if there is a value associated with the value key in the map.

V remove(K key) This method should remove the entry associated with key and return the associated value. Return null if the key is not found.

Set<K> keySet() This method should return a Set of all available keys in the map. You may use the java.util.TreeSet class to return these. You can build these on the fly using a traversal, or collect them as new keys are added to the map.

Of course, since this is just the interface, you will need to implement an actual instance so that you can use it. This instance should be called BSTMap<K,V>. Obviously, this will use a binary search tree to actually implement the methods described above. While one perfectly valid implementation choice would be to create a separate binary search tree class and just use it to implement the described methods, it can force you to jump through some hoops to make methods like get and containsKey to work properly. So instead, you can build the binary search tree structure directly into your implementation of BSTMap.

When constructing the underlying BST, you will need to create some form of node structure that will hold both the key and the value. You will, of course, need to adjust the pseudocode that we walked through in class to work directly with the key values. It is important that these nodes are not visible outside of the class. The other major aspect of this design is whether you use an array or links in the nodes to implement the actual tree. I encourage you to use the linked version as we will be building on this class later and will want it to be easy to move nodes around.

Part three: The Huffman tree [40 pts]

Once you have built your map, you will have enough tools to actually build the Huffman tree. The tree should be in a class called HuffmanTree. The interface should look like this:

static HuffmanTree newTreeFromFile(File file) This is a static factory method that creates a new Huffman tree. In essence, this just creates the new object and calls buildFromFile on it. This should throw a FileNotFoundException when appropriate.

private void buildFromFile(File file) This reads in an unencoded file and builds the Huffman tree based on the character frequencies. This should throw a FileNotFoundException when appropriate.

String encode(String text) This method takes in some plain text and returns the encoded version as a String of 1's and 0's.

String decode(String text) This method takes in a String of 1's and 0's and returns a decoded raw text version.

For this assignment I am introducing you to a new Java design pattern called the "factory method". Factory methods are methods (usually static) that return new instances of a class. We typically use them when there is some complex initialization that needs to be performed or we have more than one way in which we want an object to be created. In this case, we are paving the way for future development. The factory method should just create a new HuffmanTree and then call buildFromFile on it.

To actually build the tree, you should start by reading the seed file one character at a time, using the Map to keep track of the frequencies of each character. Once the file has been consumed, you will use your PriorityQueue to to build the Huffman tree.

You will want to create a private inner node class that holds a frequency and a String. Since we want to assemble everything into a binary tree structure (it will be a heap, but we won't use the Heap class for it), also include left and right children. Call this class HTree. Write a constructor that sets all of the fields at

1/1/13 10:10 PMCS 211 Homework Seven — Fall 2012

Page 3 of 3http://web.cs.mtholyoke.edu/~candrews/cs211/homework/hw7.html

once.

For each character you read in, create an HTree wrapper for it and its frequency and put it in the PriorityQueue. The general algorithm for building the Huffman tree goes like this:

Remove the two trees with the lowest frequency from the queue Use these two trees as the left and right children of a new tree with a root node that has the frequency of the two children combined. Return this tree to the queue Repeat until there is only one tree — this is the Huffman tree

To make this work, you will want to write a Comparator that uses the lowest frequency as the highest priority. Unfortunately, this algorithm does not create unique trees. We would like to build a canonical Huffman tree. In order to do that, we will add some more rules. When we have two trees that have the same frequency, we will take the shorter one first. When we have two trees of the same complexity and the same length, we will use alphabetical order to determine priority (e.g., 'a' comes before 'b'). In addition, when we build our subtrees, the shorter tree goes on the left. If the trees are the same length, the first alphabetically goes of the left (note that this is true regardless of frequency).

Once the tree is fully constructed, you should create a second map that stores the final codes for each character. The tree and this map should be the two data structures that you keep around after the construction process. For encoding, you will use the map to lookup each character, and for decoding, you will use the tree to find the decoded characters.

I want you to notice that the two methods, encode and decode are really only for testing. They transform the data into Strings of 1s and 0s, which is certainly not a compressed form for the data.

Turning the assignment in...

Commented source files and tests should be put into a single directory, zipped, tarred or jarred, and submitted online into your course dropbox on ella (I only want source files, I do not want any class files). Please name your files cs211_pid_hw7.suffix, where pid is your Mount Holyoke email account name and suffix is the appropriate suffix for your archive format.

Last modified: Mon Dec 17 17:11:21 EST 2012