Homework Assignment 4
For this homework assignment we will implement and experiment with methods for data compression, using two lossless data compression algorithms: Huffman coding and LZW compression.
- Apply core data structures and algorithms from the 15-211 course, such as priority queues, Huffman trees, and tries to a real-world problem.
- Gain an understanding of major data compression algorithms, including Huffman coding and LZW compression.
- See why good algorithms matter, not only for good results (in compressing data) but also for getting practical performance.
- Gain experience working with more realistic problems that are larger and open-ended.
Theory Questions: (20 points)
In the file theory4.pdf there are some short theory questions. Please answer the questions in this file and turn it in class on Friday June 10.
Download the starter code for assignment (zip file). Once you finish implementation, FTP java files to /afs/andrew.cmu.edu/course/15/211/handin
What to Submit
You FTP the following java files
Do not submit anything else like zip files, folders, desktops, class files. Make sure you do not use packages in your java files as well. Failure to do so will result in a 20 point penalty.
Due to the size of this assignment, it is normal to feel somewhat overwhelmed. Make sure to get an early start and think before you hack! Breaking it up will allow you to treat this homework assignment essentially as two smaller assignments, and also give you the flexibility to work on one part while you're still debugging another part. Don't wait until the last minute to start on all the parts!
Priority Queue (15 points)
Your first task in this assignment is going to be to implement a priority queue.
As far as implementation goes, insert and deleteMin must be O(log(n)) time so you must implement a binary heap. It is also covered in detail in the lecture notes. You may not use the Java implementation of priority queue, nor may you hack a priority queue using some sort method.
Note that for the iterator, you do not have to return the elements in any specific order. This would make your life painful, and we have no desire to do that. Similar to Java's own implementation of PriorityQueue 's Iterator, just return all elements in the queue in any order. One more thing, you must write your own iterator. For example, you may not just create a new ArrayList, insert all the PQ items into the list, and return the list's iterator. Any attempt to do so will result in a loss of all the iterator points.
And the last note, the buildHeap operation (see the PQ constructor) must run in O(n).
The Huffman Algorithm (25 points)
Huffman's part involves the implementation of plain Huffman compression as discussed in lecture. However, before you can get coding the Huffman code proper, you must first create a priority queue. Once this is done, you will implement HuffmanEncode.java and HuffmanDecode.java.
Let us consider the process of compression. First it reads in the file and creates a list of frequencies for each byte (aka char). It passes these frequencies to the HuffmanTree method that takes a Map. Then the compressor calls the writeHeader method which emits the tree recursively, using writer.writeByte(byte) to write each data item. Next, it write the input file size. Finally, the compressor reads each character in the file and writes the codeword by using encode
The header is written as follows. You should use a recursive helper method. To write the header, you start by looking at the root. If the node you are looking at is a parent node (i.e. is not a leaf) you output the bit PARENT (a static constant) and then call the method recursively on the node's left child, and then on the node's right child. If the node you are looking at is a leaf node, you output the bit LEAF (another static constant) and then output the character (aka byte) at that node. You should be able to figure out how to read the header and use it to rebuild the tree by yourself. We specify this header format so that every implementation generates a header of the same size.
Now we consider decompression. First it reads the header from the file using the HuffmanTree method that takes a BitReader. The method reads in the header, calling the readByte method of the BitReader whenever it sees a leaf in the file header. Then it reads in each Huffman code and outputs the symbol.
The Lempel-Ziv-Welch Algorithm (25 points)
The second compression algorithm of this assignment involves implementing a more sophisticated method for data compression, based on a version of the Lempel-Ziv-Welch algorithm. The method automatically builds a dictionary of previously seen strings in the text being compressed. Unlike to the Huffman algorithm, the dictionary does not have to be transmitted with the compressed text.
The dictionary starts with 256 ASCII entries, one for each possible character. Every time a string is not in the dictionary, a new entry in the dictionary is created. Each entry in the initial dictionary is 16 bits long. As the dictionary grows, its size increases to the maximum value, which is set to MAX_BIT_WIDTH = 31. You throw an IllegalArgumentException when it exceeds the limit..
It's recommended that you implement compression with a TRIE. Every character you read in represents taking one step down the trie; when you "fall off" the trie, you output a codeword and add a new one to the dictionary. You should use a HashMap in the trie nodes to store the children, since having a 256-element array will be a big waste of memory. Feel free to reuse the trie you implemented in the previous assignment, however your trie the trie needs to be able to handle all 256 ASCII characters. We will not test your trie.
When decompressing, just use a HashMap to store the dictionary. Note, you must handle the degenerate case of decompression - when a codeword you read in is not in the dictionary. There is a specific way to deal with this situation.
The keyword char should never appear in this assignment, anywhere, ever. These classes deal with bytes and ints, not chars. With chars, all sorts of issues related to encoding arise. Using the raw bytes bypasses these issues.
Take the priority queue seriously! Test it thoroughly. It's easy to make a mistake when implementing it. Our test suite is very tough on your code, so it's in your own best interest to be tough on your own code when you test it.
Call writer.flush() at the end of both LZW methods! If you don't, you might get corrupted output data.
Writing tests is good! By now you should realize that testing is highly beneficial. Be sure to test major compress/decompress functionality, as well as to test individual parts like the PQ and building the HuffmanTree. The priority queue should be extremely easy to test (just compare it to the Java version). For the other types, you should try to make sure various files round trip (compress/decompress).
This is a good time to point out that good code coverage does not neccesarily equal good grades. The few hand cases, if written correctly can achieve full code coverage of the compression stage. They are, however, insufficient to thoroughly test your code.
Point Breakdown: 80 points
Here is the point breakdown for the implementation part:
- Huffman algorithm - 30 points
- Priority queue - 15 points
- LZW algorithm - 30 points
- Coding style - 5 points
Victor S. Adamchik, CMU, 2011