Homework Assignment 4

BZip Compression

15-211 Summer 2008

Due Wednesday July 30th 2008 at 11:59:59PM

Zip file with source files, documentation, &c.
Browsable version of the files above

Tests:
You are on your own for this one.

For this homework assignment we will implement and experiment with methods for data compression, using two lossless data compression algorithms: Huffman coding and LZW compression. The primary goal of the assignment is to implement the backend for a program called tez (two-eleven zip) that is able to compress and decompress data files using a variety of compression algorithms.

Objectives

Apply core data structures and algorithms from the 15-211 course, such as priority queues, Huffman trees, and sorting, to a real-world problem.
Gain an understanding of (basic versions of) major data compression algorithms, including Huffman coding and LZW compression.
See why good algorithms matter, not only for good results (in compressing data) but also for getting practical performance.
Gain experience working with more realistic problems that are larger and open-ended, with black-box specifications and requirements.

Starter Code

Download the starter code for assignment (zip file). Once you finish implementation, FTP java files to /afs/andrew.cmu.edu/course/15/211/handin

Time Management

Due to the size of this assignment, it is normal to feel somewhat overwhelmed. Make sure to get an early start and think before you hack! Breaking it up will allow you to treat this homework assignment essentially as two smaller assignments, and also give you the flexibility to work on one part while you're still debugging another part. Don't wait until the last minute to start on all the parts!

The Parts

This assignment comes in two major parts, the Huffman and LZW sections.

Huffman

Huffman is the first part of this homework assignment and involves the implementation of plain Huffman compression as discussed in lecture. However, before you can get coding the Huffman code proper, you must first create a priority queue. (Though there are versions of Huffman that do not require priority queues, we do.) Once this is done, you will generate an object-oriented abstraction of the concept of Huffman codes. We have created a simple Huffman compressor that uses this.

LZW

The second part of this assignment involves implementing a more sophisticated method for data compression, based on a version of the LZW algorithm. This is similar to one of the components in bzip2 compression, which outperforms gzip and zip, two other extremely popular compression methods. It extends Huffman coding in some interesting ways to achieve better compression. Details can be found in the lecture notes

Testing Code

Writing tests is good!

By now you should realize that testing is highly beneficial.Be sure to test major compress/decompress functionality, as well as unit tests for individual parts like the PQ and building the HuffmanTree. The priority queue should be extremely easy to test (just do it like you did for snake, comparing to the Java version). For the other types, you should try to make sure various files round trip.

You will need to generate byte[] arrays and test on those. Two useful methods for getting arrays are String.getBytes("ASCII") and Random.nextBytes. We suggest a few ways of doing this.

Do some hand cases. You can easily create a few hand test cases to figure out exactly how a compressed file would appear, and test against that.
We have provided some original and compressed files for you to test against. You should use the "diff" command in unix to test if they are the same.
Test roundtrip using java's Random class. An example is provided for you in HuffmanTest.java. This method fills a byte array with random bytes, compresses and decompresses it, and makes sure the original byte array is equal to the decompressed byte array. You may extend this to also test your LZW compression as necessary

This is a good time to point out that good code coverage does not neccesarily equal good grades. The few hand cases, if written correctly can achieve full code coverage of the compression stage. They are, however, insufficient to thoroughly test your code. You should perform local tests with the provided files. You should put try-catch blocks around all tests that access files to handle the IOExceptions.

Huffman Code (45 points)

You will implement HuffmanCode.java as a generic Huffman backend. Feel free to make use of anything you want from java.util and java.lang, but nothing else. You may not use java.util.PriorityQueue. In fact, due to the queue you wrote having the same name, Java will not easily let you use this. You must implement both the compression and expansion for Huffman.

Compression

First we consider the process of compression.

The compression sequence starts in a class implementing the Compressor interface. First it reads in the files and creates a list of frequencies for each data item. It passes these frequencies to the HuffmanCode constructor that takes a Map. Then the compressor calls the writeHeader method which emits the tree recursively, using writer.writeByte(byte) to emit each data item. The compressor then reads each symbol in the file and emits the code word by using encode

Expansion

We now consider expansion, the reverse of compression.

First expand reads the header from the file using the HuffmanCode constructor that takes a BitReader. The Constructor reads in the header, calling the readByte method of the BitReader whenever it sees a leaf in the file header. Then expand reads in each Huffman code and outputs the symbol.

The Huffman Compressor

Once you complete HuffmanCode.java, the provided implementation for HuffmanCompressor.java should work. Note the file format we use when compressing/decompressing files. We first encode the Huffman header (table for the tree outputted via writeHeader). Then we encode the number of symbols in the file, which corresponds to the number of times we need to call encode/decode (which for HuffmanCompressor is just the number of bytes in the file).

The implication with this file format is that when we have a character set of just one letter, we can define it's huffman code to be the empty string (rather than arbitrarily defining it to be 0 or 1). This actually makes it simpler because we don't have to deal with the special case (at least we shouldn't) when the root is a leaf node. Also, a file of all 'a's compresses better under this technique. Please test for this and make sure you get the appropriate output.

HuffmanCodeVisualizer

We provide you with HuffmanCodeVisualizer to visualize the tree you are creating.

LZW (35 points)

LZW is a more sophisticated method of compression that relies on patterns in the files instead of just the frequency of the bytes. It can lead to drasticly better compression sizes than the huffman algorithm. When implemented with a trie, it also runs extremely quickly. You only need to read the file in once, and compress as you go. You also do not need a header.

In order for us to test that you are encoding and decoding correctly, you need to follow a few LZW conventions:

You should implement your LZW algorithm with a trie. Although this will not effect testing, it is faster, cleaner, and more efficient. This shouldn't be too hard, considering you already implemented tries in the last assignment. Moreover, you may use java.util.HashMap in your trieNodes (to represent the children), so that makes it even simpler. The trie is not a requirement, but it's heavily suggested.
Your LZW codes must use a fixed width 16 bit encoding. We promise that we will not require you to correctly compress and decompress any file that would require more codes than would be possible with a 16 bit encoding. That also means that you at no time need to flush the LZW dictionary. That being said, your program should detect such invalid inputs, and throw an exception (see javadocs). One such condition would be when you are reading in the codes in decompress and you see a code that isn't in the dictionary. If that code is just the next code you were going to insert into the dictionary, everything is fine. If the code, however, is greater than the nextCode, you need to throw an IllegalArgumentException(). This may seem like a royal pain, however not doing so has caused grave security issues in the real world
You must encode the initial dictionary in the following way. The dictionary will contain 256 entries, one for each ASCII byte value. The codes will be 16 bit zero-extended integers, one for each byte. So for example, you will encode the byte value 01100001 (which corresponds to the letter a) as 0000000001100001. So just create 16 bit integer values from 0 to 255 and map them directly to the unsigned bytes (0000000, 00000001,...,11111111) they represent. While this may seem obvious, there are other ways to do the initial dictionary that are valid for LZW, but will cause you to fail our tests.
The codes are fixed width 16bit integers. So in order to write them to a file, use writeBits(code, 16). Don't try to use writeInt or writeByte twice, it will just become a mess.

Be sure to call flush at the end of the compression and decompression methods. Otherwise, file output will be corrupt along with other potentially bad things.

Note for decompression: Although the compression uses tries, the decompression does not have to. In fact, using a trie for decompression would be a mess. Just use an array or a hash table to store the decompression dictionary.

Theory Questions (10 points)

These are worth 10 points of your grade. You can find the questions in theory.txt.

Style Points (10 points)

FAQ

Pay special attention to this section. This is a collection of common questions students had from last semester

Q: How many elements should my PQ take before it runs out of memory
- A: Atleast 100k items (ints).
Q: I'm running out of heap space on some of the larger files. Is there any way I can fix this.
- A: First make sure your code is correct :). For some of the larger files (like theorygirl.mp3) you may need to increase the JVM heap size. You can do this by using the JVM arguments -Xms512M -Xmx512M.
Q: I've written good unit tests, and matched all the staff files exactly. So why am I still failing?
- A: Make sure your compressors are stateless. This is mentioned in a previous section of the docs.

Homework Assignment 4 BZip Compression 15-211 Summer 2008