Homework Assignment 4

BZip Compression

15-211 Summer 2008


Due Wednesday July 30th 2008 at 11:59:59PM

Zip file with source files, documentation, &c.
Browsable version of the files above

Tests:
You are on your own for this one.

For this homework assignment we will implement and experiment with methods for data compression, using two lossless data compression algorithms: Huffman coding and LZW compression. The primary goal of the assignment is to implement the backend for a program called tez (two-eleven zip) that is able to compress and decompress data files using a variety of compression algorithms.

Objectives

Starter Code

Download the starter code for assignment (zip file). Once you finish implementation, FTP java files to /afs/andrew.cmu.edu/course/15/211/handin

Time Management

Due to the size of this assignment, it is normal to feel somewhat overwhelmed. Make sure to get an early start and think before you hack! Breaking it up will allow you to treat this homework assignment essentially as two smaller assignments, and also give you the flexibility to work on one part while you're still debugging another part. Don't wait until the last minute to start on all the parts!

The Parts

This assignment comes in two major parts, the Huffman and LZW sections.

Huffman

Huffman is the first part of this homework assignment and involves the implementation of plain Huffman compression as discussed in lecture. However, before you can get coding the Huffman code proper, you must first create a priority queue. (Though there are versions of Huffman that do not require priority queues, we do.) Once this is done, you will generate an object-oriented abstraction of the concept of Huffman codes. We have created a simple Huffman compressor that uses this.

LZW

The second part of this assignment involves implementing a more sophisticated method for data compression, based on a version of the LZW algorithm. This is similar to one of the components in bzip2 compression, which outperforms gzip and zip, two other extremely popular compression methods. It extends Huffman coding in some interesting ways to achieve better compression. Details can be found in the lecture notes

Testing Code

Writing tests is good!

By now you should realize that testing is highly beneficial.Be sure to test major compress/decompress functionality, as well as unit tests for individual parts like the PQ and building the HuffmanTree. The priority queue should be extremely easy to test (just do it like you did for snake, comparing to the Java version). For the other types, you should try to make sure various files round trip.

You will need to generate byte[] arrays and test on those. Two useful methods for getting arrays are String.getBytes("ASCII") and Random.nextBytes. We suggest a few ways of doing this.

  1. Do some hand cases. You can easily create a few hand test cases to figure out exactly how a compressed file would appear, and test against that.
  2. We have provided some original and compressed files for you to test against. You should use the "diff" command in unix to test if they are the same.
  3. Test roundtrip using java's Random class. An example is provided for you in HuffmanTest.java. This method fills a byte array with random bytes, compresses and decompresses it, and makes sure the original byte array is equal to the decompressed byte array. You may extend this to also test your LZW compression as necessary

This is a good time to point out that good code coverage does not neccesarily equal good grades. The few hand cases, if written correctly can achieve full code coverage of the compression stage. They are, however, insufficient to thoroughly test your code. You should perform local tests with the provided files. You should put try-catch blocks around all tests that access files to handle the IOExceptions.

Huffman Code (45 points)

You will implement HuffmanCode.java as a generic Huffman backend. Feel free to make use of anything you want from java.util and java.lang, but nothing else. You may not use java.util.PriorityQueue. In fact, due to the queue you wrote having the same name, Java will not easily let you use this. You must implement both the compression and expansion for Huffman.

Compression

First we consider the process of compression.

The compression sequence starts in a class implementing the Compressor interface. First it reads in the files and creates a list of frequencies for each data item. It passes these frequencies to the HuffmanCode constructor that takes a Map. Then the compressor calls the writeHeader method which emits the tree recursively, using writer.writeByte(byte) to emit each data item. The compressor then reads each symbol in the file and emits the code word by using encode

Expansion

We now consider expansion, the reverse of compression.

First expand reads the header from the file using the HuffmanCode constructor that takes a BitReader. The Constructor reads in the header, calling the readByte method of the BitReader whenever it sees a leaf in the file header. Then expand reads in each Huffman code and outputs the symbol.

The Huffman Compressor

Once you complete HuffmanCode.java, the provided implementation for HuffmanCompressor.java should work. Note the file format we use when compressing/decompressing files. We first encode the Huffman header (table for the tree outputted via writeHeader). Then we encode the number of symbols in the file, which corresponds to the number of times we need to call encode/decode (which for HuffmanCompressor is just the number of bytes in the file).

The implication with this file format is that when we have a character set of just one letter, we can define it's huffman code to be the empty string (rather than arbitrarily defining it to be 0 or 1). This actually makes it simpler because we don't have to deal with the special case (at least we shouldn't) when the root is a leaf node. Also, a file of all 'a's compresses better under this technique. Please test for this and make sure you get the appropriate output.

HuffmanCodeVisualizer

We provide you with HuffmanCodeVisualizer to visualize the tree you are creating.

LZW (35 points)

LZW is a more sophisticated method of compression that relies on patterns in the files instead of just the frequency of the bytes. It can lead to drasticly better compression sizes than the huffman algorithm. When implemented with a trie, it also runs extremely quickly. You only need to read the file in once, and compress as you go. You also do not need a header.

In order for us to test that you are encoding and decoding correctly, you need to follow a few LZW conventions:

Be sure to call flush at the end of the compression and decompression methods. Otherwise, file output will be corrupt along with other potentially bad things.

Note for decompression: Although the compression uses tries, the decompression does not have to. In fact, using a trie for decompression would be a mess. Just use an array or a hash table to store the decompression dictionary.

Theory Questions (10 points)

These are worth 10 points of your grade. You can find the questions in theory.txt.

Style Points (10 points)

FAQ

Pay special attention to this section. This is a collection of common questions students had from last semester