Homework Assignment 5

BZip Compression

15-211 Summer 2010

Due Saturday, July 30, 2010 at 11:59:59pm

Overview

For this homework assignment we will implement and experiment with methods for data compression, using two lossless data compression algorithms: Huffman coding and Burrows-Wheeler compression.

Objectives

Apply core data structures and algorithms from the 15-211 course, such as priority queues, Huffman trees, and sorting, to a real-world problem.
Gain an understanding of (basic versions of) major data compression algorithms, including Huffman coding and Burrows-Wheeler compression.
See why good algorithms matter, not only for good results (in compressing data) but also for getting practical performance.
Gain experience working with more realistic problems that are larger and open-ended.

Theory Questions: 15 points

In the file theory.pdf there are some short theory questions. Please answer the questions in this file and turn it in class.

Starter Code

Download the starter code for assignment (zip file). Once you finish implementation, FTP java files to /afs/andrew.cmu.edu/course/15/211-kesden/handin

What to Submit

You FTP the following java files

HuffmanDecode.java
HuffmanEncode.java
PriorityQueue.java
BurrowsWheelerTransformer.java
MoveToFrontTransformer.java

Do not submit anything else like zip files, folders, desktops, class files. Make sure you do not use packages in your java files as well. Failure to do so will result in a 20 point penalty.

Time Management

Due to the size of this assignment, it is normal to feel somewhat overwhelmed. Make sure to get an early start and think before you hack! Breaking it up will allow you to treat this homework assignment essentially as two smaller assignments, and also give you the flexibility to work on one part while you're still debugging another part. Don't wait until the last minute to start on all the parts!

The Parts

This assignment comes in two major parts, the Huffman and Burrows-Wheeler sections.

Huffman

Huffman is the first part of this homework assignment and involves the implementation of plain Huffman compression as discussed in lecture. However, before you can get coding the Huffman code proper, you must first create a priority queue. (Though there are versions of Huffman that do not require priority queues, we do.) Once this is done, you will generate an object-oriented abstraction of the concept of Huffman codes. We have created a simple Huffman compressor that uses this.

Burrows-Wheeler

The second part of this assignment involves implementing a more sophisticated method for data compression, based on a version of the Burrows-Wheeler algorithm. This is similar to one of the components in bzip2 compression, which outperforms gzip and zip, two other extremely popular compression methods. It extends Huffman coding in some interesting ways to achieve better compression. Details can be found in the lecture notes.

Testing Code

Writing tests is good!

By now you should realize that testing is highly beneficial.Be sure to test major compress/decompress functionality, as well as unit tests for individual parts like the PQ and building the HuffmanTree. The priority queue should be extremely easy to test (just do it like you did for snake, comparing to the Java version). For the other types, you should try to make sure various files round trip.

You will need to generate byte[] arrays and test on those. Two useful methods for getting arrays are String.getBytes("ASCII") and Random.nextBytes. We suggest a few ways of doing this.

Do some hand cases. You can easily create a few hand test cases to figure out exactly how a compressed file would appear, and test against that.
We have provided some original and compressed files for you to test against. You should use the "diff" command in unix to test if they are the same.
Test roundtrip using java's Random class. An example is provided for you in HuffmanTest.java. This method fills a byte array with random bytes, compresses and decompresses it, and makes sure the original byte array is equal to the decompressed byte array. You may extend this to also test your Burrows-Wheeler compression as necessary

This is a good time to point out that good code coverage does not neccesarily equal good grades. The few hand cases, if written correctly can achieve full code coverage of the compression stage. They are, however, insufficient to thoroughly test your code. You should perform local tests with the provided files. You should put try-catch blocks around all tests that access files to handle the IOExceptions.

Priority Queue (15 points)

Your first task in this assignment is going to be to implement a priority queue.

As far as implementation goes, insert and deleteMin must be O(log(n)) time so you must implement a binary heap. It is also covered in detail in the lecture notes. You may not use the Java implementation of priority queue, nor may you hack a priority queue using some sort method (not logn time).

Note that for the iterator, you do not have to return the elements in any specific order. This would make your life painful, and we have no desire to do that. Similar to Java's own implementation of PriorityQueue 's Iterator, just return all elements in the queue in any order. One more thing, you must write your own iterator. For example, you may not just create a new ArrayList, insert all the PQ items into the list, and return the list's iterator. Any attempt to do so will result in a loss of all the iterator points.

And the last note, the buildHeap operation (see the PQ constructor) must run in O(n).

Huffman Code (25 points)

You will implement HuffmanEncode.java and HuffmanDecode.java as a generic Huffman backend.

First we consider the process of compression. First it reads in the files and creates a list of frequencies for each byte (aka char). It passes these frequencies to the HuffmanTree method that takes a Map. Then the compressor calls the writeHeader method which emits the tree recursively, using writer.writeByte(byte) to emit each data item. The compressor then reads each symbol in the file and emits the code word by using encode

Now we consider decompression. First it reads the header from the file using the HuffmanTree method that takes a BitReader. The method reads in the header, calling the readByte method of the BitReader whenever it sees a leaf in the file header. Then it reads in each Huffman code and outputs the symbol.

The Burrows-Wheeler Transform (20 points)

The purpose of the Burrows-Wheeler transform is to convert a sequence of bytes into another sequence of bytes that has high locality of reference -- that is, one which tends to have long blocks of characters from a small set. (Long runs of repeated characters is a special case.) The transform is also invertable, so that the original sequence can be recovered. The combination of the Burrows-Wheeler transform followed by the Move-To-Front transform, followed by Huffman encoding all together is called the Burrows-Wheeler data-compression algorithm. The algorithm is efficient, gives quite good compression, and is totally public domain.

Mind Blowing Fact 1: The Burrows-Wheeler transform can be inverted.
Mind Blowing Fact 2: The Burrows-Wheeler transform can be computed in O(n) space.
Mind Blowing Fact 3: The Burrows-Wheeler transform can be inverted in O(n) space and time.

We will not expect you to implement O(n) time algorithms for these transformations (except for extra credit); however, your algorithms should be O(n) space. In Java, there are good and bad ways to do this. Think before coding, and check the lecture slides for more details on the BWT and some hints on how to do it.

Move-to-front transform (15 points)

MoveToFrontTransformer.java implements the move-to-front (or MTF) transform. The MTF transform is the second step in the Burrows-Wheeler compression after the Burrow-Wheeler transform (BWT). The idea of MTF transform is to scan the input data and convert each symbol into integers such that the transformed data is more suitable for the Huffman compression. The MTF transform starts with a list of all symbol types in the data. In the case of transforming data byte-by-byte, the symbol list contains all 256 possible byte values. When scanning through the data, do the following to each of the symbol c in the input:

find c in the sorted symbol list
output the index i of symbol c in the sorted list as its transformation
move c to the front of the list and shift all the symbols from 0 to i-1 to position 1 to i

The Invert Move-To-Front transform works in almost the same way as the MTF transform except that the input is now an integer sequence resulted from MTF transform. The Invert MTF transform starts with a list of all symbols in the data which is the same symbol list as used in MTF. When scanning through the data, do the following to each of the integer value i in the input:

output the i-th symbol in the sorted symbol list as the inverted transformation
move the i-th symbol to the front of the list and shift all the symbols from 0 to i-1 to position 1 to i

Coding Style: 10 points

Your assignment will be graded first by compiling and testing it. After that, we will read your code to determine whether appropriate data structures and algorithms were used, as well as judge the overall style of your code. Programs will be graded both on correctness as well as their ability to deal with so-called edge cases, like empty lists, lists with only one element, and so on. Points will be deducted for poor design decisions and un-commented, or un-readable code as determined by your TA.

General Advice

Think beforehand about how the trie can precompute a number of the methods that you are asked to implement. By leveraging the trie more completely, you can reduce the amount of work the other methods have to do.

Also, your trie should not cause you to run out of memory unless you increase the Java heap size. Think about ways to design your trie nodes to not waste space; if you run out of heap, you are doing something wrong.

Homework Assignment 5 BZip Compression 15-211 Summer 2010 Due Saturday, July 30, 2010 at 11:59:59pm