Lab #2: Dynamic Programming


Due: Wednesday, July 16th, 2008 at 11:59PM


Zip file with source files, documentation, &c.
Browsable version of the files above

Tests:
You guys get to write your own tests this time. Good luck, and be sure to make them hard!

This lab will give you a chance to explore applications of dynamic programming in creating efficient algorithms.

Part 1: Subsequence Counting (40 Points)

For this part, you want to count the number of times that a sequence X appears as a subsequence of a sequence Y.

Part 1.1 (15 Points)

Before coding, you should have a good idea what you will be writing. The code for this problem is extremely short, but only if you know what to write. Work out on paper what your DP solution looks like. Submit a paper version to the TA in class or include a an electronic version with your final solution by the due date.

Part 1.2 (25 Points)

Write methods

public static int countSubsequences(byte[] x, byte[] y)
public static Set<List<Integer>> getAllSubsequences(byte[] x, byte[] y)

Part 2: Edit Distance (50 points)

In this problem we will measure the similarity of two strings by their edit distance, a concept that is widely used in spell checking, speech recognition, plagiarism detection, file revisioning, and computational linguistics. The edit distance measures the number of editing operations it would be necessary to perform to transform the first string into the second. The possible operations are as follows:

Here is an example of transformation from activate to caveat:

D M R D M I M M D
a c t i v   a t e
  c a   v e a t

The edit distance is the minimum number of edit operations, excluding matches. It is 5 in this example.

Part 2.1: Edit Cost (30 points)

Write the method:

public static int getEditCost (String seq1, String seq2)

that accepts two strings and returns the minimal edit cost. The empty string is considered valid input. If you are careful, you don't have to handle this as a special case. Place this method in the class EditDistance.

Part 2.2: Recovering the Alignment (20 points)

The above procedure describes how to compute the edit distance between two strings. We need to recover the optimal alignment itself. The key idea is to retrace the steps of the dynamic programming algorithm backwards. You will write the method:

public static String getTransformation (String seq1, String seq2)

that accepts two strings and returns the actual transformation. The output is a string of characters D, I, R and M that transforms the first string into the second. Place this method inside the EditDistance class as well.

Sometimes it is possible to have multiple transformations of minimal length. Consider the transformation from abba to aba:

M D M M     M M D M
a b b a     a b b a
a   b a     a   b a

In this case, you can return any valid transformation.

Part 2.3: All Transformations (10 points extra credit)

Write a method

public static Set getAllTransformations (String seq1, String seq2)

that accepts two strings and returns a Set of all possible transformations, as described in problem 2.2. Place this method inside the EditDistance class as well.

Style (10 Points)

It should go without saying that your submission must have good style. This is needed so that the TAs can understand your code and give you partial credit. It's also a good habit.

General Advice

As always, start early. This lab requires relatively little code, yet is somewhat conceptually difficult for beginning dynamic programmers. Office hours will be most valuable for you if you come with a worksheet in hand. For problem 1, this would be the handout for part A, for the other problems, it would be a subset of this page. This lab is best done with some thought prior to coding as to what the appropriate helper functions, variables should be.

Grading

Your assignment will be graded first by compiling and testing it. After that, we will read your code to determine whether appropriate data structures and algorithms were used, as well as judge the overall style of your code. Programs will be graded both on correctness as well as their ability to deal with so-called edge cases, like empty lists, lists with only one element, and so on. Points will be deducted for poor design decisions and un-commented, or un-readable code as determined by your TA.