Homework Assignment 4

The document distance problem

Document similarities are measured based on the content overlap between documents. With the large number of text documents in our life, there is a need toautomatically process those documents for information extraction, similarity clustering, and search applications. There exist a vast number of complex algorithms to solve this problem. One of such algorithms is a cosine similarity - a vector based similarity measure. The cosine distance of two documents is defined by the angle between their feature vectors which are, in our case, word frequency vectors. The word frequency distribution of a document is a mapping from words to their frequency count.

where "." denotes the dot-product of the two frequency vectors A and B, and ||A|| denotes the length (or norm) of a vector.


Problem Statement

In this lab you are to compute a distance (an angle) between two given documents or between two strings using the cosine similarity metric. You start with reading in the text and counting frequency of each word. The word frequency distribution for the text D is Java's Map from words to their frequency counts, which we'll denote as freq(D) . We view freq(D) as a vector of non-negative integers in N-dimensional space. For example, reading the string "To be or not to be" results in the following map

{be=2, not=1, or=1, to=2}

These 4 distinct words make a document representation as a 4-dimensional vector {2, 1, 1, 2} in term space.

A word is a sequence of letters [a..zA..Z] that might include digits [0..9] and the undescore character. All delimiters are thrown away and not kept as part of the word. Here are examples of words:


We'll treat all upper-case letters as if they are lower-case, so that "CMU" and "cmu" are the same word.

The Euclidean norm of the frequency vector is defined by

where xk denote frequencies for each word in the text. For the above example, the norm is

The Dot product (or the inner product) of two frequency vectors X and Y is defined by

Here we multiply frequencies xk and yk of the same word in both text documents.

Finally, we define a distance between two documents D1 and D2 by using cosine similarity measurement:

Observe, the distance between two identical documents is 0, and the distance is Pi/2 = 1.57... if two documents have no common words.

Your task is to write a program to compute a distance (an angle) between two given documents using the formula above.

Black-box Requirements

Black-box requirements are requirements that your program must conform to. They are described by so-called acceptance tests - tests that define the criteria by which your clients will determine whether the program meets their needs.

We provide you with five acceptance tests that include expected outputs. Those test files enumerated by Test1.java, ..., Test5.java. The last test case is perhaps the most challenging. Your application must conform to all these tests. You are encouraged to create your own test cases to test various edge cases such as empty string or file, two identical files.


Starter Code

Download the starter code for assignment (zip file).

Once you finish implementation, you must turn in only Distance.java file.


Your assignment will be graded first by compiling and testing it. After that, we will read your code to determine whether appropriate method/classes were used, as well as judge the overall style and modularity of your code. Points will be deducted for poor design decisions and un-commented, or un-readable or un-efficient code.