15-121: Document Distance

Homework Assignment 4
The document distance problem

Document similarities are measured based on the content overlap between documents. With the large number of text documents in our life, there is a need toautomatically process those documents for information extraction, similarity clustering, and search applications. There exist a vast number of complex algorithms to solve this problem. One of such algorithms is a cosine similarity - a vector based similarity measure. The cosine distance of two documents is defined by the angle between their feature vectors which are, in our case, word frequency vectors. The word frequency distribution of a document is a mapping from words to their frequency count.

where "." denotes the dot-product of the two frequency vectors A and B, and ||A|| denotes the length (or norm) of a vector.

Objectives

Gain experience in working with List, Set and Map interfaces.
Gain experience working with more realistic problems that are open-ended, with black- box specifications and requirements.

Problem Statement

In this lab you are to compute a distance (an angle) between two given documents or between two strings using the cosine similarity metric. You start with reading in the text and counting frequency of each word. The word frequency distribution for the text D is Java's Map from words to their frequency counts, which we'll denote as freq(D) . We view freq(D) as a vector of non-negative integers in N-dimensional space. For example, reading the string "To be or not to be" results in the following map

{be=2, not=1, or=1, to=2}

These 4 distinct words make a document representation as a 4-dimensional vector {2, 1, 1, 2} in term space.

A word is a sequence of letters [a..zA..Z] that might include digits [0..9] and the undescore character. All delimiters are thrown away and not kept as part of the word. Here are examples of words:

abcd
abcd12
abc_
a12cd
15111

We'll treat all upper-case letters as if they are lower-case, so that "CMU" and "cmu" are the same word.

The Euclidean norm of the frequency vector is defined by

where x_k denote frequencies for each word in the text. For the above example, the norm is

The Dot product (or the inner product) of two frequency vectors X and Y is defined by

Here we multiply frequencies x_k and y_k of the same word in both text documents.

Finally, we define a distance between two documents D₁ and D₂ by using cosine similarity measurement:

Observe, the distance between two identical documents is 0, and the distance is Pi/2 = 1.57... if two documents have no common words.

Your task is to write a program to compute a distance (an angle) between two given documents using the formula above.

Black-box Requirements

Black-box requirements are requirements that your program must conform to. They are described by so-called acceptance tests - tests that define the criteria by which your clients will determine whether the program meets their needs.

We provide you with five acceptance tests that include expected outputs. Those test files enumerated by Test1.java, ..., Test5.java. The last test case is perhaps the most challenging. Your application must conform to all these tests. You are encouraged to create your own test cases to test various edge cases such as empty string or file, two identical files.

Hints

How to read words from a file? There are a few different ways to identify words, however the simplest one is to use the split() method along with a regular expression processor, namely split("\\W").
Only consider words of length greater than 0! Therefore, check that each string has non-zero length before declaring that it is a word.
How to store words and their multiplicities? Before you insert a word into a map, you have to query the map. The get() method returns either null or the current frequency.
Be clever when you implement the innerProduct() method; try to avoid quadratic runtime implementation.
Be aware that processing large files might lead to an integer overflow, which will cause an incorrect distance between two documents. Your code must be able to handle all given input files.
Start early. Plan your time accordingly.

Starter Code

Download the starter code for assignment (zip file).

Once you finish implementation, you must turn in only Distance.java file.

Grading

Your assignment will be graded first by compiling and testing it. After that, we will read your code to determine whether appropriate method/classes were used, as well as judge the overall style and modularity of your code. Points will be deducted for poor design decisions and un-commented, or un-readable or un-efficient code.

Homework Assignment 4 The document distance problem