20-760 Homework #1

Homework Assignment #1

Web-Based Information Architecture

Handed out: July 16, 2002
Due: 7:00pm, July 25, 2002

The objective of this assignment is to give you some hands-on experience with the construction of information retrieval systems by building a simple inverted index for a small document set.
The Reuters Corpus

The Reuters-21578 text categorization corpus is commonly used by researchers in machine learning and information retrieval as a standard for evaluating the performance of text categorization and retrieval algorithms. It consists of 21,578 articles which appeared on the Reuters newswire during 1987. Each article in the corpus has been hand-annotated with SGML tags identifying its content (e.g. the topic, named entities such as places, organizations, and people). Details about the corpus are available from the README file.
You will construct an inverted index for a small subset (1000 articles) of the Reuters corpus. To simplify the task, the articles have been preprocessed to remove all SGML tags except a boundary tag beginning with " ", as well as to strip out additional header material, punctuation, and special characters.
A sample preprocessed article is shown below in Figure 1. The beginning of each article is indicated by a single line containing a tag of the form . The next line contains the title of the article, followed by an optional byline, and a variable number of lines containing the location, date, and the body of the article. Additional information about how the articles were preprocessed is provided at the end of this handout for your reference.

RED LION INNS FILES PLANS OFFERING PORTLAND Ore Feb 26 Red Lion Inns Limited Partnership said it filed a registration statement with the Securities and Exchange Commission covering a proposed offering of 4,790,000 units of limited partnership interests The company said it expects the offering to be priced at 20 dlrs per unit It said proceeds from the offering along with a 102.5 mln dlr mortgage loan will be used to finance its planned acquisition of 10 Red Lion hotels Reuter

Figure 1. Sample Preprocessed Article From the Reuters Corpus

DATA

You will need a copy of the preprocessed file to do the assignment. download data
NOTE: There are exactly 1000 documents in the collection, and the docids are not continuous.
Constructing the Index (100 points)

Using the preprocessed data, construct an inverted index consisting of: a dictionary entry for each unique term in the document set and an associated posting list for each entry. Each posting list should contain the IDF (inverse document frequency) of the term, the IDs of all documents in which the term appears, the normalized and un-normalized TF (term frequency) and positions it appears (within-document frequency counts).
Because we have a small document collection, it is simplest to store the index in a single file, such as shown in Figure 2. Here, each line in the file represents one dictionary entry (sorted lexicographically), followed by its postings containing the IDF and the pairs of document ID, normalized TF, TF and positions. (Single spaces are used as delimiters between document entries, colons serve as delimiters between a document id, its term frequency and positions, commas are used as delimiters among the positions.) For example, the term "advertisers" occurs twice in document 79, as the 94th and 102nd words. Since it occurs only in one document, its DF equals 1 and the IDF (suppose we have 1000 documents in the collection) is log₂ (1000/1) = 9.966.
In this homework, the IDF is defined as log₂( N / DF), where N is the total number of documents in the collection). And the normalized TF is defined as TF/maxTF , where maxTF is the maximal term frequency in the same document.

adversaries 8.966 32:0.125:1:2455:0.25:1:67 adversely 7.966 54:0.25:1:23 356:0.05:1:103 454:0.0625:1:111 634:0.2:1:57 adverserial 9.966 342:0.4:4:23,54,89,45 advertisers 9.966 79:0.25:2:94,102 advertising 8.38 736:0.125:2:45,77 766:0.33:2:4,67 810:0.04:1:26 advice 8.38 178:0.125:1:44 467:0.5:3:12,45,86 687:0.08:2:25,65

Figure 2. Sorted list implementation of an inverted index
(each line is in the format <term> <idf> <doc id>:<normalized tf>:<tf>:<position 1>,...,<position i>)

In order to construct the index, you will need to:

Read the corpus file
Identify document boundaries
Process each document to extract:

the document id
all the terms in the article (you should lowercase all the terms, and throw away any terms without letters, such as numbers)
the term frequency (both normalized and un-normalized one) and position for each term in the article

Update the index to include the terms and id/term-freq/positions pair for the article just processed, also keep in mind that you should keep the DF information for each term in order to calculate IDF.
Write out the completed index to a file in the format shown in Figure 2. The terms should be sorted in alphabetical order from a to z.

You may assume a "term" is any sequence of consecutive non-whitespace characters (i.e.you do not have to filter out stopwords and do stemming). And you can scan the collection multi-pass to simplify the algorithm (i.e. you create the dictionary at first pass, then build the index at the second pass).

Extra Credit (Optional!)

Stopword Removal and Word Stemming (10 points)
Augment your indexer by including support for stopword removal and word stemming.

The stopword file is available with the current distributiion, and the Porter Stemmers Source Code in C/Java/Perl are also available. Add them into your index construction routines, how much does this reduce the size of the index?

You can check whether your stemmer works correctly by the sampled words and their stems (download).
Requirements & Homework Submission

You are encouraged to use Java to do the assignment, although a language such as Perl, C or C++ is also OK. If you really want to use Perl/C/C++, talk to TA first.
You will do the assignment in groups of 3 people. Each group is expected to hand in one solution (please include the names of all group members in the source code comments). Your solution should consist of three parts: (1) source codes for the indexer (2) a file named reuters.index containing the inverted index itself (3) a written discription of main data structure and algorithm.
Solutions should be submitted electronically before 7:00pm July 25, 2002 as follows:

Compress the index file with gzip/winzip/winrar

Each group send me ONE email about your group members and ONE ANDREW ID (by the end of tomorrow).
Put your result files (zipped index file, source code & description file) under the directory:

/afs/andrew/course/20/760/students/yourandrewid/

Source code & Index file for the extra credit problem should be submitted the same way (in a DIFFERENT file)

If you have any questions or problems accessing/submitting the files, send email to jianzhang@cmu.edu.

OPTIONAL: If you are interested in how we get the data file, the followings are the Preprocessing Rules

Remove any of the material between SGML tags except for material labeled as TEXT, TITLE, BODY, DATELINE, or AUTHOR.
Remove all SGML tags except for the "REUTERS" tag marking the start of a new article; simplify the REUTERS tag to contain just the document ID.
Remove all non-whitespace character sequences beginning with "&" and ending with ";".
Remove any punctuation characters except on abbreviations like "U.S.", within numbers, and apostrophes on contractions.
Split tokens at slashes (but leave numbers intact).
Split tokens at hyphens.

To make sure the mini-corpus contained articles which were relatively interesting to index, two additional preprocessing filters were used: articles containing mostly numbers (> 30%) and any without text in the body were skipped.
The perl script which does this is also availble.