AI: Spam Assignment

Sajid Siddiqi
Sanjeev Koppal
David Choi
I. Project Goal
Develop a program that classifies spam vs. non-spam email.
II. Training Classifier
We develop a list of features that will allow us to distinguish between spam and non-spam. The presence or absence of these features will then enable us to make a prediction of the most probable category to which the email belongs. These features are the words that have the largest information content. For example, words such as “Viagra and “porn occur much more frequently in spam than in non-spam. Thus seeing those words in an email makes it highly probable that the email in question is spam.
Seeing that the spam emails were already separated into a few categories, we decided to retain these categories in our training for one of our approaches. The five categories of emails we build a feature list for are: certainly spam, probably spam, spam that got past the filter, a mega spam list (including all the spam together), and non-spam. The way we come up with this list of features for each category is by keeping track of all the words that appear in a category and counting the number of emails that contain that word. We are not concerned with word frequency in our particular algorithm. After all the occurrences of words have been added up, we divide the total count for that word by the total number of emails in that category. Thus we obtain the probability of a particular word appearing in a particular category. One minus this probability gives the probability that the word does not appear.
Due to an optimized training algorithm, our code is able to train on 100MB of emails in less than five minutes.
Improving Feature List
Next, we prune our list of words to exclude as many words as
we can that do not provide relevant information. This is first done by comparing each word to
another list of words to be excluded in its consideration. This list includes the 300 most common words
in the English language (such as “and, “the, “but, “are, “because, etc). We obtained
this list of the 300 most common words in the English language (source:
http://www.zingman.com/commonWords.html,
“The American Heritage Word Frequency Book." The idea behind doing this was, these words are highly common in
both English spam and non-spam, so they shouldn’t be in the feature list. We
were able to do this because all our code was written from scratch,
we weren’t using any pre-existing tool that would constrain us from doing
modifications that we desired.
Meaningless words and long strings or random text are
excluded by ignoring words that are over 15 characters long, as well as words
that contain non-alphabetic symbols or have three consonants in a row. Its true that such
features could also be used to detect spam, but we wanted to avoid any “rules
or inferences that looks at long strings, and see how well we could do with a
purely Bayesian classifier. These strings are highly non-repetitive in nature,
most occurring just once, so its useless to have them
as features in a Bayesian filter.
III. Testing Filter
Once the classifiers are trained, we simply use the naive Bayes equation below to categorize the test email as spam or non-spam.

Where:
Some modifications were made to take into account the fact that our spam training set was disproportionately larger than our non-spam training set, introducing a sort of bias into the classifier training.
IV.
Results
Comparison of Classification Methods
We tested a set of test emails using 6 different methods. Chris made three categories of email available on the website (almost-certainly-spam, probably-spam, got-past-filter), and we used these in different ways. First we tested using filters trained from each spam category separately.. Another test was done with the filter trained from all of the spam concatenated into one file. The fifth test trained a filter for each category and then returned the most probably category. If the returned category was any of the spam categories, it was classified as spam. Otherwise, it was classified as non-spam. The final test we did was to exclude the 300 most common words from the feature list of the combined spam filter.
Results of various filters
|
|
Almost Certainly
Spam Filter |
Probably Spam
Filter |
Got Past Filter |
Combined Filter |
Multiple Filters |
Combined Filter
with Excluded words |
|
Overall Accuracy |
60.3% |
75.2% |
42.6% |
81.6% |
79.4% |
81.6% |
|
Non-spam Accuracy |
100% |
96.7% |
100% |
91.7% |
76.7% |
90% |
|
Spam Accuracy |
30.9% |
59.3% |
0% |
74.1% |
81.5% |
75.3% |
Confusion Matrix
|
|
Classified as Non-spam
% (count) |
Classified as Spam
% (count) |
|
Really Non-spam |
90%(54) |
10%(6) |
|
Really Spam |
24.7%(20) |
75.3%(60) |
Size of training set:
spam:
14,211
non-spam:
2,202
All the spam was stuff that Chris gave us. All the
non-spam was stuff from Sajid’s email-inbox at CMU.
Number of emails tested:
spam:
80
non-spam:
60
The spam was from stuff that Chris gave us,
that we didn’t train on. The non-spam was stuff from Dave’s email inbox
at CMU.
Number misclassified:
26 emails total (out of 140)
20 spam
6 non-spam
V. Software
The code was written in object-oriented C++ from scratch. Libraries were avoided and all text-handling and list-managing code was written from scratch to achieve the most optimized results. The feature list is built as a sorted list in order to allow binary search.
All of the code can be found here.
VI. Human-Computer
Combination
A human user could improve the performance of the spam filter in a couple ways. First, the user can manually add words to the exclude list. In addition, the user can continue to grow the feature list by letting the program know which emails were misclassified. If a particular user receives a higher percentage of spam, then the user can tweak the priors (the percentage that a received email is spam) so that it is more accurate for him. There are many parameters that the user could tweak to optimize performance and make the filter more or less conservative in its filtering. Finally, the user can add or subtract spam categories to further improve the classification.