Simple Spam Filtering

Version: 2.5 (12/13/2002)
http://yendi.cc.cmu.edu/work/s-spam.html

1.0 Background

Sieve is a mechanism for creating email filters. The script is stored on the server and applied to all incoming email. For example, the script is able to automatically filing an email message into a given folder.

SpamAssassin examines incoming email messages and based on a set of rules, it generates a value where the larger the value, the greater the chance that the email message is spam.

Once the value is above 5, the Andrew MX servers will insert the following header into the message:

    X-Spam-Warning: 83 (CLICK_BELOW,CLICK_HERE_LINK,WEB_BUGS,CTYPE_JUST_HTML)

The numeric value in the header is multiplied by 10. This is to make it easier for Sieve to deal with it since Sieve does not have floating point numbers. The information in parentheses is the list of rules that were matched.

Sieve can use the information that SpamAssassin provides to automatically file message into an appropriate folder. While a number of users have already written their own script, Writing the script is not a trivial task.

This project is to create a simple web interface to allow users to easily file messages tagged as spam into a folder.

2.0 Requirements

Must use WebISO for authentication.
Allow user to turn filing on or off.
The generated sieve script will never delete a message. All the script will do is a file into a spam folder.
Allow for "white list" of accepted addresses/domains. (always accept, and don't filter into spam folder).
Allow for "black list" of rejected addresses/domains. (always file into the spam folder).
The generated script must function in conjunction with the script created by the vacation CGI. Users should be able to easily combine spam filtering with vacation, and separate them later with options to then do nothing, enable vacation only, or enable spam filtering only.
Detect user created sieve scripts and allow the user to deactive the user created script in favor of the generated sieve script.
User created sieve scripts must not be deleted.
User created sieve scripts do not have to be parsed and understood by the system.
Allow for users to always accept all mail from Computing Services SMTP servers.
Allow viewing of the white list and black list entries.
Allow removal of white list and black list entries.
Allow the generated script to be stored on the server but not activated.

3.0 Implementation Guidelines

Internal testing available by October 11, 2002.
The URL to the CGI will be off of the MyAndrew "E-mail Options" list and will be combined with Vacation.
We were not able to use the new WebISO feature to allow us to get away from the all power user. The problem comes in with the ticket that WebISO obtains and/or the lack of a sieve proxy. The way sieveshell currently works with the Cyrus aggregator is that it first gets a sieve/cyrus ticket and connects to the front-end. The front-end then gets a redirect to the back-end server where the sieve script actually exists. sieveshell then gets the sieve/ ticket for the appropriate back-end.
System will recommend a default spam folder: INBOX.spam. The name of the folder can be changed. However, if the folder already exists, make sure the user knows it exists and wants to use it. NOTE: the user MUST create the spam folder externally from the My Andrew filtering pages.
NEW: The reason why users need to create their own folders is that folder creation requires different authorization than the installation of sieve scripts. The current implementation of this CGI runs as an all powerful sieve user. We'd rather not also have an all powerful IMAP user for just this purpose. We'd also run into the same "what ticket to get" problem if we tried using the WebISO feature to get away from the all powerful user.
The detection of whether or not email will be from a Computing Services SMTP server will be done via a test in the "Received:" headers. While this can be forged, it is unlikely that the spammers will go through the trouble and it is a significant amount of work.
One possible workaround was to just get all the tickets but since that would have been more work and we decided to just go with the admin user approach.
At some point in time, we are likely to also do virus scanning. This may also involve a header and something to be tossed into something like INBOX.Virus. This is something to keep in the back of your mind as you are writing the code.

3.1 White/Black List Issues

White/black lists are specified as either a specific addresses in the form userid@domain or just a domain. If the user specified string begins with @ then it can be considered a domain.
Wildcarding is not allowed. Abbreviations (e.g. foo@andrew) are not allowed. If the user tries to put an asterisk (*) or did not specify a FQDN address, an error should be raised.
NEW: Precedence:
- A specific addresses always takes precedence over domains. This means that if there is an address in the white list, it overrides the domain blacklist. If there is an address in the black list, it overrides the domain whitelist.
- If there is an identical item in the whitelist and the blacklist, the whitelist will take precedence. It is likely that the only time this should happen is if the user made a mistake.

3.2 Handling of Existing Sieve scripts

In general, if the CGI detects an existing sieve script that it did not generate or the vacation CGI did not generate then the user should be given an option to deactive that script and enable the generated script.

It may be difficult to determine whether or not an existing vacation script was generated by the vacation CGI or user written. In this case, the recommendation is to just use the existing vacation code to determine whether or not the script in question is a vacation script. If so, then treat it like your generated vacation script. If you want to be paranoid, make sure it only has vacation and no fileinto or any other constructs that isn't put in there by the current vacation script.

An earlier version of the vacation CGI allowed one to edit a sieve script that was not generated by the CGI. However, when the user went to save the script back to the server, the vacation CGI would not allow it to be saved. The new system should not allow the user to edit a script that they can't save back.

4.0 Implementation Roles

Glorious Leader - John Lerchey
Web Page Interface Design - Laura Valentine
Programming Design and Coding - Barbara Jensen

Changelog

2.5 - 12/12/2002 - Clarified the white/black list precedence
                   rule. Described why we needed to create a folder externally
2.4 - 12/10/2002 - Updated formatting
2.3 - 11/07/2002 - Updates from Lerchey
2.2 - 09/04/2002 - Revisions after initial meeting
0.1 - 08/29/2002 - Initial draft