Simple Spam Filtering
Version: 2.5 (12/13/2002)
http://yendi.cc.cmu.edu/work/s-spam.html
1.0 Background
Sieve is a mechanism
for creating email filters. The script is stored on the server and
applied to all incoming email. For example, the script is able to
automatically filing an email message into a given folder.
SpamAssassin examines
incoming email messages and based on a set of rules, it generates a
value where the larger the value, the greater the chance that the
email message is spam.
Once the value is above 5, the Andrew MX servers will insert the
following header into the message:
X-Spam-Warning: 83 (CLICK_BELOW,CLICK_HERE_LINK,WEB_BUGS,CTYPE_JUST_HTML)
The numeric value in the header is multiplied by 10. This is to
make it easier for Sieve to deal with it since Sieve does not have
floating point numbers. The information in parentheses is the list of
rules that were matched.
Sieve can use the information that SpamAssassin provides to
automatically file message into an appropriate folder. While a number
of users have already written their own script, Writing the script is
not a trivial task.
This project is to create a simple web interface to allow users to
easily file messages tagged as spam into a folder.
2.0 Requirements
- Must use WebISO for authentication.
- Allow user to turn filing on or off.
- The generated sieve script will never delete a message.
All the script will do is a file into a spam folder.
- Allow for "white list" of accepted addresses/domains. (always
accept, and don't filter into spam folder).
- Allow for "black list" of rejected addresses/domains. (always
file into the spam folder).
- The generated script must function in conjunction with the script
created by the vacation CGI. Users should be able to easily combine
spam filtering with vacation, and separate them later with options to
then do nothing, enable vacation only, or enable spam filtering only.
- Detect user created sieve scripts and allow the user to deactive
the user created script in favor of the generated sieve script.
- User created sieve scripts must not be deleted.
- User created sieve scripts do not have to be parsed and understood
by the system.
- Allow for users to always accept all mail from Computing Services
SMTP servers.
- Allow viewing of the white list and black list entries.
- Allow removal of white list and black list entries.
- Allow the generated script to be stored on the server but not
activated.
3.0 Implementation Guidelines
- Internal testing available by October 11, 2002.
- The URL to the CGI will be off of the MyAndrew "E-mail
Options" list and will be combined with Vacation.
We were not able to use the new WebISO feature to allow us to get
away from the all power user. The problem comes in with the ticket
that WebISO obtains and/or the lack of a sieve proxy. The way
sieveshell currently works with the Cyrus aggregator is that
it first gets a sieve/cyrus ticket and connects to the
front-end. The front-end then gets a redirect to the back-end server
where the sieve script actually exists. sieveshell then gets
the sieve/ ticket for the appropriate back-end.
- System will recommend a default spam folder:
INBOX.spam. The name of the folder can be changed. However,
if the folder already exists, make sure the user knows it exists and
wants to use it. NOTE: the user MUST create the spam folder
externally from the My Andrew filtering pages.
NEW: The reason why users
need to create their own folders is that folder creation requires
different authorization than the installation of sieve scripts. The
current implementation of this CGI runs as an all powerful sieve
user. We'd rather not also have an all powerful IMAP user for just
this purpose. We'd also run into the same "what ticket to get"
problem if we tried using the WebISO feature to get away from the all
powerful user.
- The detection of whether or not email will be from a Computing
Services SMTP server will be done via a test in the "Received:"
headers. While this can be forged, it is unlikely that the spammers
will go through the trouble and it is a significant amount of work.
One possible workaround was to just get all the tickets but since
that would have been more work and we decided to just go with the
admin user approach.
- At some point in time, we are likely to also do virus
scanning. This may also involve a header and something to be tossed
into something like INBOX.Virus. This is something to keep
in the back of your mind as you are writing the code.
3.1 White/Black List Issues
- White/black lists are specified as either a specific addresses in the
form userid@domain or just a domain. If the user specified
string begins with @ then it can be considered a domain.
- Wildcarding is not allowed. Abbreviations
(e.g. foo@andrew) are not allowed. If the user tries to put
an asterisk (*) or did not specify a FQDN address, an error
should be raised.
- NEW: Precedence:
- A specific addresses always takes precedence over domains. This
means that if there is an address in the white list, it overrides
the domain blacklist. If there is an address in the black list, it
overrides the domain whitelist.
- If there is an identical item in the whitelist and the
blacklist, the whitelist will take precedence. It is likely that the
only time this should happen is if the user made a mistake.
3.2 Handling of Existing Sieve scripts
In general, if the CGI detects an existing sieve script that it
did not generate or the vacation CGI did not generate then the user
should be given an option to deactive that script and enable the
generated script.
It may be difficult to determine whether or not an existing
vacation script was generated by the vacation CGI or user written. In
this case, the recommendation is to just use the existing vacation
code to determine whether or not the script in question is a vacation
script. If so, then treat it like your generated vacation script. If
you want to be paranoid, make sure it only has vacation and no
fileinto or any other constructs that isn't put in there by the
current vacation script.
An earlier version of the vacation CGI allowed one to edit a sieve
script that was not generated by the CGI. However, when the user went
to save the script back to the server, the vacation CGI would not
allow it to be saved. The new system should not allow the user to edit
a script that they can't save back.
4.0 Implementation Roles
- Glorious Leader - John Lerchey
- Web Page Interface Design - Laura Valentine
- Programming Design and Coding - Barbara Jensen
Changelog
2.5 - 12/12/2002 - Clarified the white/black list precedence
rule. Described why we needed to create a folder externally
2.4 - 12/10/2002 - Updated formatting
2.3 - 11/07/2002 - Updates from Lerchey
2.2 - 09/04/2002 - Revisions after initial meeting
0.1 - 08/29/2002 - Initial draft