| |||||||||||||
I use layers of filters to keep my inbox relatively clean; I see around 1 spam/day and no viruses at all, from around 1000 inbound spams and 50 inbound viruses a week. A single type of offender penetrates the filters; Bayesian-busting spam. The configuration outlined below will run on a 166MHz Pentium* with Windows 98SE, or above, and uses two pieces of free non-Microsoft software: SpamPal (a spam filter), and Pegasus Mail (an email client; Outlook Express on steroids).
| layer | function |
|---|---|
| 1 | Bayesian spam filter |
| 2 | attachment filetype filters |
| 3 | X-RBL-Warning filter |
| 4 | personalised subjectline filters |
| 5 | blacklisted arbitrary keyword filters |
| 6 | whitelisted personal address filters |
| 7 | whitelisted family, friend and mailing list filters |
| 8 | blacklisted personal address filters |
The ordering of these filters is very important. Filters are applied in order, from top to bottom. Since over-processing unwanted mail is a waste of resources, the ordering is designed to find such mail rapidly; this removes it from the queue and allows the filters to begin work on the next message. The ordering might be rearranged for various purposes; this ordering is optimised for speed, with a few compromises for functionality. I might rearrange layers 5 and 6 in order to allow my friends to send me emails containing blacklisted keywords, for example.
Layer 1: The first layer is a Bayesian spam filter called Spampal. It actually uses other forms of spam detection, such as whitelists, blacklists, and its own set of regular expressions, as well; for each email, it weights each test result and computes a final "spam score". Emails scoring over a certain threshold are considered spam and have the string "** SPAM **" inserted at the start of their subject line. The idea is then for the email client to filter for "** SPAM **" in the subject line in all incoming mail. Which is exactly what is done (see below).
The rest of the layers are provided by filters within Pegasus Mail, which is the email client software I use. Pegasus has a reputation for quality filtering, and it lives up to that reputation. It allows incoming mail to be scanned not only for certain strings in the To, From, Subject, and CC fields, but it allows the use of regular expressions to scan either the message header or the message body for a certain expression.
TVqQAAMAAA* TVoAAAEAAAA* TVoAAD8AAAAE* TVoAAAAAAAAAAAAAUEUAAE* TVpLRVJORU*
Note the trailing *. This means "anything" - it's a wildcard. As there's no * at the start of the line, the first regular expression above means "if a line of the email starts with TVqQAAMAAA, filter it".
These next two expressions catch ZIP files:
UEsDBAoAA* UEsDBBQAA*
This catches BHX files:
YmVnaW4gNj*
These catch GIF, PNG and JPG files respectively (not used by viruses but well-used by spammers):
R0lGODlh* iVBORw0KGgoAAAAN* /9j/4AAQSkZJRgABAQ*
This last expression catches WMF files:
183GmgAA*
These filters work by finding MIME data which matches the above strings. When a virus is sent via email, it is encoded in MIME. It may change its filename, use a variety of subject lines and message bodies, and/or forge the sender's address; but it cannot forge its own file header, which is faithfully represented in MIME in the email containing the virus. It's not even necessary to decode the MIME; the above "MIME signatures" are functionally equivalent to the signatures used by traditional anti-virus scanners. Except as they are encoded in MIME already, they can be utilised by standard email filters (including Windows-based client software such as Pegasus, and also unix-based software, such as procmail). This technique is thus ideal for home users and small businesses without expensive, enterprise-grade content scanners.
These filters can also be embedded into anti-spam software, which often includes support for regular expressions. Here's how to do it with Spampal, a free filter for Windows.
Viruses that send themselves as encrypted zipfiles (such as Bagle) are not a problem for the above technique either. Encrypted zipfiles have a standard header just like any other zipfile. So the above zipfile filters catch encrypted zips as well.
Additional MIME filters can be determined simply by emailing yourself a file in a given format (for example, .XLS) and examining the raw MIME data. The first ten bytes have proved sufficient, to date.
Extra anti-virus filters are required for viruses that send themselves as text. These are not executable files in the traditional sense, in that they do not contain binary object code; none-the-less, they are dangerous and should be filtered. Currently these three expressions are all that's required:
<HTA:APPLICATION* Content-Disposition: attachment; filename="*.vbs" Content-Disposition: attachment; filename="*.hta"
Test for each of these in the body of the message.
Since Spampal flags a good proportion of these incoming viruses as spam, but I want to separate the viruses from the spam (if nothing else, to measure the effectiveness of my filters), I only put the test for Spampal's "** SPAM **" flag after I filter the viruses. So my next filter in Pegasus is actually the Spampal test. This is really part of the Spampal layer, however I have it here for technical purposes.
To permit certain addresses to receive JPGs or PNGs, place the filter for those addresses before the JPG or PNG filters.
Layer 3: This filter layer utilises the X-RBL-Warning: headerline. This line is inserted into the header of the message by various SMTP servers out there, if they forward a message originating from a blacklisted host. The blacklists are manually maintained by unaffiliated individuals all over the world, and thus subject to error (my own mail is occasionally flagged as spam), so these blacklists are not perfect. The blacklist used is usually noted, so there is room for a degree of auditing, however. This filter yields a false positive rate of around 0.1%, suggesting that whatever techniques are being employed to maintain the blacklists, they are very accurate. This filter only catches about half the spam I receive in total, however. Match on this expression in the header of the message:
X-RBL-Warning:*
Since I started using Spampal, this filter has become almost redundant. This is because Spampal looks at the same blacklists the above mentioned SMTP servers look at. However I leave it in place, in case something happens to Spampal, or I'm not using Spampal, or... - just in case, basically.
Layer 4: This layer consists of a series of very specific tests, which catch a remarkably large amount of spam. Some spamming software out there must take the first part of my email address and insert it, with a trailing comma, into the start of the subject line. Say my email address was joe@foo.bar, I'd receive a spam from this software with the subject starting with joe, ... accordingly this filter tests for the first part of my two heavily spammed email addresses, with a trailing comma, in the subject line of each email:
lsi, Stuart,
Layer 5: This layer consists of a series of tests for specific keywords in the subject of the message. This is the lowest-tech layer, but it does catch the occasional email, nonetheless. The strings are obvious enough:
medication prescription pharmacy viagra sildenafil vicodin xanax automobile depressant cellulite mortgage
Layer 6: This layer consists of a series of tests for specific email addresses I use. This layer essentially lifts all email that is actually addressed to me out of my inbox and into various other folders (including my "real" inboxes). It examines the To: and CC: fields. Unfortunately the occasional spam, addressed correctly and failing to match any of the other filters, is also moved to my "real" inboxes. Those are the subject of further research which will be published here when available.
Layer 7: This layer consists of a multitude of tests for specific email addresses in the To: and CC: fields, and specific strings in subject lines. This layer lifts all mail I have previously defined as "OK" out of my inbox and into various other folders. This is where I filter my inbox for focus-virus@securityfocus.com, for example, which enables me to move all messages from the Esteemed List into an Esteemed Folder specifically about viruses. I could also use this layer to filter all email from my family and friends to specific folders.
Layer 8: This last layer consists of a series of tests for specific email addresses I no longer use. Mails to these addresses are almost certainly spam, and I filter them to a spam folder.
All this filtering has the effect of leaving only "mystery" mail in my inbox. I deal with those here.
I am certain that, if widely deployed, this degree of filtering would make it uneconomical for spammers to operate (given that 1 message a day, max, gets through), and infeasible for mass-mailing viruses to propagate (since nobody is getting infected).
There are various black holes into which various emails still fall (if someone wants to email me a ZIP or an EXE, they must encrypt it to one of my public keys, then send it; joining new mailing lists often means some fiddling around with Spampal's whitelists). But compared to a choked inbox, these problems are minor.
Assuming you set up at least one spam filter, you'll end up with a folder in your email application full of messages that tripped the filter. If the filters are any good, most of these messages will be spam. However the occasional legitimate email may end up there, for numerous reasons; it's thus a good idea to periodically look through this folder, rather than just deleting all the messages without looking at them. These erroneously filtered messages are known as false positives. Here's the process I use to go through the spam folder, or spambox as I call it:
Note: emails may contain exploits which attempt to crack known vulnerabilities in common email software (eg. Outlook), and take over your computer. I therefore suggest using an industrial-strength reader, such as the aforementioned Pegasus Mail, to browse the spam folder.
Note: I have created software to do the above and more, this is not currently available for download, however.
* OK, so a P166 with a full set of filters is a bit too slow for comfortable viewing. A 333MHz box is fine...