email filtering
April 12, 2004 (as amended)
this page is one of a series, see end for links

Below is the process I use to avoid spam, email-based viruses, and phish. The first few steps are for everyone, while the last few are more for diehards like myself, who wish to improve the filters, as well as use them.

  1. Reduce the attack surface: shut down unused accounts, and close catch-all mailboxes

  2. Install primary filter: set up SpamPal or equivalent, and set your mail reader to filter mails with Spampal's "** SPAM **" in the subject line to the spam folder

    Spampal uses a number of forms of spam detection; for each email, it weights each test result and computes a final "spam score". Emails scoring over a certain threshold are considered spam and have the string "** SPAM **" inserted at the start of their subject line. I recommend you use the Public Blacklists, plus the Regexfilter plugin.

  3. Fix Regexfilter's ruleset

  4. Whitelist as many known-good senders as possible

  5. Periodically wade through the spam folder

  6. [optional] Customise Regexfilter's ruleset (recommended)

    Add filtering of specific filetypes, character sets, keywords and domains as desired. This is where to add rules for anti-virus and phish filtering.

  7. [optional] Create secondary filters: set your mail reader to filter whitelisted mail to other folders (recommended)

    Secondary filtering occurs once the primary filter has processed the message. In the beginning, this was used simply to move mail marked as spam to the spam folder, however it can also be used to further clean your inbox. The filters can be as fancy as your mail program lets them be. I use secondary filtering as follows:

    1. Whitelisted personal address filters: These test for specific email addresses I use, and lift all mail that is actually addressed to me out of my inbox and into various other folders (including my "real" inboxes). They examine the To: and CC: fields. Unfortunately the occasional spam, addressed correctly and failing to match any of the other filters, is also moved to my "real" inboxes.

    2. Whitelisted family, friend and mailing list filters: These test for specific email addresses in the To: and CC: fields, and specific strings in subject lines. They lift all mail I have previously defined as "OK" out of my inbox and into various other folders. This is where I filter my inbox for focus-virus@securityfocus.com, for example, which enables me to move all messages from the Esteemed List into an Esteemed Folder specifically about viruses.

    3. Blacklisted personal address filters: These test for specific email addresses I no longer use. Mails to these addresses are almost certainly spam, and I filter them to the spam folder.

  8. [optional] Manually collect spam that makes it past the filters, and move it to a special folder - do NOT delete it. When this folder grows large, use it to refine Regexfilter's ruleset.

  9. [optional] In extreme cases, use the Filter of Last Resort to clean your inbox (I stopped using this as I wanted to keep the evaders, so I could use them to refine Regexfilter's ruleset).

All this filtering has the effect of leaving only "mystery" mail in my inbox. I am certain that, if widely deployed, this degree of filtering would make it uneconomical for spammers to operate, and infeasible for mass-mailing viruses to propagate. There are various black holes into which various emails still fall (if someone wants to email me a ZIP or an EXE, they must encrypt it to one of my public keys, then send it; joining new mailing lists often means some fiddling around with Spampal's whitelists). But compared to a choked inbox, these problems are minor.

Notes:

wading through the spam folder

Assuming you set up at least one spam filter, you'll end up with a folder in your email application full of messages that tripped the filter. If the filters are any good, most of these messages will be spam. However the occasional legitimate email may end up there, for numerous reasons; it's thus a good idea to periodically look through this folder, rather than just deleting all the messages without looking at them. These erroneously filtered messages are known as false positives. Here's the process I use to go through the spam folder:

  1. Sort by size - the biggest messages are sometimes false positives (attachments from associates, etc). Also the smallest are usually blank and can be removed.

  2. Sort by datetime - delete the uppermost and lowermost portions of the list (messages from the past and future are used by spammers to force their messages to these locations)

  3. Sort by subject - delete the uppermost and lowermost portions of the list (spammers use punctuation and other tricks to force their messages to these locations)

  4. Sort by name - delete the uppermost and lowermost portions of the list (spammers use punctuation and other tricks to force their messages to these locations); scroll through the list looking for repeats of the same name (most spammers don't usually use the same name twice, so repeats here are false positives, or duplicate messages)

  5. Search for known strings (words or phrases in subject lines, or email addresses, that are known to end up in the spam folder)

  6. Add new strings found to the list of strings (above) and/or whitelist them somehow

  7. Depending on the volume of messages remaining, you may wish to scroll through the list for known senders

  8. Delete all the remaining messages

Problem: If your spam folder ends up with 1000's of messages, it's not feasible to go through it manually, BUT there may be mail in there from someone you know.

Solution 1: Use software such as Thunderbird which has a feature to "ignore this rule for anyone in my contact list", together with Spampal - this feature will prevent Thunderbird from moving a message to the spam folder, even if it's marked as spam by Spampal, if it's from someone on your contact list.

Solution 2: Periodically scan the spam folder for mail from anyone on a list of known-good senders, move any matching messages from the spam folder to the inbox. The scan is done by software (not currently available for download); before scanning occurs, the list of known-good senders is obtained from multiple sources, and built on-the-fly. Using an automated tool like this means false positives are less important, as the tool can find them, even if they are buried amongst thousands of spams.

Note that neither of these solutions can rescue legitimate mails from unknown senders from your spam folder. There is no way to separate false positives from unknown senders from actual spam. If this is a problem for a particular user, this user will need to get a new email address.