the finer points of spam filtering
December 16, 2007 (as amended)
see also: email filtering I, email filtering II and the phish filter

got a spam problem? basic process:

  1. shut down unused accounts
  2. close catch-all mailboxes
  3. set up Spampal or equivalent (the primary filter)
  4. set the mail reader to filter spampal's SPAM to the spambox (folder A)
  5. set the mail reader to filter other mail matching certain keywords to the secondary spambox (folder B) - see "secondary filtering", below
  6. whitelist as many known-good senders as possible
  7. manually collect spam that makes it past the filters, move it to the nastybox (folder C) - do NOT delete it
  8. periodically scan folders A and B for mail from whitelisted senders (see "dredging the spambox", below)
  9. when folder C grows large, scan it for regex foolishness, new keywords, and other obvious lameness, use this information to refine the filtering rules (see "refining spampal's regexes", below) - the larger the sample the better, the numbers below come from 2000+ known-good and 2000+ known-bad messages

Notes:

secondary filtering

Secondary filtering occurs once the primary filter has processed the message. In the beginning, this was used simply to move mail marked as spam to the spambox, however it can also be used to catch spam that spampal doesn't.

The filters can be as fancy as your mail program lets them be. However, a simple list of keywords is often effective.

The best way to identify keywords is to scan the nastybox (folder C). Look for keywords that come up often, in either the sender or subject. Then, create rules for those keywords in the mail program.

Send all mail that matches these rules to a secondary spambox (folder B). Keep folder A and B separate, so that later, additional analysis can be performed.

Secondary filtering is perfect for anti-phishing, simply enter the names of banks as keywords.

dredging the spambox

Note: This section refers to software which is not currently for sale. See wading through the spambox for the manual version.

Problem: If your spambox ends up with 1000's of messages, it's not feasible to go through it manually, BUT there may be mail in there from someone you know.

Solution: Periodically scan the spambox for mail from anyone on a list of known-good senders, move any matching messages from the spambox to the inbox.

The scan is done by software; before scanning occurs, the list of known-good senders is obtained from multiple sources, and built on-the-fly.

refining spampal's regexes

This procedure uses the scores from Spampal's regex filter to hone in on badly-working rules. It assumes use of software (which is not currently for sale) to analyse Spampal's X-Regex: header in inbound mail.

The relative frequency of occurence of the various header rules is analysed. Sample partial analysis (sample size=5248 mails):

1 X-RegEx: [59.6] FROM_AND_RECEIVED_DO_NOT_MATCH FQDN in From and Received header do not match 1244 (23.70%) 
2 X-RegEx: [155.2] TO_LOCALPART To: repeats local part as a real name 610 (11.62%) 
3 X-RegEx: [10.0] MY_PLING_QUESTION 3 Ausrufezeichen oder 3 Fragezeichen (besonders wichtig o. besonders dumm) 354 (6.74%) 
4 X-RegEx: [21.8] HTML_FONT_BIG FONT Size +2 and up or 3 and up 338 (6.44%) 

This shows that the most-frequently-occuring header, contained in 1244 mails (23.70%), is the FROM_AND_RECEIVED_DO_NOT_MATCH header, and the second most-frequently-occuring header, which is contained in 610 mails (11.62%), is the TO_LOCALPART header, etc.

Process:

  1. extract a large quantity of KNOWN-GOOD messages to a given location (the clearbox)
  2. extract a large quantity of KNOWN-BAD messages to a given location (the nastybox)
  3. generate X-RegEx header analyses for both groups
  4. find all the messages in the nastybox that have negative Regex scores
  5. check the most common headers in the clearbox - they should be mostly negative
  6. check the top 20 most common headers in the nastybox - the values should be relatively large
  7. scan the remaining headers in the nastybox for obvious lameness: