refining Regexfilter's ruleset
December 16, 2007 (as amended)
this page is one of a series, see end for links

This procedure uses the scores from Regexfilter to hone in on badly-working rules. It assumes use of software (which is not currently available for download) to analyse Regexfilter's X-Regex: header in inbound mail.

It also assumes access to a large quantity (1000+) of spam - the spam that evaded existing filters is the best spam to use for this, so it should be kept, NOT deleted. The larger the sample, the better.

The relative frequency of occurence of the various header rules is analysed. Sample partial analysis (sample size=5248 mails):

1 X-RegEx: [59.6] FROM_AND_RECEIVED_DO_NOT_MATCH FQDN in From and Received header do not match 1244 (23.70%) 
2 X-RegEx: [155.2] TO_LOCALPART To: repeats local part as a real name 610 (11.62%) 
3 X-RegEx: [10.0] MY_PLING_QUESTION 3 Ausrufezeichen oder 3 Fragezeichen (besonders wichtig o. besonders dumm) 354 (6.74%) 
4 X-RegEx: [21.8] HTML_FONT_BIG FONT Size +2 and up or 3 and up 338 (6.44%) 

This shows that the most-frequently-occuring header, contained in 1244 mails (23.70%), is the FROM_AND_RECEIVED_DO_NOT_MATCH header, and the second most-frequently-occuring header, which is contained in 610 mails (11.62%), is the TO_LOCALPART header, etc.

Process:

  1. Extract a large quantity of KNOWN-GOOD messages to a given location (the clearbox)

  2. Extract a large quantity of KNOWN-BAD messages to a given location (the nastybox)

  3. Generate X-RegEx header analyses for both groups

  4. Find all the messages in the nastybox that have negative Regex scores

  5. Check the most common headers in the clearbox - they should be mostly negative

  6. Check the top 20 most common headers in the nastybox - the values should be relatively large

  7. Scan the remaining headers in the nastybox for obvious lameness: