refining Regexfilter's ruleset
December 16, 2007 (as amended)
this page is one of a series, see end for links

This procedure uses the scores from Regexfilter to hone in on badly-working rules. It assumes use of software (which is not currently available for download) to analyse Regexfilter's X-Regex: header in inbound mail.

It also assumes access to a large quantity (1000+) of spam - the spam that evaded existing filters is the best spam to use for this, so it should be kept, NOT deleted. The larger the sample, the better.

The relative frequency of occurence of the various header rules is analysed. Sample partial analysis (sample size=5248 mails):

1 X-RegEx: [59.6] FROM_AND_RECEIVED_DO_NOT_MATCH FQDN in From and Received header do not match 1244 (23.70%) 
2 X-RegEx: [155.2] TO_LOCALPART To: repeats local part as a real name 610 (11.62%) 
3 X-RegEx: [10.0] MY_PLING_QUESTION 3 Ausrufezeichen oder 3 Fragezeichen (besonders wichtig o. besonders dumm) 354 (6.74%) 
4 X-RegEx: [21.8] HTML_FONT_BIG FONT Size +2 and up or 3 and up 338 (6.44%) 

This shows that the most-frequently-occuring header, contained in 1244 mails (23.70%), is the FROM_AND_RECEIVED_DO_NOT_MATCH header, and the second most-frequently-occuring header, which is contained in 610 mails (11.62%), is the TO_LOCALPART header, etc.

Process:

  1. Extract a large quantity (1000+) of KNOWN-GOOD messages to a given location (the clearbox)

    Note: only mails with a X-RegEx header should be used. Mails without a RegEx header should be excluded from the analysis. The mails should be as recent as possible.

  2. Extract a large quantity (1000+) of KNOWN-BAD messages to a given location (the nastybox)

    Note: only mails with a X-RegEx header should be used. Mails without a RegEx header should be excluded from the analysis. The mails should be as recent as possible.

  3. Generate X-RegEx header analyses for both groups

  4. Find all the messages in the nastybox that have negative or zero Regex scores

  5. Check the top 20 most common headers in the clearbox - they should be mostly negative

  6. Check the top 20 most common headers in the nastybox - the values should be relatively large

  7. Scan the remaining headers in the nastybox for obvious lameness: