refining Regexfilter's ruleset December 16, 2007 (as amended) this page is one of a series, see end for links
This procedure uses the scores from Regexfilter to hone in on badly-working rules.
It assumes use of software (which is not currently available for download) to analyse Regexfilter's X-Regex: header in inbound mail.
It also assumes access to a large quantity (1000+) of spam - the spam that evaded existing filters is the best spam to use for this,
so it should be kept, NOT deleted. The larger the sample, the better.
The relative frequency of occurence of the various header rules is analysed. Sample partial analysis (sample size=5248 mails):
1 X-RegEx: [59.6] FROM_AND_RECEIVED_DO_NOT_MATCH FQDN in From and Received header do not match 1244 (23.70%)
2 X-RegEx: [155.2] TO_LOCALPART To: repeats local part as a real name 610 (11.62%)
3 X-RegEx: [10.0] MY_PLING_QUESTION 3 Ausrufezeichen oder 3 Fragezeichen (besonders wichtig o. besonders dumm) 354 (6.74%)
4 X-RegEx: [21.8] HTML_FONT_BIG FONT Size +2 and up or 3 and up 338 (6.44%)
This shows that the most-frequently-occuring header, contained in 1244 mails (23.70%), is the FROM_AND_RECEIVED_DO_NOT_MATCH header, and the second most-frequently-occuring header, which is contained in 610 mails (11.62%), is the TO_LOCALPART header, etc.
Process:
Extract a large quantity (1000+) of KNOWN-GOOD messages to a given location (the clearbox)
Note: only mails with a X-RegEx header should be used. Mails without a RegEx header should be excluded from the analysis. The mails should be as recent as possible.
Extract a large quantity (1000+) of KNOWN-BAD messages to a given location (the nastybox)
Note: only mails with a X-RegEx header should be used. Mails without a RegEx header should be excluded from the analysis. The mails should be as recent as possible.
Generate X-RegEx header analyses for both groups
Find all the messages in the nastybox that have negative or zero Regex scores
negative scores reduce the total spam score
negative scores should appear mainly in the clearbox
messages in the nastybox with negative scores may be tricking the filter
examine each such message - is the Regex working as intended?
this process can identify false negatives
Check the top 20 most common headers in the clearbox - they should be mostly negative
the more positive scores closer to the top, and the larger those scores, the more chance of a false positive
if there is a problem with false positives, reduce the scores of the regexs closer to the top of the list - positive values are moved closer to 0, while negative values are moved further away
this process can identify false positives
this process can PRODUCE false negatives if the values of the rules are set too low
Check the top 20 most common headers in the nastybox - the values should be relatively large
the larger the values for the most common rules, the more chance of correctly detecting spam
if common rules have low values, this MAY be because non-spam also triggers the rule
check the relative frequencies of suspect rules in both the clearbox and the nastybox
if non-spam does NOT trigger the rule then the value of the rule should be increased
if non-spam OCCASIONALLY triggers the rule, the value of the rule should only be increased a little
this process can identify false negatives
this process can PRODUCE false positives if the values of the rules are set too high
Scan the remaining headers in the nastybox for obvious lameness:
any blatant spam rules anywhere in the top 50 should be increased, especially if they are not listed in the clearbox