set the mail reader to filter spampal's SPAM to the spambox (folder A)
set the mail reader to filter other mail matching certain keywords to the secondary spambox (folder B) - see "secondary filtering", below
whitelist as many known-good senders as possible
manually collect spam that makes it past the filters, move it to the nastybox (folder C) - do NOT delete it
periodically scan folders A and B for mail from whitelisted senders (see "dredging the spambox", below)
when folder C grows large, scan it for regex foolishness, new keywords, and other obvious lameness, use this
information to refine the filtering rules (see "refining spampal's regexes", below) - the larger the sample
the better, the numbers below come from 2000+ known-good and 2000+ known-bad messages
Notes:
false positives are NON-spam that is detected as spam
false negatives are spam that is NOT detected as spam
false positives aren't terribly important, as we have an automated tool to retrieve them
(for all known senders, plus a bunch of other addresses as well..) (see "dredging the spambox", below)
use a fast connection, so it doesn't take long to download
use a fast computer, so it doesn't slow down much (or, move some/all of the filtering to a secondary computer)
secondary filtering
Secondary filtering occurs once the primary filter has processed the message.
In the beginning, this was used simply to move mail marked as spam to the spambox,
however it can also be used to catch spam that spampal doesn't.
The filters can be as fancy as your mail program lets them be. However, a simple
list of keywords is often effective.
The best way to identify keywords is to scan the nastybox (folder C). Look for
keywords that come up often, in either the sender or subject. Then, create rules
for those keywords in the mail program.
Send all mail that matches these rules to a secondary spambox (folder B). Keep
folder A and B separate, so that later, additional analysis can be performed.
Note: This section refers to software which is not currently for sale. See wading through the spambox for the manual version.
Problem: If your spambox ends up with 1000's of messages, it's not feasible to go through it manually, BUT there may be mail in there from someone you know.
Solution: Periodically scan the spambox for mail from anyone on a list of known-good senders, move any matching messages from the spambox to the inbox.
The scan is done by software; before scanning occurs, the list of known-good senders is obtained from multiple sources, and built on-the-fly.
refining spampal's regexes
This procedure uses the scores from Spampal's regex filter to hone in on badly-working rules.
It assumes use of software (which is not currently for sale) to analyse Spampal's X-Regex: header in inbound mail.
The relative frequency of occurence of the various header rules is analysed. Sample partial analysis (sample size=5248 mails):
1 X-RegEx: [59.6] FROM_AND_RECEIVED_DO_NOT_MATCH FQDN in From and Received header do not match 1244 (23.70%)
2 X-RegEx: [155.2] TO_LOCALPART To: repeats local part as a real name 610 (11.62%)
3 X-RegEx: [10.0] MY_PLING_QUESTION 3 Ausrufezeichen oder 3 Fragezeichen (besonders wichtig o. besonders dumm) 354 (6.74%)
4 X-RegEx: [21.8] HTML_FONT_BIG FONT Size +2 and up or 3 and up 338 (6.44%)
This shows that the most-frequently-occuring header, contained in 1244 mails (23.70%), is the FROM_AND_RECEIVED_DO_NOT_MATCH header, and the second most-frequently-occuring header, which is contained in 610 mails (11.62%), is the TO_LOCALPART header, etc.
Process:
extract a large quantity of KNOWN-GOOD messages to a given location (the clearbox)
extract a large quantity of KNOWN-BAD messages to a given location (the nastybox)
generate X-RegEx header analyses for both groups
find all the messages in the nastybox that have negative Regex scores
negative scores reduce the total spam score
negative scores should appear mainly in the clearbox
messages in the nastybox with negative scores may be tricking the filter
examine each such message - is the Regex working as intended?
this process can identify false negatives
check the most common headers in the clearbox - they should be mostly negative
the more positive scores closer to the top, and the larger those scores, the more chance
of a false positive
if there is a problem with false positives, reduce the scores of the regexs closer to the
top of the list - positive values are moved closer to 0, while negative values are moved
further away
this process can identify false positives
this process can PRODUCE false negatives if the values of the rules are set too low
check the top 20 most common headers in the nastybox - the values should be relatively large
the larger the values for the most common rules, the more chance of correctly detecting
spam
if common rules have low values, this MAY be because non-spam also triggers the rule
check the relative frequencies of suspect rules in both the clearbox and the nastybox
if non-spam does NOT trigger the rule then the value of the rule should be increased
if non-spam OCCASIONALLY triggers the rule, the value of the rule should only be
increased a little
this process can identify false negatives
this process can PRODUCE false positives if the values of the rules are set too high
scan the remaining headers in the nastybox for obvious lameness:
any blatant spam rules anywhere in the top 50 should be increased, especially if they are
not listed in the clearbox