customising Regexfilter's ruleset
April 12, 2004 (as amended)
this page is one of a series, see end for links

Spampal and Regexfilter can do much more than filter spam - they can be used to catch viruses and phish as well. This page details how to do that, plus some other extra filtering rules that I find useful.

The ordering of the various filters is very important. The filters are applied in order, from top to bottom. Since over-processing unwanted mail is a waste of resources, the ordering should be designed to find such mail rapidly; this removes it from the queue and allows the filters to begin work on the next message. The ordering might be rearranged for various purposes; the ordering below is optimised for speed, with a few compromises for functionality. I place all of the tests below at the top of the file containing the ruleset - this means they run first and eject any mail they find, saving the lower-order rules the trouble of processing it.

The rules below make various assumptions about what should and should not be permitted. You should not use them unless you understand what they will do. The rules should be modified to suit your own purposes before use.

All of the rules below use a leading equals sign and a high spam score - this has the effect of instantly marking any matching mail as spam, with no further testing performed on it. If this does not suit you, change it.

  1. Anti-spoofing filter (part 1): These match if the SENDER uses my own domainname, BUT the mail is NOT sent from my own SMTP server (eg. it is spoofed). This should be adjusted for your own domainname and SMTP servers (in the example below, my SMTP server has the imaginary IP address 123.456.789.123). In the second rule, don't forget to allow for any servers that you permit to spoof your mail (eg, your own webserver, emailing you form output).
       @Any-Sender:            {cyberdelix\.net}
      -=Received:       500.0  {123\.456\.789\.123|from www\.cyberdelix\.net}                                                                     [ANTISPOOF_BLOCK_SPOOFED_INTERNAL]
       =Any-Sender:    -500.0  {cyberdelix\.net}                                                                                                  [ANTISPOOF_PASS_FRIENDLY_SMTP_INTERNAL]
    

    The first two rules blacklist my own domainname, except when the mail is sent from one of the IP addresses or hosts listed.

    The third rule will never run, except if the mail is NOT spoofed (due to the equals sign in the second rule) - therefore, at this point, any mail sent with my domainname is friendly (non-spam). Mail that matches on the third rule is thus given a minus spam score, and no further testing is done on it.

    Do NOT enter "localhost" or similar into the second rule, only enter strings that uniquely identify your own SMTP server (such as its IP or hostname).

    Note: these rules may block mail from legitimate services which spoof your address, such as Paypal or mailing lists - still testing this, if so, I imagine one or more additional rules could be added to fix.

    Note: hyphen and dot both need to be escaped

    Note: the above rule means it's not necessary to whitelist *@yourdomain.com - and indeed, if your whitelist does contain this, you should remove it as it will prevent the anti-spoofing rule from detecting spoofed mails. The antispoofing rule automatically allows through any mail sent from a SMTP server listed in the rule, so as long as all senders @yourdomain.com are using one of these SMTP servers, they are effectively whitelisted, just as they were when *@yourdomain.com was whitelisted. Meaning that there is no danger in removing this entry from the whitelist, if the antispoofing rule is in use.

    The anti-spoofing rules should go FIRST in the filter file, so that mails from friendly SMTP servers are not filtered for spam (eg. are allowed to be spammy). Also, putting these rules first means spoofers are immediately detected and ejected.

  2. Anti-spoofing filter (part 2): These match if the SENDER uses a given domainname, BUT the mail is NOT sent from that's domain's SMTP server (eg. it is spoofed). Currently just for filtering fake Facebook mails:
       @From:                  {@facebookmail\.com}
      -=Received:       500.0  {mx\-out\.facebook\.com}                                                                                           [ANTISPOOF_BLOCK_SPOOFED_EXTERNAL_FACEBOOK]
       =Any-Sender:    -500.0  {@facebookmail\.com}                                                                                               [ANTISPOOF_PASS_FRIENDLY_SMTP_FACEBOOK]
    

    Note: hyphen and dot both need to be escaped, @-sign does not

  3. Anti-virus MIME signature filters: These match the first few bytes of most Windows-based executable files, and consequently filter almost all mass-mailing Windows-based viruses, such as Sobig, MyDoom, and Netsky. Regexfilter rules:

       =Line:           9999   {^TVqQAAMAAA*}                                         [MIMEAV: Win32 executable variant 1]
       =Line:           9999   {^TVoAAAEAAAA*}                                        [MIMEAV: Win32 executable variant 2]
       =Line:           9999   {^TVoAAAAAAAAAAAAAUEUAAE*}                             [MIMEAV: Win32 executable variant 3]
       =Line:           9999   {^TVoAAD8AAAAE*}                                       [MIMEAV: Win32 executable variant 4] 
       =Line:           9999   {^TVpLRVJORU*}                                         [MIMEAV: Win32 executable variant 5] 
       =Line:           9999   {^UEsDBAoAA*}                                          [MIMEAV: Zipfile variant 1]
       @Line:                  {^UEsDBBQAA*}                                          [MIMEAV: Zipfile variant 2]
      -=Body:           9999   {name=.*\.(docx|xlsx)}                                 [MIMEAV: Zipfile variant 2]
       =Line:           9999   {^183GmgAA*}                                           [MIMEAV: WMF file variant 1]
    

    These filters work by finding MIME data which matches the above strings. When a virus is sent via email, it is encoded in MIME. It may change its filename, use a variety of subject lines and message bodies, and/or forge the sender's address; but it cannot forge its own file header, which is faithfully represented in MIME in the email containing the virus. It's not even necessary to decode the MIME; the above "MIME signatures" are functionally equivalent to the signatures used by traditional anti-virus scanners.

    Note that "Zipfile variant 2" has a different syntax, using two rules - this is to allow for DOCX and XLSX files, which are actually ZIP files. The above syntax allows DOCX and XLSX files through, while still blocking all other ZIP files. Exceptions for other filetypes such as PPTX could also be added here (not tested).

    Below is a table of various filetypes and their MIME signatures. Additional signatures can be determined simply by emailing yourself a file in a given format (for example, .XLS) and examining the raw MIME data. Short is good, too short is bad, though.

    extensionMIME signaturenotes
    EXE, COM, SCR, PIFTVqQAAMAAAWin32 executable variant 1
    EXE, COM, SCR, PIFTVoAAAEAAAAWin32 executable variant 2
    EXE, COM, SCR, PIFTVoAAD8AAAAEWin32 executable variant 3
    EXE, COM, SCR, PIFTVoAAAAAAAAAAAAAUEUAAEWin32 executable variant 4
    EXE, COM, SCR, PIFTVpLRVJORUWin32 executable variant 5
    ZIPUEsDBAoAAZIP type 1
    ZIPUEsDBBQAAZIP type 2
    GIFR0lGODlhnot used by viruses but well-used by spammers
    PNGiVBORw0KGgoAAAANnot used by viruses but well-used by spammers
    JPG/9j/4AAQSkZJRgABAQnot used by viruses but well-used by spammers
    WMF183GmgAAWindows MetaFile format
    BHXYmVnaW4gNjMac BinHex format

    Viruses that arrive in encrypted zipfiles (such as Bagle) are not a problem for the above technique. Encrypted zipfiles have a standard header just like any other zipfile, so the above zipfile filters catch encrypted zips as well.

    If the above rules are used, the include of filters_virus.dat (at the top of the filter file) can be removed (or commented out). Also, several virus-specific virus tests included in the default Regexfilter file can be removed (or commented out). These can be found by searching for "[SOBER Sober.P identified]", or by searching for a spam score of 500.0.

  4. Script filters: The first rule identifies hyperlinks to EXE files (Win32 executables). These are not detected by the above MIME filters as they are text, just a link to the executable, rather than the EXE itself. The next two rules do one job, which is to identify incoming HTA and VBS attachments. These files are likely to be malicious (viruses or similar), but again, as they are scripts (eg. text) they are not detected by the above MIME filters. The final two rules do one job, which is to identify incoming HTML attachments. These files are likely to be redirectors to malicious software (viruses or similar), but as they are HTML (eg. text) they are not detected by the above MIME filters.
       =Body:           9999   {http.*\.exe}                                          [CRIPPLED_FILETYPE_GENERIC link to Win32 executable]
       @Body:                  "Content-Disposition:"
       =Body:           9999   {name=.*\.(hta|vbs)}                                   [CRIPPLED_FILETYPE_GENERIC Win32 scripting]
       @Body:                  "Content-Disposition:"
       =Body:           9999   {name=.*\.htm}                                         [CRIPPLED_FILETYPE_GENERIC HTML document]
    

    Note: the HTML document detection finds files attached in HTML format. It does not look for (or mark as spam), HTML mail. It checks the file attachments, not the mail format.

  5. Blacklisted sender filters: These will match on mail from any of the listed senders. The use of Regexfilter's Any-Senders: meta-header means the mail will match if the strings below are anywhere in From:, Sender:, Reply-To:, X-Sender:, Envelope-From:, or X-Envelope-From:. Users who correspond with any of the following organisations should whitelist them, or remove them from the filters. Regexfilter rules:
       =Any-Sender:     500.0  {Royal Bank of Scotland|bankofscotland\.co\.uk|NatWest|HSBC|lloydstsb\.co|lloyds\.co\.uk|barclays\.co}                            [CRIPPLED_SENDER_PHISHING rule 1]
       =Any-Sender:     500.0  {abbeynational\.co\.uk|(\.|@)abbey\.co|halifax\.co\.uk|alliance\-leicester\.co\.uk|(\.|@)egg\.com|cahoot\.com}                    [CRIPPLED_SENDER_PHISHING rule 2]
       =Any-Sender:     500.0  {CitiBusiness|(\.|@)citi\.com|citibank\.com|equifax\.com|commercebank\.com|bankofamerica|(\.|@)chase\.com|(\.|@)ally\.com}        [CRIPPLED_SENDER_PHISHING rule 3]
       =Any-Sender:     500.0  {wachovia\.com|americanexpress\.com|bankofthewest\.com|capitalone\.com|nationalcity\.com|tdbanknorth\.com|(\.|@)key\.com}         [CRIPPLED_SENDER_PHISHING rule 4]
       =Any-Sender:     500.0  {hmrc\.gov\.uk|adwords\-noreply@google\.com|networksolutions\.com|westernunion\.com|fdic\.gov}                                    [CRIPPLED_SENDER_PHISHING rule 5]
    

    Note: hyphen and dot both need to be escaped, @-sign does not

    Note: these are anti-phishing rules. There are plenty of people who scoff at this approach to anti-phishing, arguing that I'll end up blacklisting the entire internet. While I understand their point that blacklists have limited utility in an unconstrained problem space, I disagree with them, because in the case of phish filtering, the problem space is not unconstrained. It is limited to the most common financial institutions, plus a few ring-ins. This means blacklisting is feasible - it certainly works great for me, and it has done for years, and I spend almost no time maintaining my list.

  6. Blacklisted keyword filters: These match on specific keywords in the SUBJECT of the message. This is fairly low-tech, but it does catch a chunk of spam, nonetheless. The strings are obvious enough. Sample Regexfilter rule:
       =SUBJECT:        500.0  {viagra|sildenafil|cialis|vicodin|xanax|regalis|valium|anatrim|phentermine|nicotine| pills|depressant}                      [CRIPPLED_SUBJECT_GENERIC specific drugs]
    
  7. Blacklisted keyword filters: These match on specific keywords in the SENDER of the message. This is a clone of the previous, but testing the SENDER rather than the subject. Sample Regexfilter rule:
       =Any-Sender:     500.0  {viagra|sildenafil|cialis|vicodin|xanax|regalis|valium|anatrim|phentermine|nicotine| pills|depressant}                      [CRIPPLED_SENDER_GENERIC specific drugs]
    
  8. Blacklisted character set filters: These match on specific character sets used in the mail. I cannot read these symbols, even if they are not spam, so there's no point keeping the mail. Users using a non-Western character set should disable/modify these. These rules are tweaked versions of existing Regexfilter rules.
       =SUBJECT:        500.0  {=\?(big5|gb2312|euc\-kr|ks_c|\-(kr|jp)|koi8\-r|windows\-1251|iso\-8859\-9|windows\-1254)\?}     [NON_WESTERN_SUBJECT Non-western character set in subject]
       =CONTENT-TYPE:   500.0  {(big5|gb2312|euc\-kr|ks_c|iso\-2022\-(kr|jp)|koi8\-r|windows\-1251|iso\-8859\-9|windows\-1254)} [NON_WESTERN_CONTENTTYPE Non-western character set in Content-Type]
    

    Note: hyphen and dot both need to be escaped

    Below is a table of various character sets and corresponding notes. This info partly from Wiki I think.

    character setnotes
    big5used in Taiwan, Hong Kong and Macau for Traditional Chinese characters
    gb2312GB2312 is the registered internet name for a key official character set of the People's Republic of China
    euc-krExtended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.
    ks_cKorean
    iso-2022Korean/Japanese
    KOI8-RRussian, which uses the Cyrillic alphabet
    windows-1251Russian
    iso-8859-9Turkish
    windows-1254Turkish

A summary of the syntax used on this page (taken from section 7.1 of the RegExFilter manual):

symbolmeaning
=on match no more rules are tested for this email
-negate the result (logical NOT)
@combine this rule with the next rule (logical AND)
~decodes RFC-2047 encoded headers and RFC-2045 encoded bodies

Note: do NOT put comments at the end of rules, in the RegExFilter filter file. This will cause the rule to stop working. Do this instead:

# comments must be on a separate line from rules
From:                  {@spammer\.com}