fixes for Regexfilter's ruleset
January 23, 2010 (as amended)
this page is one of a series, see end for links

The default ruleset in Regexfilter has a few issues, which are documented below:

  1. Outlook 2003

    When Outlook 2003 sends an HTML mail, the HTML tags are problematic for RegExFilter:

    X-RegEx: [250.0] TEXT_AFTER_HTML_TAG Nach </HTML>Tag kommt noch Text       
    X-RegEx: [200.0] INVALID_HTML Content-Type is HTML without HTML tag       
    

    This is likely to result in most emails from Outlook 2003 users being sent to the spam folder.

    The first regex gets tripped by this:

    </HTML>=0D
    

    The trailing =0D, which is some kind of encoding for newline inserted by Outlook 2003, looks like text to the RegExFilter (which doesn't understand that =0D means newline).

    The first regex is:

       ~BODY:           250.0  {\<\/html\>.*[a-z0-9]{1,}}    [TEXT_AFTER_HTML_TAG Nach </HTML>Tag kommt noch Text]
    

    Fix for the first regex:

       ~BODY:           250.0  {(?!\<\/html\>=0D)\<\/html\>.*[a-z0-9]{1,}}    [TEXT_AFTER_HTML_TAG Nach </HTML>Tag kommt noch Text]
    

    This is identical, except for the leading "(?!\<\/html\>=0D)", which means, "EXCEPT if the next match is "</html>=OD". That is, it causes the rule to say "no match" if it does actually match, BUT the match is "</html>=OD".

    The second regex gets tripped by this:

    <HTML xmlns=3D"http://www.w3.org/TR/REC-html40" xmlns:o=3D"urn:schemas-micr=
    osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:word">
    

    The regex is expecting the HTML tag to be closed immediately with a ">", however Outlook 2003 puts all that extra stuff in before the ">", so much extra stuff that the HTML tag doesn't finish until the end of the second line. The HTML tag is over two lines in the above sample (it may be over three or more in other mails) but the regex uses LINE syntax, so it will always fail for multi-line HTML tags.

    The second regex is:

      -~LINE:           200.0  {\<html\>}                    [INVALID_HTML Content-Type is HTML without HTML tag]
    

    Fix for the second regex:

      -~LINE:           200.0  {\<html(\>| xmlns)}                    [INVALID_HTML Content-Type is HTML without HTML tag]
    

    The fixed regex says that the HTML tag must end immediately with ">", OR, it must continue with " xmlns" (as seen in the above sample).

    These lines MAY identify when Outlook 2003 is being used:

    X-MimeOLE: Produced By Microsoft MimeOLE V6.00.3790.4073
    <meta content=3D"Microsoft Word 11 (filtered medium)" name=3DGenerator>=0D 
    

    Note: the problem HTML tags may only be generated if Word is being used as Outlook's editor - untested

    Note: Office 11 includes Word 11 and Outlook 11 and is also known as Office 2003 - http://en.wikipedia.org/wiki/Microsoft_Office

  2. Windows Mail

    Windows Mail uses a header line which trips RegExFilter:

    X-RegEx: [250.0] MISSING_OUTLOOK_NAME Message looks like Outlook, but isn't
    

    This is due to the X-Mailer header used by Windows Mail, which is as follows:

    X-Mailer: Microsoft Windows Mail 6.0.6002.18197
    

    The fix is to allow this variant of X-Mailer header too:

      -X-MAILER:                250.0  {Microsoft (CDO|Office Outlook|Outlook|Windows Mail)\b}  [MISSING_OUTLOOK_NAME Message looks like Outlook, but isn't]
    

  3. subheaders not recognised

    Note: this is not actually a fix, this explains HOW to fix an apparent shortcoming in Regexfilter. Nobody needs to worry about this bit, except those making a new regex to find a subheader in a multipart message.

    Regexfilter does not seem to include the headers of the parts of a multipart message in the header. That is, in a multipart message, the parts are separated by various headers, like this:

    Content-Type: application/octet-stream;	name="wicked_scr.scr"
    Content-Transfer-Encoding: base64
    Content-Disposition: attachment; filename="wicked_scr.scr"
    

    However, these rules do NOT match:

       Header: 20  "Content-Disposition:" [CRIPPLED_FILETYPE_GENERIC disposition]
       CompleteHeader: 20  "Content-Disposition:" [CRIPPLED_FILETYPE_GENERIC disposition]
       Match: 20  "Content-Disposition:" [CRIPPLED_FILETYPE_GENERIC disposition]
       ~Header: 20  "Content-Disposition:" [CRIPPLED_FILETYPE_GENERIC disposition]
       ~CompleteHeader: 20  "Content-Disposition:" [CRIPPLED_FILETYPE_GENERIC disposition]
       ~Match: 20  "Content-Disposition:" [CRIPPLED_FILETYPE_GENERIC disposition]
    

    These DO work:

       Body: 20  "Content-Disposition:" [CRIPPLED_FILETYPE_GENERIC disposition]
       Line: 20  "Content-Disposition:" [CRIPPLED_FILETYPE_GENERIC disposition]
       ~Body: 20  "Content-Disposition:" [CRIPPLED_FILETYPE_GENERIC disposition]
       ~Line: 20  "Content-Disposition:" [CRIPPLED_FILETYPE_GENERIC disposition]
    

    Therefore, it seems Regexfilter considers these sub-header lines to be part of the body of the message.

    Problem: the Content-Disposition: line can go over several lines, but as Regexfilter is treating it as a line of body text, it does not read the second line. For example, to detect this line:

    Content-Disposition: attachment; filename="urmama.hta"
    

    This rule works fine:

       Body: 20  {Content-Disposition:.*filename=.*\.hta} [CRIPPLED_FILETYPE_GENERIC disposition]
    

    BUT the spammer space it over two lines, this breaks the rule:

    Content-Disposition: attachment;
     filename="urmama.hta"
    

    To fix, use two rules like this:

       @Body:    "Content-Disposition:"
       Body: 20  {filename=.*\.hta} [CRIPPLED_FILETYPE_GENERIC disposition]
    

    The @ in the first rule says "combine this rule with next" (logical AND), the second rule finds our evil file extensions, BUT it will only fire if the first rule was also true.

Note: do NOT put comments at the end of rules, in the RegExFilter filter file. This will cause the rule to stop working. Do this instead:

# comments must be on a separate line from rules
From:                  {@spammer\.com}