Why Bayesian filtering is the most effective anti-spam technology
This white paper describes how Bayesian mathematics can be applied to
the spam problem, resulting in an adaptive, ‘statistical
intelligence’ technique that is much harder to circumvent by
spammers. It also explains why the Bayesian approach is the best way
to tackle spam once and for all, as it overcomes the obstacles faced
by more static technologies such as blacklist checking, databases of
known spam and keyword checking. This is not to say that these
technologies are obsolete, but they cannot be used as effectively as
needed if not combined with a Bayesian filter.
---
Spam is an ever-increasing problem. The number of spam mails is
increasing daily - in June 2003, studies showed that over 50% of all
email is spam. Added to this, spammers are becoming more
sophisticated and are constantly managing to outsmart 'static'
methods of fighting spam.
The techniques currently used by anti-spam software are static,
meaning that it is fairly easy to evade by tweaking the message a
little. To do this, spammers simply examine the latest anti-spam
techniques and find ways how to dodge them.
To effectively combat spam, an adaptive new technique is needed. This
method must be familiar with spammers' tactics as they change over
time. It must also be able to adapt to the particular organization
that it is protecting from spam.
The answer lies in Bayesian mathematics, which can be applied to the
spam problem, resulting in an adaptive, ‘statistical intelligence’
technique that is much harder to circumvent by spammers. The Bayesian
approach is the only and best way to tackle spam once and for all, as
it overcomes the obstacles faced by more static technologies such as
blacklist checking, databases of known spam and keyword checking.
This is not to say that these technologies are obsolete, but they
cannot be used as effectively as needed if not combined with a
Bayesian filter.
How the Bayesian spam filter works
Bayesian filtering is based on the principle that most events are
dependent and that the probability of an event occurring in the
future can be inferred from the previous occurrences of that event.
(More information about the mathematical basis of Bayesian filtering
is available at http://www- ccrma.stanford.edu/~jos/bayes/Bayesian_Parameter_Estimation.html and
http://www.niedermayer.ca/papers/bayesian/bayes.html.)
This same technique can be used to classify spam. If some piece of
text occurs often in spam but not in legitimate mail, then the next
time the same piece of text is encountered in a new email, it would
be reasonable to assume that this email is probably spam.
Creating a tailor-made Bayesian word database
Before mail can be filtered using this method, the user needs to
generate a history for each word or token (such as the $ sign, IP
addresses and domains, and so on). A probability value is then
assigned to each word or token; the probability is based on
calculations that take into account how often that word occurs in
spam as opposed to legitimate mail. This is done by analyzing the
users' outbound mail and by analyzing known spam: All the words and
tokens in both pools of mail are analyzed to generate the probability
that a particular word is spam.
For example, if the word "mortgage" occurs in 400 of 3,000 spam mails
and 5 out of 300 legitimate emails, its spam probability is:
(400/3000) / (5/300 + 400/3000) = 0.8889.
Its important to note that this analysis is performed on the
company's mail, and is therefore tailored to that particular company.
For example, a financial institution might use the word "mortgage"
many times over would get a lot of false positives if using a general
anti-spam rule set. The Bayesian filter, on the other hand, takes
note of the company's valid outbound mail (and recognizes "mortgage"
as being frequently used in legitimate messages), and therefore has a
much better spam detection rate and a far lower false positive rate.
Once the word probabilities have been calculated, the filter is ready
for use.
Note that the Bayesian filter is not static - the filter is then
constantly updated based on new spam and valid emails; the Bayesian
filter's performance will therefore improve over time and - more
importantly - will adapt to a change in spam tactics and/or a change
in the kind of emails written by users within the organization.
Finding spam based on the Bayesian filter
When a new mail arrives, it is broken down into words and the most
relevant words - i.e., those that are most significant in identifying
whether the mail is spam or not - are singled out. From these words
the Bayesian filter calculates the probability of the new message
being spam or not. If the probability is greater than a threshold,
say 0.9, then the message is classified as spam.
This Bayesian approach to spam is highly effective - a May 2003 BBC
article reported that spam detection rates of over 99.7% can be
achieved with a very low number of false positives.
Why Bayesian filtering is better than keyword checking in detecting
spam
1. The Bayesian method takes the whole message into account - It
recognizes keywords that identify spam, but it also recognizes words
that denote valid mail. For example: not every email that contains
the word "free" and "cash" is spam. The advantage of the Bayesian
method is that it considers the most interesting words (as defined by
their deviation from the mean) and comes up with a probability that a
message is spam. The Bayesian method would find the words "cash" and
"free" interesting but it would also recognize the name of the
business contact who sent the message and thus classify the message
as legitimate, for instance; it allows words to "balance" each other
out. In other words, Bayesian filtering is a much more intelligent
approach because it examines all aspects of a message, as opposed to
keyword checking that classifies a mail as spam on the basis of a
single word.
2. A Bayesian filter is constantly self-adapting - By learning from
new spam and new valid outbound mails, the Bayesian filter evolves
and adapts to new spam techniques. For example, when spammers started
using "f-r-e-e" instead of "free" they succeeded in evading keyword
checking until "f-r-e-e" was also included in the keyword database.
On the other hand, the Bayesian filter automatically notices such
tactics; in fact if the word "f-r-e-e" is found, it is an even better
spam indicator. Another example would be using the word "5ex" instead
of "Sex".
3. The Bayesian technique is sensitive to the user - To be successful
and have their messages delivered, spammers have to send emails that
are not caught by the indented victims' personalized filters. Because
the Bayesian method takes the company's email profile into account,
it detects spam with greater ease: Spammers would need to know the
company's email profile to be able to circumvent it. Because since
spam mails have their own vocabulary and character, the Bayesian
filter can catch them out; however, it is not easy for spammers to
change their sales pitch to take an organization's email profile into
account; after all, there are only so many ways to sell Viagra!
4. The Bayesian method is multi-lingual and international - A
Bayesian anti-spam filter, being adaptive, can be used for any
language required. Most keyword lists are only available in English
only and are therefore quite useless in non English-speaking regions.
The Bayesian filter can also take into account certain languages
deviations or the diverse usage of certain words in different areas,
even if the same language is spoken. This intelligence enables such a
filter to catch more spam.
5. A Bayesian filter is hard to trick as opposed to a keyword filter -
An advanced spammer who wants to trick a Bayesian filter can either
use fewer 'bad' words (i.e., words that usually indicate spam such as
free, Viagra, etc), or more words that generally indicate valid mail
(such as a valid contact name, etc). Doing the latter is impossible
because the spammer would have to know the email profile of each
recipient - and a spammer can never hope to gather this kind of
information from every intended recipient. Using neutral words, for
example the word "public", would not work since these are disregarded
in the final analysis. Breaking up spam words (using "f-r-e-e"
instead of "free") will just increase the chance of the message being
spam, since a legitimate user will rarely write the word "free" as "f-
r-e-e".
Bayesian filters or updated keyword lists?
Some types of anti-spam software regularly download new keyword
files. While this is of course better than not updating keyword
lists, the fact is a rather patchy approach that is easily
circumvented. Downloading updates makes it a little bit harder, but
the principal system is flawed compared to a Bayesian filter.
The Bayesian filter is a new approach to spam that is likely to
create a revolution in the sphere of anti-spam software - because it
is both intelligent and adaptive. History has shown us that this is
really the only way to deal with the complex and changing problem
that is spam.
About GFI MailEssentials
GFI MailEssentials for Exchange/SMTP is a server-based anti-spam and
email management solution for Microsoft Exchange Server and
Notes/SMTP servers. It virtually eliminates spam from your mail
server through its in-built Bayesian filter and its other key
features:
Block spam at server level - No need to update email clients
Automatic whitelist management - Keep whitelists up-to-date without
extra admin
Blacklists scanning - Stop mail from blacklisted senders and invalid
domains
Email header analysis - Blocks spam based on message field info
Keyword checking - Enables you to refine your anti-spam rules
and more!
GFI MailEssentials also adds key email tools to your mail server:
disclaimers, mail archiving and monitoring, reporting, server-based
auto replies and POP3 downloading.
GFI (www.gfi.com) is a leading provider of Windows-based network
security, content security and messaging software. Key products
include the GFI FAXmaker fax connector for Exchange and fax server
for networks; GFI MailSecurity email content/exploit checking and
anti-virus software; GFI MailEssentials server-based anti-spam
software; GFI LANguard Security Event Log Monitor (S.E.L.M.) that
performs event log based intrusion detection and network-wide event
log management; and GFI LANguard Network Security Scanner (N.S.S.)
that audits network security and allows administrators to remotely
install hotfixes and service packs. Clients include Microsoft,
Telstra, Time Warner Cable, Shell Oil Lubricants, NASA, DHL,
Caterpillar, BMW, the US IRS, and the USAF. GFI has six offices in
the US, UK, Germany, France, Australia and Malta, and has a worldwide
network of distributors. GFI is a Microsoft Gold Certified Partner
and has won the Microsoft Fusion (GEM) Packaged Application Partner
of the Year award.
For more information
Please email [email protected] or contact one of the GFI offices.
---
* Origin: [adminz] tech, security, support (192.168.0.2)
generated by msg2page 0.06 on Jul 21, 2006 at 19:04:23