Everyone who does email marketing is concerned about deliverability. After all, how useful is even the greatest email if it gets blocked at the server level and never delivered, or if it goes into some overfilled spam box? But understanding deliverability means understanding how messages get filtered, and so today I’m covering some of the basic methods that email filters use to mark spam.
One of the most common email filters is SpamAssassin. It is used by many many many many applications, from McAffee filters to Apple Mail. At its most basic level, SpamAssassin is a framework for combining various spam recognition techniques, and allows integration with all sorts of custom solutions. Its basic framework provides a very good overview of spam identification techniques including:
- Header tests
- Body phrase tests
- Bayesian filtering
- Automatic address whitelist/blacklist
- Manual address whitelist/blacklist
- Collaborative spam identification databases
- DNS Blocklists
SpamAssassin claims that using just this combination it can catch 99% of spam with a very very low rate of false negatives. More importantly, it provides a framework for understanding how most spam is caught, and how an intricate collection of systems works together to make sure that your mailbox is spam free.
Defined Rules
Each of these tests produces a score, which is then compared against each other, resulting in an overall spam score. As an e-mail is analyzed, SpamAssassin’s rules are run and
each generate a spam score. These are then tallied, with positive and negatives adding or subtracting from the total.
Rules are pre-included (in newer SpamAssassin installations they get pulled from the server when you install) but you can also write your own. The format looks like this:
(from: http://taint.org/x/2007/maawg/maawg-dublin-june2007.pdf)
For instance, an early header test used to catch spam looked at an Outlook Express header. Outlooks Express includes DATETIME when generating a message-ID. If you compare the correct time stamp from the Received header data and compare that there would be discrepancies. The result was a rule that catches 25% of spam.
In SpamAssassins specific case (and in many other email filters) the weight of each rule is predefined, but as more email is filtered a Genetic Algorithm optimizes its score.
Bayesian Filters
Adding a Bayesian filter into the mix increases the likelihood of catching spam and reducing false positives. Bayesian spam filtering is a statistical technique of e-mail filtering. It makes use of a simple probabilistic classifier based on applying Bayes’ theorem to identify characteristics of spam e-mail.
It works by comparing the probability of words/phrases occurring in normal mail versus spam mail. The filter doesn’t know probability of any particular word or phrase, instead it has to be trained manually by indicating whether a new email is or is not spam.
Initial training is usually refined when false positives/negatives are detected by the user. In SpamAssassin the command sa-learn, and in Apple Mail the “report as spam” and “not spam” buttons, cause the content of that email to be reevaluated as spam. In Gmail (and many other online mail services) these same buttons function the same, but they also feed into an overall filter that Google uses to determine spam (hence why its so bad to have gmail users mark your email marketing as spam).
The advantage of Bayesian spam filtering is that it can be trained on a per-user basis. Any individual users spam is related to their online activities. For example, a user might have opted in for a newsletter that they no longer want, and now consider spam. This newsletter likely contains words common to many such newsletters. However, a Bayesian spam filter will eventually assign a higher probability to more specific words based on the user’s specific patterns.
Baysian filters aren’t unbeatable, of course. You can still mess up Baysian filters with clever design. If a Baysian filter is basing itself on real practices of email users, you just need to mimic the actions of a “good” email. Stuff like random snippets of real text made so that it cant be read by humans, HTML chaff, etc. Just keep it random. However, these are easy to pick up with rules, removing invisible text, etc. What does work is text that matches conversational English, or content that doesn’t get caught by Bayesian analysis. This is why you see so many spam messages using a snippet of regular, unrelated text and then propping an image over it.
Google’s solution (which is used by Gmail) is to perform an OCR (Optical Character Recognition) to every mid to large size image, analyzing the text inside.
Whitelists/Blacklists
Whitelists take a lot of effort to maintain, but are often spread between email systems (another reason why you don’t want to be marked as spam). An e-mail whitelist deems whether your address is acceptable to receive email from and thus should skip spam testing. In SpamAssassin a whitelist/blacklist membership usually gives -100 or +100 points, assuring that it gets through or gets stuck.
Some ISPs keep whitelists that they use to filter e-mail coming to their customers. In some cases, companies will pay for a time period to be allowed to e-mail the ISPs customers, or will “pay per complaint” (in increments, with 10 costing so much apiece and 20 costing so much more, for instance) that is received by the ISP from customers.
Hence how crucial it is to make sure that you don’t end up on a blacklist, as it can be incredibly difficult to convince a sysop that your messages (especially promotional ones) are legitimate and should be allowed through. This goes doubly so for ISPs who may charge you for access to their email clients.
Collaborative spam identification databases
There are a lot of collaborative spam identification services and databases including the likes of DCCand Pyzor. The central premise of these systems is that they compile a database of checksums for email messages which can then be matched against incoming spam, identifying and blocking mass email. Most of these work by stripping headers and just looking at certain elements of body copy (since some spam injects random data into messages to get around such systems). They then convert these into summary information that can be compares quickly against other messages.
DNS Block Lists
DNS Blocklists, blackholes (DNSBL), etc. are lists of IP addresses collected through a variety of methods, including discovered banks of zombie computers (computers who have had malware installed to run an email bot), addresses caught in a honeypot, and those of ISPs who have been found to be willfully hosting spammers. There are loads of different lists, and they all use a varying set of standards.
To fight the use of new ISP’s by spammers, some block lists also look at the URL of links within an email and block based on that. That’s why you get spam that tells you not to click, but to enter a url into your address bar.
This last point is key, since it’s something that email marketers don’t often think about: linking to the wrong domain can cause you to earn serious negative points. I touched on this a few months ago when I noted that ow.ly was on the URIBL.
In the case of SpamAssassin, a variety of block lists are used, each with a different “score”. These are then rated against other factors listed above. Ending up on the wrong one can make your email impossible to get through to anyone who hasn’t either modified the score of that particular blocklist or is not using SpamAssassin at all.
Most spam filtering applications rely the on a combination of the above to determine spam, and as you can imagine there are a variety of ways that even a legitimate email marketer can get caught up on these systems. More importantly, understanding how these systems work makes sure that you don’t start using bad practices that could get you filtered out of people’s inboxes. However, when it all comes down to it, what these systems are designed to catch is spammers gaming the filtering system, and the best way to avoid that is to send a legitimate, permission based campaign that people aren’t going to label spam, that ISP’s wont get complaints about, and that won’t end up on black lists.