How Does Your Computer Recognize Spam Mail?
Spam is a problem for all email users, but it could be a lot worse. Thanks to an 18th century English mathematician who’d never even heard of Viagra, your daily trickle of laser eye-surgery and organ-enlargement throwaways is prevented from becoming a raging flood.
The Reverend Thomas Bayes died in 1761. Published two years after his death, his important essay on the subject of probability included a mathematical rule now known as Bayes’ theorem. That same theorem now forms the basis of "smart" spam filtration.
Spam evolves. Spammers are always devising more sophisticated ways to get through to your inbox, and ‘mutating spam’ changes in response to server knockbacks. So, hard and fast filtering rules don’t perform well. Blocking spam used to be a simple matter of "blacklisting" bad senders and building lists of banned content words. As that approach no longer works, spam filters have had to evolve too.
Bayesian filters don’t simply build lists of words and email addresses, they build lists of classifiers. Once an email is classified as spam (or not), it becomes a gold mine of further classifiers for the Bayesian algorithm. Patterns of information—whether in images, text content, or source header data—are used by the algorithm as a kind of template (a ‘decision tree’) to check new incoming mail against.
It’s vital, then, that classifiers are accurate. To improve their accuracy, the filter needs to "learn" when it gets classification right and when it doesn’t. And what better to teach it than the most sophisticated classification device we know of—a human brain. Brains usually know ham when they see it.
Receiving spam is annoying, but having "good" email (sometimes called "ham") classified as spam is worse. Depending on filter settings, it may get moved to another folder that you don’t check often, or may even get deleted. When a filter classifies ham as spam, that’s known as a false positive. Fortunately, it’s easy to tell the algorithm about false positives so that, over time, they become fewer and fewer.
How does this work? Let’s use the popular spam-filtering program SpamAssassin as an example. This program, usually installed on your email server, has a Bayesian function called sa-learn. To "teach" it, you set up folders in your email client that correspond to "spam" and "ham." To kick start the process, it’s a good idea to put a bunch of spam and ham into the relevant folders. After that, each time a new spam message is delivered to your inbox you move it to "spam," and each time you pick up a false positive you move it to "ham."
If sa-learn is set up right, it will scan through your "spam" and "ham" folders once per day, and then adjust its classifiers to achieve a better match with what it finds there.
The filter is a kind of Bayesian agent. More technically, it’s a "naïve" Bayesian agent—it’s impossible to implement Bayes' theorem in full. The algorithm doesn’t really do anything on its own apart from process information. But, in combination with a utility function that does something with that information—like assigning a "spam score" out of 10 to each message—it becomes a useful tool. So, a combination of inference and action gives us an agent.
Spam filtration isn’t so different from water filtration. Imagine pushing a torrent of emails through a series of meshes—each one finer than the previous one—with the "pure ham" we want coming out as the end product. Top-level filters and "block lists" on the servers of Internet service providers (ISPs) are the reservoir grilles trapping branches and big debris. User-controlled filters on ISP mail servers trap leaves, twigs, and trash. Automatic and rule-based filters on end-user email client computers trap grit.
In these terms, our attention is a super-fine mesh that can get rid of even the tiniest particles. But we’d really like to stop the spam before it ever reaches that one. Bayesian filtering is one of the finest ways to do that.
Were this an email, there’s a chance that you wouldn't get to read it. Because the text contains many occurrences of the word "spam," it might get picked out and trashed by some filter somewhere before it ever reaches your inbox. It’s quite a tricky challenge for a Bayesian agent to learn that stories about spam with "spam" in the message subject aren’t necessarily spam.
But if the Bayesian agents that were to process this email had been doing their sa-learning homework, and they aren’t too strict or naïve, then the email would make it through.