|
by Adam C. Engst ace@tidbits.com
Having to sort through the increasingly repulsive spam
that's rushing into our electronic mailboxes is becoming more unpleasant
than ever. You can reduce the flow, though, with one of three basic
approaches to filtering spam out of your email stream: Boolean filters,
points-based filters, and so-called "Bayesian" statistical
filters.
Put simply, a Boolean filter
looks for string of text, and if it's found, considers the message spam.
Points- based filters refine that approach, assigning (or removing)
points for each criteria matched by a given message; they decide if
a message is spam or not by how many points that message accumulates.
Statistical (or Bayesian) filters, which were most popularly described
in relation to spam in August of 2002 (and refined last month) by Paul
Graham, use a statistical approach that combines the probability that
any given word or phrase (implementations vary) to decide if the message
is spam.
http://www.paulgraham.com/spam.htm
http://www.paulgraham.com/better.html
Bayesian Filters
The beauty of Bayesian filtering is that it works on the contents of
your email, which is probably rather different from mine and anyone
else's. That's because you must train a Bayesian filter with a sample
of both spam and legitimate messages, and because the Bayesian filter
continually examines new messages, it can adapt to the kind of mail
you receive, both good and bad.
Bayesian filters aren't perfect. Legitimate mail, such as promotional
mailings from companies you've bought from in the past, can look a lot
like spam at first, and it's also hard to identify spam messages with
minimal text accurately. Spam may get through when it's sufficiently
related to your profession; for instance, I get spam advertising translation
services because of the TidBITS translations. It's also possible for
spammers to pollute your corpus of good and bad words by including lots
of good words in a spam message, thus reducing the accuracy of the filter
over time. On the positive side, it's possible that improved algorithms
can address these problems.
There are two main implementations of statistical Bayesian filtering
for Mac OS X: Apple's Mail and Michael Tsai's SpamSieve, the latter
of which I've been testing with Eudora 5.2 for some months now.
SpamSieve
Along with its implementation of Bayesian filter, I especially appreciate
the fact that SpamSieve works inside Eudora, and also inside a number
of other email programs, including Entourage, Mailsmith, and PowerMail.
Although it's not available for Mac OS 9, it does also work with Emailer
running in Classic mode. I'm not interested in using Mail, and other
spam utilities (such as Matterform Media's points-based Spamfire utility,
which also has many proponents) work outside of your email program,
forcing you to scan for false positives in a separate interface). SpamSieve
works with any number of accounts and filters mail from any source your
email program supports. Once it has identified messages as spam, it
can mark or move them, and in some of the email programs, your filters
can continue to work on the marked messages.
http://www.matterform.com/
SpamSieve accomplishes this by using the AppleScript capabilities of
these email programs to pass information to and from SpamSieve itself.
The integration is relatively seamless, except in Eudora, the current
version of which has limitations that restrict SpamSieve to filtering
mail that ends up in the In box (not in any other folder). Since the
communication happens via AppleScript, you can edit the included scripts
to customize them further. Even while I'm waiting for the next version
of Eudora to bring SpamSieve's capabilities to messages I filter out
of my In box, I've found it extremely worthwhile.
I initially trained SpamSieve with about 600 spam messages from my disgustingly
large collection of spam and 600 good messages from my In box (yes,
it has been that full, though I've beaten it back down into the 300s).
If you don't have spam around, you could either train SpamSieve as you
receive it (probably with lower accuracy at first) or wait briefly until
you've collected a representative sample. I've also told SpamSieve to
learn from new messages. Since the middle of January, SpamSieve has
filtered over 2,600 messages, about 55 percent of which were spam. In
that time, it has reported 88 percent accuracy, with a false negative
rate of 11 percent and a false positive rate of 1 percent (an alternative
way I've used to verify SpamSieve's accuracy came up with lower numbers
- 80 percent accuracy, with 19 percent false negatives - I'm working
with Michael Tsai to figure out the discrepancy). Most of the false
positives were solicited commercial email or messages forwarded to me
and a large number of other people, both of which are likely to run
afoul of SpamSieve's filtering until it has been trained to recognize
similar messages. Because SpamSieve filters on the contents of your
particular email stream, your mileage may vary, as it has for other
TidBITS staff members, who have seen somewhat less reliable results.
New features in SpamSieve 1.3 include increased resilience to the ways
spammers are now obfuscating common words, the capability to use email
addresses in Apple's Address Book as a whitelist (so mail from people
whose addresses are stored in the Address Book is never considered spam),
editing of SpamSieve's corpus of words, type-to-select in the Corpus
window, and the capability to see statistics from after any given date.
If you've longed for the Bayesian filtering in Apple's Mail, but weren't
willing to give up your preferred email program for that one capability,
I'd strongly encourage you to take a look at SpamSieve. Michael Tsai
is developing it actively, and has been extremely responsive to comments
and suggestions.
SpamSieve 1.3 is $20 shareware (upgrades from previous versions are
free) and is a 1.5 MB download.
http://www.c-command.com/spamsieve/
Reprinted
with permission from TidBITS. TidBITS has offered more than ten years
of thoughtful commentary on Macintosh and Internet topics. For free
email subscriptions and access to the entire TidBITS archive, visit
www.tidbits.com.
|
|