Can fuzzy logic beat real logic?

Whatever happened to the idea of Bayesian spam filtering?

I mean, any idiot these days can tell a SPAM email from a real email, so how on earth do these stupid things keep getting through?

I am not talking about covert advertising cleverly disguised as real text; that may take time. But the vast majority of SPAM is blatant.

I don't like insulting my friends by requiring them to fill in a stupid blank in order to send me an email, nor do I wish to install some sucky-ass software which generates its own SPAM by sending out ads to anyone who gets email from me.

I wonder why the Bayesian software hasn't been developed more fully. Might it just put an end to SPAM as we know it?

It might, but on the other hand it might not! This post (which begins with an "intelligent" SPAM email) demonstrates the problems inherent in the Bayesian approach:

[M]arking such emails as spam will increase the probability of false positives in the future. If you receive a lot of these mails, certain rare words will be associated very highly with spam by your filter. Then, when you get an innocent-seeming email from a friend that happens to contain the words “schizophrenic pompous playwright”, that will be enough to get it black-holed.

Amusingly enough, this email would likely be shot down by a filter because of the method it uses to hide the words in the HTML file. It’s only a first attempt, though. Spammers will get better at it. So long as the spammer’s dictionary is big enough, and they regularly rotate their words, this could be an effective technique to weaken Bayesian filtering on two fronts: increasing both its false-negatives and false-positives.

One problem (one that Graham admits to) is that Bayesian classification assumes that the elements of the object being classified (in this case, the words of the email) are independent of each other. This is a troublesome assumption to make about language, in that the filter can’t tell that the list of words does not match the patterns we would normally recognise as the language of a legitimate email.

On the other hand, if we had a more sophisticated linguistic filter, it wouldn’t be hard for spammers to come up with a program that generated random, but grammatically correct sentences.

It’s a war of escalation.

Well, yes, maybe it is. (Another reason I have repeatedly recommended crucifixion....) As soon as I get home I am going to get off my butt and configure Netscape's apparently Bayesian spam filter. Had I not written this post, I don't think I would have even known I had it! Netscape Mail has saved my ass over the years from the numerous viruses that are written to target IE; I just wish the Netscape browser could be made to work better.

What just sticks in my craw is how easy it is for humans to spot SPAM, yet how difficult it is for computers. Fuzzy logic somehow has to supply the answer.

Meanwhile, the spammers have nothing but time and fanatic devotion to their silly games of annoyance.

Along similar lines, I often wonder whether anti-virus companies hire virus writers. (I can't think of a more intriguing conflict of interest than the sort of moonlighting which might be officially unapproved, yet guarantee promotion -- for if there are no new viruses you can't sell software! Money plus conflicts of interest means mutually escalated non-destruction -- if that's not too fuzzy.)

posted by Eric on 04.14.04 at 05:45 PM





TrackBack

TrackBack URL for this entry:
http://classicalvalues.com/cgi-bin/pings.cgi/936






Comments

For what it's worth, Bayesian filtering seems to be thriving at the moment - there are a number of programs under active development (good results now and a lot of interesting research to improve matters). I prefer the SpamAssassin approach of using Bayesian filtering as one component in a larger system - it's much harder to get past all of those different tests, particularly since the obfuscation tactics which fool one check are often red flags on another - this also allows things like automatically training the Bayesian classifier on words which appeared in blacklisted spam - essential when the same thing starts arriving from a non-blacklisted source.

The bottom line is that with 200+ inbound spams per day filtered through SpamAssassin with a trained Bayesian classifier I will actually see 1-2 messages in a bad week.

Chris Adams   ·  April 15, 2004 04:02 AM

I second that. SpamAssassin is excellent. Out of the 300-400 spams I get a week, only one or two will slip through per month. Of course, I also have a number of other anti-spam measures in place. I use mailinator.com for throw-away addresses, configure procmail to automatically pass certain messages before they even hit SpamAssassin, have one long-standing email address that I use entirely as a spam trap, have my own domain, and run an exim (mail server) configuration that is very aggressively anti-spam. Still, the vast bulk of the work is done by SpamAssassin and it does it very well.

mallarme   ·  April 15, 2004 01:16 PM

Popfile works for me. I maybe get 1 or 2 a month seeping through while 150 a day get consigned to e-mail hell.

Bill Peschel   ·  April 16, 2004 12:57 AM

SpamSieve is 99.99% on a Macintosh running OSX. I'm average just under 250 pieces per day. I'd rate it 100% effective but no one believes in perfection these days. ;)

BTW. I find some of your comments where you beat up on Glenn Reynolds humorous, others mundane. Most of them fall short of providing useful information, but that's why different views can be useful. I hope to return and find something in that category. I suspect he's not 100% accurate either. :)

Steve   ·  April 16, 2004 06:44 AM


March 2007
Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

ANCIENT (AND MODERN)
WORLD-WIDE CALENDAR


Search the Site


E-mail




Classics To Go

Classical Values PDA Link



Archives




Recent Entries



Links



Site Credits