Spam sorting (was Re: Install assistance needed in Berkeley area)
Tim Freeman
tim@fungible.com
Fri, 2 Jan 2004 13:07:56 -0700
From: Nick Moffitt <nick@zork.net>
> You ought to consider using something else, like a bayesian
>filter. I get 25 megabytes of spam a day, and I only ever have to
>read one or two a month.
I use bogofilter and I would not get results that good with just
bogofilter. I'd like to figure out what we're doing different. Here
are some alternatives I can see:
1. Is there a bayesian filter noticeably better than bogofilter? I
ran some head-to-head tests a while ago before selecting
bogofilter. I hear spambayes (sp?) is good, but I don't see a
Debian package for it.
2. Do you let it automatically train on the spams and non-spams as it
sorts them, or do you only do manual training? I only do manual
training because I don't want bogofilter to develop
self-reinforcing bad habits of dumping my good emails in the spam pile.
3. Do you include all the emails you can in the training set? My
training set has equal numbers of spams and good emails, and if
there aren't enough spams I train it with fewer good emails.
4. How big is your training set? Mine presently has 1138 spams and
1138 good emails.
5. Do you manually massage the training emails to eliminate
uninteresting training text? I don't. The crm114 author did,
according to one version of the crm114 documentation.
Before a message gets to bogofilter, it has to get through
cbl.abuseat.org and then through greylisting (search Google for
"relaydelay"). Greylisting is a big win. It reduces the spam rate
80% so I can easily read the subject lines of all of the spams, and it
delays much of the spam enough so cbl.abuseat.org has a chance to
blacklist it.
I'm not doing SPF filtering yet. I did publish SPF records for
fungible.com, though.
--
Tim Freeman tim@fungible.com
I xeroxed a mirror. Now I have an extra xerox machine. -- Steven Wright