Spam sorting (was Re: Install assistance needed in Berkeley area)

Tim Freeman tim@fungible.com
Fri, 2 Jan 2004 13:07:56 -0700


From: Nick Moffitt <nick@zork.net>
>	You ought to consider using something else, like a bayesian
>filter.  I get 25 megabytes of spam a day, and I only ever have to
>read one or two a month.

I use bogofilter and I would not get results that good with just
bogofilter.  I'd like to figure out what we're doing different.  Here
are some alternatives I can see:

1. Is there a bayesian filter noticeably better than bogofilter?  I
   ran some head-to-head tests a while ago before selecting
   bogofilter.  I hear spambayes (sp?) is good, but I don't see a
   Debian package for it.
2. Do you let it automatically train on the spams and non-spams as it
   sorts them, or do you only do manual training?  I only do manual
   training because I don't want bogofilter to develop
   self-reinforcing bad habits of dumping my good emails in the spam pile.
3. Do you include all the emails you can in the training set?  My
   training set has equal numbers of spams and good emails, and if
   there aren't enough spams I train it with fewer good emails.
4. How big is your training set?  Mine presently has 1138 spams and
   1138 good emails.
5. Do you manually massage the training emails to eliminate
   uninteresting training text?  I don't.  The crm114 author did,
   according to one version of the crm114 documentation.

Before a message gets to bogofilter, it has to get through
cbl.abuseat.org and then through greylisting (search Google for
"relaydelay").  Greylisting is a big win.  It reduces the spam rate
80% so I can easily read the subject lines of all of the spams, and it
delays much of the spam enough so cbl.abuseat.org has a chance to
blacklist it.

I'm not doing SPF filtering yet.  I did publish SPF records for
fungible.com, though.

-- 
Tim Freeman                                                  tim@fungible.com
I xeroxed a mirror. Now I have an extra xerox machine. -- Steven Wright