Spam sorting (was Re: Install assistance needed in Berkeley area)

Sean 'Shaleh' Perry shaleh@speakeasy.net
Fri, 2 Jan 2004 16:33:55 -0800


On Friday 02 January 2004 12:07, Tim Freeman wrote:
> From: Nick Moffitt <nick@zork.net>
>
> >	You ought to consider using something else, like a bayesian
> >filter.  I get 25 megabytes of spam a day, and I only ever have to
> >read one or two a month.
>
> I use bogofilter and I would not get results that good with just
> bogofilter.  I'd like to figure out what we're doing different.  Here
> are some alternatives I can see:
>
> 1. Is there a bayesian filter noticeably better than bogofilter?  I
>    ran some head-to-head tests a while ago before selecting
>    bogofilter.  I hear spambayes (sp?) is good, but I don't see a
>    Debian package for it.

I use spambayes, it is a good filter.  However, it has some serious drawbacks:

1) it is single user.  This is the biggest reason there is no package for it I 
bet.

2) the training gets slow very quickly.  Not sure what the culprit is here, I 
doubt being written in Python has anything to do with it.  Probably has to do 
with how they store the information.


> 2. Do you let it automatically train on the spams and non-spams as it
>    sorts them, or do you only do manual training?  I only do manual
>    training because I don't want bogofilter to develop
>    self-reinforcing bad habits of dumping my good emails in the spam pile.

I trained it with my ham and spam.  Then I train it now and then on anything 
that fails.  spambayes also supports an 'unsure' bin.  I never pull ham out 
of the spam bin but there are occasional items that end up in unsure.  Mostly 
newsletters and the like.  In particular I can not convince it that the EFF's 
newsletter is ham.

> 3. Do you include all the emails you can in the training set?  My
>    training set has equal numbers of spams and good emails, and if
>    there aren't enough spams I train it with fewer good emails.

The spambayes people recommend twice as many ham as you have spam.