Spam sorting (was Re: Install assistance needed in Berkeley area)
Sean 'Shaleh' Perry
shaleh@speakeasy.net
Fri, 2 Jan 2004 16:33:55 -0800
On Friday 02 January 2004 12:07, Tim Freeman wrote:
> From: Nick Moffitt <nick@zork.net>
>
> > You ought to consider using something else, like a bayesian
> >filter. I get 25 megabytes of spam a day, and I only ever have to
> >read one or two a month.
>
> I use bogofilter and I would not get results that good with just
> bogofilter. I'd like to figure out what we're doing different. Here
> are some alternatives I can see:
>
> 1. Is there a bayesian filter noticeably better than bogofilter? I
> ran some head-to-head tests a while ago before selecting
> bogofilter. I hear spambayes (sp?) is good, but I don't see a
> Debian package for it.
I use spambayes, it is a good filter. However, it has some serious drawbacks:
1) it is single user. This is the biggest reason there is no package for it I
bet.
2) the training gets slow very quickly. Not sure what the culprit is here, I
doubt being written in Python has anything to do with it. Probably has to do
with how they store the information.
> 2. Do you let it automatically train on the spams and non-spams as it
> sorts them, or do you only do manual training? I only do manual
> training because I don't want bogofilter to develop
> self-reinforcing bad habits of dumping my good emails in the spam pile.
I trained it with my ham and spam. Then I train it now and then on anything
that fails. spambayes also supports an 'unsure' bin. I never pull ham out
of the spam bin but there are occasional items that end up in unsure. Mostly
newsletters and the like. In particular I can not convince it that the EFF's
newsletter is ham.
> 3. Do you include all the emails you can in the training set? My
> training set has equal numbers of spams and good emails, and if
> there aren't enough spams I train it with fewer good emails.
The spambayes people recommend twice as many ham as you have spam.