Spam sorting (was Re: Install assistance needed in Berkeley area)

Mon, 5 Jan 2004 17:52:25 -0800

Hi, all.

According to Tim Freeman,
> From: Nick Moffitt <nick@zork.net>
> >	You ought to consider using something else, like a bayesian
> >filter.  I get 25 megabytes of spam a day, and I only ever have to
> >read one or two a month.
> 
> I use bogofilter and I would not get results that good with just
> bogofilter.  I'd like to figure out what we're doing different.  Here
> are some alternatives I can see:
> 
> 1. Is there a bayesian filter noticeably better than bogofilter?  I
>    ran some head-to-head tests a while ago before selecting
>    bogofilter.  I hear spambayes (sp?) is good, but I don't see a
>    Debian package for it.

I've used (only) spamassassin for a long time, with good
results.  Newer versions include bayesian filtering, and 
training is done like this:

  sa-learn --spam [--mbox] file-of-spam
  sa-learn --ham [--mbox] file-of-nonspam

Unfortunately bayesian filtering has become a bit less
effective lately (more spam gets through) as spammers are
using random misspellings and garbage words to evade.  But
hopefully someone will come up with a fix for that (perhaps
grouping rarely-seen words together rather than ranking them
separately).

One thing I like about spamassassin is that it uses multiple
tests, so when I upgrade the package I get new and effective
filtering schemes, such as bayesian filtering, without
having to go search for them.  Heck, when I first read about 
bayesian filtering, it turned out that I already had it!
(sa-learn was already on my system; I just had to train it.)

> 2. Do you let it automatically train on the spams and non-spams as it
>    sorts them, or do you only do manual training?  I only do manual
>    training because I don't want bogofilter to develop
>    self-reinforcing bad habits of dumping my good emails in the spam pile.

I only use manual training.  Supposedly it's best to use
approximatly equal volume of spam and non-spam in the
training.  spamassassin's FAQ (I think) recommends a training 
with 1000+ messages each (spam and non).  Also, it's easy to
reverse a mistake, as using sa-learn --ham reverses the
effect of sa-learn --spam and vise versa.

Well, actually, if you have old email addresses that get
only spam, self-training on that is probably a good idea.
But I understand that a large imbalance between spam and
nonspam can hurt the filter's effectiveness.  But I'll
probably be setting up self-training on some spamtrap
addresses myself soon.

> 3. Do you include all the emails you can in the training set?  My
>    training set has equal numbers of spams and good emails, and if
>    there aren't enough spams I train it with fewer good emails.

I train primarily on spam that is missed by spamassassin's
other rules vs mail that is very important to me.

When I get a false-positive, I also examine the spamassasin
results and sometimes ajust the weights of the rules a bit
(I cranked up the realtime-blacklists a bit this week).

> 4. How big is your training set?  Mine presently has 1138 spams and
>    1138 good emails.

(see above)

> 5. Do you manually massage the training emails to eliminate
>    uninteresting training text?  I don't.  The crm114 author did,
>    according to one version of the crm114 documentation.

?

> Before a message gets to bogofilter, it has to get through
> cbl.abuseat.org and then through greylisting (search Google for
> "relaydelay").  Greylisting is a big win.  It reduces the spam rate
> 80% so I can easily read the subject lines of all of the spams, and it
> delays much of the spam enough so cbl.abuseat.org has a chance to
> blacklist it.
> 
> I'm not doing SPF filtering yet.  I did publish SPF records for
> fungible.com, though.
> 
> -- 
> Tim Freeman                                                  tim@fungible.com
> I xeroxed a mirror. Now I have an extra xerox machine. -- Steven Wright
> --
> bad mailing list
> bad@bad.debian.net
> http://bad.debian.net/cgi-bin/mailman/listinfo/bad

-- 

-- Tony