Spam sorting (was Re: Install assistance needed in Berkeley area)
Tony Godshall
togo@of.net
Mon, 5 Jan 2004 17:52:25 -0800
Hi, all.
According to Tim Freeman,
> From: Nick Moffitt <nick@zork.net>
> > You ought to consider using something else, like a bayesian
> >filter. I get 25 megabytes of spam a day, and I only ever have to
> >read one or two a month.
>
> I use bogofilter and I would not get results that good with just
> bogofilter. I'd like to figure out what we're doing different. Here
> are some alternatives I can see:
>
> 1. Is there a bayesian filter noticeably better than bogofilter? I
> ran some head-to-head tests a while ago before selecting
> bogofilter. I hear spambayes (sp?) is good, but I don't see a
> Debian package for it.
I've used (only) spamassassin for a long time, with good
results. Newer versions include bayesian filtering, and
training is done like this:
sa-learn --spam [--mbox] file-of-spam
sa-learn --ham [--mbox] file-of-nonspam
Unfortunately bayesian filtering has become a bit less
effective lately (more spam gets through) as spammers are
using random misspellings and garbage words to evade. But
hopefully someone will come up with a fix for that (perhaps
grouping rarely-seen words together rather than ranking them
separately).
One thing I like about spamassassin is that it uses multiple
tests, so when I upgrade the package I get new and effective
filtering schemes, such as bayesian filtering, without
having to go search for them. Heck, when I first read about
bayesian filtering, it turned out that I already had it!
(sa-learn was already on my system; I just had to train it.)
> 2. Do you let it automatically train on the spams and non-spams as it
> sorts them, or do you only do manual training? I only do manual
> training because I don't want bogofilter to develop
> self-reinforcing bad habits of dumping my good emails in the spam pile.
I only use manual training. Supposedly it's best to use
approximatly equal volume of spam and non-spam in the
training. spamassassin's FAQ (I think) recommends a training
with 1000+ messages each (spam and non). Also, it's easy to
reverse a mistake, as using sa-learn --ham reverses the
effect of sa-learn --spam and vise versa.
Well, actually, if you have old email addresses that get
only spam, self-training on that is probably a good idea.
But I understand that a large imbalance between spam and
nonspam can hurt the filter's effectiveness. But I'll
probably be setting up self-training on some spamtrap
addresses myself soon.
> 3. Do you include all the emails you can in the training set? My
> training set has equal numbers of spams and good emails, and if
> there aren't enough spams I train it with fewer good emails.
I train primarily on spam that is missed by spamassassin's
other rules vs mail that is very important to me.
When I get a false-positive, I also examine the spamassasin
results and sometimes ajust the weights of the rules a bit
(I cranked up the realtime-blacklists a bit this week).
> 4. How big is your training set? Mine presently has 1138 spams and
> 1138 good emails.
(see above)
> 5. Do you manually massage the training emails to eliminate
> uninteresting training text? I don't. The crm114 author did,
> according to one version of the crm114 documentation.
?
> Before a message gets to bogofilter, it has to get through
> cbl.abuseat.org and then through greylisting (search Google for
> "relaydelay"). Greylisting is a big win. It reduces the spam rate
> 80% so I can easily read the subject lines of all of the spams, and it
> delays much of the spam enough so cbl.abuseat.org has a chance to
> blacklist it.
>
> I'm not doing SPF filtering yet. I did publish SPF records for
> fungible.com, though.
>
> --
> Tim Freeman tim@fungible.com
> I xeroxed a mirror. Now I have an extra xerox machine. -- Steven Wright
> --
> bad mailing list
> bad@bad.debian.net
> http://bad.debian.net/cgi-bin/mailman/listinfo/bad
--
-- Tony