Spam sorting (was Re: Install assistance needed in Berkeley area)
Nick Moffitt
nick@zork.net
Fri, 2 Jan 2004 22:52:30 -0800
begin Tim Freeman quotation:
> begin Nick Moffitt quotation:
> > This is the key to bayesian filtering: The more spam you *get*, the
> > less you have to *read*!
>
> So how big is your training corpus by now? I have about 1000 spams
> and 1000 hams.
I don't know. I just let procmail do self-reinforcing. I
trained initially on just a few hundred messages (sanitized
spamassassin catches for spam, saved mail for non-spam (also, it's
kind of aggravating that people use the term "ham" to refer to
the good mails you want to keep)).
> Do you ever expire old messages from the training corpus? I don't.
> I'm thinking of it, though, since my older training hams have a
> different average age from my older training spams, and bogofilter
> seems to learn from inessential things like changes in dates and
> consequences of changes in how receiving email is set up. However
> in tests I was never able to get improved sorting by leaving out the
> old skewed emails, probably because I have relatively little
> training data.
I don't understand what you're talking about. It sure doesn't
sound like any self-reinforcing bayesian filter techniques any sane
person would use. How do you let it adapt if you keep a static
training corpus instead of treating that like an initial seed?
> Do you retain all messages that you've trained your spam filter on?
> I retain them just in case I want to change spam filters some day,
> and so I can do experiments to figure out the best threshhold to use
> for bogofilter, or the experiment I mentioned in the previous
> paragraph of leaving out old training data to see if it improves
> things.
I logrotate them every day, deleting after four. I get
megabytes of the stuff, and haven't had any false positives for
several months.
> There are lots of single-user programs, like emacs, that get
> packages. You know this so I must not have understood you properly.
> Which single-user programs do you think won't get packages?
You install emacs once, and many people use it. Bogofilter
can operate from one installation, storing data in ~/.whatever. I
assume he meant that spambayes gets one data dir per installation, so
you can't have multiple users with different spambayes databases on
the same machine without multiple separate installations.
> All it takes is one enthusiastic and competent person to make a
> package, so I think it's just bad luck that this person hasn't shown
> up for spambayes.
Are you volunteering?
--
"Forget the damned motor car and build cities for lovers and friends."
-- Lewis Mumford
end