Spam sorting (was Re: Install assistance needed in Berkeley area)

Nick Moffitt nick@zork.net
Fri, 2 Jan 2004 22:52:30 -0800


begin  Tim Freeman  quotation:
> begin  Nick Moffitt  quotation:
> > This is the key to bayesian filtering: The more spam you *get*, the
> > less you have to *read*!
> 
> So how big is your training corpus by now?  I have about 1000 spams
> and 1000 hams.

	I don't know.  I just let procmail do self-reinforcing.  I
trained initially on just a few hundred messages (sanitized
spamassassin catches for spam, saved mail for non-spam (also, it's
kind of aggravating that people use the term "ham" to refer to
the good mails you want to keep)).

> Do you ever expire old messages from the training corpus?  I don't.
> I'm thinking of it, though, since my older training hams have a
> different average age from my older training spams, and bogofilter
> seems to learn from inessential things like changes in dates and
> consequences of changes in how receiving email is set up.  However
> in tests I was never able to get improved sorting by leaving out the
> old skewed emails, probably because I have relatively little
> training data.

	I don't understand what you're talking about.  It sure doesn't
sound like any self-reinforcing bayesian filter techniques any sane
person would use.  How do you let it adapt if you keep a static
training corpus instead of treating that like an initial seed?

> Do you retain all messages that you've trained your spam filter on?
> I retain them just in case I want to change spam filters some day,
> and so I can do experiments to figure out the best threshhold to use
> for bogofilter, or the experiment I mentioned in the previous
> paragraph of leaving out old training data to see if it improves
> things.

	I logrotate them every day, deleting after four.  I get
megabytes of the stuff, and haven't had any false positives for
several months.

> There are lots of single-user programs, like emacs, that get
> packages.  You know this so I must not have understood you properly.
> Which single-user programs do you think won't get packages?

	You install emacs once, and many people use it.  Bogofilter
can operate from one installation, storing data in ~/.whatever.  I
assume he meant that spambayes gets one data dir per installation, so
you can't have multiple users with different spambayes databases on
the same machine without multiple separate installations.

> All it takes is one enthusiastic and competent person to make a
> package, so I think it's just bad luck that this person hasn't shown
> up for spambayes.

	Are you volunteering?

-- 
"Forget the damned motor car and build cities for lovers and friends."
	-- Lewis Mumford

end