Spam sorting (was Re: Install assistance needed in Berkeley area)

Tony Godshall togo@of.net
Thu, 8 Jan 2004 11:14:30 -0800


According to Nick Moffitt,
> begin  Tony Godshall  quotation:
> > Unfortunately bayesian filtering has become a bit less
> > effective lately (more spam gets through) as spammers are
> > using random misspellings and garbage words to evade.  But
> > hopefully someone will come up with a fix for that (perhaps
> > grouping rarely-seen words together rather than ranking them
> > separately).
> 
> 	Nope.  You just don't get enough spam.  Bayesian filtering
> eventually flags those nonsense words as spammy words, and you don't
> need to worry.  Remember, and repeat after me:  The more spam you GET,
> the less you have to READ!
> 
> > I only use manual training.  Supposedly it's best to use
> > approximatly equal volume of spam and non-spam in the
> > training.  spamassassin's FAQ (I think) recommends a training 
> > with 1000+ messages each (spam and non).  Also, it's easy to
> > reverse a mistake, as using sa-learn --ham reverses the
> > effect of sa-learn --spam and vise versa.
> 
> 	That's why your bayesian filter isn't adapting.  You're not
> letting it learn.
> 	
> 	If it makes a mistake, correct it.  But don't keep it in the
> dark all the time!

sure, more data is always good.  but what about the imbalance 
issue?  are you saying it's not a problem?  and why should I take 
your opinion on this seriously if you don't offer a rational argument 
against what the SA doc says?