Spam sorting (was Re: Install assistance needed in Berkeley area)
Tony Godshall
togo@of.net
Sun, 11 Jan 2004 12:22:10 -0800
According to Nick Moffitt,
> begin Tony Godshall quotation:
> > sure, more data is always good. but what about the imbalance issue?
> > are you saying it's not a problem? and why should I take your
> > opinion on this seriously if you don't offer a rational argument
> > against what the SA doc says?
>
> SA is making the assumption that you can use an actual strict
> bayesian analysis instead of just writing it off as "naive bayes". I
> keep the balance issue worked out mostly by subscribing to a lot of
> legitimate mailing lists as well as spam lists. So long as you're not
> getting 3 of one kind of mail per day and 300 of the other, you're
> probably fine.
>
> And I'm speaking purely anecdotally, true. I do these things,
> and I get next to no spam and no false positives.
Ah, thanks, Tim! That makes a lot more sense.
No I just have to google around for or work out the procmail recipe
(unless you want paste me a sample of yours to let lazy me crib of you).
Say, any quick way to count the number of messages in a mbox file?
Is "grep -c '^From' file" sufficient?
Tony