report back (for the newbies) [was Re: Spam sorting (was...)]
Tony Godshall
togo@of.net
Fri, 30 Jan 2004 19:26:57 -0600
According to Nick Moffitt,
> begin Tony Godshall quotation:
> > sure, more data is always good. but what about the imbalance issue?
> > are you saying it's not a problem? and why should I take your
> > opinion on this seriously if you don't offer a rational argument
> > against what the SA doc says?
>
> SA is making the assumption that you can use an actual strict
> bayesian analysis instead of just writing it off as "naive bayes". I
> keep the balance issue worked out mostly by subscribing to a lot of
> legitimate mailing lists as well as spam lists. So long as you're not
> getting 3 of one kind of mail per day and 300 of the other, you're
> probably fine.
>
> And I'm speaking purely anecdotally, true. I do these things,
> and I get next to no spam and no false positives.
I've implemented auto-learning with good results (much less
spam sneaks through, and false positives are almost
nonexistent (I still check, tho)).
To autotrain, I found the fake addresses getting the most
traffic and made a recipe to immediately and automatically train
the spamassassin.
The details:
Added to ~/.spamassassin/user_prefs:
rewrite_subject 1
subject_tag SPAM[_HITS_]
That lets me sort my caughtspam by subject.
Locating spamtrap addresses:
cat lotsofspam | perl -e '
while(<>)
{
while(s{^(To:.*[^a-zA-Z_.])([a-zA-Z_.]+@[a-zA-Z_.]+)}{$1})
{
print "$2\n"
}
}'|sort|uniq -c|sort -n
Amazingly I'm getting more spam at a fake addr than at a
real one:
...
624 godshall@of.net
750 togo@of.net
1987 pulver@of.net
In my ~/.procmailrc, autotrain on a spamtrap address
:0fw
* ^To: .*arcard@of.net
sa-learn --spam
Or to keep a copy ...
:0fwc
* ^To: .*arcard@of.net
sa-learn --spam
:0:
* ^To: .*pulver@of.net
spamtrap.trained
You can also use 'spamassassin -r' in palce of sa-learn,
which reports to Vipul's Razor (if you have razor installed
properly (read /usr/share/doc/razor/readme.Debian).
You might be tempted to use 'spamassassin -r -l filename' to
train, report, and save in sone recipe, but it doesn't seem
to work (I get perl errors).
For for examples on using spamassassin + procmail to sort
mail, see /usr/share/doc/procmailrc.example.
I read my mail with mutt and file anything that got through
with one keytroke because I have this in my .muttrc:
macro index X "s=manspam\r"
Also anything I read gets filed too, so I can train it as nonspam:
mbox-hook Inbox =oldmail
There are mutt macros out there to train and report spam from
inside mutt, but that's just too painful (I use mutt because
it's /fast/). So I just sort mail in mutt and let sa-learn
have at it when I'm done (and I can go onto something else
while it churns).
Once I've read my mail, I train SA. It's important to train
the bayesian filter on the stuff that got through since it's
I don't want any similar stuff coming through.
sa-learn --spam --showdots --mbox ~/my/mail/manspam
It's also important to keep training it on the good stuff so it
doesn't get too imbalanced.
sa-learn --ham --showdots --mbox ~/my/mail/oldmail
sa-learn --ham --showdots --mbox ~/my/mail/friends
sa-learn --ham --showdots --mbox ~/my/mail/work
...
Of course I don't type all this stuff each time: I have a
script ("M") that starts mutt and then train upon exit.
Hope this hopes a newbie or two.
And, of course, no guarantees. If you break it, don't blame
me. Make sure you test your procmail recipes before you put
them in place. And if you can't figure out how to do that,
ehhh, maybe you need to read some read some Fine Manuals ;-).
Tony