report back (for the newbies) [was Re: Spam sorting (was...)]

Tony Godshall togo@of.net
Fri, 30 Jan 2004 19:26:57 -0600


According to Nick Moffitt,
> begin  Tony Godshall  quotation:
> > sure, more data is always good.  but what about the imbalance issue?
> > are you saying it's not a problem?  and why should I take your
> > opinion on this seriously if you don't offer a rational argument
> > against what the SA doc says?
> 
> 	SA is making the assumption that you can use an actual strict
> bayesian analysis instead of just writing it off as "naive bayes".  I
> keep the balance issue worked out mostly by subscribing to a lot of
> legitimate mailing lists as well as spam lists.  So long as you're not
> getting 3 of one kind of mail per day and 300 of the other, you're
> probably fine.
> 
> 	And I'm speaking purely anecdotally, true.  I do these things,
> and I get next to no spam and no false positives.

I've implemented auto-learning with good results (much less
spam sneaks through, and false positives are almost
nonexistent (I still check, tho)).

To autotrain, I found the fake addresses getting the most 
traffic and made a recipe to immediately and automatically train
the spamassassin.

The details:

Added to ~/.spamassassin/user_prefs:

  rewrite_subject 1
  subject_tag SPAM[_HITS_]

That lets me sort my caughtspam by subject.

Locating spamtrap addresses:

cat lotsofspam | perl -e '
  while(<>)
  {
    while(s{^(To:.*[^a-zA-Z_.])([a-zA-Z_.]+@[a-zA-Z_.]+)}{$1})
    {
      print "$2\n"
    }
  }'|sort|uniq -c|sort -n

Amazingly I'm getting more spam at a fake addr than at a
real one:

    ...
    624 godshall@of.net
    750 togo@of.net
   1987 pulver@of.net

In my ~/.procmailrc, autotrain on a spamtrap address
  
  :0fw
  * ^To: .*arcard@of.net
  sa-learn --spam 

Or to keep a copy ...

  :0fwc
  * ^To: .*arcard@of.net
  sa-learn --spam 

  :0:
  * ^To: .*pulver@of.net
  spamtrap.trained

You can also use 'spamassassin -r' in palce of sa-learn,
which reports to Vipul's Razor (if you have razor installed
properly (read /usr/share/doc/razor/readme.Debian).

You might be tempted to use 'spamassassin -r -l filename' to
train, report, and save in sone recipe, but it doesn't seem
to work (I get perl errors).

For for examples on using spamassassin + procmail to sort
mail, see /usr/share/doc/procmailrc.example.

I read my mail with mutt and file anything that got through
with one keytroke because I have this in my .muttrc:

  macro index X "s=manspam\r"

Also anything I read gets filed too, so I can train it as nonspam:

  mbox-hook Inbox =oldmail

There are mutt macros out there to train and report spam from
inside mutt, but that's just too painful (I use mutt because
it's /fast/).  So I just sort mail in mutt and let sa-learn
have at it when I'm done (and I can go onto something else
while it churns).

Once I've read my mail, I train SA.  It's important to train
the bayesian filter on the stuff that got through since it's
I don't want any similar stuff coming through.

  sa-learn --spam --showdots --mbox ~/my/mail/manspam

It's also important to keep training it on the good stuff so it 
doesn't get too imbalanced.

  sa-learn --ham --showdots --mbox ~/my/mail/oldmail
  sa-learn --ham --showdots --mbox ~/my/mail/friends
  sa-learn --ham --showdots --mbox ~/my/mail/work
  ...

Of course I don't type all this stuff each time: I have a
script ("M") that starts mutt and then train upon exit.

Hope this hopes a newbie or two.

And, of course, no guarantees.  If you break it, don't blame
me.  Make sure you test your procmail recipes before you put
them in place.  And if you can't figure out how to do that,
ehhh, maybe you need to read some read some Fine Manuals ;-).

Tony