Bug 4493 - RFE: add pre-tokenize text munge to learner
Summary: RFE: add pre-tokenize text munge to learner
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: 3.0.4
Hardware: All All
: P5 enhancement
Target Milestone: Future
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-07-20 11:58 UTC by David Harris
Modified: 2009-03-31 02:52 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description David Harris 2005-07-20 11:58:25 UTC
I setup spamassassin with a site-wide bayes database. Users are reporting their 
own spam, and after being approved by an administrator, that spam is used to 
train the spamassassin bayes database.

Because I have users reporting spam into a global bayes database, I want the 
learner to ignore any e-mail addresses of my users in the learning, because if 
one user happens to report lots of spam, bayes would learn that their address 
means spam. I don't want this.

I have already excluded the To, Cc, Bcc headers using the base_ignore_header 
config, however e-mail addresses show up in my Received header like the 
following and can show up others places too.

Received: from w3.drh.net ([64.21.76.5])
          (envelope-sender <dharris@drh.net>)
          by secondary.scan1.myactv.net (qmail-ldap-1.03) with SMTP
          for <test.mss2@mail.myactv.net>; 20 Jul 2005 18:12:32 -0000

So, I created a patch that applies the below regular expression to any text 
before it tokenized by bayes to wipe out the username:

s/[a-z0-9][a-z0-9\_\.-]{1,48}\@
(myactv.net|mail.myactv.net|mss1.myactv.net)/MYACTVREPLACEDUSERNAME\@myactv.net/
gi;

Because I have multiple MX servers, I also used this regular expression to 
solve the problem described here http://wiki.apache.org/spamassassin/BayesBitMe

s/scan\d.myactv.net/scan1.myactv.net/g;

A configurable way rewrite text before tokenization would be appreciated.

Also note that crm114 (http://crm114.sourceforge.net/) has a feature to do this 
same thing.

Here is my patch to add this feature manually:
http://www.davideous.com/qmail/Mail-SpamAssassin-3.0.4-antietam-bayes-
customizations-040719-just-rewrite.patch
Comment 1 Theo Van Dinter 2006-12-31 12:48:11 UTC
It seems like a plugin call in Bayes::tokenize() would solve this.  Then people
could filter out whatever tokens they don't want, or add in new tokens, or whatever.
Comment 2 Justin Mason 2007-01-14 07:00:42 UTC
seems unlikely to happen in 3.2.0 without a patch
Comment 3 Justin Mason 2007-02-21 12:05:19 UTC
pushing out to 3.3.0, since I don't think it's a 3.2.0 blocker. shout (or change
the milestone) if you disagree....
Comment 4 Justin Mason 2009-03-31 02:52:21 UTC
no code -> moving to Future