4493 – RFE: add pre-tokenize text munge to learner

Bug 4493 - RFE: add pre-tokenize text munge to learner

Summary: RFE: add pre-tokenize text munge to learner

Status:	NEW

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	Learner (show other bugs)
Version:	3.0.4
Hardware:	All All

Importance:	P5 enhancement
Target Milestone:	Future
Assignee:	SpamAssassin Developer Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2005-07-20 11:58 UTC by David Harris
Modified:	2009-03-31 02:52 UTC (History)
CC List:	0 users

Attachment	Type	Modified	Status	Actions	Submitter/CLA Status
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description David Harris 2005-07-20 11:58:25 UTC

I setup spamassassin with a site-wide bayes database. Users are reporting their 
own spam, and after being approved by an administrator, that spam is used to 
train the spamassassin bayes database.

Because I have users reporting spam into a global bayes database, I want the 
learner to ignore any e-mail addresses of my users in the learning, because if 
one user happens to report lots of spam, bayes would learn that their address 
means spam. I don't want this.

I have already excluded the To, Cc, Bcc headers using the base_ignore_header 
config, however e-mail addresses show up in my Received header like the 
following and can show up others places too.

Received: from w3.drh.net ([64.21.76.5])
          (envelope-sender <dharris@drh.net>)
          by secondary.scan1.myactv.net (qmail-ldap-1.03) with SMTP
          for <test.mss2@mail.myactv.net>; 20 Jul 2005 18:12:32 -0000

So, I created a patch that applies the below regular expression to any text 
before it tokenized by bayes to wipe out the username:

s/[a-z0-9][a-z0-9\_\.-]{1,48}\@
(myactv.net|mail.myactv.net|mss1.myactv.net)/MYACTVREPLACEDUSERNAME\@myactv.net/
gi;

Because I have multiple MX servers, I also used this regular expression to 
solve the problem described here http://wiki.apache.org/spamassassin/BayesBitMe

s/scan\d.myactv.net/scan1.myactv.net/g;

A configurable way rewrite text before tokenization would be appreciated.

Also note that crm114 (http://crm114.sourceforge.net/) has a feature to do this 
same thing.

Here is my patch to add this feature manually:
http://www.davideous.com/qmail/Mail-SpamAssassin-3.0.4-antietam-bayes-
customizations-040719-just-rewrite.patch

Comment 1 Theo Van Dinter 2006-12-31 12:48:11 UTC

It seems like a plugin call in Bayes::tokenize() would solve this.  Then people
could filter out whatever tokens they don't want, or add in new tokens, or whatever.

Comment 2 Justin Mason 2007-01-14 07:00:42 UTC

seems unlikely to happen in 3.2.0 without a patch

Comment 3 Justin Mason 2007-02-21 12:05:19 UTC

pushing out to 3.3.0, since I don't think it's a 3.2.0 blocker. shout (or change
the milestone) if you disagree....

Comment 4 Justin Mason 2009-03-31 02:52:21 UTC

no code -> moving to Future