SA Bugzilla – Bug 4493
RFE: add pre-tokenize text munge to learner
Last modified: 2009-03-31 02:52:21 UTC
I setup spamassassin with a site-wide bayes database. Users are reporting their own spam, and after being approved by an administrator, that spam is used to train the spamassassin bayes database. Because I have users reporting spam into a global bayes database, I want the learner to ignore any e-mail addresses of my users in the learning, because if one user happens to report lots of spam, bayes would learn that their address means spam. I don't want this. I have already excluded the To, Cc, Bcc headers using the base_ignore_header config, however e-mail addresses show up in my Received header like the following and can show up others places too. Received: from w3.drh.net ([64.21.76.5]) (envelope-sender <dharris@drh.net>) by secondary.scan1.myactv.net (qmail-ldap-1.03) with SMTP for <test.mss2@mail.myactv.net>; 20 Jul 2005 18:12:32 -0000 So, I created a patch that applies the below regular expression to any text before it tokenized by bayes to wipe out the username: s/[a-z0-9][a-z0-9\_\.-]{1,48}\@ (myactv.net|mail.myactv.net|mss1.myactv.net)/MYACTVREPLACEDUSERNAME\@myactv.net/ gi; Because I have multiple MX servers, I also used this regular expression to solve the problem described here http://wiki.apache.org/spamassassin/BayesBitMe s/scan\d.myactv.net/scan1.myactv.net/g; A configurable way rewrite text before tokenization would be appreciated. Also note that crm114 (http://crm114.sourceforge.net/) has a feature to do this same thing. Here is my patch to add this feature manually: http://www.davideous.com/qmail/Mail-SpamAssassin-3.0.4-antietam-bayes- customizations-040719-just-rewrite.patch
It seems like a plugin call in Bayes::tokenize() would solve this. Then people could filter out whatever tokens they don't want, or add in new tokens, or whatever.
seems unlikely to happen in 3.2.0 without a patch
pushing out to 3.3.0, since I don't think it's a 3.2.0 blocker. shout (or change the milestone) if you disagree....
no code -> moving to Future