SA Bugzilla – Bug 4876
CRM114 Plugin for SpamAssassin (comments, please!)
Last modified: 2008-02-12 15:00:26 UTC
Our SA implementation (via MailScanner and Exim) is working very well, dropping some 90% of spam immediately and only missing (i.e. not tagging) some 5% of the remaining "questionable" stuff. Still, the amount of mail that gets tagged (above a score of 4 and below 8) and passed to users is significant. I wanted to try using CRM114 within SpamAssassin to augment the existing "bayes" learner. I wanted it to discriminate between spam/ham for messages that could not be classified accurately by existing rules and, as such, not waste it's resources for things already handled elsewhere. I've written this plug-in to test it out. Right now, I'm just using the basic "classifymail.crm" script that comes with CRM114 with a few modifications as to where to find files. This is my first attempt at a plug-in and working with v3 of SpamAssassin, so I appologize if it's not as elegant as it could be. A few notes about the plugin: .../SpamAssassin/Plugin/CRM114.pm * It skips itself unless the current score is within the -5 to 15 range. I believe this will avoid running it for messages that are already obvious. I choose this range on the assumption that the rule weightings would never be more than +/- 10 and thus would never be able to change the final decision on messagse outside of that range. I've set the rule priorities to run this rule last. * I intended to train CRM only with messages that user supply as either false positives or false negatives. This contrants with the standard learningh system that auto-learns from everything. (I know I can disable auto-learn, but I want CRM to work on a _different_ problem than the existing rules.) * I still have to figure out how to actually do that training. To train or original messages would be a different data set than the "rendered" text it's classifying. What I need is a method to have SpamAssassin render a message and dump it's output rather than running rules on it. I'd appreciate any comments people have. I've placed the plugin code in the public domain. The CRM filter file did not have a copyright notice on the original; since it was an example, I suspect it's also public domain but can't say for sure. I am sure, however, that anybody with some CRM knowledge could write a better classifier that what I present here.
Created attachment 3484 [details] CRM114 interface plugin
Created attachment 3485 [details] CRM114 classification filter
Created attachment 3486 [details] CRM114 classification configuration
Created attachment 3487 [details] CRM114 rules scores, flags, etc.
I've been using CRM114 with spamassassin for a while now (though in a simpler setup, with CRM114 added a mail header, which is then matched against by SA). It is working very well; much better than SA's default internal bayesian engine. In my experience, it's best to train only From and Subject headers, and the body. I'm also auto-training on messages that have low (<0) or very high (>12) scores (note that crm114 will only learn messages that are misqualified).
Interesting. Your best bet is to post this to the wiki: http://wiki.apache.org/spamassassin/CustomPlugins and also maybe an annoucement on the users list. In the future I would like to pluginize Bayes as a whole, that would allow you to write your plugin at a much lower level with more control, and then replace the existing implementation if you believe it is better.
bug 5293 ("pluginize Bayes") is now about to be applied to 3.3.0 -- so it's now possible to replace our default Bayes implementation with other classifiers entirely, at a low level, as Michael mentioned.
If you are interested in CRM114 you might also try my CRM114 plugin, which I wrote in 2007 without knowing this bugzilla entry. It is available on the Spamassassin-Wiki (http://wiki.apache.org/spamassassin/CustomPlugins) and at http://mschuette.name/wp/crm114-spamassassin-plugin/.