SA Bugzilla – Bug 5559
Extend seek-phrases-in-corpus to automagically spit out a rules file
Last modified: 2007-08-23 05:58:02 UTC
It would be of great help for the SARE folks if "seek-phrases-in-corpus" could be extended to have an option to spit out a normal rules file with a score of 1 instead of the 1.000 8.633 0.000 /pattern/, /pattern2/, /pattern3/ format. Thanks Alex
seek-phrases-in-log now takes a --rules switch, which generates rules as output: e.g.: RATIO SPAM% HAM% DATA # 1.000 34.194 0.000 body SEEK_QRKYNC /! UP 37\.5\% Shandong Zhouyuan Seed and Nursery Co\., Ltd \(SZSN\) \$0\.33 UP 37\.5\% / # 1.000 7.097 0.000 body SEEK_EX2OMO / 0rder All of your favorite RxMeDs 0nline! With fast discreet trackable USPS shipping! N0 Prescription Needed! 0rder Now at - / (the rule names are based on SHA1 of the pattern, so the same pattern will wind up with the same "name".) this has been doing really well recently; I've been running it on my low-scoring spam to generate a set of test rules, the SEEK_* rules in my sandbox. here's one, for example: http://ruleqa.spamassassin.org/20070714-r556246-n/SEEK_ACG/detail I generate those with this command: /home/jm/ftp/sa/trunk/masses/rule-dev/seek-phrases-in-log \ --ham ~/ftp/spamassassin/masses/big_w.h \ --rules --reqpatlength 40 --reqhitrate 5 \ --spam /tmp/findpats.tmp.8631/w.s | head -20 and then I discard a few of them (generally the ones that have been output previously). /tmp/findpats.tmp.8631/w.s is the "w.s" file output for the last day's low-scoring spam, big_w.h is the ham mass-check log from a load of 2007 ham. given that *every single rule* it's generated in the past week has hit *no* ham and a good bit of spam that the other rules miss, this is looking very promising ;)
this is already fixed