Bug 5559 - Extend seek-phrases-in-corpus to automagically spit out a rules file
Summary: Extend seek-phrases-in-corpus to automagically spit out a rules file
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Masses (show other bugs)
Version: unspecified
Hardware: All other
: P1 enhancement
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-07-14 07:29 UTC by AXB
Modified: 2007-08-23 05:58 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description AXB 2007-07-14 07:29:24 UTC
It would be of great help for the SARE folks if "seek-phrases-in-corpus" could
be extended to have an option to spit out a normal rules file with a score of 1
instead of the
1.000   8.633   0.000  /pattern/, /pattern2/, /pattern3/ format.

Thanks
Alex
Comment 1 Justin Mason 2007-07-14 11:45:50 UTC
seek-phrases-in-log now takes a --rules switch, which generates rules as output:
e.g.:

 RATIO   SPAM%    HAM%   DATA
#  1.000  34.194   0.000
body SEEK_QRKYNC  /! UP 37\.5\% Shandong Zhouyuan Seed and Nursery Co\., Ltd
\(SZSN\) \$0\.33 UP 37\.5\% /
#  1.000   7.097   0.000
body SEEK_EX2OMO  / 0rder All of your favorite RxMeDs 0nline! With fast discreet
trackable USPS shipping! N0 Prescription Needed! 0rder Now at - /

(the rule names are based on SHA1 of the pattern, so the same pattern will wind
up with the same "name".)

this has been doing really well recently; I've been running it on my low-scoring
spam to generate a set of test rules, the SEEK_* rules in my sandbox.  here's
one, for example:

http://ruleqa.spamassassin.org/20070714-r556246-n/SEEK_ACG/detail

I generate those with this command:

/home/jm/ftp/sa/trunk/masses/rule-dev/seek-phrases-in-log \
  --ham ~/ftp/spamassassin/masses/big_w.h \
  --rules --reqpatlength 40 --reqhitrate 5 \
  --spam /tmp/findpats.tmp.8631/w.s | head -20

and then I discard a few of them (generally the ones that have been output
previously).  /tmp/findpats.tmp.8631/w.s is the "w.s" file output for the last
day's low-scoring spam, big_w.h is the ham mass-check log from a load of 2007 ham.

given that *every single rule* it's generated in the past week has hit *no* ham
and a good bit of spam that the other rules miss, this is looking very promising ;)
Comment 2 Justin Mason 2007-08-23 05:58:02 UTC
this is already fixed