Bug 4156 - mass-check related script to find meta rules for low scoring spam
Summary: mass-check related script to find meta rules for low scoring spam
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Masses (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P5 enhancement
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords: triage
Depends on:
Blocks:
 
Reported: 2005-02-25 06:11 UTC by Peter Fritz
Modified: 2007-01-14 04:47 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
find-meta-rules.pl script text/plain None Peter Fritz [HasCLA]
90_meta_spam.cf - sample meta rules file text/plain None Peter Fritz [HasCLA]
90_meta_spam.cf - sample rules file from bzoetekouw corpus text/plain None Bas Zoetekouw [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Fritz 2005-02-25 06:11:54 UTC
Attached is a short script that parses the mass-check ham and spam logs looking
for meta rules that would assist in catching low scoring spam.  These meta rules
can then be scored manually or via perceptron to assist in pushing low scoring
spam over the threshold.  Hopefully it will assist in environments not using
Bayes, or in catching spam that would get a low/neutral Bayes score or only hit
one or two network tests.  It also relates to the recent wiki page, WhyUseRules.

Note: I'd recommend using mass-check and perceptron to gain tuned scores for
your corpus/environment before adding more meta rules into the mix blindly.

Example usage (assuming a base of spamassassin/masses):

#  Generate logs from mass-check (be sure to make net checks reuse-able first)
./mass-check --after='21 days' --reuse --progress ham:dir:./ham spam:dir:./spam
#  Parse logs, rewriting new versions (see usage for other options)
./find-meta-rules.pl --rewrite

#  Verify run was sane
head 90_meta_spam.cf ; head 90_meta_ham.cf
wc -l ham.log* spam.log*
#  Make newly generated rules and logs available
cp ham.log ham.log.mass-check ; cp spam.log spam.log.mass-check
cp ham.log.meta ham.log ; cp spam.log.meta spam.log
cp 90_meta_spam.cf 90_meta_ham.cf ../rules

#  Generate freqs and assign scores
./parse-rules-for-masses
make ; ./perceptron
grep "T_META_" freqs perceptron.scores

I realise this is probably not a new idea, noting that hit-frequencies already
includes an option to output overlapping rules.  Part of this implementation is
specifically targetted at determining which rule combinations or signatures
would assist in pushing under-the-radar spam over the line (if it's already
spam, I'm happy for it to get more points too).  Potentially also allows you to
write more rules that may hit ham, but when part of a meta rule, only hit spam.
 eg a rule for "From name is all lower case" hits ham and spam, but when
combined with a network check and a HTML only message say, it may hit only spam.
 Hopefully the perceptron would give a low score to the individual ham hitting
rule, and a higher score to the meta rule.

To do:
- Check performance (caching) against a larger corpus (>10k messages)
- Check rule/score suggestions against a larger corpus
- Handle ham signatures better (ie make nice meta rules)
- Perhaps improve naming of rules (ie include date to help keep track)
- Provide better documentation and sanity checks if script proves useful

Please let me know if you have feedback on the idea or suggest other
implementations.  So far it's proved useful in my environment.
Comment 1 Peter Fritz 2005-02-25 06:14:38 UTC
Created attachment 2667 [details]
find-meta-rules.pl script

perltidy-ised version of find-meta-rules.pl as discussed in previous post.
Comment 2 Peter Fritz 2005-02-25 06:22:32 UTC
Created attachment 2668 [details]
90_meta_spam.cf - sample meta rules file

Example frequencies and scores associated with the rules in my small corpus:

freqs: 13.197  14.6441	 0.0000    1.000   0.95    0.01  T_META_SPAM_180
freqs:	7.470	8.2895	 0.0000    1.000   0.90    0.01  T_META_SPAM_104
freqs:	6.243	6.9279	 0.0000    1.000   0.86    0.01  T_META_SPAM_181
freqs:	6.114	6.7845	 0.0000    1.000   0.86    0.01  T_META_SPAM_182
freqs:	3.531	3.9178	 0.0000    1.000   0.77    0.01  T_META_SPAM_101
freqs:	0.883	0.9795	 0.0000    1.000   0.54    0.01  T_META_SPAM_103
freqs:	0.409	0.4539	 0.0000    1.000   0.47    0.01  T_META_SPAM_102
freqs:	0.388	0.4300	 0.0000    1.000   0.47    0.01  T_META_SPAM_107
freqs:	0.280	0.3106	 0.0000    1.000   0.46    0.01  T_META_SPAM_108
freqs:	0.194	0.2150	 0.0000    1.000   0.44    0.01  T_META_SPAM_105
freqs:	0.151	0.1672	 0.0000    1.000   0.44    0.01  T_META_SPAM_100
freqs:	0.086	0.0956	 0.0000    1.000   0.43    0.01  T_META_SPAM_109
freqs:	0.043	0.0478	 0.0000    1.000   0.42    0.01  T_META_SPAM_106
perceptron.scores:score T_META_SPAM_100 	       0.205 # [0.000..2.000]
perceptron.scores:score T_META_SPAM_101 	       1.510 # [0.000..3.500]
perceptron.scores:score T_META_SPAM_102 	       1.183 # [0.000..2.100]
perceptron.scores:score T_META_SPAM_103 	       0.031 # [0.000..2.400]
perceptron.scores:score T_META_SPAM_104 	       0.132 # [0.000..4.000]
perceptron.scores:score T_META_SPAM_105 	       1.027 # [0.000..2.000]
perceptron.scores:score T_META_SPAM_106 	       1.552 # [0.000..1.900]
perceptron.scores:score T_META_SPAM_107 	       0.999 # [0.000..2.100]
perceptron.scores:score T_META_SPAM_108 	       1.120 # [0.000..2.100]
perceptron.scores:score T_META_SPAM_109 	       1.116 # [0.000..1.900]
perceptron.scores:score T_META_SPAM_180 	       0.917 # [0.000..4.300]
perceptron.scores:score T_META_SPAM_181 	       0.046 # [0.000..3.900]
perceptron.scores:score T_META_SPAM_182 	       0.051 # [0.000..3.900]
Comment 3 Bob Menschel 2005-04-28 23:24:15 UTC
Sounds like a really good idea.  
1) Peter, can you submit a CLA to Apache (or let us know if you already have)? 
2) Provisionally setting target milestone to 3.2, though I certainly won't
complain if it's worked on sooner. 

We may also play with this within SARE, with your permission...
Comment 4 Peter Fritz 2005-05-06 04:52:29 UTC
Sorry for the delay, bugzilla slipped under the radar for a bit.

Will pursue submitting a CLA and advise.  In the mean time, you're welcome to
play with it in SARE (anything that can improve spam detection more widely is
fine with me!)

Current thread on the dev mailing list regarding boosting (May 4, 2005) may also
be related.
Comment 5 Justin Mason 2005-06-08 16:24:20 UTC
btw, Peter's CLA is noted as received, so this is applicable -- anyone
interested in trying it out?  Henry, perhaps? ;)
Comment 6 Bas Zoetekouw 2005-06-09 02:26:37 UTC
OK, I ran it against the results of yesterday's daily check (no net) for my own
corpus (15k ha, 85k spam).  The script is quite fast (few seconds or so on my
Athlon 3000+), and it seems to come up with some really good rules:

freqs: 22.820  26.9383   0.0000    1.000   1.00    0.01  T_META_SPAM_135
freqs: 12.886  15.2117   0.0000    1.000   0.99    0.01  T_META_SPAM_061
freqs: 11.617  13.7127   0.0000    1.000   0.99    0.01  T_META_SPAM_059
freqs:  9.912  11.7004   0.0000    1.000   0.98    0.01  T_META_SPAM_117
freqs:  8.064   9.5186   0.0000    1.000   0.97    0.01  T_META_SPAM_005
freqs: 11.258  13.2882   0.0063    1.000   0.97    0.01  T_META_SPAM_063
freqs:  6.692   7.9001   0.0000    1.000   0.96    0.01  T_META_SPAM_185
freqs:  6.516   7.6918   0.0000    1.000   0.96    0.01  T_META_SPAM_159
freqs:  6.418   7.5757   0.0000    1.000   0.96    0.01  T_META_SPAM_036
freqs:  4.598   5.4280   0.0000    1.000   0.94    0.01  T_META_SPAM_076
freqs:  6.551   7.7305   0.0126    0.998   0.93    0.01  T_META_SPAM_040
freqs:  7.815   9.2215   0.0189    0.998   0.93    0.01  T_META_SPAM_026
freqs:  4.099   4.8372   0.0063    0.999   0.92    0.01  T_META_SPAM_173
freqs:  3.079   3.6342   0.0000    1.000   0.92    0.01  T_META_SPAM_004
freqs:  2.943   3.4737   0.0000    1.000   0.91    0.01  T_META_SPAM_086
freqs:  3.177   3.7491   0.0063    0.998   0.91    0.01  T_META_SPAM_120
freqs:  3.991   4.7086   0.0126    0.997   0.91    0.01  T_META_SPAM_014
freqs:  2.616   3.0879   0.0000    1.000   0.90    0.01  T_META_SPAM_129
freqs:  3.372   3.9768   0.0189    0.995   0.89    0.01  T_META_SPAM_039
freqs:  2.199   2.5962   0.0000    1.000   0.88    0.01  T_META_SPAM_196
freqs:  2.665   3.1436   0.0126    0.996   0.88    0.01  T_META_SPAM_007
freqs:  1.928   2.2763   0.0000    1.000   0.88    0.01  T_META_SPAM_177
freqs:  2.464   2.9058   0.0126    0.996   0.87    0.01  T_META_SPAM_054
freqs:  1.757   2.0738   0.0000    1.000   0.86    0.01  T_META_SPAM_038
freqs:  1.633   1.9281   0.0000    1.000   0.86    0.01  T_META_SPAM_118
freqs:  1.621   1.9133   0.0000    1.000   0.85    0.01  T_META_SPAM_090
freqs:  1.571   1.8541   0.0000    1.000   0.85    0.01  T_META_SPAM_064
freqs:  1.932   2.2786   0.0126    0.994   0.85    0.01  T_META_SPAM_023
freqs:  1.563   1.8450   0.0000    1.000   0.85    0.01  T_META_SPAM_194
freqs:  3.014   3.5522   0.0315    0.991   0.85    0.01  T_META_SPAM_010
freqs:  1.455   1.7175   0.0000    1.000   0.84    0.01  T_META_SPAM_049
freqs:  1.514   1.7858   0.0063    0.996   0.83    0.01  T_META_SPAM_008
freqs:  1.835   2.1625   0.0189    0.991   0.83    0.01  T_META_SPAM_016
freqs:  3.128   3.6843   0.0442    0.988   0.82    0.01  T_META_SPAM_022
freqs:  1.228   1.4500   0.0000    1.000   0.82    0.01  T_META_SPAM_141
freqs:  1.219   1.4387   0.0000    1.000   0.82    0.01  T_META_SPAM_017
freqs:  1.217   1.4364   0.0000    1.000   0.82    0.01  T_META_SPAM_149
freqs:  1.171   1.3817   0.0000    1.000   0.81    0.01  T_META_SPAM_102
freqs:  1.165   1.3749   0.0000    1.000   0.81    0.01  T_META_SPAM_003
freqs:  1.151   1.3590   0.0000    1.000   0.80    0.01  T_META_SPAM_121
freqs:  1.116   1.3169   0.0000    1.000   0.80    0.01  T_META_SPAM_048
freqs:  1.468   1.7300   0.0189    0.989   0.80    0.01  T_META_SPAM_052
freqs:  1.832   2.1568   0.0315    0.986   0.80    0.01  T_META_SPAM_045
freqs:  1.033   1.2190   0.0000    1.000   0.80    0.01  T_META_SPAM_166
freqs:  0.970   1.1450   0.0000    1.000   0.79    0.01  T_META_SPAM_066
freqs:  0.961   1.1348   0.0000    1.000   0.79    0.01  T_META_SPAM_073
freqs:  2.733   3.2165   0.0568    0.983   0.78    0.01  T_META_SPAM_013
freqs:  0.870   1.0266   0.0000    1.000   0.78    0.01  T_META_SPAM_142
freqs:  0.865   1.0209   0.0000    1.000   0.78    0.01  T_META_SPAM_096
freqs:  1.356   1.5957   0.0252    0.984   0.78    0.01  T_META_SPAM_002
freqs:  0.902   1.0631   0.0063    0.994   0.77    0.01  T_META_SPAM_160
freqs:  0.751   0.8866   0.0000    1.000   0.77    0.01  T_META_SPAM_186
freqs:  0.810   0.9549   0.0063    0.993   0.76    0.01  T_META_SPAM_025
freqs:  0.871   1.0255   0.0126    0.988   0.75    0.01  T_META_SPAM_193
freqs:  0.653   0.7705   0.0000    1.000   0.75    0.01  T_META_SPAM_169
freqs:  0.904   1.0642   0.0189    0.983   0.74    0.01  T_META_SPAM_180
freqs:  0.595   0.7023   0.0000    1.000   0.74    0.01  T_META_SPAM_055
freqs:  0.587   0.6931   0.0000    1.000   0.74    0.01  T_META_SPAM_097
freqs:  0.559   0.6601   0.0000    1.000   0.73    0.01  T_META_SPAM_172
freqs:  0.551   0.6499   0.0000    1.000   0.73    0.01  T_META_SPAM_015
freqs:  0.741   0.8718   0.0189    0.979   0.72    0.01  T_META_SPAM_012
freqs:  0.532   0.6283   0.0000    1.000   0.72    0.01  T_META_SPAM_020
freqs:  0.738   0.8673   0.0189    0.979   0.72    0.01  T_META_SPAM_027
freqs:  0.506   0.5975   0.0000    1.000   0.72    0.01  T_META_SPAM_083
freqs:  0.495   0.5839   0.0000    1.000   0.72    0.01  T_META_SPAM_192
freqs:  0.482   0.5691   0.0000    1.000   0.71    0.01  T_META_SPAM_144
freqs:  0.534   0.6294   0.0063    0.990   0.71    0.01  T_META_SPAM_042
freqs:  0.477   0.5634   0.0000    1.000   0.71    0.01  T_META_SPAM_171
freqs:  0.455   0.5372   0.0000    1.000   0.71    0.01  T_META_SPAM_095
freqs:  0.454   0.5361   0.0000    1.000   0.71    0.01  T_META_SPAM_047
freqs:  0.485   0.5714   0.0063    0.989   0.70    0.01  T_META_SPAM_037
freqs:  0.435   0.5133   0.0000    1.000   0.70    0.01  T_META_SPAM_109
freqs:  0.751   0.8809   0.0315    0.965   0.70    0.01  T_META_SPAM_011
freqs:  0.430   0.5076   0.0000    1.000   0.69    0.01  T_META_SPAM_098
freqs:  0.737   0.8639   0.0315    0.965   0.69    0.01  T_META_SPAM_062
freqs:  0.411   0.4849   0.0000    1.000   0.69    0.01  T_META_SPAM_195
freqs:  0.407   0.4803   0.0000    1.000   0.69    0.01  T_META_SPAM_174
freqs:  0.402   0.4746   0.0000    1.000   0.68    0.01  T_META_SPAM_155
freqs:  0.481   0.5657   0.0126    0.978   0.68    0.01  T_META_SPAM_150
freqs:  0.398   0.4701   0.0000    1.000   0.68    0.01  T_META_SPAM_190
freqs:  0.372   0.4393   0.0000    1.000   0.67    0.01  T_META_SPAM_189
freqs:  0.369   0.4359   0.0000    1.000   0.67    0.01  T_META_SPAM_140
freqs:  0.405   0.4769   0.0063    0.987   0.67    0.01  T_META_SPAM_099
freqs:  0.350   0.4132   0.0000    1.000   0.66    0.01  T_META_SPAM_162
freqs:  0.337   0.3972   0.0000    1.000   0.66    0.01  T_META_SPAM_146
freqs:  0.336   0.3961   0.0000    1.000   0.66    0.01  T_META_SPAM_198
freqs:  0.334   0.3938   0.0000    1.000   0.66    0.01  T_META_SPAM_125
freqs:  0.333   0.3927   0.0000    1.000   0.66    0.01  T_META_SPAM_068
freqs:  0.327   0.3858   0.0000    1.000   0.65    0.01  T_META_SPAM_106
freqs:  0.325   0.3836   0.0000    1.000   0.65    0.01  T_META_SPAM_024
freqs:  0.309   0.3642   0.0000    1.000   0.65    0.01  T_META_SPAM_071
freqs:  0.281   0.3312   0.0000    1.000   0.64    0.01  T_META_SPAM_101
freqs:  0.279   0.3289   0.0000    1.000   0.64    0.01  T_META_SPAM_200
freqs:  0.278   0.3278   0.0000    1.000   0.64    0.01  T_META_SPAM_119
freqs:  0.276   0.3255   0.0000    1.000   0.64    0.01  T_META_SPAM_157
freqs:  0.270   0.3187   0.0000    1.000   0.63    0.01  T_META_SPAM_158
freqs:  0.333   0.3904   0.0126    0.969   0.63    0.01  T_META_SPAM_018
freqs:  0.235   0.2777   0.0000    1.000   0.62    0.01  T_META_SPAM_105
freqs:  0.234   0.2766   0.0000    1.000   0.62    0.01  T_META_SPAM_092
freqs:  0.226   0.2663   0.0000    1.000   0.62    0.01  T_META_SPAM_091
freqs:  0.225   0.2652   0.0000    1.000   0.61    0.01  T_META_SPAM_085
freqs:  0.221   0.2606   0.0000    1.000   0.61    0.01  T_META_SPAM_078
freqs:  0.220   0.2595   0.0000    1.000   0.61    0.01  T_META_SPAM_001
freqs:  0.218   0.2572   0.0000    1.000   0.61    0.01  T_META_SPAM_080
freqs:  0.217   0.2561   0.0000    1.000   0.61    0.01  T_META_SPAM_034
freqs:  0.215   0.2538   0.0000    1.000   0.61    0.01  T_META_SPAM_147
freqs:  0.231   0.2720   0.0063    0.977   0.60    0.01  T_META_SPAM_051
freqs:  0.230   0.2709   0.0063    0.977   0.60    0.01  T_META_SPAM_065
freqs:  0.203   0.2402   0.0000    1.000   0.60    0.01  T_META_SPAM_046
freqs:  0.195   0.2299   0.0000    1.000   0.60    0.01  T_META_SPAM_176
freqs:  0.193   0.2276   0.0000    1.000   0.60    0.01  T_META_SPAM_167
freqs:  0.191   0.2254   0.0000    1.000   0.60    0.01  T_META_SPAM_127
freqs:  0.188   0.2219   0.0000    1.000   0.60    0.01  T_META_SPAM_041
freqs:  0.231   0.2709   0.0126    0.956   0.59    0.01  T_META_SPAM_009
freqs:  0.175   0.2060   0.0000    1.000   0.59    0.01  T_META_SPAM_182
freqs:  0.173   0.2037   0.0000    1.000   0.59    0.01  T_META_SPAM_201
freqs:  0.149   0.1764   0.0000    1.000   0.58    0.01  T_META_SPAM_161
freqs:  0.148   0.1741   0.0000    1.000   0.58    0.01  T_META_SPAM_058
freqs:  0.147   0.1730   0.0000    1.000   0.58    0.01  T_META_SPAM_110
freqs:  0.145   0.1707   0.0000    1.000   0.57    0.01  T_META_SPAM_139
freqs:  0.140   0.1650   0.0000    1.000   0.57    0.01  T_META_SPAM_130
freqs:  0.172   0.2015   0.0063    0.970   0.57    0.01  T_META_SPAM_028
freqs:  0.135   0.1593   0.0000    1.000   0.57    0.01  T_META_SPAM_033
freqs:  0.132   0.1559   0.0000    1.000   0.57    0.01  T_META_SPAM_143
freqs:  0.310   0.3597   0.0378    0.905   0.57    0.01  T_META_SPAM_050
freqs:  0.130   0.1537   0.0000    1.000   0.57    0.01  T_META_SPAM_094
freqs:  0.127   0.1502   0.0000    1.000   0.56    0.01  T_META_SPAM_154
freqs:  0.122   0.1445   0.0000    1.000   0.56    0.01  T_META_SPAM_175
freqs:  0.122   0.1445   0.0000    1.000   0.56    0.01  T_META_SPAM_131
freqs:  0.121   0.1434   0.0000    1.000   0.56    0.01  T_META_SPAM_006
freqs:  0.120   0.1411   0.0000    1.000   0.56    0.01  T_META_SPAM_084
freqs:  0.117   0.1377   0.0000    1.000   0.56    0.01  T_META_SPAM_070
freqs:  0.197   0.2288   0.0189    0.924   0.56    0.01  T_META_SPAM_108
freqs:  0.114   0.1343   0.0000    1.000   0.56    0.01  T_META_SPAM_151
freqs:  0.114   0.1343   0.0000    1.000   0.56    0.01  T_META_SPAM_056
freqs:  0.111   0.1309   0.0000    1.000   0.56    0.01  T_META_SPAM_104
freqs:  0.194   0.2254   0.0189    0.923   0.56    0.01  T_META_SPAM_021
freqs:  0.104   0.1229   0.0000    1.000   0.55    0.01  T_META_SPAM_156
freqs:  0.102   0.1206   0.0000    1.000   0.55    0.01  T_META_SPAM_191
freqs:  0.101   0.1195   0.0000    1.000   0.55    0.01  T_META_SPAM_145
freqs:  0.099   0.1172   0.0000    1.000   0.55    0.01  T_META_SPAM_128
freqs:  0.095   0.1127   0.0000    1.000   0.54    0.01  T_META_SPAM_043
freqs:  0.095   0.1127   0.0000    1.000   0.54    0.01  T_META_SPAM_087
freqs:  0.094   0.1115   0.0000    1.000   0.54    0.01  T_META_SPAM_134
freqs:  0.092   0.1081   0.0000    1.000   0.54    0.01  T_META_SPAM_079
freqs:  0.137   0.1593   0.0126    0.927   0.54    0.01  T_META_SPAM_035
freqs:  0.091   0.1070   0.0000    1.000   0.54    0.01  T_META_SPAM_111
freqs:  0.088   0.1036   0.0000    1.000   0.54    0.01  T_META_SPAM_116
freqs:  0.087   0.1024   0.0000    1.000   0.54    0.01  T_META_SPAM_122
freqs:  0.087   0.1024   0.0000    1.000   0.54    0.01  T_META_SPAM_067
freqs:  0.085   0.1002   0.0000    1.000   0.54    0.01  T_META_SPAM_107
freqs:  0.084   0.0990   0.0000    1.000   0.53    0.01  T_META_SPAM_181
freqs:  0.084   0.0990   0.0000    1.000   0.53    0.01  T_META_SPAM_133
freqs:  0.079   0.0933   0.0000    1.000   0.53    0.01  T_META_SPAM_187
freqs:  0.117   0.1354   0.0126    0.915   0.53    0.01  T_META_SPAM_044
freqs:  0.116   0.1343   0.0126    0.914   0.53    0.01  T_META_SPAM_077
freqs:  0.077   0.0911   0.0000    1.000   0.53    0.01  T_META_SPAM_030
freqs:  0.076   0.0899   0.0000    1.000   0.53    0.01  T_META_SPAM_188
freqs:  0.075   0.0888   0.0000    1.000   0.53    0.01  T_META_SPAM_031
freqs:  0.073   0.0865   0.0000    1.000   0.53    0.01  T_META_SPAM_075
freqs:  0.072   0.0854   0.0000    1.000   0.52    0.01  T_META_SPAM_115
freqs:  0.070   0.0831   0.0000    1.000   0.52    0.01  T_META_SPAM_126
freqs:  0.068   0.0808   0.0000    1.000   0.52    0.01  T_META_SPAM_168
freqs:  0.067   0.0785   0.0000    1.000   0.52    0.01  T_META_SPAM_132
freqs:  0.067   0.0785   0.0000    1.000   0.52    0.01  T_META_SPAM_136
freqs:  0.063   0.0740   0.0000    1.000   0.52    0.01  T_META_SPAM_152
freqs:  0.062   0.0728   0.0000    1.000   0.52    0.01  T_META_SPAM_179
freqs:  0.062   0.0728   0.0000    1.000   0.52    0.01  T_META_SPAM_093
freqs:  0.058   0.0683   0.0000    1.000   0.51    0.01  T_META_SPAM_164
freqs:  0.055   0.0649   0.0000    1.000   0.51    0.01  T_META_SPAM_184
freqs:  0.055   0.0649   0.0000    1.000   0.51    0.01  T_META_SPAM_183
freqs:  0.055   0.0649   0.0000    1.000   0.51    0.01  T_META_SPAM_178
freqs:  0.054   0.0637   0.0000    1.000   0.51    0.01  T_META_SPAM_123
freqs:  0.048   0.0569   0.0000    1.000   0.50    0.01  T_META_SPAM_074
freqs:  0.059   0.0683   0.0063    0.915   0.50    0.01  T_META_SPAM_082
freqs:  0.040   0.0478   0.0000    1.000   0.50    0.01  T_META_SPAM_124
freqs:  0.039   0.0455   0.0000    1.000   0.50    0.01  T_META_SPAM_163
freqs:  0.037   0.0433   0.0000    1.000   0.50    0.01  T_META_SPAM_088
freqs:  0.034   0.0398   0.0000    1.000   0.49    0.01  T_META_SPAM_032
freqs:  0.033   0.0387   0.0000    1.000   0.49    0.01  T_META_SPAM_112
freqs:  0.033   0.0387   0.0000    1.000   0.49    0.01  T_META_SPAM_103
freqs:  0.028   0.0330   0.0000    1.000   0.49    0.01  T_META_SPAM_072
freqs:  0.028   0.0330   0.0000    1.000   0.49    0.01  T_META_SPAM_089
freqs:  0.023   0.0273   0.0000    1.000   0.48    0.01  T_META_SPAM_081
freqs:  0.023   0.0273   0.0000    1.000   0.48    0.01  T_META_SPAM_029
freqs:  0.023   0.0273   0.0000    1.000   0.48    0.01  T_META_SPAM_019
freqs:  0.020   0.0239   0.0000    1.000   0.48    0.01  T_META_SPAM_113
freqs:  0.018   0.0216   0.0000    1.000   0.48    0.01  T_META_SPAM_165
freqs:  0.017   0.0205   0.0000    1.000   0.48    0.01  T_META_SPAM_069
freqs:  0.016   0.0193   0.0000    1.000   0.48    0.01  T_META_SPAM_137
freqs:  0.014   0.0171   0.0000    1.000   0.47    0.01  T_META_SPAM_053
freqs:  0.009   0.0102   0.0000    1.000   0.47    0.01  T_META_SPAM_153
freqs:  0.007   0.0080   0.0000    1.000   0.47    0.01  T_META_SPAM_148
freqs:  0.019   0.0216   0.0063    0.774   0.46    0.01  T_META_SPAM_100
freqs:  0.005   0.0057   0.0000    1.000   0.46    0.01  T_META_SPAM_057
freqs:  0.003   0.0034   0.0000    1.000   0.46    0.01  T_META_SPAM_170
freqs:  0.002   0.0023   0.0000    1.000   0.46    0.01  T_META_SPAM_138
freqs:  0.002   0.0023   0.0000    1.000   0.46    0.01  T_META_SPAM_197
freqs:  0.002   0.0023   0.0000    1.000   0.46    0.01  T_META_SPAM_114
freqs:  0.001   0.0011   0.0000    1.000   0.46    0.01  T_META_SPAM_199
freqs:  0.001   0.0011   0.0000    1.000   0.46    0.01  T_META_SPAM_060
Comment 7 Bas Zoetekouw 2005-06-09 06:24:29 UTC
Created attachment 2933 [details]
90_meta_spam.cf - sample rules file from bzoetekouw corpus

added the rules list that corresponds to the frequencies posted above
Comment 8 Justin Mason 2005-06-09 09:31:43 UTC
btw I know Henry has added some code to trunk under masses/evolve_metarule which
does something similar with a GA -- this is why I'd like him to take a look at
this.  however I suspect he may be about to embark on a week's holiday tomorrow ;)

one question is, how much do these rules overlap with one another?  SVN trunk's
"hit-frequencies" supports a "-o" switch to compute this.
Comment 9 Peter Fritz 2007-01-14 04:47:52 UTC
See also bug #2427.  Reference was made to Henry's masses code there too.  I
think there is still value in having a meta-rule generator, particularly for
catching low scoring spam (which find-meta-rules.pl seeks to address).  It would
be good to see how this approach performs against current corpus data.