|
SA Bugzilla – Full Text Bug Listing |
Summary: | mass-check related script to find meta rules for low scoring spam | ||
---|---|---|---|
Product: | Spamassassin | Reporter: | Peter Fritz <peter> |
Component: | Masses | Assignee: | SpamAssassin Developer Mailing List <dev> |
Status: | RESOLVED WORKSFORME | ||
Severity: | enhancement | CC: | apache |
Priority: | P5 | Keywords: | triage |
Version: | SVN Trunk (Latest Devel Version) | ||
Target Milestone: | Undefined | ||
Hardware: | Other | ||
OS: | other | ||
Whiteboard: | |||
Attachments: |
find-meta-rules.pl script
90_meta_spam.cf - sample meta rules file 90_meta_spam.cf - sample rules file from bzoetekouw corpus |
Description
Peter Fritz
2005-02-25 06:11:54 UTC
Created attachment 2667 [details]
find-meta-rules.pl script
perltidy-ised version of find-meta-rules.pl as discussed in previous post.
Created attachment 2668 [details]
90_meta_spam.cf - sample meta rules file
Example frequencies and scores associated with the rules in my small corpus:
freqs: 13.197 14.6441 0.0000 1.000 0.95 0.01 T_META_SPAM_180
freqs: 7.470 8.2895 0.0000 1.000 0.90 0.01 T_META_SPAM_104
freqs: 6.243 6.9279 0.0000 1.000 0.86 0.01 T_META_SPAM_181
freqs: 6.114 6.7845 0.0000 1.000 0.86 0.01 T_META_SPAM_182
freqs: 3.531 3.9178 0.0000 1.000 0.77 0.01 T_META_SPAM_101
freqs: 0.883 0.9795 0.0000 1.000 0.54 0.01 T_META_SPAM_103
freqs: 0.409 0.4539 0.0000 1.000 0.47 0.01 T_META_SPAM_102
freqs: 0.388 0.4300 0.0000 1.000 0.47 0.01 T_META_SPAM_107
freqs: 0.280 0.3106 0.0000 1.000 0.46 0.01 T_META_SPAM_108
freqs: 0.194 0.2150 0.0000 1.000 0.44 0.01 T_META_SPAM_105
freqs: 0.151 0.1672 0.0000 1.000 0.44 0.01 T_META_SPAM_100
freqs: 0.086 0.0956 0.0000 1.000 0.43 0.01 T_META_SPAM_109
freqs: 0.043 0.0478 0.0000 1.000 0.42 0.01 T_META_SPAM_106
perceptron.scores:score T_META_SPAM_100 0.205 # [0.000..2.000]
perceptron.scores:score T_META_SPAM_101 1.510 # [0.000..3.500]
perceptron.scores:score T_META_SPAM_102 1.183 # [0.000..2.100]
perceptron.scores:score T_META_SPAM_103 0.031 # [0.000..2.400]
perceptron.scores:score T_META_SPAM_104 0.132 # [0.000..4.000]
perceptron.scores:score T_META_SPAM_105 1.027 # [0.000..2.000]
perceptron.scores:score T_META_SPAM_106 1.552 # [0.000..1.900]
perceptron.scores:score T_META_SPAM_107 0.999 # [0.000..2.100]
perceptron.scores:score T_META_SPAM_108 1.120 # [0.000..2.100]
perceptron.scores:score T_META_SPAM_109 1.116 # [0.000..1.900]
perceptron.scores:score T_META_SPAM_180 0.917 # [0.000..4.300]
perceptron.scores:score T_META_SPAM_181 0.046 # [0.000..3.900]
perceptron.scores:score T_META_SPAM_182 0.051 # [0.000..3.900]
Sounds like a really good idea. 1) Peter, can you submit a CLA to Apache (or let us know if you already have)? 2) Provisionally setting target milestone to 3.2, though I certainly won't complain if it's worked on sooner. We may also play with this within SARE, with your permission... Sorry for the delay, bugzilla slipped under the radar for a bit. Will pursue submitting a CLA and advise. In the mean time, you're welcome to play with it in SARE (anything that can improve spam detection more widely is fine with me!) Current thread on the dev mailing list regarding boosting (May 4, 2005) may also be related. btw, Peter's CLA is noted as received, so this is applicable -- anyone interested in trying it out? Henry, perhaps? ;) OK, I ran it against the results of yesterday's daily check (no net) for my own corpus (15k ha, 85k spam). The script is quite fast (few seconds or so on my Athlon 3000+), and it seems to come up with some really good rules: freqs: 22.820 26.9383 0.0000 1.000 1.00 0.01 T_META_SPAM_135 freqs: 12.886 15.2117 0.0000 1.000 0.99 0.01 T_META_SPAM_061 freqs: 11.617 13.7127 0.0000 1.000 0.99 0.01 T_META_SPAM_059 freqs: 9.912 11.7004 0.0000 1.000 0.98 0.01 T_META_SPAM_117 freqs: 8.064 9.5186 0.0000 1.000 0.97 0.01 T_META_SPAM_005 freqs: 11.258 13.2882 0.0063 1.000 0.97 0.01 T_META_SPAM_063 freqs: 6.692 7.9001 0.0000 1.000 0.96 0.01 T_META_SPAM_185 freqs: 6.516 7.6918 0.0000 1.000 0.96 0.01 T_META_SPAM_159 freqs: 6.418 7.5757 0.0000 1.000 0.96 0.01 T_META_SPAM_036 freqs: 4.598 5.4280 0.0000 1.000 0.94 0.01 T_META_SPAM_076 freqs: 6.551 7.7305 0.0126 0.998 0.93 0.01 T_META_SPAM_040 freqs: 7.815 9.2215 0.0189 0.998 0.93 0.01 T_META_SPAM_026 freqs: 4.099 4.8372 0.0063 0.999 0.92 0.01 T_META_SPAM_173 freqs: 3.079 3.6342 0.0000 1.000 0.92 0.01 T_META_SPAM_004 freqs: 2.943 3.4737 0.0000 1.000 0.91 0.01 T_META_SPAM_086 freqs: 3.177 3.7491 0.0063 0.998 0.91 0.01 T_META_SPAM_120 freqs: 3.991 4.7086 0.0126 0.997 0.91 0.01 T_META_SPAM_014 freqs: 2.616 3.0879 0.0000 1.000 0.90 0.01 T_META_SPAM_129 freqs: 3.372 3.9768 0.0189 0.995 0.89 0.01 T_META_SPAM_039 freqs: 2.199 2.5962 0.0000 1.000 0.88 0.01 T_META_SPAM_196 freqs: 2.665 3.1436 0.0126 0.996 0.88 0.01 T_META_SPAM_007 freqs: 1.928 2.2763 0.0000 1.000 0.88 0.01 T_META_SPAM_177 freqs: 2.464 2.9058 0.0126 0.996 0.87 0.01 T_META_SPAM_054 freqs: 1.757 2.0738 0.0000 1.000 0.86 0.01 T_META_SPAM_038 freqs: 1.633 1.9281 0.0000 1.000 0.86 0.01 T_META_SPAM_118 freqs: 1.621 1.9133 0.0000 1.000 0.85 0.01 T_META_SPAM_090 freqs: 1.571 1.8541 0.0000 1.000 0.85 0.01 T_META_SPAM_064 freqs: 1.932 2.2786 0.0126 0.994 0.85 0.01 T_META_SPAM_023 freqs: 1.563 1.8450 0.0000 1.000 0.85 0.01 T_META_SPAM_194 freqs: 3.014 3.5522 0.0315 0.991 0.85 0.01 T_META_SPAM_010 freqs: 1.455 1.7175 0.0000 1.000 0.84 0.01 T_META_SPAM_049 freqs: 1.514 1.7858 0.0063 0.996 0.83 0.01 T_META_SPAM_008 freqs: 1.835 2.1625 0.0189 0.991 0.83 0.01 T_META_SPAM_016 freqs: 3.128 3.6843 0.0442 0.988 0.82 0.01 T_META_SPAM_022 freqs: 1.228 1.4500 0.0000 1.000 0.82 0.01 T_META_SPAM_141 freqs: 1.219 1.4387 0.0000 1.000 0.82 0.01 T_META_SPAM_017 freqs: 1.217 1.4364 0.0000 1.000 0.82 0.01 T_META_SPAM_149 freqs: 1.171 1.3817 0.0000 1.000 0.81 0.01 T_META_SPAM_102 freqs: 1.165 1.3749 0.0000 1.000 0.81 0.01 T_META_SPAM_003 freqs: 1.151 1.3590 0.0000 1.000 0.80 0.01 T_META_SPAM_121 freqs: 1.116 1.3169 0.0000 1.000 0.80 0.01 T_META_SPAM_048 freqs: 1.468 1.7300 0.0189 0.989 0.80 0.01 T_META_SPAM_052 freqs: 1.832 2.1568 0.0315 0.986 0.80 0.01 T_META_SPAM_045 freqs: 1.033 1.2190 0.0000 1.000 0.80 0.01 T_META_SPAM_166 freqs: 0.970 1.1450 0.0000 1.000 0.79 0.01 T_META_SPAM_066 freqs: 0.961 1.1348 0.0000 1.000 0.79 0.01 T_META_SPAM_073 freqs: 2.733 3.2165 0.0568 0.983 0.78 0.01 T_META_SPAM_013 freqs: 0.870 1.0266 0.0000 1.000 0.78 0.01 T_META_SPAM_142 freqs: 0.865 1.0209 0.0000 1.000 0.78 0.01 T_META_SPAM_096 freqs: 1.356 1.5957 0.0252 0.984 0.78 0.01 T_META_SPAM_002 freqs: 0.902 1.0631 0.0063 0.994 0.77 0.01 T_META_SPAM_160 freqs: 0.751 0.8866 0.0000 1.000 0.77 0.01 T_META_SPAM_186 freqs: 0.810 0.9549 0.0063 0.993 0.76 0.01 T_META_SPAM_025 freqs: 0.871 1.0255 0.0126 0.988 0.75 0.01 T_META_SPAM_193 freqs: 0.653 0.7705 0.0000 1.000 0.75 0.01 T_META_SPAM_169 freqs: 0.904 1.0642 0.0189 0.983 0.74 0.01 T_META_SPAM_180 freqs: 0.595 0.7023 0.0000 1.000 0.74 0.01 T_META_SPAM_055 freqs: 0.587 0.6931 0.0000 1.000 0.74 0.01 T_META_SPAM_097 freqs: 0.559 0.6601 0.0000 1.000 0.73 0.01 T_META_SPAM_172 freqs: 0.551 0.6499 0.0000 1.000 0.73 0.01 T_META_SPAM_015 freqs: 0.741 0.8718 0.0189 0.979 0.72 0.01 T_META_SPAM_012 freqs: 0.532 0.6283 0.0000 1.000 0.72 0.01 T_META_SPAM_020 freqs: 0.738 0.8673 0.0189 0.979 0.72 0.01 T_META_SPAM_027 freqs: 0.506 0.5975 0.0000 1.000 0.72 0.01 T_META_SPAM_083 freqs: 0.495 0.5839 0.0000 1.000 0.72 0.01 T_META_SPAM_192 freqs: 0.482 0.5691 0.0000 1.000 0.71 0.01 T_META_SPAM_144 freqs: 0.534 0.6294 0.0063 0.990 0.71 0.01 T_META_SPAM_042 freqs: 0.477 0.5634 0.0000 1.000 0.71 0.01 T_META_SPAM_171 freqs: 0.455 0.5372 0.0000 1.000 0.71 0.01 T_META_SPAM_095 freqs: 0.454 0.5361 0.0000 1.000 0.71 0.01 T_META_SPAM_047 freqs: 0.485 0.5714 0.0063 0.989 0.70 0.01 T_META_SPAM_037 freqs: 0.435 0.5133 0.0000 1.000 0.70 0.01 T_META_SPAM_109 freqs: 0.751 0.8809 0.0315 0.965 0.70 0.01 T_META_SPAM_011 freqs: 0.430 0.5076 0.0000 1.000 0.69 0.01 T_META_SPAM_098 freqs: 0.737 0.8639 0.0315 0.965 0.69 0.01 T_META_SPAM_062 freqs: 0.411 0.4849 0.0000 1.000 0.69 0.01 T_META_SPAM_195 freqs: 0.407 0.4803 0.0000 1.000 0.69 0.01 T_META_SPAM_174 freqs: 0.402 0.4746 0.0000 1.000 0.68 0.01 T_META_SPAM_155 freqs: 0.481 0.5657 0.0126 0.978 0.68 0.01 T_META_SPAM_150 freqs: 0.398 0.4701 0.0000 1.000 0.68 0.01 T_META_SPAM_190 freqs: 0.372 0.4393 0.0000 1.000 0.67 0.01 T_META_SPAM_189 freqs: 0.369 0.4359 0.0000 1.000 0.67 0.01 T_META_SPAM_140 freqs: 0.405 0.4769 0.0063 0.987 0.67 0.01 T_META_SPAM_099 freqs: 0.350 0.4132 0.0000 1.000 0.66 0.01 T_META_SPAM_162 freqs: 0.337 0.3972 0.0000 1.000 0.66 0.01 T_META_SPAM_146 freqs: 0.336 0.3961 0.0000 1.000 0.66 0.01 T_META_SPAM_198 freqs: 0.334 0.3938 0.0000 1.000 0.66 0.01 T_META_SPAM_125 freqs: 0.333 0.3927 0.0000 1.000 0.66 0.01 T_META_SPAM_068 freqs: 0.327 0.3858 0.0000 1.000 0.65 0.01 T_META_SPAM_106 freqs: 0.325 0.3836 0.0000 1.000 0.65 0.01 T_META_SPAM_024 freqs: 0.309 0.3642 0.0000 1.000 0.65 0.01 T_META_SPAM_071 freqs: 0.281 0.3312 0.0000 1.000 0.64 0.01 T_META_SPAM_101 freqs: 0.279 0.3289 0.0000 1.000 0.64 0.01 T_META_SPAM_200 freqs: 0.278 0.3278 0.0000 1.000 0.64 0.01 T_META_SPAM_119 freqs: 0.276 0.3255 0.0000 1.000 0.64 0.01 T_META_SPAM_157 freqs: 0.270 0.3187 0.0000 1.000 0.63 0.01 T_META_SPAM_158 freqs: 0.333 0.3904 0.0126 0.969 0.63 0.01 T_META_SPAM_018 freqs: 0.235 0.2777 0.0000 1.000 0.62 0.01 T_META_SPAM_105 freqs: 0.234 0.2766 0.0000 1.000 0.62 0.01 T_META_SPAM_092 freqs: 0.226 0.2663 0.0000 1.000 0.62 0.01 T_META_SPAM_091 freqs: 0.225 0.2652 0.0000 1.000 0.61 0.01 T_META_SPAM_085 freqs: 0.221 0.2606 0.0000 1.000 0.61 0.01 T_META_SPAM_078 freqs: 0.220 0.2595 0.0000 1.000 0.61 0.01 T_META_SPAM_001 freqs: 0.218 0.2572 0.0000 1.000 0.61 0.01 T_META_SPAM_080 freqs: 0.217 0.2561 0.0000 1.000 0.61 0.01 T_META_SPAM_034 freqs: 0.215 0.2538 0.0000 1.000 0.61 0.01 T_META_SPAM_147 freqs: 0.231 0.2720 0.0063 0.977 0.60 0.01 T_META_SPAM_051 freqs: 0.230 0.2709 0.0063 0.977 0.60 0.01 T_META_SPAM_065 freqs: 0.203 0.2402 0.0000 1.000 0.60 0.01 T_META_SPAM_046 freqs: 0.195 0.2299 0.0000 1.000 0.60 0.01 T_META_SPAM_176 freqs: 0.193 0.2276 0.0000 1.000 0.60 0.01 T_META_SPAM_167 freqs: 0.191 0.2254 0.0000 1.000 0.60 0.01 T_META_SPAM_127 freqs: 0.188 0.2219 0.0000 1.000 0.60 0.01 T_META_SPAM_041 freqs: 0.231 0.2709 0.0126 0.956 0.59 0.01 T_META_SPAM_009 freqs: 0.175 0.2060 0.0000 1.000 0.59 0.01 T_META_SPAM_182 freqs: 0.173 0.2037 0.0000 1.000 0.59 0.01 T_META_SPAM_201 freqs: 0.149 0.1764 0.0000 1.000 0.58 0.01 T_META_SPAM_161 freqs: 0.148 0.1741 0.0000 1.000 0.58 0.01 T_META_SPAM_058 freqs: 0.147 0.1730 0.0000 1.000 0.58 0.01 T_META_SPAM_110 freqs: 0.145 0.1707 0.0000 1.000 0.57 0.01 T_META_SPAM_139 freqs: 0.140 0.1650 0.0000 1.000 0.57 0.01 T_META_SPAM_130 freqs: 0.172 0.2015 0.0063 0.970 0.57 0.01 T_META_SPAM_028 freqs: 0.135 0.1593 0.0000 1.000 0.57 0.01 T_META_SPAM_033 freqs: 0.132 0.1559 0.0000 1.000 0.57 0.01 T_META_SPAM_143 freqs: 0.310 0.3597 0.0378 0.905 0.57 0.01 T_META_SPAM_050 freqs: 0.130 0.1537 0.0000 1.000 0.57 0.01 T_META_SPAM_094 freqs: 0.127 0.1502 0.0000 1.000 0.56 0.01 T_META_SPAM_154 freqs: 0.122 0.1445 0.0000 1.000 0.56 0.01 T_META_SPAM_175 freqs: 0.122 0.1445 0.0000 1.000 0.56 0.01 T_META_SPAM_131 freqs: 0.121 0.1434 0.0000 1.000 0.56 0.01 T_META_SPAM_006 freqs: 0.120 0.1411 0.0000 1.000 0.56 0.01 T_META_SPAM_084 freqs: 0.117 0.1377 0.0000 1.000 0.56 0.01 T_META_SPAM_070 freqs: 0.197 0.2288 0.0189 0.924 0.56 0.01 T_META_SPAM_108 freqs: 0.114 0.1343 0.0000 1.000 0.56 0.01 T_META_SPAM_151 freqs: 0.114 0.1343 0.0000 1.000 0.56 0.01 T_META_SPAM_056 freqs: 0.111 0.1309 0.0000 1.000 0.56 0.01 T_META_SPAM_104 freqs: 0.194 0.2254 0.0189 0.923 0.56 0.01 T_META_SPAM_021 freqs: 0.104 0.1229 0.0000 1.000 0.55 0.01 T_META_SPAM_156 freqs: 0.102 0.1206 0.0000 1.000 0.55 0.01 T_META_SPAM_191 freqs: 0.101 0.1195 0.0000 1.000 0.55 0.01 T_META_SPAM_145 freqs: 0.099 0.1172 0.0000 1.000 0.55 0.01 T_META_SPAM_128 freqs: 0.095 0.1127 0.0000 1.000 0.54 0.01 T_META_SPAM_043 freqs: 0.095 0.1127 0.0000 1.000 0.54 0.01 T_META_SPAM_087 freqs: 0.094 0.1115 0.0000 1.000 0.54 0.01 T_META_SPAM_134 freqs: 0.092 0.1081 0.0000 1.000 0.54 0.01 T_META_SPAM_079 freqs: 0.137 0.1593 0.0126 0.927 0.54 0.01 T_META_SPAM_035 freqs: 0.091 0.1070 0.0000 1.000 0.54 0.01 T_META_SPAM_111 freqs: 0.088 0.1036 0.0000 1.000 0.54 0.01 T_META_SPAM_116 freqs: 0.087 0.1024 0.0000 1.000 0.54 0.01 T_META_SPAM_122 freqs: 0.087 0.1024 0.0000 1.000 0.54 0.01 T_META_SPAM_067 freqs: 0.085 0.1002 0.0000 1.000 0.54 0.01 T_META_SPAM_107 freqs: 0.084 0.0990 0.0000 1.000 0.53 0.01 T_META_SPAM_181 freqs: 0.084 0.0990 0.0000 1.000 0.53 0.01 T_META_SPAM_133 freqs: 0.079 0.0933 0.0000 1.000 0.53 0.01 T_META_SPAM_187 freqs: 0.117 0.1354 0.0126 0.915 0.53 0.01 T_META_SPAM_044 freqs: 0.116 0.1343 0.0126 0.914 0.53 0.01 T_META_SPAM_077 freqs: 0.077 0.0911 0.0000 1.000 0.53 0.01 T_META_SPAM_030 freqs: 0.076 0.0899 0.0000 1.000 0.53 0.01 T_META_SPAM_188 freqs: 0.075 0.0888 0.0000 1.000 0.53 0.01 T_META_SPAM_031 freqs: 0.073 0.0865 0.0000 1.000 0.53 0.01 T_META_SPAM_075 freqs: 0.072 0.0854 0.0000 1.000 0.52 0.01 T_META_SPAM_115 freqs: 0.070 0.0831 0.0000 1.000 0.52 0.01 T_META_SPAM_126 freqs: 0.068 0.0808 0.0000 1.000 0.52 0.01 T_META_SPAM_168 freqs: 0.067 0.0785 0.0000 1.000 0.52 0.01 T_META_SPAM_132 freqs: 0.067 0.0785 0.0000 1.000 0.52 0.01 T_META_SPAM_136 freqs: 0.063 0.0740 0.0000 1.000 0.52 0.01 T_META_SPAM_152 freqs: 0.062 0.0728 0.0000 1.000 0.52 0.01 T_META_SPAM_179 freqs: 0.062 0.0728 0.0000 1.000 0.52 0.01 T_META_SPAM_093 freqs: 0.058 0.0683 0.0000 1.000 0.51 0.01 T_META_SPAM_164 freqs: 0.055 0.0649 0.0000 1.000 0.51 0.01 T_META_SPAM_184 freqs: 0.055 0.0649 0.0000 1.000 0.51 0.01 T_META_SPAM_183 freqs: 0.055 0.0649 0.0000 1.000 0.51 0.01 T_META_SPAM_178 freqs: 0.054 0.0637 0.0000 1.000 0.51 0.01 T_META_SPAM_123 freqs: 0.048 0.0569 0.0000 1.000 0.50 0.01 T_META_SPAM_074 freqs: 0.059 0.0683 0.0063 0.915 0.50 0.01 T_META_SPAM_082 freqs: 0.040 0.0478 0.0000 1.000 0.50 0.01 T_META_SPAM_124 freqs: 0.039 0.0455 0.0000 1.000 0.50 0.01 T_META_SPAM_163 freqs: 0.037 0.0433 0.0000 1.000 0.50 0.01 T_META_SPAM_088 freqs: 0.034 0.0398 0.0000 1.000 0.49 0.01 T_META_SPAM_032 freqs: 0.033 0.0387 0.0000 1.000 0.49 0.01 T_META_SPAM_112 freqs: 0.033 0.0387 0.0000 1.000 0.49 0.01 T_META_SPAM_103 freqs: 0.028 0.0330 0.0000 1.000 0.49 0.01 T_META_SPAM_072 freqs: 0.028 0.0330 0.0000 1.000 0.49 0.01 T_META_SPAM_089 freqs: 0.023 0.0273 0.0000 1.000 0.48 0.01 T_META_SPAM_081 freqs: 0.023 0.0273 0.0000 1.000 0.48 0.01 T_META_SPAM_029 freqs: 0.023 0.0273 0.0000 1.000 0.48 0.01 T_META_SPAM_019 freqs: 0.020 0.0239 0.0000 1.000 0.48 0.01 T_META_SPAM_113 freqs: 0.018 0.0216 0.0000 1.000 0.48 0.01 T_META_SPAM_165 freqs: 0.017 0.0205 0.0000 1.000 0.48 0.01 T_META_SPAM_069 freqs: 0.016 0.0193 0.0000 1.000 0.48 0.01 T_META_SPAM_137 freqs: 0.014 0.0171 0.0000 1.000 0.47 0.01 T_META_SPAM_053 freqs: 0.009 0.0102 0.0000 1.000 0.47 0.01 T_META_SPAM_153 freqs: 0.007 0.0080 0.0000 1.000 0.47 0.01 T_META_SPAM_148 freqs: 0.019 0.0216 0.0063 0.774 0.46 0.01 T_META_SPAM_100 freqs: 0.005 0.0057 0.0000 1.000 0.46 0.01 T_META_SPAM_057 freqs: 0.003 0.0034 0.0000 1.000 0.46 0.01 T_META_SPAM_170 freqs: 0.002 0.0023 0.0000 1.000 0.46 0.01 T_META_SPAM_138 freqs: 0.002 0.0023 0.0000 1.000 0.46 0.01 T_META_SPAM_197 freqs: 0.002 0.0023 0.0000 1.000 0.46 0.01 T_META_SPAM_114 freqs: 0.001 0.0011 0.0000 1.000 0.46 0.01 T_META_SPAM_199 freqs: 0.001 0.0011 0.0000 1.000 0.46 0.01 T_META_SPAM_060 Created attachment 2933 [details]
90_meta_spam.cf - sample rules file from bzoetekouw corpus
added the rules list that corresponds to the frequencies posted above
btw I know Henry has added some code to trunk under masses/evolve_metarule which does something similar with a GA -- this is why I'd like him to take a look at this. however I suspect he may be about to embark on a week's holiday tomorrow ;) one question is, how much do these rules overlap with one another? SVN trunk's "hit-frequencies" supports a "-o" switch to compute this. See also bug #2427. Reference was made to Henry's masses code there too. I think there is still value in having a meta-rule generator, particularly for catching low scoring spam (which find-meta-rules.pl seeks to address). It would be good to see how this approach performs against current corpus data. Closing ancient stale bug. |