SA Bugzilla – Bug 6155
generate new scores for 3.3.0 release
Last modified: 2010-01-05 10:47:51 UTC
Here's a ticket to track this release work item. Do we actually need to do this, though, since we have Daryl's code generating scores weekly from nightly mass-check results?
(In reply to comment #0) > Do we actually need to do this, though, since we have Daryl's code generating > scores weekly from nightly mass-check results? well, we need to fix that, actually. it seems to be broken.
This time around, I think I'll scrap the confusing differentiation between nightly mass-check result submission rsync accounts and "submit" accounts. Anyone object? I'm going to try a test run of the evolver based on nightly mass-check logs, btw.
http://wiki.apache.org/spamassassin/RescoreMassCheck is the procedure, as in previous releases. fwiw, we have 1022294 spams and 271617 hams in our nightly corpora, currently.
Created attachment 4517 [details] Ignore missing support for ADSP in old versions of Mail::DKIM.
(In reply to comment #4) > Created an attachment (id=4517) [details] > Ignore missing support for ADSP in old versions of Mail::DKIM. wrong bug I suspect! ;)
Is there still time to add more nightlies for this rescoring? There is another major Japanese user that is very close to joining. How important is this rescoring? Do nightlies help to rescore the sa-update scores?
ok, I think I've ironed out a couple of issues. Let's see what people think of these sample scores: http://taint.org/x/2009/gen-set0-2.0-5.0-500-ga_scores http://taint.org/x/2009/gen-set1-5.0-5.0-500-ga_scores http://taint.org/x/2009/gen-set2-2.0-5.0-500-ga_scores http://taint.org/x/2009/gen-set3-5.0-5.0-500-ga_scores here are the test results against the "test" fold for each scoreset: gen-set0-2.0-5.0-500-ga/test Reading scores from "tmprules"... Reading per-message hit stat logs and scores... # SUMMARY for threshold 5.0: # Correctly non-spam: 26453 99.07% # Correctly spam: 83369 81.53% # False positives: 249 0.93% # False negatives: 18882 18.47% # TCR(l=50): 3.263469 SpamRecall: 81.534% SpamPrec: 99.702% gen-set1-5.0-5.0-500-ga/test Reading scores from "tmprules"... Reading per-message hit stat logs and scores... # SUMMARY for threshold 5.0: # Correctly non-spam: 26646 99.79% # Correctly spam: 100943 98.72% # False positives: 56 0.21% # False negatives: 1308 1.28% # TCR(l=50): 24.890701 SpamRecall: 98.721% SpamPrec: 99.945% gen-set2-2.0-5.0-500-ga/test Reading scores from "tmprules"... Reading per-message hit stat logs and scores... # SUMMARY for threshold 5.0: # Correctly non-spam: 26485 99.19% # Correctly spam: 84218 82.36% # False positives: 217 0.81% # False negatives: 18033 17.64% # TCR(l=50): 3.540179 SpamRecall: 82.364% SpamPrec: 99.743% gen-set3-5.0-5.0-500-ga/test Reading scores from "tmprules"... Reading per-message hit stat logs and scores... # SUMMARY for threshold 5.0: # Correctly non-spam: 26662 99.85% # Correctly spam: 100964 98.74% # False positives: 40 0.15% # False negatives: 1287 1.26% # TCR(l=50): 31.107697 SpamRecall: 98.741% SpamPrec: 99.960% Yes, set0 and set2 are terrible. This is pretty much what happened last time, too; our ruleset is pretty crappy nowadays without network rules active. But the net rule results are very good! However I think I need to look into the local rule GA runs if possible. Bug 5270 is the 3.2.0 rescoring run, for reference. Spamhaus will be happy to see a much improved score for RCVD_IN_PBL ;) gen-set1-5.0-5.0-500-ga_scores:score RCVD_IN_PBL 2.596 gen-set3-5.0-5.0-500-ga_scores:score RCVD_IN_PBL 2.411
Created attachment 4518 [details] sample new scores, as diff here's the results of running a GA run for each set. please shout about any and all issues you spot (and there's a few, I think, eg the ACCESSDB score leakage which should probably be ignored by the masses scripts)
http://ruleqa.spamassassin.org/20090817-r804903-n/TVD_SPACE_RATIO/detail 90% FP rate for Japanese http://ruleqa.spamassassin.org/20090817-r804903-n/PLING_QUERY/detail 52% FP rate for Japanese http://ruleqa.spamassassin.org/20090817-r804903-n/GAPPY_SUBJECT/detail 44% FP rate for Japanese All three of these rules do very poorly with Japanese mail, and the total % SPAM is lower than the % FP. Yet the GA scores are rather high since we don't have a statistically significant amount of Japanese mail in the corpus. What language are the SPAM hits? Perhaps many are examples of identifying foreign languages instead of determining if it is ham or spam? Bug #6149 is related to this problem. I am attempting to convince Japanese, Chinese and Korean users to join the nightly masscheck, but it is very difficult.
(In reply to comment #9) > http://ruleqa.spamassassin.org/20090817-r804903-n/TVD_SPACE_RATIO/detail > 90% FP rate for Japanese > http://ruleqa.spamassassin.org/20090817-r804903-n/PLING_QUERY/detail > 52% FP rate for Japanese > http://ruleqa.spamassassin.org/20090817-r804903-n/GAPPY_SUBJECT/detail > 44% FP rate for Japanese > > All three of these rules do very poorly with Japanese mail, and the total % > SPAM is lower than the % FP. Yet the GA scores are rather high since we don't > have a statistically significant amount of Japanese mail in the corpus. > > What language are the SPAM hits? Perhaps many are examples of identifying > foreign languages instead of determining if it is ham or spam? > > Bug #6149 is related to this problem. I plan to fix that, alright. > I am attempting to convince Japanese, Chinese and Korean users to join the > nightly masscheck, but it is very difficult. BTW, you could also take copies of their mail samples and add them to your own corpora, in effect acting as a proxy for them. that's easier for them than setting up all the infrastructure. (I thought you were already doing this ;) You may need to be able to ask them if a mail _really_ is ham, down the line, though, so it needs to remain a two-way arrangement.
> BTW, you could also take copies of their mail samples and add them to your own > corpora, in effect acting as a proxy for them. that's easier for them than > setting up all the infrastructure. (I thought you were already doing this ;) I have 3 English and 3 Japanese users in my corpus at the moment. One additional Japanese user rio is starting nightly masscheck hopefully tonight. He is doing his own masschecks. > You may need to be able to ask them if a mail _really_ is ham, down the line, > though, so it needs to remain a two-way arrangement. I asked them very carefully to avoid mis-classification. This is part of the difficulty of getting more volunteers, aside from the privacy worries. I look forward to seeing the effect of the fix in Bug #6149 on the next masscheck. I asked one of my users to pick a few dozen real-world sample messages that triggers the three rules in Comment #9 for the test suite.
(In reply to comment #9) > http://ruleqa.spamassassin.org/20090817-r804903-n/TVD_SPACE_RATIO/detail > 90% FP rate for Japanese > http://ruleqa.spamassassin.org/20090817-r804903-n/PLING_QUERY/detail > 52% FP rate for Japanese > http://ruleqa.spamassassin.org/20090817-r804903-n/GAPPY_SUBJECT/detail > 44% FP rate for Japanese http://ruleqa.spamassassin.org/20090819-r805703-n/TVD_SPACE_RATIO/detail 0% FP rate for that particular Japanese user http://ruleqa.spamassassin.org/20090819-r805703-n/PLING_QUERY/detail 0% FP rate for that particular Japanese user (Huh? You changed this rule too?) http://ruleqa.spamassassin.org/20090819-r805703-n/__GAPPY_SUBJECT/detail 44% FP rate for that particular Japanese user
http://ruleqa.spamassassin.org/20090819-r805703-n/GAPPY_SUBJECT/detail 0% FP rate Oops, wrong one?
(In reply to comment #13) > http://ruleqa.spamassassin.org/20090819-r805703-n/GAPPY_SUBJECT/detail > 0% FP rate > > Oops, wrong one? yep, __GAPPY_SUBJECT is likely to have fps, GAPPY_SUBJECT avoids them.
Looks good, looking forward to the next test scores. Some questions... How important is this rescoring? Will future nightly masschecks help to rescore the sa-update scores? Should I bother to continue recruiting more masscheck participants after this rescore?
(In reply to comment #15) > How important is this rescoring? > Will future nightly masschecks help to rescore the sa-update scores? the base ruleset (non-sandbox rules) won't change scores, so this is important. For nightly masschecks, the only scores affected will be those of sandbox rules. So only about 1/2 of the ruleset, I'd reckon. > Should I bother to continue recruiting more masscheck participants after this > rescore? No, I think as long as they provide results for the rescore, that's the most important thing. Has anyone had inspiration about the reason for the bad set0 results? (I haven't looked yet)
> the base ruleset (non-sandbox rules) won't change scores, so this is important. > For nightly masschecks, the only scores affected will be those of sandbox > rules. So only about 1/2 of the ruleset, I'd reckon. I am curious, do you remember the original reason for this design decision? Might there be value in making the entire ruleset scores affected by nightly masshecks?
(In reply to comment #17) > > the base ruleset (non-sandbox rules) won't change scores, so this is important. > > For nightly masschecks, the only scores affected will be those of sandbox > > rules. So only about 1/2 of the ruleset, I'd reckon. > > I am curious, do you remember the original reason for this design decision? > > Might there be value in making the entire ruleset scores affected by nightly > masshecks? iirc, the risk is that a small set of corpora (e.g. a few people take a week off) could cause the entire ruleset to be skewed incorrectly. This way at least only the most recent (sandbox) rules would be affected, so it's a bit safer. It's also faster to generate the scores, but this isn't so much of an issue now, as our main machine is quite beefy... There may have been other reasons, too, but I can't find the mails :(
> iirc, the risk is that a small set of corpora (e.g. a few people take a week > off) could cause the entire ruleset to be skewed incorrectly. This way at > least only the most recent (sandbox) rules would be affected, so it's a bit > safer. > It's also faster to generate the scores, but this isn't so much of an issue > now, as our main machine is quite beefy... > There may have been other reasons, too, but I can't find the mails :( I feel like we have too little diversity in the type and number of ham contributors. This rescoring would be a big improvement from our scores from two years ago and we definitely should do it. But after 3.3.0 I would like to learn how I can become more involved in order to revamp the score update process. * I'd like to learn how to operate the GA. * I want to continue recruiting other nightly masscheck participants. I want to recruit contributors of non-English languages and non-technical users. * I am thinking about writing a toolkit (in RPM and DEB packages) that would make it easier for participants to join masschecks. The current documented process is very unclear and confusing, and I want to clean this up as well. With more diversity in masscheck participants, perhaps we can do complete rescoring more often than 2 years.
(In reply to comment #19) > I feel like we have too little diversity in the type and number of ham > contributors. This rescoring would be a big improvement from our scores from > two years ago and we definitely should do it. yes. > But after 3.3.0 I would like to learn how I can become more involved in order > to revamp the score update process. > > * I'd like to learn how to operate the GA. > * I want to continue recruiting other nightly masscheck participants. I want > to recruit contributors of non-English languages and non-technical users. Great! As long as they keep the ham out of the spam and vice versa, and we can occasionally get in touch for eyeball-verification of odd-looking FPs, that'll be very useful ;) > * I am thinking about writing a toolkit (in RPM and DEB packages) that would > make it easier for participants to join masschecks. The current documented > process is very unclear and confusing, and I want to clean this up as well. It certainly is. We've been meaning to improve this for several _years_ now, but it's never been a high enough priority. mass-check is very dev-oriented, and it should be something bundled (and documented) at a similar level to the sa-compile or sa-update scripts. Here's history on the historical attempts which ran out of steam halfway through: https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3096 https://issues.apache.org/SpamAssassin/show_bug.cgi?id=2853 BTW please ensure that changes in SA (which there will definitely need to be) are submitted back upstream; IMO this functionality should be part of the core package. ;) > With more diversity in masscheck participants, perhaps we can do complete > rescoring more often than 2 years. Yes.
Let's set a deadline of this Thursday for rule changes. At that point, I'll set an SVN tag for mass-checking with. We'll then give everyone 2 weeks to get results in, and build scores with those. Bugs that are rule-related against the 3.3.0 target: bug 5380, bug 6119, bug 6156, bug 6183, bug 5937
(In reply to comment #21) > Let's set a deadline of this Thursday for rule changes. At that point, I'll > set an SVN tag for mass-checking with. We'll then give everyone 2 weeks to get > results in, and build scores with those. Will people not paying attention automatically get the mass-checking SVN tag in their nightly mass check?
(In reply to comment #1) > (In reply to comment #0) > > Do we actually need to do this, though, since we have Daryl's code generating > > scores weekly from nightly mass-check results? > > well, we need to fix that, actually. it seems to be broken. Crap, is this broken? I might need to clear some space on the volume it runs on.
(In reply to comment #22) > (In reply to comment #21) > > Let's set a deadline of this Thursday for rule changes. At that point, I'll > > set an SVN tag for mass-checking with. We'll then give everyone 2 weeks to get > > results in, and build scores with those. > > Will people not paying attention automatically get the mass-checking SVN tag in > their nightly mass check? no; they have to sync to a specific tag (or download a tarball iirc).
Daryl, is there a URL to your weekly scores?
(In reply to comment #25) > Daryl, is there a URL to your weekly scores? I think that the removal of rulesrc in svn broke it. I will have to investigate what the change was there and how I can get it working again.
(In reply to comment #21) > Let's set a deadline of this Thursday for rule changes. At that point, I'll > set an SVN tag for mass-checking with. We'll then give everyone 2 weeks to get > results in, and build scores with those. hmm. this is in a bit of trouble due to the broken build for the last few days. But we can hack something up using the previous working active.list file...
At a very minimum, could we have the one-liner in lib/Mail/SpamAssassin/Plugin/HeaderEval.pm applied? It should be perfectly safe.
Gah, I really hate how this Bugzilla shows you the next bug after you submit. I keep posting to the wrong bug.
(In reply to comment #29) > Gah, I really hate how this Bugzilla shows you the next bug after you submit. > I keep posting to the wrong bug. I fully agree, it is terribly annoying. Teleports you to some completely unrelated bug, and requires an additional click to come back.
(In reply to comment #30) > (In reply to comment #29) > > Gah, I really hate how this Bugzilla shows you the next bug > > after you submit. I keep posting to the wrong bug. > > I fully agree, it is terribly annoying. Teleports you to some completely > unrelated bug, and requires an additional click to come back. You're reporting this bug on the wrong bug. :)
(In reply to comment #29) > Gah, I really hate how this Bugzilla shows you the next bug after you submit. > I keep posting to the wrong bug. It is configurable if you click on the Preferences link near the top of the page. the "After changing a bug" setting. I just set mine to "Show the updated bug" and I'll see if it works when I submit this comment.
thanks for the pointer Sidney! I've updated the default preferences, which may fix it.
and the mass-checks are now ready to go! mail sent to users@ and dev@.
(In reply to comment #34) > and the mass-checks are now ready to go! mail sent to users@ and dev@. Mail sent? I don't see it.
(In reply to comment #35) > (In reply to comment #34) > > and the mass-checks are now ready to go! mail sent to users@ and dev@. > > Mail sent? I don't see it. I don't see any announcements anywhere. I only saw that you edited the RescoreDetails page. Is that the only hint that people should being doing it?
dammit. broken laptop mail config ate it :( resending
reminder for myself. Things that need to be done to the rules before running the GA: - ensure JM_SOUGHT* is removed from the logs and ruleset - bug 6156: remove all refs to RCVD_IN_PSBL in logs where "reuse=no", replace with RCVD_IN_PSBL_2WEEKS to more accurately model "near-live" DNSBL lookups
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6156#c59 Here I noticed that RCVD_IN_PSBL was not firing at all in my mcsnapshot masschecks, but working just fine in nightly_mass_check given the same ./mass-check syntax. http://wiki.apache.org/spamassassin/RescoreDetails perl Makefile.PL < /dev/null make My mass-check box did not have gcc installed so I wasn't doing the "make" step. After I installed gcc and used "make", then RCVD_IN_PSBL began working in mcsnapshot. rsync -vrz --delete \ rsync://rsync.spamassassin.org/tagged_builds/nightly_mass_check . I'm confused about this, because the nightly_mass_check that I obtain via rsync does not require "make". RCVD_IN_PSBL works fine there. Questions... 1) Does mass-check actually need gcc and "make" beforehand? 2) If so, why is nightly_mass_check working without it? 3) Is this a separate bug that mass-check succeeds, but is silently failing on some rules?
4) Are other people doing rescore masschecks uploading bogus logs due to this silent failure?
it's good practice to use "hit-frequencies" (http://wiki.apache.org/spamassassin/HitFrequencies) to examine your results and see if anything looks broken.
I've uploaded my results, but they don't have bayes enabled. Why, again, aren't we reusing bayes results? I've kicked off another round with bayes enabled (my net enabled check took 13.4 hours), I'm waiting on timing to see how long it'll take. I may have to setup a SQL server on the cluster to do it in a reasonable amount of time. In any case, I don't think we have enough message results contributed yet for a good scoreset. We have way less than for 3.2.0, although from a larger number of contributors. Is there any chance we might see results from Theo? (In reply to comment #15) > Should I bother to continue recruiting more masscheck participants after this > rescore? I would. A larger number of people submitting from *clean* corpora will allow us to provide updated scores more often. As it is now the scores I'm generating now (well broken right now, but I'll fix it soon) swing quite a bit. I suspect it's due too not enough submitters and not enough messages. (In reply to comment #17) > > the base ruleset (non-sandbox rules) won't change scores, so this is important. > > For nightly masschecks, the only scores affected will be those of sandbox > > rules. So only about 1/2 of the ruleset, I'd reckon. > > I am curious, do you remember the original reason for this design decision? I felt that we didn't have a large enough nightly/weekly corpus to reliable change all of the scores. I could generate two versions of the scores... with and without locking the base set of scores. > Might there be value in making the entire ruleset scores affected by nightly > masshecks? I think we need a larger nightly/weekly corpus before we do this. (In reply to comment #18) > iirc, the risk is that a small set of corpora (e.g. a few people take a week > off) could cause the entire ruleset to be skewed incorrectly. This way at > least only the most recent (sandbox) rules would be affected, so it's a bit > safer. Even when all of the regular contributors submitted their results the corpus wasn't that large, so I didn't want to throw away the scores based on the much large corpus we had for 3.2.0 > It's also faster to generate the scores, but this isn't so much of an issue > now, as our main machine is quite beefy... I can do it either way... cycles wasn't an issue. > There may have been other reasons, too, but I can't find the mails :( I probably only sent one about the topic. Some terse comments in the commit messages for that code. (In reply to comment #25) > Daryl, is there a URL to your weekly scores? Still a little broken on my end, but: http://svn.apache.org/viewvc/spamassassin/trunk/rulesrc/scores/
I've now uploaded results for my 962396 messages with bayes enabled.
I'm not going to be able to work on this until next week, if anyone feels the need to re-run parts of their mass-checks before then...
Justin, would you be able to setup a ruleqa URL sooner? Would be nice to see how we're doing compared to the nightly.
Created attachment 4541 [details] freqs file on all submitted files for rescore mass-checks > Justin, would you be able to setup a ruleqa URL sooner? > Would be nice to see how we're doing compared to the nightly. To give us something to chew on while we wait for the true runs, here is the freqs.full file that I obtained while following the RescoreMassCheck instructions from the Wiki, using all uploaded files (from about 6 hours ago) on the submission directory, including Daryl's.
Created attachment 4542 [details] resulting 'scores' file from a GA run ...and here is the resulting 'scores' file, obtained on scoreset 3 by running 'garescorer -f 0.003 -e 30000 -t 5.0' (through runGA). Its header is: # SUMMARY for threshold 5.0: # Correctly non-spam: 274586 23.732% (99.965% of non-spam corpus) # Correctly spam: 877118 75.809% (99.410% of spam corpus) # False positives: 97 0.008% (0.035% of nonspam, 23905 weighted) # False negatives: 5204 0.450% (0.590% of spam, 15482 weighted) # Average score for spam: 26.2 nonspam: -1.6 # Average for false-pos: 7.7 false-neg: 3.0 # TOTAL: 1157005 100.00% and the matching 'test' file is: # SUMMARY for threshold 5.0: # Correctly non-spam: 34321 99.93% # Correctly spam: 109470 99.40% # False positives: 23 0.07% # False negatives: 662 0.60% # TCR(l=50): 60.779249 SpamRecall: 99.399% SpamPrec: 99.979% Perhaps I pushed it too far by '-f 0.003' .
P.S. keep in mind that I'm only playing with GA for the last two days, after first gaining some experience by running it on my corpus only. Take results with with a large grain of salt.
I recruited an Italian participant for masscheck. He's ready to upload logs for nightly masscheck and rescore masscheck. He sent a request for an rsync account on September 11th, 2009 but did not hear back. I'm uploading logs on his behalf soon.
(In reply to comment #49) > I recruited an Italian participant for masscheck. He's ready to upload logs > for nightly masscheck and rescore masscheck. He sent a request for an rsync > account on September 11th, 2009 but did not hear back. I'm uploading logs on > his behalf soon. what was his username? I thought Mark created an acct for him, but could have confused him with someone else...
bernie or Bernardo, not sure which he would have requested as. Are the ::submit and nightly ::corpus accounts the same thing now?
> He sent a request for an rsync account on September 11th, 2009 but did not > hear back. I'm uploading logs on his behalf soon. > > what was his username? I thought Mark created an acct for him, but could > have confused him with someone else... I did create rsync accounts for Bernie Innocenti <bernie@codewiz.org> (binnocenti, 2009-09-15) and for Austin Henry (ahenry). Both received my general reply as CC-ed to the private@spamassassin.apache.org ML, plus a private mail with a password. Bernie's MX host 83.149.158.210 accepted and confirmed both messages: Sep 15 15:59:05 dorothy postfix/smtp[14113]: 328DD1D1C4B: to=<bernie@codewiz.org>, relay=mail.codewiz.org[83.149.158.210]:25, delay=4.6, delays=0/0/1.7/2.9, dsn=2.0.0, status=sent (250 ok 1253023145 qp 22364) Sep 15 16:00:04 dorothy postfix/smtp[14113]: 69A7A1D1C68: to=<bernie@codewiz.org>, relay=mail.codewiz.org[83.149.158.210]:25, delay=2.6, delays=0/0/0.72/1.9, dsn=2.0.0, status=sent\ (250 ok 1253023203 qp 22602)
Is there a way to individual delete files over rsync? I need to delete the "bernie" log from the ::submit directory. It seems that the rsync --delete option is only if you are syncing entire directories.
> Is there a way to individual delete files over rsync? I need to delete the > "bernie" log from the ::submit directory. It seems that the rsync --delete > option is only if you are syncing entire directories. I don't think the rsync is able to delete a specific file. Just upload an empty file in its place, then we can delete the leftovers at some time. > Are the ::submit and nightly ::corpus accounts the same thing now? Yes, both rsync areas currently point to the same 'secrets file' in rsyncd.conf.
(In reply to comment #54) > > Are the ::submit and nightly ::corpus accounts the same thing now? > > Yes, both rsync areas currently point to the same 'secrets file' in > rsyncd.conf. However -- they are not the same place. They are separate directories, allowing for people to turn back on their nightlies without overwriting the results they've uploaded for the "rescore" mass-check.
Here is a set of rules in 50_scores.cf that I ended up as fixed (immutable) for the GA run (score set 3). Most of these are already documented and labeled as such, but it doesn't hurt to post it here as a double-check. The score in comments of BAYES rules is what a GA run on a scoreset 3 gave me (all .log files in 'submit', except for Daryl's spam-bayes-net-dos of which I only took a random sample of 65000 entries, not to overwhelm the remaining data). I manually reduced BAYES_ scores a bit, as suggested by a comment in 50_scores.cf, referring to Bug 4505). score ALL_TRUSTED -1.360 -1.440 -1.665 -1.800 score ANY_BOUNCE_MESSAGE 0.1 score BAYES_00 -2.8 # -2.935 score BAYES_05 -1.1 # -1.148 score BAYES_20 -0.9 # -2.020 score BAYES_40 -0.5 # -2.172 score BAYES_50 0.2 # 0.326 score BAYES_60 1.5 # 2.555 score BAYES_80 2.0 # 2.133 score BAYES_95 3.2 # 3.995 score BAYES_99 3.8 # 4.495 score BOUNCE_MESSAGE 0.1 score CHALLENGE_RESPONSE 0.1 score CRBOUNCE_MESSAGE 0.1 score DKIM_ADSP_CUSTOM_HIGH 0.001 score DKIM_ADSP_CUSTOM_LOW 0.001 score DKIM_ADSP_CUSTOM_MED 0.001 score DKIM_POLICY_SIGNALL 0 score DKIM_POLICY_SIGNSOME 0 score DKIM_POLICY_TESTING 0 score DKIM_SIGNED 0.1 score DKIM_VALID -0.1 score DKIM_VALID_AU -0.1 score DKIM_VERIFIED 0 score EXTRA_MPART_TYPE 1.0 score GTUBE 1000.000 score NO_HEADERS_MESSAGE 0.001 score NO_RECEIVED -0.001 score NO_RELAYS -0.001 score RDNS_DYNAMIC 0.1 score RDNS_NONE 0.1 score SPF_HELO_PASS -0.001 score SPF_PASS -0.001 score SUBJECT_IN_BLACKLIST 100 score SUBJECT_IN_WHITELIST -100 score UNPARSEABLE_RELAY 0.001 score USER_IN_ALL_SPAM_TO -100.000 score USER_IN_BLACKLIST 100.000 score USER_IN_BLACKLIST_TO 10.000 score USER_IN_DEF_DKIM_WL -7.500 score USER_IN_DEF_SPF_WL -7.500 score USER_IN_DEF_WHITELIST -15.000 score USER_IN_DKIM_WHITELIST -100.000 score USER_IN_MORE_SPAM_TO -20.000 score USER_IN_SPF_WHITELIST -100.000 score USER_IN_WHITELIST -100.000 score USER_IN_WHITELIST_TO -6.000 score VBOUNCE_MESSAGE 0.1 One observation on the DCC scores: the calculated score on DCC_CHECK depends on whether one is using a licensed DCC server (providing reputation data), or not. There is a significant overlap between DCC_CHECK hits and DCC_REPUT_99_100, so the DCC_CHECK score should be lower when reputation data is not offered by a DCC server. score DCC_CHECK 1.15 # no reputation data score DCC_CHECK 0.835 # with reputation data score DCC_REPUT_00_12 -0.9 # -0.001 score DCC_REPUT_13_19 -0.5 # -0.001 score DCC_REPUT_70_89 1.354 score DCC_REPUT_90_94 0.56 score DCC_REPUT_95_98 1.52 score DCC_REPUT_99_100 2.40 As the majority of installations probably won't be using a commercial DCC server, it would probably be best to zero-out the DCC_REPUT_* scores for the GA run (so as to obtain correct DCC_CHECK score).
When do the bb rescore masschecks begin?
dammit! I totally dropped the ball on that one. :( I'll need to get that set up asap...
(In reply to comment #58) > dammit! I totally dropped the ball on that one. :( I'll need to get that set > up asap... ok, 5 EC2 nodes are now running mass-checks, one for each bb-* corpus; all should be complete by tomorrow morning. yay for elastic scaling ;)
and they're now uploaded. Is that everyone? Do we want to wait for any more? Mark -- I'm on vacation for 2 weeks starting on Sunday. Can you run the GA? it looks like you've pretty much got it working, as far as I can tell. I've also copied the current set of logs to ruleqa under the following date: Tue Sep 30 09:00:00 UTC 2009 and rev: 808953. That should show up at: http://ruleqa.spamassassin.org/?daterev=20090930-r808953-n mail counts (approx as these include header comments and too-old messages): : 60...; wc -l submit/spam-*.log 2061 submit/spam-bayes-net-ahenry.log 6 submit/spam-bayes-net-bb-fredt.log 1418 submit/spam-bayes-net-bb-guenther_fraud.log 1846 submit/spam-bayes-net-bb-jhardin.log 2200 submit/spam-bayes-net-bb-kmcgrail.log 7191 submit/spam-bayes-net-bb-zmi.log 638 submit/spam-bayes-net-binnocenti.log 81271 submit/spam-bayes-net-bluestreak.log 931869 submit/spam-bayes-net-dos.log 98 submit/spam-bayes-net-hege-fi.log 36948 submit/spam-bayes-net-hege.log 1489714 submit/spam-bayes-net-jm.log 23768 submit/spam-bayes-net-mmartinec.log 6734 submit/spam-bayes-net-wt-en1.log 9 submit/spam-bayes-net-wt-en2.log 6 submit/spam-bayes-net-wt-en3.log 19166 submit/spam-bayes-net-wt-en4.log 6 submit/spam-bayes-net-wt-en5.log 6 submit/spam-bayes-net-wt-en6.log 126 submit/spam-bayes-net-wt-jp1.log 6 submit/spam-bayes-net-wt-jp2.log 2605087 total : 61...; wc -l submit/ham-*.log 2657 submit/ham-bayes-net-ahenry.log 587 submit/ham-bayes-net-bb-fredt.log 9 submit/ham-bayes-net-bb-guenther_fraud.log 4307 submit/ham-bayes-net-bb-jhardin.log 6 submit/ham-bayes-net-bb-kmcgrail.log 6 submit/ham-bayes-net-bb-zmi.log 10909 submit/ham-bayes-net-binnocenti.log 87446 submit/ham-bayes-net-bluestreak.log 30539 submit/ham-bayes-net-dos.log 123556 submit/ham-bayes-net-hege-fi.log 34804 submit/ham-bayes-net-hege.log 353429 submit/ham-bayes-net-jm.log 38913 submit/ham-bayes-net-mmartinec.log 5705 submit/ham-bayes-net-wt-en1.log 3003 submit/ham-bayes-net-wt-en2.log 9906 submit/ham-bayes-net-wt-en3.log 6 submit/ham-bayes-net-wt-en4.log 5106 submit/ham-bayes-net-wt-en5.log 2110 submit/ham-bayes-net-wt-en6.log 1065 submit/ham-bayes-net-wt-jp1.log 3619 submit/ham-bayes-net-wt-jp2.log 717688 total we could probably skip some of the spam.
(In reply to comment #60) > > I've also copied the current set of logs to ruleqa ... > > : 60...; wc -l submit/spam-*.log > 1418 submit/spam-bayes-net-bb-guenther_fraud.log > 1846 submit/spam-bayes-net-bb-jhardin.log > 2200 submit/spam-bayes-net-bb-kmcgrail.log > > : 61...; wc -l submit/ham-*.log > 9 submit/ham-bayes-net-bb-guenther_fraud.log > 4307 submit/ham-bayes-net-bb-jhardin.log > 6 submit/ham-bayes-net-bb-kmcgrail.log There should also be jhardin_fraud logs, should there not? I _am_ submitting daily corpora updates for sought_fraud, and those should be included just as guenther's are...
(In reply to comment #60) > 9 submit/ham-bayes-net-bb-guenther_fraud.log ^^^ Please do *not* include my fraud ham corpus. It exclusively contains fake, artificial messages to exclude some German [1] from the fraud spam corpus. No real ham there. My spam corpus of course is fine to include. [1] Short, broken German paragraphs along the lines of "you may write in German, too", in an otherwise entirely English spam.
(In reply to comment #62) > (In reply to comment #60) > > 9 submit/ham-bayes-net-bb-guenther_fraud.log > ^^^ > Please do *not* include my fraud ham corpus. It exclusively contains fake, > artificial messages to exclude some German [1] from the fraud spam corpus. Same goes for my fraud ham corpus, except s/German/English/ (primarily free mail adverts and legal disclaimers).
http://ruleqa.spamassassin.org/20090930-r808953-n/RCVD_IN_PSBL/detail It looks like all the ham is visible in the ruleqa, but only 86390 spam?
yep, that's not right :( I've deleted the files, let's see if the backend rebuilds them correctly using all logs this time.
(In reply to comment #60) > we could probably skip some of the spam. If you feel that it's detrimental too include that much sure. I'd start with dropping from your and my corpora. I've got spam up to 60 days old in my corpus. I'd include everyone elses' spam and thin ours out rather than just a straight drop by date method. If it's solely a processing time concern, I'd say it's a non-issue as the GA doesn't take that long to run. I know the nightly ones (about half as much mail) take around 30 minutes on the ancient machine I've got it running on.
http://ruleqa.spamassassin.org/20090930-r808953-n Was that re-run? The same total number of spam: 86390
(In reply to comment #67) > http://ruleqa.spamassassin.org/20090930-r808953-n > Was that re-run? The same total number of spam: 86390 it took a little time, but it appears to have corrected itself now. I think there's a race condition to do with the way logs are rsynced from spamassassin.zones to spamassassin2.zones. :(
Hey Mark, is the GA run happening while jm is away?
> Hey Mark, is the GA run happening while jm is away? Yes, it is underway just now. I needed to figure out how to set up the mpich2 message-passing environment, but I think I have it working now. I will be asking contributors to check some apparent FP and FN in their logs soon...
> I will be asking contributors to check some apparent FP and FN in their > logs soon... The longer you wait, more of the logs ID's will no longer match the mail boxes. BTW, did you do the things written in Comment #38? So scoring PSBL might be more complicated than this. * RCVD_IN_PSBL_2WEEKS was never meant to be published as a run-time rule. It is valuable in measuring PSBL in masschecks. * It seems that PSBL is not set to allow reuse? * PSBL as measured in the rescore masscheck was deep parsing, while we subsequently agreed to change it to lastexternal. What should we do?
> The longer you wait, more of the logs ID's will no longer match the mail boxes. The messages whose results are submitted to rescoring are supposed to be preserved, at least until the rescoring runs are done. > BTW, did you do the things written in Comment #38? Not yet, will do in my next iteration. It takes a couple of hours. The JM_SOUGHT results I kept on purpose for now, wondering what their scores would be. On the next round I can just force them to zero, I believe this is equivalent to removing them from the logs. In the first round I got: score JM_SOUGHT_FRAUD_1 2.105 score JM_SOUGHT_FRAUD_2 2.318 score JM_SOUGHT_FRAUD_3 3.270 > So scoring PSBL might be more complicated than this. > > * RCVD_IN_PSBL_2WEEKS was never meant to be published as a run-time rule. It > is valuable in measuring PSBL in masschecks. > * It seems that PSBL is not set to allow reuse? > * PSBL as measured in the rescore masscheck was deep parsing, while we > subsequently agreed to change it to lastexternal. I did the translations from Comment #38 now on the RCVD_IN_PSBL*, will get into the next approximation. > What should we do? There seem to be some other rules in the works, so I'd say let's just finish up whatever was frozen with a call for rescoring results, publish that as beta-1, then examine what we got, polish it, and to another rescoring run before the final release. It's not too bad to just fix some scores manually, we're doing it also for BAYES, SPF, etc. ========== Here is now the first homework, the following were reported as false positives on my last completed attempt. Please check if these are really ham messages (I already checked my two entries, and they are): ham-bayes-net-hege.log /data/sa/h/3/36f18b49dd8ce2ce70586c67eeb780fd /data/sa/h/0/0270ee166042abd0aa94cbdda855400c /data/sa/h/9/9eb11730050002add51ecdc6ed25343d /data/sa/h/5/5dfa06864bb3021674768e8af372a6c9 /data/sa/h/4/4214ade1e7e177f0453c5f1cc98c8b42 ham-bayes-net-bluestreak.log ../../aaa_ham/2009-07_HAM_721117.0 ../../aaa_ham/2009-06_HAM_602375.0 ../../aaa_ham/2009-06_HAM_609153.0 ../../aaa_ham/2009-06_HAM_623012.0 ../../aaa_ham/2009-06_HAM_622736.0 ../../aaa_ham/2009-08_HAM_814010.0 ham-bayes-net-dos.log /home/dos/SA-corpus/ham/leah/ INBOX-Inbox-2007/1195695047.P9700Q22.dilbert.dostech.net:2,S /home/dos/SA-corpus/ham/leah/ INBOX-Inbox-2007/1196258008.P18803Q16.dilbert.dostech.net:2,S /home/dos/SA-corpus/ham/leah/ INBOX-Inbox-2007/1199765108.P20983Q90.dilbert.dostech.net:2,RS ham-bayes-net-jm.log /local/cor/recent/ham/priv.radish.jmason.org.200808310000.mbox.160968 /local/cor/recent/ham/priv.wall.200809081400.mbox.1677188 /local/cor/recent/ham/priv.20050914/126599 ham-bayes-net-mmartinec.log ham/uYUQM2RmF9I0 ham/p+KSEyzZTPOw
> Please check if these are really ham messages and four more from the second run: ../../aaa_ham/2009-07_HAM_704334.0 ../../aaa_ham/2009-08_HAM_810051.0 /local/cor/recent/ham/priv.20050914/137533 /home/dos/SA-corpus/ham/dos/Inbox-2008/ 1221834769.M749008P21562V0000000000000302I00414902_237.cyan.dostech.net,S=26243:2,S Also, I find scores on URIBL_(AB|JP|WS)_SURBL to be rather low compared to my experience (e.g. one FP out of 39.000 on URIBL_WS_SURBL at my ham-bayes-net-mmartinec.log), so my guess is that several of the following hits could be false positives on these rules: grep -c 'URIBL_WS_SURBL' ham-bayes-net-jm.log 178 grep -c 'URIBL_AB_SURBL' ham-bayes-net-jm.log 42 grep -c 'URIBL_JP_SURBL' ham-bayes-net-jm.log 29 grep -c 'URIBL_JP_SURBL' ham-bayes-net-bluestreak.log 28 egrep -c 'URIBL_(AB|JP|WS)_SURBL' ham-bayes-net-hege.log 7 grep -c 'URIBL_WS_SURBL' ham-bayes-net-dos.log 4
/home/dos/SA-corpus/ham/dos/Domains/1195543943.M277151P27837V0000000000000302I00154082_16.cyan.dostech.net\,S\=6338\:2\,S ...is an abuse report that contains an abused domain. I'd rm it from the logs. I have from my corpus. /home/dos/SA-corpus/ham/leah/INBOX-Inbox-2007/1195695047.P9700Q22.dilbert.dostech.net:2,S ...is ham. A user recommends somebody locally who I guess has spamed their domain. I've left this in my corpus. /home/dos/SA-corpus/ham/dos/infra-list/1204046401.M43776P15497V0000000000000302I0000C20E_0.cyan.dostech.net,S=5621:2,S ...abuse report. I'd rm it from the logs. I have from my corpus. /home/dos/SA-corpus/ham/dos/infra-list/1253117012.M352778P19949V0000000000000302I008D1494_70.cyan.dostech.net,S=2683:2, ...abuse report. I'd rm it from the logs. I have from my corpus.
Might we consider assigning different confidence weights to ham corpa? For example, my ham corpa are relatively small in number, but I have strong confidence that they are thoroughly cleaned. Furthermore they are extremely varied in sources and likely to be different from other masscheck participants. I have also filtered out all discussion mailing lists and automated report sources. For example, I would assign the following weights to my ham corpa: wt-en1: x2.5 wt-en2: x2 wt-en3: x1.5 wt-en5: x2 wt-en6: x1 wt-jp1: x2.5 wt-jp2: x1.5 Anyhow, just an idea. Not sure if this is helpful.
I cleaned up my few FPs and some other stuff, new logs sent.. Talking about weights, does anyone have an academic answer on how results are affected when some corpuses are uniqued (atleast mine is) and some are not?
Nevermind about the weights idea.
> I cleaned up my few FPs and some other stuff, new logs sent.. Thanks to Daryl and Henrik, I'm still waiting for the bluestreak, but meanwhile am running garescorer on what I have (including the recent updates). Btw, Daryl, you haven't commented on: /home/dos/SA-corpus/ham/leah/ INBOX-Inbox-2007/1196258008.P18803Q16.dilbert.dostech.net:2,S /home/dos/SA-corpus/ham/leah/ INBOX-Inbox-2007/1199765108.P20983Q90.dilbert.dostech.net:2,RS /home/dos/SA-corpus/ham/dos/ Inbox-2008/1221834769.M749008P21562V0000000000000302I00414902_237.\ cyan.dostech.net,S=26243:2,S > Talking about weights, does anyone have an academic answer on how results are > affected when some corpuses are uniqued (atleast mine is) and some are not? Don't know. I removed exact duplicates on mail body from my corpus, although due to 'personalized' spam which is becoming prevalent nowadays thanks to the free CPU resources on botnets, there are still plenty of very similar yet different messages left in the corpus. I did some manual removal on these, but it is very impractical to be thorough. > Might we consider assigning different confidence weights to ham corpa? > > For example, my ham corpa are relatively small in number, but I have strong > confidence that they are thoroughly cleaned. Furthermore they are extremely > varied in sources and likely to be different from other masscheck participants. > I have also filtered out all discussion mailing lists and automated report I do recognize that corpora are quite different in several aspects, although I don't know how one can weight them more fairly and incorporate it into the current procedure. Let me just document here what I'm doing now with a local copy of all submitted logs. Due to a significant disproportion on the size of spam-bayes-net-dos.log and spam-bayes-net-jm.log compared to the rest, I'm taking a random sample of each of these files, restricted to scoreset 3 and age below 6 months, decimated to 150.000 entries each (I initially used 100.000, but now bumped it up). There are some spam log entries older than 6 months on other spam logs, but not too many (mostly on the 'hege' collection), but as it seems these are mainly hand-selected fraud samples, I'm keeping these regardless of age. Due to shortage of ham, I'm keeping it all regardless of age. This mainly goes for JM's ham collection, which contains some (smaller) share of older ham; the remaining collections are fairly recent. There are no scoreset 0 and 2 entries in any of the logs. So for the scoreset 3 and 2 I'm using a selection from the logs with 'set=3'. For scoresets 0 and 1 runs I'm using all entries (set=1 and set=3). This all amounts to the following 'wc -l' counts: 463957 ham-full-set1.log 483402 spam-full-set1.log 293637 ham-full-set3.log 443635 spam-full-set3.log This seems reasonably fair and balanced to me.
> There are some spam log entries older than 6 months on other spam logs, but > not too many (mostly on the 'hege' collection), but as it seems these are > mainly hand-selected fraud samples, I'm keeping these regardless of age. Oops, wrong id: s/hege/jhardin/
The following also looks fishy: grep -c DKIM_ADSP_DISCARD ham*.log ham-bayes-net-bb-fredt.log 21 ham-bayes-net-bb-jhardin.log 22 ham-bayes-net-bluestreak.log 36 ham-bayes-net-hege.log 43 ham-bayes-net-wt-en6.log 35 ham-bayes-net-mmartinec.log 1 ham-bayes-net-dos.log 25 ham-bayes-net-jm.log 65 (the one entry in my collection is due to the author posting through a mailing list, despite the fact that his domain publishes a 'discardable' policy; so, a sender's mistake)
(In reply to comment #80) > The following also looks fishy: > > grep -c DKIM_ADSP_DISCARD ham*.log > > ham-bayes-net-wt-en6.log 35 > > (the one entry in my collection is due to the author posting > through a mailing list, despite the fact that his domain publishes > a 'discardable' policy; so, a sender's mistake) These are all legitimate looking paypal mail delivered to a Yahoo account from mid-2008 through recently. What is DKIM_ADSP_DISCARD supposed to mean?
(In reply to comment #81) > > The following also looks fishy: > > grep -c DKIM_ADSP_DISCARD ham*.log > > ham-bayes-net-wt-en6.log 35 > > These are all legitimate looking paypal mail delivered to a Yahoo account > from mid-2008 through recently. I'm not sure since when paypal is signing their mail. They were certainly signing it with DomainKeys signatures in 2006, and with DKIM in 2008. So for very old ham mail from paypal (or ebay) it is quite possible the signature is missing or somehow broken or unverifiable, but this shouldn't be the case for current mail from these domains. > What is DKIM_ADSP_DISCARD supposed to mean? It means two things: - that the message does not have a valid author's domain DKIM or DomainKeys signature (e.g. there is no signature at all, or that the signature does not match the mail contents, or that it does not match the domain name in the From header field); - and that the domain claims that any mail claiming to be from that domain and failing on signature verification, should be discarded. This claim is made by publishing a DNS record (RFC 5617), or through 'adsp_override' configuration directive in SpamAssassin's .cf file. So, if your mail samples are younger than a year, they do have a DKIM-Signature in the header, and they appear to be genuine, the only explanation for a failed signature verification is that the message got somehow corrupted or transformed on its way to SpamAssassin in such a way that the signature no longer matches the mail contents, or that SA could not fetch the domain's public key, perhaps due to DNS resolver failing or some firewall trouble. Depending on your where and how SpamAssassin is called from your mail delivery system, and how you collected your samples (e.g. from a MTA, from a mailbox, from some kind of a quarantine), there are different possible reasons for mail corruption. For example, saving a mail message source from some MUA (e.g. kmail) can rewrite/reformat some header fields. Running some virus scanner in the mail path may add its verdict to the mail body. Fetching it from some POP3 server or even from a webmail service offers their own challenges to mail integrity. In some cases even a 'friendly' MTA thinks it is doing a favour by rewriting some header fields, perhaps in belief that they would look 'prettier'. One way to find out is to describe a path the mail is making through your infrastructure (firewall, MTA, virus scanners, mailbox server) before it reaches SpamAssassin, and by carefully examining one or two such mail samples. If you have a choice, you may mail me some samples, preferably as a gzip or tar.gz attachment, to make sure it won't get transformed in transition.
Cleaned up my DKIM_ADSP_DISCARD hits (old 2005 ebay mails removed) and some other old stuff, logs sent..
> These are all legitimate looking paypal mail delivered to a Yahoo account from > mid-2008 through recently. Thanks Warren for your out-of-band mail. Apart from some general comments from my previous posting, there is a real problem regarding your method of fetching mail for a Yahoo account. You are using the FetchYahoo to download these messages from the Yahoo webmail interface. The FetchYahoo has to jump hoops to be able to retrieve a message as close to its original form as possible, but there are some real obstacles there. Glancing at its source code, it has to pull attachments separately and splice them back together into a message, necessarily reinventing the MIME boundaries. This is enough to render DomainKeys and DKIM signatures invalid. Apart from this, it also converts QP and base64 encoded messages into UTF-8 binary, which again is a sufficient reason for signature breakage. Moreover, it has to repair some damage to header field folding and empty lines, which are broken either due to bugs in Yahoo HTML rendering (indicated by comments in the FetchYahoo code), or details are simply lost because of a conversion to HTML and back to mail. This method of fetching mail is bound to cause trouble. It may quite easily cause some other low-level SpamAssassin rules to misfire or to fail triggering, not just the signature verification failures.
I guess we have no choice but to drop wt-en6 from the rescore GA. Should I drop it from nightly masscheck as well?
> I guess we have no choice but to drop wt-en6 from the rescore GA. > Should I drop it from nightly masscheck as well? I can imagine such problem could also affect other users, especially those not running SpamAssassin close to their MTA. I guess we can keep the wt-en6 corpus (and similar, if identified), but keep in mind that FP hits on DKIM_ADSP_DISCARD (and possibly on some other rule if identified) should be disregarded. I already removed the "DKIM_ADSP_DISCARD" hit from my copy of wt-en6 log. If it turns out the undesired mail modifications are more common in submitted corpora, we could perhaps re-run the GA on a subset of logs know not to be suffering from the problem, and just fetch the DKIM_* scores from results as obtained from this run. The release notes could then say that one should lower the DKIM_ADSP_* scores on installations where it is known that mail is not reaching SpamAssassin in its pristine form (as received by the MTA).
(In reply to comment #86) > The release notes could then say that one should lower the DKIM_ADSP_* > scores on installations where it is known that mail is not reaching > SpamAssassin in its pristine form (as received by the MTA). This case or old ham where the sender subsequently changed their DKIM policy is only an issue for masscheck, not production scanning. Lowering the DKIM scores makes no sense then?
> > The release notes could then say that one should lower the DKIM_ADSP_* > > scores on installations where it is known that mail is not reaching > > SpamAssassin in its pristine form (as received by the MTA). > > This case or old ham where the sender subsequently changed their DKIM policy > is only an issue for masscheck, not production scanning. True for the case of old ham where the sender subsequently changed their DKIM policy, or for the case of expired signatures - these are only an issue with masscheck. ...but not the case of wt-en6, where mail is transformed by its path through webmail. This is an issue both for masschecks, as well as for production runs. > Lowering the DKIM scores makes no sense then? If one knows that mail reaching SpamAssassin will be modified by his mail path, then one must disable rules targeting mail forgery and depending on a pristine mail, such as the DKIM_ADSP_DISCARD rule. Otherwise the rule would generate FP score points for legitimate mail from domains publishing ADSP (explicitly or through overrides).
Created attachment 4550 [details] resulting 50_scores.cf from garescorer runs Ok, here it is as last, the auto-generated 50_scores.cf from garescorer runs on all four sets, with no hand-tweaking of results (yet) ... to give us something to digest and comment on, and can serve as the first approximation. Some values are surprising or plain wrong, I'll comment on some later. I used the submitted logs (tweaked as per Comment 78), with all the recent updates to them as posted so far in this ticket. I left the BAYES scores fully floating. I fixed at zero the DCC_REPUT_* scores and JM_SOUGHT_FRAUD_*, as was discussed previously (as can be seen by the end of the attached file). Eventually these will need to be set to some manually determined score.
To assess the quality and repeatability of results, here are the summaries on all four score sets, each pair consists of a normal run on 90% of entries, and a test run on remaining 10% of log entries. The most interesting figures are the FP and FN percents, e.g. 0.028% and 0.961%, in this clipping: # False positives: 65 0.011% (0.028% of nonspam, 10580 weighted) # False negatives: 3411 0.578% (0.961% of spam, 12054 weighted) ========================================== gen-set0-5-5.0-25000-ga SCORESET 0 : (no net, not bayes) test (10%): # SUMMARY for threshold 5.0: # Correctly non-spam: 45335 98.03% # Correctly spam: 39320 81.61% # False positives: 913 1.97% # False negatives: 8860 18.39% # TCR(l=50): 0.883875 SpamRecall: 81.611% SpamPrec: 97.731% scores (90%): # SUMMARY for threshold 5.0: # Correctly non-spam: 365397 48.193% (98.401% of non-spam corpus) # Correctly spam: 314466 41.476% (81.286% of spam corpus) # False positives: 5936 0.783% (1.599% of nonspam, 173347 weighted) # False negatives: 72396 9.548% (18.714% of spam, 226867 weighted) # Average score for spam: 10.0 nonspam: 1.4 # Average for false-pos: 5.6 false-neg: 3.1 # TOTAL: 758195 100.00% ========================================== gen-set1-10-5.0-30000-ga SCORESET 1: (net, no bayes) test: # SUMMARY for threshold 5.0: # Correctly non-spam: 46183 99.86% # Correctly spam: 46648 96.82% # False positives: 65 0.14% # False negatives: 1532 3.18% # TCR(l=50): 10.075282 SpamRecall: 96.820% SpamPrec: 99.861% scores: # SUMMARY for threshold 5.0: # Correctly non-spam: 370804 48.906% (99.858% of non-spam corpus) # Correctly spam: 374579 49.404% (96.825% of spam corpus) # False positives: 529 0.070% (0.142% of nonspam, 31804 weighted) # False negatives: 12283 1.620% (3.175% of spam, 39385 weighted) # Average score for spam: 17.4 nonspam: 0.4 # Average for false-pos: 5.8 false-neg: 3.2 # TOTAL: 758195 100.00% ========================================== gen-set2-10-5.0-30000-ga SCORESET 2: (no net, bayes) test: # SUMMARY for threshold 5.0: # Correctly non-spam: 29308 99.78% # Correctly spam: 42344 95.69% # False positives: 64 0.22% # False negatives: 1907 4.31% # TCR(l=50): 8.664774 SpamRecall: 95.690% SpamPrec: 99.849% scores: # SUMMARY for threshold 5.0: # Correctly non-spam: 234375 39.745% (99.864% of non-spam corpus) # Correctly spam: 339736 57.612% (95.700% of spam corpus) # False positives: 320 0.054% (0.136% of nonspam, 26164 weighted) # False negatives: 15265 2.589% (4.300% of spam, 58794 weighted) # Average score for spam: 10.4 nonspam: 0.6 # Average for false-pos: 5.4 false-neg: 3.9 # TOTAL: 589696 100.00% ========================================== gen-set3-20-5.0-20000-ga SCORESET 3: (net, bayes) test: # SUMMARY for threshold 5.0: # Correctly non-spam: 29342 99.90% # Correctly spam: 43843 99.08% # False positives: 30 0.10% # False negatives: 408 0.92% # TCR(l=50): 23.192348 SpamRecall: 99.078% SpamPrec: 99.932% scores: # SUMMARY for threshold 5.0: # Correctly non-spam: 234630 39.788% (99.972% of non-spam corpus) # Correctly spam: 351590 59.622% (99.039% of spam corpus) # False positives: 65 0.011% (0.028% of nonspam, 10580 weighted) # False negatives: 3411 0.578% (0.961% of spam, 12054 weighted) # Average score for spam: 18.5 nonspam: -0.1 # Average for false-pos: 5.4 false-neg: 3.5 # TOTAL: 589696 100.00%
As can be seen from above, the scoreset 0 (no net tests, no bayes) is pretty much useless nowadays. The scoresets 1 and 2 come close, i.e. net tests are worth about as much as bayes. Of course the combination of all (set3) is an outstanding winner.
(In reply to comment #89) > Created an attachment (id=4550) [details] > resulting 50_scores.cf from garescorer runs > > Ok, here it is as last, the auto-generated 50_scores.cf from garescorer runs > on all four sets, with no hand-tweaking of results (yet) ... to give us > something to digest and comment on, and can serve as the first approximation. > Some values are surprising or plain wrong, I'll comment on some later. Bug #6156 RCVD_IN_PSBL We should manually adjust this score something between 2.0 through 2.5 for these reasons. * Rescore masschecks were with deep parsing. We have subsequently changed it to lastexternal which should be much safer. Even with deep parsing it proved to be very good. * At the time of the rescore masschecks, PSBL's recent whitelist filtering of gmail, yahoo, rr.com and several other major ISP's had not yet timed out legitimate MTA's. Safety should be improved further now.
Bad news. Please remove the binnocenti logs from the rescore masschecks. Working with him we discovered 50+ additional spam in his ham folders and there is certainly more. Furthermore his ham contains lots of automated low quality sources like Bugzilla, trac, cron and log monitoring daemons that should probably be removed from ham corpa. It seems incorrect FP's and bias introduced by this corpus can be large enough to possibly throw off scoring. Did you also remove wt-en6 after we discovered that copying mail from a Yahoo account corrupts the messages?
(In reply to comment #56) > Here is a set of rules in 50_scores.cf that I ended up as fixed (immutable) > for the GA run (score set 3). Most of these are already documented and labeled > as such, but it doesn't hurt to post it here as a double-check. I suspect that RCVD_IN_DNSWL_* should be immutable as well; in generated scores, there are counter-intuitive scores assigned (expected _HI < _MED < _LOW, observed _MED << _HI < _LOW). https://svn.apache.org/repos/asf/spamassassin/trunk/rules/50_scores.cf has the following outside the "gen:mutable" section: | score RCVD_IN_DNSWL_LOW 0 -1 0 -1 | score RCVD_IN_DNSWL_MED 0 -4 0 -4 | score RCVD_IN_DNSWL_HI 0 -8 0 -8 The DNSWL stats posted by Warren to the users list seem to indicate that this should be the correct ordering (at least based on safety): | SPAM% HAM% RANK RULE | 0.0016% 4.2489% 0.91 RCVD_IN_DNSWL_HI | 0.0281% 6.9639% 0.90 RCVD_IN_DNSWL_MED | 0.1147% 3.9169% 0.81 RCVD_IN_DNSWL_LOW
(In reply to comment #94) > The DNSWL stats posted by Warren to the users list seem to indicate that this > should be the correct ordering (at least based on safety): > > | SPAM% HAM% RANK RULE > | 0.0016% 4.2489% 0.91 RCVD_IN_DNSWL_HI > | 0.0281% 6.9639% 0.90 RCVD_IN_DNSWL_MED > | 0.1147% 3.9169% 0.81 RCVD_IN_DNSWL_LOW These were yesterday's weekly results, not the rescore masscheck. Weekly results are a smaller sample size and lower confidence. http://ruleqa.spamassassin.org/20090930-r808953-n SPAM% HAM% RANK RULE 0.0002% 0.3651% 0.75 RCVD_IN_DNSWL_HI 0.0288% 18.6970% 0.79 RCVD_IN_DNSWL_MED 0.0753% 8.1433% 0.68 RCVD_IN_DNSWL_LOW This was the rescore masscheck.
Created attachment 4553 [details] resulting 50_scores.cf from garescorer runs - V2 Here is now a 50_scores.cf from my second attempt after cleaning some logs: removed binnocenti and wt-en6 logs as per Comment 93, removed DKIM_ADSP_DISCARD hits from ham-bayes-net-bluestreak.log. I have also limited the log entries to fewer months following the RescoreMassCheck (wiki): -m 6 for spam, and -m 25 for ham (after 25th month there is a large gap in data till the next peak, too far in the past). This leaves us with the following number of entries in merged logs: score set 1 (no data from score set 3), provides data for set0 and set1: 360070 ham-full-set1.log 472682 spam-full-set1.log score set 3, provides data for set2 and set3: 210603 ham-full-set3.log 442709 spam-full-set3.log For DCC_ rules, I took the DCC_CHECK value of 1.1 from a preliminary run which had all the DCC_REPUT_* scores fixed at 0, then for the next run I fixed the DCC_CHECK, but left the DCC_REPUT_* scores floating. This should cope with both types of sites: those with a commercial license that do receive reputation results from DCC servers, and those who don't.
gen-set0-5-5.0-10000-ga test (10%) # SUMMARY for threshold 5.0: # Correctly non-spam: 35461 98.50% # Correctly spam: 38357 81.35% # False positives: 541 1.50% # False negatives: 8794 18.65% # TCR(l=50): 1.315450 SpamRecall: 81.349% SpamPrec: 98.609% scores (90%): # SUMMARY for threshold 5.0: # Correctly non-spam: 283119 42.494% (98.304% of non-spam corpus) # Correctly spam: 306367 45.984% (80.997% of spam corpus) # False positives: 4886 0.733% (1.696% of nonspam, 179777 weighted) # False negatives: 71879 10.789% (19.003% of spam, 231331 weighted) # Average score for spam: 10.4 nonspam: 1.7 # Average for false-pos: 5.6 false-neg: 3.2 # TOTAL: 666251 100.00% gen-set1-10-5.0-10000-ga test: # SUMMARY for threshold 5.0: # Correctly non-spam: 35942 99.83% # Correctly spam: 45983 97.52% # False positives: 60 0.17% # False negatives: 1168 2.48% # TCR(l=50): 11.312620 SpamRecall: 97.523% SpamPrec: 99.870% scores: # SUMMARY for threshold 5.0: # Correctly non-spam: 287639 43.173% (99.873% of non-spam corpus) # Correctly spam: 368783 55.352% (97.498% of spam corpus) # False positives: 366 0.055% (0.127% of nonspam, 27040 weighted) # False negatives: 9463 1.420% (2.502% of spam, 29645 weighted) # Average score for spam: 20.3 nonspam: 0.2 # Average for false-pos: 5.6 false-neg: 3.1 # TOTAL: 666251 100.00% gen-set2-10-5.0-10000-ga test: # SUMMARY for threshold 5.0: # Correctly non-spam: 35949 99.85% # Correctly spam: 44538 94.46% # False positives: 53 0.15% # False negatives: 2613 5.54% # TCR(l=50): 8.958959 SpamRecall: 94.458% SpamPrec: 99.881% scores: # SUMMARY for threshold 5.0: # Correctly non-spam: 287557 43.160% (99.844% of non-spam corpus) # Correctly spam: 357656 53.682% (94.556% of spam corpus) # False positives: 448 0.067% (0.156% of nonspam, 33456 weighted) # False negatives: 20590 3.090% (5.444% of spam, 73371 weighted) # Average score for spam: 12.3 nonspam: 0.8 # Average for false-pos: 5.7 false-neg: 3.6 # TOTAL: 666251 100.00% gen-set3-20-5.0-10000-ga test: # SUMMARY for threshold 5.0: # Correctly non-spam: 21173 99.92% # Correctly spam: 43749 99.08% # False positives: 17 0.08% # False negatives: 404 0.92% # TCR(l=50): 35.209729 SpamRecall: 99.085% SpamPrec: 99.961% scores: # SUMMARY for threshold 5.0: # Correctly non-spam: 168159 32.186% (99.976% of non-spam corpus) # Correctly spam: 350875 67.159% (99.046% of spam corpus) # False positives: 40 0.008% (0.024% of nonspam, 9039 weighted) # False negatives: 3379 0.647% (0.954% of spam, 11476 weighted) # Average score for spam: 19.3 nonspam: -0.8 # Average for false-pos: 5.4 false-neg: 3.4 # TOTAL: 522453 100.00% =========== In summary, the essential data: score set 0 (no net, no bayes): # False positives: 4886 0.733% (1.696% of nonspam, 179777 weighted) # False negatives: 71879 10.789% (19.003% of spam, 231331 weighted) score set 1 (net, no bayes): # False positives: 366 0.055% (0.127% of nonspam, 27040 weighted) # False negatives: 9463 1.420% (2.502% of spam, 29645 weighted) score set 2 (no net, bayes): # False positives: 448 0.067% (0.156% of nonspam, 33456 weighted) # False negatives: 20590 3.090% (5.444% of spam, 73371 weighted) score set 3 (net, bayes): # False positives: 40 0.008% (0.024% of nonspam, 9039 weighted) # False negatives: 3379 0.647% (0.954% of spam, 11476 weighted)
The RCVD_IN_DNSWL_* scores are again unusual: score RCVD_IN_DNSWL_HI 0 -0.466 0 -0.001 score RCVD_IN_DNSWL_LOW 0 -0.292 0 -0.760 score RCVD_IN_DNSWL_MED 0 -1.703 0 -0.727 probably because of their low frequency, especially the _HI rule: OVERALL SPAM% HAM% S/O RANK SCORE NAME 0.184 0.0007 0.5707 0.001 0.76 -1.00 RCVD_IN_DNSWL_HI 7.408 0.1096 22.7509 0.005 0.67 -1.00 RCVD_IN_DNSWL_MED 2.553 0.1816 7.5365 0.024 0.59 -1.00 RCVD_IN_DNSWL_LOW and resulting zero ranges (tmp/ranges.data): 0.000 0.000 0 RCVD_IN_DNSWL_HI 0.000 0.000 0 RCVD_IN_DNSWL_MED 0.000 0.000 0 RCVD_IN_DNSWL_LOW Don't know what a clean solution is, apart from fixing their scores manually.
I just discovered that I was falsely triggering rules like RCVD_IN_SORBS_DUL, RCVD_IN_PBL or RDNS_DYNAMIC on some of my corpus ham due to a misconfiguration on my server. My users delivering mail directly to other users on my server from their home ISP or mobile phone were lacking "authenticated user" within the Received header causing many hits on these and unknown other rules. Roughly ~150-170 of my FP's on these three rules should not count against those rules. Nearly all of my RCVD_IN_SORBS_DUL and RCVD_IN_PBL should have been AllTrusted instead. Is this enough to throw off the GA scoring?
Btw, I added a "Target Milestone" 3.3.1, so that a triage on 3.3.0 bugs may be more selective, choosing between Future/Undefined/3.3.1
(In reply to comment #98) > The RCVD_IN_DNSWL_* scores are again unusual: > score RCVD_IN_DNSWL_HI 0 -0.466 0 -0.001 > score RCVD_IN_DNSWL_LOW 0 -0.292 0 -0.760 > score RCVD_IN_DNSWL_MED 0 -1.703 0 -0.727 > > probably because of their low frequency, especially the _HI rule: > OVERALL SPAM% HAM% S/O RANK SCORE NAME > 0.184 0.0007 0.5707 0.001 0.76 -1.00 RCVD_IN_DNSWL_HI > 7.408 0.1096 22.7509 0.005 0.67 -1.00 RCVD_IN_DNSWL_MED > 2.553 0.1816 7.5365 0.024 0.59 -1.00 RCVD_IN_DNSWL_LOW > > and resulting zero ranges (tmp/ranges.data): > 0.000 0.000 0 RCVD_IN_DNSWL_HI > 0.000 0.000 0 RCVD_IN_DNSWL_MED > 0.000 0.000 0 RCVD_IN_DNSWL_LOW > > Don't know what a clean solution is, apart from fixing their scores > manually. feel free to fix them; it's hard for the GA to be mostly right about network rules. tbh I'm surprised the ranges were zeroed (for _MED at least).
(In reply to comment #99) > I just discovered that I was falsely triggering rules like RCVD_IN_SORBS_DUL, > RCVD_IN_PBL or RDNS_DYNAMIC on some of my corpus ham due to a misconfiguration > on my server. My users delivering mail directly to other users on my server > from their home ISP or mobile phone were lacking "authenticated user" within > the Received header causing many hits on these and unknown other rules. > Roughly ~150-170 of my FP's on these three rules should not count against those > rules. Nearly all of my RCVD_IN_SORBS_DUL and RCVD_IN_PBL should have been > AllTrusted instead. Is this enough to throw off the GA scoring? if you want, feel free to sed the log files to fix this, or just remove the lines entirely, and reupload. 170 FPs for those DUL rules is quite strong imo.
> if you want, feel free to sed the log files to fix this, or just remove the > lines entirely, and reupload. 170 FPs for those DUL rules is quite strong imo. Removed the majority of the offending lines and reuploaded ham-rescore-wt*.log. I also zeroed out *wt-en6.log because they were found to be too corrupted to trust the results.
(In reply to comment #103) > Removed the majority of the offending lines and reuploaded ham-rescore-wt*.log. > I also zeroed out *wt-en6.log because they were found to be too corrupted to > trust the results. Thanks. Seems you did it in the 'corpus' rsync directory. Please also update them in the 'submit' directory using existing names, otherwise in few weeks time we'll all forget which file came from where - after all, the 'submit' directory is the official source for rescoring runs.
Argh, late to the show, sorry. :-/ From the second GA re-score run, attachment 4553 [details] (aligned for readability): score KB_RATWARE_MSGID 4.099 3.315 4.095 1.475 This is awesome! :) Though unrelated, so let me move on to the issue. score KB_RATWARE_OUTLOOK_08 1.100 3.232 0.776 0.025 score KB_RATWARE_OUTLOOK_12 2.734 2.826 1.654 0.041 score KB_RATWARE_OUTLOOK_16 1.725 3.331 2.532 0.887 score KB_RATWARE_OUTLOOK_MID 2.259 2.485 3.121 0.001 This is also awesome -- kind of. But frankly, it also is a total mess. They are essentially the same, just slightly differing in strictness or fuzziness. They are almost *exactly* overlapping -- *all* four of them (see ruleqa). These rules are really redundant, and there should be only one instead. FWIW, that *should* be KB_RATWARE_BOUNDARY, which was added specifically for this. This rule seems to be missing entirely, though. :( Looking at the scores, I don't think simply adding them would do. Also, I'm kind of un-satisfied with the score-set 3 scores. The FP rate is 0! (Almost, I'll challenge the ham hits.) For all five rules above. Net tests or not...
> Thanks. Seems you did it in the 'corpus' rsync directory. Please also update > them in the 'submit' directory using existing names, otherwise in few weeks > time we'll all forget which file came from where - after all, the 'submit' > directory is the official source for rescoring runs. Fixed in 'submit'.
(In reply to comment #105) > score KB_RATWARE_OUTLOOK_08 1.100 3.232 0.776 0.025 > score KB_RATWARE_OUTLOOK_12 2.734 2.826 1.654 0.041 > score KB_RATWARE_OUTLOOK_16 1.725 3.331 2.532 0.887 > score KB_RATWARE_OUTLOOK_MID 2.259 2.485 3.121 0.001 > > This is also awesome -- kind of. But frankly, it also is a total mess. They are > essentially the same, just slightly differing in strictness or fuzziness. They > are almost *exactly* overlapping -- *all* four of them (see ruleqa). > > These rules are really redundant, and there should be only one instead. FWIW, > that *should* be KB_RATWARE_BOUNDARY, which was added specifically for this. > This rule seems to be missing entirely, though. :( > > Looking at the scores, I don't think simply adding them would do. > > Also, I'm kind of un-satisfied with the score-set 3 scores. The FP rate is 0! > (Almost, I'll challenge the ham hits.) For all five rules above. Net tests or > not... it looks like they overlap a lot with some other rules. But yes, if they were just 1 rule, it probably would have gotten a better single score. I'm not sure if it's too late to fix this or not. :(
(In reply to comment #107) > it looks like they overlap a lot with some other rules. But yes, if they were > just 1 rule, it probably would have gotten a better single score. > > I'm not sure if it's too late to fix this or not. :( Frankly, pretty much either one could be used and all other variants simply be dropped for the next re-score run. Keeping all of them is just a waste of cycles. The important questions are, where is KB_RATWARE_BOUNDARY, which was specifically pushed right before the deadline to supersede these? And of course, why do the scores drop that drastically with score-set 3, if there is *no* FP? Regardless of the spam already scoring above 5, there is no FP reason to lower the score.
(In reply to comment #108) > The important questions are, where is KB_RATWARE_BOUNDARY, which was > specifically pushed right before the deadline to supersede these? Argh! It is in freqs.full, attachment 4541 [details]. However, it appears we've been using inconsistent rule-sets, with most contributors using one outdated rule-set or the other. :-( 10.830 14.1437 0.1901 0.987 0.67 0.00 T_KB_RATWARE_BOUNDARY 0.025 0.0327 0.0000 1.000 0.65 1.00 KB_RATWARE_BOUNDARY
(In reply to comment #109) > (In reply to comment #108) > > The important questions are, where is KB_RATWARE_BOUNDARY, which was > > specifically pushed right before the deadline to supersede these? > > Argh! It is in freqs.full, attachment 4541 [details]. However, it appears we've been > using inconsistent rule-sets, with most contributors using one outdated > rule-set or the other. :-( > > 10.830 14.1437 0.1901 0.987 0.67 0.00 T_KB_RATWARE_BOUNDARY > 0.025 0.0327 0.0000 1.000 0.65 1.00 KB_RATWARE_BOUNDARY mysterious: : exit=[130] uid=jm Tue Oct 20 10:40:30 GMT 2009; cd /export/home/corpus-rsync/corpus/submit : 6...; grep KB_RATWARE_BOUNDARY *.log | grep -v T_KB_RATWARE_BOUNDARY : exit=[0 1] uid=jm Tue Oct 20 10:43:41 GMT 2009; cd /export/home/corpus-rsync/corpus/submit I can't find any non-T_ hits in the submit logs. Mark?
(In reply to comment #110) > (In reply to comment #109) > > (In reply to comment #108) > > > The important questions are, where is KB_RATWARE_BOUNDARY, which was > > > specifically pushed right before the deadline to supersede these? anyway.... it doesn't look like that rules is good enough to supersede them: 10.830 14.1437 0.1901 0.987 0.67 0.00 T_KB_RATWARE_BOUNDARY vs 9.846 12.9126 0.0003 1.000 0.98 1.00 KB_RATWARE_OUTLOOK_08 9.836 12.8985 0.0003 1.000 0.98 1.00 KB_RATWARE_OUTLOOK_MID 9.835 12.8976 0.0003 1.000 0.98 1.00 KB_RATWARE_OUTLOOK_16 9.835 12.8976 0.0003 1.000 0.98 1.00 KB_RATWARE_OUTLOOK_12 that's a much higher FP rate!
> anyway.... it doesn't look like that rules is good enough to supersede them: > that's a much higher FP rate! Yes. It's all Warren's fault! ;) Seriously, the new BOUNDARY one does indeed have quite some FPs, all in Warren's corpus, and he kindly provided me with the samples. Appears these are all entirely legit, though auto-generated messages. I wish MS wouldn't re-use their code like that. X-Mailer: Microsoft CDO for Windows 2000 Anyway, I agree -- RATWARE_BOUNDARY is bad, I screwed up with too low a range between headers. One of the previous rules needs to be kept. (The massive overlap along with the introduced FNs made it drop off of the active rules.) Still wondering why there are different rule names in freqs.
> 9.836 12.8985 0.0003 1.000 0.98 1.00 KB_RATWARE_OUTLOOK_MID Proposing the MID variant for inclusion, and dropping the other variants. The BOUNDARY one is bad, and the variants do have an almost 100% overlap with the MID one. It's also the most strict one. (Funny side-effect of the additional constraint is actually catching a spam or two more... Go figure.) The ham hit probably is not really ham (no FP in nightlies).
(In reply to comment #113) > > 9.836 12.8985 0.0003 1.000 0.98 1.00 KB_RATWARE_OUTLOOK_MID > > Proposing the MID variant for inclusion, and dropping the other variants. can you list exactly which rules you want zeroed, before Mark reruns the GA accordingly? minimize the work he has to do ;)
Err, sure. :) The following variations should just be dropped. score KB_RATWARE_OUTLOOK_08 0 score KB_RATWARE_OUTLOOK_12 0 score KB_RATWARE_OUTLOOK_16 0 score KB_RATWARE_BOUNDARY 0 Keep KB_RATWARE_OUTLOOK_MID (instead of the above) and KB_RATWARE_MSGID (which is an unrelated rule anyway).
Standing up for RDNS_NONE ... http://ruleqa.spamassassin.org/week/RDNS_NONE/detail bb_trec_enron has 98.9497% of its ham match RDNS_NONE, which is to say that it's bogus. Discounting that corpus, RDNS_NONE matches 58.7244% of the total spam corpus and 1.7463% of the total ham corpus (down from 12.1273%), which makes it far more interesting. Many of the people on the sa-users list have manually scored RDNS_NONE higher than the default 0.1. I score it at 0.9 on my own production servers. (Not sure if this is the right venue -- or if I'm an approved kibitzer)
> bb_trec_enron has 98.9497% of its ham match RDNS_NONE, which is to say that > it's bogus. Indeed. From the dev list earlier today, that's "a corpus with generated (synthetic) headers [...], only useful for body hits", and is not included in the re-scoring. > Many of the people on the sa-users list have > manually scored RDNS_NONE higher than the default 0.1. FWIW, nailed to 0.1 as per comment 56.
(In reply to comment #117) > > bb_trec_enron has 98.9497% of its ham match RDNS_NONE, which is to say > > that it's bogus. > > Indeed. From the dev list earlier today, that's "a corpus with generated > (synthetic) headers [...], only useful for body hits", and is not included > in the re-scoring. Ah, I thought I saw that corpus mentions somewhere ... only thought to search the bug. I had assumed that if the rulesqa page mentioned it, it was factored in everywhere. > > Many of the people on the sa-users list have > > manually scored RDNS_NONE higher than the default 0.1. > > FWIW, nailed to 0.1 as per comment 56. I saw that but did not understand it ... It says "most of these are already documented and labeled as [fixed/immutable]" but it doesn't say where. Is this because it triggers when rDNS checks aren't performed by the first trusted relay, and if so, can we work around that somehow (wasn't that bug 5586 )? Or is this a remnant of Justin's checkin r497852 from 2007 which states: > move 20_dynrdns.cf from sandbox into main ruleset, so RDNS_DYNAMIC > and RDNS_NONE are core rules; lock their scores to an informational > 0.1, however, since they still have a high ham hit-rate alone ... despite the current corpus data (unless 1.7% is a high ham hit-rate)?
(In reply to comment #118) > ... despite the current corpus data (unless 1.7% is a high ham hit-rate)? http://ruleqa.spamassassin.org/20091017-r826198-n/RDNS_NONE/detail The most recent weekly run has pretty substantial hits even outside of the synthetic corpus. Adam, this like your RCVD_IN_APNIC are examples of inherently prejudiced rules. It might work for the most part, and you might accept the risk of accidental FP's because the score alone wont push it above the threshold. However the combined risks of multiple prejudiced rules is too great. Prejudiced rules should be up to the sysadmin if they want to enable. We should not highly score any known prejudiced rules in the default ruleset.
(In reply to comment #119) > (In reply to comment #118) >> ... despite the current corpus data (unless 1.7% is a high ham hit-rate)? > > http://ruleqa.spamassassin.org/20091017-r826198-n/RDNS_NONE/detail > The most recent weekly run has pretty substantial hits even outside of > the synthetic corpus. Your link is just a longer version of mine. It still results in a 1.7% total ham hit-rate. Is that too substantial? Is there detail on what each corpus is (specifically nbebout, since that's the only other corpus that hit 4+% of spam)? Looking only at ham scoring 4 or higher (including enron since I can't remove it), RDNS_NONE hit 0.8528% of the total ham corpus. Of the ham scoring JUST 4 (4.0-4.99999), we're looking at 0.5865% that would become FPs assuming a score of 1.1 (increasing the 0.1 by 1), and I'm not even proposing my own implementation's 0.9. > Adam, this [... and] your RCVD_IN_APNIC are examples of inherently > prejudiced rules. It might work for the most part, and you might accept > the risk of accidental FP's because the score alone wont push it above > the threshold. However the combined risks of multiple prejudiced rules > is too great. Prejudiced rules should be up to the sysadmin if they want > to enable. We should not highly score any known prejudiced rules in the > default ruleset. I agree that RCVD_VIA_APNIC is a prejudiced rule, and my channels have had it rated 0.001 ever since you called me out on it (RCVD_VIA_APNIC accidentally came in when I migrated from an internal-only propagation to a published channel). KHOP_NO_FIRST_NAME, my other poorly-considered published test, pre-dates my more thorough testing mechanism (which has limited new rules' entry quite considerably). My rules will get even more cleaned up once I get an svn account to test them here. (Some of them, like the biased RCVD_IN_APNIC and quasi-biased/unfair KHOP_SC_CIDR8, would either never get pushed up for testing or would get the nopublish flag, depending on the guidelines ... that nobody has yet pointed me to.) (Side note: I see __RCVD_VIA_APNIC is already in your own sandbox, hitting 86% of all Japanese ham.) Getting back to this issue: I don't see any problem with prejudice against poorly constructed network infrastructures that can't bother to adhere to the SMTP standard (RFC1912 section 2.1). This is something that any network admin who should legitimately be managing a mail server should be able to fix with a single phone call (please correct me if this sentence is prejudiced in any way). The SMTP standard requires a server's rDNS must match the server's reported name (thus the IP must have rDNS), and most allocated IPs have them anyway (even if they're wrong or ~dynamic, e.g. RDNS_DYNAMIC). There is also a growing number of deployments that block improper FCrDNS at the door (RDNS_NONE is a subset of failing FCrDNS). SA already has built-in "prejudices" against poorly constructed email clients (e.g. MISSING_HEADERS) and relays (e.g. DATE_IN_FUTURE_48_96), so why not the network? Isn't SPF_FAIL a "prejudiced" test against network configuration? SA at its core is merely a system of probabilities. Even without bayes, the masscheck mechanism and its points are awarded based on statistical significance. Very few rules are actually free of FPs (or FNs for negative rules). My question still stands: what does SA deem statistically significant when it comes to FPs? Why does RDNS_NONE need to be immutable rather than dictated by the masscheck results? What would the automated system score RDNS_NONE if it were allowed to? I'm guessing something between 0.2 and 0.7.
(In reply to comment #120) > I agree that RCVD_VIA_APNIC is a prejudiced rule, and my channels have had it > rated 0.001 ever since you called me out on it (RCVD_VIA_APNIC accidentally OK glad to hear that you reduced it. I didn't look at your scores after that first time. You should really get a spamassassin account so your rules can be more thoroughly tested against a more varied corpa. > nobody has yet pointed me to.) (Side note: I see __RCVD_VIA_APNIC is already > in your own sandbox, hitting 86% of all Japanese ham.) Yes, I'm using it as a softener to exclude from the extremely prejudiced CN_<NUMBER> rules. It just so happens that the majority of CN_<NUMBER> spam comes from !APNIC, and APNIC is prejudiced in exactly the way to make CN_<NUMBER> rules less dangerous. Even though those rules have high spam hit rates and zero FP's across our nightly masscheck corpa, it is still too prejudiced to be safe as a default rule. > SA at its core is merely a system of probabilities. Even without bayes, the > masscheck mechanism and its points are awarded based on statistical > significance. Very few rules are actually free of FPs (or FNs for negative > rules). My question still stands: what does SA deem statistically significant > when it comes to FPs? Why does RDNS_NONE need to be immutable rather than > dictated by the masscheck results? What would the automated system score > RDNS_NONE if it were allowed to? I'm guessing something between 0.2 and 0.7. That is an interesting question.
Some bugs in the auto-generated rules from attachment 4553 [details] HTML_MESSAGE scores WAY too high. There are others too. Full list as of right now: MSECS SPAM% HAM% S/O RANK SCORE NAME 0 0.1848 4.8675 0.037 0.78 0.00 SPF_HELO_PASS 0 0.3294 5.5859 0.056 0.74 0.00 SPF_PASS 0 12.2476 1.2568 0.907 0.58 0.00 RCVD_IN_BL_SPAMCOP_NET 0 50.4453 3.7391 0.931 0.57 2.30 MIME_HTML_ONLY 0 49.9300 12.1231 0.805 0.52 0.10 RDNS_NONE 0 3.8466 1.8427 0.676 0.51 2.30 SUBJ_ALL_CAPS 0 2.3989 1.3218 0.645 0.50 0.00 UNPARSEABLE_RELAY 0 83.7769 40.8865 0.672 0.49 0.00 HTML_MESSAGE 0 3.4477 3.8932 0.470 0.47 2.50 MIME_QP_LONG_LINE 0 12.2361 15.6252 0.439 0.46 0.00 FREEMAIL_FROM 0 0.7695 1.2102 0.389 0.41 2.90 TVD_SPACE_RATIO 0 0.4610 1.2409 0.271 0.35 1.00 EXTRA_MPART_TYPE 0 0.0271 1.0700 0.025 0.15 1.22 MSGID_MULTIPLE_AT score SPF_HELO_PASS -0.001 score SPF_PASS -0.001 score RCVD_IN_BL_SPAMCOP_NET 0 1.725 0 1.180 # n=2 score MIME_HTML_ONLY 1.474 0.737 0.829 0.462 score RDNS_NONE 0.1 score SUBJ_ALL_CAPS 0.264 1.568 0.593 1.045 score UNPARSEABLE_RELAY 0.001 score HTML_MESSAGE 2.199 0.838 1.473 0.511 score MIME_QP_LONG_LINE 0.074 0.242 0.116 0.002 score FREEMAIL_FROM 0.817 1.020 0.401 0.856 score TVD_SPACE_RATIO 0.001 0.201 0.398 0.001 score MSGID_MULTIPLE_AT 0.001 0.001 0.598 0.000 To fetch them for yourself (so as to get something more up-to-date or from a different URL, etc), here's the code I ran (sorry, I know posix shell better than perl, so I dip into both): elinks -dump http://ruleqa.spamassassin.org/ |perl -ne 'print if /(\s+[\d.]+){2}\s+[1-9][\d.]+(\s+[\d.]+){3}\s+(?!T_)\w|\sMSECS/' |tee rules.txt for rule in $(perl -ne 'if (/.*\s([A-Z]+\w*_\w*)/) { s//$1/; print; }' < rules.txt); do grep "^[^#]* $rule " /tmp/50_scores_newest.cf; done That could probably be written better, e.g. looking for ham% > spam% in addition to ham% > 0.9999%, but this is a good first-pass. Obviously, /removing/ fixed scores for things like RDNS_NONE can't possibly be considered until the GA is a little more apt at figuring this sort of thing out.
(In reply to comment #122) sorry, that should be: elinks -dump http://ruleqa.spamassassin.org/ |perl -ne 'print if /(\s+[\d.]+){2}\s+[1-9][\d.]+(\s+[\d.]+){3}\s+(?!T_)\w|\sMSECS/' |tee rules.txt for rule in $(perl -ne 'if (/.*\s([A-Z]+\w*_\w*)/) { s//$1/; print; }' < rules.txt); do grep "^[^#]* $rule " /tmp/50_scores_newest.cf || echo "score $rule UNKNOWN"; done With each of those two stanzas living on just one line. Obviously, ignore the genuine ham rules.
Created attachment 4558 [details] resulting 50_scores.cf from garescorer runs - V3 Attached is the latest 50_scores.cf file, obtained in a couple of iterations during the last few days. It takes into account the updated results files from the rsync submit area, in particular the updated net-wt* (Comment 99, 102, 103), and net-hege* files. The binnocenti* are still excluded. The rest of the corpora tweaks/decimation as per my previous run, Comment 96. The RCVD_IN_DNSWL_* scores are hand-tweaked (according to Comment 101), otherwise the _MED stands out above the _HI due to its significantly higher hit rate. The KB_RATWARE_OUTLOOK_08, KB_RATWARE_OUTLOOK_12, KB_RATWARE_OUTLOOK_16 and KB_RATWARE_BOUNDARY were now zeroed-out according to Comment 115. I tried leaving RDNS_NONE and RDNS_DYNAMIC floating (Comment 116, 120, 122), and it seems to me the obtained score is perfectly sensible and useful, and still not too high to punish incompetent admins too hard: score RDNS_NONE 0 1.1 0 0.7 score RDNS_DYNAMIC 0 0.5 0 0.5 so I'm leaving these floating. According to Comment 122 I zeroed out (actually, 0.001'd out) the HTML_MESSAGE, MIME_QP_LONG_LINE, FREEMAIL_FROM, TVD_SPACE_RATIO, and MSGID_MULTIPLE_AT. Some further tweaks: I reduced the BAYES scores somewhat (e.g. from 4.5 to 3.5 for BAYES_99 scoreset3) and tamed down the BAYES_50, which was standing out from the crowd). For DCC_* rules I used the already described approach: obtain DCC_CHECK score from a GA run with all DCC_REPUT_* zeroed-out, then fix the obtained DCC_CHECK, and let DCC_REPUT_* float for the final run. The NML_ADSP_CUSTOM_MED was obtained from a GA run, but other (_LOW, _HIGH) were set manually (currently no hits). The DKIM_ADSP_ALL, DKIM_ADSP_DISCARD, and DKIM_ADSP_NXDOMAIN are based on GA runs, but hand-tweaked somewhat due to inconsistencies between corpora. A word about JM_SOUGHT_FRAUD_{1,2,3}: these three rules come out from a ga RUN with scores between 2 and 3, but are somewhat inconsistent between runs and corpora. As requested by Comment 38 their scores were fixed at zero for the final run, but I'd set these manually to 2.2 each for the published 50_scores.cf. After preparing my manual fixes from a couple of trial runs, I made a final run for each scoreset with these fixed scores, so as to allow other scores to adjust themselves to the new constraints. So here are the manual fixes: score SPF_PASS -0.001 score SPF_HELO_PASS -0.001 score BAYES_00 0 0 -1.2 -1.9 score BAYES_05 0 0 -0.2 -0.5 score BAYES_20 0 0 -0.001 -0.001 score BAYES_40 0 0 -0.001 -0.001 score BAYES_50 0 0 2.0 0.8 score BAYES_60 0 0 2.5 1.5 score BAYES_80 0 0 2.7 2.0 score BAYES_95 0 0 3.2 3.0 score BAYES_99 0 0 3.8 3.5 score RCVD_IN_DNSWL_LOW 0 -0.6 0 -1.1 score RCVD_IN_DNSWL_MED 0 -1.5 0 -1.2 score RCVD_IN_DNSWL_HI 0 -1.8 0 -1.8 score HTML_MESSAGE 0.001 score NO_RELAYS -0.001 score UNPARSEABLE_RELAY 0.001 score NO_RECEIVED -0.001 score NO_HEADERS_MESSAGE 0.001 score DKIM_ADSP_ALL 0 1.1 0 0.8 score DKIM_ADSP_DISCARD 0 1.8 0 1.8 score DKIM_ADSP_NXDOMAIN 0 0.8 0 0.9 score NML_ADSP_CUSTOM_LOW 0 0.7 0 0.7 score NML_ADSP_CUSTOM_MED 0 1.2 0 0.9 score NML_ADSP_CUSTOM_HIGH 0 2.6 0 2.5 score JM_SOUGHT_FRAUD_1 0 score JM_SOUGHT_FRAUD_2 0 score JM_SOUGHT_FRAUD_3 0 score MIME_QP_LONG_LINE 0.001 score FREEMAIL_FROM 0.001 score TVD_SPACE_RATIO 0.001 score MSGID_MULTIPLE_AT 0.001 score EXTRA_MPART_TYPE 1.0 score RDNS_NONE 0 1.1 0 0.7 score RDNS_DYNAMIC 0 0.5 0 0.5 score KB_RATWARE_OUTLOOK_08 0 score KB_RATWARE_OUTLOOK_12 0 score KB_RATWARE_OUTLOOK_16 0 score KB_RATWARE_BOUNDARY 0
$ head test scores ================================= score set 3 (net, bayes) - gen-set3-20-5.0-12200-ga test (10%) # SUMMARY for threshold 5.0: # Correctly non-spam: 21172 99.93% # Correctly spam: 43597 98.78% # False positives: 14 0.07% # False negatives: 537 1.22% # TCR(l=50): 35.678254 SpamRecall: 98.783% SpamPrec: 99.968% scores (90%): # SUMMARY for threshold 5.0: # Correctly non-spam: 168143 32.193% (99.979% of non-spam corpus) # Correctly spam: 349734 66.961% (98.763% of spam corpus) # False positives: 36 0.007% (0.021% of nonspam, 8360 weighted) # False negatives: 4382 0.839% (1.237% of spam, 14401 weighted) # Average score for spam: 21.1 nonspam: -2.2 # Average for false-pos: 5.5 false-neg: 3.3 # TOTAL: 522295 100.00% ================================= score set 2 (no net, bayes) - gen-set2-10-5.0-12200-ga test: # SUMMARY for threshold 5.0: # Correctly non-spam: 21148 99.82% # Correctly spam: 41172 93.29% # False positives: 38 0.18% # False negatives: 2962 6.71% # TCR(l=50): 9.077334 SpamRecall: 93.289% SpamPrec: 99.908% scores: # SUMMARY for threshold 5.0: # Correctly non-spam: 167953 32.157% (99.866% of non-spam corpus) # Correctly spam: 329931 63.169% (93.170% of spam corpus) # False positives: 226 0.043% (0.134% of nonspam, 26882 weighted) # False negatives: 24185 4.631% (6.830% of spam, 89229 weighted) # Average score for spam: 10.8 nonspam: -0.7 # Average for false-pos: 5.6 false-neg: 3.7 # TOTAL: 522295 100.00% ================================= score set 1 (net, no bayes) - gen-set1-10-5.0-12201-ga test: # SUMMARY for threshold 5.0: # Correctly non-spam: 21155 99.85% # Correctly spam: 43153 97.78% # False positives: 31 0.15% # False negatives: 981 2.22% # TCR(l=50): 17.437377 SpamRecall: 97.777% SpamPrec: 99.928% scores: # SUMMARY for threshold 5.0: # Correctly non-spam: 168012 32.168% (99.901% of non-spam corpus) # Correctly spam: 346216 66.287% (97.769% of spam corpus) # False positives: 167 0.032% (0.099% of nonspam, 20194 weighted) # False negatives: 7900 1.513% (2.231% of spam, 23052 weighted) # Average score for spam: 19.8 nonspam: -0.5 # Average for false-pos: 5.7 false-neg: 2.9 # TOTAL: 522295 100.00% ================================= score set 0 (no net, no bayes) - gen-set0-5-5.0-12201-ga test: # SUMMARY for threshold 5.0: # Correctly non-spam: 20919 98.74% # Correctly spam: 34081 77.22% # False positives: 267 1.26% # False negatives: 10053 22.78% # TCR(l=50): 1.885827 SpamRecall: 77.222% SpamPrec: 99.223% scores: # SUMMARY for threshold 5.0: # Correctly non-spam: 166261 31.833% (98.860% of non-spam corpus) # Correctly spam: 271409 51.965% (76.644% of spam corpus) # False positives: 1918 0.367% (1.140% of nonspam, 126535 weighted) # False negatives: 82707 15.835% (23.356% of spam, 235514 weighted) # Average score for spam: 10.4 nonspam: 0.6 # Average for false-pos: 6.3 false-neg: 2.8 # TOTAL: 522295 100.00% ================================= In summary: set 3 # False positives: 36 (0.021% of nonspam) # False negatives: 4382 (1.237% of spam) set 2 # False positives: 226 (0.134% of nonspam) # False negatives: 24185 (6.830% of spam) set 1 # False positives: 167 (0.099% of nonspam) # False negatives: 7900 (2.231% of spam) set 0 # False positives: 1918 (1.140% of nonspam) # False negatives: 82707 (23.356% of spam)
Created attachment 4559 [details] freqs.full of corpora used for score set 3 and 2 runs
Created attachment 4560 [details] ranges.data on corpora used for score set 3 and 2 runs
(In reply to comment #124) > Created an attachment (id=4558) [details] > resulting 50_scores.cf from garescorer runs - V3 Now I am getting really nervous. :-/ From the scores: score KB_DATE_CONTAINS_TAB 3.799 3.799 3.315 2.871 score KB_FAKED_THE_BAT 1.447 2.273 2.452 3.799 The bad thing about this is, that onet.pl / onet.eu (a polish free-mailer AFAIK) actually munges the header, and injects the tab into the Date header on their outgoing SMTP servers. Apparently, they do that harm to all outgoing mail, not limited to their web-mailer. It is a very, very stupid thing to do for them, to munge MUA generated headers like that, but still they appear to do it. :( That means their customers will really be punished, and using them *and* The Bat! is a killer. FWIW, I once wrote these to counter a flood of low-scoreres -- but the above scores are scaring me. This is quite bad.
(In reply to comment #124) > The RCVD_IN_DNSWL_* scores are hand-tweaked (according to Comment 101), > otherwise the _MED stands out above the _HI due to its significantly higher > hit rate. > [..] > > score RCVD_IN_DNSWL_LOW 0 -0.6 0 -1.1 > score RCVD_IN_DNSWL_MED 0 -1.5 0 -1.2 > score RCVD_IN_DNSWL_HI 0 -1.8 0 -1.8 Is there a particular reason why these are so much different from those in https://svn.apache.org/repos/asf/spamassassin/trunk/rules/50_scores.cf: | score RCVD_IN_DNSWL_LOW 0 -1 0 -1 | score RCVD_IN_DNSWL_MED 0 -4 0 -4 | score RCVD_IN_DNSWL_HI 0 -8 0 -8
> > The RCVD_IN_DNSWL_* scores are hand-tweaked (according to Comment 101), > > otherwise the _MED stands out above the _HI due to its significantly higher > > hit rate. > > score RCVD_IN_DNSWL_LOW 0 -0.6 0 -1.1 > > score RCVD_IN_DNSWL_MED 0 -1.5 0 -1.2 > > score RCVD_IN_DNSWL_HI 0 -1.8 0 -1.8 > > Is there a particular reason why these are so much different from those in > https://svn.apache.org/repos/asf/spamassassin/trunk/rules/50_scores.cf: > > | score RCVD_IN_DNSWL_LOW 0 -1 0 -1 > | score RCVD_IN_DNSWL_MED 0 -4 0 -4 > | score RCVD_IN_DNSWL_HI 0 -8 0 -8 The -1/-4/-8 were manually provided (don't know the background on this decision). The RCVD_IN_DNSWL_MED in my GA results was obtained automatically, and the other two were manually adjusted to make some sense compared to _MED. Btw, the GA results on scoreset 3 from one of my previous runs were: RCVD_IN_DNSWL_LOW -2.761 RCVD_IN_DNSWL_MED -0.999 RCVD_IN_DNSWL_HI -0.966
(In reply to comment #130) > The -1/-4/-8 were manually provided (don't know the background on this > decision). Other whitelisting rules (HABEAS_*, RCVD_IN_IADB_*, RCVD_IN_BSP_TRUSTED etc) have the same scores as in the previous 50_scores.cf. I was wondering why the dnswl.org rules have specifically lower scores than in previous versions - and extremely low scores. This is worrying me, as it would indicate we have a quality issue in the dnswl.org data.
> Other whitelisting rules (HABEAS_*, RCVD_IN_IADB_*, RCVD_IN_BSP_TRUSTED etc) > have the same scores as in the previous 50_scores.cf. They do not have the same scores, seems to me they are all mostly much lower. Please ignore the comments in 50_scores_newest3.cf, just take into account uncommented 'score' lines: score HABEAS_ACCREDITED_COI 0 score HABEAS_ACCREDITED_SOI 0 -1.634 0 -0.475 score RCVD_IN_BSP_TRUSTED 0 -0.001 0 -0.001 score RCVD_IN_IADB_DK 0 -0.044 0 -0.001 score RCVD_IN_IADB_DOPTIN 0 score RCVD_IN_IADB_DOPTIN_GT50 0 score RCVD_IN_IADB_DOPTIN_LT50 0 -0.001 0 -0.001 score RCVD_IN_IADB_EDDB 0 score RCVD_IN_IADB_EPIA 0 score RCVD_IN_IADB_GOODMAIL 0 score RCVD_IN_IADB_LISTED 0 -1.144 0 -0.001 score RCVD_IN_IADB_LOOSE 0 score RCVD_IN_IADB_MI_CPEAR 0 score RCVD_IN_IADB_MI_CPR_30 0 score RCVD_IN_IADB_MI_CPR_MAT 0 -0.079 0 -0.001 score RCVD_IN_IADB_ML_DOPTIN 0 score RCVD_IN_IADB_NOCONTROL 0 score RCVD_IN_IADB_OOO 0 score RCVD_IN_IADB_OPTIN 0 -3.265 0 -2.791 score RCVD_IN_IADB_OPTIN_GT50 0 -0.219 0 -1.041 score RCVD_IN_IADB_OPTIN_LT50 0 score RCVD_IN_IADB_OPTOUTONLY 0 score RCVD_IN_IADB_RDNS 0 -0.018 0 -0.001 score RCVD_IN_IADB_SENDERID 0 -0.001 0 -0.001 score RCVD_IN_IADB_SPF 0 -0.006 0 -0.042 score RCVD_IN_IADB_UNVERIFIED_1 0 score RCVD_IN_IADB_UNVERIFIED_2 0 score RCVD_IN_IADB_UT_CPEAR 0 score RCVD_IN_IADB_UT_CPR_30 0 score RCVD_IN_IADB_UT_CPR_MAT 0 -0.001 0 -0.052 score RCVD_IN_IADB_VOUCHED 0 -1.718 0 -0.956 score RCVD_IN_DNSWL_LOW 0 -0.6 0 -1.1 score RCVD_IN_DNSWL_MED 0 -1.5 0 -1.2 score RCVD_IN_DNSWL_HI 0 -1.8 0 -1.8 > I was wondering why the dnswl.org rules have specifically lower scores than in > previous versions - and extremely low scores. This is worrying me, as it would > indicate we have a quality issue in the dnswl.org data. These all have pretty low rank: $ grep RCVD_IN_DNSWL_ freqs.full OVERALL SPAM% HAM% S/O RANK SCORE NAME 0.184 0.0005 0.5708 0.001 0.76 -1.80 RCVD_IN_DNSWL_HI 7.410 0.1094 22.7527 0.005 0.67 -1.20 RCVD_IN_DNSWL_MED 2.551 0.1810 7.5322 0.023 0.59 -1.10 RCVD_IN_DNSWL_LOW the _HI gets a low automatic score probably because it hits very little mail, so it probably needs manual tweaking. The _MED seems to hit too many spam messages in the submitted logs for rescoring runs, or perhaps it has a high overlap with other similar rules. It is quite possible that some of these hits are still false positives, despite several iterations of cleaning: for j in spam*.log; do echo -n $j; grep RCVD_IN_DNSWL_HI $j | \ wc -l; done | sort -k2nr spam-bayes-net-bb-jhardin.log 3 spam-bayes-net-bb-kmcgrail.log 2 spam-bayes-net-bb-guenther_fraud.log 1 spam-bayes-net-hege.log 1 same on _MED: spam-bayes-net-bluestreak.log 381 spam-bayes-net-hege.log 79 spam-bayes-net-bb-jhardin.log 23 spam-bayes-net-wt-en1.log 15 spam-bayes-net-bb-kmcgrail.log 14 spam-bayes-net-jm-decimated.log 11 spam-bayes-net-ahenry.log 9 spam-bayes-net-dos-decimated.log 6 spam-bayes-net-bb-zmi.log 3 spam-bayes-net-mmartinec.log 3 spam-bayes-net-wt-en4.log 2
strange, some of the more trustworthy BLs are very low scoring. RCVD_IN_XBL: 0.404 and 0.722 these have been effectively zeroed, although are supposed to be immutable: RCVD_IN_SSC_TRUSTED_COI is 0 (with a 0.012 S/O, low hit rate though) HABEAS_ACCREDITED_COI is 0 (ditto) RCVD_IN_BSP_TRUSTED is -0.001 (although with a 0.002 S/O) the HASHCASH rules likewise aren't supposed to be mutable. it looks like there might be a bit of a problem there -- definitely some rules that are in immutable sections, like the above, have been allowed to be mutable in ranges.data....
(In reply to comment #132) > $ grep RCVD_IN_DNSWL_ freqs.full > OVERALL SPAM% HAM% S/O RANK SCORE NAME > 0.184 0.0005 0.5708 0.001 0.76 -1.80 RCVD_IN_DNSWL_HI > 7.410 0.1094 22.7527 0.005 0.67 -1.20 RCVD_IN_DNSWL_MED > 2.551 0.1810 7.5322 0.023 0.59 -1.10 RCVD_IN_DNSWL_LOW > > It is quite possible that some of these hits are still false positives, > despite several iterations of cleaning: > > for j in spam*.log; do echo -n $j; grep RCVD_IN_DNSWL_HI $j | \ > wc -l; done | sort -k2nr > > spam-bayes-net-bb-jhardin.log 3 > > same on _MED: > > spam-bayes-net-bb-jhardin.log 23 All but one of those are obvious spams, and I've removed the one questionable one from my corpora. Some of the spam in my corpora is from third parties. I do check it for correct classification before uploading, but I was wondering: how does masscheck determine the correct lastexternal for corpora containing messages from multiple different networks? Or does it assume all of the messages in a given contributor's corpora have the same network boundary? If the latter, I need to remove those third-party messages from my spam corpora... Might lastexternal confusion in the masschecks be contributing in some way to the odd RCVD_IN_* score generation?
Created attachment 4561 [details] Checker for rules that match more ham than spam I've updated my checker to an actual perl script (still uses elinks as I don't feel like learning LWP and then parsing HTML). I've attached the checker, which can be run with custom parameters for a different ruleset, ham threshold, or minimum difference for ham:spam ratio. Here's the current output, listing all rules that hit 1+% of the ham corpus or that hit 0.05% more of the ham corpus than of the spam corpus. H^2/S HAM% SPAM% Score in attachment 4558 [details] Rule 331.9 0.3319 0.0010 0 OBSCURED_EMAIL 117.4 4.8566 0.2009 -0.001 SPF_HELO_PASS 88.52 5.5735 0.3509 -0.001 SPF_PASS 85.61 0.2226 0.0026 0.000 2.099 0.001 1.212 MISSING_MIME_HB_SEP 76.18 0.7085 0.0093 0.001 0.001 0.699 0.699 TVD_RCVD_SPACE_BRACKET 66.19 0.2780 0.0042 1.145 1.542 1.912 2.400 FUZZY_CPILL 49.98 1.0676 0.0228 0.001 MSGID_MULTIPLE_AT 31.82 0.1496 0.0047 1.494 1.699 1.591 1.516 X_IP 21.86 0.1465 0.0067 0 SUBJECT_FUZZY_TION 20.40 15.6218 11.9604 0.001 FREEMAIL_FROM 20.00* 40.9055 83.6301 0.001 HTML_MESSAGE 17.10 0.1710 0 1.222 0.001 0.082 0.476 MIME_BOUND_DIGITS_15 12.95 0.0609 0.0047 0 HTML_IFRAME_SRC 12.52 0.0714 0.0057 0 FORGED_IMS_TAGS 11.56 0.0659 0.0057 0.001 0.001 0.605 0.378 HTML_NONELEMENT_30_40 10.83 0.1127 0.0104 0.033 0.001 0.365 0.413 WEIRD_PORT 10.18 0.3494 0.0343 2.205 0.174 1.299 1.806 FRT_SOMA2 9.721 0.8934 0.0919 1.499 0.419 0.904 0.798 MIME_BASE64_BLANKS 8.996 0.2474 0.0275 0.987 0.750 0.943 1.318 CTYPE_001C_B 8.918 0.1525 0.0171 0.001 2.499 0.268 0.516 DRUGS_MUSCLE 8.373 0.0829 0.0099 0.003 0.978 0.100 1.515 TVD_FW_GRAPHIC_NAME_LONG 8.016 0.1956 0.0244 0.001 0.020 0.001 1.799 MIME_BASE64_TEXT 6.850 0.0685 0 0 HTML_NONELEMENT_40_50 5.404 0.5356 0.0991 0 1.200 0 2.514 SPF_HELO_FAIL 4.237 0.1585 0.0374 2.199 2.199 1.246 2.090 WEIRD_QUOTING 4.159 3.8908 3.6392 0.001 MIME_QP_LONG_LINE 3.483 0.8570 0.2460 1.799 0.572 1.182 1.138 HTML_IMAGE_RATIO_06 3.219 1.2399 0.4775 1.0 EXTRA_MPART_TYPE 2.913* 12.1047 50.2891 0 1.1 0 0.7 RDNS_NONE 2.839 0.1164 0.0410 0.001 2.185 1.936 0.476 FRT_SOMA 2.751 0.1172 0.0426 0.1 ANY_BOUNCE_MESSAGE 2.417 0.6787 0.2808 0.539 0.001 0.332 0.488 MIME_HTML_MOSTLY 2.370 0.1010 0.0426 0.1 BOUNCE_MESSAGE 2.078 0.5534 0.2663 1.899 0.496 0.950 0.445 HTML_IMAGE_RATIO_08 1.899 1.2077 0.7677 0.001 TVD_SPACE_RATIO 1.726 0.3227 0.1869 0.023 0.887 0.000 0.417 UPPERCASE_50_75 1.517 0.9658 0.6364 2.801 2.080 1.780 3.387 DATE_IN_PAST_96_XX 1.269 0.4224 0.3327 0.000 0.001 0.264 0.001 HTML_FONT_SIZE_LARGE 1.151 0.5492 0.4770 2.260 0.742 1.199 0.640 MPART_ALT_DIFF 0.913* 1.8488 3.7425 1.154 1.677 1.198 1.453 SUBJ_ALL_CAPS 0.703* 1.3317 2.5216 0.001 UNPARSEABLE_RELAY 0.278* 3.7480 50.4848 2.199 0.955 1.215 0.549 MIME_HTML_ONLY 0.121* 1.2540 12.9472 0 1.322 0 1.237 RCVD_IN_BL_SPAMCOP_NET (Anything asterisked is included because it matched >1% of the ham corpus but matched a larger percent of the spam corpus while everything else matched a larger percent of the ham corpus than the spam corpus.) Mark's fixes solved the immediate issues raised earlier, so I decided to order this by the ratio of percentage of ham corpus hit to percentage of spam corpus hit, but that under-emphasized the ham hits, so I then multiplied that by the ham percentage again (unless the percent was under 1). It's easy enough to browse for non-zero ham% hits. Any rule with a ratio over 1.000 is a problem when scored positively unless it is exempted for applying to popular spam patterns that the corpus is known to lack. For completeness, this list includes all tests that hit at least 1% of the ham corpus (thus the presence of HTML_MESSAGE, RDNS_NONE, and the four tests with ratios under 1.0).
(In reply to comment #133) > it looks like there might be a bit of a problem there -- definitely some rules > that are in immutable sections, like the above, have been allowed to be mutable > in ranges.data.... just wondering, Mark, did you do this deliberately? or is it just a bug in the tool that it's ignoring the non-mutable flag for those rules for some reason?
> > it looks like there might be a bit of a problem there -- definitely some > > rules that are in immutable sections, like the above, have been allowed > > to be mutable in ranges.data.... > > just wondering, Mark, did you do this deliberately? or is it just a bug > in the tool that it's ignoring the non-mutable flag for those rules for > some reason? Sort-of deliberately. Initially I followed the idea in wiki RescoreMassCheck section 4.2: 'comment out all "score" lines except for rules that you think the scores are accurate like carefully-vetted net rules, or 0.001 informational rules' which made perfect sense to me, so I did it for 50_scores.cf, except for a couple of rather obvious rules like _WHITELIST and similar, and the ones clearly indicated as 'indicators' only in the surrounding comments, or set to 0.001. Later I nailed a couple more. I followed a principle: when in doubt, leave it floating, it can be fixed later if necessary. It gives some insight into what GA 'thinks' about certain rules. I think at least for some rules GA makes perfect sense, like RDNS_NONE and RDNS_DYNAMIC. For some of them the GA result is close to the manually assigned score, or may indicate a need for reconsidering the assigned score. But I agree that more may need re-fixing.
(In reply to comment #134) > Some of the spam in my corpora is from third parties. I do check it for correct > classification before uploading, but I was wondering: how does masscheck > determine the correct lastexternal for corpora containing messages from > multiple different networks? Or does it assume all of the messages in a given > contributor's corpora have the same network boundary? If the latter, I need to > remove those third-party messages from my spam corpora... > > Might lastexternal confusion in the masschecks be contributing in some way to > the odd RCVD_IN_* score generation? I believe the masschecks leaves internal/external/msa_networks to their defaults, unless one cares to configure it correctly for his corpus. And I believe that it is more likely than not that some corpora were scanned with unsuitable settings of networks. I know that configuring it for my mass checks runs it gave me a headache (but I did it right in the end). Which is why I posted the following note on the ML at that time: From: Mark Martinec <Mark.Martinec+sa@ijs.si> To: dev@spamassassin.apache.org Subject: Re: SpamAssassin 3.3.0 mass-checks now starting Date: Fri, 4 Sep 2009 21:46:59 +0200 Docs don't say where one is supposed to put a local.cf with options which are ignored in masses/spamassassin/user_prefs (like Bayes SQL options, DCC, Pyzor timeouts etc). I tried to place local.cf into masses/spamassassin/, with horror results (some directives in local.cf proclaimed as invalid, as apparently plugins have not yet been loaded at the time of parsing this file, but only later). I finally placed it into ../rules/ as mylocal.cf, which finally works as expected, but I wonder if the is the proper solution. Should be documented I guess...
(In reply to comment #137) > Sort-of deliberately. Initially I followed the idea in wiki RescoreMassCheck > section 4.2: 'comment out all "score" lines except for rules that you think > the scores are accurate like carefully-vetted net rules, or 0.001 informational > rules' which made perfect sense to me, so I did it for 50_scores.cf, except > for a couple of rather obvious rules like _WHITELIST and similar, and the ones > clearly indicated as 'indicators' only in the surrounding comments, or set to > 0.001. Later I nailed a couple more. I followed a principle: when in doubt, > leave it floating, it can be fixed later if necessary. It gives some insight > into what GA 'thinks' about certain rules. That's true. It's good to hear it's not a bug in the masses scripts, anyway ;) > I think at least for some rules GA makes perfect sense, like RDNS_NONE > and RDNS_DYNAMIC. Yes, I agree, it's actually done a (surprisingly) good job with those. > For some of them the GA result is close to the manually > assigned score, or may indicate a need for reconsidering the assigned score. > But I agree that more may need re-fixing. cool. In particular, some of the DNSBLs and most of the DNSWLs are good to 'lock down', I feel, as users tend to 'compensate' or correct their scores more frequently than other rules -- in my opinion. Also, if those are given low scores by the GA, their operators tend to be annoyed, and it's not good to annoy people who we're relying on ;) It also reflects that those rules are slightly different, and hopefully more reliable, than a typical body rule for example -- there's no way to indicate this to the GA yet, so locking the rules is as good as we can do.
(In reply to comment #138) > I believe the masschecks leaves internal/external/msa_networks to their > defaults, unless one cares to configure it correctly for his corpus. And > I believe that it is more likely than not that some corpora were scanned > with unsuitable settings of networks. I know that configuring it for my > mass checks runs it gave me a headache (but I did it right in the end). What should be happening, though, is that we're just underestimating the amount of -lastexternal rule hits -- the S/O should still be correct, but the overall number of hits will be less. Hopefully that will still provide a useful estimation of accuracy. > Docs don't say where one is supposed to put a local.cf with > options which are ignored in masses/spamassassin/user_prefs > (like Bayes SQL options, DCC, Pyzor timeouts etc). > > I tried to place local.cf into masses/spamassassin/, with > horror results (some directives in local.cf proclaimed as > invalid, as apparently plugins have not yet been loaded > at the time of parsing this file, but only later). > > I finally placed it into ../rules/ as mylocal.cf, which > finally works as expected, but I wonder if the is the proper > solution. Should be documented I guess... yuck. bug 6227.
>> But I agree that more may need re-fixing. > > cool. > In particular, some of the DNSBLs and most of the DNSWLs are good to 'lock > down', I feel, as users tend to 'compensate' or correct their scores more > frequently than other rules -- in my opinion. Also, if those are given low > scores by the GA, their operators tend to be annoyed, and it's not good to > annoy people who we're relying on ;) > > It also reflects that those rules are slightly different, and hopefully > more reliable, than a typical body rule for example -- there's no way to > indicate this to the GA yet, so locking the rules is as good as we can do. | It is quite possible that some of these hits are still false positives, | despite several iterations of cleaning I wonder how much is the low score for some ham rules affected by false positives present in the spam* corpora. Here is some statistics for the more prominent ham rules (i.e. the ones with negative scores). For each rule the table shows a number of hits of this rule for each corpus - both as a percentage of all entries in a file, and as absolute counts. The entries standing out from the crowd that may need re-checking are labeled with *** : score ALL_TRUSTED -1.000 0.046 % 1/2194 spam-bayes-net-bb-kmcgrail 0.017 % 4/23761 spam-bayes-net-mmartinec 0.014 % 5/36941 spam-bayes-net-hege 0.001 % 1/81265 spam-bayes-net-bluestreak 0.000 % 1/931863 spam-bayes-net-dos score BAYES_00 0 0 -1.2 -1.9 5.652 % 104/1840 spam-bayes-net-bb-jhardin *** 1.805 % 429/23761 spam-bayes-net-mmartinec 1.606 % 33/2055 spam-bayes-net-ahenry 0.439 % 357/81265 spam-bayes-net-bluestreak 0.374 % 138/36941 spam-bayes-net-hege 0.030 % 445/1489699 spam-bayes-net-jm 0.017 % 156/931863 spam-bayes-net-dos score DCC_REPUT_00_12 0 -0.8 0 -0.4 0.164 % 39/23761 spam-bayes-net-mmartinec score HABEAS_ACCREDITED_SOI 0 -1.634 0 -0.475 5.382 % 76/1412 spam-bayes-net-bb-guenther_fraud *** 0.272 % 5/1840 spam-bayes-net-bb-jhardin 0.091 % 2/2194 spam-bayes-net-bb-kmcgrail 0.059 % 14/23761 spam-bayes-net-mmartinec 0.049 % 18/36941 spam-bayes-net-hege 0.037 % 558/1489699 spam-bayes-net-jm 0.030 % 2/6728 spam-bayes-net-wt-en1 0.018 % 15/81265 spam-bayes-net-bluestreak 0.000 % 1/931863 spam-bayes-net-dos score RCVD_IN_DNSWL_HI 0 -1.8 0 -1.8 0.163 % 3/1840 spam-bayes-net-bb-jhardin *** 0.091 % 2/2194 spam-bayes-net-bb-kmcgrail 0.071 % 1/1412 spam-bayes-net-bb-guenther_fraud 0.003 % 1/36941 spam-bayes-net-hege 0.000 % 1/1489699 spam-bayes-net-jm score RCVD_IN_DNSWL_MED 0 -1.5 0 -1.2 1.250 % 23/1840 spam-bayes-net-bb-jhardin *** (1.108 % 7/632 spam-bayes-net-binnocenti.OFF) 0.638 % 14/2194 spam-bayes-net-bb-kmcgrail 0.469 % 381/81265 spam-bayes-net-bluestreak 0.438 % 9/2055 spam-bayes-net-ahenry 0.223 % 15/6728 spam-bayes-net-wt-en1 0.214 % 79/36941 spam-bayes-net-hege 0.046 % 682/1489699 spam-bayes-net-jm 0.042 % 3/7185 spam-bayes-net-bb-zmi 0.013 % 3/23761 spam-bayes-net-mmartinec 0.010 % 2/19160 spam-bayes-net-wt-en4 0.003 % 29/931863 spam-bayes-net-dos score RCVD_IN_DNSWL_LOW 0 -0.6 0 -1.1 16.153 % 240627/1489699 spam-bayes-net-jm *** (9.810 % 62/632 spam-bayes-net-binnocenti.OFF) 1.739 % 32/1840 spam-bayes-net-bb-jhardin 1.600 % 591/36941 spam-bayes-net-hege 1.159 % 78/6728 spam-bayes-net-wt-en1 1.133 % 16/1412 spam-bayes-net-bb-guenther_fraud 0.925 % 19/2055 spam-bayes-net-ahenry 0.365 % 8/2194 spam-bayes-net-bb-kmcgrail 0.107 % 87/81265 spam-bayes-net-bluestreak 0.097 % 7/7185 spam-bayes-net-bb-zmi 0.022 % 201/931863 spam-bayes-net-dos 0.021 % 5/23761 spam-bayes-net-mmartinec 0.016 % 3/19160 spam-bayes-net-wt-en4 score RCVD_IN_BSP_TRUSTED 0 -0.001 0 -0.001 5.312 % 75/1412 spam-bayes-net-bb-guenther_fraud *** 0.030 % 2/6728 spam-bayes-net-wt-en1 0.029 % 7/23761 spam-bayes-net-mmartinec 0.029 % 435/1489699 spam-bayes-net-jm 0.015 % 12/81265 spam-bayes-net-bluestreak 0.003 % 1/36941 spam-bayes-net-hege 0.001 % 11/931863 spam-bayes-net-dos score RCVD_IN_IADB_DK 0 -0.044 0 -0.001 0.059 % 4/6728 spam-bayes-net-wt-en1 0.054 % 1/1840 spam-bayes-net-bb-jhardin 0.033 % 27/81265 spam-bayes-net-bluestreak 0.004 % 1/23761 spam-bayes-net-mmartinec 0.001 % 21/1489699 spam-bayes-net-jm score RCVD_IN_IADB_RDNS 0 -0.018 0 -0.001 0.342 % 23/6728 spam-bayes-net-wt-en1 *** 0.054 % 1/1840 spam-bayes-net-bb-jhardin 0.049 % 1/2055 spam-bayes-net-ahenry 0.033 % 27/81265 spam-bayes-net-bluestreak 0.004 % 1/23761 spam-bayes-net-mmartinec 0.002 % 26/1489699 spam-bayes-net-jm score RCVD_IN_IADB_OPTIN 0 -3.265 0 -2.791 0.342 % 23/6728 spam-bayes-net-wt-en1 *** 0.049 % 1/2055 spam-bayes-net-ahenry 0.000 % 4/1489699 spam-bayes-net-jm score RCVD_IN_IADB_OPTIN_GT50 0 -0.219 0 -1.041 0.054 % 1/1840 spam-bayes-net-bb-jhardin score RCVD_IN_IADB_DOPTIN 0 0.000 % 7/1489699 spam-bayes-net-jm score RCVD_IN_IADB_DOPTIN_LT50 0 -0.001 0 -0.001 0.026 % 21/81265 spam-bayes-net-bluestreak *** 0.001 % 15/1489699 spam-bayes-net-jm.log score RCVD_IN_IADB_DOPTIN_GT50 0 0.007 % 6/81265 spam-bayes-net-bluestreak 0.004 % 1/23761 spam-bayes-net-mmartinec score RCVD_IN_IADB_ML_DOPTIN 0 0.000 % 2/1489699 spam-bayes-net-jm score RCVD_IN_IADB_UT_CPR_MAT 0 -0.001 0 -0.052 0.026 % 21/81265 spam-bayes-net-bluestreak *** 0.001 % 15/1489699 spam-bayes-net-jm score RCVD_IN_IADB_MI_CPR_MAT 0 -0.079 0 -0.001 0.026 % 21/81265 spam-bayes-net-bluestreak *** 0.001 % 15/1489699 spam-bayes-net-jm score RCVD_IN_IADB_LISTED 0 -1.144 0 -0.001 0.342 % 23/6728 spam-bayes-net-wt-en1 *** 0.054 % 1/1840 spam-bayes-net-bb-jhardin 0.049 % 1/2055 spam-bayes-net-ahenry 0.033 % 27/81265 spam-bayes-net-bluestreak 0.004 % 1/23761 spam-bayes-net-mmartinec 0.002 % 26/1489699 spam-bayes-net-jm 0.000 % 1/931863 spam-bayes-net-dos score RCVD_IN_IADB_SENDERID 0 -0.001 0 -0.001 0.208 % 14/6728 spam-bayes-net-wt-en1 *** 0.049 % 1/2055 spam-bayes-net-ahenry 0.033 % 27/81265 spam-bayes-net-bluestreak 0.004 % 1/23761 spam-bayes-net-mmartinec 0.000 % 4/1489699 spam-bayes-net-jm score RCVD_IN_IADB_SPF 0 -0.006 0 -0.042 0.342 % 23/6728 spam-bayes-net-wt-en1 *** 0.054 % 1/1840 spam-bayes-net-bb-jhardin 0.049 % 1/2055 spam-bayes-net-ahenry 0.033 % 27/81265 spam-bayes-net-bluestreak 0.004 % 1/23761 spam-bayes-net-mmartinec 0.002 % 26/1489699 spam-bayes-net-jm score RCVD_IN_IADB_VOUCHED 0 -1.718 0 -0.956 0
Seems to me that many / most(?) HABEAS_ACCREDITED_SOI supposedly false positives are due to freelotto.com mail. I wonder whether such samples are rightfully in the spam* corpora - I'd say yes, but, as they say, spam is about consent, not content, and people receiving mail from freelotto.com most likely did register once, not realizing what they are dealing with. So there was a consent, at least initially. It is also about fraud and advertising, so, should one leave such mail samples in the spam corpus or not?
> Seems to me that many / most(?) HABEAS_ACCREDITED_SOI supposedly > false positives are due to freelotto.com mail. Same applies to RCVD_IN_BSP_TRUSTED spam hits.
What is the next step in order to move forward?
Created attachment 4564 [details] Checker for rules that match more ham than spam Updated my checker to use S/O (now that I understand that stat). It also supports specifying the DateRev for the specific masscheck run. Since today's run was sparse, here are yesterday's results. $ ./sa33badrules.pl 20091103-r832343-n S/O RANK HAM% SPAM% Score in attachment 4558 [details] Rule .008 .12 1.2401 0.0105 0.001 MSGID_MULTIPLE_AT .011 .22 0.3066 0.0035 0 OBSCURED_EMAIL .012 .25 0.2058 0.0025 0.000 2.099 0.001 1.212 MISSING_MIME_HB_SEP .014 .17 0.5822 0.0080 0.001 0.001 0.699 0.699 TVD_RCVD_SPACE_BRACKET .028 .20 0.4339 0.0125 unknown TVD_FUZZY_SECTOR .042 .28 0.1732 0.0075 0 SUBJECT_FUZZY_TION .048 .77 4.4862 0.2279 -0.001 SPF_HELO_PASS .052 .29 0.1476 0.0080 1.494 1.699 1.591 1.516 X_IP .055 .22 0.3914 0.0226 2.205 0.174 1.299 1.806 FRT_SOMA2 .062 .74 5.1484 0.3424 -0.001 SPF_PASS .077 .25 0.2643 0.0221 0.987 0.750 0.943 1.318 CTYPE_001C_B .079 .36 0.0640 0.0055 0.001 0.001 0.605 0.378 HTML_NONELEMENT_30_40 .080 .28 0.1742 0.0151 0.001 2.499 0.268 0.516 DRUGS_MUSCLE .084 .36 0.0660 0.0060 0 FORGED_IMS_TAGS .090 .32 0.1114 0.0110 0.033 0.001 0.365 0.413 WEIRD_PORT .092 .21 0.8712 0.0878 1.499 0.419 0.904 0.798 MIME_BASE64_BLANKS .102 .37 0.0577 0.0065 0 HTML_IFRAME_SRC .123 .34 0.0821 0.0115 0.003 0.978 0.100 1.515 TVD_FW_GRAPHIC_NAME_LONG .128 .37 0.0614 0.0090 0 RCVD_BAD_ID .130 .29 0.1851 0.0276 0.001 0.020 0.001 1.799 MIME_BASE64_TEXT .178 .28 0.4948 0.1069 0 1.200 0 2.514 SPF_HELO_FAIL .202 .32 0.1590 0.0402 0.1 ANY_BOUNCE_MESSAGE .205 .35 0.0817 0.0211 2.199 1.622 2.199 1.086 LONGWORDS .213 .34 0.1186 0.0321 0 BLANK_LINES_80_90 .216 .32 0.1474 0.0407 2.199 2.199 1.246 2.090 WEIRD_QUOTING .218 .32 0.1445 0.0402 0.1 BOUNCE_MESSAGE .223 .30 0.7605 0.2179 1.799 0.572 1.182 1.138 HTML_IMAGE_RATIO_06 .241 .34 1.3973 0.4438 1.0 EXTRA_MPART_TYPE .254 .34 0.1222 0.0417 0.001 2.185 1.936 0.476 FRT_SOMA .283 .33 0.6883 0.2711 0.539 0.001 0.332 0.488 MIME_HTML_MOSTLY .299 .36 0.0908 0.0387 0.799 0.001 0.711 0.026 TVD_FW_GRAPHIC_NAME_MID .303 .34 0.4938 0.2143 1.899 0.496 0.950 0.445 HTML_IMAGE_RATIO_08 .367 .40 1.2775 0.7409 0.001 TVD_SPACE_RATIO .379 .37 0.3182 0.1943 0.023 0.887 0.000 0.417 UPPERCASE_50_75 .434 .39 0.3261 0.2505 3.099 1.823 1.802 1.998 BAD_ENC_HEADER .436 .46 15.3798 11.8920 0.001 FREEMAIL_FROM .454 .41 0.5503 0.4573 2.260 0.742 1.199 0.640 MPART_ALT_DIFF .516 .47 3.6581 3.9024 0.001 MIME_QP_LONG_LINE .655 .51 1.9537 3.7036 1.154 1.677 1.198 1.453 SUBJ_ALL_CAPS .665 .49 42.2269 83.7383 0.001 HTML_MESSAGE .692 .52 1.1850 2.6580 0.001 UNPARSEABLE_RELAY .922 .58 1.1584 13.7423 0 1.322 0 1.237 RCVD_IN_BL_SPAMCOP_NET .935 .57 3.5421 50.6034 2.199 0.955 1.215 0.549 MIME_HTML_ONLY .970 .52 1.5729 51.1430 0 1.1 0 0.7 RDNS_NONE Note, I hacked RDNS_NONE so that it removes the Enron hits. "Problem" rules this week include X_IP, EXTRA_MPART_TYPE, FRT_SOMA2, and BAD_ENC_HEADER (scored 3.099?!). Food for thought: while it's good to create workarounds for the problematic outcomes from the genetic algorithm, I think that these should be examples with which to troubleshoot the algorithm itself while this might just be an early sign of over-fitting (which is largely fine as long as we comb through the results with scripts like this), it might also be indicative of a problem in the system's prioritization.
Created attachment 4565 [details] resulting 50_scores.cf from garescorer runs - V5 A new run, this time I left the URIBL whitelists and similar fixed (at their relatively high manual scores) as they were in current 50_scores.cf
Corresponding GA summaries ($ head test scores): gen-set3-20-5.0-14000-ga-best ==> test <== # SUMMARY for threshold 5.0: # Correctly non-spam: 21171 99.93% # Correctly spam: 43624 98.84% # False positives: 15 0.07% # False negatives: 510 1.16% # TCR(l=50): 35.026984 SpamRecall: 98.844% SpamPrec: 99.966% ==> scores <== # SUMMARY for threshold 5.0: # Correctly non-spam: 168144 32.193% (99.979% of non-spam corpus) # Correctly spam: 349846 66.982% (98.794% of spam corpus) # False positives: 35 0.007% (0.021% of nonspam, 8289 weighted) # False negatives: 4270 0.818% (1.206% of spam, 13858 weighted) # Average score for spam: 21.3 nonspam: -3.2 # Average for false-pos: 5.6 false-neg: 3.2 # TOTAL: 522295 100.00% gen-set2-10-5.0-6500-ga-best ==> test <== # SUMMARY for threshold 5.0: # Correctly non-spam: 21149 99.83% # Correctly spam: 41755 94.61% # False positives: 37 0.17% # False negatives: 2379 5.39% # TCR(l=50): 10.436037 SpamRecall: 94.610% SpamPrec: 99.911% ==> scores <== # SUMMARY for threshold 5.0: # Correctly non-spam: 167927 32.152% (99.850% of non-spam corpus) # Correctly spam: 335063 64.152% (94.620% of spam corpus) # False positives: 252 0.048% (0.150% of nonspam, 29229 weighted) # False negatives: 19053 3.648% (5.380% of spam, 68835 weighted) # Average score for spam: 11.1 nonspam: -1.0 # Average for false-pos: 5.5 false-neg: 3.6 # TOTAL: 522295 100.00% gen-set1-10-5.0-14000-ga-best ==> test <== # SUMMARY for threshold 5.0: # Correctly non-spam: 21151 99.83% # Correctly spam: 43145 97.76% # False positives: 35 0.17% # False negatives: 989 2.24% # TCR(l=50): 16.113180 SpamRecall: 97.759% SpamPrec: 99.919% ==> scores <== # SUMMARY for threshold 5.0: # Correctly non-spam: 168009 32.167% (99.899% of non-spam corpus) # Correctly spam: 346230 66.290% (97.773% of spam corpus) # False positives: 170 0.033% (0.101% of nonspam, 20632 weighted) # False negatives: 7886 1.510% (2.227% of spam, 22952 weighted) # Average score for spam: 20.1 nonspam: -1.5 # Average for false-pos: 5.8 false-neg: 2.9 # TOTAL: 522295 100.00% gen-set0-5-5.0-14000-ga-best ==> test <== # SUMMARY for threshold 5.0: # Correctly non-spam: 20925 98.77% # Correctly spam: 36049 81.68% # False positives: 261 1.23% # False negatives: 8085 18.32% # TCR(l=50): 2.088195 SpamRecall: 81.681% SpamPrec: 99.281% ==> scores <== # SUMMARY for threshold 5.0: # Correctly non-spam: 166235 31.828% (98.844% of non-spam corpus) # Correctly spam: 288300 55.199% (81.414% of spam corpus) # False positives: 1944 0.372% (1.156% of nonspam, 128482 weighted) # False negatives: 65816 12.601% (18.586% of spam, 202271 weighted) # Average score for spam: 10.5 nonspam: 0.6 # Average for false-pos: 6.3 false-neg: 3.1 # TOTAL: 522295 100.00%
Created attachment 4566 [details] GA cost vs. iterations Here is a somewhat interesting diagram, showing how the 'cost' as optimized by GA is minimized through iterations. Data comes from the nohup.out log file, where each GA iteration looks like: 123456789 Pop size, replacement: 50 33 Adapt (t, fneg, fneg_add, fpos, fpos_add): 1250 4776 0 0 0 Adapt (over, cross, repeat): 1 1 4131 Performance: 0.672 iterations/s, iteration no. 10900 # SUMMARY for threshold 5.0: # Correctly non-spam: 168144 32.193% (99.979% of non-spam corpus) # Correctly spam: 349845 66.982% (98.794% of spam corpus) # False positives: 35 0.007% (0.021% of nonspam, 8290 weighted) # False negatives: 4271 0.818% (1.206% of spam, 13863 weighted) # Average score for spam: 21.1 nonspam: -3.2 # Average for false-pos: 5.6 false-neg: 3.2 # TOTAL: 522295 100.00% From the above, the extracted data for this iteration is: - iteration count: 10900 - FP weighted: 8290 - FN weighted: 13863 So the chart plots FP weighted and FN weighted cost against iteration count. Each of the four colours corresponds to one set (set3: net+bayes, set2: nonet+bayes, set1: net+nobayes, set0: nonet+nobayes). The thicker line of each pair is a FP line, the thinner is a FN line. The purpose of the chart is to determine if the chosen max iterations limit is sensible: still gains some benefit without coming into overfitting or wasting too much time. One safety valve against overfitting is to check if the 10% test sample produces similar results as the learning set (90%). The other test I made is to repeat the runs with a limit of about 5000 iterations (instead of 14000) and compare the results - which are indeed similar.
Created attachment 4567 [details] Scaled diagram of the previous one, only sets 3 and 1 shown Here is the same diagram as above, but scaled so as not be be compressed by poor results of set 0. Also, only the two score sets are shown: 1 and 3, i.e. both sets with network tests, without and with bayes.
(In reply to comment #146) > Created an attachment (id=4565) [details] > resulting 50_scores.cf from garescorer runs - V5 > > A new run, this time I left the URIBL whitelists and similar fixed > (at their relatively high manual scores) as they were in current 50_scores.cf After a little examination, they look good to me! +1 to check in. RCVD_IN_XBL is still surprisingly low -- I bet there's some additive behaviour overlapping between XBL and PBL, though. RCVD_IN_SBL is _very_ low in set 3 too, bizarre! otherwise I can't see any issues.... btw if you feel like cranking up the max gens, go for it. fwiw, spamassassin2.zones has a very powerful CPU -- if it's taking too long on your own machine, try scping stuff up and running it there.
Please manually adjust the scores of RCVD_IN_PSBL up. At the time of the rescore masscheck PSBL had not yet whitelisted hotmail, yahoo, gmail and a number of major ISP's. As a result, for 5 weeks straight RCVD_IN_PSBL has been almost completely devoid of FP's in our weekly masschecks. I am confident that PSBL performs safer than measured during the rescore masscheck. http://ruleqa.spamassassin.org/20090829-r809102-n/RCVD_IN_PSBL/detail http://ruleqa.spamassassin.org/20090905-r811608-n/RCVD_IN_PSBL/detail http://ruleqa.spamassassin.org/20090912-r814117-n/RCVD_IN_PSBL/detail http://ruleqa.spamassassin.org/20090919-r816871-n/RCVD_IN_PSBL/detail http://ruleqa.spamassassin.org/20090926-r819101-n/RCVD_IN_PSBL/detail http://ruleqa.spamassassin.org/20091003-r821273-n/RCVD_IN_PSBL/detail (below this point FP rate dropped to nearly zero) http://ruleqa.spamassassin.org/20091010-r823821-n/RCVD_IN_PSBL/detail http://ruleqa.spamassassin.org/20091017-r826198-n/RCVD_IN_PSBL/detail http://ruleqa.spamassassin.org/20091024-r829323-n/RCVD_IN_PSBL/detail http://ruleqa.spamassassin.org/20091031-r831520-n/RCVD_IN_PSBL/detail http://ruleqa.spamassassin.org/20091107-r833654-n/RCVD_IN_PSBL/detail You can plainly see steady and sustained improvement in FP safety in these past weeks. RCVD_IN_PSBL in the rescore masscheck was without lastexternal. Clearly with the added limitation of lastexternal it is safer than measured.
> > A new run, this time I left the URIBL whitelists and similar fixed > > (at their relatively high manual scores) as they were in current > > 50_scores.cf Or to say it better: unlike my previous runs where I commented out most scores in the existing 50_scores.cf (thus making them mutable, regardless of a <gen:mutable> markup) except for a couple of exceptions, this time I did not comment-out scores, and let <gen:mutable> markup do its job. So this is now more like how it was intended to run GA. > After a little examination, they look good to me! +1 to check in. Thanks. I'm sure we can can still do some manual tweaks and improvements, but perhaps we can indeed freeze the rest to automatically assigned scores in this run. > btw if you feel like cranking up the max gens, go for it. fwiw, > spamassassin2.zones has a very powerful CPU -- if it's taking too long > on your own machine, try scping stuff up and running it there. My office workstation is quite beefy too, and I hope we won't need to do many further runs, so for now I'd just stick to what I'm familiar with. Btw, my set3 run at 14000 iterations takes 5 hours, similar for set1, the other two are much faster (less than 30 minutes each). I just let it run overnight, so it wouldn't matter if it takes half that time. I did some previous runs at 30000 iterations, and a diagram (like the one attached earlier) does not show noticeable improvements beyond about 10000, or even small worsening by the end, so the 14000 limit seems reasonable. And the GA algorithms are said to be prone to overfitting, so it's probably prudent not to go too far. > RCVD_IN_XBL is still surprisingly low -- I bet there's some additive > behaviour overlapping between XBL and PBL, though. > RCVD_IN_SBL is _very_ low in set 3 too, bizarre! > otherwise I can't see any issues.... | Please manually adjust the scores of RCVD_IN_PSBL up. At the time of the | rescore masscheck PSBL had not yet whitelisted hotmail, yahoo, gmail and a | number of major ISP's. As a result, for 5 weeks straight RCVD_IN_PSBL has | been almost completely devoid of FP's in our weekly masschecks. I am | confident that PSBL performs safer than measured during the rescore masscheck Ok, I suggest we collect some manual fixes like the ones suggested here (with specific score suggestions), and wrap it up.
Created attachment 4568 [details] Checker for rules that match more ham than spam Collected selections from several more runs of my script. I took the last three days' worth of masschecks plus the run last week, hand-picked rules with a high score (~1.0+) but low S/O (~0.250-), and then looked for repeat offenders. This is the list, with each rule's worst S/O of any run: S/O RANK HAM% SPAM% Score attachment 4565 [details] Rule .002 .14 1.2650 0.0024 0.001 0.001 0.131 0.700 TVD_RCVD_SPACE_BRACKET .002 .23 0.4472 0.0008 0.000 2.099 0.001 1.711 MISSING_MIME_HB_SEP .019 .22 0.2529 0.0049 1.482 0.855 2.399 2.399 FUZZY_CPILL .019 .29 0.2809 0.0056 0.001 1.699 1.498 1.699 X_IP .046 .22 0.4010 0.0193 2.385 0.345 0.998 2.503 FRT_SOMA2 .077 .25 0.2643 0.0221 0.551 1.026 1.033 1.250 CTYPE_001C_B .092 .21 0.8712 0.0878 0.699 0.332 0.480 0.800 MIME_BASE64_BLANKS .095 .31 0.2735 0.0286 2.200 2.199 0.540 2.199 WEIRD_QUOTING .178 .28 0.4948 0.1069 0 0.973 0 2.385 SPF_HELO_FAIL .195 .29 0.8975 0.2173 1.799 0.579 0.901 0.882 HTML_IMAGE_RATIO_06 .241 .34 1.4248 0.4529 1.0 EXTRA_MPART_TYPE I don't think it wise to release with these scores quite so high. I propose we score them all 0.1 or 0.001 so as to not hold up the release and bookmark the issue (likely a bug in the GA, probably best registered as its own bugzilla bug) for dealing with later. Additionally, I've updated my script to do the reverse - seek out negatively scored rules that hit more spam than ham. This doesn't currently find anything beyond SPF_PASS (due to having >=1% spam hits, while it was previously found for having ham>spam), but it does prevent listing SPF_HELO_PASS and theoretically will help find poorly-written ham rules in the future.
(In reply to comment #152) > > | Please manually adjust the scores of RCVD_IN_PSBL up. At the time of the > | rescore masscheck PSBL had not yet whitelisted hotmail, yahoo, gmail and a > | number of major ISP's. As a result, for 5 weeks straight RCVD_IN_PSBL has > | been almost completely devoid of FP's in our weekly masschecks. I am > | confident that PSBL performs safer than measured during the rescore masscheck > > Ok, I suggest we collect some manual fixes like the ones suggested here > (with specific score suggestions), and wrap it up. Let's just go ahead with committing as jm suggested in Comment #153 and make the manual adjustments after that in separate commits each with explanations. RCVD_IN_PSBL I suggest 2.7 for both network sets. Adam Katz in Comment #153 makes a good argument for reducing those rules to informational. Any comments on that?
(In reply to comment #154) > (In reply to comment #152) > > > > | Please manually adjust the scores of RCVD_IN_PSBL up. At the time of the > > | rescore masscheck PSBL had not yet whitelisted hotmail, yahoo, gmail and a > > | number of major ISP's. As a result, for 5 weeks straight RCVD_IN_PSBL has > > | been almost completely devoid of FP's in our weekly masschecks. I am > > | confident that PSBL performs safer than measured during the rescore masscheck > > > > Ok, I suggest we collect some manual fixes like the ones suggested here > > (with specific score suggestions), and wrap it up. > > Let's just go ahead with committing as jm suggested in Comment #153 and make > the manual adjustments after that in separate commits each with explanations. > > RCVD_IN_PSBL I suggest 2.7 for both network sets. > > Adam Katz in Comment #153 makes a good argument for reducing those rules to > informational. Any comments on that? +1 to all ;)
I might have to eat my words. Applying these new scores did not improve my own statistics. ORIGINAL SCORES ./fp-fn-statistics -s 3 (wt-* 20091107 weekly logs) # SUMMARY for threshold 5.0: # Correctly non-spam: 29677 99.82% # Correctly spam: 21106 90.42% # False positives: 54 0.18% # False negatives: 2235 9.58% # TCR(l=50): 4.729686 SpamRecall: 90.425% SpamPrec: 99.745% https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155#c146 GA SCORES ./fp-fn-statistics -s 3 (wt-* 20091107 weekly logs) # SUMMARY for threshold 5.0: # Correctly non-spam: 29624 99.64% # Correctly spam: 21039 90.14% # False positives: 107 0.36% # False negatives: 2302 9.86% # TCR(l=50): 3.050314 SpamRecall: 90.138% SpamPrec: 99.494% (In reply to comment #153) > Created an attachment (id=4568) [details] > Checker for rules that match more ham than spam > > Collected selections from several more runs of my script. I took the last > three days' worth of masschecks plus the run last week, hand-picked rules with > a high score (~1.0+) but low S/O (~0.250-), and then looked for repeat > offenders. This is the list, with each rule's worst S/O of any run: > > S/O RANK HAM% SPAM% Score attachment 4565 [details] Rule > .195 .29 0.8975 0.2173 1.799 0.579 0.901 0.882 HTML_IMAGE_RATIO_06 score HTML_IMAGE_RATIO_02 2.199 0.805 1.200 0.437 score HTML_IMAGE_RATIO_04 2.089 0.610 0.607 0.556 score HTML_IMAGE_RATIO_06 1.799 0.579 0.901 0.882 score HTML_IMAGE_RATIO_08 1.410 0.351 0.874 0.021 Is it logical to zero out HTML_IMAGE_RATIO_06 when these others have scores? It feels like either our corpus sample size was not large and varied enough, or we are doing something else wrong. These particular rules had scores much lower from the 3.2.0 GA. > S/O RANK HAM% SPAM% Score attachment 4565 [details] Rule > .241 .34 1.4248 0.4529 1.0 EXTRA_MPART_TYPE I suppose this is the clearest case of a rule we should zero out.
TVD_RCVD_SPACE_BRACKET MISSING_MIME_HB_SEP FUZZY_CPILL X_IP Bug #5920 appears not fixed as claimed. FRT_SOMA2 CTYPE_001C_B MIME_BASE64_BLANKS WEIRD_QUOTING SPF_HELO_FAIL EXTRA_MPART_TYPE It appears to be correct to zero out these rules, or at least make them informational. spamassassin-3.2.5 score HTML_IMAGE_RATIO_02 1.518 0.550 0.573 0.383 score HTML_IMAGE_RATIO_04 1.561 0.170 0.863 0.172 score HTML_IMAGE_RATIO_06 0.401 0.001 0.501 0.001 score HTML_IMAGE_RATIO_08 0.203 0.001 0.179 0.001 attachment 4565 [details] resulting 50_scores.cf from garescorer runs - V5 score HTML_IMAGE_RATIO_02 2.199 0.805 1.200 0.437 score HTML_IMAGE_RATIO_04 2.089 0.610 0.607 0.556 score HTML_IMAGE_RATIO_06 1.799 0.579 0.901 0.882 score HTML_IMAGE_RATIO_08 1.410 0.351 0.874 0.021 The old scores showed a more linear relationship, with a sharp drop-off between _04 and _06. Our masscheck results indicate _02 and _04 hit on more spam than ham, but _06 and _08 are pretty worthless. I think we should zero out _06 and _08 while reducing the scores of _02 and _04.
(In reply to comment #157) > spamassassin-3.2.5 > score HTML_IMAGE_RATIO_02 1.518 0.550 0.573 0.383 > score HTML_IMAGE_RATIO_04 1.561 0.170 0.863 0.172 > score HTML_IMAGE_RATIO_06 0.401 0.001 0.501 0.001 > score HTML_IMAGE_RATIO_08 0.203 0.001 0.179 0.001 > > attachment 4565 [details] > resulting 50_scores.cf from garescorer runs - V5 > score HTML_IMAGE_RATIO_02 2.199 0.805 1.200 0.437 > score HTML_IMAGE_RATIO_04 2.089 0.610 0.607 0.556 > score HTML_IMAGE_RATIO_06 1.799 0.579 0.901 0.882 > score HTML_IMAGE_RATIO_08 1.410 0.351 0.874 0.021 > > The old scores showed a more linear relationship, with a sharp drop-off > between _04 and _06. Our masscheck results indicate _02 and _04 hit on > more spam than ham, but _06 and _08 are pretty worthless. I think we > should zero out _06 and _08 while reducing the scores of _02 and _04. I didn't mention _08 because it wasn't a remarkable enough margin of HAM > SPAM (my script only reports if HAM% + 0.05 > SPAM%) and my hand-sampling utilized S/O ratios under .250 while this rule is .320. Still, it has the problem: SPAM% HAM% S/O RANK SCORE NAME DateRev 0.2709 0.5491 0.330 0.34 0.20 HTML_IMAGE_RATIO_08 20091111-r834803-n 0.2717 0.5492 0.331 0.34 0.20 HTML_IMAGE_RATIO_08 20091110-r834389-n 0.2672 0.5493 0.327 0.34 0.20 HTML_IMAGE_RATIO_08 20091109-r833997-n 0.2075 0.4995 0.294 0.34 0.20 HTML_IMAGE_RATIO_08 20091104-r832683-n 0.2548 0.5476 0.318 0.34 0.20 HTML_IMAGE_RATIO_08 20091028-r830464-n Here are the results from the 20091111-r834803-n set, pruning only rules scoring under 0.2 (all hits from my last report are present and asterisked): S/O RANK HAM% SPAM% Score in attachment 4565 [details] Rule .014 .15 0.6328 0.0093 0.001 0.001 0.131 0.700 TVD_RCVD_SPACE_BRACKET* .015 .24 0.1927 0.0029 0.000 2.099 0.001 1.711 MISSING_MIME_HB_SEP* .019 .22 0.2528 0.0049 1.482 0.855 2.399 2.399 FUZZY_CPILL* .043 .29 0.1298 0.0059 0.001 1.699 1.498 1.699 X_IP* .075 .35 0.0603 0.0049 0.000 0.001 0.308 0.001 HTML_NONELEMENT_30_40 .092 .21 0.8123 0.0825 0.699 0.332 0.480 0.800 MIME_BASE64_BLANKS* .106 .25 0.2483 0.0293 0.551 1.026 1.033 1.250 CTYPE_001C_B* .123 .33 0.0837 0.0117 0.001 0.648 0.836 1.293 TVD_FW_GRAPHIC_NAME_LONG .123 .28 0.1632 0.0229 0.001 2.499 0.392 0.164 DRUGS_MUSCLE(*) .130 .25 0.3663 0.0547 2.385 0.345 0.998 2.503 FRT_SOMA2* .155 .29 0.1736 0.0317 0.001 0.001 0.001 1.741 MIME_BASE64_TEXT .188 .27 0.4622 0.1069 0 0.973 0 2.385 SPF_HELO_FAIL* .214 .31 0.1449 0.0395 2.200 2.199 0.540 2.199 WEIRD_QUOTING* .239 .30 0.8321 0.2612 1.799 0.579 0.901 0.882 HTML_IMAGE_RATIO_06* .254 .34 1.3070 0.4442 1.0 EXTRA_MPART_TYPE* .330 .34 0.5491 0.2709 1.410 0.351 0.874 0.021 HTML_IMAGE_RATIO_08 .363 .38 1.0856 0.6194 2.600 2.070 1.233 3.405 DATE_IN_PAST_96_XX .368 .36 0.3029 0.1767 0.001 0.791 0.001 0.008 UPPERCASE_50_75 .381 .37 0.6473 0.3983 0.354 0.001 0.725 0.428 MIME_HTML_MOSTLY .660 .51 1.8514 3.5893 0.518 1.625 1.197 1.506 SUBJ_ALL_CAPS .905 .58 1.0822 10.2987 0 1.246 0 1.347 RCVD_IN_BL_SPAMCOP_NET .934 .56 3.6172 51.2001 2.199 1.105 1.199 0.723 MIME_HTML_ONLY .957 .52 2.2200 50.3063 2.399 1.274 1.228 0.793 RDNS_NONE DRUGS_MUSCLE met all the requirements I set for my last report, but I removed it because it had almost no hits anyway, and it scored very very low except on net+no-bayes, so I was assuming it had some justification there somehow.
will we go ahead and check in those scores, anyway? that would allow another beta (soon). re: HTML_IMAGE_RATIO_* -- it's very common for that kind of "multi-valued" set of rules to wind up with nonintuitive scoring. This happens from either low hitrates or hitting alongside other (better) rules.
(In reply to comment #142) > Seems to me that many / most(?) HABEAS_ACCREDITED_SOI supposedly > false positives are due to freelotto.com mail. I wonder whether such > samples are rightfully in the spam* corpora - I'd say yes, but, > as they say, spam is about consent, not content, and people receiving > mail from freelotto.com most likely did register once, not realizing > what they are dealing with. So there was a consent, at least initially. > It is also about fraud and advertising, so, should one leave such > mail samples in the spam corpus or not? Perhaps we should explicitly exclude known sketchy senders like freelotto.com from HABEAS_ACCREDITED_SOI. This would allow us to more easily monitor for clear violators by not being distracted by the common FP's. Exclusion in this case only brings the listed back to neutral which is pretty clearly a good idea. Any objections? Otherwise I'll file a separate bug for this.
-score RDNS_NONE 0.1 -score RDNS_DYNAMIC 0.1 +# score RDNS_NONE 0 1.1 0 0.7 +# score RDNS_DYNAMIC 0 0.5 0 0.5 These are supposed to be informational rules according to the comment. Is this supposed to become commented out? Doesn't commented out mean 1 point?
fp-fn-statistics across the entire "rescore" logs. Set 3 Before =========== # SUMMARY for threshold 5.0: # Correctly non-spam: 703647 99.90% # Correctly spam: 2559525 98.28% # False positives: 719 0.10% # False negatives: 44795 1.72% # TCR(l=50): 32.253638 SpamRecall: 98.280% SpamPrec: 99.972% Set 3 Raw Rescoring from Comment #146 ================================== # SUMMARY for threshold 5.0: # Correctly non-spam: 703520 99.88% # Correctly spam: 2548134 97.84% # False positives: 846 0.12% # False negatives: 56186 2.16% # TCR(l=50): 26.443555 SpamRecall: 97.843% SpamPrec: 99.967% Doesn't look like an improvement. Set 3 + Rescore + Reductions ========================== # SUMMARY for threshold 5.0: # Correctly non-spam: 704002 99.95% # Correctly spam: 2558896 98.26% # False positives: 364 0.05% # False negatives: 45424 1.74% # TCR(l=50): 40.932981 SpamRecall: 98.256% SpamPrec: 99.986% Looks like a statistically insignificant improvement over the old scores. I only hope our corpora was sufficiently varied. Rules Made Informational ====================== TVD_RCVD_SPACE_BRACKET MISSING_MIME_HB_SEP FUZZY_CPILL X_IP Bug #5920 appears not fixed as claimed. FRT_SOMA2 CTYPE_001C_B MIME_BASE64_BLANKS WEIRD_QUOTING SPF_HELO_FAIL HTML_IMAGE_RATIO_06 HTML_IMAGE_RATIO_08 Other Changes ============ * EXTRA_MPART_TYPE was left as 1.0 because while it does relatively poorly in the weeky masscheck, it did far better in rescore masscheck. * I am increasing the scores of PSBL *after* the above fp-fn-statistics run because the old logs do not reflect its current safety level. I am committing these changes now. I suspect the key to these reductions is getting rid of the rules that wouldn't have passed our ruleqa auto-promotion criteria? There might be additional tweaks to make. Please comment here.
http://hudson.zones.apache.org/hudson/job/SpamAssassin-trunk/4344/testReport/ -score MISSING_HB_SEP 2.5 +# score MISSING_HB_SEP 2.5 +score MISSING_HB_SEP 0 # n=0 n=1 n=2 -score X_MESSAGE_INFO 3.499 3.496 3.330 1.597 +score X_MESSAGE_INFO 0 # n=0 n=1 n=2 n=3 It appears that tests here are failing after commit because rules required by this test were zeroed out. It seems these rules have almost zero hits in masscheck. What should we do about this?
> It appears that tests here are failing after commit because rules required by > this test were zeroed out. It seems these rules have almost zero hits in > masscheck. What should we do about this? Bug 6155 #163: force nonzero scores on MISSING_HB_SEP and X_MESSAGE_INFO for the test Sending t/missing_hb_separator.t Committed revision 881240. I hope this is the right approach. Alternative would be to introduce a file similar to t/data/01_test_rules.cf to hold score overrides, but with a name like 51_test_rules.cf to be sorted after the 50_scores.cf. Btw, is the 01_ in the name intentional, or could the existing file just be renamed to something like 99_test_rules.cf ?
(In reply to comment #161) > -score RDNS_NONE 0.1 > -score RDNS_DYNAMIC 0.1 > +# score RDNS_NONE 0 1.1 0 0.7 > +# score RDNS_DYNAMIC 0 0.5 0 0.5 > Doesn't commented out mean 1 point? It would mean 1 point, if there were no other score lines for these two rules: score RDNS_DYNAMIC 2.639 0.363 1.663 0.982 score RDNS_NONE 2.399 1.274 1.228 0.793 > These are supposed to be informational rules according to the comment. > Is this supposed to become commented out? Comment 116, 120, 124, 137, 139. I left it mutable, I think it still makes sense - it's kind of a poor man's Botnet plugin.
(In reply to comment #164) > > It appears that tests here are failing after commit because rules required by > > this test were zeroed out. It seems these rules have almost zero hits in > > masscheck. What should we do about this? > > Bug 6155 #163: force nonzero scores on MISSING_HB_SEP and X_MESSAGE_INFO > for the test > Sending t/missing_hb_separator.t > Committed revision 881240. > > I hope this is the right approach. Alternative would be to introduce > a file similar to t/data/01_test_rules.cf to hold score overrides, but > with a name like 51_test_rules.cf to be sorted after the 50_scores.cf. > Btw, is the 01_ in the name intentional, or could the existing file > just be renamed to something like 99_test_rules.cf ? X_MESSAGE_INFO can be dropped, but MISSING_HB_SEP should not have been made mutable; I'd say lock to 2.5. btw it is to be expected that with less mutability the scores become slightly less optimal for the rescoring corpus; this always happens. If scores are allowed to wander without locking down the "unsafe" rules, the GA will overfit to the training data and produce great FP/FN figures, but scores that are risky for "real world" usage.
(In reply to comment #166) > (In reply to comment #164) > > > It appears that tests here are failing after commit because rules required by > > > this test were zeroed out. It seems these rules have almost zero hits in > > > masscheck. What should we do about this? > > > > Bug 6155 #163: force nonzero scores on MISSING_HB_SEP and X_MESSAGE_INFO > > for the test > > Sending t/missing_hb_separator.t > > Committed revision 881240. > > > > I hope this is the right approach. Alternative would be to introduce > > a file similar to t/data/01_test_rules.cf to hold score overrides, but > > with a name like 51_test_rules.cf to be sorted after the 50_scores.cf. > > Btw, is the 01_ in the name intentional, or could the existing file > > just be renamed to something like 99_test_rules.cf ? > > X_MESSAGE_INFO can be dropped, but MISSING_HB_SEP should not have been made > mutable; I'd say lock to 2.5. > > btw it is to be expected that with less mutability the scores become slightly > less optimal for the rescoring corpus; this always happens. If scores are > allowed to wander without locking down the "unsafe" rules, the GA will overfit > to the training data and produce great FP/FN figures, but scores that are risky > for "real world" usage. locally, I've have lowered the MISSING_HB_SEP score to 0.5 lottsa funky ERP stuff seems to have a talent to FP on it. its great for metas but usually triggers scores close to FP with the usual suspects & their very ugly HTML formatting. (sorry, cannot supply samples) I'd say 2.5 is sorta high Axb
(In reply to comment #167) > locally, I've have lowered the MISSING_HB_SEP score to 0.5 > > lottsa funky ERP stuff seems to have a talent to FP on it. > its great for metas but usually triggers scores close to FP with the usual > suspects & their very ugly HTML formatting. > (sorry, cannot supply samples) > > I'd say 2.5 is sorta high ok -- I was under the impression it was FP-free. 0.5 works for me in that case.
spamassassin/trunk/rulesrc/10_force_active.cf It seems this file needs to be updated after the rescoring. Should all the rules in 50_scores.cf be listed in 10_force_active.cf? Even the rules that are zeroed out in 50_scores.cf?
Created attachment 4579 [details] patch for 10_force_active.cf Nobody responded to the previous comment. I didn't know how this file was generated before. I took 50_scores.cf and took all rule names that were not commented out for this patch. Is this correct?
>> spamassassin/trunk/rulesrc/10_force_active.cf >> It seems this file needs to be updated after the rescoring. >> Should all the rules in 50_scores.cf be listed in 10_force_active.cf? >> Even the rules that are zeroed out in 50_scores.cf? > > Nobody responded to the previous comment. > I didn't know how this file was generated before. No idea, sorry. I haven't been around that long. > I took 50_scores.cf and took all rule names that were not > commented out for this patch. Is this correct? Probably. Btw, the: prove xt/10_rule_test_suite.t is failing for several rules. Can someone more familiar with rules please check where the reported problems lie?
Warren, The file was originally used to list all *rules from sandboxes* that had scores assigned by the GA so that they didn't get auto-demoted leaving a score line but no rule. I don't think its use has changed, but I'm not completely up-to-date on the re-org of the rules source structure. jm might have a script to generate the file... although it's been a long time.
Sending rulesrc/10_force_active.cf Transmitting file data . Committed revision 884912. Please review.
Restoring comment originally made by Mark Martinec (In reply to comment #171) > Btw, the: > prove xt/10_rule_test_suite.t > is failing for several rules. Can someone more familiar with rules > please check where the reported problems lie? Actually it's just two rules failing on multiple tests: FM_FRM_RN_L_BRACK and TVD_SPACE_RATIO. Luckily their score is zero or near zero: score TVD_SPACE_RATIO 0.001 score FM_FRM_RN_L_BRACK 0 | Changed score of FM_FRM_RN_L_BRACK from 0 into 0.001, | to make xt/10_rule_test_suite.t happy. | Sending rules/50_scores.cf | Committed revision 884927. So that leaves the TVD_SPACE_RATIO. Is it something to worry about?
10_force_active.cf is generated at this step in the RescoreMassCheck process (see https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155#c3): 6.5. mark evolved-score rules as 'always published' sounds like we could be missing a few steps if that got missed...
http://wiki.apache.org/spamassassin/RescoreMassCheck Mark, did you do these steps? 6. upload the test logs to zone 8. Make the stats files 8. upload new stats files
> Mark, did you do these steps? > 6. upload the test logs to zone > 8. Make the stats files > 8. upload new stats files No, I left at the '5. generate scores for score sets', I only attached the score file for considerations.
Mark, it appears that only you can do those steps?
Mark, please correct me if I am wrong. But it seems only you can complete the final steps since we don't know exactly which subset of data you used.
> Mark, please correct me if I am wrong. But it seems only you can complete the > final steps since we don't know exactly which subset of data you used. I'm doing it right now. The config.set* is already checked in, logs are being transferred, ...
Ok, I think I'm done now (RescoreMassCheck): 5. generate scores for score sets svn commit -m "runGA config files used" masses/config.set* r886173 | mmartinec | 2009-12-02 16:24:32 +0100 (Wed, 02 Dec 2009) | 1 line runGA config files used tar cvf rescore-logs.tar gen-set{0,1,2,3}-* 6. upload the test logs to zone (spamassassin.zones.apache.org): sudo mkdir /home/corpus-rsync/ARCHIVE/3.3.0 sudo mv rescore-logs.tar.bz2 \ /home/corpus-rsync/ARCHIVE/3.3.0/rescore-logs-bug6155.tar.bz2 ls -l /home/corpus-rsync/ARCHIVE/3.3.0/rescore-logs-bug6155.tar.bz2 -rw-r--r-- 1 mmartinec other 20380424 Dec 2 18:23 /home/corpus-rsync/ARCHIVE/3.3.0/rescore-logs-bug6155.tar.bz2 6.5. mark evolved-score rules as 'always published' ./force-publish-active-rules ../rules/active.list ../rulesrc/10_force_active.cf svn commit -m "force publish of rescored rules" ../rulesrc/10_force_active.cf r886212 | mmartinec | 2009-12-02 18:33:57 +0100 (Wed, 02 Dec 2009) | 3 lines Bug 6155: generated new rulesrc/10_force_active.cf as per step 6.5 in RescoreMassCheck 6.6. fix test failures nothing to tweak, all tests pass 7. upload proposed new scores done some time ago, some tweaks later: r881159 | wtogami | 2009-11-17 06:35:00 +0100 (Tue, 17 Nov 2009) | 2 lines Bug #6155 commit raw scores from Comment #146 as documented in #162. To view the diffs: svn diff -r 881158:886232 rules/50_scores.cf 8. Make the stats files cp config.set0 config ; bash ./runGA stats cp config.set1 config ; bash ./runGA stats cp config.set2 config ; bash ./runGA stats cp config.set3 config ; bash ./runGA stats 8(.1) upload new stats files r886232 | mmartinec | 2009-12-02 19:11:35 +0100 (Wed, 02 Dec 2009) | 2 lines rules/STATISTICS-set*.txt > Attach the new proposed STATISTICS*.txt as a patch to the rescoring bug too many differences, just do a: svn diff -c886232
6.5. mark evolved-score rules as 'always published' cd masses ./force-publish-active-rules ../rules/active.list ../rulesrc/10_force_active.cf svn commit -m "force publish of rescored rules" ../rulesrc/10_force_active.cf Doing this seems to remove all the 0 score rules from 10_force_active.cf. Does this make any difference?
Why is active.list (the result of auto-promotion) relevant as input to this script? Seems kind of like circular logic that makes no sense. + SPAMMY_MIME_BDRY_01 force-publish-active-rules added a few lines like this that have no scores assigned in rules/50_scores.cf. It seems what I already did by copying rule names from rules/50_scores.cf into rulesrc/10_force_active.cf is more correct? If so, then it appears we are ready for beta if we can clear up the GPG key issue in Bug #6223.
(In reply to comment #183) > Why is active.list (the result of auto-promotion) relevant as input to this > script? Seems kind of like circular logic that makes no sense. > > + SPAMMY_MIME_BDRY_01 > > force-publish-active-rules added a few lines like this that have no scores > assigned in rules/50_scores.cf. > > It seems what I already did by copying rule names from rules/50_scores.cf into > rulesrc/10_force_active.cf is more correct? > > If so, then it appears we are ready for beta if we can clear up the GPG key > issue in Bug #6223. I think you're right. could you open a side-bug for that issue so we can fix it post-release? anyway, this is now fixed.
I have a hunch that FREEMAIL_ENVFROM_END_DIGIT has a bit too high score (1.553). Probably there wasn't enough "nicedude90" ham in corpora. Strangely FREEMAIL_REPLYTO_END_DIGIT has a lower score, one would think it would be safer FP wise..