Bug 6155 - generate new scores for 3.3.0 release
Summary: generate new scores for 3.3.0 release
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Score Generation (show other bugs)
Version: unspecified
Hardware: Other All
: P1 blocker
Target Milestone: 3.3.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on: 6188
Blocks:
  Show dependency tree
 
Reported: 2009-07-15 03:41 UTC by Justin Mason
Modified: 2010-01-05 10:47 UTC (History)
8 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Ignore missing support for ADSP in old versions of Mail::DKIM. patch None Mark Martinec [HasCLA]
sample new scores, as diff patch None Justin Mason [HasCLA]
freqs file on all submitted files for rescore mass-checks text/plain None Mark Martinec [HasCLA]
resulting 'scores' file from a GA run text/plain None Mark Martinec [HasCLA]
resulting 50_scores.cf from garescorer runs text/plain None Mark Martinec [HasCLA]
resulting 50_scores.cf from garescorer runs - V2 text/plain None Mark Martinec [HasCLA]
resulting 50_scores.cf from garescorer runs - V3 text/plain None Mark Martinec [HasCLA]
freqs.full of corpora used for score set 3 and 2 runs text/plain None Mark Martinec [HasCLA]
ranges.data on corpora used for score set 3 and 2 runs text/plain None Mark Martinec [HasCLA]
Checker for rules that match more ham than spam text/plain None Adam Katz [HasCLA]
Checker for rules that match more ham than spam text/plain None Adam Katz [HasCLA]
resulting 50_scores.cf from garescorer runs - V5 text/plain None Mark Martinec [HasCLA]
GA cost vs. iterations image/png None Mark Martinec [HasCLA]
Scaled diagram of the previous one, only sets 3 and 1 shown image/png None Mark Martinec [HasCLA]
Checker for rules that match more ham than spam text/plain None Adam Katz [HasCLA]
patch for 10_force_active.cf patch None Warren Togami [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Justin Mason 2009-07-15 03:41:07 UTC
Here's a ticket to track this release work item.

Do we actually need to do this, though, since we have Daryl's code generating scores weekly from nightly mass-check results?
Comment 1 Justin Mason 2009-07-31 09:56:38 UTC
(In reply to comment #0)
> Do we actually need to do this, though, since we have Daryl's code generating
> scores weekly from nightly mass-check results?

well, we need to fix that, actually. it seems to be broken.
Comment 2 Justin Mason 2009-08-14 13:30:20 UTC
This time around, I think I'll scrap the confusing differentiation between nightly mass-check result submission rsync accounts and "submit" accounts.  Anyone object?

I'm going to try a test run of the evolver based on nightly mass-check logs, btw.
Comment 3 Justin Mason 2009-08-14 13:31:47 UTC
http://wiki.apache.org/spamassassin/RescoreMassCheck is the procedure, as in previous releases.

fwiw, we have 1022294 spams and 271617 hams in our nightly corpora, currently.
Comment 4 Mark Martinec 2009-08-17 13:15:01 UTC
Created attachment 4517 [details]
Ignore missing support for ADSP in old versions of Mail::DKIM.
Comment 5 Justin Mason 2009-08-17 13:27:40 UTC
(In reply to comment #4)
> Created an attachment (id=4517) [details]
> Ignore missing support for ADSP in old versions of Mail::DKIM.

wrong bug I suspect! ;)
Comment 6 Warren Togami 2009-08-17 14:25:33 UTC
Is there still time to add more nightlies for this rescoring?  There is another major Japanese user that is very close to joining.

How important is this rescoring?  Do nightlies help to rescore the sa-update scores?
Comment 7 Justin Mason 2009-08-17 15:21:17 UTC
ok, I think I've ironed out a couple of issues.  Let's see what people think of these sample scores:

http://taint.org/x/2009/gen-set0-2.0-5.0-500-ga_scores
http://taint.org/x/2009/gen-set1-5.0-5.0-500-ga_scores
http://taint.org/x/2009/gen-set2-2.0-5.0-500-ga_scores
http://taint.org/x/2009/gen-set3-5.0-5.0-500-ga_scores


here are the test results against the "test" fold for each scoreset:

gen-set0-2.0-5.0-500-ga/test
Reading scores from "tmprules"...
Reading per-message hit stat logs and scores...

# SUMMARY for threshold 5.0:
# Correctly non-spam:  26453  99.07%
# Correctly spam:      83369  81.53%
# False positives:       249  0.93%
# False negatives:     18882  18.47%
# TCR(l=50): 3.263469  SpamRecall: 81.534%  SpamPrec: 99.702%


gen-set1-5.0-5.0-500-ga/test
Reading scores from "tmprules"...
Reading per-message hit stat logs and scores...

# SUMMARY for threshold 5.0:
# Correctly non-spam:  26646  99.79%
# Correctly spam:     100943  98.72%
# False positives:        56  0.21%
# False negatives:      1308  1.28%
# TCR(l=50): 24.890701  SpamRecall: 98.721%  SpamPrec: 99.945%


gen-set2-2.0-5.0-500-ga/test
Reading scores from "tmprules"...
Reading per-message hit stat logs and scores...

# SUMMARY for threshold 5.0:
# Correctly non-spam:  26485  99.19%
# Correctly spam:      84218  82.36%
# False positives:       217  0.81%
# False negatives:     18033  17.64%
# TCR(l=50): 3.540179  SpamRecall: 82.364%  SpamPrec: 99.743%


gen-set3-5.0-5.0-500-ga/test
Reading scores from "tmprules"...
Reading per-message hit stat logs and scores...

# SUMMARY for threshold 5.0:
# Correctly non-spam:  26662  99.85%
# Correctly spam:     100964  98.74%
# False positives:        40  0.15%
# False negatives:      1287  1.26%
# TCR(l=50): 31.107697  SpamRecall: 98.741%  SpamPrec: 99.960%

Yes, set0 and set2 are terrible.  This is pretty much what happened last time, too; our ruleset is pretty crappy nowadays without network rules active.  But the net rule results are very good!  However I think I need to look into the local rule GA runs if possible.

Bug 5270 is the 3.2.0 rescoring run, for reference.

Spamhaus will be happy to see a much improved score for RCVD_IN_PBL ;)

gen-set1-5.0-5.0-500-ga_scores:score RCVD_IN_PBL                    2.596
gen-set3-5.0-5.0-500-ga_scores:score RCVD_IN_PBL                    2.411
Comment 8 Justin Mason 2009-08-17 16:05:38 UTC
Created attachment 4518 [details]
sample new scores, as diff

here's the results of running a GA run for each set.   please shout about any and all issues you spot (and there's a few, I think, eg the ACCESSDB score leakage which should probably be ignored by the masses scripts)
Comment 9 Warren Togami 2009-08-17 20:46:17 UTC
http://ruleqa.spamassassin.org/20090817-r804903-n/TVD_SPACE_RATIO/detail
90% FP rate for Japanese
http://ruleqa.spamassassin.org/20090817-r804903-n/PLING_QUERY/detail
52% FP rate for Japanese
http://ruleqa.spamassassin.org/20090817-r804903-n/GAPPY_SUBJECT/detail
44% FP rate for Japanese

All three of these rules do very poorly with Japanese mail, and the total % SPAM is lower than the % FP.  Yet the GA scores are rather high since we don't have a statistically significant amount of Japanese mail in the corpus.

What language are the SPAM hits?  Perhaps many are examples of identifying foreign languages instead of determining if it is ham or spam?

Bug #6149 is related to this problem.

I am attempting to convince Japanese, Chinese and Korean users to join the nightly masscheck, but it is very difficult.
Comment 10 Justin Mason 2009-08-18 01:15:46 UTC
(In reply to comment #9)
> http://ruleqa.spamassassin.org/20090817-r804903-n/TVD_SPACE_RATIO/detail
> 90% FP rate for Japanese
> http://ruleqa.spamassassin.org/20090817-r804903-n/PLING_QUERY/detail
> 52% FP rate for Japanese
> http://ruleqa.spamassassin.org/20090817-r804903-n/GAPPY_SUBJECT/detail
> 44% FP rate for Japanese
> 
> All three of these rules do very poorly with Japanese mail, and the total %
> SPAM is lower than the % FP.  Yet the GA scores are rather high since we don't
> have a statistically significant amount of Japanese mail in the corpus.
> 
> What language are the SPAM hits?  Perhaps many are examples of identifying
> foreign languages instead of determining if it is ham or spam?
> 
> Bug #6149 is related to this problem.

I plan to fix that, alright. 

> I am attempting to convince Japanese, Chinese and Korean users to join the
> nightly masscheck, but it is very difficult.

BTW, you could also take copies of their mail samples and add them to your own corpora, in effect acting as a proxy for them.  that's easier for them than setting up all the infrastructure.  (I thought you were already doing this ;)

You may need to be able to ask them if a mail _really_ is ham, down the line, though, so it needs to remain a two-way arrangement.
Comment 11 Warren Togami 2009-08-18 19:24:47 UTC
> BTW, you could also take copies of their mail samples and add them to your own
> corpora, in effect acting as a proxy for them.  that's easier for them than
> setting up all the infrastructure.  (I thought you were already doing this ;)

I have 3 English and 3 Japanese users in my corpus at the moment.  One additional Japanese user rio is starting nightly masscheck hopefully tonight.  He is doing his own masschecks.

> You may need to be able to ask them if a mail _really_ is ham, down the line,
> though, so it needs to remain a two-way arrangement.

I asked them very carefully to avoid mis-classification.  This is part of the difficulty of getting more volunteers, aside from the privacy worries.

I look forward to seeing the effect of the fix in Bug #6149 on the next masscheck.  I asked one of my users to pick a few dozen real-world sample messages that triggers the three rules in Comment #9 for the test suite.
Comment 12 Warren Togami 2009-08-19 07:30:23 UTC
(In reply to comment #9)
> http://ruleqa.spamassassin.org/20090817-r804903-n/TVD_SPACE_RATIO/detail
> 90% FP rate for Japanese
> http://ruleqa.spamassassin.org/20090817-r804903-n/PLING_QUERY/detail
> 52% FP rate for Japanese
> http://ruleqa.spamassassin.org/20090817-r804903-n/GAPPY_SUBJECT/detail
> 44% FP rate for Japanese

http://ruleqa.spamassassin.org/20090819-r805703-n/TVD_SPACE_RATIO/detail
0% FP rate for that particular Japanese user
http://ruleqa.spamassassin.org/20090819-r805703-n/PLING_QUERY/detail
0% FP rate for that particular Japanese user (Huh?  You changed this rule too?)
http://ruleqa.spamassassin.org/20090819-r805703-n/__GAPPY_SUBJECT/detail
44% FP rate for that particular Japanese user
Comment 13 Warren Togami 2009-08-19 08:01:28 UTC
http://ruleqa.spamassassin.org/20090819-r805703-n/GAPPY_SUBJECT/detail
0% FP rate

Oops, wrong one?
Comment 14 Justin Mason 2009-08-19 15:42:09 UTC
(In reply to comment #13)
> http://ruleqa.spamassassin.org/20090819-r805703-n/GAPPY_SUBJECT/detail
> 0% FP rate
> 
> Oops, wrong one?

yep, __GAPPY_SUBJECT is likely to have fps, GAPPY_SUBJECT avoids them.
Comment 15 Warren Togami 2009-08-19 18:52:38 UTC
Looks good, looking forward to the next test scores.


Some questions...

How important is this rescoring?

Will future nightly masschecks help to rescore the sa-update scores?

Should I bother to continue recruiting more masscheck participants after this rescore?
Comment 16 Justin Mason 2009-08-25 13:54:54 UTC
(In reply to comment #15)
> How important is this rescoring?
> Will future nightly masschecks help to rescore the sa-update scores?

the base ruleset (non-sandbox rules) won't change scores, so this is important. For nightly masschecks, the only scores affected will be those of sandbox rules.  So only about 1/2 of the ruleset, I'd reckon.

> Should I bother to continue recruiting more masscheck participants after this
> rescore?

No, I think as long as they provide results for the rescore, that's the most important thing.

Has anyone had inspiration about the reason for the bad set0 results? (I haven't looked yet)
Comment 17 Warren Togami 2009-08-26 00:48:59 UTC
> the base ruleset (non-sandbox rules) won't change scores, so this is important.
> For nightly masschecks, the only scores affected will be those of sandbox
> rules.  So only about 1/2 of the ruleset, I'd reckon.

I am curious, do you remember the original reason for this design decision?

Might there be value in making the entire ruleset scores affected by nightly masshecks?
Comment 18 Justin Mason 2009-08-26 01:58:53 UTC
(In reply to comment #17)
> > the base ruleset (non-sandbox rules) won't change scores, so this is important.
> > For nightly masschecks, the only scores affected will be those of sandbox
> > rules.  So only about 1/2 of the ruleset, I'd reckon.
> 
> I am curious, do you remember the original reason for this design decision?
> 
> Might there be value in making the entire ruleset scores affected by nightly
> masshecks?

iirc, the risk is that a small set of corpora (e.g. a few people take a week off) could cause the entire ruleset to be skewed incorrectly.  This way at least only the most recent (sandbox) rules would be affected, so it's a bit safer.

It's also faster to generate the scores, but this isn't so much of an issue now, as our main machine is quite beefy...

There may have   been other reasons, too, but I can't find the mails :(
Comment 19 Warren Togami 2009-08-26 18:11:50 UTC
> iirc, the risk is that a small set of corpora (e.g. a few people take a week
> off) could cause the entire ruleset to be skewed incorrectly.  This way at
> least only the most recent (sandbox) rules would be affected, so it's a bit
> safer.

> It's also faster to generate the scores, but this isn't so much of an issue
> now, as our main machine is quite beefy...

> There may have   been other reasons, too, but I can't find the mails :(

I feel like we have too little diversity in the type and number of ham contributors.  This rescoring would be a big improvement from our scores from two years ago and we definitely should do it.

But after 3.3.0 I would like to learn how I can become more involved in order to revamp the score update process.

* I'd like to learn how to operate the GA.
* I want to continue recruiting other nightly masscheck participants.  I want to recruit contributors of non-English languages and non-technical users. 
* I am thinking about writing a toolkit (in RPM and DEB packages) that would make it easier for participants to join masschecks.  The current documented process is very unclear and confusing, and I want to clean this up as well.

With more diversity in masscheck participants, perhaps we can do complete rescoring more often than 2 years.
Comment 20 Justin Mason 2009-08-27 03:52:29 UTC
(In reply to comment #19)
> I feel like we have too little diversity in the type and number of ham
> contributors.  This rescoring would be a big improvement from our scores from
> two years ago and we definitely should do it.

yes.

> But after 3.3.0 I would like to learn how I can become more involved in order
> to revamp the score update process.
> 
> * I'd like to learn how to operate the GA.
> * I want to continue recruiting other nightly masscheck participants.  I want
> to recruit contributors of non-English languages and non-technical users. 

Great!  As long as they keep the ham out of the spam and vice versa, and we can occasionally get in touch for eyeball-verification of odd-looking FPs, that'll be very useful ;)

> * I am thinking about writing a toolkit (in RPM and DEB packages) that would
> make it easier for participants to join masschecks.  The current documented
> process is very unclear and confusing, and I want to clean this up as well.

It certainly is.

We've been meaning to improve this for several _years_ now, but it's never been a high enough priority.  mass-check is very dev-oriented, and it should be something bundled (and documented) at a similar level to the sa-compile or sa-update scripts.

Here's history on the historical attempts which ran out of steam halfway through:

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3096
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=2853

BTW please ensure that changes in SA (which there will definitely need to be) are submitted back upstream; IMO this functionality should be part of the core package. ;)

> With more diversity in masscheck participants, perhaps we can do complete
> rescoring more often than 2 years.

Yes.
Comment 21 Justin Mason 2009-08-31 16:05:19 UTC
Let's set a deadline of this Thursday for rule changes.  At that point, I'll set an SVN tag for mass-checking with.  We'll then give everyone 2 weeks to get results in, and build scores with those.

Bugs that are rule-related against the 3.3.0 target:  bug 5380, bug 6119, bug 6156, bug 6183, bug 5937
Comment 22 Warren Togami 2009-08-31 18:30:49 UTC
(In reply to comment #21)
> Let's set a deadline of this Thursday for rule changes.  At that point, I'll
> set an SVN tag for mass-checking with.  We'll then give everyone 2 weeks to get
> results in, and build scores with those.

Will people not paying attention automatically get the mass-checking SVN tag in their nightly mass check?
Comment 23 Daryl C. W. O'Shea 2009-08-31 19:47:53 UTC
(In reply to comment #1)
> (In reply to comment #0)
> > Do we actually need to do this, though, since we have Daryl's code generating
> > scores weekly from nightly mass-check results?
> 
> well, we need to fix that, actually. it seems to be broken.

Crap, is this broken?  I might need to clear some space on the volume it runs on.
Comment 24 Justin Mason 2009-09-01 04:04:22 UTC
(In reply to comment #22)
> (In reply to comment #21)
> > Let's set a deadline of this Thursday for rule changes.  At that point, I'll
> > set an SVN tag for mass-checking with.  We'll then give everyone 2 weeks to get
> > results in, and build scores with those.
> 
> Will people not paying attention automatically get the mass-checking SVN tag in
> their nightly mass check?

no; they have to sync to a specific tag (or download a tarball iirc).
Comment 25 Warren Togami 2009-09-01 05:33:59 UTC
Daryl, is there a URL to your weekly scores?
Comment 26 Daryl C. W. O'Shea 2009-09-01 17:16:53 UTC
(In reply to comment #25)
> Daryl, is there a URL to your weekly scores?

I think that the removal of rulesrc in svn broke it.  I will have to investigate what the change was there and how I can get it working again.
Comment 27 Justin Mason 2009-09-02 13:13:51 UTC
(In reply to comment #21)
> Let's set a deadline of this Thursday for rule changes.  At that point, I'll
> set an SVN tag for mass-checking with.  We'll then give everyone 2 weeks to get
> results in, and build scores with those.

hmm. this is in a bit of trouble due to the broken build for the last few days.  But we can hack something up using the previous working active.list file...
Comment 28 Warren Togami 2009-09-02 13:35:08 UTC
At a very minimum, could we have the one-liner in lib/Mail/SpamAssassin/Plugin/HeaderEval.pm applied?  It should be perfectly safe.
Comment 29 Warren Togami 2009-09-02 13:36:06 UTC
Gah, I really hate how this Bugzilla shows you the next bug after you submit.  I keep posting to the wrong bug.
Comment 30 Mark Martinec 2009-09-03 04:43:21 UTC
(In reply to comment #29)
> Gah, I really hate how this Bugzilla shows you the next bug after you submit. 
> I keep posting to the wrong bug.

I fully agree, it is terribly annoying. Teleports you to some completely unrelated bug, and requires an additional click to come back.
Comment 31 John Hardin 2009-09-03 06:31:32 UTC
(In reply to comment #30)
> (In reply to comment #29)
> > Gah, I really hate how this Bugzilla shows you the next bug
> > after you submit. I keep posting to the wrong bug.
> 
> I fully agree, it is terribly annoying. Teleports you to some completely
> unrelated bug, and requires an additional click to come back.

You're reporting this bug on the wrong bug. :)
Comment 32 Sidney Markowitz 2009-09-03 13:01:38 UTC
(In reply to comment #29)
> Gah, I really hate how this Bugzilla shows you the next bug after you submit. 
> I keep posting to the wrong bug.

It is configurable if you click on the Preferences link near the top of the page. the "After changing a bug" setting.

I just set mine to "Show the updated bug" and I'll see if it works when I submit this comment.
Comment 33 Justin Mason 2009-09-03 13:32:51 UTC
thanks for the pointer Sidney!  I've updated the default preferences, which may fix it.
Comment 34 Justin Mason 2009-09-03 15:01:46 UTC
and the mass-checks are now ready to go! mail sent to users@ and dev@.
Comment 35 Warren Togami 2009-09-03 16:17:08 UTC
(In reply to comment #34)
> and the mass-checks are now ready to go! mail sent to users@ and dev@.

Mail sent?  I don't see it.
Comment 36 Warren Togami 2009-09-04 06:22:36 UTC
(In reply to comment #35)
> (In reply to comment #34)
> > and the mass-checks are now ready to go! mail sent to users@ and dev@.
> 
> Mail sent?  I don't see it.

I don't see any announcements anywhere.  I only saw that you edited the RescoreDetails page.  Is that the only hint that people should being doing it?
Comment 37 Justin Mason 2009-09-04 07:52:01 UTC
dammit. broken laptop mail config ate it :( resending
Comment 38 Justin Mason 2009-09-20 11:03:56 UTC
reminder for myself.  Things that need to be done to the rules before running the GA:

- ensure JM_SOUGHT* is removed from the logs and ruleset

- bug 6156: remove all refs to RCVD_IN_PSBL in logs where "reuse=no", replace with RCVD_IN_PSBL_2WEEKS to more accurately model "near-live" DNSBL lookups
Comment 39 Warren Togami 2009-09-22 13:38:15 UTC
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6156#c59
Here I noticed that RCVD_IN_PSBL was not firing at all in my mcsnapshot masschecks, but working just fine in nightly_mass_check given the same ./mass-check syntax.

http://wiki.apache.org/spamassassin/RescoreDetails
perl Makefile.PL < /dev/null
make

My mass-check box did not have gcc installed so I wasn't doing the "make" step.  After I installed gcc and used "make", then RCVD_IN_PSBL began working in mcsnapshot.

rsync -vrz --delete \
     rsync://rsync.spamassassin.org/tagged_builds/nightly_mass_check .
I'm confused about this, because the nightly_mass_check that I obtain via rsync does not require "make".  RCVD_IN_PSBL works fine there.

Questions...

1) Does mass-check actually need gcc and "make" beforehand?
2) If so, why is nightly_mass_check working without it?
3) Is this a separate bug that mass-check succeeds, but is silently failing on some rules?
Comment 40 Warren Togami 2009-09-22 13:52:39 UTC
4) Are other people doing rescore masschecks uploading bogus logs due to this silent failure?
Comment 41 Justin Mason 2009-09-22 14:24:06 UTC
it's good practice to use "hit-frequencies" (http://wiki.apache.org/spamassassin/HitFrequencies) to examine your results and see if anything looks broken.
Comment 42 Daryl C. W. O'Shea 2009-09-22 20:12:25 UTC
I've uploaded my results, but they don't have bayes enabled.  Why, again, aren't we reusing bayes results?

I've kicked off another round with bayes enabled (my net enabled check took 13.4 hours), I'm waiting on timing to see how long it'll take.  I may have to setup a SQL server on the cluster to do it in a reasonable amount of time.

In any case, I don't think we have enough message results contributed yet for a good scoreset.  We have way less than for 3.2.0, although from a larger number of contributors.  Is there any chance we might see results from Theo?

(In reply to comment #15)
> Should I bother to continue recruiting more masscheck participants after this
> rescore?

I would.  A larger number of people submitting from *clean* corpora will allow us to provide updated scores more often.  As it is now the scores I'm generating now (well broken right now, but I'll fix it soon) swing quite a bit.  I suspect it's due too not enough submitters and not enough messages.


(In reply to comment #17)
> > the base ruleset (non-sandbox rules) won't change scores, so this is important.
> > For nightly masschecks, the only scores affected will be those of sandbox
> > rules.  So only about 1/2 of the ruleset, I'd reckon.
> 
> I am curious, do you remember the original reason for this design decision?

I felt that we didn't have a large enough nightly/weekly corpus to reliable change all of the scores.  I could generate two versions of the scores... with and without locking the base set of scores.

> Might there be value in making the entire ruleset scores affected by nightly
> masshecks?

I think we need a larger nightly/weekly corpus before we do this.

(In reply to comment #18)
> iirc, the risk is that a small set of corpora (e.g. a few people take a week
> off) could cause the entire ruleset to be skewed incorrectly.  This way at
> least only the most recent (sandbox) rules would be affected, so it's a bit
> safer.

Even when all of the regular contributors submitted their results the corpus wasn't that large, so I didn't want to throw away the scores based on the much large corpus we had for 3.2.0

> It's also faster to generate the scores, but this isn't so much of an issue
> now, as our main machine is quite beefy...

I can do it either way... cycles wasn't an issue.

> There may have   been other reasons, too, but I can't find the mails :(

I probably only sent one about the topic.  Some terse comments in the commit messages for that code.

(In reply to comment #25)
> Daryl, is there a URL to your weekly scores?

Still a little broken on my end, but:

http://svn.apache.org/viewvc/spamassassin/trunk/rulesrc/scores/
Comment 43 Daryl C. W. O'Shea 2009-09-23 18:33:54 UTC
I've now uploaded results for my 962396 messages with bayes enabled.
Comment 44 Justin Mason 2009-09-24 04:06:23 UTC
I'm not going to be able to work on this until next week, if anyone feels the need to re-run parts of their mass-checks before then...
Comment 45 Warren Togami 2009-09-24 05:30:43 UTC
Justin, would you be able to setup a ruleqa URL sooner?  Would be nice to see how we're doing compared to the nightly.
Comment 46 Mark Martinec 2009-09-25 17:35:10 UTC
Created attachment 4541 [details]
freqs file on all submitted files for rescore mass-checks

> Justin, would you be able to setup a ruleqa URL sooner?
> Would be nice to see how we're doing compared to the nightly.

To give us something to chew on while we wait for the true runs,
here is the freqs.full file that I obtained while following the
RescoreMassCheck instructions from the Wiki, using all uploaded
files (from about 6 hours ago) on the submission directory,
including Daryl's.
Comment 47 Mark Martinec 2009-09-25 17:46:28 UTC
Created attachment 4542 [details]
resulting 'scores' file from a GA run

...and here is the resulting 'scores' file, obtained on scoreset 3
by running 'garescorer -f 0.003 -e 30000 -t 5.0' (through runGA).

Its header is:

# SUMMARY for threshold 5.0:
# Correctly non-spam: 274586  23.732%  (99.965% of non-spam corpus)
# Correctly spam:     877118  75.809%  (99.410% of spam corpus)
# False positives:        97  0.008%  (0.035% of nonspam,  23905 weighted)
# False negatives:      5204  0.450%  (0.590% of spam,  15482 weighted)
# Average score for spam:  26.2    nonspam: -1.6
# Average for false-pos:   7.7  false-neg: 3.0
# TOTAL:              1157005  100.00%

and the matching 'test' file is:

# SUMMARY for threshold 5.0:
# Correctly non-spam:  34321  99.93%
# Correctly spam:     109470  99.40%
# False positives:        23  0.07%
# False negatives:       662  0.60%
# TCR(l=50): 60.779249  SpamRecall: 99.399%  SpamPrec: 99.979%

Perhaps I pushed it too far by '-f 0.003' .
Comment 48 Mark Martinec 2009-09-25 17:49:22 UTC
P.S. keep in mind that I'm only playing with GA for the last two days,
after first gaining some experience by running it on my corpus only.
Take results with with a large grain of salt.
Comment 49 Warren Togami 2009-09-25 20:36:34 UTC
I recruited an Italian participant for masscheck.  He's ready to upload logs for nightly masscheck and rescore masscheck.  He sent a request for an rsync account on September 11th, 2009 but did not hear back.  I'm uploading logs on his behalf soon.
Comment 50 Justin Mason 2009-09-28 13:22:49 UTC
(In reply to comment #49)
> I recruited an Italian participant for masscheck.  He's ready to upload logs
> for nightly masscheck and rescore masscheck.  He sent a request for an rsync
> account on September 11th, 2009 but did not hear back.  I'm uploading logs on
> his behalf soon.

what was his username?  I thought Mark created an acct for him, but could have confused him with someone else...
Comment 51 Warren Togami 2009-09-28 13:24:47 UTC
bernie or Bernardo, not sure which he would have requested as.

Are the ::submit and nightly ::corpus accounts the same thing now?
Comment 52 Mark Martinec 2009-09-28 15:45:03 UTC
> He sent a request for an rsync account on September 11th, 2009 but did not
> hear back.  I'm uploading logs on his behalf soon.
>
> what was his username?  I thought Mark created an acct for him, but could
> have confused him with someone else...

I did create rsync accounts for Bernie Innocenti <bernie@codewiz.org>
(binnocenti, 2009-09-15) and for Austin Henry (ahenry). Both received
my general reply as CC-ed to the private@spamassassin.apache.org ML,
plus a private mail with a password.

Bernie's MX host 83.149.158.210 accepted and confirmed both messages:

Sep 15 15:59:05 dorothy postfix/smtp[14113]: 328DD1D1C4B:
 to=<bernie@codewiz.org>, relay=mail.codewiz.org[83.149.158.210]:25,
 delay=4.6, delays=0/0/1.7/2.9, dsn=2.0.0, status=sent
 (250 ok 1253023145 qp 22364)

Sep 15 16:00:04 dorothy postfix/smtp[14113]: 69A7A1D1C68:
 to=<bernie@codewiz.org>, relay=mail.codewiz.org[83.149.158.210]:25,
 delay=2.6, delays=0/0/0.72/1.9, dsn=2.0.0, status=sent\
 (250 ok 1253023203 qp 22602)
Comment 53 Warren Togami 2009-09-28 20:28:17 UTC
Is there a way to individual delete files over rsync?  I need to delete the "bernie" log from the ::submit directory.  It seems that the rsync --delete option is only if you are syncing entire directories.
Comment 54 Mark Martinec 2009-09-29 06:42:11 UTC
> Is there a way to individual delete files over rsync?  I need to delete the
> "bernie" log from the ::submit directory.  It seems that the rsync --delete
> option is only if you are syncing entire directories.

I don't think the rsync is able to delete a specific file.
Just upload an empty file in its place, then we can delete
the leftovers at some time.

> Are the ::submit and nightly ::corpus accounts the same thing now?

Yes, both rsync areas currently point to the same 'secrets file' in rsyncd.conf.
Comment 55 Justin Mason 2009-09-29 07:08:20 UTC
(In reply to comment #54)
> > Are the ::submit and nightly ::corpus accounts the same thing now?
> 
> Yes, both rsync areas currently point to the same 'secrets file' in
> rsyncd.conf.

However -- they are not the same place.  They are separate directories, allowing for people to turn back on their nightlies without overwriting the results they've uploaded for the "rescore" mass-check.
Comment 56 Mark Martinec 2009-09-29 11:20:06 UTC
Here is a set of rules in 50_scores.cf that I ended up as fixed (immutable)
for the GA run (score set 3). Most of these are already documented and labeled
as such, but it doesn't hurt to post it here as a double-check.

The score in comments of BAYES rules is what a GA run on a scoreset 3
gave me (all .log files in 'submit', except for Daryl's spam-bayes-net-dos
of which I only took a random sample of 65000 entries, not to overwhelm
the remaining data). I manually reduced BAYES_ scores a bit, as suggested
by a comment in 50_scores.cf, referring to Bug 4505).

score ALL_TRUSTED -1.360 -1.440 -1.665 -1.800
score ANY_BOUNCE_MESSAGE 0.1
score BAYES_00 -2.8    # -2.935
score BAYES_05 -1.1    # -1.148
score BAYES_20 -0.9    # -2.020
score BAYES_40 -0.5    # -2.172
score BAYES_50  0.2    #  0.326
score BAYES_60  1.5    #  2.555
score BAYES_80  2.0    #  2.133
score BAYES_95  3.2    #  3.995
score BAYES_99  3.8    #  4.495
score BOUNCE_MESSAGE     0.1
score CHALLENGE_RESPONSE 0.1
score CRBOUNCE_MESSAGE   0.1
score DKIM_ADSP_CUSTOM_HIGH 0.001
score DKIM_ADSP_CUSTOM_LOW  0.001
score DKIM_ADSP_CUSTOM_MED  0.001
score DKIM_POLICY_SIGNALL  0
score DKIM_POLICY_SIGNSOME 0
score DKIM_POLICY_TESTING  0
score DKIM_SIGNED    0.1
score DKIM_VALID    -0.1
score DKIM_VALID_AU -0.1
score DKIM_VERIFIED  0
score EXTRA_MPART_TYPE 1.0
score GTUBE 1000.000
score NO_HEADERS_MESSAGE 0.001
score NO_RECEIVED   -0.001
score NO_RELAYS     -0.001
score RDNS_DYNAMIC   0.1
score RDNS_NONE      0.1
score SPF_HELO_PASS -0.001
score SPF_PASS      -0.001
score SUBJECT_IN_BLACKLIST 100
score SUBJECT_IN_WHITELIST -100
score UNPARSEABLE_RELAY 0.001
score USER_IN_ALL_SPAM_TO -100.000
score USER_IN_BLACKLIST 100.000
score USER_IN_BLACKLIST_TO 10.000
score USER_IN_DEF_DKIM_WL -7.500
score USER_IN_DEF_SPF_WL -7.500
score USER_IN_DEF_WHITELIST -15.000
score USER_IN_DKIM_WHITELIST -100.000
score USER_IN_MORE_SPAM_TO -20.000
score USER_IN_SPF_WHITELIST -100.000
score USER_IN_WHITELIST -100.000
score USER_IN_WHITELIST_TO -6.000
score VBOUNCE_MESSAGE 0.1

One observation on the DCC scores: the calculated score on DCC_CHECK
depends on whether one is using a licensed DCC server (providing
reputation data), or not. There is a significant overlap between
DCC_CHECK hits and DCC_REPUT_99_100, so the DCC_CHECK score should
be lower when reputation data is not offered by a DCC server.

score DCC_CHECK  1.15  # no reputation data

score DCC_CHECK  0.835 # with reputation data

score DCC_REPUT_00_12  -0.9   # -0.001
score DCC_REPUT_13_19  -0.5   # -0.001
score DCC_REPUT_70_89   1.354
score DCC_REPUT_90_94   0.56
score DCC_REPUT_95_98   1.52
score DCC_REPUT_99_100  2.40

As the majority of installations probably won't be using a commercial
DCC server, it would probably be best to zero-out the DCC_REPUT_*
scores for the GA run (so as to obtain correct DCC_CHECK score).
Comment 57 Warren Togami 2009-09-29 12:41:07 UTC
When do the bb rescore masschecks begin?
Comment 58 Justin Mason 2009-09-29 13:46:06 UTC
dammit! I totally dropped the ball on that one. :(  I'll need to get that set up asap...
Comment 59 Justin Mason 2009-09-29 16:43:41 UTC
(In reply to comment #58)
> dammit! I totally dropped the ball on that one. :(  I'll need to get that set
> up asap...

ok, 5 EC2 nodes are now running mass-checks, one for each bb-* corpus; all should be complete by tomorrow morning. yay for elastic scaling ;)
Comment 60 Justin Mason 2009-09-30 07:53:26 UTC
and they're now uploaded.  Is that everyone?  Do we want to wait for any more?

Mark -- I'm on vacation for 2 weeks starting on Sunday.  Can you run the GA?
it looks like you've pretty much got it working, as far as I can tell.

I've also copied the current set of logs to ruleqa under the following date:
Tue Sep 30 09:00:00 UTC 2009 and rev: 808953.  That should show up at:
http://ruleqa.spamassassin.org/?daterev=20090930-r808953-n

mail counts (approx as these include header comments and too-old messages):

: 60...; wc -l submit/spam-*.log
    2061 submit/spam-bayes-net-ahenry.log
       6 submit/spam-bayes-net-bb-fredt.log
    1418 submit/spam-bayes-net-bb-guenther_fraud.log
    1846 submit/spam-bayes-net-bb-jhardin.log
    2200 submit/spam-bayes-net-bb-kmcgrail.log
    7191 submit/spam-bayes-net-bb-zmi.log
     638 submit/spam-bayes-net-binnocenti.log
   81271 submit/spam-bayes-net-bluestreak.log
  931869 submit/spam-bayes-net-dos.log
      98 submit/spam-bayes-net-hege-fi.log
   36948 submit/spam-bayes-net-hege.log
 1489714 submit/spam-bayes-net-jm.log
   23768 submit/spam-bayes-net-mmartinec.log
    6734 submit/spam-bayes-net-wt-en1.log
       9 submit/spam-bayes-net-wt-en2.log
       6 submit/spam-bayes-net-wt-en3.log
   19166 submit/spam-bayes-net-wt-en4.log
       6 submit/spam-bayes-net-wt-en5.log
       6 submit/spam-bayes-net-wt-en6.log
     126 submit/spam-bayes-net-wt-jp1.log
       6 submit/spam-bayes-net-wt-jp2.log
 2605087 total


: 61...; wc -l submit/ham-*.log
    2657 submit/ham-bayes-net-ahenry.log
     587 submit/ham-bayes-net-bb-fredt.log
       9 submit/ham-bayes-net-bb-guenther_fraud.log
    4307 submit/ham-bayes-net-bb-jhardin.log
       6 submit/ham-bayes-net-bb-kmcgrail.log
       6 submit/ham-bayes-net-bb-zmi.log
   10909 submit/ham-bayes-net-binnocenti.log
   87446 submit/ham-bayes-net-bluestreak.log
   30539 submit/ham-bayes-net-dos.log
  123556 submit/ham-bayes-net-hege-fi.log
   34804 submit/ham-bayes-net-hege.log
  353429 submit/ham-bayes-net-jm.log
   38913 submit/ham-bayes-net-mmartinec.log
    5705 submit/ham-bayes-net-wt-en1.log
    3003 submit/ham-bayes-net-wt-en2.log
    9906 submit/ham-bayes-net-wt-en3.log
       6 submit/ham-bayes-net-wt-en4.log
    5106 submit/ham-bayes-net-wt-en5.log
    2110 submit/ham-bayes-net-wt-en6.log
    1065 submit/ham-bayes-net-wt-jp1.log
    3619 submit/ham-bayes-net-wt-jp2.log
  717688 total

we could probably skip some of the spam.
Comment 61 John Hardin 2009-09-30 09:03:42 UTC
(In reply to comment #60)
> 
> I've also copied the current set of logs to ruleqa ...
>
> : 60...; wc -l submit/spam-*.log
>     1418 submit/spam-bayes-net-bb-guenther_fraud.log
>     1846 submit/spam-bayes-net-bb-jhardin.log
>     2200 submit/spam-bayes-net-bb-kmcgrail.log
> 
> : 61...; wc -l submit/ham-*.log
>        9 submit/ham-bayes-net-bb-guenther_fraud.log
>     4307 submit/ham-bayes-net-bb-jhardin.log
>        6 submit/ham-bayes-net-bb-kmcgrail.log

There should also be jhardin_fraud logs, should there not? I _am_ submitting daily corpora updates for sought_fraud, and those should be included just as guenther's are...
Comment 62 Karsten Bräckelmann 2009-09-30 12:11:23 UTC
(In reply to comment #60)
>        9 submit/ham-bayes-net-bb-guenther_fraud.log
                  ^^^
Please do *not* include my fraud ham corpus. It exclusively contains fake, artificial messages to exclude some German [1] from the fraud spam corpus. No real ham there.

My spam corpus of course is fine to include.

[1] Short, broken German paragraphs along the lines of "you may write in German,
    too", in an otherwise entirely English spam.
Comment 63 John Hardin 2009-09-30 14:35:22 UTC
(In reply to comment #62)
> (In reply to comment #60)
> >        9 submit/ham-bayes-net-bb-guenther_fraud.log
>                   ^^^
> Please do *not* include my fraud ham corpus. It exclusively contains fake,
> artificial messages to exclude some German [1] from the fraud spam corpus.

Same goes for my fraud ham corpus, except s/German/English/ (primarily free mail adverts and legal disclaimers).
Comment 64 Warren Togami 2009-09-30 14:44:35 UTC
http://ruleqa.spamassassin.org/20090930-r808953-n/RCVD_IN_PSBL/detail
It looks like all the ham is visible in the ruleqa, but only 86390 spam?
Comment 65 Justin Mason 2009-09-30 15:16:17 UTC
yep, that's not right :(  I've deleted the files, let's see if the backend rebuilds them correctly using all logs this time.
Comment 66 Daryl C. W. O'Shea 2009-09-30 18:53:50 UTC
(In reply to comment #60)
> we could probably skip some of the spam.

If you feel that it's detrimental too include that much sure.  I'd start with dropping from your and my corpora.  I've got spam up to 60 days old in my corpus.  I'd include everyone elses' spam and thin ours out rather than just a straight drop by date method.

If it's solely a processing time concern, I'd say it's a non-issue as the GA doesn't take that long to run.  I know the nightly ones (about half as much mail) take around 30 minutes on the ancient machine I've got it running on.
Comment 67 Warren Togami 2009-09-30 19:04:34 UTC
http://ruleqa.spamassassin.org/20090930-r808953-n
Was that re-run?  The same total number of spam: 86390
Comment 68 Justin Mason 2009-10-01 05:50:15 UTC
(In reply to comment #67)
> http://ruleqa.spamassassin.org/20090930-r808953-n
> Was that re-run?  The same total number of spam: 86390

it took a little time, but it appears to have corrected itself now.   I think there's a race condition to do with the way logs are rsynced from spamassassin.zones to spamassassin2.zones. :(
Comment 69 Warren Togami 2009-10-05 20:00:00 UTC
Hey Mark, is the GA run happening while jm is away?
Comment 70 Mark Martinec 2009-10-06 03:46:36 UTC
> Hey Mark, is the GA run happening while jm is away?

Yes, it is underway just now. I needed to figure out how to set up the
mpich2 message-passing environment, but I think I have it working now.

I will be asking contributors to check some apparent FP and FN in their
logs soon...
Comment 71 Warren Togami 2009-10-06 07:08:46 UTC
> I will be asking contributors to check some apparent FP and FN in their
> logs soon...

The longer you wait, more of the logs ID's will no longer match the mail boxes.

BTW, did you do the things written in Comment #38?

So scoring PSBL might be more complicated than this.

 * RCVD_IN_PSBL_2WEEKS was never meant to be published as a run-time rule.  It is valuable in measuring PSBL in masschecks.
 * It seems that PSBL is not set to allow reuse?
 * PSBL as measured in the rescore masscheck was deep parsing, while we subsequently agreed to change it to lastexternal.

What should we do?
Comment 72 Mark Martinec 2009-10-06 12:33:09 UTC
> The longer you wait, more of the logs ID's will no longer match the mail boxes.

The messages whose results are submitted to rescoring are supposed to be preserved,
at least until the rescoring runs are done.
 
> BTW, did you do the things written in Comment #38?

Not yet, will do in my next iteration. It takes a couple of hours.
The JM_SOUGHT results I kept on purpose for now, wondering what their
scores would be. On the next round I can just force them to zero,
I believe this is equivalent to removing them from the logs.
In the first round I got:
  score JM_SOUGHT_FRAUD_1 2.105
  score JM_SOUGHT_FRAUD_2 2.318
  score JM_SOUGHT_FRAUD_3 3.270

> So scoring PSBL might be more complicated than this.
> 
>  * RCVD_IN_PSBL_2WEEKS was never meant to be published as a run-time rule.  It
> is valuable in measuring PSBL in masschecks.
>  * It seems that PSBL is not set to allow reuse?
>  * PSBL as measured in the rescore masscheck was deep parsing, while we
> subsequently agreed to change it to lastexternal.

I did the translations from Comment #38 now on the RCVD_IN_PSBL*, will get into
the next approximation.

> What should we do?

There seem to be some other rules in the works, so I'd say let's just finish
up whatever was frozen with a call for rescoring results, publish that as beta-1,
then examine what we got, polish it, and to another rescoring run before the
final release. It's not too bad to just fix some scores manually, we're doing
it also for BAYES, SPF, etc.

==========

Here is now the first homework, the following were reported as false positives
on my last completed attempt. Please check if these are really ham messages
(I already checked my two entries, and they are):

ham-bayes-net-hege.log
  /data/sa/h/3/36f18b49dd8ce2ce70586c67eeb780fd
  /data/sa/h/0/0270ee166042abd0aa94cbdda855400c
  /data/sa/h/9/9eb11730050002add51ecdc6ed25343d
  /data/sa/h/5/5dfa06864bb3021674768e8af372a6c9
  /data/sa/h/4/4214ade1e7e177f0453c5f1cc98c8b42

ham-bayes-net-bluestreak.log
  ../../aaa_ham/2009-07_HAM_721117.0
  ../../aaa_ham/2009-06_HAM_602375.0
  ../../aaa_ham/2009-06_HAM_609153.0
  ../../aaa_ham/2009-06_HAM_623012.0
  ../../aaa_ham/2009-06_HAM_622736.0
  ../../aaa_ham/2009-08_HAM_814010.0

ham-bayes-net-dos.log
  /home/dos/SA-corpus/ham/leah/
    INBOX-Inbox-2007/1195695047.P9700Q22.dilbert.dostech.net:2,S
  /home/dos/SA-corpus/ham/leah/
    INBOX-Inbox-2007/1196258008.P18803Q16.dilbert.dostech.net:2,S
  /home/dos/SA-corpus/ham/leah/
    INBOX-Inbox-2007/1199765108.P20983Q90.dilbert.dostech.net:2,RS

ham-bayes-net-jm.log
  /local/cor/recent/ham/priv.radish.jmason.org.200808310000.mbox.160968
  /local/cor/recent/ham/priv.wall.200809081400.mbox.1677188
  /local/cor/recent/ham/priv.20050914/126599

ham-bayes-net-mmartinec.log
  ham/uYUQM2RmF9I0
  ham/p+KSEyzZTPOw
Comment 73 Mark Martinec 2009-10-06 15:57:29 UTC
> Please check if these are really ham messages

and four more from the second run:

../../aaa_ham/2009-07_HAM_704334.0

../../aaa_ham/2009-08_HAM_810051.0

/local/cor/recent/ham/priv.20050914/137533

/home/dos/SA-corpus/ham/dos/Inbox-2008/
  1221834769.M749008P21562V0000000000000302I00414902_237.cyan.dostech.net,S=26243:2,S


Also, I find scores on URIBL_(AB|JP|WS)_SURBL to be rather low compared
to my experience (e.g. one FP out of 39.000 on URIBL_WS_SURBL at my
ham-bayes-net-mmartinec.log), so my guess is that several of the
following hits could be false positives on these rules:

grep -c 'URIBL_WS_SURBL' ham-bayes-net-jm.log
178

grep -c 'URIBL_AB_SURBL' ham-bayes-net-jm.log
42

grep -c 'URIBL_JP_SURBL' ham-bayes-net-jm.log
29

grep -c 'URIBL_JP_SURBL' ham-bayes-net-bluestreak.log
28

egrep -c 'URIBL_(AB|JP|WS)_SURBL' ham-bayes-net-hege.log
7

grep -c 'URIBL_WS_SURBL' ham-bayes-net-dos.log
4
Comment 74 Daryl C. W. O'Shea 2009-10-06 18:06:41 UTC
/home/dos/SA-corpus/ham/dos/Domains/1195543943.M277151P27837V0000000000000302I00154082_16.cyan.dostech.net\,S\=6338\:2\,S

...is an abuse report that contains an abused domain.  I'd rm it from the logs.  I have from my corpus.

/home/dos/SA-corpus/ham/leah/INBOX-Inbox-2007/1195695047.P9700Q22.dilbert.dostech.net:2,S

...is ham.  A user recommends somebody locally who I guess has spamed their domain.  I've left this in my corpus.

/home/dos/SA-corpus/ham/dos/infra-list/1204046401.M43776P15497V0000000000000302I0000C20E_0.cyan.dostech.net,S=5621:2,S

...abuse report.  I'd rm it from the logs.  I have from my corpus.

/home/dos/SA-corpus/ham/dos/infra-list/1253117012.M352778P19949V0000000000000302I008D1494_70.cyan.dostech.net,S=2683:2,

...abuse report.  I'd rm it from the logs.  I have from my corpus.
Comment 75 Warren Togami 2009-10-06 18:30:19 UTC
Might we consider assigning different confidence weights to ham corpa?

For example, my ham corpa are relatively small in number, but I have strong confidence that they are thoroughly cleaned.  Furthermore they are extremely varied in sources and likely to be different from other masscheck participants.  I have also filtered out all discussion mailing lists and automated report sources.

For example, I would assign the following weights to my ham corpa:
wt-en1: x2.5
wt-en2: x2
wt-en3: x1.5
wt-en5: x2
wt-en6: x1
wt-jp1: x2.5
wt-jp2: x1.5

Anyhow, just an idea.  Not sure if this is helpful.
Comment 76 Henrik Krohns 2009-10-06 23:20:34 UTC
I cleaned up my few FPs and some other stuff, new logs sent..

Talking about weights, does anyone have an academic answer on how results are affected when some corpuses are uniqued (atleast mine is) and some are not?
Comment 77 Warren Togami 2009-10-07 07:13:03 UTC
Nevermind about the weights idea.
Comment 78 Mark Martinec 2009-10-07 09:56:41 UTC
> I cleaned up my few FPs and some other stuff, new logs sent..

Thanks to Daryl and Henrik, I'm still waiting for the bluestreak, but
meanwhile am running garescorer on what I have (including the recent updates).

Btw, Daryl, you haven't commented on:

/home/dos/SA-corpus/ham/leah/
  INBOX-Inbox-2007/1196258008.P18803Q16.dilbert.dostech.net:2,S

/home/dos/SA-corpus/ham/leah/
  INBOX-Inbox-2007/1199765108.P20983Q90.dilbert.dostech.net:2,RS

/home/dos/SA-corpus/ham/dos/
  Inbox-2008/1221834769.M749008P21562V0000000000000302I00414902_237.\
  cyan.dostech.net,S=26243:2,S


> Talking about weights, does anyone have an academic answer on how results are
> affected when some corpuses are uniqued (atleast mine is) and some are not?

Don't know. I removed exact duplicates on mail body from my corpus, although
due to 'personalized' spam which is becoming prevalent nowadays thanks to the
free CPU resources on botnets, there are still plenty of very similar yet
different messages left in the corpus. I did some manual removal on these,
but it is very impractical to be thorough.


> Might we consider assigning different confidence weights to ham corpa?
>
> For example, my ham corpa are relatively small in number, but I have strong
> confidence that they are thoroughly cleaned.  Furthermore they are extremely
> varied in sources and likely to be different from other masscheck participants.
> I have also filtered out all discussion mailing lists and automated report

I do recognize that corpora are quite different in several aspects, although
I don't know how one can weight them more fairly and incorporate it into
the current procedure.

Let me just document here what I'm doing now with a local copy of all
submitted logs.

Due to a significant disproportion on the size of spam-bayes-net-dos.log
and spam-bayes-net-jm.log compared to the rest, I'm taking a random sample
of each of these files, restricted to scoreset 3 and age below 6 months,
decimated to 150.000 entries each (I initially used 100.000, but now
bumped it up).

There are some spam log entries older than 6 months on other spam logs, but
not too many (mostly on the 'hege' collection), but as it seems these are
mainly hand-selected fraud samples, I'm keeping these regardless of age.

Due to shortage of ham, I'm keeping it all regardless of age. This mainly
goes for JM's ham collection, which contains some (smaller) share of
older ham; the remaining collections are fairly recent.

There are no scoreset 0 and 2 entries in any of the logs. So for the
scoreset 3 and 2 I'm using a selection from the logs with 'set=3'.
For scoresets 0 and 1 runs I'm using all entries (set=1 and set=3).

This all amounts to the following 'wc -l' counts:

  463957 ham-full-set1.log
  483402 spam-full-set1.log

  293637 ham-full-set3.log
  443635 spam-full-set3.log

This seems reasonably fair and balanced to me.
Comment 79 Mark Martinec 2009-10-07 10:30:25 UTC
> There are some spam log entries older than 6 months on other spam logs, but
> not too many (mostly on the 'hege' collection), but as it seems these are
> mainly hand-selected fraud samples, I'm keeping these regardless of age.

Oops, wrong id:  s/hege/jhardin/
Comment 80 Mark Martinec 2009-10-07 10:59:07 UTC
The following also looks fishy:

grep -c DKIM_ADSP_DISCARD ham*.log

  ham-bayes-net-bb-fredt.log    21
  ham-bayes-net-bb-jhardin.log  22
  ham-bayes-net-bluestreak.log  36
  ham-bayes-net-hege.log        43
  ham-bayes-net-wt-en6.log      35
  ham-bayes-net-mmartinec.log    1
  ham-bayes-net-dos.log         25
  ham-bayes-net-jm.log          65

(the one entry in my collection is due to the author posting
through a mailing list, despite the fact that his domain publishes
a 'discardable' policy; so, a sender's mistake)
Comment 81 Warren Togami 2009-10-07 12:37:09 UTC
(In reply to comment #80)
> The following also looks fishy:
> 
> grep -c DKIM_ADSP_DISCARD ham*.log
> 
>   ham-bayes-net-wt-en6.log      35
> 
> (the one entry in my collection is due to the author posting
> through a mailing list, despite the fact that his domain publishes
> a 'discardable' policy; so, a sender's mistake)

These are all legitimate looking paypal mail delivered to a Yahoo account from mid-2008 through recently.

What is DKIM_ADSP_DISCARD supposed to mean?
Comment 82 Mark Martinec 2009-10-07 15:56:36 UTC
(In reply to comment #81)
> > The following also looks fishy: 
> > grep -c DKIM_ADSP_DISCARD ham*.log
> >   ham-bayes-net-wt-en6.log      35
> 
> These are all legitimate looking paypal mail delivered to a Yahoo account
> from mid-2008 through recently.

I'm not sure since when paypal is signing their mail. They were certainly
signing it with DomainKeys signatures in 2006, and with DKIM in 2008.
So for very old ham mail from paypal (or ebay) it is quite possible the
signature is missing or somehow broken or unverifiable, but this shouldn't
be the case for current mail from these domains.
 
> What is DKIM_ADSP_DISCARD supposed to mean?

It means two things:
- that the message does not have a valid author's domain DKIM or DomainKeys
  signature (e.g. there is no signature at all, or that the signature does
  not match the mail contents, or that it does not match the domain name
  in the From header field);
- and that the domain claims that any mail claiming to be from that domain
  and failing on signature verification, should be discarded. This claim
  is made by publishing a DNS record (RFC 5617), or through 'adsp_override'
  configuration directive in SpamAssassin's .cf file.

So, if your mail samples are younger than a year, they do have a
DKIM-Signature in the header, and they appear to be genuine, the only
explanation for a failed signature verification is that the message got
somehow corrupted or transformed on its way to SpamAssassin in such a way
that the signature no longer matches the mail contents, or that SA could
not fetch the domain's public key, perhaps due to DNS resolver failing
or some firewall trouble.

Depending on your where and how SpamAssassin is called from your mail
delivery system, and how you collected your samples (e.g. from a MTA,
from a mailbox, from some kind of a quarantine), there are different
possible reasons for mail corruption. For example, saving a mail message
source from some MUA (e.g. kmail) can rewrite/reformat some header fields.
Running some virus scanner in the mail path may add its verdict to the
mail body. Fetching it from some POP3 server or even from a webmail service
offers their own challenges to mail integrity. In some cases even a
'friendly' MTA thinks it is doing a favour by rewriting some header
fields, perhaps in belief that they would look 'prettier'.

One way to find out is to describe a path the mail is making through
your infrastructure (firewall, MTA, virus scanners, mailbox server)
before it reaches SpamAssassin, and by carefully examining one or two
such mail samples. If you have a choice, you may mail me some samples,
preferably as a gzip or tar.gz attachment, to make sure it won't get
transformed in transition.
Comment 83 Henrik Krohns 2009-10-08 01:02:43 UTC
Cleaned up my DKIM_ADSP_DISCARD hits (old 2005 ebay mails removed) and some other old stuff, logs sent..
Comment 84 Mark Martinec 2009-10-08 06:50:37 UTC
> These are all legitimate looking paypal mail delivered to a Yahoo account from
> mid-2008 through recently.

Thanks Warren for your out-of-band mail. Apart from some general comments
from my previous posting, there is a real problem regarding your method of
fetching mail for a Yahoo account. You are using the FetchYahoo to download
these messages from the Yahoo webmail interface. The FetchYahoo has to jump
hoops to be able to retrieve a message as close to its original form as
possible, but there are some real obstacles there. Glancing at its source
code, it has to pull attachments separately and splice them back together
into a message, necessarily reinventing the MIME boundaries. This is enough
to render DomainKeys and DKIM signatures invalid. Apart from this, it also
converts QP and base64 encoded messages into UTF-8 binary, which again is
a sufficient reason for signature breakage. Moreover, it has to repair some
damage to header field folding and empty lines, which are broken either due to
bugs in Yahoo HTML rendering (indicated by comments in the FetchYahoo code),
or details are simply lost because of a conversion to HTML and back to mail.

This method of fetching mail is bound to cause trouble. It may quite easily
cause some other low-level SpamAssassin rules to misfire or to fail triggering,
not just the signature verification failures.
Comment 85 Warren Togami 2009-10-08 10:15:55 UTC
I guess we have no choice but to drop wt-en6 from the rescore GA.

Should I drop it from nightly masscheck as well?
Comment 86 Mark Martinec 2009-10-08 10:37:23 UTC
> I guess we have no choice but to drop wt-en6 from the rescore GA.
> Should I drop it from nightly masscheck as well?

I can imagine such problem could also affect other users, especially
those not running SpamAssassin close to their MTA. I guess we can keep
the wt-en6 corpus (and similar, if identified), but keep in mind that FP
hits on DKIM_ADSP_DISCARD (and possibly on some other rule if identified)
should be disregarded. I already removed the "DKIM_ADSP_DISCARD" hit
from my copy of wt-en6 log.

If it turns out the undesired mail modifications are more common
in submitted corpora, we could perhaps re-run the GA on a subset
of logs know not to be suffering from the problem, and just fetch
the DKIM_* scores from results as obtained from this run.

The release notes could then say that one should lower the DKIM_ADSP_*
scores on installations where it is known that mail is not reaching
SpamAssassin in its pristine form (as received by the MTA).
Comment 87 Warren Togami 2009-10-08 13:51:31 UTC
(In reply to comment #86)
> The release notes could then say that one should lower the DKIM_ADSP_*
> scores on installations where it is known that mail is not reaching
> SpamAssassin in its pristine form (as received by the MTA).

This case or old ham where the sender subsequently changed their DKIM policy is only an issue for masscheck, not production scanning.  Lowering the DKIM scores makes no sense then?
Comment 88 Mark Martinec 2009-10-09 06:23:06 UTC
> > The release notes could then say that one should lower the DKIM_ADSP_*
> > scores on installations where it is known that mail is not reaching
> > SpamAssassin in its pristine form (as received by the MTA).
> 
> This case or old ham where the sender subsequently changed their DKIM policy
> is only an issue for masscheck, not production scanning.

True for the case of old ham where the sender subsequently changed their DKIM policy,
or for the case of expired signatures - these are only an issue with masscheck.

...but not the case of wt-en6, where mail is transformed by its path through
webmail. This is an issue both for masschecks, as well as for production runs.

> Lowering the DKIM scores makes no sense then?

If one knows that mail reaching SpamAssassin will be modified by his mail path,
then one must disable rules targeting mail forgery and depending on a pristine
mail, such as the DKIM_ADSP_DISCARD rule. Otherwise the rule would generate
FP score points for legitimate mail from domains publishing ADSP (explicitly
or through overrides).
Comment 89 Mark Martinec 2009-10-09 06:38:09 UTC
Created attachment 4550 [details]
resulting 50_scores.cf from garescorer runs

Ok, here it is as last, the auto-generated 50_scores.cf from garescorer runs
on all four sets, with no hand-tweaking of results (yet) ... to give us
something to digest and comment on, and can serve as the first approximation.
Some values are surprising or plain wrong, I'll comment on some later.

I used the submitted logs (tweaked as per Comment 78), with all the recent
updates to them as posted so far in this ticket. I left the BAYES scores
fully floating. I fixed at zero the DCC_REPUT_* scores and JM_SOUGHT_FRAUD_*,
as was discussed previously (as can be seen by the end of the attached file).
Eventually these will need to be set to some manually determined score.
Comment 90 Mark Martinec 2009-10-09 06:49:27 UTC
To assess the quality and repeatability of results, here are the summaries
on all four score sets, each pair consists of a normal run on 90% of
entries, and a test run on remaining 10% of log entries.

The most interesting figures are the FP and FN percents, e.g. 0.028% and 0.961%,
in this clipping:
  # False positives:     65  0.011%  (0.028% of nonspam,  10580 weighted)
  # False negatives:   3411  0.578%  (0.961% of spam,  12054 weighted)


==========================================
gen-set0-5-5.0-25000-ga
SCORESET 0 : (no net, not bayes)

test (10%):
# SUMMARY for threshold 5.0:
# Correctly non-spam:  45335  98.03%
# Correctly spam:      39320  81.61%
# False positives:       913  1.97%
# False negatives:      8860  18.39%
# TCR(l=50): 0.883875  SpamRecall: 81.611%  SpamPrec: 97.731%

scores (90%):
# SUMMARY for threshold 5.0:
# Correctly non-spam: 365397  48.193%  (98.401% of non-spam corpus)
# Correctly spam:     314466  41.476%  (81.286% of spam corpus)
# False positives:      5936  0.783%  (1.599% of nonspam, 173347 weighted)
# False negatives:     72396  9.548%  (18.714% of spam, 226867 weighted)
# Average score for spam:  10.0    nonspam: 1.4
# Average for false-pos:   5.6  false-neg: 3.1
# TOTAL:              758195  100.00%

==========================================
gen-set1-10-5.0-30000-ga
SCORESET 1: (net, no bayes)

test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  46183  99.86%
# Correctly spam:      46648  96.82%
# False positives:        65  0.14%
# False negatives:      1532  3.18%
# TCR(l=50): 10.075282  SpamRecall: 96.820%  SpamPrec: 99.861%

scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 370804  48.906%  (99.858% of non-spam corpus)
# Correctly spam:     374579  49.404%  (96.825% of spam corpus)
# False positives:       529  0.070%  (0.142% of nonspam,  31804 weighted)
# False negatives:     12283  1.620%  (3.175% of spam,  39385 weighted)
# Average score for spam:  17.4    nonspam: 0.4
# Average for false-pos:   5.8  false-neg: 3.2
# TOTAL:              758195  100.00%


==========================================
gen-set2-10-5.0-30000-ga
SCORESET 2: (no net, bayes)

test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  29308  99.78%
# Correctly spam:      42344  95.69%
# False positives:        64  0.22%
# False negatives:      1907  4.31%
# TCR(l=50): 8.664774  SpamRecall: 95.690%  SpamPrec: 99.849%

scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 234375  39.745%  (99.864% of non-spam corpus)
# Correctly spam:     339736  57.612%  (95.700% of spam corpus)
# False positives:       320  0.054%  (0.136% of nonspam,  26164 weighted)
# False negatives:     15265  2.589%  (4.300% of spam,  58794 weighted)
# Average score for spam:  10.4    nonspam: 0.6
# Average for false-pos:   5.4  false-neg: 3.9
# TOTAL:              589696  100.00%


==========================================
gen-set3-20-5.0-20000-ga
SCORESET 3: (net, bayes)

test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  29342  99.90%
# Correctly spam:      43843  99.08%
# False positives:        30  0.10%
# False negatives:       408  0.92%
# TCR(l=50): 23.192348  SpamRecall: 99.078%  SpamPrec: 99.932%

scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 234630  39.788%  (99.972% of non-spam corpus)
# Correctly spam:     351590  59.622%  (99.039% of spam corpus)
# False positives:        65  0.011%  (0.028% of nonspam,  10580 weighted)
# False negatives:      3411  0.578%  (0.961% of spam,  12054 weighted)
# Average score for spam:  18.5    nonspam: -0.1
# Average for false-pos:   5.4  false-neg: 3.5
# TOTAL:              589696  100.00%
Comment 91 Mark Martinec 2009-10-09 06:53:51 UTC
As can be seen from above, the scoreset 0 (no net tests, no bayes) is pretty much
useless nowadays. The scoresets 1 and 2 come close, i.e. net tests are worth
about as much as bayes. Of course the combination of all (set3) is an
outstanding winner.
Comment 92 Warren Togami 2009-10-09 20:22:24 UTC
(In reply to comment #89)
> Created an attachment (id=4550) [details]
> resulting 50_scores.cf from garescorer runs
> 
> Ok, here it is as last, the auto-generated 50_scores.cf from garescorer runs
> on all four sets, with no hand-tweaking of results (yet) ... to give us
> something to digest and comment on, and can serve as the first approximation.
> Some values are surprising or plain wrong, I'll comment on some later.

Bug #6156 RCVD_IN_PSBL
We should manually adjust this score something between 2.0 through 2.5 for these reasons.

* Rescore masschecks were with deep parsing.  We have subsequently changed it to lastexternal which should be much safer.  Even with deep parsing it proved to be very good.
* At the time of the rescore masschecks, PSBL's recent whitelist filtering of gmail, yahoo, rr.com and several other major ISP's had not yet timed out legitimate MTA's.  Safety should be improved further now.
Comment 93 Warren Togami 2009-10-11 00:01:01 UTC
Bad news.  Please remove the binnocenti logs from the rescore masschecks.  Working with him we discovered 50+ additional spam in his ham folders and there is certainly more.  Furthermore his ham contains lots of automated low quality sources like Bugzilla, trac, cron and log monitoring daemons that should probably be removed from ham corpa.  It seems incorrect FP's and bias introduced by this corpus can be large enough to possibly throw off scoring.

Did you also remove wt-en6 after we discovered that copying mail from a Yahoo account corrupts the messages?
Comment 94 Matthias Leisi 2009-10-11 02:19:21 UTC
(In reply to comment #56)
> Here is a set of rules in 50_scores.cf that I ended up as fixed (immutable)
> for the GA run (score set 3). Most of these are already documented and labeled
> as such, but it doesn't hurt to post it here as a double-check.

I suspect that RCVD_IN_DNSWL_* should be immutable as well; in generated scores, there are counter-intuitive scores assigned (expected _HI < _MED < _LOW, observed _MED << _HI < _LOW). 

https://svn.apache.org/repos/asf/spamassassin/trunk/rules/50_scores.cf has the following outside the "gen:mutable" section:

| score RCVD_IN_DNSWL_LOW 0 -1 0 -1
| score RCVD_IN_DNSWL_MED 0 -4 0 -4
| score RCVD_IN_DNSWL_HI 0 -8 0 -8

The DNSWL stats posted by Warren to the users list seem to indicate that this should be the correct ordering (at least based on safety):

| SPAM%   HAM%    RANK RULE
| 0.0016% 4.2489% 0.91 RCVD_IN_DNSWL_HI
| 0.0281% 6.9639% 0.90 RCVD_IN_DNSWL_MED
| 0.1147% 3.9169% 0.81 RCVD_IN_DNSWL_LOW
Comment 95 Warren Togami 2009-10-11 07:03:21 UTC
(In reply to comment #94)
> The DNSWL stats posted by Warren to the users list seem to indicate that this
> should be the correct ordering (at least based on safety):
> 
> | SPAM%   HAM%    RANK RULE
> | 0.0016% 4.2489% 0.91 RCVD_IN_DNSWL_HI
> | 0.0281% 6.9639% 0.90 RCVD_IN_DNSWL_MED
> | 0.1147% 3.9169% 0.81 RCVD_IN_DNSWL_LOW

These were yesterday's weekly results, not the rescore masscheck.  Weekly results are a smaller sample size and lower confidence.

http://ruleqa.spamassassin.org/20090930-r808953-n

SPAM%   HAM%     RANK RULE
0.0002% 0.3651%  0.75 RCVD_IN_DNSWL_HI
0.0288% 18.6970% 0.79 RCVD_IN_DNSWL_MED
0.0753% 8.1433%  0.68 RCVD_IN_DNSWL_LOW

This was the rescore masscheck.
Comment 96 Mark Martinec 2009-10-14 16:21:44 UTC
Created attachment 4553 [details]
resulting 50_scores.cf from garescorer runs - V2

Here is now a 50_scores.cf from my second attempt after cleaning some
logs: removed binnocenti and wt-en6 logs as per Comment 93, removed
DKIM_ADSP_DISCARD hits from ham-bayes-net-bluestreak.log. I have also
limited the log entries to fewer months following the RescoreMassCheck
(wiki): -m 6 for spam, and -m 25 for ham (after 25th month there is a
large gap in data till the next peak, too far in the past).

This leaves us with the following number of entries in merged logs:
score set 1 (no data from score set 3), provides data for set0 and set1:
  360070 ham-full-set1.log
  472682 spam-full-set1.log
score set 3, provides data for set2 and set3:
  210603 ham-full-set3.log
  442709 spam-full-set3.log

For DCC_ rules, I took the DCC_CHECK value of 1.1 from a preliminary run
which had all the DCC_REPUT_* scores fixed at 0, then for the next run
I fixed the DCC_CHECK, but left the DCC_REPUT_* scores floating. This
should cope with both types of sites: those with a commercial license
that do receive reputation results from DCC servers, and those who don't.
Comment 97 Mark Martinec 2009-10-14 16:29:29 UTC
gen-set0-5-5.0-10000-ga
test (10%)
# SUMMARY for threshold 5.0:
# Correctly non-spam:  35461  98.50%
# Correctly spam:      38357  81.35%
# False positives:       541  1.50%
# False negatives:      8794  18.65%
# TCR(l=50): 1.315450  SpamRecall: 81.349%  SpamPrec: 98.609%
scores (90%):
# SUMMARY for threshold 5.0:
# Correctly non-spam: 283119  42.494%  (98.304% of non-spam corpus)
# Correctly spam:     306367  45.984%  (80.997% of spam corpus)
# False positives:      4886  0.733%  (1.696% of nonspam, 179777 weighted)
# False negatives:     71879  10.789%  (19.003% of spam, 231331 weighted)
# Average score for spam:  10.4    nonspam: 1.7
# Average for false-pos:   5.6  false-neg: 3.2
# TOTAL:              666251  100.00%

gen-set1-10-5.0-10000-ga
test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  35942  99.83%
# Correctly spam:      45983  97.52%
# False positives:        60  0.17%
# False negatives:      1168  2.48%
# TCR(l=50): 11.312620  SpamRecall: 97.523%  SpamPrec: 99.870%
scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 287639  43.173%  (99.873% of non-spam corpus)
# Correctly spam:     368783  55.352%  (97.498% of spam corpus)
# False positives:       366  0.055%  (0.127% of nonspam,  27040 weighted)
# False negatives:      9463  1.420%  (2.502% of spam,  29645 weighted)
# Average score for spam:  20.3    nonspam: 0.2
# Average for false-pos:   5.6  false-neg: 3.1
# TOTAL:              666251  100.00%

gen-set2-10-5.0-10000-ga
test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  35949  99.85%
# Correctly spam:      44538  94.46%
# False positives:        53  0.15%
# False negatives:      2613  5.54%
# TCR(l=50): 8.958959  SpamRecall: 94.458%  SpamPrec: 99.881%
scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 287557  43.160%  (99.844% of non-spam corpus)
# Correctly spam:     357656  53.682%  (94.556% of spam corpus)
# False positives:       448  0.067%  (0.156% of nonspam,  33456 weighted)
# False negatives:     20590  3.090%  (5.444% of spam,  73371 weighted)
# Average score for spam:  12.3    nonspam: 0.8
# Average for false-pos:   5.7  false-neg: 3.6
# TOTAL:              666251  100.00%

gen-set3-20-5.0-10000-ga
test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  21173  99.92%
# Correctly spam:      43749  99.08%
# False positives:        17  0.08%
# False negatives:       404  0.92%
# TCR(l=50): 35.209729  SpamRecall: 99.085%  SpamPrec: 99.961%
scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 168159  32.186%  (99.976% of non-spam corpus)
# Correctly spam:     350875  67.159%  (99.046% of spam corpus)
# False positives:        40  0.008%  (0.024% of nonspam,   9039 weighted)
# False negatives:      3379  0.647%  (0.954% of spam,  11476 weighted)
# Average score for spam:  19.3    nonspam: -0.8
# Average for false-pos:   5.4  false-neg: 3.4
# TOTAL:              522453  100.00%

===========
In summary, the essential data:

score set 0 (no net, no bayes):
# False positives:      4886  0.733%  (1.696% of nonspam, 179777 weighted)
# False negatives:     71879  10.789%  (19.003% of spam, 231331 weighted)

score set 1 (net, no bayes):
# False positives:       366  0.055%  (0.127% of nonspam,  27040 weighted)
# False negatives:      9463  1.420%  (2.502% of spam,  29645 weighted)

score set 2 (no net, bayes):
# False positives:       448  0.067%  (0.156% of nonspam,  33456 weighted)
# False negatives:     20590  3.090%  (5.444% of spam,  73371 weighted)

score set 3 (net, bayes):
# False positives:        40  0.008%  (0.024% of nonspam,   9039 weighted)
# False negatives:      3379  0.647%  (0.954% of spam,  11476 weighted)
Comment 98 Mark Martinec 2009-10-14 16:48:26 UTC
The RCVD_IN_DNSWL_* scores are again unusual:
  score RCVD_IN_DNSWL_HI  0 -0.466 0 -0.001
  score RCVD_IN_DNSWL_LOW 0 -0.292 0 -0.760
  score RCVD_IN_DNSWL_MED 0 -1.703 0 -0.727

probably because of their low frequency, especially the _HI rule:
OVERALL    SPAM%     HAM%     S/O    RANK   SCORE  NAME
  0.184   0.0007   0.5707    0.001   0.76   -1.00  RCVD_IN_DNSWL_HI
  7.408   0.1096  22.7509    0.005   0.67   -1.00  RCVD_IN_DNSWL_MED
  2.553   0.1816   7.5365    0.024   0.59   -1.00  RCVD_IN_DNSWL_LOW

and resulting zero ranges (tmp/ranges.data):
  0.000 0.000 0 RCVD_IN_DNSWL_HI
  0.000 0.000 0 RCVD_IN_DNSWL_MED
  0.000 0.000 0 RCVD_IN_DNSWL_LOW

Don't know what a clean solution is, apart from fixing their scores
manually.
Comment 99 Warren Togami 2009-10-14 21:58:58 UTC
I just discovered that I was falsely triggering rules like RCVD_IN_SORBS_DUL,  RCVD_IN_PBL or RDNS_DYNAMIC on some of my corpus ham due to a misconfiguration on my server.  My users delivering mail directly to other users on my server from their home ISP or mobile phone were lacking "authenticated user" within the Received header causing many hits on these and unknown other rules.  Roughly ~150-170 of my FP's on these three rules should not count against those rules.  Nearly all of my RCVD_IN_SORBS_DUL and RCVD_IN_PBL should have been AllTrusted instead.  Is this enough to throw off the GA scoring?
Comment 100 Mark Martinec 2009-10-15 11:56:23 UTC
Btw, I added a "Target Milestone" 3.3.1, so that a triage on 3.3.0 bugs
may be more selective, choosing between Future/Undefined/3.3.1
Comment 101 Justin Mason 2009-10-19 07:53:59 UTC
(In reply to comment #98)
> The RCVD_IN_DNSWL_* scores are again unusual:
>   score RCVD_IN_DNSWL_HI  0 -0.466 0 -0.001
>   score RCVD_IN_DNSWL_LOW 0 -0.292 0 -0.760
>   score RCVD_IN_DNSWL_MED 0 -1.703 0 -0.727
> 
> probably because of their low frequency, especially the _HI rule:
> OVERALL    SPAM%     HAM%     S/O    RANK   SCORE  NAME
>   0.184   0.0007   0.5707    0.001   0.76   -1.00  RCVD_IN_DNSWL_HI
>   7.408   0.1096  22.7509    0.005   0.67   -1.00  RCVD_IN_DNSWL_MED
>   2.553   0.1816   7.5365    0.024   0.59   -1.00  RCVD_IN_DNSWL_LOW
> 
> and resulting zero ranges (tmp/ranges.data):
>   0.000 0.000 0 RCVD_IN_DNSWL_HI
>   0.000 0.000 0 RCVD_IN_DNSWL_MED
>   0.000 0.000 0 RCVD_IN_DNSWL_LOW
> 
> Don't know what a clean solution is, apart from fixing their scores
> manually.

feel free to fix them; it's hard for the GA to be mostly right about network rules.  tbh I'm surprised the ranges were zeroed (for _MED at least).
Comment 102 Justin Mason 2009-10-19 07:55:57 UTC
(In reply to comment #99)
> I just discovered that I was falsely triggering rules like RCVD_IN_SORBS_DUL, 
> RCVD_IN_PBL or RDNS_DYNAMIC on some of my corpus ham due to a misconfiguration
> on my server.  My users delivering mail directly to other users on my server
> from their home ISP or mobile phone were lacking "authenticated user" within
> the Received header causing many hits on these and unknown other rules. 
> Roughly ~150-170 of my FP's on these three rules should not count against those
> rules.  Nearly all of my RCVD_IN_SORBS_DUL and RCVD_IN_PBL should have been
> AllTrusted instead.  Is this enough to throw off the GA scoring?

if you want, feel free to sed the log files to fix this, or just remove the lines entirely, and reupload.  170 FPs for those DUL rules is quite strong imo.
Comment 103 Warren Togami 2009-10-19 10:31:26 UTC
> if you want, feel free to sed the log files to fix this, or just remove the
> lines entirely, and reupload.  170 FPs for those DUL rules is quite strong imo.

Removed the majority of the offending lines and reuploaded ham-rescore-wt*.log.

I also zeroed out *wt-en6.log because they were found to be too corrupted to trust the results.
Comment 104 Mark Martinec 2009-10-19 11:28:49 UTC
(In reply to comment #103)
> Removed the majority of the offending lines and reuploaded ham-rescore-wt*.log. 
> I also zeroed out *wt-en6.log because they were found to be too corrupted to
> trust the results.

Thanks. Seems you did it in the 'corpus' rsync directory. Please also update
them in the 'submit' directory using existing names, otherwise in few weeks
time we'll all forget which file came from where - after all, the 'submit'
directory is the official source for rescoring runs.
Comment 105 Karsten Bräckelmann 2009-10-19 12:21:56 UTC
Argh, late to the show, sorry. :-/  From the second GA re-score run, attachment 4553 [details] (aligned for readability):

score KB_RATWARE_MSGID       4.099 3.315 4.095 1.475

This is awesome! :)  Though unrelated, so let me move on to the issue.


score KB_RATWARE_OUTLOOK_08  1.100 3.232 0.776 0.025
score KB_RATWARE_OUTLOOK_12  2.734 2.826 1.654 0.041
score KB_RATWARE_OUTLOOK_16  1.725 3.331 2.532 0.887
score KB_RATWARE_OUTLOOK_MID 2.259 2.485 3.121 0.001

This is also awesome -- kind of. But frankly, it also is a total mess. They are essentially the same, just slightly differing in strictness or fuzziness. They are almost *exactly* overlapping -- *all* four of them (see ruleqa).

These rules are really redundant, and there should be only one instead. FWIW, that *should* be KB_RATWARE_BOUNDARY, which was added specifically for this. This rule seems to be missing entirely, though. :(

Looking at the scores, I don't think simply adding them would do.

Also, I'm kind of un-satisfied with the score-set 3 scores. The FP rate is 0! (Almost, I'll challenge the ham hits.) For all five rules above. Net tests or not...
Comment 106 Warren Togami 2009-10-19 12:35:30 UTC
> Thanks. Seems you did it in the 'corpus' rsync directory. Please also update
> them in the 'submit' directory using existing names, otherwise in few weeks
> time we'll all forget which file came from where - after all, the 'submit'
> directory is the official source for rescoring runs.

Fixed in 'submit'.
Comment 107 Justin Mason 2009-10-19 14:26:25 UTC
(In reply to comment #105)
> score KB_RATWARE_OUTLOOK_08  1.100 3.232 0.776 0.025
> score KB_RATWARE_OUTLOOK_12  2.734 2.826 1.654 0.041
> score KB_RATWARE_OUTLOOK_16  1.725 3.331 2.532 0.887
> score KB_RATWARE_OUTLOOK_MID 2.259 2.485 3.121 0.001
> 
> This is also awesome -- kind of. But frankly, it also is a total mess. They are
> essentially the same, just slightly differing in strictness or fuzziness. They
> are almost *exactly* overlapping -- *all* four of them (see ruleqa).
> 
> These rules are really redundant, and there should be only one instead. FWIW,
> that *should* be KB_RATWARE_BOUNDARY, which was added specifically for this.
> This rule seems to be missing entirely, though. :(
> 
> Looking at the scores, I don't think simply adding them would do.
> 
> Also, I'm kind of un-satisfied with the score-set 3 scores. The FP rate is 0!
> (Almost, I'll challenge the ham hits.) For all five rules above. Net tests or
> not...

it looks like they overlap a lot with some other rules.  But yes, if they were just 1 rule, it probably would have gotten a better single score.

I'm not sure if it's too late to fix this or not. :(
Comment 108 Karsten Bräckelmann 2009-10-19 14:49:17 UTC
(In reply to comment #107)
> it looks like they overlap a lot with some other rules.  But yes, if they were
> just 1 rule, it probably would have gotten a better single score.
> 
> I'm not sure if it's too late to fix this or not. :(

Frankly, pretty much either one could be used and all other variants simply be dropped for the next re-score run. Keeping all of them is just a waste of cycles.

The important questions are, where is KB_RATWARE_BOUNDARY, which was specifically pushed right before the deadline to supersede these?

And of course, why do the scores drop that drastically with score-set 3, if there is *no* FP? Regardless of the spam already scoring above 5, there is no FP reason to lower the score.
Comment 109 Karsten Bräckelmann 2009-10-19 15:37:16 UTC
(In reply to comment #108)
> The important questions are, where is KB_RATWARE_BOUNDARY, which was
> specifically pushed right before the deadline to supersede these?

Argh!  It is in freqs.full, attachment 4541 [details]. However, it appears we've been using inconsistent rule-sets, with most contributors using one outdated rule-set or the other. :-(

 10.830  14.1437   0.1901    0.987   0.67    0.00  T_KB_RATWARE_BOUNDARY
  0.025   0.0327   0.0000    1.000   0.65    1.00  KB_RATWARE_BOUNDARY
Comment 110 Justin Mason 2009-10-20 03:46:49 UTC
(In reply to comment #109)
> (In reply to comment #108)
> > The important questions are, where is KB_RATWARE_BOUNDARY, which was
> > specifically pushed right before the deadline to supersede these?
> 
> Argh!  It is in freqs.full, attachment 4541 [details]. However, it appears we've been
> using inconsistent rule-sets, with most contributors using one outdated
> rule-set or the other. :-(
> 
>  10.830  14.1437   0.1901    0.987   0.67    0.00  T_KB_RATWARE_BOUNDARY
>   0.025   0.0327   0.0000    1.000   0.65    1.00  KB_RATWARE_BOUNDARY

mysterious:

: exit=[130] uid=jm Tue Oct 20 10:40:30 GMT 2009; cd /export/home/corpus-rsync/corpus/submit
: 6...; grep KB_RATWARE_BOUNDARY *.log | grep -v T_KB_RATWARE_BOUNDARY
: exit=[0 1] uid=jm Tue Oct 20 10:43:41 GMT 2009; cd /export/home/corpus-rsync/corpus/submit

I can't find any non-T_ hits in the submit logs.  Mark?
Comment 111 Justin Mason 2009-10-20 03:48:45 UTC
(In reply to comment #110)
> (In reply to comment #109)
> > (In reply to comment #108)
> > > The important questions are, where is KB_RATWARE_BOUNDARY, which was
> > > specifically pushed right before the deadline to supersede these?

anyway.... it doesn't look like that rules is good enough to supersede them:

 10.830  14.1437   0.1901    0.987   0.67    0.00  T_KB_RATWARE_BOUNDARY

vs

  9.846  12.9126   0.0003    1.000   0.98    1.00  KB_RATWARE_OUTLOOK_08
  9.836  12.8985   0.0003    1.000   0.98    1.00  KB_RATWARE_OUTLOOK_MID
  9.835  12.8976   0.0003    1.000   0.98    1.00  KB_RATWARE_OUTLOOK_16
  9.835  12.8976   0.0003    1.000   0.98    1.00  KB_RATWARE_OUTLOOK_12

that's a much higher FP rate!
Comment 112 Karsten Bräckelmann 2009-10-20 04:15:03 UTC
> anyway.... it doesn't look like that rules is good enough to supersede them:
> that's a much higher FP rate!

Yes. It's all Warren's fault! ;)  Seriously, the new BOUNDARY one does indeed have quite some FPs, all in Warren's corpus, and he kindly provided me with the samples. Appears these are all entirely legit, though auto-generated messages. I wish MS wouldn't re-use their code like that.
  X-Mailer: Microsoft CDO for Windows 2000

Anyway, I agree -- RATWARE_BOUNDARY is bad, I screwed up with too low a range between headers. One of the previous rules needs to be kept. (The massive overlap along with the introduced FNs made it drop off of the active rules.)

Still wondering why there are different rule names in freqs.
Comment 113 Karsten Bräckelmann 2009-10-20 04:43:31 UTC
>   9.836  12.8985   0.0003    1.000   0.98    1.00  KB_RATWARE_OUTLOOK_MID

Proposing the MID variant for inclusion, and dropping the other variants.

The BOUNDARY one is bad, and the variants do have an almost 100% overlap with the MID one. It's also the most strict one. (Funny side-effect of the additional constraint is actually catching a spam or two more... Go figure.)

The ham hit probably is not really ham (no FP in nightlies).
Comment 114 Justin Mason 2009-10-20 08:26:26 UTC
(In reply to comment #113)
> >   9.836  12.8985   0.0003    1.000   0.98    1.00  KB_RATWARE_OUTLOOK_MID
> 
> Proposing the MID variant for inclusion, and dropping the other variants.

can you list exactly which rules you want zeroed, before Mark reruns the GA accordingly?  minimize the work he has to do ;)
Comment 115 Karsten Bräckelmann 2009-10-20 08:46:55 UTC
Err, sure. :)  The following variations should just be dropped.

score KB_RATWARE_OUTLOOK_08  0
score KB_RATWARE_OUTLOOK_12  0
score KB_RATWARE_OUTLOOK_16  0
score KB_RATWARE_BOUNDARY    0

Keep KB_RATWARE_OUTLOOK_MID (instead of the above) and KB_RATWARE_MSGID (which is an unrelated rule anyway).
Comment 116 Adam Katz 2009-10-20 13:08:15 UTC
Standing up for RDNS_NONE ...

http://ruleqa.spamassassin.org/week/RDNS_NONE/detail
bb_trec_enron has 98.9497% of its ham match RDNS_NONE, which is to say that it's bogus.  Discounting that corpus, RDNS_NONE matches 58.7244% of the total spam corpus and 1.7463% of the total ham corpus (down from 12.1273%), which makes it far more interesting.  Many of the people on the sa-users list have manually scored RDNS_NONE higher than the default 0.1.  I score it at 0.9 on my own production servers.

(Not sure if this is the right venue -- or if I'm an approved kibitzer)
Comment 117 Karsten Bräckelmann 2009-10-20 13:17:26 UTC
> bb_trec_enron has 98.9497% of its ham match RDNS_NONE, which is to say that
> it's bogus.

Indeed. From the dev list earlier today, that's "a corpus with generated (synthetic) headers [...], only useful for body hits", and is not included in the re-scoring.

> Many of the people on the sa-users list have
> manually scored RDNS_NONE higher than the default 0.1.

FWIW, nailed to 0.1 as per comment 56.
Comment 118 Adam Katz 2009-10-20 13:38:04 UTC
(In reply to comment #117)
> > bb_trec_enron has 98.9497% of its ham match RDNS_NONE, which is to say
> > that it's bogus.
> 
> Indeed. From the dev list earlier today, that's "a corpus with generated
> (synthetic) headers [...], only useful for body hits", and is not included
> in the re-scoring.

Ah, I thought I saw that corpus mentions somewhere ... only thought to search the bug.  I had assumed that if the rulesqa page mentioned it, it was factored in everywhere.

> > Many of the people on the sa-users list have
> > manually scored RDNS_NONE higher than the default 0.1.
> 
> FWIW, nailed to 0.1 as per comment 56.

I saw that but did not understand it ...  It says "most of these are already documented and labeled as [fixed/immutable]" but it doesn't say where.  Is this because it triggers when rDNS checks aren't performed by the first trusted relay, and if so, can we work around that somehow (wasn't that bug 5586 )?

Or is this a remnant of Justin's checkin r497852 from 2007 which states:
> move 20_dynrdns.cf from sandbox into main ruleset, so RDNS_DYNAMIC
> and RDNS_NONE are core rules; lock their scores to an informational
> 0.1, however, since they still have a high ham hit-rate alone 

... despite the current corpus data (unless 1.7% is a high ham hit-rate)?
Comment 119 Warren Togami 2009-10-20 13:47:28 UTC
(In reply to comment #118)
> ... despite the current corpus data (unless 1.7% is a high ham hit-rate)?

http://ruleqa.spamassassin.org/20091017-r826198-n/RDNS_NONE/detail
The most recent weekly run has pretty substantial hits even outside of the synthetic corpus.

Adam, this like your RCVD_IN_APNIC are examples of inherently prejudiced rules.  It might work for the most part, and you might accept the risk of accidental FP's because the score alone wont push it above the threshold.  However the combined risks of multiple prejudiced rules is too great.  Prejudiced rules should be up to the sysadmin if they want to enable.  We should not highly score any known prejudiced rules in the default ruleset.
Comment 120 Adam Katz 2009-10-20 16:25:36 UTC
(In reply to comment #119)
> (In reply to comment #118)
>> ... despite the current corpus data (unless 1.7% is a high ham hit-rate)?
> 
> http://ruleqa.spamassassin.org/20091017-r826198-n/RDNS_NONE/detail
> The most recent weekly run has pretty substantial hits even outside of
> the synthetic corpus.

Your link is just a longer version of mine.  It still results in a 1.7% total ham hit-rate.  Is that too substantial?  Is there detail on what each corpus is (specifically nbebout, since that's the only other corpus that hit 4+% of spam)?

Looking only at ham scoring 4 or higher (including enron since I can't remove it), RDNS_NONE hit 0.8528% of the total ham corpus.  Of the ham scoring JUST 4 (4.0-4.99999), we're looking at 0.5865% that would become FPs assuming a score of 1.1 (increasing the 0.1 by 1), and I'm not even proposing my own implementation's 0.9.

> Adam, this [... and] your RCVD_IN_APNIC are examples of inherently
> prejudiced rules. It might work for the most part, and you might accept
> the risk of accidental FP's because the score alone wont push it above
> the threshold. However the combined risks of multiple prejudiced rules
> is too great. Prejudiced rules should be up to the sysadmin if they want
> to enable.  We should not highly score any known prejudiced rules in the
> default ruleset.

I agree that RCVD_VIA_APNIC is a prejudiced rule, and my channels have had it rated 0.001 ever since you called me out on it (RCVD_VIA_APNIC accidentally came in when I migrated from an internal-only propagation to a published channel).  KHOP_NO_FIRST_NAME, my other poorly-considered published test, pre-dates my more thorough testing mechanism (which has limited new rules' entry quite considerably).  My rules will get even more cleaned up once I get an svn account to test them here.  (Some of them, like the biased RCVD_IN_APNIC and quasi-biased/unfair KHOP_SC_CIDR8, would either never get pushed up for testing or would get the nopublish flag, depending on the guidelines ... that nobody has yet pointed me to.)  (Side note: I see __RCVD_VIA_APNIC is already in your own sandbox, hitting 86% of all Japanese ham.)

Getting back to this issue:  I don't see any problem with prejudice against poorly constructed network infrastructures that can't bother to adhere to the SMTP standard (RFC1912 section 2.1).  This is something that any network admin who should legitimately be managing a mail server should be able to fix with a single phone call (please correct me if this sentence is prejudiced in any way).

The SMTP standard requires a server's rDNS must match the server's reported name (thus the IP must have rDNS), and most allocated IPs have them anyway (even if they're wrong or ~dynamic, e.g. RDNS_DYNAMIC).  There is also a growing number of deployments that block improper FCrDNS at the door (RDNS_NONE is a subset of failing FCrDNS).

SA already has built-in "prejudices" against poorly constructed email clients (e.g. MISSING_HEADERS) and relays (e.g. DATE_IN_FUTURE_48_96), so why not the network?  Isn't SPF_FAIL a "prejudiced" test against network configuration?

SA at its core is merely a system of probabilities.  Even without bayes, the masscheck mechanism and its points are awarded based on statistical significance.  Very few rules are actually free of FPs (or FNs for negative rules).  My question still stands:  what does SA deem statistically significant when it comes to FPs?  Why does RDNS_NONE need to be immutable rather than dictated by the masscheck results?  What would the automated system score RDNS_NONE if it were allowed to?  I'm guessing something between 0.2 and 0.7.
Comment 121 Warren Togami 2009-10-20 19:00:36 UTC
(In reply to comment #120)
> I agree that RCVD_VIA_APNIC is a prejudiced rule, and my channels have had it
> rated 0.001 ever since you called me out on it (RCVD_VIA_APNIC accidentally

OK glad to hear that you reduced it.  I didn't look at your scores after that first time.  You should really get a spamassassin account so your rules can be more thoroughly tested against a more varied corpa.

> nobody has yet pointed me to.)  (Side note: I see __RCVD_VIA_APNIC is already
> in your own sandbox, hitting 86% of all Japanese ham.)

Yes, I'm using it as a softener to exclude from the extremely prejudiced CN_<NUMBER> rules.  It just so happens that the majority of CN_<NUMBER> spam comes from !APNIC, and APNIC is prejudiced in exactly the way to make CN_<NUMBER> rules less dangerous.  Even though those rules have high spam hit rates and zero FP's across our nightly masscheck corpa, it is still too prejudiced to be safe as a default rule.

> SA at its core is merely a system of probabilities.  Even without bayes, the
> masscheck mechanism and its points are awarded based on statistical
> significance.  Very few rules are actually free of FPs (or FNs for negative
> rules).  My question still stands:  what does SA deem statistically significant
> when it comes to FPs?  Why does RDNS_NONE need to be immutable rather than
> dictated by the masscheck results?  What would the automated system score
> RDNS_NONE if it were allowed to?  I'm guessing something between 0.2 and 0.7.

That is an interesting question.
Comment 122 Adam Katz 2009-10-22 13:32:40 UTC
Some bugs in the auto-generated rules from attachment 4553 [details]

HTML_MESSAGE scores WAY too high.  There are others too.

Full list as of right now:


   MSECS    SPAM%     HAM%     S/O    RANK   SCORE  NAME
       0   0.1848   4.8675   0.037    0.78    0.00  SPF_HELO_PASS
       0   0.3294   5.5859   0.056    0.74    0.00  SPF_PASS
       0  12.2476   1.2568   0.907    0.58    0.00  RCVD_IN_BL_SPAMCOP_NET
       0  50.4453   3.7391   0.931    0.57    2.30  MIME_HTML_ONLY
       0  49.9300  12.1231   0.805    0.52    0.10  RDNS_NONE
       0   3.8466   1.8427   0.676    0.51    2.30  SUBJ_ALL_CAPS
       0   2.3989   1.3218   0.645    0.50    0.00  UNPARSEABLE_RELAY
       0  83.7769  40.8865   0.672    0.49    0.00  HTML_MESSAGE
       0   3.4477   3.8932   0.470    0.47    2.50  MIME_QP_LONG_LINE
       0  12.2361  15.6252   0.439    0.46    0.00  FREEMAIL_FROM
       0   0.7695   1.2102   0.389    0.41    2.90  TVD_SPACE_RATIO
       0   0.4610   1.2409   0.271    0.35    1.00  EXTRA_MPART_TYPE
       0   0.0271   1.0700   0.025    0.15    1.22  MSGID_MULTIPLE_AT

score SPF_HELO_PASS -0.001
score SPF_PASS -0.001
score RCVD_IN_BL_SPAMCOP_NET 0 1.725 0 1.180 # n=2
score MIME_HTML_ONLY 1.474 0.737 0.829 0.462
score RDNS_NONE             0.1
score SUBJ_ALL_CAPS 0.264 1.568 0.593 1.045
score UNPARSEABLE_RELAY 0.001
score HTML_MESSAGE 2.199 0.838 1.473 0.511
score MIME_QP_LONG_LINE 0.074 0.242 0.116 0.002
score FREEMAIL_FROM 0.817 1.020 0.401 0.856
score TVD_SPACE_RATIO 0.001 0.201 0.398 0.001
score MSGID_MULTIPLE_AT 0.001 0.001 0.598 0.000


To fetch them for yourself (so as to get something more up-to-date or from a different URL, etc), here's the code I ran (sorry, I know posix shell better than perl, so I dip into both):

elinks -dump http://ruleqa.spamassassin.org/ |perl -ne 
  'print if /(\s+[\d.]+){2}\s+[1-9][\d.]+(\s+[\d.]+){3}\s+(?!T_)\w|\sMSECS/'
  |tee rules.txt

for rule in $(perl -ne 'if (/.*\s([A-Z]+\w*_\w*)/) { s//$1/; print; }'
  < rules.txt); do grep "^[^#]* $rule " /tmp/50_scores_newest.cf; done


That could probably be written better, e.g. looking for ham% > spam% in addition to ham% > 0.9999%, but this is a good first-pass.

Obviously, /removing/ fixed scores for things like RDNS_NONE can't possibly be considered until the GA is a little more apt at figuring this sort of thing out.
Comment 123 Adam Katz 2009-10-22 13:47:40 UTC
(In reply to comment #122)
sorry, that should be:

elinks -dump http://ruleqa.spamassassin.org/ |perl -ne 
  'print if /(\s+[\d.]+){2}\s+[1-9][\d.]+(\s+[\d.]+){3}\s+(?!T_)\w|\sMSECS/'
  |tee rules.txt

for rule in $(perl -ne 'if (/.*\s([A-Z]+\w*_\w*)/) { s//$1/; print; }'
  < rules.txt); do grep "^[^#]* $rule " /tmp/50_scores_newest.cf ||
  echo "score $rule UNKNOWN"; done

With each of those two stanzas living on just one line.

Obviously, ignore the genuine ham rules.
Comment 124 Mark Martinec 2009-10-26 07:49:13 UTC
Created attachment 4558 [details]
resulting 50_scores.cf from garescorer runs - V3

Attached is the latest 50_scores.cf file, obtained in a couple of iterations
during the last few days. It takes into account the updated results files
from the rsync submit area, in particular the updated net-wt* (Comment 99,
102, 103), and net-hege* files. The binnocenti* are still excluded.
The rest of the corpora tweaks/decimation as per my previous run, Comment 96.

The RCVD_IN_DNSWL_* scores are hand-tweaked (according to Comment 101),
otherwise the _MED stands out above the _HI due to its significantly higher
hit rate.

The KB_RATWARE_OUTLOOK_08, KB_RATWARE_OUTLOOK_12, KB_RATWARE_OUTLOOK_16
and KB_RATWARE_BOUNDARY were now zeroed-out according to Comment 115.

I tried leaving RDNS_NONE and RDNS_DYNAMIC floating (Comment 116, 120, 122),
and it seems to me the obtained score is perfectly sensible and useful,
and still not too high to punish incompetent admins too hard:
  score RDNS_NONE     0 1.1 0 0.7
  score RDNS_DYNAMIC  0 0.5 0 0.5
so I'm leaving these floating.

According to Comment 122 I zeroed out (actually, 0.001'd out) the
HTML_MESSAGE, MIME_QP_LONG_LINE, FREEMAIL_FROM, TVD_SPACE_RATIO,
and MSGID_MULTIPLE_AT.

Some further tweaks: I reduced the BAYES scores somewhat (e.g. from 4.5
to 3.5 for BAYES_99 scoreset3) and tamed down the BAYES_50, which was
standing out from the crowd).

For DCC_* rules I used the already described approach: obtain DCC_CHECK score
from a GA run with all DCC_REPUT_* zeroed-out, then fix the obtained DCC_CHECK,
and let DCC_REPUT_* float for the final run.

The NML_ADSP_CUSTOM_MED was obtained from a GA run, but other (_LOW, _HIGH)
were set manually (currently no hits). The DKIM_ADSP_ALL, DKIM_ADSP_DISCARD,
and DKIM_ADSP_NXDOMAIN are based on GA runs, but hand-tweaked somewhat due
to inconsistencies between corpora.

A word about JM_SOUGHT_FRAUD_{1,2,3}: these three rules come out from
a ga RUN with scores between 2 and 3, but are somewhat inconsistent
between runs and corpora. As requested by Comment 38 their scores
were fixed at zero for the final run, but I'd set these manually
to 2.2 each for the published 50_scores.cf.

After preparing my manual fixes from a couple of trial runs, I made a
final run for each scoreset with these fixed scores, so as to allow other
scores to adjust themselves to the new constraints.

So here are the manual fixes:

score SPF_PASS -0.001
score SPF_HELO_PASS -0.001

score BAYES_00  0  0 -1.2   -1.9
score BAYES_05  0  0 -0.2   -0.5
score BAYES_20  0  0 -0.001 -0.001
score BAYES_40  0  0 -0.001 -0.001
score BAYES_50  0  0  2.0    0.8
score BAYES_60  0  0  2.5    1.5
score BAYES_80  0  0  2.7    2.0
score BAYES_95  0  0  3.2    3.0
score BAYES_99  0  0  3.8    3.5

score RCVD_IN_DNSWL_LOW  0 -0.6 0 -1.1
score RCVD_IN_DNSWL_MED  0 -1.5 0 -1.2
score RCVD_IN_DNSWL_HI   0 -1.8 0 -1.8

score HTML_MESSAGE 0.001
score NO_RELAYS -0.001
score UNPARSEABLE_RELAY 0.001
score NO_RECEIVED -0.001
score NO_HEADERS_MESSAGE 0.001

score DKIM_ADSP_ALL        0 1.1 0 0.8
score DKIM_ADSP_DISCARD    0 1.8 0 1.8
score DKIM_ADSP_NXDOMAIN   0 0.8 0 0.9
score NML_ADSP_CUSTOM_LOW  0 0.7 0 0.7
score NML_ADSP_CUSTOM_MED  0 1.2 0 0.9
score NML_ADSP_CUSTOM_HIGH 0 2.6 0 2.5

score JM_SOUGHT_FRAUD_1 0
score JM_SOUGHT_FRAUD_2 0
score JM_SOUGHT_FRAUD_3 0

score MIME_QP_LONG_LINE 0.001
score FREEMAIL_FROM     0.001
score TVD_SPACE_RATIO   0.001
score MSGID_MULTIPLE_AT 0.001
score EXTRA_MPART_TYPE     1.0
score RDNS_NONE     0 1.1 0 0.7
score RDNS_DYNAMIC  0 0.5 0 0.5

score KB_RATWARE_OUTLOOK_08  0
score KB_RATWARE_OUTLOOK_12  0
score KB_RATWARE_OUTLOOK_16  0
score KB_RATWARE_BOUNDARY    0
Comment 125 Mark Martinec 2009-10-26 08:00:59 UTC
$ head test scores

=================================
score set 3 (net, bayes) - gen-set3-20-5.0-12200-ga

test (10%)
# SUMMARY for threshold 5.0:
# Correctly non-spam:  21172  99.93%
# Correctly spam:      43597  98.78%
# False positives:        14  0.07%
# False negatives:       537  1.22%
# TCR(l=50): 35.678254  SpamRecall: 98.783%  SpamPrec: 99.968%

scores (90%):
# SUMMARY for threshold 5.0:
# Correctly non-spam: 168143  32.193%  (99.979% of non-spam corpus)
# Correctly spam:     349734  66.961%  (98.763% of spam corpus)
# False positives:        36  0.007%  (0.021% of nonspam,   8360 weighted)
# False negatives:      4382  0.839%  (1.237% of spam,  14401 weighted)
# Average score for spam:  21.1    nonspam: -2.2
# Average for false-pos:   5.5  false-neg: 3.3
# TOTAL:              522295  100.00%

=================================
score set 2 (no net, bayes) - gen-set2-10-5.0-12200-ga

test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  21148  99.82%
# Correctly spam:      41172  93.29%
# False positives:        38  0.18%
# False negatives:      2962  6.71%
# TCR(l=50): 9.077334  SpamRecall: 93.289%  SpamPrec: 99.908%

scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 167953  32.157%  (99.866% of non-spam corpus)
# Correctly spam:     329931  63.169%  (93.170% of spam corpus)
# False positives:       226  0.043%  (0.134% of nonspam,  26882 weighted)
# False negatives:     24185  4.631%  (6.830% of spam,  89229 weighted)
# Average score for spam:  10.8    nonspam: -0.7
# Average for false-pos:   5.6  false-neg: 3.7
# TOTAL:              522295  100.00%

=================================
score set 1 (net, no bayes) - gen-set1-10-5.0-12201-ga

test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  21155  99.85%
# Correctly spam:      43153  97.78%
# False positives:        31  0.15%
# False negatives:       981  2.22%
# TCR(l=50): 17.437377  SpamRecall: 97.777%  SpamPrec: 99.928%

scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 168012  32.168%  (99.901% of non-spam corpus)
# Correctly spam:     346216  66.287%  (97.769% of spam corpus)
# False positives:       167  0.032%  (0.099% of nonspam,  20194 weighted)
# False negatives:      7900  1.513%  (2.231% of spam,  23052 weighted)
# Average score for spam:  19.8    nonspam: -0.5
# Average for false-pos:   5.7  false-neg: 2.9
# TOTAL:              522295  100.00%

=================================
score set 0 (no net, no bayes) - gen-set0-5-5.0-12201-ga

test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  20919  98.74%
# Correctly spam:      34081  77.22%
# False positives:       267  1.26%
# False negatives:     10053  22.78%
# TCR(l=50): 1.885827  SpamRecall: 77.222%  SpamPrec: 99.223%

scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 166261  31.833%  (98.860% of non-spam corpus)
# Correctly spam:     271409  51.965%  (76.644% of spam corpus)
# False positives:      1918  0.367%  (1.140% of nonspam, 126535 weighted)
# False negatives:     82707  15.835%  (23.356% of spam, 235514 weighted)
# Average score for spam:  10.4    nonspam: 0.6
# Average for false-pos:   6.3  false-neg: 2.8
# TOTAL:              522295  100.00%

=================================




In summary:
set 3
# False positives:        36  (0.021% of nonspam)
# False negatives:      4382  (1.237% of spam)

set 2
# False positives:       226  (0.134% of nonspam)
# False negatives:     24185  (6.830% of spam)

set 1
# False positives:       167  (0.099% of nonspam)
# False negatives:      7900  (2.231% of spam)

set 0
# False positives:      1918  (1.140% of nonspam)
# False negatives:     82707  (23.356% of spam)
Comment 126 Mark Martinec 2009-10-26 08:08:22 UTC
Created attachment 4559 [details]
freqs.full of corpora used for score set 3 and 2 runs
Comment 127 Mark Martinec 2009-10-26 08:09:26 UTC
Created attachment 4560 [details]
ranges.data on corpora used for score set 3 and 2 runs
Comment 128 Karsten Bräckelmann 2009-10-26 09:57:28 UTC
(In reply to comment #124)
> Created an attachment (id=4558) [details]
> resulting 50_scores.cf from garescorer runs - V3

Now I am getting really nervous. :-/  From the scores:

 score KB_DATE_CONTAINS_TAB  3.799 3.799 3.315 2.871
 score KB_FAKED_THE_BAT      1.447 2.273 2.452 3.799

The bad thing about this is, that onet.pl / onet.eu (a polish free-mailer AFAIK) actually munges the header, and injects the tab into the Date header on their outgoing SMTP servers. Apparently, they do that harm to all outgoing mail, not limited to their web-mailer.

It is a very, very stupid thing to do for them, to munge MUA generated headers like that, but still they appear to do it. :(  That means their customers will really be punished, and using them *and* The Bat! is a killer.

FWIW, I once wrote these to counter a flood of low-scoreres -- but the above scores are scaring me. This is quite bad.
Comment 129 Matthias Leisi 2009-10-26 10:36:56 UTC
(In reply to comment #124)

> The RCVD_IN_DNSWL_* scores are hand-tweaked (according to Comment 101),
> otherwise the _MED stands out above the _HI due to its significantly higher
> hit rate.
> [..]
>
> score RCVD_IN_DNSWL_LOW  0 -0.6 0 -1.1
> score RCVD_IN_DNSWL_MED  0 -1.5 0 -1.2
> score RCVD_IN_DNSWL_HI   0 -1.8 0 -1.8

Is there a particular reason why these are so much different from those in  https://svn.apache.org/repos/asf/spamassassin/trunk/rules/50_scores.cf:

| score RCVD_IN_DNSWL_LOW 0 -1 0 -1
| score RCVD_IN_DNSWL_MED 0 -4 0 -4
| score RCVD_IN_DNSWL_HI 0 -8 0 -8
Comment 130 Mark Martinec 2009-10-26 11:03:28 UTC
> > The RCVD_IN_DNSWL_* scores are hand-tweaked (according to Comment 101),
> > otherwise the _MED stands out above the _HI due to its significantly higher
> > hit rate.
> > score RCVD_IN_DNSWL_LOW  0 -0.6 0 -1.1
> > score RCVD_IN_DNSWL_MED  0 -1.5 0 -1.2
> > score RCVD_IN_DNSWL_HI   0 -1.8 0 -1.8
> 
> Is there a particular reason why these are so much different from those in 
> https://svn.apache.org/repos/asf/spamassassin/trunk/rules/50_scores.cf:
> 
> | score RCVD_IN_DNSWL_LOW 0 -1 0 -1
> | score RCVD_IN_DNSWL_MED 0 -4 0 -4
> | score RCVD_IN_DNSWL_HI  0 -8 0 -8

The -1/-4/-8 were manually provided (don't know the background on this
decision).

The RCVD_IN_DNSWL_MED in my GA results was obtained automatically, and the
other two were manually adjusted to make some sense compared to _MED.
Btw, the GA results on scoreset 3 from one of my previous runs were:
  RCVD_IN_DNSWL_LOW -2.761
  RCVD_IN_DNSWL_MED -0.999
  RCVD_IN_DNSWL_HI  -0.966
Comment 131 Matthias Leisi 2009-10-26 11:36:22 UTC
(In reply to comment #130)

> The -1/-4/-8 were manually provided (don't know the background on this
> decision).

Other whitelisting rules (HABEAS_*, RCVD_IN_IADB_*, RCVD_IN_BSP_TRUSTED etc) have the same scores as in the previous 50_scores.cf. 

I was wondering why the dnswl.org rules have specifically lower scores than in previous versions - and extremely low scores. This is worrying me, as it would indicate we have a quality issue in the dnswl.org data.
Comment 132 Mark Martinec 2009-10-26 12:26:49 UTC
> Other whitelisting rules (HABEAS_*, RCVD_IN_IADB_*, RCVD_IN_BSP_TRUSTED etc)
> have the same scores as in the previous 50_scores.cf. 

They do not have the same scores, seems to me they are all mostly
much lower. Please ignore the comments in 50_scores_newest3.cf,
just take into account uncommented 'score' lines:

score HABEAS_ACCREDITED_COI 0
score HABEAS_ACCREDITED_SOI 0 -1.634 0 -0.475

score RCVD_IN_BSP_TRUSTED 0 -0.001 0 -0.001

score RCVD_IN_IADB_DK 0 -0.044 0 -0.001
score RCVD_IN_IADB_DOPTIN 0
score RCVD_IN_IADB_DOPTIN_GT50 0
score RCVD_IN_IADB_DOPTIN_LT50 0 -0.001 0 -0.001
score RCVD_IN_IADB_EDDB 0
score RCVD_IN_IADB_EPIA 0
score RCVD_IN_IADB_GOODMAIL 0
score RCVD_IN_IADB_LISTED 0 -1.144 0 -0.001
score RCVD_IN_IADB_LOOSE 0
score RCVD_IN_IADB_MI_CPEAR 0
score RCVD_IN_IADB_MI_CPR_30 0
score RCVD_IN_IADB_MI_CPR_MAT 0 -0.079 0 -0.001
score RCVD_IN_IADB_ML_DOPTIN 0
score RCVD_IN_IADB_NOCONTROL 0
score RCVD_IN_IADB_OOO 0
score RCVD_IN_IADB_OPTIN 0 -3.265 0 -2.791
score RCVD_IN_IADB_OPTIN_GT50 0 -0.219 0 -1.041
score RCVD_IN_IADB_OPTIN_LT50 0
score RCVD_IN_IADB_OPTOUTONLY 0
score RCVD_IN_IADB_RDNS 0 -0.018 0 -0.001
score RCVD_IN_IADB_SENDERID 0 -0.001 0 -0.001
score RCVD_IN_IADB_SPF 0 -0.006 0 -0.042
score RCVD_IN_IADB_UNVERIFIED_1 0
score RCVD_IN_IADB_UNVERIFIED_2 0
score RCVD_IN_IADB_UT_CPEAR 0
score RCVD_IN_IADB_UT_CPR_30 0
score RCVD_IN_IADB_UT_CPR_MAT 0 -0.001 0 -0.052
score RCVD_IN_IADB_VOUCHED 0 -1.718 0 -0.956

score RCVD_IN_DNSWL_LOW  0 -0.6 0 -1.1
score RCVD_IN_DNSWL_MED  0 -1.5 0 -1.2
score RCVD_IN_DNSWL_HI   0 -1.8 0 -1.8

 
> I was wondering why the dnswl.org rules have specifically lower scores than in
> previous versions - and extremely low scores. This is worrying me, as it would
> indicate we have a quality issue in the dnswl.org data.

These all have pretty low rank:

$ grep RCVD_IN_DNSWL_ freqs.full
OVERALL    SPAM%     HAM%     S/O    RANK   SCORE  NAME
  0.184   0.0005   0.5708    0.001   0.76   -1.80  RCVD_IN_DNSWL_HI
  7.410   0.1094  22.7527    0.005   0.67   -1.20  RCVD_IN_DNSWL_MED
  2.551   0.1810   7.5322    0.023   0.59   -1.10  RCVD_IN_DNSWL_LOW

the _HI gets a low automatic score probably because it hits very little mail,
so it probably needs manual tweaking. The _MED seems to hit too many spam
messages in the submitted logs for rescoring runs, or perhaps it has a high
overlap with other similar rules.

It is quite possible that some of these hits are still false positives,
despite several iterations of cleaning:

for j in spam*.log; do echo -n $j; grep RCVD_IN_DNSWL_HI $j | \
  wc -l; done | sort -k2nr

spam-bayes-net-bb-jhardin.log         3
spam-bayes-net-bb-kmcgrail.log        2
spam-bayes-net-bb-guenther_fraud.log  1
spam-bayes-net-hege.log               1

same on _MED:

spam-bayes-net-bluestreak.log     381
spam-bayes-net-hege.log            79
spam-bayes-net-bb-jhardin.log      23
spam-bayes-net-wt-en1.log          15
spam-bayes-net-bb-kmcgrail.log     14
spam-bayes-net-jm-decimated.log    11
spam-bayes-net-ahenry.log           9
spam-bayes-net-dos-decimated.log    6
spam-bayes-net-bb-zmi.log           3
spam-bayes-net-mmartinec.log        3
spam-bayes-net-wt-en4.log           2
Comment 133 Justin Mason 2009-10-26 13:51:54 UTC
strange, some of the more trustworthy BLs are very low scoring.

RCVD_IN_XBL: 0.404 and 0.722

these have been effectively zeroed, although are supposed to be immutable:
RCVD_IN_SSC_TRUSTED_COI is 0  (with a 0.012 S/O, low hit rate though)
HABEAS_ACCREDITED_COI is 0    (ditto)
RCVD_IN_BSP_TRUSTED is -0.001  (although with a 0.002 S/O)

the HASHCASH rules likewise aren't supposed to be mutable.

it looks like there might be a bit of a problem there -- definitely some rules that are in immutable sections, like the above, have been allowed to be mutable in ranges.data....
Comment 134 John Hardin 2009-10-26 14:31:20 UTC
(In reply to comment #132)

> $ grep RCVD_IN_DNSWL_ freqs.full
> OVERALL    SPAM%     HAM%     S/O    RANK   SCORE  NAME
>   0.184   0.0005   0.5708    0.001   0.76   -1.80  RCVD_IN_DNSWL_HI
>   7.410   0.1094  22.7527    0.005   0.67   -1.20  RCVD_IN_DNSWL_MED
>   2.551   0.1810   7.5322    0.023   0.59   -1.10  RCVD_IN_DNSWL_LOW
> 
> It is quite possible that some of these hits are still false positives,
> despite several iterations of cleaning:
> 
> for j in spam*.log; do echo -n $j; grep RCVD_IN_DNSWL_HI $j | \
>   wc -l; done | sort -k2nr
> 
> spam-bayes-net-bb-jhardin.log         3
> 
> same on _MED:
> 
> spam-bayes-net-bb-jhardin.log      23

All but one of those are obvious spams, and I've removed the one questionable one from my corpora.

Some of the spam in my corpora is from third parties. I do check it for correct classification before uploading, but I was wondering: how does masscheck determine the correct lastexternal for corpora containing messages from multiple different networks? Or does it assume all of the messages in a given contributor's corpora have the same network boundary? If the latter, I need to remove those third-party messages from my spam corpora...

Might lastexternal confusion in the masschecks be contributing in some way to the odd RCVD_IN_* score generation?
Comment 135 Adam Katz 2009-10-26 16:27:56 UTC
Created attachment 4561 [details]
Checker for rules that match more ham than spam

I've updated my checker to an actual perl script (still uses elinks as I don't feel like learning LWP and then parsing HTML).  I've attached the checker, which can be run with custom parameters for a different ruleset, ham threshold, or minimum difference for ham:spam ratio.  Here's the current output, listing all rules that hit 1+% of the ham corpus or that hit 0.05% more of the ham corpus than of the spam corpus.

H^2/S    HAM%    SPAM%    Score in attachment 4558 [details]   Rule
331.9    0.3319  0.0010   0                          OBSCURED_EMAIL
117.4    4.8566  0.2009   -0.001                     SPF_HELO_PASS
88.52    5.5735  0.3509   -0.001                     SPF_PASS
85.61    0.2226  0.0026   0.000 2.099 0.001 1.212    MISSING_MIME_HB_SEP
76.18    0.7085  0.0093   0.001 0.001 0.699 0.699    TVD_RCVD_SPACE_BRACKET
66.19    0.2780  0.0042   1.145 1.542 1.912 2.400    FUZZY_CPILL
49.98    1.0676  0.0228   0.001                      MSGID_MULTIPLE_AT
31.82    0.1496  0.0047   1.494 1.699 1.591 1.516    X_IP
21.86    0.1465  0.0067   0                          SUBJECT_FUZZY_TION
20.40   15.6218 11.9604   0.001                      FREEMAIL_FROM
20.00*  40.9055 83.6301   0.001                      HTML_MESSAGE
17.10    0.1710  0        1.222 0.001 0.082 0.476    MIME_BOUND_DIGITS_15
12.95    0.0609  0.0047   0                          HTML_IFRAME_SRC
12.52    0.0714  0.0057   0                          FORGED_IMS_TAGS
11.56    0.0659  0.0057   0.001 0.001 0.605 0.378    HTML_NONELEMENT_30_40
10.83    0.1127  0.0104   0.033 0.001 0.365 0.413    WEIRD_PORT
10.18    0.3494  0.0343   2.205 0.174 1.299 1.806    FRT_SOMA2
9.721    0.8934  0.0919   1.499 0.419 0.904 0.798    MIME_BASE64_BLANKS
8.996    0.2474  0.0275   0.987 0.750 0.943 1.318    CTYPE_001C_B
8.918    0.1525  0.0171   0.001 2.499 0.268 0.516    DRUGS_MUSCLE
8.373    0.0829  0.0099   0.003 0.978 0.100 1.515    TVD_FW_GRAPHIC_NAME_LONG
8.016    0.1956  0.0244   0.001 0.020 0.001 1.799    MIME_BASE64_TEXT
6.850    0.0685  0        0                          HTML_NONELEMENT_40_50
5.404    0.5356  0.0991   0 1.200 0 2.514            SPF_HELO_FAIL
4.237    0.1585  0.0374   2.199 2.199 1.246 2.090    WEIRD_QUOTING
4.159    3.8908  3.6392   0.001                      MIME_QP_LONG_LINE
3.483    0.8570  0.2460   1.799 0.572 1.182 1.138    HTML_IMAGE_RATIO_06
3.219    1.2399  0.4775   1.0                        EXTRA_MPART_TYPE
2.913*  12.1047 50.2891   0 1.1 0 0.7                RDNS_NONE
2.839    0.1164  0.0410   0.001 2.185 1.936 0.476    FRT_SOMA
2.751    0.1172  0.0426   0.1                        ANY_BOUNCE_MESSAGE
2.417    0.6787  0.2808   0.539 0.001 0.332 0.488    MIME_HTML_MOSTLY
2.370    0.1010  0.0426   0.1                        BOUNCE_MESSAGE
2.078    0.5534  0.2663   1.899 0.496 0.950 0.445    HTML_IMAGE_RATIO_08
1.899    1.2077  0.7677   0.001                      TVD_SPACE_RATIO
1.726    0.3227  0.1869   0.023 0.887 0.000 0.417    UPPERCASE_50_75
1.517    0.9658  0.6364   2.801 2.080 1.780 3.387    DATE_IN_PAST_96_XX
1.269    0.4224  0.3327   0.000 0.001 0.264 0.001    HTML_FONT_SIZE_LARGE
1.151    0.5492  0.4770   2.260 0.742 1.199 0.640    MPART_ALT_DIFF
0.913*   1.8488  3.7425   1.154 1.677 1.198 1.453    SUBJ_ALL_CAPS
0.703*   1.3317  2.5216   0.001                      UNPARSEABLE_RELAY
0.278*   3.7480 50.4848   2.199 0.955 1.215 0.549    MIME_HTML_ONLY
0.121*   1.2540 12.9472   0 1.322 0 1.237            RCVD_IN_BL_SPAMCOP_NET

(Anything asterisked is included because it matched >1% of the ham corpus but matched a larger percent of the spam corpus while everything else matched a larger percent of the ham corpus than the spam corpus.)

Mark's fixes solved the immediate issues raised earlier, so I decided to order this by the ratio of percentage of ham corpus hit to percentage of spam corpus hit, but that under-emphasized the ham hits, so I then multiplied that by the ham percentage again (unless the percent was under 1).  It's easy enough to browse for non-zero ham% hits.

Any rule with a ratio over 1.000 is a problem when scored positively unless it is exempted for applying to popular spam patterns that the corpus is known to lack.  For completeness, this list includes all tests that hit at least 1% of the ham corpus (thus the presence of HTML_MESSAGE, RDNS_NONE, and the four tests with ratios under 1.0).
Comment 136 Justin Mason 2009-10-27 07:09:36 UTC
(In reply to comment #133)
> it looks like there might be a bit of a problem there -- definitely some rules
> that are in immutable sections, like the above, have been allowed to be mutable
> in ranges.data....

just wondering, Mark, did you do this deliberately?  or is it just a bug in the tool that it's ignoring the non-mutable flag for those rules for some reason?
Comment 137 Mark Martinec 2009-10-27 14:18:14 UTC
> > it looks like there might be a bit of a problem there -- definitely some
> > rules that are in immutable sections, like the above, have been allowed
> > to be mutable in ranges.data....
> 
> just wondering, Mark, did you do this deliberately?  or is it just a bug
> in the tool that it's ignoring the non-mutable flag for those rules for
> some reason?

Sort-of deliberately. Initially I followed the idea in wiki RescoreMassCheck
section 4.2: 'comment out all "score" lines except for rules that you think
the scores are accurate like carefully-vetted net rules, or 0.001 informational
rules' which made perfect sense to me, so I did it for 50_scores.cf, except
for a couple of rather obvious rules like _WHITELIST and similar, and the ones
clearly indicated as 'indicators' only in the surrounding comments, or set to
0.001. Later I nailed a couple more. I followed a principle: when in doubt,
leave it floating, it can be fixed later if necessary. It gives some insight
into what GA 'thinks' about certain rules.

I think at least for some rules GA makes perfect sense, like RDNS_NONE
and RDNS_DYNAMIC. For some of them the GA result is close to the manually
assigned score, or may indicate a need for reconsidering the assigned score.
But I agree that more may need re-fixing.
Comment 138 Mark Martinec 2009-10-27 14:29:03 UTC
(In reply to comment #134)
> Some of the spam in my corpora is from third parties. I do check it for correct
> classification before uploading, but I was wondering: how does masscheck
> determine the correct lastexternal for corpora containing messages from
> multiple different networks? Or does it assume all of the messages in a given
> contributor's corpora have the same network boundary? If the latter, I need to
> remove those third-party messages from my spam corpora...
> 
> Might lastexternal confusion in the masschecks be contributing in some way to
> the odd RCVD_IN_* score generation?

I believe the masschecks leaves internal/external/msa_networks to their
defaults, unless one cares to configure it correctly for his corpus. And
I believe that it is more likely than not that some corpora were scanned
with unsuitable settings of networks. I know that configuring it for my
mass checks runs it gave me a headache (but I did it right in the end).
Which is why I posted the following note on the ML at that time:


  From: Mark Martinec <Mark.Martinec+sa@ijs.si>
  To: dev@spamassassin.apache.org
  Subject: Re: SpamAssassin 3.3.0 mass-checks now starting
  Date: Fri, 4 Sep 2009 21:46:59 +0200

  Docs don't say where one is supposed to put a local.cf with
  options which are ignored in masses/spamassassin/user_prefs
  (like Bayes SQL options, DCC, Pyzor timeouts etc).

  I tried to place local.cf into masses/spamassassin/, with
  horror results (some directives in local.cf proclaimed as
  invalid, as apparently plugins have not yet been loaded
  at the time of parsing this file, but only later).

  I finally placed it into ../rules/ as mylocal.cf, which
  finally works as expected, but I wonder if the is the proper
  solution. Should be documented I guess...
Comment 139 Justin Mason 2009-10-27 15:00:50 UTC
(In reply to comment #137)
> Sort-of deliberately. Initially I followed the idea in wiki RescoreMassCheck
> section 4.2: 'comment out all "score" lines except for rules that you think
> the scores are accurate like carefully-vetted net rules, or 0.001 informational
> rules' which made perfect sense to me, so I did it for 50_scores.cf, except
> for a couple of rather obvious rules like _WHITELIST and similar, and the ones
> clearly indicated as 'indicators' only in the surrounding comments, or set to
> 0.001. Later I nailed a couple more. I followed a principle: when in doubt,
> leave it floating, it can be fixed later if necessary. It gives some insight
> into what GA 'thinks' about certain rules.

That's true.  It's good to hear it's not a bug in the masses scripts, anyway ;)

> I think at least for some rules GA makes perfect sense, like RDNS_NONE
> and RDNS_DYNAMIC.

Yes, I agree, it's actually done a (surprisingly) good job with those.

> For some of them the GA result is close to the manually
> assigned score, or may indicate a need for reconsidering the assigned score.
> But I agree that more may need re-fixing.

cool.

In particular, some of the DNSBLs and most of the DNSWLs are good to 'lock
down', I feel, as users tend to 'compensate' or correct their scores more
frequently than other rules -- in my opinion.  Also, if those are given low
scores by the GA, their operators tend to be annoyed, and it's not good to
annoy people who we're relying on ;)

It also reflects that those rules are slightly different, and hopefully 
more reliable, than a typical body rule for example -- there's no way to
indicate this to the GA yet, so locking the rules is as good as we can do.
Comment 140 Justin Mason 2009-10-27 15:04:51 UTC
(In reply to comment #138)
> I believe the masschecks leaves internal/external/msa_networks to their
> defaults, unless one cares to configure it correctly for his corpus. And
> I believe that it is more likely than not that some corpora were scanned
> with unsuitable settings of networks. I know that configuring it for my
> mass checks runs it gave me a headache (but I did it right in the end).

What should be happening, though, is that we're just underestimating the amount
of -lastexternal rule hits -- the S/O should still be correct, but the overall
number of hits will be less.  Hopefully that will still provide a useful
estimation of accuracy.


>   Docs don't say where one is supposed to put a local.cf with
>   options which are ignored in masses/spamassassin/user_prefs
>   (like Bayes SQL options, DCC, Pyzor timeouts etc).
> 
>   I tried to place local.cf into masses/spamassassin/, with
>   horror results (some directives in local.cf proclaimed as
>   invalid, as apparently plugins have not yet been loaded
>   at the time of parsing this file, but only later).
> 
>   I finally placed it into ../rules/ as mylocal.cf, which
>   finally works as expected, but I wonder if the is the proper
>   solution. Should be documented I guess...

yuck.  bug 6227.
Comment 141 Mark Martinec 2009-10-28 09:02:40 UTC
>> But I agree that more may need re-fixing.
> 
> cool.
> In particular, some of the DNSBLs and most of the DNSWLs are good to 'lock
> down', I feel, as users tend to 'compensate' or correct their scores more
> frequently than other rules -- in my opinion.  Also, if those are given low
> scores by the GA, their operators tend to be annoyed, and it's not good to
> annoy people who we're relying on ;)
> 
> It also reflects that those rules are slightly different, and hopefully 
> more reliable, than a typical body rule for example -- there's no way to
> indicate this to the GA yet, so locking the rules is as good as we can do.

| It is quite possible that some of these hits are still false positives,
| despite several iterations of cleaning

I wonder how much is the low score for some ham rules affected by false
positives present in the spam* corpora. Here is some statistics for
the more prominent ham rules (i.e. the ones with negative scores).

For each rule the table shows a number of hits of this rule for each
corpus - both as a percentage of all entries in a file, and as absolute
counts. The entries standing out from the crowd that may need re-checking
are labeled with *** :

score ALL_TRUSTED -1.000
 0.046 %     1/2194 spam-bayes-net-bb-kmcgrail
 0.017 %    4/23761 spam-bayes-net-mmartinec
 0.014 %    5/36941 spam-bayes-net-hege
 0.001 %    1/81265 spam-bayes-net-bluestreak
 0.000 %   1/931863 spam-bayes-net-dos

score BAYES_00  0 0 -1.2 -1.9
 5.652 %   104/1840 spam-bayes-net-bb-jhardin  ***
 1.805 %  429/23761 spam-bayes-net-mmartinec
 1.606 %    33/2055 spam-bayes-net-ahenry
 0.439 %  357/81265 spam-bayes-net-bluestreak
 0.374 %  138/36941 spam-bayes-net-hege
 0.030 % 445/1489699 spam-bayes-net-jm
 0.017 % 156/931863 spam-bayes-net-dos

score DCC_REPUT_00_12  0 -0.8 0 -0.4
 0.164 %   39/23761 spam-bayes-net-mmartinec

score HABEAS_ACCREDITED_SOI 0 -1.634 0 -0.475
 5.382 %    76/1412 spam-bayes-net-bb-guenther_fraud  ***
 0.272 %     5/1840 spam-bayes-net-bb-jhardin
 0.091 %     2/2194 spam-bayes-net-bb-kmcgrail
 0.059 %   14/23761 spam-bayes-net-mmartinec
 0.049 %   18/36941 spam-bayes-net-hege
 0.037 % 558/1489699 spam-bayes-net-jm
 0.030 %     2/6728 spam-bayes-net-wt-en1
 0.018 %   15/81265 spam-bayes-net-bluestreak
 0.000 %   1/931863 spam-bayes-net-dos

score RCVD_IN_DNSWL_HI  0 -1.8 0 -1.8
 0.163 %     3/1840 spam-bayes-net-bb-jhardin  ***
 0.091 %     2/2194 spam-bayes-net-bb-kmcgrail
 0.071 %     1/1412 spam-bayes-net-bb-guenther_fraud
 0.003 %    1/36941 spam-bayes-net-hege
 0.000 %  1/1489699 spam-bayes-net-jm

score RCVD_IN_DNSWL_MED  0 -1.5 0 -1.2
 1.250 %    23/1840 spam-bayes-net-bb-jhardin  ***
(1.108 %      7/632 spam-bayes-net-binnocenti.OFF)
 0.638 %    14/2194 spam-bayes-net-bb-kmcgrail
 0.469 %  381/81265 spam-bayes-net-bluestreak
 0.438 %     9/2055 spam-bayes-net-ahenry
 0.223 %    15/6728 spam-bayes-net-wt-en1
 0.214 %   79/36941 spam-bayes-net-hege
 0.046 % 682/1489699 spam-bayes-net-jm
 0.042 %     3/7185 spam-bayes-net-bb-zmi
 0.013 %    3/23761 spam-bayes-net-mmartinec
 0.010 %    2/19160 spam-bayes-net-wt-en4
 0.003 %  29/931863 spam-bayes-net-dos

score RCVD_IN_DNSWL_LOW  0 -0.6 0 -1.1
 16.153 % 240627/1489699 spam-bayes-net-jm  ***
(9.810 %     62/632 spam-bayes-net-binnocenti.OFF)
 1.739 %    32/1840 spam-bayes-net-bb-jhardin
 1.600 %  591/36941 spam-bayes-net-hege
 1.159 %    78/6728 spam-bayes-net-wt-en1
 1.133 %    16/1412 spam-bayes-net-bb-guenther_fraud
 0.925 %    19/2055 spam-bayes-net-ahenry
 0.365 %     8/2194 spam-bayes-net-bb-kmcgrail
 0.107 %   87/81265 spam-bayes-net-bluestreak
 0.097 %     7/7185 spam-bayes-net-bb-zmi
 0.022 % 201/931863 spam-bayes-net-dos
 0.021 %    5/23761 spam-bayes-net-mmartinec
 0.016 %    3/19160 spam-bayes-net-wt-en4

score RCVD_IN_BSP_TRUSTED 0 -0.001 0 -0.001
 5.312 %    75/1412 spam-bayes-net-bb-guenther_fraud  ***
 0.030 %     2/6728 spam-bayes-net-wt-en1
 0.029 %    7/23761 spam-bayes-net-mmartinec
 0.029 % 435/1489699 spam-bayes-net-jm
 0.015 %   12/81265 spam-bayes-net-bluestreak
 0.003 %    1/36941 spam-bayes-net-hege
 0.001 %  11/931863 spam-bayes-net-dos

score RCVD_IN_IADB_DK 0 -0.044 0 -0.001
 0.059 %     4/6728 spam-bayes-net-wt-en1
 0.054 %     1/1840 spam-bayes-net-bb-jhardin
 0.033 %   27/81265 spam-bayes-net-bluestreak
 0.004 %    1/23761 spam-bayes-net-mmartinec
 0.001 % 21/1489699 spam-bayes-net-jm

score RCVD_IN_IADB_RDNS 0 -0.018 0 -0.001
 0.342 %    23/6728 spam-bayes-net-wt-en1  ***
 0.054 %     1/1840 spam-bayes-net-bb-jhardin
 0.049 %     1/2055 spam-bayes-net-ahenry
 0.033 %   27/81265 spam-bayes-net-bluestreak
 0.004 %    1/23761 spam-bayes-net-mmartinec
 0.002 % 26/1489699 spam-bayes-net-jm

score RCVD_IN_IADB_OPTIN 0 -3.265 0 -2.791
 0.342 %    23/6728 spam-bayes-net-wt-en1  ***
 0.049 %     1/2055 spam-bayes-net-ahenry
 0.000 %  4/1489699 spam-bayes-net-jm

score RCVD_IN_IADB_OPTIN_GT50 0 -0.219 0 -1.041
 0.054 %     1/1840 spam-bayes-net-bb-jhardin

score RCVD_IN_IADB_DOPTIN 0
 0.000 %  7/1489699 spam-bayes-net-jm

score RCVD_IN_IADB_DOPTIN_LT50 0 -0.001 0 -0.001
 0.026 %   21/81265 spam-bayes-net-bluestreak  ***
 0.001 % 15/1489699 spam-bayes-net-jm.log

score RCVD_IN_IADB_DOPTIN_GT50 0
 0.007 %    6/81265 spam-bayes-net-bluestreak
 0.004 %    1/23761 spam-bayes-net-mmartinec

score RCVD_IN_IADB_ML_DOPTIN 0
 0.000 %  2/1489699 spam-bayes-net-jm

score RCVD_IN_IADB_UT_CPR_MAT 0 -0.001 0 -0.052
 0.026 %   21/81265 spam-bayes-net-bluestreak  ***
 0.001 % 15/1489699 spam-bayes-net-jm

score RCVD_IN_IADB_MI_CPR_MAT 0 -0.079 0 -0.001
 0.026 %   21/81265 spam-bayes-net-bluestreak  ***
 0.001 % 15/1489699 spam-bayes-net-jm

score RCVD_IN_IADB_LISTED 0 -1.144 0 -0.001
 0.342 %    23/6728 spam-bayes-net-wt-en1  ***
 0.054 %     1/1840 spam-bayes-net-bb-jhardin
 0.049 %     1/2055 spam-bayes-net-ahenry
 0.033 %   27/81265 spam-bayes-net-bluestreak
 0.004 %    1/23761 spam-bayes-net-mmartinec
 0.002 % 26/1489699 spam-bayes-net-jm
 0.000 %   1/931863 spam-bayes-net-dos

score RCVD_IN_IADB_SENDERID 0 -0.001 0 -0.001
 0.208 %    14/6728 spam-bayes-net-wt-en1  ***
 0.049 %     1/2055 spam-bayes-net-ahenry
 0.033 %   27/81265 spam-bayes-net-bluestreak
 0.004 %    1/23761 spam-bayes-net-mmartinec
 0.000 %  4/1489699 spam-bayes-net-jm

score RCVD_IN_IADB_SPF 0 -0.006 0 -0.042
 0.342 %    23/6728 spam-bayes-net-wt-en1  ***
 0.054 %     1/1840 spam-bayes-net-bb-jhardin
 0.049 %     1/2055 spam-bayes-net-ahenry
 0.033 %   27/81265 spam-bayes-net-bluestreak
 0.004 %    1/23761 spam-bayes-net-mmartinec
 0.002 % 26/1489699 spam-bayes-net-jm

score RCVD_IN_IADB_VOUCHED 0 -1.718 0 -0.956
 0
Comment 142 Mark Martinec 2009-10-28 10:23:19 UTC
Seems to me that many / most(?) HABEAS_ACCREDITED_SOI supposedly
false positives are due to freelotto.com mail. I wonder whether such
samples are rightfully in the spam* corpora - I'd say yes, but,
as they say, spam is about consent, not content, and people receiving
mail from freelotto.com most likely did register once, not realizing
what they are dealing with. So there was a consent, at least initially.
It is also about fraud and advertising, so, should one leave such
mail samples in the spam corpus or not?
Comment 143 Mark Martinec 2009-10-28 10:41:31 UTC
> Seems to me that many / most(?) HABEAS_ACCREDITED_SOI supposedly
> false positives are due to freelotto.com mail.

Same applies to RCVD_IN_BSP_TRUSTED spam hits.
Comment 144 Warren Togami 2009-10-29 18:33:38 UTC
What is the next step in order to move forward?
Comment 145 Adam Katz 2009-11-04 15:52:15 UTC
Created attachment 4564 [details]
Checker for rules that match more ham than spam

Updated my checker to use S/O (now that I understand that stat).  It also supports specifying the DateRev for the specific masscheck run.  Since today's run was sparse, here are yesterday's results.

$ ./sa33badrules.pl 20091103-r832343-n
 S/O RANK HAM%    SPAM%   Score in attachment 4558 [details] Rule
.008 .12  1.2401  0.0105  0.001                    MSGID_MULTIPLE_AT
.011 .22  0.3066  0.0035  0                        OBSCURED_EMAIL
.012 .25  0.2058  0.0025  0.000 2.099 0.001 1.212  MISSING_MIME_HB_SEP
.014 .17  0.5822  0.0080  0.001 0.001 0.699 0.699  TVD_RCVD_SPACE_BRACKET
.028 .20  0.4339  0.0125  unknown                  TVD_FUZZY_SECTOR
.042 .28  0.1732  0.0075  0                        SUBJECT_FUZZY_TION
.048 .77  4.4862  0.2279  -0.001                   SPF_HELO_PASS
.052 .29  0.1476  0.0080  1.494 1.699 1.591 1.516  X_IP
.055 .22  0.3914  0.0226  2.205 0.174 1.299 1.806  FRT_SOMA2
.062 .74  5.1484  0.3424  -0.001                   SPF_PASS
.077 .25  0.2643  0.0221  0.987 0.750 0.943 1.318  CTYPE_001C_B
.079 .36  0.0640  0.0055  0.001 0.001 0.605 0.378  HTML_NONELEMENT_30_40
.080 .28  0.1742  0.0151  0.001 2.499 0.268 0.516  DRUGS_MUSCLE
.084 .36  0.0660  0.0060  0                        FORGED_IMS_TAGS
.090 .32  0.1114  0.0110  0.033 0.001 0.365 0.413  WEIRD_PORT
.092 .21  0.8712  0.0878  1.499 0.419 0.904 0.798  MIME_BASE64_BLANKS
.102 .37  0.0577  0.0065  0                        HTML_IFRAME_SRC
.123 .34  0.0821  0.0115  0.003 0.978 0.100 1.515  TVD_FW_GRAPHIC_NAME_LONG
.128 .37  0.0614  0.0090  0                        RCVD_BAD_ID
.130 .29  0.1851  0.0276  0.001 0.020 0.001 1.799  MIME_BASE64_TEXT
.178 .28  0.4948  0.1069  0 1.200 0 2.514          SPF_HELO_FAIL
.202 .32  0.1590  0.0402  0.1                      ANY_BOUNCE_MESSAGE
.205 .35  0.0817  0.0211  2.199 1.622 2.199 1.086  LONGWORDS
.213 .34  0.1186  0.0321  0                        BLANK_LINES_80_90
.216 .32  0.1474  0.0407  2.199 2.199 1.246 2.090  WEIRD_QUOTING
.218 .32  0.1445  0.0402  0.1                      BOUNCE_MESSAGE
.223 .30  0.7605  0.2179  1.799 0.572 1.182 1.138  HTML_IMAGE_RATIO_06
.241 .34  1.3973  0.4438  1.0                      EXTRA_MPART_TYPE
.254 .34  0.1222  0.0417  0.001 2.185 1.936 0.476  FRT_SOMA
.283 .33  0.6883  0.2711  0.539 0.001 0.332 0.488  MIME_HTML_MOSTLY
.299 .36  0.0908  0.0387  0.799 0.001 0.711 0.026  TVD_FW_GRAPHIC_NAME_MID
.303 .34  0.4938  0.2143  1.899 0.496 0.950 0.445  HTML_IMAGE_RATIO_08
.367 .40  1.2775  0.7409  0.001                    TVD_SPACE_RATIO
.379 .37  0.3182  0.1943  0.023 0.887 0.000 0.417  UPPERCASE_50_75
.434 .39  0.3261  0.2505  3.099 1.823 1.802 1.998  BAD_ENC_HEADER
.436 .46 15.3798 11.8920  0.001                    FREEMAIL_FROM
.454 .41  0.5503  0.4573  2.260 0.742 1.199 0.640  MPART_ALT_DIFF
.516 .47  3.6581  3.9024  0.001                    MIME_QP_LONG_LINE
.655 .51  1.9537  3.7036  1.154 1.677 1.198 1.453  SUBJ_ALL_CAPS
.665 .49 42.2269 83.7383  0.001                    HTML_MESSAGE
.692 .52  1.1850  2.6580  0.001                    UNPARSEABLE_RELAY
.922 .58  1.1584 13.7423  0 1.322 0 1.237          RCVD_IN_BL_SPAMCOP_NET
.935 .57  3.5421 50.6034  2.199 0.955 1.215 0.549  MIME_HTML_ONLY
.970 .52  1.5729 51.1430  0 1.1 0 0.7              RDNS_NONE

Note, I hacked RDNS_NONE so that it removes the Enron hits.

"Problem" rules this week include X_IP, EXTRA_MPART_TYPE, FRT_SOMA2, and BAD_ENC_HEADER (scored 3.099?!).

Food for thought:  while it's good to create workarounds for the problematic outcomes from the genetic algorithm, I think that these should be examples with which to troubleshoot the algorithm itself while this might just be an early sign of over-fitting (which is largely fine as long as we comb through the results with scripts like this), it might also be indicative of a problem in the system's prioritization.
Comment 146 Mark Martinec 2009-11-06 12:33:41 UTC
Created attachment 4565 [details]
resulting 50_scores.cf from garescorer runs - V5

A new run, this time I left the URIBL whitelists and similar fixed
(at their relatively high manual scores) as they were in current 50_scores.cf
Comment 147 Mark Martinec 2009-11-06 12:38:36 UTC
Corresponding GA summaries ($ head test scores):

gen-set3-20-5.0-14000-ga-best
==> test <==
# SUMMARY for threshold 5.0:
# Correctly non-spam:  21171  99.93%
# Correctly spam:      43624  98.84%
# False positives:        15  0.07%
# False negatives:       510  1.16%
# TCR(l=50): 35.026984  SpamRecall: 98.844%  SpamPrec: 99.966%
==> scores <==
# SUMMARY for threshold 5.0:
# Correctly non-spam: 168144  32.193%  (99.979% of non-spam corpus)
# Correctly spam:     349846  66.982%  (98.794% of spam corpus)
# False positives:        35  0.007%  (0.021% of nonspam,   8289 weighted)
# False negatives:      4270  0.818%  (1.206% of spam,  13858 weighted)
# Average score for spam:  21.3    nonspam: -3.2
# Average for false-pos:   5.6  false-neg: 3.2
# TOTAL:              522295  100.00%


gen-set2-10-5.0-6500-ga-best
==> test <==
# SUMMARY for threshold 5.0:
# Correctly non-spam:  21149  99.83%
# Correctly spam:      41755  94.61%
# False positives:        37  0.17%
# False negatives:      2379  5.39%
# TCR(l=50): 10.436037  SpamRecall: 94.610%  SpamPrec: 99.911%
==> scores <==
# SUMMARY for threshold 5.0:
# Correctly non-spam: 167927  32.152%  (99.850% of non-spam corpus)
# Correctly spam:     335063  64.152%  (94.620% of spam corpus)
# False positives:       252  0.048%  (0.150% of nonspam,  29229 weighted)
# False negatives:     19053  3.648%  (5.380% of spam,  68835 weighted)
# Average score for spam:  11.1    nonspam: -1.0
# Average for false-pos:   5.5  false-neg: 3.6
# TOTAL:              522295  100.00%


gen-set1-10-5.0-14000-ga-best
==> test <==
# SUMMARY for threshold 5.0:
# Correctly non-spam:  21151  99.83%
# Correctly spam:      43145  97.76%
# False positives:        35  0.17%
# False negatives:       989  2.24%
# TCR(l=50): 16.113180  SpamRecall: 97.759%  SpamPrec: 99.919%
==> scores <==
# SUMMARY for threshold 5.0:
# Correctly non-spam: 168009  32.167%  (99.899% of non-spam corpus)
# Correctly spam:     346230  66.290%  (97.773% of spam corpus)
# False positives:       170  0.033%  (0.101% of nonspam,  20632 weighted)
# False negatives:      7886  1.510%  (2.227% of spam,  22952 weighted)
# Average score for spam:  20.1    nonspam: -1.5
# Average for false-pos:   5.8  false-neg: 2.9
# TOTAL:              522295  100.00%


gen-set0-5-5.0-14000-ga-best
==> test <==
# SUMMARY for threshold 5.0:
# Correctly non-spam:  20925  98.77%
# Correctly spam:      36049  81.68%
# False positives:       261  1.23%
# False negatives:      8085  18.32%
# TCR(l=50): 2.088195  SpamRecall: 81.681%  SpamPrec: 99.281%
==> scores <==
# SUMMARY for threshold 5.0:
# Correctly non-spam: 166235  31.828%  (98.844% of non-spam corpus)
# Correctly spam:     288300  55.199%  (81.414% of spam corpus)
# False positives:      1944  0.372%  (1.156% of nonspam, 128482 weighted)
# False negatives:     65816  12.601%  (18.586% of spam, 202271 weighted)
# Average score for spam:  10.5    nonspam: 0.6
# Average for false-pos:   6.3  false-neg: 3.1
# TOTAL:              522295  100.00%
Comment 148 Mark Martinec 2009-11-06 16:31:11 UTC
Created attachment 4566 [details]
GA cost vs. iterations

Here is a somewhat interesting diagram, showing how the 'cost' as optimized by
GA is minimized through iterations. Data comes from the nohup.out log file,
where each GA iteration looks like:

123456789
Pop size, replacement: 50 33

Adapt (t, fneg, fneg_add, fpos, fpos_add): 1250 4776 0 0 0
Adapt (over, cross, repeat): 1 1 4131
Performance: 0.672 iterations/s, iteration no. 10900

# SUMMARY for threshold 5.0:
# Correctly non-spam: 168144  32.193%  (99.979% of non-spam corpus)
# Correctly spam:     349845  66.982%  (98.794% of spam corpus)
# False positives:        35  0.007%  (0.021% of nonspam,   8290 weighted)
# False negatives:      4271  0.818%  (1.206% of spam,  13863 weighted)
# Average score for spam:  21.1    nonspam: -3.2
# Average for false-pos:   5.6  false-neg: 3.2
# TOTAL:              522295  100.00%

From the above, the extracted data for this iteration is:
- iteration count: 10900
- FP weighted: 8290
- FN weighted: 13863

So the chart plots FP weighted and FN weighted cost against iteration count.
Each of the four colours corresponds to one set (set3: net+bayes,
set2: nonet+bayes, set1: net+nobayes, set0: nonet+nobayes).
The thicker line of each pair is a FP line, the thinner is a FN line.

The purpose of the chart is to determine if the chosen max iterations
limit is sensible: still gains some benefit without coming into
overfitting or wasting too much time.

One safety valve against overfitting is to check if the 10% test
sample produces similar results as the learning set (90%).
The other test I made is to repeat the runs with a limit of about
5000 iterations (instead of 14000) and compare the results - which
are indeed similar.
Comment 149 Mark Martinec 2009-11-06 16:34:46 UTC
Created attachment 4567 [details]
Scaled diagram of the previous one, only sets 3 and 1 shown

Here is the same diagram as above, but scaled so as not be be compressed
by poor results of set 0. Also, only the two score sets are shown: 1 and 3,
i.e. both sets with network tests, without and with bayes.
Comment 150 Justin Mason 2009-11-07 13:33:19 UTC
(In reply to comment #146)
> Created an attachment (id=4565) [details]
> resulting 50_scores.cf from garescorer runs - V5
> 
> A new run, this time I left the URIBL whitelists and similar fixed
> (at their relatively high manual scores) as they were in current 50_scores.cf

After a little examination, they look good to me!  +1 to check in.

RCVD_IN_XBL is still surprisingly low -- I bet there's some additive behaviour overlapping between XBL and PBL, though.  

RCVD_IN_SBL is _very_ low in set 3 too, bizarre!

otherwise I can't see any issues....



btw if you feel like cranking up the max gens, go for it.  fwiw, spamassassin2.zones has a very powerful CPU -- if it's taking too long on your own machine, try scping stuff up and running it there.
Comment 151 Warren Togami 2009-11-07 15:46:54 UTC
Please manually adjust the scores of RCVD_IN_PSBL up.  At the time of the rescore masscheck PSBL had not yet whitelisted hotmail, yahoo, gmail and a number of major ISP's.  As a result, for 5 weeks straight RCVD_IN_PSBL has been almost completely devoid of FP's in our weekly masschecks.  I am confident that PSBL performs safer than measured during the rescore masscheck.

http://ruleqa.spamassassin.org/20090829-r809102-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20090905-r811608-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20090912-r814117-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20090919-r816871-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20090926-r819101-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20091003-r821273-n/RCVD_IN_PSBL/detail
(below this point FP rate dropped to nearly zero)
http://ruleqa.spamassassin.org/20091010-r823821-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20091017-r826198-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20091024-r829323-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20091031-r831520-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20091107-r833654-n/RCVD_IN_PSBL/detail
You can plainly see steady and sustained improvement in FP safety in these past weeks.

RCVD_IN_PSBL in the rescore masscheck was without lastexternal.  Clearly with the added limitation of lastexternal it is safer than measured.
Comment 152 Mark Martinec 2009-11-08 16:36:24 UTC
> > A new run, this time I left the URIBL whitelists and similar fixed
> > (at their relatively high manual scores) as they were in current
> > 50_scores.cf

Or to say it better: unlike my previous runs where I commented out most
scores in the existing 50_scores.cf (thus making them mutable, regardless
of a <gen:mutable> markup) except for a couple of exceptions, this time
I did not comment-out scores, and let <gen:mutable> markup do its job.
So this is now more like how it was intended to run GA.

> After a little examination, they look good to me!  +1 to check in.

Thanks. I'm sure we can can still do some manual tweaks and improvements,
but perhaps we can indeed freeze the rest to automatically assigned scores
in this run.

> btw if you feel like cranking up the max gens, go for it.  fwiw,
> spamassassin2.zones has a very powerful CPU -- if it's taking too long
> on your own machine, try scping stuff up and running it there.

My office workstation is quite beefy too, and I hope we won't need to do
many further runs, so for now I'd just stick to what I'm familiar with.
Btw, my set3 run at 14000 iterations takes 5 hours, similar for set1, the
other two are much faster (less than 30 minutes each). I just let it run
overnight, so it wouldn't matter if it takes half that time. I did some
previous runs at 30000 iterations, and a diagram (like the one attached
earlier) does not show noticeable improvements beyond about 10000, or even
small worsening by the end, so the 14000 limit seems reasonable. And the
GA algorithms are said to be prone to overfitting, so it's probably prudent
not to go too far.



> RCVD_IN_XBL is still surprisingly low -- I bet there's some additive
> behaviour overlapping between XBL and PBL, though.
> RCVD_IN_SBL is _very_ low in set 3 too, bizarre!
> otherwise I can't see any issues....

| Please manually adjust the scores of RCVD_IN_PSBL up.  At the time of the
| rescore masscheck PSBL had not yet whitelisted hotmail, yahoo, gmail and a
| number of major ISP's.  As a result, for 5 weeks straight RCVD_IN_PSBL has
| been almost completely devoid of FP's in our weekly masschecks.  I am
| confident that PSBL performs safer than measured during the rescore masscheck

Ok, I suggest we collect some manual fixes like the ones suggested here
(with specific score suggestions), and wrap it up.
Comment 153 Adam Katz 2009-11-09 15:40:31 UTC
Created attachment 4568 [details]
Checker for rules that match more ham than spam

Collected selections from several more runs of my script.  I took the last three days' worth of masschecks plus the run last week, hand-picked rules with a high score (~1.0+) but low S/O (~0.250-), and then looked for repeat offenders.  This is the list, with each rule's worst S/O of any run:

 S/O RANK HAM%    SPAM%   Score attachment 4565 [details] Rule
.002 .14  1.2650  0.0024  0.001 0.001 0.131 0.700  TVD_RCVD_SPACE_BRACKET
.002 .23  0.4472  0.0008  0.000 2.099 0.001 1.711  MISSING_MIME_HB_SEP
.019 .22  0.2529  0.0049  1.482 0.855 2.399 2.399  FUZZY_CPILL
.019 .29  0.2809  0.0056  0.001 1.699 1.498 1.699  X_IP
.046 .22  0.4010  0.0193  2.385 0.345 0.998 2.503  FRT_SOMA2
.077 .25  0.2643  0.0221  0.551 1.026 1.033 1.250  CTYPE_001C_B
.092 .21  0.8712  0.0878  0.699 0.332 0.480 0.800  MIME_BASE64_BLANKS
.095 .31  0.2735  0.0286  2.200 2.199 0.540 2.199  WEIRD_QUOTING
.178 .28  0.4948  0.1069  0 0.973 0 2.385          SPF_HELO_FAIL
.195 .29  0.8975  0.2173  1.799 0.579 0.901 0.882  HTML_IMAGE_RATIO_06
.241 .34  1.4248  0.4529  1.0                      EXTRA_MPART_TYPE

I don't think it wise to release with these scores quite so high.  I propose we score them all 0.1 or 0.001 so as to not hold up the release and bookmark the issue (likely a bug in the GA, probably best registered as its own bugzilla bug) for dealing with later.


Additionally, I've updated my script to do the reverse - seek out negatively scored rules that hit more spam than ham.  This doesn't currently find anything beyond SPF_PASS (due to having >=1% spam hits, while it was previously found for having ham>spam), but it does prevent listing SPF_HELO_PASS and theoretically will help find poorly-written ham rules in the future.
Comment 154 Warren Togami 2009-11-11 11:38:13 UTC
(In reply to comment #152)
> 
> | Please manually adjust the scores of RCVD_IN_PSBL up.  At the time of the
> | rescore masscheck PSBL had not yet whitelisted hotmail, yahoo, gmail and a
> | number of major ISP's.  As a result, for 5 weeks straight RCVD_IN_PSBL has
> | been almost completely devoid of FP's in our weekly masschecks.  I am
> | confident that PSBL performs safer than measured during the rescore masscheck
> 
> Ok, I suggest we collect some manual fixes like the ones suggested here
> (with specific score suggestions), and wrap it up.

Let's just go ahead with committing as jm suggested in Comment #153 and make the manual adjustments after that in separate commits each with explanations.

RCVD_IN_PSBL I suggest 2.7 for both network sets.

Adam Katz in Comment #153 makes a good argument for reducing those rules to informational.  Any comments on that?
Comment 155 Justin Mason 2009-11-11 14:13:16 UTC
(In reply to comment #154)
> (In reply to comment #152)
> > 
> > | Please manually adjust the scores of RCVD_IN_PSBL up.  At the time of the
> > | rescore masscheck PSBL had not yet whitelisted hotmail, yahoo, gmail and a
> > | number of major ISP's.  As a result, for 5 weeks straight RCVD_IN_PSBL has
> > | been almost completely devoid of FP's in our weekly masschecks.  I am
> > | confident that PSBL performs safer than measured during the rescore masscheck
> > 
> > Ok, I suggest we collect some manual fixes like the ones suggested here
> > (with specific score suggestions), and wrap it up.
> 
> Let's just go ahead with committing as jm suggested in Comment #153 and make
> the manual adjustments after that in separate commits each with explanations.
> 
> RCVD_IN_PSBL I suggest 2.7 for both network sets.
> 
> Adam Katz in Comment #153 makes a good argument for reducing those rules to
> informational.  Any comments on that?

+1 to all ;)
Comment 156 Warren Togami 2009-11-11 15:42:49 UTC
I might have to eat my words.  Applying these new scores did not improve my own statistics.

ORIGINAL SCORES
./fp-fn-statistics  -s 3 (wt-* 20091107 weekly logs)

# SUMMARY for threshold 5.0:
# Correctly non-spam:  29677  99.82%
# Correctly spam:      21106  90.42%
# False positives:        54  0.18%
# False negatives:      2235  9.58%
# TCR(l=50): 4.729686  SpamRecall: 90.425%  SpamPrec: 99.745%

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155#c146
GA SCORES
./fp-fn-statistics  -s 3 (wt-* 20091107 weekly logs)

# SUMMARY for threshold 5.0:
# Correctly non-spam:  29624  99.64%
# Correctly spam:      21039  90.14%
# False positives:       107  0.36%
# False negatives:      2302  9.86%
# TCR(l=50): 3.050314  SpamRecall: 90.138%  SpamPrec: 99.494%

(In reply to comment #153)
> Created an attachment (id=4568) [details]
> Checker for rules that match more ham than spam
> 
> Collected selections from several more runs of my script.  I took the last
> three days' worth of masschecks plus the run last week, hand-picked rules with
> a high score (~1.0+) but low S/O (~0.250-), and then looked for repeat
> offenders.  This is the list, with each rule's worst S/O of any run:
> 
>  S/O RANK HAM%    SPAM%   Score attachment 4565 [details] Rule
> .195 .29  0.8975  0.2173  1.799 0.579 0.901 0.882  HTML_IMAGE_RATIO_06

score HTML_IMAGE_RATIO_02 2.199 0.805 1.200 0.437
score HTML_IMAGE_RATIO_04 2.089 0.610 0.607 0.556
score HTML_IMAGE_RATIO_06 1.799 0.579 0.901 0.882
score HTML_IMAGE_RATIO_08 1.410 0.351 0.874 0.021

Is it logical to zero out HTML_IMAGE_RATIO_06 when these others have scores?  It feels like either our corpus sample size was not large and varied enough, or we are doing something else wrong.  These particular rules had scores much lower from the 3.2.0 GA.

>  S/O RANK HAM%    SPAM%   Score attachment 4565 [details] Rule
> .241 .34  1.4248  0.4529  1.0                      EXTRA_MPART_TYPE

I suppose this is the clearest case of a rule we should zero out.
Comment 157 Warren Togami 2009-11-12 10:07:55 UTC
TVD_RCVD_SPACE_BRACKET
MISSING_MIME_HB_SEP
FUZZY_CPILL
X_IP Bug #5920 appears not fixed as claimed.
FRT_SOMA2
CTYPE_001C_B
MIME_BASE64_BLANKS
WEIRD_QUOTING
SPF_HELO_FAIL
EXTRA_MPART_TYPE

It appears to be correct to zero out these rules, or at least make them informational.

spamassassin-3.2.5
score HTML_IMAGE_RATIO_02 1.518 0.550 0.573 0.383
score HTML_IMAGE_RATIO_04 1.561 0.170 0.863 0.172
score HTML_IMAGE_RATIO_06 0.401 0.001 0.501 0.001
score HTML_IMAGE_RATIO_08 0.203 0.001 0.179 0.001

attachment 4565 [details]
resulting 50_scores.cf from garescorer runs - V5
score HTML_IMAGE_RATIO_02 2.199 0.805 1.200 0.437
score HTML_IMAGE_RATIO_04 2.089 0.610 0.607 0.556
score HTML_IMAGE_RATIO_06 1.799 0.579 0.901 0.882
score HTML_IMAGE_RATIO_08 1.410 0.351 0.874 0.021

The old scores showed a more linear relationship, with a sharp drop-off between _04 and _06.  Our masscheck results indicate _02 and _04 hit on more spam than ham, but _06 and _08 are pretty worthless.  I think we should zero out _06 and _08 while reducing the scores of _02 and _04.
Comment 158 Adam Katz 2009-11-12 16:20:15 UTC
(In reply to comment #157)
> spamassassin-3.2.5
> score HTML_IMAGE_RATIO_02 1.518 0.550 0.573 0.383
> score HTML_IMAGE_RATIO_04 1.561 0.170 0.863 0.172
> score HTML_IMAGE_RATIO_06 0.401 0.001 0.501 0.001
> score HTML_IMAGE_RATIO_08 0.203 0.001 0.179 0.001
> 
> attachment 4565 [details]
> resulting 50_scores.cf from garescorer runs - V5
> score HTML_IMAGE_RATIO_02 2.199 0.805 1.200 0.437
> score HTML_IMAGE_RATIO_04 2.089 0.610 0.607 0.556
> score HTML_IMAGE_RATIO_06 1.799 0.579 0.901 0.882
> score HTML_IMAGE_RATIO_08 1.410 0.351 0.874 0.021
> 
> The old scores showed a more linear relationship, with a sharp drop-off
> between _04 and _06.  Our masscheck results indicate _02 and _04 hit on
> more spam than ham, but _06 and _08 are pretty worthless.  I think we
> should zero out _06 and _08 while reducing the scores of _02 and _04.

I didn't mention _08 because it wasn't a remarkable enough margin of HAM > SPAM (my script only reports if HAM% + 0.05 > SPAM%) and my hand-sampling utilized S/O ratios under .250 while this rule is .320.  Still, it has the problem:

SPAM%   HAM%    S/O    RANK  SCORE NAME                DateRev
0.2709  0.5491  0.330  0.34  0.20  HTML_IMAGE_RATIO_08 20091111-r834803-n
0.2717  0.5492  0.331  0.34  0.20  HTML_IMAGE_RATIO_08 20091110-r834389-n
0.2672  0.5493  0.327  0.34  0.20  HTML_IMAGE_RATIO_08 20091109-r833997-n
0.2075  0.4995  0.294  0.34  0.20  HTML_IMAGE_RATIO_08 20091104-r832683-n
0.2548  0.5476  0.318  0.34  0.20  HTML_IMAGE_RATIO_08 20091028-r830464-n

Here are the results from the 20091111-r834803-n set, pruning only rules scoring under 0.2 (all hits from my last report are present and asterisked):

 S/O RANK HAM%    SPAM%   Score in attachment 4565 [details] Rule
.014 .15  0.6328  0.0093  0.001 0.001 0.131 0.700  TVD_RCVD_SPACE_BRACKET*
.015 .24  0.1927  0.0029  0.000 2.099 0.001 1.711  MISSING_MIME_HB_SEP*
.019 .22  0.2528  0.0049  1.482 0.855 2.399 2.399  FUZZY_CPILL*
.043 .29  0.1298  0.0059  0.001 1.699 1.498 1.699  X_IP*
.075 .35  0.0603  0.0049  0.000 0.001 0.308 0.001  HTML_NONELEMENT_30_40
.092 .21  0.8123  0.0825  0.699 0.332 0.480 0.800  MIME_BASE64_BLANKS*
.106 .25  0.2483  0.0293  0.551 1.026 1.033 1.250  CTYPE_001C_B*
.123 .33  0.0837  0.0117  0.001 0.648 0.836 1.293  TVD_FW_GRAPHIC_NAME_LONG
.123 .28  0.1632  0.0229  0.001 2.499 0.392 0.164  DRUGS_MUSCLE(*)
.130 .25  0.3663  0.0547  2.385 0.345 0.998 2.503  FRT_SOMA2*
.155 .29  0.1736  0.0317  0.001 0.001 0.001 1.741  MIME_BASE64_TEXT
.188 .27  0.4622  0.1069  0 0.973 0 2.385          SPF_HELO_FAIL*
.214 .31  0.1449  0.0395  2.200 2.199 0.540 2.199  WEIRD_QUOTING*
.239 .30  0.8321  0.2612  1.799 0.579 0.901 0.882  HTML_IMAGE_RATIO_06*
.254 .34  1.3070  0.4442  1.0                      EXTRA_MPART_TYPE*
.330 .34  0.5491  0.2709  1.410 0.351 0.874 0.021  HTML_IMAGE_RATIO_08
.363 .38  1.0856  0.6194  2.600 2.070 1.233 3.405  DATE_IN_PAST_96_XX
.368 .36  0.3029  0.1767  0.001 0.791 0.001 0.008  UPPERCASE_50_75
.381 .37  0.6473  0.3983  0.354 0.001 0.725 0.428  MIME_HTML_MOSTLY
.660 .51  1.8514  3.5893  0.518 1.625 1.197 1.506  SUBJ_ALL_CAPS
.905 .58  1.0822 10.2987  0 1.246 0 1.347          RCVD_IN_BL_SPAMCOP_NET
.934 .56  3.6172 51.2001  2.199 1.105 1.199 0.723  MIME_HTML_ONLY
.957 .52  2.2200 50.3063  2.399 1.274 1.228 0.793  RDNS_NONE

DRUGS_MUSCLE met all the requirements I set for my last report, but I removed it because it had almost no hits anyway, and it scored very very low except on net+no-bayes, so I was assuming it had some justification there somehow.
Comment 159 Justin Mason 2009-11-16 16:27:51 UTC
will we go ahead and check in those scores, anyway?  that would allow another beta (soon).

re: HTML_IMAGE_RATIO_* -- it's very common for that kind of "multi-valued" set of rules to wind up with nonintuitive scoring.  This happens from either low hitrates or hitting alongside other (better) rules.
Comment 160 Warren Togami 2009-11-16 18:28:03 UTC
(In reply to comment #142)
> Seems to me that many / most(?) HABEAS_ACCREDITED_SOI supposedly
> false positives are due to freelotto.com mail. I wonder whether such
> samples are rightfully in the spam* corpora - I'd say yes, but,
> as they say, spam is about consent, not content, and people receiving
> mail from freelotto.com most likely did register once, not realizing
> what they are dealing with. So there was a consent, at least initially.
> It is also about fraud and advertising, so, should one leave such
> mail samples in the spam corpus or not?

Perhaps we should explicitly exclude known sketchy senders like freelotto.com from HABEAS_ACCREDITED_SOI.  This would allow us to more easily monitor for clear violators by not being distracted by the common FP's.  Exclusion in this case only brings the listed back to neutral which is pretty clearly a good idea.

Any objections?  Otherwise I'll file a separate bug for this.
Comment 161 Warren Togami 2009-11-16 19:27:50 UTC
-score RDNS_NONE             0.1
-score RDNS_DYNAMIC          0.1
+# score RDNS_NONE     0 1.1 0 0.7
+# score RDNS_DYNAMIC  0 0.5 0 0.5

These are supposed to be informational rules according to the comment.  Is this supposed to become commented out?  Doesn't commented out mean 1 point?
Comment 162 Warren Togami 2009-11-16 21:28:44 UTC
fp-fn-statistics across the entire "rescore" logs.

Set 3 Before
===========
# SUMMARY for threshold 5.0:
# Correctly non-spam: 703647  99.90%
# Correctly spam:     2559525  98.28%
# False positives:       719  0.10%
# False negatives:     44795  1.72%
# TCR(l=50): 32.253638  SpamRecall: 98.280%  SpamPrec: 99.972%

Set 3 Raw Rescoring from Comment #146
==================================
# SUMMARY for threshold 5.0:
# Correctly non-spam: 703520  99.88%
# Correctly spam:     2548134  97.84%
# False positives:       846  0.12%
# False negatives:     56186  2.16%
# TCR(l=50): 26.443555  SpamRecall: 97.843%  SpamPrec: 99.967%

Doesn't look like an improvement.

Set 3 + Rescore + Reductions
==========================
# SUMMARY for threshold 5.0:
# Correctly non-spam: 704002  99.95%
# Correctly spam:     2558896  98.26%
# False positives:       364  0.05%
# False negatives:     45424  1.74%
# TCR(l=50): 40.932981  SpamRecall: 98.256%  SpamPrec: 99.986%

Looks like a statistically insignificant improvement over the old scores.  I only hope our corpora was sufficiently varied.

Rules Made Informational
======================
TVD_RCVD_SPACE_BRACKET
MISSING_MIME_HB_SEP
FUZZY_CPILL
X_IP Bug #5920 appears not fixed as claimed.
FRT_SOMA2
CTYPE_001C_B
MIME_BASE64_BLANKS
WEIRD_QUOTING
SPF_HELO_FAIL
HTML_IMAGE_RATIO_06
HTML_IMAGE_RATIO_08

Other Changes
============
* EXTRA_MPART_TYPE was left as 1.0 because while it does relatively poorly in the weeky masscheck, it did far better in rescore masscheck.
* I am increasing the scores of PSBL *after* the above fp-fn-statistics run because the old logs do not reflect its current safety level.

I am committing these changes now.  I suspect the key to these reductions is getting rid of the rules that wouldn't have passed our ruleqa auto-promotion criteria?  There might be additional tweaks to make.  Please comment here.
Comment 163 Warren Togami 2009-11-16 22:58:57 UTC
http://hudson.zones.apache.org/hudson/job/SpamAssassin-trunk/4344/testReport/
-score MISSING_HB_SEP 2.5
+# score MISSING_HB_SEP 2.5
+score MISSING_HB_SEP 0 # n=0 n=1 n=2

-score X_MESSAGE_INFO 3.499 3.496 3.330 1.597
+score X_MESSAGE_INFO 0 # n=0 n=1 n=2 n=3

It appears that tests here are failing after commit because rules required by this test were zeroed out.  It seems these rules have almost zero hits in masscheck.  What should we do about this?
Comment 164 Mark Martinec 2009-11-17 03:03:22 UTC
> It appears that tests here are failing after commit because rules required by
> this test were zeroed out.  It seems these rules have almost zero hits in
> masscheck.  What should we do about this?

  Bug 6155 #163: force nonzero scores on MISSING_HB_SEP and X_MESSAGE_INFO
  for the test
  Sending t/missing_hb_separator.t
  Committed revision 881240.

I hope this is the right approach. Alternative would be to introduce
a file similar to t/data/01_test_rules.cf to hold score overrides, but
with a name like 51_test_rules.cf to be sorted after the 50_scores.cf.
Btw, is the 01_ in the name intentional, or could the existing file
just be renamed to something like 99_test_rules.cf ?
Comment 165 Mark Martinec 2009-11-17 03:18:15 UTC
(In reply to comment #161)
> -score RDNS_NONE             0.1
> -score RDNS_DYNAMIC          0.1
> +# score RDNS_NONE     0 1.1 0 0.7
> +# score RDNS_DYNAMIC  0 0.5 0 0.5

> Doesn't commented out mean 1 point?

It would mean 1 point, if there were no other score lines for these two rules:
score RDNS_DYNAMIC 2.639 0.363 1.663 0.982
score RDNS_NONE    2.399 1.274 1.228 0.793

> These are supposed to be informational rules according to the comment.
> Is this supposed to become commented out?

Comment 116, 120, 124, 137, 139.
I left it mutable, I think it still makes sense - it's kind of a poor man's
Botnet plugin.
Comment 166 Justin Mason 2009-11-17 07:41:11 UTC
(In reply to comment #164)
> > It appears that tests here are failing after commit because rules required by
> > this test were zeroed out.  It seems these rules have almost zero hits in
> > masscheck.  What should we do about this?
> 
>   Bug 6155 #163: force nonzero scores on MISSING_HB_SEP and X_MESSAGE_INFO
>   for the test
>   Sending t/missing_hb_separator.t
>   Committed revision 881240.
> 
> I hope this is the right approach. Alternative would be to introduce
> a file similar to t/data/01_test_rules.cf to hold score overrides, but
> with a name like 51_test_rules.cf to be sorted after the 50_scores.cf.
> Btw, is the 01_ in the name intentional, or could the existing file
> just be renamed to something like 99_test_rules.cf ?

X_MESSAGE_INFO can be dropped, but MISSING_HB_SEP should not have been made mutable; I'd say lock to 2.5.

btw it is to be expected that with less mutability the scores become slightly less optimal for the rescoring corpus; this always happens.  If scores are allowed to wander without locking down the "unsafe" rules, the GA will overfit to the training data and produce great FP/FN figures, but scores that are risky for "real world" usage.
Comment 167 AXB 2009-11-17 07:56:17 UTC
(In reply to comment #166)
> (In reply to comment #164)
> > > It appears that tests here are failing after commit because rules required by
> > > this test were zeroed out.  It seems these rules have almost zero hits in
> > > masscheck.  What should we do about this?
> > 
> >   Bug 6155 #163: force nonzero scores on MISSING_HB_SEP and X_MESSAGE_INFO
> >   for the test
> >   Sending t/missing_hb_separator.t
> >   Committed revision 881240.
> > 
> > I hope this is the right approach. Alternative would be to introduce
> > a file similar to t/data/01_test_rules.cf to hold score overrides, but
> > with a name like 51_test_rules.cf to be sorted after the 50_scores.cf.
> > Btw, is the 01_ in the name intentional, or could the existing file
> > just be renamed to something like 99_test_rules.cf ?
> 
> X_MESSAGE_INFO can be dropped, but MISSING_HB_SEP should not have been made
> mutable; I'd say lock to 2.5.
> 
> btw it is to be expected that with less mutability the scores become slightly
> less optimal for the rescoring corpus; this always happens.  If scores are
> allowed to wander without locking down the "unsafe" rules, the GA will overfit
> to the training data and produce great FP/FN figures, but scores that are risky
> for "real world" usage.

locally, I've have lowered the MISSING_HB_SEP score to 0.5

lottsa funky ERP stuff seems to have a talent to FP on it.
its great for metas but usually triggers scores close to FP with the usual suspects & their very ugly HTML formatting.
(sorry, cannot supply samples)

I'd say 2.5 is sorta high

Axb
Comment 168 Justin Mason 2009-11-20 15:10:05 UTC
(In reply to comment #167)
> locally, I've have lowered the MISSING_HB_SEP score to 0.5
> 
> lottsa funky ERP stuff seems to have a talent to FP on it.
> its great for metas but usually triggers scores close to FP with the usual
> suspects & their very ugly HTML formatting.
> (sorry, cannot supply samples)
> 
> I'd say 2.5 is sorta high

ok -- I was under the impression it was FP-free.  0.5 works for me in that case.
Comment 169 Warren Togami 2009-11-23 20:08:06 UTC
spamassassin/trunk/rulesrc/10_force_active.cf

It seems this file needs to be updated after the rescoring.  Should all the rules in 50_scores.cf be listed in 10_force_active.cf?

Even the rules that are zeroed out in 50_scores.cf?
Comment 170 Warren Togami 2009-11-25 15:39:17 UTC
Created attachment 4579 [details]
patch for 10_force_active.cf

Nobody responded to the previous comment.  I didn't know how this file was generated before.  I took 50_scores.cf and took all rule names that were not commented out for this patch.  Is this correct?
Comment 171 Mark Martinec 2009-11-26 11:16:06 UTC
>> spamassassin/trunk/rulesrc/10_force_active.cf
>> It seems this file needs to be updated after the rescoring.
>> Should all the rules in 50_scores.cf be listed in 10_force_active.cf?
>> Even the rules that are zeroed out in 50_scores.cf?
>
> Nobody responded to the previous comment.
> I didn't know how this file was generated before.

No idea, sorry. I haven't been around that long.

> I took 50_scores.cf and took all rule names that were not
> commented out for this patch.  Is this correct?

Probably.


Btw, the:
  prove xt/10_rule_test_suite.t
is failing for several rules. Can someone more familiar with rules
please check where the reported problems lie?
Comment 172 Daryl C. W. O'Shea 2009-11-26 17:24:49 UTC
Warren,

The file was originally used to list all *rules from sandboxes* that had scores assigned by the GA so that they didn't get auto-demoted leaving a score line but no rule.

I don't think its use has changed, but I'm not completely up-to-date on the re-org of the rules source structure.

jm might have a script to generate the file... although it's been a long time.
Comment 173 Warren Togami 2009-11-30 13:38:47 UTC
Sending rulesrc/10_force_active.cf Transmitting file data . Committed revision 884912.

Please review.
Comment 174 Mark Thomas 2009-11-30 13:40:07 UTC
Restoring comment originally made by Mark Martinec

(In reply to comment #171)
> Btw, the:
>   prove xt/10_rule_test_suite.t
> is failing for several rules. Can someone more familiar with rules
> please check where the reported problems lie?

Actually it's just two rules failing on multiple tests: FM_FRM_RN_L_BRACK and TVD_SPACE_RATIO. Luckily their score is zero or near zero: score TVD_SPACE_RATIO 0.001 score FM_FRM_RN_L_BRACK 0

| Changed score of FM_FRM_RN_L_BRACK from 0 into 0.001, | to make xt/10_rule_test_suite.t happy. | Sending rules/50_scores.cf | Committed revision 884927.

So that leaves the TVD_SPACE_RATIO. Is it something to worry about?
Comment 175 Justin Mason 2009-12-01 05:08:47 UTC
10_force_active.cf is generated at this step in the RescoreMassCheck process (see https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155#c3):

6.5. mark evolved-score rules as 'always published'

sounds like we could be missing a few steps if that got missed...
Comment 176 Warren Togami 2009-12-01 08:50:38 UTC
http://wiki.apache.org/spamassassin/RescoreMassCheck

Mark, did you do these steps?

6. upload the test logs to zone
8. Make the stats files
8. upload new stats files
Comment 177 Mark Martinec 2009-12-01 09:17:58 UTC
> Mark, did you do these steps?
> 6. upload the test logs to zone
> 8. Make the stats files
> 8. upload new stats files

No, I left at the '5. generate scores for score sets',
I only attached the score file for considerations.
Comment 178 Warren Togami 2009-12-01 10:28:26 UTC
Mark, it appears that only you can do those steps?
Comment 179 Warren Togami 2009-12-02 07:25:38 UTC
Mark, please correct me if I am wrong.  But it seems only you can complete the final steps since we don't know exactly which subset of data you used.
Comment 180 Mark Martinec 2009-12-02 07:31:01 UTC
> Mark, please correct me if I am wrong.  But it seems only you can complete the
> final steps since we don't know exactly which subset of data you used.

I'm doing it right now. The config.set* is already checked in, logs are
being transferred, ...
Comment 181 Mark Martinec 2009-12-02 10:48:45 UTC
Ok, I think I'm done now (RescoreMassCheck):

5. generate scores for score sets
svn commit -m "runGA config files used" masses/config.set*
  r886173 | mmartinec | 2009-12-02 16:24:32 +0100 (Wed, 02 Dec 2009) | 1 line
  runGA config files used
tar cvf rescore-logs.tar gen-set{0,1,2,3}-*

6. upload the test logs to zone (spamassassin.zones.apache.org):
sudo mkdir /home/corpus-rsync/ARCHIVE/3.3.0
sudo mv rescore-logs.tar.bz2 \
  /home/corpus-rsync/ARCHIVE/3.3.0/rescore-logs-bug6155.tar.bz2
ls -l /home/corpus-rsync/ARCHIVE/3.3.0/rescore-logs-bug6155.tar.bz2
  -rw-r--r--   1 mmartinec other    20380424 Dec  2 18:23
    /home/corpus-rsync/ARCHIVE/3.3.0/rescore-logs-bug6155.tar.bz2

6.5. mark evolved-score rules as 'always published'
./force-publish-active-rules ../rules/active.list ../rulesrc/10_force_active.cf
svn commit -m "force publish of rescored rules" ../rulesrc/10_force_active.cf
  r886212 | mmartinec | 2009-12-02 18:33:57 +0100 (Wed, 02 Dec 2009) | 3 lines
  Bug 6155: generated new rulesrc/10_force_active.cf
  as per step 6.5 in RescoreMassCheck

6.6. fix test failures
nothing to tweak, all tests pass

7. upload proposed new scores
done some time ago, some tweaks later:
  r881159 | wtogami | 2009-11-17 06:35:00 +0100 (Tue, 17 Nov 2009) | 2 lines
  Bug #6155 commit raw scores from Comment #146 as documented in #162.
To view the diffs: svn diff -r 881158:886232 rules/50_scores.cf

8. Make the stats files
cp config.set0 config ; bash ./runGA stats
cp config.set1 config ; bash ./runGA stats
cp config.set2 config ; bash ./runGA stats
cp config.set3 config ; bash ./runGA stats

8(.1) upload new stats files
  r886232 | mmartinec | 2009-12-02 19:11:35 +0100 (Wed, 02 Dec 2009) | 2 lines
  rules/STATISTICS-set*.txt
> Attach the new proposed STATISTICS*.txt as a patch to the rescoring bug
too many differences, just do a: svn diff -c886232
Comment 182 Warren Togami 2009-12-02 11:04:18 UTC
6.5. mark evolved-score rules as 'always published'

  cd masses
  ./force-publish-active-rules ../rules/active.list ../rulesrc/10_force_active.cf
  svn commit -m "force publish of rescored rules" ../rulesrc/10_force_active.cf

Doing this seems to remove all the 0 score rules from 10_force_active.cf.  Does this make any difference?
Comment 183 Warren Togami 2009-12-02 11:43:16 UTC
Why is active.list (the result of auto-promotion) relevant as input to this script?  Seems kind of like circular logic that makes no sense.

+ SPAMMY_MIME_BDRY_01

force-publish-active-rules added a few lines like this that have no scores assigned in rules/50_scores.cf.

It seems what I already did by copying rule names from rules/50_scores.cf into rulesrc/10_force_active.cf is more correct?

If so, then it appears we are ready for beta if we can clear up the GPG key issue in Bug #6223.
Comment 184 Justin Mason 2009-12-02 14:28:46 UTC
(In reply to comment #183)
> Why is active.list (the result of auto-promotion) relevant as input to this
> script?  Seems kind of like circular logic that makes no sense.
> 
> + SPAMMY_MIME_BDRY_01
> 
> force-publish-active-rules added a few lines like this that have no scores
> assigned in rules/50_scores.cf.
> 
> It seems what I already did by copying rule names from rules/50_scores.cf into
> rulesrc/10_force_active.cf is more correct?
> 
> If so, then it appears we are ready for beta if we can clear up the GPG key
> issue in Bug #6223.

I think you're right.  could you open a side-bug for that issue so we can fix it post-release?

anyway, this is now fixed.
Comment 185 Henrik Krohns 2010-01-05 10:47:51 UTC
I have a hunch that FREEMAIL_ENVFROM_END_DIGIT has a bit too high score (1.553). Probably there wasn't enough "nicedude90" ham in corpora. Strangely FREEMAIL_REPLYTO_END_DIGIT has a lower score, one would think it would be safer FP wise..