Bug 4505 - [review] Score generation for SpamAssassin 3.1
Summary: [review] Score generation for SpamAssassin 3.1
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Score Generation (show other bugs)
Version: 3.1.0
Hardware: Other other
: P1 critical
Target Milestone: 3.1.0
Assignee: Justin Mason
URL:
Whiteboard: ready to commit
Keywords:
Depends on:
Blocks:
 
Reported: 2005-07-27 14:36 UTC by Henry Stern
Modified: 2005-08-11 09:06 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
freqs for scoreset 3, all logs text/plain None Justin Mason [HasCLA]
Proposed scores for 3.1 patch None Henry Stern [HasCLA]
Proposed scores for 3.1 generated without Bob's data patch None Henry Stern [HasCLA]
freqs for scoreset 3, all logs, all rules text/plain None Justin Mason [HasCLA]
bayes scores patch None Justin Mason [HasCLA]
release-quality patch patch None Justin Mason [HasCLA]
redo of 3062 patch None Justin Mason [HasCLA]
redo of 3065 patch None Justin Mason [HasCLA]
fix for test failures caused by 3066 patch None Justin Mason [HasCLA]
redo of 3066 patch None Justin Mason [HasCLA]
fix for t/rule_names.t patch None Justin Mason [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Henry Stern 2005-07-27 14:36:04 UTC
To tune the models this time, I am using a 10% random sample of all of the
corpus submissions.  All of these results have been generated using the same
parameters as I did with 3.0, except for set1.

False positives and negatives from the 10% sample to follow...

./model-statistics vm-set0-2.0-4.0-100/validate
False positives: mean=0.0753% std=0.0462
False negatives: mean=20.9334% std=7.3811
TCR (lambda=50): mean=2.7302 std=0.9718

./model-statistics vm-set1-2.0-4.0-100/validate
False positives: mean=0.0713% std=0.0435
False negatives: mean=5.9736% std=2.1137
TCR (lambda=50): mean=9.8396 std=3.6217

./model-statistics vm-set2-2.0-4.625-100/validate
False positives: mean=0.0847% std=0.0364
False negatives: mean=5.6917% std=2.0176
TCR (lambda=50): mean=9.7449 std=3.4877

./model-statistics vm-set3-2.0-5.0-100/validate
False positives: mean=0.0847% std=0.0527
False negatives: mean=2.9959% std=1.0621
TCR (lambda=50): mean=15.7957 std=6.3287
Comment 1 Henry Stern 2005-07-27 15:19:58 UTC
The misses can be found on the rsync server in /corpus/scoregen-3.1/falses/

I wanted to put them on BZ, but the file is too big.
Comment 2 Justin Mason 2005-07-27 18:39:21 UTC
(bumping pri to the appropriate level)

since quite a few of the mass-checkers don't have accounts on that box, I've
also copied the set3 files to these URLs:

http://taint.org/xfer/2005/set3.fn.gz
http://taint.org/xfer/2005/set3.fp.gz

Please download and verify that any mails in the FP set that are coming from
your corpus, are indeed valid ham; and ditto for the FN set being spam.

Btw Henry -- in my case, the breakdown of errors is as follows...

FNS:  (can be moved to spam if you want, or deleted)
/home/jm/Mail/deld.priv/56232
/home/jm/Mail/deld.priv/61238
/home/jm/Mail/sent/587
/home/jm/Mail/sent/736

INVALID, DELETE FROM HAM:   (rule discussion, bounced spam)
/home/jm/Mail/deld.priv/111034
/home/jm/Mail/A3inbox/1

FPS:   (can be moved to ham or deleted)
/home/jm/cor/spam.cor/20041029a/216
/home/jm/cor/spam.cor/20041029a/226
/home/jm/cor/spam.cor/20041029a/246
/home/jm/cor/spam.cor/20041029a/233
/home/jm/cor/spam.cor/20041029a/235

INVALID, DELETE FROM SPAM:   (bounced spam)
/home/jm/Mail/Sapm/1540
/home/jm/Mail/Sapm/1647
Comment 3 Theo Van Dinter 2005-07-27 19:54:47 UTC
Subject: Re:  Score generation for SpamAssassin 3.1

On Wed, Jul 27, 2005 at 06:39:22PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Please download and verify that any mails in the FP set that are coming from
> your corpus, are indeed valid ham; and ditto for the FN set being spam.

Ok, checked over the set3 results.

FPs: all valid ham.
FNs: all valid spam.

In full disclosure, several of the spams could be considered
"questionable", namely HGTV newsletters which also include DIY
newsletters.  I was originally receiving them to a hamtrap, but then I
started receiving things I didn't ask for, and then couldn't unsubscribe,
so they got switched to spam instead.

The rest are a various set of things, mostly stock spams, phishing,
several of those German spams from earlier in the year, national lottery
spams, etc.

Comment 4 Michael Parker 2005-07-27 19:56:01 UTC
I let Henry know, but for the record, I looked through all of mine and they are
all good to go.
Comment 5 Henry Stern 2005-07-28 03:22:59 UTC
I'm not too concerned about a few mis-labeled entries.  All that will happen
from those is that our numbers will look a bit off.  Unless anyone has
objections, I'm going to use the corpus as is and will generate the scores.  The
learning algorithm is stable enough to work around a bit of noise.
Comment 6 Bas Zoetekouw 2005-07-28 03:40:35 UTC
MY check of the set3 results gives:

FNS:  (can be moved to spam if you want, or deleted)
/scratch/SA/mails/2005-01.mbox.ham.21322338

FPS:   (can be moved to ham or deleted)
/scratch/SA/mails/personal.2005w08.spam.194510
/scratch/SA/mails/personal.2005w09.spam.780822
/scratch/SA/mails/personal.2005w20.spam.220636
/scratch/SA/mails/personal.2005w21.spam.1340310
/scratch/SA/mails/personal.2005w22.spam.1210785
/scratch/SA/mails/personal.2005w25.spam.886714

INVALID, DELETE FROM SPAM:   (bounces,virusses,etc)
/scratch/SA/mails/backup.2005.jan-may.spam.101747
/scratch/SA/mails/traps.2005w09.spam.1189670
/scratch/SA/mails/personal.2005w09.spam.704311
/scratch/SA/mails/personal.2005w14.spam.1332942
/scratch/SA/mails/personal.2005w28.spam.161488
Comment 7 Justin Mason 2005-07-28 09:46:44 UTC
well, in terms of generating STATISTICS.txt at least, I would prefer to have the
bad entries fixed; those numbers are published.  it's pretty trivial to fix up
the logs appropriately using "remove-ids-from-mclog", I'll do it if you want.
Comment 8 Henry Stern 2005-07-28 09:57:40 UTC
I'd rather that we didn't clean up the logs this way because:

1) You've only removed errors from 10% of the logs.
2) You haven't removed the errors that both you and SA have made.

I'm running a set of cross-validations on the full set now.  If you really want
to remove only the instances where the human was incorrect and the classifier
was correct and not the instances where both the human and the classifier are
incorrect, I will upload the errors to the rsync server when it's finished.
Comment 9 Justin Mason 2005-07-28 10:12:07 UTC
well, we disagree ;)   I'd appreciate some comments from the rest of the
committers on how they feel about this one.   Here's a chat log between myself
and H talking about it....


(09:49:33) henry: so about fixing up logs
(09:50:19) henry: I'd rather that we didn't because:
1) You've only removed errors from 10% of the logs.
2) You haven't removed the errors that both you and SA has made.
(09:50:25) henry: have made
(09:51:00) jm: please respond via mail on this one, I suspect I'm not the only
one who disagrees ;)
(09:51:18) henry: sure
(09:51:56) jm: imo we need to try and get the logs as clean as poss, even if
we're missing 90% of the FPs/FNs
(09:52:19) henry: we're just gaming the numbers
(09:52:32) jm: even if the perceptron is able to deal with some noise, the logs
are used for other things (STATISTICS.txt) that cannot deal with noise
(09:52:36) henry: the learning algorithm would be useless if it couldn't work
around a few mistakes
(09:52:58) jm: we're not gaming it -- we're using it to build something nearer a
"gold standard" in Cormack temrs
(09:53:13) henry: and what I'm saying is that by correcting errors in only one
direction, STATISTICS.txt will be worse off than it was before
(09:53:24) henry: Cormack uses multiple classifiers to make his "gold standard"
(09:56:27) jm: why are we correcting errors only in 1 dir?
(09:56:31) jm: don't get that
(09:56:54) henry: you're not correcting entries where both you and SA have erred
(09:57:22) henry: so they look like TPs and TNs, but in fact they are FNs and FPs
(09:57:52) jm: ok.   but it's still *better* than the current logs
(09:58:03) henry: I disagree
(09:58:03) jm: in that there are *less* FPs and FNs overall
(09:58:17) jm: even if there are still *some* FPs and FNs
(09:58:19) henry: there are indeed less FPs and FNs overall
(09:58:44) henry: but since we know how many errors we've seen, we can make some
predictions about what's gone on in the other direction
(09:59:49) jm: I disagree that that's useful ;)
(09:59:58) jm: unless you want to fix the STATISTICS generating scripts as well...
(10:01:30) henry: well, here's the thing
(10:01:37) henry: from first look
(10:01:47) henry: it seems that people have about the same amount misclassified
in each direction
(10:01:49) henry: that have been found
(10:02:42) henry: so you could hypothesise that there are plenty that have gone
the other way
(10:03:29) henry: and that they are about the same proportion
(10:03:34) henry: maybe
(10:03:36) henry: I don't know
(10:04:07) henry: all that I can say is that by fixing based solely on the
suspected mistakes of the classifier, we're biasing the results to make things
look better than they are
(10:04:45) henry: and really.. at the end of the day, the numbers reflect how
good the sample set is
Comment 10 Justin Mason 2005-07-28 10:16:10 UTC
so in summary:

- I think we should try to make the logs as clean as possible

- Henry thinks we should keep the logs as they are, and use that to estimate a
misclassification figure instead

(PS: henry also notes that Bayes will have been trained on those instances, too.)
Comment 11 Chris Thielen 2005-07-28 11:06:30 UTC
Here are my misclassifications (I guess whether or not it matters is still up for debate):

Virus Bounce: 
/home/sone/spamassassin_masscheck_3.1-pre/corpora/spam/1082327516.17711_3.ns1:2,S

Misclassified as spam (kinda sorta ham-ish-y I guess):
/home/sone/spamassassin_masscheck_3.1-pre/corpora/spam/
1115735940.M20350P12544V0000000000000304I001D2C12_6.ns1,S=14073:2,S

Misclassified as spam (really ham)
/home/sone/spamassassin_masscheck_3.1-pre/corpora/spam/1106269507.18978_3.ns1:2,S
Comment 12 Rod Begbie 2005-07-28 14:55:29 UTC
My only misclassify:
/Users/rod/spam/Maildir/.spam.2004-12/cur/1103575966.15119_0.blazing.arsecandle.org,S=18955:2,S
is really ham.
Comment 13 Justin Mason 2005-07-28 18:04:15 UTC
Created attachment 3044 [details]
freqs for scoreset 3, all logs

fyi -- here's the freqs data from 3.1.0's mass-check logs, scoreset 3.

I didn't clear up the misclassifications reporting since the perceptron run,
fwiw; this is just using the rsync'd logs.  so far, though, the FP/FNs reported
are tiny compared to the number of mass-checked messages (1483066 spam, 743761
ham).
Comment 14 Justin Mason 2005-07-28 18:15:52 UTC
btw, more hits that look very iffy, from the freqs file:

  0.333   0.0546   0.8887    0.058   0.26   -4.30  RCVD_IN_BSP_TRUSTED
  0.051   0.0130   0.1267    0.093   0.19   -0.10  RCVD_IN_BSP_OTHER
  0.036   0.0053   0.0961    0.053   0.29   -8.00  HABEAS_ACCREDITED_COI

that seems like a *LOT* of Bonded Sender spam hits -- 809 messages hitting
RCVD_IN_BSP_TRUSTED!  could we get those spam hits verified?  (Bob, in
particular, most seem to be coming from your corpus)
Comment 15 Theo Van Dinter 2005-07-28 19:19:17 UTC
Subject: Re:  Score generation for SpamAssassin 3.1

On Thu, Jul 28, 2005 at 06:15:52PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> that seems like a *LOT* of Bonded Sender spam hits -- 809 messages hitting
> RCVD_IN_BSP_TRUSTED!  could we get those spam hits verified?

My hits are all valid, btw.

Comment 16 Henry Stern 2005-07-29 14:53:45 UTC
I did a run with the full 2M corpus.  Here are the results:

vm-set0-2.0-4.0-100
False positives: mean=0.0625% std=0.0263
False negatives: mean=21.8408% std=7.6947
TCR (lambda=50): mean=2.6218 std=0.9242

vm-set1-2.0-4.0-100
False positives: mean=0.0682% std=0.0263
False negatives: mean=6.1945% std=2.1798
TCR (lambda=50): mean=9.5497 std=3.3674

vm-set2-2.0-4.625-100
False positives: mean=0.0846% std=0.0325
False negatives: mean=7.9603% std=2.8295
TCR (lambda=50): mean=7.3340 std=2.5958

vm-set3-2.0-5.0-100
False positives: mean=0.0822% std=0.0318
False negatives: mean=3.0710% std=1.0898
TCR (lambda=50): mean=15.2954 std=5.4556
Comment 17 Justin Mason 2005-07-29 16:03:33 UTC
further info regarding the BSP_TRUSTED hits --

grep BSP_TRUSTED spam.log > o
perl -ne '/ (\/[^\/]+\/[^\/]+\/[^\/]+)/ and print "$1\n"' o | uniq -c
 792 /home/Bob/spamassassin.active
  10 /home/duncf/Maildir
   2 /home/jm/Mail
   1 /home/jm/cor
   4 /home/corpus/mail
   1 /home/corpus/SA

97% of the Bonded Sender hits on spam are from Bob's corpus.   I suspect
something's up with the corpus there... spamtraps?  retired accounts?


PS: there's an argument that having FPs in the logs is irrelevant.
however, I disagree -- the Perceptron is only *one* thing that uses
the logs.  There are also the following:

  - overall FP/FN% figures for scoresets and thresholds (STATISTICS.txt)
  - rule freqs, for per-rule FP/FN% figures

Given those two, there's good reasons to clean up the logs.
Comment 18 Sidney Markowitz 2005-07-29 16:29:38 UTC
Here's an email Bob sent to sa-dev mailing list that looks like it was meant to
be a comment here. Or if not, I think it should be in the record here and it is
on a public list so I feel free to repost it. However, 259 is a lot less than
792 so there still is a question why Bob has so many Bonded sender FPs.

  ---- rest of this is a quote -----

Hello Henry,

Wednesday, July 27, 2005, 6:39:22 PM, you wrote:


>> jm@jmason.org changed:
>>            What    |Removed                     |Added
>> ----------------------------------------------------------------------------
>>            Severity|normal                      |critical
>>            Priority|P5                          |P1


>> since quite a few of the mass-checkers don't have accounts on that
>> box, I've also copied the set3 files to these URLs: 
>> http://taint.org/xfer/2005/set3.fn.gz
>> http://taint.org/xfer/2005/set3.fp.gz


>> Please download and verify that any mails in the FP set that are
>> coming from your corpus, are indeed valid ham; and ditto for the FN
>> set being spam.


FN:

I spot-checked all FNs with positive scores, and checked every FN with
negative scores.  Corpus is clean, except:

ham: mid=<mailman.3.1119452414.19901.announce@ctyme.com>

discount: Message-ID: <12880891.1119562416154.JavaMail.root@agent1.ientrymail.com>
          Message-ID:
<28195449.1118795862153.JavaMail.root@mailagent0.ientrymail.com>
spam newsletter, but this user probably subscribed to it...

There are 259 emails from/via constantcontact.com which are treated
as spam on my system, have been flagged as spam on my system (scores
as high as 30's and 40's), have been encapsulated on delivery, have
never been flagged by any user as not-spam, but, for the purposes of a
world-wide mass-check, these constantcontact.com emails might be
questionable.

Note: Not all constantcontact.com is treated as spam here -- quite a
few cc.com newsletters are subscribed to and seen as ham by their
subscribers and the system. The ones I find above in the fns file are
all from a set of eight newsletters which have regularly (almost
always) been seen as spam, and no user has ever corrected that
classification.

Henry: To remove these from the log (if you want to), remove
everything where the path is
/home/Bob/spamassassin.active/masses/corpus.spam (or corpus.ham),
since that identifies my corpus contribution, and where the mid ends
in @scheduler. 

FP:  Checked every one.  Corpus is clean, except:

ham: Message-ID: <1118650726.505.53825.m18@yahoogroups.com>
There are two of these listed. One should be removed.

spam: mid=<17EDCF9C.FD9DD30@hotmail.com>


Bob Menschel



Comment 19 Sidney Markowitz 2005-07-29 16:43:07 UTC
Of course I should have said FN not FP in the last comment. And in case it is
not clear to someone reading this: constantcontact.com runs the Bonded Sender
service, which is what the RCVD_IN_BSP_TRUSTED rule looks for.

Bob, what does it mean that you say that you have 259 emails from/via
constantcontact.com that are flagged as spam, but Justin says that the log shows
792 BSP_TRUSTED hits from your spam corpus?
Comment 20 Justin Mason 2005-07-29 16:57:16 UTC
oops, missed that.

however, I don't think Bob was talking about the BSP issue in that mail...

Sidney -- I think you're confusing Constant Contact with Return Path -- Return
Path are now partners in the BSP, http://www.returnpath.net/, but afaik
Constant Contact are a different company.  I don't think that's it (although it
may be some of the hits).
Comment 21 Sidney Markowitz 2005-07-29 17:04:45 UTC
Oh, I got confused by this:

http://www.constantcontact.com/services/bonded-sender-program.jsp

I guess constantcontact provides a way for people to get Bonded Sender status
for $25/month and no risk of losing a bond. I wonder what the business model is
if spammers take advantage of it. That would explain it if they are a source of
most of the BSP_TRUSTED FNs. Except there are still another 533 FNs to explain.
Comment 22 Henry Stern 2005-07-29 17:29:41 UTC
Created attachment 3045 [details]
Proposed scores for 3.1

gen-set0-2.0-4.0-100
# SUMMARY for threshold 5.0:
# Correctly non-spam:  74239  99.92%
# Correctly spam:     113219  76.56%
# False positives:	  60  0.08%
# False negatives:     34655  23.44%
# TCR(l=50): 3.927075  SpamRecall: 76.565%  SpamPrec: 99.947%

gen-set1-2.0-4.0-100
# Correctly non-spam:  74274  99.92%
# Correctly spam:     138015  93.05%
# False positives:	  59  0.08%
# False negatives:     10312  6.95%
# TCR(l=50): 11.184361	SpamRecall: 93.048%  SpamPrec: 99.957%

gen-set2-2.0-4.625-100
# Correctly non-spam:  74747  99.92%
# Correctly spam:     134723  90.61%
# False positives:	  58  0.08%
# False negatives:     13955  9.39%
# TCR(l=50): 8.821003  SpamRecall: 90.614%  SpamPrec: 99.957%

gen-set3-2.0-5.0-100
# Correctly non-spam:  74528  99.92%
# Correctly spam:     143427  96.65%
# False positives:	  59  0.08%
# False negatives:	4975  3.35%
# TCR(l=50): 18.725804	SpamRecall: 96.648%  SpamPrec: 99.959%
Comment 23 Justin Mason 2005-07-29 19:41:38 UTC
I hacked together something to make ROC curves... take a look.

current SVN trunk:
http://taint.org/xfer/2005/roc_curves_pre_perceptron.png

with the scores in patch 3045:
http://taint.org/xfer/2005/roc_curves_with_3045.png
Comment 24 Loren Wilton 2005-07-29 20:29:21 UTC
Subject: Re:  Score generation for SpamAssassin 3.1

+score BAYES_50 0 0 0.845 0.001 # n=1
+score BAYES_60 0 0 2.312 0.372 # n=1
+score BAYES_80 0 0 2.775 2.087 # n=1
+score BAYES_95 0 0 3.023 2.063 # n=1
+score BAYES_99 0 0 2.960 1.886 # n=1

I think the score for BAYES_99 should be hand tweaked, regardless of what the score generator said.
This was big grief for most people on 3.0 - 3.0.3, and I'd just as soon not see it take until 3.1.3 to apply the same hack again.

          Loren

Comment 25 Sidney Markowitz 2005-07-30 03:59:58 UTC
Bob, for some reason the email replies you are sending are not ending up in
comment even though they are Cc'd to bugzilla-daemon. I'm pasting your last one
in here below.

I don't know about the others you list but I don't see how the Motley Fool ones
are spam. The content looks like stock spam, but they are a very widely read
reputable organization that requires registration with email confirmation to
receive a login password before one can subscribe. Each email that I have
received from them contains unsubscription information that states the email
address that I am susbscribed under and a link to where I can view and change
all subscription preferences. I have never seen any reference to them not
honoring the preference settings. While the web site is fool.com, the subscribed
email from them does come from foolsub.com addresses.

See, for example, the reference to Motley Fool in
http://www.ironport.com/company/pp_business_week_03-13-2003.html

 -- sidney

>> ------- Additional Comments From jm@jmason.org  2005-07-28 18:15 -------
>> btw, more hits that look very iffy, from the freqs file:


>>   0.333   0.0546   0.8887    0.058   0.26   -4.30  RCVD_IN_BSP_TRUSTED
>>   0.051   0.0130   0.1267    0.093   0.19   -0.10  RCVD_IN_BSP_OTHER
>>   0.036   0.0053   0.0961    0.053   0.29   -8.00  HABEAS_ACCREDITED_COI


>> that seems like a *LOT* of Bonded Sender spam hits -- 809 messages hitting
>> RCVD_IN_BSP_TRUSTED!  could we get those spam hits verified?  (Bob, in
>> particular, most seem to be coming from your corpus)


Summary:

Misclassified ham:  28
Bounce/outscatter of spam:  1
Possibly misclassified ham: 34
Constant Contact questionable: 3099 (ham and spam)
The remainder are IMO spam.

Note: In the following discussions where I say "flagged spam", I mean
fully encapsulated, with full SA report and score presented as the
primary email to the user.


>> Misclassified ham:


From: newsletters@about.com (count: 7)
From: "American Express" <AmericanExpress@email.americanexpress.com>
      count: 10, multiple users fed to sa-learn, primarily because
      instead of being official notifications, statements, alerts,
      etc., the "spam" identified by users were marketing emails,
      "take a look at our special offers", "plan the perfect holiday",
      "upgrade to a card with premium service", etc. Only one of the
      sa-learned "spam" was what I'd consider a ham, though none of
      them are spam.
From: <support@godaddy.com> (count: 2, 1 to each of 2 users)
From: PayPal <paypal@email.paypal.com> (count: 6)
From: Tikkun <Tikkun@democracyinaction.org> (count: 1)
From: HeartCenterOnline <HeartCenterOnline@heartcenteronline-mail.com> (count: 2)



>> Possibly misclassified ham:


From: "CNET Help.com Online Courses  "
<CNET_Networks_Member_Services@newsletter.online.com>
Count: 9
User CR declared it to be spam via sa-learn. Probably old subscription.
Several others not fed to sa-learn, but flagged as spam by our system
(and not corrected by the users via sa-learn).
Willing to consider these ham.

From: "The Home Depot" <HomeDepotCustomerCare@homedepot.com>
Subject: Great Last-Minute Gifts for Dad
Count 4: Various users, flagged as spam by our system, not fed through
sa-learn. Looked like spam during validation. also have nine emails
from same source, 3 with low positive scores, six with negative
scores, also not fed through sa-learn.
Willing to consider these ham.

From: Godiva.com <godiva@godiva.com>
Count 3: User CR declared it to be spam via sa-learn. Might be old
subscription.
Count 1: User SV, flagged as spam by our system, no sa-learn correction.
Note: my unverified corpus also has two more emails from same source,
not flagged as spam (low positive score), not fed to sa-learn.

From: "eBay" <eBay@reply3.ebay.com>  Count: 7
Subject: Preview eBay's Summer Sizzlers & Save Big!
Subject: B-52's Live, BBQ at Great America--register now for eBay Live and save!
Subject: feralcanning, check these amazing eBay deals--all under $10
User CR declared it to be spam via sa-learn. Maybe old subscription,
very likely not the type of email the user wanted from eBay.

From: "Movies Unlimited Video E-Flash" <eflash@moviesunlimitedeflash.com>
Count 3: User SA, system flagged as spam, no sa-learn, look like spam,
but all to single user. Could be ham.

From: "DVD Talk" <newsletter@dvdtalk.com>  count: 2
To: mike@misosoup.com
Subject: DVD Talk: It's Back - The Huge DeepDiscountDVD.com Sale
User MM, system flagged as spam, no sa-learn, look like spam, all to
single user, count 2, many others not flagged as spam (some low
positive, some negative), none through sa-learn. Could be ham.

From: "Planet DVD Now" <sales@planetdvdnow.com>  count: 3
To: ncoronado@prontotax.com
Subject: Planet DVD Now Insider News for Saturday June 18, 2005
User NP, system flagged as spam, no sa-learn, look like spam, all to
single user, count 3, many others not flagged as spam (some low
positive, some negative), none through sa-learn. Could be ham.

From: support@sexsearchcom.com  count: 3
Subject: SexSearch Shown Interest
User JB, flagged spam, no sa-learn. Only user receiving these emails.


>> Constant Contact


Per earlier email, several other Constant Contact "newsletters"
flagged by our system as spam, variety of newsletters, variety of
users, spam classification not corrected by users, including technical
users who regularly and reliably sa-learn their misclassified emails.
Messages fed through sa-learn as spam by users:     17
Messages flagged as spam and not sa-learned as ham: 1586
Messages not flagged as spam:                       1496
IMO, if we discard the 1603 flagged as spam, we should also discard
the 1496 treated as ham.


>> Sure looks like spam:


From: "Entertainment Update" <EntertainmentUpdate@mail85.subscribermail.com>
Subject: New Promotional Partner Opportunities
User CR declared it to be spam via sa-learn. Sure looks to me like spam.

From: The Motley Fool <Fool@foolsubs.com>
Subject: Urgent Stock Buy/Sell Alert...from Motley Fool Stock Advisor
User CR declared it to be spam via sa-learn. Sure looks to me like spam.
Plus another copy flagged as spam by our system, same user, not fed to
sa-learn. Quite a few others, all look like spam.

From: "Entertainment Insider" <EntertainmentInsider@mail85.subscribermail.com>
Subject: New Marketing Opportunities from The b EQUAL Company
Subject: New Promotional Opportunities Available from Nickelodeon
Subject: New Marketing Opportunities from Buena Vista Home Entertainment
User CR declared it to be spam via sa-learn. Sure looks to me like spam.
Count: 5

From: Rabbi Michael Lerner <rabbilerner@tikkun.org>
Subject: Science and Spirit--a work group at the Network of Spiritual
Progressives Founding Conferences
User RI declared it to be spam via sa-learn. Maybe old subscription,
very likely not the type of email the user wanted from this source.

From: "ArcaMax" <ezines@arcamax.com>
Subject: Congratulations - You Won
User NP declared it to be spam via sa-learn. Sure looks to me like spam.
Two copies, same recipient, different message ids
Third email, also user NP, no sa-learn, flagged as spam by our system,
sure looks like spam to me.
Other emails, various users, no sa-learn, flagged as spam by our
system, look like spam to me.

From: South Beach Diet Online <products@southbeachdiet.com>
Subject: why this diet WORKS!
User AM, no sa-learn, flagged as spam by our system.

>> You are receiving this message because you subscribed to or visited
>> a Waterfront Media newsletter or product."

Visited a newsletter or product = looks like spam to me.

From: DGI Line - asi/50910 <promoflash@promotioncorner.com>
Reply-To: promoflash@promotioncorner.com
To: jan@award-source.com
Subject: 2005 Magnetic Football Schedules!  All Pro Teams Available
User JA, no sa-learn, flagged as spam by our system, roving constant
contact, contents look like spam to me.

From: "NewsMax.com" <customerservice@reply.newsmax.com>
Subject: Ken Blackwell and New Republicans: Inside Story
User GI, no sa-learn, flagged as spam by our system, only one email in
corpus, including unclassified. If "newsmax.com" were a real service,
I'd expect repeated emails. Therefore I believe this to be spam.

From: Health Insurance Solutions <HealthInsurance@focalex2.com>
Subject: Health and happiness go hand in hand.
User JC, system flagged as spam, no sa-learn, five separate emails,
all look like spam (including no MID from sender), all to single user,
an insurance agent. Could be ham. But...
From: Medical Insurance <MedicalInsurance@focalex2.com>
Subject: Take care with medical insurance.
From: US Immigration Help <USImmigrationHelp@focalex2.com>
Subject: Make the dream of citizenship a reality.
User JC, system flagged as spam, no sa-learn, multiple emails,
all look like spam (including no MID from sender), all to single user,
an insurance agent. Content very much so aimed at consumer, not agent,
strongly suggesting to me that all email from @focalex2.com is indeed
spam. Then ...
From: Posters And Wall Art <PostersAndWallArt@focalex2.com>
Subject: What your walls want to wear.
Same user (insurance agent), same source, nothing at all to do with
insurance or anything similar to any other email received by this
user. Other spam samples abound in more recent email.

From: "SmartBargains" <SmartBargains@deals.smartbargains.com>
Reply-To: "SmartBargains" <SmartBargains.L9A0NB.226361@deals.smartbargains.com>
To: srose@cencalins.com
Subject: 320TC Sheet Set, Duvet & More Just $29.95
User SC, system flagged as spam, no sa-learn, all look like spam.
User DT, "
Emails do refer to users by a first name which matches first letter of
email address.

>> You are receiving this email because you subscribed to it through
>> SmartBargains.com or one of our partners.


From: AIU Online <aiuonline@aiuonline-update.com>
Subject: Nights. Weekends. We're here when it's convenient for YOU!
Consistent spam, repeated sa-learn as spam, 2 users, plus one
unclassified to third user. Confident this is spam.

From: "International Living" <webeditor@internationalliving.com>
To: jim@cudney.com
Subject: IL Postcards - Tax Breaks in the Cloud Forest
User JC, many emails flagged spam, many emails not flagged, no
sa-learn. May or may not be spam. Certainly looks like scam.

From: "Martin D. Weiss, Ph.D." <alerts@weissinc.com>
Subject: A Personal Invitation from Martin Weiss
User JC, all emails flagged spam, no sa-learn, emails certainly do
look like spam/scam. Sent to only this user.

From: Hersheys Kisses <kisses@prewards.com>
Subject: Complimentary 10 lbs of Hershey~Rs Chocolate
User BQ, clear spam, even in SURBL blacklist.

From: "TopButton" <vip@TopButton.com>
To: nysale@dvorak.org
Subject: TOP BUTTON VIP - Prada Price Cuts: 4-Days Only
User ND, among the most technically oriented and skilled of our users,
email flagged as spam, no sa-learn, only email from this source in the
entire corpus, looks unquestionably spam.

From: eDiets Extra <extra@ediets.com>
Subject: Miami Mediterranean Diet: It's Hot!
Users ST and KG, several emails flagged spam, many emails not flagged,
no sa-learn. May or may not be spam. Certainly looks like spam.

Bob Menschel
Comment 26 Henry Stern 2005-07-30 04:24:59 UTC
As per Justin's request, I did a validation run without Bob's data.  The numbers
come out much better but leave an unanswered question:  Is Bob's data really
noisy or is it really hard?  I'm doing a scoring run now and will post a patch
when it's ready.

I don't care what we do either way.  What do you guys want to do?

vm-set0-2.0-4.0-100-nobob
False positives: mean=0.0767% std=0.0342
False negatives: mean=16.9041% std=5.9576
TCR (lambda=50): mean=3.5471 std=1.2481

vm-set1-2.0-4.0-100-nobob
False positives: mean=0.0595% std=0.0252
False negatives: mean=3.3299% std=1.1745
TCR (lambda=50): mean=16.9662 std=6.0300

vm-set2-2.0-4.625-100-nobob
False positives: mean=0.0686% std=0.0251
False negatives: mean=5.4227% std=1.9189
TCR (lambda=50): mean=11.0551 std=3.9115

vm-set3-2.0-5.0-100-nobob
False positives: mean=0.0575% std=0.0241
False negatives: mean=1.2911% std=0.4657
TCR (lambda=50): mean=31.9635 std=11.8543
Comment 27 Henry Stern 2005-07-30 05:33:59 UTC
Re: comment #24

I absolutely agree with you, Loren.  There's no problem with hand-tuning the
scores afterwards.  What I come up with is not necessarily the right answer,
it's just the best answer that I can come up with given the data at hand.
Comment 28 Henry Stern 2005-07-30 06:08:43 UTC
Created attachment 3046 [details]
Proposed scores for 3.1 generated without Bob's data

gen-set0-2.0-4.0-100-nobob
# Correctly non-spam:  52964  99.94%
# Correctly spam:     100131  81.10%
# False positives:	  34  0.06%
# False negatives:     23335  18.90%
# TCR(l=50): 4.931736  SpamRecall: 81.100%  SpamPrec: 99.966%

gen-set1-2.0-4.0-100-nobob
# Correctly non-spam:  53084  99.95%
# Correctly spam:     118698  96.28%
# False positives:	  28  0.05%
# False negatives:	4592  3.72%
# TCR(l=50): 20.575768	SpamRecall: 96.275%  SpamPrec: 99.976%

gen-set2-2.0-4.625-100-nobob
# Correctly non-spam:  53309  99.92%
# Correctly spam:     116473  93.94%
# False positives:	  41  0.08%
# False negatives:	7508  6.06%
# TCR(l=50): 12.971438	SpamRecall: 93.944%  SpamPrec: 99.965%

gen-set3-2.0-5.0-100-nobob
# Correctly non-spam:  53070  99.96%
# Correctly spam:     121906  98.49%
# False positives:	  21  0.04%
# False negatives:	1872  1.51%
# TCR(l=50): 42.360712	SpamRecall: 98.488%  SpamPrec: 99.983%
Comment 29 Loren Wilton 2005-07-30 06:33:22 UTC
Subject: Re:  Score generation for SpamAssassin 3.1

> TCR (lambda=50): mean=2.6218 std=0.9242

Out of curiosity what is TCR?

Comment 30 Sidney Markowitz 2005-07-30 07:12:53 UTC
Full explanation of TCR (too long for this comment) is in 

http://wiki.apache.org/spamassassin/TotalCostRatio
Comment 31 Bob Menschel 2005-07-30 12:43:59 UTC
> Here's an email Bob sent to sa-dev mailing list that looks like it was meant to
> be a comment here. Or if not, I think it should be in the record here and it is
> on a public list so I feel free to repost it.

Agreed. Actually, this first comment was just back to the list; the
second was to the list cc bugz, but didn't get to bugz.  I'll try to
post directly to bugz on this subject going forward.

> However, 259 is a lot less than 792 so there still is a question why
> Bob has so many Bonded sender FPs.

My first analysis was on Henry's 10% extract from the log, going
strictly against the FN/FP warning extract from that. So the numbers
were significantly smaller than from my full corpus which Justin
reviewed.

> There are 259 emails from/via constantcontact.com
from that 10% extract
> which are treated as spam on my system, have been flagged as spam on
> my system (scores as high as 30's and 40's), have been encapsulated
> on delivery, have never been flagged by any user as not-spam, but,
> for the purposes of a world-wide mass-check, these
> constantcontact.com emails might be questionable.

> Note: Not all constantcontact.com is treated as spam here -- quite a
> few cc.com newsletters are subscribed to and seen as ham by their
> subscribers and the system. The ones I find above in the fns file are
> all from a set of eight newsletters which have regularly (almost
> always) been seen as spam, and no user has ever corrected that
> classification.

Per my later email, this is out of over 3000 constant contact emails,
split about 50/50 in my corpus. Of the 1500+ that are considered spam
here, half are considered FPs, so apparently the other half are being
flagged correctly regardless of my corpus. No problem there.

Motley fool: Sidney indicates they're ham; I can't argue with him.
Treated as spam here because a) a user intentionally flagged it as
spam into sa-learn, b) they seem to me to be spam, based on the
contents, c) I'm not familiar with that service myself, and d) I don't
have time to research all of the sources of emails which get flagged
as spam. In my corpus, 22 from this source are flagged as spam (2 via
sa-learn), 26 as ham, 40 as unclassified.

About 80% of my BSP-trusted hits, spam, ham, and apparently also not
classifed, are through constant contact. Given Sidney's discovery and
comment re: constantcontact, I'm fairly convinced that /some/ of the
cc BSP-trusted emails in my corpus are spam. But I can't be absolutely
sure which (I'd be willing to put money down on about a dozen of them
that I reviewed yesterday, even after our discussions here, but given
our discussions here, only that dozen or so).

Not all of my cc emails, of cource, are BSP-Trusted. Those other also
fall on all sides of the ham/spam/unclassified groupings, and while I
haven't done stats on them, it feels from a quick glance as if the
ratio is about the same.

My corpus comes mostly from an aggressive ISP system, where a) a lot
of spam from known spam sources is dropped before SA, b) there are a
number of additional exim filters which put additional headers into
emails for SA to analyze, c) we have an additional Bayes analysis
system outside SA which gives additional feedback concerning whether
an email is/isn't spam, d) we have additioanl custom rules that review
the outputs of (b) and (c) in determining the SA score, e) we use most
of the not-high-risk SARE rules, f) we have a large number of
technical users very familiar with spam/anti-spam concerns and very
able to sa-learn their own emails, g) we have a large number of other
(not so technical) users, many of whom use this service specifically
because of its aggressive anti-spam stance, many of whom do actively
sa-learn also, and h) a fair number of users who do no sa-learn.

Because of the aggressive stance, we do have a higher FP ratio than
many other systems. Importantly, we don't have any complaints about
that. Again, we do drop emails before they even get to SA, but those
that get to SA all get delivered to the users, with spam encapsulated.
Some FPs are corrected via sa-learn, as are many FNs.

All FPs and FNs are trapped and entered into my corpus. The number
that I then discard on review afterwards is small -- a handful each
month.

I also trap and enter those emails which are flagged as ham (negative
scores) or spam (scores over 5) by BOTH SA and one of our internal
systems. I review both of these categories, but because of the numbers
I don't manually validate each and every one. I do review the ham more
carefully than the spam.

These practices may be where the discrepancy comes from -- my reliance
on others to manually validate ham/spam via sa-learn, my acceptance
of their determination when I do not have contradicting evidence
myself, and my acceptance with careful but not paranoid review of
automated classification when two or more classification systems
agree.

I'll be reviewing the BSP-other and HABEAS_ACCREDITED_COI spam hits
later today.

Meanwhile, though I have confidence that my corpus is reasonably
accurate, I also have no problem with it being discarded if my
methodology above is insufficient for scoring purposes.

The two questions, one asked by Henry:
> Is Bob's data really noisy or is it really hard?
and, what is the definition of "spam" as it should be applied to
scoring? Is there any room in there for end user perception (I didn't
ask for this), or does it accept mail as ham if the user ever at
any time opted in for any mail from the sender, even mail which does
not properly relate to the reason the user wanted the email?

Again, I have no problem with my corpus (or any subset of it) being
discarded. I'm also willing to work on improving my methodologies for
3.2's rescoring run.

Bob Menschel

Comment 32 Bob Menschel 2005-07-30 17:08:26 UTC
> I'll be reviewing the BSP-other and HABEAS_ACCREDITED_COI spam hits
> later today.

> BSP-other misclassified ham: 11
Message-ID: <9992bcc605040810462af9cb11@mail.google.com> -- no idea
  how this obvious ham got into the corpus as spam.
Message-ID:
<bysp635axk0d48bfj1x7kbvjbu35j7.174415332.4053@mta300.email.americanexpress.com>
  Ditto.
Message-ID: <20050419184459.18922.qmail@corpmailer01.prod.mesa1.secureserver.net>
  From notice@godaddy.com, pure advertising/marketing newsletter to   a        
        %%%
  user (Godaddy customer) who sa-learned this as spam, apparently
  wanting only domain registration data and not sales fluff. Only have
  the one godaddy.com email in the "sa-learned as spam" corpus. have 7
  others that were classified as spam, obvious marketing newsletters.
  Well over 90% of all godaddy newsletters are in the ham corpus (or
  unclassified), and none of their functional emails dealing with
  registrations and specific domains are flagged as spam (about 40% of
  all godaddy emails are unclassified, the remainder ham, except for
  these 8).
Message-ID: <PayPal.65mgpzxn8.h0@email.paypal.com>
  From: PayPal <paypal@email.paypal.com>
  Subject: Annual Privacy and Electronic Fund Transfer Rights Notice
  X-Header-CompanyDBUserName: paypal
  Errors-To: paypal@email.paypal.com
  Reply-To: paypal@email.paypal.com
  X-Header-MasterId: 900764
  X-Header-Versions: PayPal.65mgpzxn8.h0@email.paypal.com
  X-Originating-IP: [206.165.246.83]
  X-Sender-Nameserver: ns3.yahoo.com ns4.yahoo.com ns5.yahoo.com ns1.yahoo.com
ns2.yahoo.com em
  X-Spam-Status: Yes, score=106.1 required=5.0 tests=BAYES_00,DCC_CHECK,
        DIGEST_MULTIPLE,HTML_20_30,HTML_MESSAGE,MIME_HTML_ONLY,OPT_IN,
        PYZOR_CHECK,RCVD_IN_BSP_OTHER,SARE_FORGED_PAYPAL,SARE_FORGED_PAYPAL_C,
        SP_HAM_VERY autolearn=no version=3.0.4
  Content looks like it came from PayPal, and I don't see any phishing
  links within, but the received header trail has nothing to do with
  any paypal or ebay system -- the only servers listed in the received
  chain are yahoo.com (starting at milter101.store.sc5.yahoo.com). I'm
  guessing this was sent to an email address within the yahoo store
  system, which auto-forwarded to the owner's address on our system,
  and the Yahoo system *stripped* all evidence that this actually came
  from paypal, causing our phish alarms to go off. 64 identical emails
  came through, most as ham, some unclassified, this was the only one
  flagged as spam.

> BSP-other questionable entries: 4
Message-ID: <25789186.1117674104561.JavaMail.clundberg@scotch>
  From rabbilerner@tikkun.org, associated with democracyinaction.org
  Fed to sa-learn as spam by user RI. Religious/Political newsletter,
  of 7 emails in my corpus, 4 have been sa-learned by this user as
  spam, one to this user is unclassified, one to this user is
  classified as ham (not sa-learned), and one is classified as ham to
  a different user.

> BSP-other definite spam: 1
Message-ID: <6.0.0.22.1.20050610214911.3eca3bd7@paypal.com> --
  guaranteed phish. Internal link to <a
  href="http://www.paypallk.com:680/paypal.php" style="font-family:
  monospace; font-size: 10pt;">Click here to confirm your account</a>

> HABEAS_ACCREDITED_COI misclassified ham: 12
Message-ID: <21139714.1120711719692.JavaMail.truelink@vma03.sbp-prod.truelink.com>
  From: FreeCreditProfile <support@freecreditprofile.com>   count: 12

> HABEAS_ACCREDITED_COI questionable entries: 32
Message-Id: yournewsletterswf20094m05XZ200506090501807044@yournewsletters.net
  southbeachdiet.com email mentioned previously. Count: 1
Message-ID: <29140116.1118865322661.JavaMail.root@mailagent0.ientrymail.com>
  In general, @ientrynetwork.net newsletters are very spammy. One user
  religiously places his newsletters flagged as spam into sa-learn as
  ham, but no others do so. Count: 8
Message-Id: <E1DiegK-0006Wz-GA@pascal.ctyme.com>
  No message id from sender. Count: 22
  From newsletter@tickle-inc.com,
  Subject: Your future, revealed!
  "The Tickle Newsletter is an email service designed with you in mind
  &#0151; it's the only email all about you.  We think you're going to
  love it." Sure sounds like an introduction to spam. Contents look
  very spammy as well.
Message-ID: <PRODWEB052en0bcaQUX00003c75@PRODWEB05.WLElmsford.com>
  FROM: Reservation Rewards Customer Service
<customerservice@reservationrewards.com>
  SUBJECT: As requested, your Membership Kit for Reservation Rewards, please
login today
  X-Spam-Status: Yes, score=12.5 required=5.0 tests=BANG_GUARANTEE,BAYES_00,
        CALL_FREE,CT_ACT_NOW,CT_DO_IT_TODAY,CT_OFFERS_ETC,CT_OFFER_3,
        CT_PERCENT,DNS_FROM_AHBL_RHSBL,FORGED_RCVD_HELO,HABEAS_USER,
        HTML_50_60,HTML_MESSAGE,LINK_PHRASE,MAILTO_LINK,
        MIME_HEADER_CTYPE_ONLY,NO_COST,ORDER_NOW,SARE_BOUNDARY_LC,SAVE_MONEY,
        SAVE_UP_TO,SP_SPAM_VERY,URI_OFFERS autolearn=no version=3.0.4
  User CR; If she signed up, then this membership confirmation was not
  spam. However, this confirmation dated July 3 is followed by a
  billing notice dated July 17, and then confirmation of the user's
  cancellation dated July 17. Cannot tell whether the original was
  spam, but user seems to have no interest in the service.

> HABEAS_ACCREDITED_COI definite spam: 0

Comment 33 Sidney Markowitz 2005-07-30 17:49:05 UTC
Bob,

It's tricky getting a good corpus: There are spammy looking mails from sources
that follow the rules. There are people who are so clueless that they label
something spam rather than unsubscribe. There are people who do the same not
because they are clueless, but if they don't recognize that something comes from
a subscription or just aren't sure, know better than to take a chance on using a
spammer's unsubscribe link. And there's Constant Contact who may have found a
way around what at first glance appears to be a good defense against spam.

So how do you have a clean corpus when it could contain edge cases that are
classified wrong? What is the "correct" score for such mail? If the only
difference between a piece of spam and a piece of ham is whether the recipient
subscribed to it, how do you call either one an FP or an FN for the purpose of
the rule scoring program? I don't have answers to that.

By the way, if Constant Contact really is doing that, they must be counting on
low numbers of complaints. That link I posted to Ironport's site listed the
Bonded Sender fees as of two years ago. It makes it risky for a single customer
to spam. But I can see how Constant Contact could have a business model based on
getting paid by a mix of spammers and hammers. The Bonded Sender fines are based
on number of complaints per million mails. If you want to nail them, get
aggressive about reporting the confirmed RCVD_IN_BSP_TRUSTED spam. Once the
numbers of complaints reach the threshold where it costs Constant Contact $1000
per spam mail they are going to have to clean up their act if it really is that
sleazy.
Comment 34 Auto-Mass-Checker 2005-07-30 20:30:12 UTC
Subject: Re:  Score generation for SpamAssassin 3.1 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


BTW weren't we planning to set the BAYES_ scores non-mutable?
can't quite recall.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFC7EU0MJF5cimLx9ARAsXJAKCaQgyOGV219B05AyKUwLI8KmWerACcC6I4
C807HMsT0flOAUTgo9otUQo=
=kvP0
-----END PGP SIGNATURE-----

Comment 35 Loren Wilton 2005-07-30 21:48:45 UTC
Subject: Re:  Score generation for SpamAssassin 3.1

> BTW weren't we planning to set the BAYES_ scores non-mutable?
> can't quite recall.

I know there had been talk of it, although I'm too lazy to try to dig up the
thread.

I think, if it isn't too much work, what I'd like to see would be something
like taking the final generated scoreset, normalizing the bayes numbers for
all sets to ascending sequence more or less*, and then locking them and
rerunning the score generation to get updated values for the other rules.

*    From the data I looked at in Henry's posting, I seem to recall that 05
and 99 were obviously out of sequence.  I think 99 is the critical one to
have in sequence.  05 may be correct where it is, even though out of
sequence.  Perhaps a topic for discussion.

        Loren

Comment 36 Bob Menschel 2005-07-31 15:11:30 UTC
I personally would prefer to avoid fixing any Bayes scores so they couldn't
float, but I feel equally strongly that BAYES_99 should score higher than the
others. BAYES_00 is problematic when a Bayes database gets poisoned, but
BAYES_99 generally doesn't have that problem. 

Option 1: Allow all Bayes scores to float, but add code which forces BAYES_99 to
be at least 10% higher than the max score of all other Bayes scores (at least
BAYES_95).

Option 2: Allow all Bayes scores to float, but give BAYES_99 a floor of either
3.5 or 4.0 -- it can float higher if the Perceptron feels it should, but no lower. 

In SARE we sometimes run into a family of rules like Bayes, something like
__RULE_1 -- spam sign # 1
__RULE_2 -- spam sign # 2
__RULE_3 -- spam sign # 3
meta RULE_1 -- rule 1 but not 2 or 3
meta RULE_2 -- rule 2 but not 1 or 3
meta RULE_3 -- rule 3 but not 1 or 2
meta RULE_4 -- rules 1 and 2 but not 3
meta RULE_5 -- rules 1 and 3 but not 2
meta RULE_6 -- rules 2 and 3 but not 1
meta RULE_7 -- rules 1, 2, and 3
The meta rules 1-3 are scored based on their solo hits (the hits of their
__feeder rules), using our standard SARE algorithms.
Assuming that meta rules 4-6 hit fewer ham than 1-3, we score them higher than
1-3, even if their total spam hits are lower (because of the increased
requirements). 
Likewise, meta rule 7 will be scored highest of this family, because it's 
"safest" of the seven rules. 

Would it be worth while opening a new bugz entry for a 3.2 enhancement to
implement some kind of "this rule scores better than that rule if its S/O is at
least as good" linkage? 
Comment 37 Bob Menschel 2005-07-31 18:42:41 UTC
SM> It's tricky getting a good corpus: ...

In addition to your reasons, a good corpus for local use (it's spam here, and
always spam here) may not be good for global use (it's not spam to users on that
other system over there). And to expand on your
SM> There are people who [sa-learn as spam] not because they are clueless, but
if they don't recognize that something comes from a subscription or just aren't
sure, ...
There are also sources that confound matters -- a user can sign up with them for
one brand, and receive emails from a corporate parent with a different domain name.

SM> And there's Constant Contact who may have found a way around what at first
glance appears to be a good defense against spam.

SM> ... if Constant Contact really is doing that, they must be counting on
low numbers of complaints. 

Apparently they are, based on the large number of cc.com emails here that
qualify for the BSP rules. 

SM> That link I posted to Ironport's site listed the Bonded Sender fees as of
two years ago. It makes it risky for a single customer to spam. But I can see
how Constant Contact could have a business model based on getting paid by a mix
of spammers and hammers. The Bonded Sender fines are based on number of
complaints per million mails. If you want to nail them, get aggressive about
reporting the confirmed RCVD_IN_BSP_TRUSTED spam. ...

My family gets a lot more ham than spam from cc.com, and so in the past on those
rare occasions when we've gotten cc.com spam I've gone directly to them, with
satisfactory results. Given what I'm seeing now in this corpus, I'll send in the
formal complaints to BSP/Ironport, to increase cc.com's incentive to police
their customers. 

SM> So how do you have a clean corpus when it could contain edge cases that are
classified wrong? ...

Or, IMO more correctly, a valid and representative corpus used for scoring
/should/ have edge cases that may or may not be classified wrong -- there's no
other way for a major ISP who can't know what their users did or didn't
subscribe for, to manage their spam. It's important to classify them as
accurately as humanly possible, but for SA to be optimally useful it needs to be
able to make judgments about the edge cases as well, and it can only do that if
we take the risk and include them in our corpus. 

SM> What is the "correct" score for such mail? If the only difference between a
piece of spam and a piece of ham is whether the recipient subscribed to it, how
do you call either one an FP or an FN for the purpose of the rule scoring
program? I don't have answers to that.

First pass suggestion:  Aim to get these "edge" emails into the 2.0-4.0 score
range, so that network tests and hopefully Bayes can push them over 5.0 or under
0.0 as appropriate for the user/site. 

Comment 38 Justin Mason 2005-08-02 13:45:57 UTC
Created attachment 3048 [details]
freqs for scoreset 3, all logs, all rules

Daniel noticed that the freqs file I posted was missing SPF_PASS (for some
reason, it's listed as a userconf rule, dunno why).  here's a copy that does.
Comment 39 Justin Mason 2005-08-02 16:26:40 UTC
regarding the Bob's-corpus issue.   I've been pondering this a bit, and I think
we have to leave it out of the rescore run.

Fundamentally, I don't trust the user population involved :(  I think your
users are using "learn as spam" to keep stuff that isn't *strictly* UBE out of
their mail folders; by using those logs, we'd generate score-sets to consider
spam to be "stuff your users don't want" rather than "unsolicited bulk email",
which is what we have to aim towards.

We used to have a spam definition, namely "spam == UBE", up somewhere related
to corpus policy, but I can't find it now.   But in my opinion that still
applies ;)

(to be honest, I'm not sure there's any good way to use someone else's email in
a rescoring run, since I've often wound up saying "yes, I subscribed to that
horrible spammy-looking newsletter that's sending with a misleading HELO
string", even for my own mail.  and you should see Rod's corpus! ;)

--j.
Comment 40 Michael Parker 2005-08-03 18:07:32 UTC
The scores for upper BAYES scores (ie 80, 85 and 90) are too low.  We should
lock in the values based on what we saw in the 3.0 release.

Personally I've been running with this in my local.cf for a long while with no
issues:

score BAYES_80 0 0 4.608 3.087
score BAYES_95 0 0 4.514 3.063
score BAYES_99 0 0 5.070 3.886


Granted the 80/95 set3 scores might be a tad high for general consumption.
Comment 41 Daryl C. W. O'Shea 2005-08-03 18:27:33 UTC
Subject: Re:  Score generation for SpamAssassin 3.1

Same here.  I've been running with 3.0's scoreset 2 scores for both 
scoresets 2 and 3, for BAYES_50-99, with no problems (always using 
scoreset 3).

score BAYES_50 0 0 1.567 1.567
score BAYES_60 0 0 3.515 3.515
score BAYES_80 0 0 3.608 3.608
score BAYES_95 0 0 3.514 3.514
score BAYES_99 0 0 4.070 4.070

Comment 42 Justin Mason 2005-08-03 19:04:36 UTC
anyway, back to the score generation thing, a few items:


1. I'm -1 on using those scores. They look great all-round, *except* for the
Bayes scores:

 56.044  84.1316   0.0375    1.000   0.84    1.89  BAYES_99
  1.716   2.5715   0.0099    0.996   0.83    2.06  BAYES_95
  1.983   2.9654   0.0251    0.992   0.76    2.09  BAYES_80
  1.685   2.5064   0.0463    0.982   0.68    0.37  BAYES_60
 31.996   0.3606  95.0772    0.004   0.60   -2.60  BAYES_00
  4.503   5.9619   1.5927    0.789   0.47    0.00  BAYES_50
  0.311   0.0880   0.7556    0.104   0.36   -0.41  BAYES_05
  0.377   0.1622   0.8048    0.168   0.32   -1.95  BAYES_20
  0.401   0.2655   0.6706    0.284   0.27   -1.10  BAYES_40

(scoreset 3 freqs output.)   note that none of them was permitted above 2
points by the perceptron; those scores have the odd flattening for
BAYES_95/99 we had to fix in 3.0.3 in r165033; and there seems to be
unanimous support on the record for fixing these.

(ok, I'm being a little disingenuous on the last point, as I think someone,
either Daniel or Henry, was ok with letting them float, but they made the
comment on a transitory medium like IRC or IM so it doesn't count. ;)

So I suggest we set them to the static scores and move out of the mutable
section, as done in the attached patch, then get Henry to rerun
the perceptron.   for ease of review, those static scores are:

score BAYES_00 0.0001 0.0001 -2.312 -2.599
score BAYES_05 0.0001 0.0001 -1.110 -1.110
score BAYES_20 0.0001 0.0001 -0.740 -0.740
score BAYES_40 0.0001 0.0001 -0.185 -0.185
score BAYES_50 0.0001 0.0001 0.001 0.001
score BAYES_60 0.0001 0.0001 1.0 1.0
score BAYES_80 0.0001 0.0001 2.0 2.0
score BAYES_95 0.0001 0.0001 3.0 3.0
score BAYES_99 0.0001 0.0001 3.5 3.5

they're a mix of what the perceptron said in that last run, what was used in
3.0.3, and some smoothing (to avoid the FAQs again).


Henry -- any chance you can gzip up the validation set after you run the
perceptron, and put them somewhere?   There's a whole batch of stuff that needs
to be done that needs those.  also, we need to get the statistics in.   I've
updated http://wiki.apache.org/spamassassin/RescoreMassCheck with what I think
needs to be done (steps 5 onwards).

Probably not worth doing those until we vote on the patch / figure out
what to do with the BAYES scores, though.
Comment 43 Justin Mason 2005-08-03 19:05:04 UTC
Created attachment 3051 [details]
bayes scores
Comment 44 Loren Wilton 2005-08-03 19:49:05 UTC
Subject: Re:  Score generation for SpamAssassin 3.1

FWIW, the data from scoreset 3 more closely supports using the equation (bayes_group-50)/(50/3.5) to calculate the score.  This is quite close to Justin's values above 50, but departs considerably at lower Bayes values:

Group	Set 3	Norm 3.5	Justin 2	Justin 3
0	-2.600	-3.500	-2.312	-2.599
5	-0.410	-3.150	-1.110	-1.110
20	-1.950	-2.100	-0.740	-0.740
40	-1.100	-0.700	-0.185	-0.185
50	0.000	0.000	0.001	0.001
60	0.370	0.700	1.000	1.000
80	2.090	2.100	2.000	2.000
95	2.060	3.150	3.000	3.000
99	1.890	3.430	3.500	3.500

The "Norm 3.5" group matching the above equation is very close to the Perceptron scores for Bayes_20 to Bayes_80.  The Perceptron score for Bayes_05 is just plain wonky, and of course the scores flatten completely at Bayes_80.

Running a simple linear solution to approximate the bayes-20 to bayes-80 scores with a straight line produces a slightly lower value for the constant (3.5) above: 3.3875.  This of course produces slightly less aggessive scores on the top and bottom ends:

Group	Set 3	Norm 3.3875
0	-2.600	-3.388
5	-0.410	-3.049
20	-1.950	-2.033
40	-1.100	-0.678
50	0.000	0.000
60	0.370	0.678
80	2.090	2.033
95	2.060	3.049
99	1.890	3.320

Comment 45 Justin Mason 2005-08-05 13:44:13 UTC
hellooooo! anyone out there? especially Henry, you're on the critical path here
in a big way. This bug is the 3.1.0 blocker.  Once this is done we can release
3.1.0.  As such it's pretty important! 

IMMEDIATELY REQUIRED:

- Henry: gzip up the validation logs set and put them somewhere.  This
  gets you off the critical path for 3.1.0, at least temporarily, since
  we can try out new bayes scores and figure out if a new perceptron
  will need to be run, or if we can just bump the scores manually and
  use the patch you already posted.   Without the validation set,
  we can't get an accurate idea afaik.

- ALL DEVS: decide correct scores for BAYES*.    this requires comments.
  please comment.

- ALL DEVS: if my patch of proposed BAYES* scores meets with your approval
  (which I'd say it probably won't seeing as everyone has their favourites),
  vote +1.  Otherwise create a patch of your own we can vote on. I think
  DOS' and Loren's suggested scores both look ok.

DOWN THE ROAD A BIT:

- Henry: (possibly) rerun the perceptron if the validation logs set
  indicates that it's required.

- ALL DEVS: once there's a new patch with all scores, vote on it so
  it can be applied.

Comment 46 Daryl C. W. O'Shea 2005-08-06 01:36:49 UTC
I just noticed that the proposed 3.1 BAYES_* scores in scoreset 2 are identical
to the 3.0 ones.

So... manually tweaked scores for 3.0 should work just as good with 3.1.  I'm +1
on the BAYES_50-99 scores I posted in comment 41 (which are the scoreset 2
scores copied to scoreset 3).  I really think BAYES_99 should score at least 4.0.

I'm not exactly sure which of Loren's scores Justin is referring to, but I think
3.5 is too low for BAYES_99.
Comment 47 Loren Wilton 2005-08-06 02:06:20 UTC
Subject: Re:  Score generation for SpamAssassin 3.1

> I'm not exactly sure which of Loren's scores Justin is referring to, but I
think
> 3.5 is too low for BAYES_99.

I'm not sure which set either.  I hink that 3.5 *might* be OK with net tests
also.  I think I'd want something closer to 4.0 - 4.5 or even higher without
net tests.  Wasn't it something just shy of 5 in 2.6?

Comment 48 Loren Wilton 2005-08-06 03:30:13 UTC
Another suggested set of bayes values:

Bayes	Set 2	Set 3	Eqn 2	Eqn 3
0	-2.312	-2.599	-2.5	-2.6
5	-1.11	-0.413	-1.525	-2.2
20	-0.74	-1.951	-0.7	-2
40	-0.185	-1.096	0.4	-0.78
50	0.912	0.001	0.95	-0.1
60	2.22	0.372	1.8	0.58
80	2.775	2.087	2.7	1.94
95	3.237	2.063	3.425	2.96
99	3.145	1.886	3.645	3.232

The second and third columns are sets 2 and 3 from Henry's data.  The final two 
columns are my proposed values for sets 2 and 3.  These values are not what I 
would really like to see on the high end, but I think are about as high as one 
can somewhat reasonably go based on the data.

Both sets are essentially linear trendlines for sets 2 and 3, with some hand 
corrections to better match what I consider a few important data points.
In particular, bayes_00 for both sets 2 and 3 are close to -2.5.  However the 
trendlines would predict values around -1.7 for set 2 and -3.2 or so for set 
3.  I've moved the bayes_00 point to something that the data will support in 
both cases.  Also both sets show a weakness in bayes_05.  I've pushed the 
bayes_05 trendline values upward for both sets, although not far enough to 
create score inversions.

It should be noted that both original sets indicate a flattening of the bayes 
scores over 80%.  I've left these values as the linear trendline would predict, 
since that seems to be closer to normal human experience.  It must be noted 
though that the data doesn't really support these extrapolations, especially 
for bayes_99.  

Neither bayes_99 score comes close to 4.0.  I tried to play with the data until 
I could get something in that range, but it wouldn't go along with the game.  
It would be possible to tweak the set 2 scores for 95 and 99 upward to aim at 
4.0 without departing too badly from the data.  This wouldn't be possible with 
the set 3 scores.
Comment 49 Justin Mason 2005-08-06 11:02:55 UTC
'So... manually tweaked scores for 3.0 should work just as good with 3.1.  I'm +1
on the BAYES_50-99 scores I posted in comment 41 (which are the scoreset 2
scores copied to scoreset 3).  I really think BAYES_99 should score at least 4.0.'

OK, I'm fine with the comment 41 scores, and I agree BAYES_99 should be >= 4.0.
+1.

care to make a patch?
Comment 50 Justin Mason 2005-08-06 15:56:06 UTC
OK, I got hold of the logs from Henry, and measured some BAYES scores
against the validation set:

base results from comment 28, gen-set3-2.0-5.0-100-nobob:
# Correctly non-spam:  53070  99.96%
# Correctly spam:     121906  98.49%
# False positives:        21  0.04%
# False negatives:      1872  1.51%
# TCR(l=50): 42.360712  SpamRecall: 98.488%  SpamPrec: 99.983%

copying values from set 2 for set 3:
# Correctly non-spam:  53064  99.95%
# Correctly spam:     122453  98.93%
# False positives:        27  0.05%
# False negatives:      1325  1.07%
# TCR(l=50): 46.272150  SpamRecall: 98.930%  SpamPrec: 99.978%

comment 14:
# Correctly non-spam:  53014  99.85%
# Correctly spam:     123093  99.45%
# False positives:        77  0.15%
# False negatives:       685  0.55%
# TCR(l=50): 27.293936  SpamRecall: 99.447%  SpamPrec: 99.937%

comment 42 (the patch in attachment 3051 [details]):
# Correctly non-spam:  53068  99.96%
# Correctly spam:     122509  98.97%
# False positives:        23  0.04%
# False negatives:      1269  1.03%
# TCR(l=50): 51.169078  SpamRecall: 98.975%  SpamPrec: 99.981%

I think 3051 has the best scores.  less FNs, just 2 more FPs,
sane scores.   I'd suggest we just vote on that patch.

If you want to try other values btw -- the logs are in the zone.  do this:

  cd svncheckout/masses
  rm ham.log spam.log
  ln -s
/home/corpus-rsync/corpus/scoregen-3.1/gen-set3-2.0-5.0-100-nobob/NSBASE/ham-test.log
ham.log
  ln -s
/home/corpus-rsync/corpus/scoregen-3.1/gen-set3-2.0-5.0-100-nobob/SPBASE/spam-test.log
spam.log
  vi ../rules/50_scores.cf
  ./fp-fn-statistics --scoreset=3
Comment 51 Duncan Findlay 2005-08-06 22:30:47 UTC
+1 on 3051

It would probably be more valid if we set the bayes score a little higher and
re-ran the perceptron, that way we could get scores over 4 for BAYES_99 without
so many FPs.
Comment 52 Bob Menschel 2005-08-06 22:53:13 UTC
+1 on 3051, and I agree it'd be good to see whether a perceptron run would back
out those two extra FPs (though I'm not overly concerned about just two FPs). 
Comment 53 Duncan Findlay 2005-08-06 22:58:35 UTC
What I meant to say was that we should set the BAYES scores explicitly and make
them immutable, then re-run the perceptron. In that case, I'd rather see
slightly higher bayes scores, closer to those in coment 40 or comment 41
(probably in between). I'd like to see about 4.5 for BAYES_99.
Comment 54 Justin Mason 2005-08-07 00:30:20 UTC
yeah, I'd like to do another perceptron run with those immutable -- however it
might take too long.  that's up to Henry, really.... in the meantime let's apply
3051.
Comment 55 Henry Stern 2005-08-07 00:43:14 UTC
I don't mind doing another validation and scoring run.  Commit a patch with
whatever you want to svn and let me know.  Make sure that the scores are in an
immutable block.
Comment 56 Justin Mason 2005-08-07 16:27:19 UTC
Henry: 3051 now has 3 +1s, and can be committed.  It moves the BAYES scores into
an immutable block.  so if you want to give this a go, go ahead and patch that
and check it in, then rerun the perceptron; alternatively, I'll check it in
later if you haven't beaten me to it, and you can rerun perceptron after that.
Comment 57 Justin Mason 2005-08-07 18:13:35 UTC
ok, I got that chance; 3051 is now applied.

trunk:
Sending        rules/50_scores.cf
Transmitting file data .
Committed revision 230721.

b3_1_0:
Sending        rules/50_scores.cf
Transmitting file data .
Committed revision 230723.
Comment 58 Justin Mason 2005-08-08 19:33:51 UTC
Created attachment 3062 [details]
release-quality patch

hey, here's a patch that uses the scores from attachment 3046 [details], plus the bayes
scores from attachment 3051 [details], and includes STATISTICS files for all scoresets.

This is release-quality, if we want to go with this; alternatively, we can wait
for a go-around with the locked-down Bayes scores.

IMO: we should release with these.  set 3 is looking fine as-is, and we're
spending a lot of time on this.
Comment 59 Justin Mason 2005-08-09 11:31:29 UTC
hmm, nix that patch.   I've just realised the STATISTICS files don't contain the
freqs.
Comment 60 Henry Stern 2005-08-09 12:10:31 UTC
Changing the Bayes scores didn't have an impact on accuracy with newly-generated
scores.  This doesn't say that changing the scores with what was previously
generated does not impact accuracy (we know otherwise).

Do you really want me to generate the scores again?  It's a real ballache but
I'll do it.

Samples: vm-set1-2.0-4.0-100-nobob vm-set1-2.0-4.0-100-nobob-ib
False positives:
        Sample 1: mean=0.0554% std=0.0229
        Sample 2: mean=0.0595% std=0.0252
        Statistically significantly different with confidence 99.2161%
        Estimated difference: -0.0041% +/- 0.0117

False negatives:
        Sample 1: mean=3.3473% std=1.1779
        Sample 2: mean=3.3299% std=1.1745
        Not statistically significantly different (alpha=0.9500)
        Estimated difference: 0.0174% +/- 0.1339

TCR (lambda=50):
        Sample 1: mean=17.2267 std=6.1150
        Sample 2: mean=16.9662 std=6.0300
        Not statistically significantly different (alpha=0.9500)
        Estimated difference: 0.2605 +/- 1.0179

Samples: vm-set3-2.0-5.0-100-nobob vm-set3-2.0-5.0-100-nobob-ib
False positives:
        Sample 1: mean=0.0546% std=0.0282
        Sample 2: mean=0.0575% std=0.0241
        Not statistically significantly different (alpha=0.9500)
        Estimated difference: -0.0028% +/- 0.0651

False negatives:
        Sample 1: mean=1.0845% std=0.5179
        Sample 2: mean=1.2911% std=0.4657
        Not statistically significantly different (alpha=0.9500)
        Estimated difference: -0.2066% +/- 0.8138

TCR (lambda=50):
        Sample 1: mean=37.6074 std=15.3585
        Sample 2: mean=31.9635 std=11.8543
        Not statistically significantly different (alpha=0.9500)
        Estimated difference: 5.6439 +/- 23.5426

Comment 61 Justin Mason 2005-08-09 12:12:12 UTC
'Do you really want me to generate the scores again?  It's a real ballache but
I'll do it.'

no, no need.  thanks for checking btw!
Comment 62 Justin Mason 2005-08-09 17:01:12 UTC
Created attachment 3065 [details]
redo of 3062

ok, this one's better, includes the freqs!  Please vote.....
Comment 63 Theo Van Dinter 2005-08-09 17:26:01 UTC
3065 is almost there it seems.

t/meta......................MULTI_FORGED depends on FORGED_AOL_RCVD with 0 score in set 0
MULTI_FORGED depends on FORGED_GW05_RCVD with 0 score in set 0
CONFIRMED_FORGED depends on FORGED_AOL_RCVD with 0 score in set 0
CONFIRMED_FORGED depends on FORGED_GW05_RCVD with 0 score in set 0
MULTI_FORGED depends on FORGED_AOL_RCVD with 0 score in set 1
MULTI_FORGED depends on FORGED_GW05_RCVD with 0 score in set 1
FORGED_THEBAT_HTML depends on MIME_HTML_ONLY with 0 score in set 1
FORGED_IMS_HTML depends on MIME_HTML_ONLY with 0 score in set 1
HTML_MIME_NO_HTML_TAG depends on MIME_HTML_ONLY with 0 score in set 1
DRUGS_MANYKINDS depends on DRUGS_PAIN with 0 score in set 1
OBFUSCATING_COMMENT depends on MIME_HTML_ONLY with 0 score in set 1
FORGED_OUTLOOK_HTML depends on MIME_HTML_ONLY with 0 score in set 1
MULTI_FORGED depends on FORGED_AOL_RCVD with 0 score in set 2
MULTI_FORGED depends on FORGED_GW05_RCVD with 0 score in set 2
CONFIRMED_FORGED depends on FORGED_AOL_RCVD with 0 score in set 2
CONFIRMED_FORGED depends on FORGED_GW05_RCVD with 0 score in set 2
MULTI_FORGED depends on FORGED_AOL_RCVD with 0 score in set 3
MULTI_FORGED depends on FORGED_GW05_RCVD with 0 score in set 3
DRUGS_MANYKINDS depends on DRUGS_PAIN with 0 score in set 3
DRUGS_MANYKINDS depends on DRUGS_MUSCLE with 0 score in set 3


I think there are a couple of things we may want to address in the future as well:  some scores are set 
to "0.000" versus "0" ala "score HDR_ORDER_MTSRIX 0 # n=0 n=1 n=2 n=3" instead of "score URI_HEX 
0.000".  It'd be nice to round scores where abs(score) < 0.1 to 0 like we used to do.  No point in 
running rules when they're basically not going to contribute.  Etc.
Comment 64 Michael Parker 2005-08-09 18:47:42 UTC
Subject: Re:  Score generation for SpamAssassin 3.1

+1
Comment 65 Justin Mason 2005-08-09 21:14:59 UTC
ok, working on the meta.t failures and the zeroing scores that are -0.1 < score
< 0.1.

question: has anyone used 'rewrite-cf-with-new-scores' recently?  can it
successfully rewrite these scores in place?

# URIDNSBL
ifplugin Mail::SpamAssassin::Plugin::URIDNSBL
# <gen:mutable>
score URIBL_AB_SURBL 0 3.306 0 3.812
score URIBL_JP_SURBL 0 3.360 0 4.087
score URIBL_OB_SURBL 0 2.617 0 3.008
score URIBL_PH_SURBL 0 2.240 0 2.800
score URIBL_SBL 0 1.094 0 1.639
score URIBL_SC_SURBL 0 3.600 0 4.498
score URIBL_WS_SURBL 0 1.533 0 2.140
# </gen:mutable>
endif # Mail::SpamAssassin::Plugin::URIDNSBL

what happens for me is that they get shoved into the main <gen:mutable> section,
and lose their "ifplugin" scope.  that's obviously bad news, as it means that
manual hand-editing is required to fix it.

is there a working script that avoids that problem?
Comment 66 Justin Mason 2005-08-09 21:46:53 UTC
Created attachment 3066 [details]
redo of 3065

ok, this one:
- passes t/meta.t
- zeroes rules where -0.1 < score < 0.1
- is otherwise identical.

I haven't redone the STATISTICS files, though. ;)
Comment 67 Justin Mason 2005-08-10 13:39:02 UTC
Created attachment 3068 [details]
fix for test failures caused by 3066

this is an adjunct to 3066; unfortunately make test produces lots of failures
without this patch otherwise.

it's a set of fixes to the test suite, fixing more of the tests to use their
own rules, isntead of relying on the distribution-default ruleset; this patch
adds a new test-suite-specific rules file, so the test suite is more
independent of the basic ruleset.
Comment 68 Justin Mason 2005-08-10 17:58:19 UTC
Created attachment 3069 [details]
redo of 3066

well isn't this fun.  it turns out that rule_names.t introduces more
unpredictability in our test suite, and causes *occasional* 'make test'
failures.

FUZZY_VALIUM in rules/25_replace.cf was therefore causing make test failures,
due to its name; this version of the rules patch includes the new scores, the
new stats, and renames that rule to "FUZZY_VLIUM" to avoid this test failure.

the following patch is a fix for t/rule_names.t that removes this
unpredictability.
Comment 69 Justin Mason 2005-08-10 17:59:36 UTC
Created attachment 3070 [details]
fix for t/rule_names.t

I think this helps
Comment 70 Justin Mason 2005-08-10 18:01:00 UTC
ok. these patches all need votes, now: 3069, 3068, 3070.
Comment 71 Duncan Findlay 2005-08-10 19:43:48 UTC
Justin, can you elaborate on why rule_names.t was failing? I don't see why
FUZZY_VALIUM had the problem, but FUZZY_VIOXX or FUZZY_VICODIN does not.

+1 on all 3
Comment 72 Justin Mason 2005-08-10 20:07:36 UTC
'Justin, can you elaborate on why rule_names.t was failing? I don't see why
FUZZY_VALIUM had the problem, but FUZZY_VIOXX or FUZZY_VICODIN does not.'

FUZZY_VALIUM contained "VALIUM" which was firing on DRUGS_ANXIETY
(__DRUGS_ANXIETY_3 to be exact).   I couldn't see exactly why, but it certainly
was firing on that bit of the name ;)

I have no idea why VIOXX/VICODIN aren't firing, although the __DRUGS_FOO_N rules
all seem to have individual subrules for each drug, and some have \b and some
have other start-of-string markers.  rule_names.t is a bit of a combinatorial
lucky dip I think. :(
Comment 73 Michael Parker 2005-08-10 22:08:21 UTC
+1
Comment 74 Justin Mason 2005-08-11 17:06:09 UTC
ok! applied, 231543 and 231544.