SA Bugzilla – Bug 4505
[review] Score generation for SpamAssassin 3.1
Last modified: 2005-08-11 09:06:09 UTC
To tune the models this time, I am using a 10% random sample of all of the corpus submissions. All of these results have been generated using the same parameters as I did with 3.0, except for set1. False positives and negatives from the 10% sample to follow... ./model-statistics vm-set0-2.0-4.0-100/validate False positives: mean=0.0753% std=0.0462 False negatives: mean=20.9334% std=7.3811 TCR (lambda=50): mean=2.7302 std=0.9718 ./model-statistics vm-set1-2.0-4.0-100/validate False positives: mean=0.0713% std=0.0435 False negatives: mean=5.9736% std=2.1137 TCR (lambda=50): mean=9.8396 std=3.6217 ./model-statistics vm-set2-2.0-4.625-100/validate False positives: mean=0.0847% std=0.0364 False negatives: mean=5.6917% std=2.0176 TCR (lambda=50): mean=9.7449 std=3.4877 ./model-statistics vm-set3-2.0-5.0-100/validate False positives: mean=0.0847% std=0.0527 False negatives: mean=2.9959% std=1.0621 TCR (lambda=50): mean=15.7957 std=6.3287
The misses can be found on the rsync server in /corpus/scoregen-3.1/falses/ I wanted to put them on BZ, but the file is too big.
(bumping pri to the appropriate level) since quite a few of the mass-checkers don't have accounts on that box, I've also copied the set3 files to these URLs: http://taint.org/xfer/2005/set3.fn.gz http://taint.org/xfer/2005/set3.fp.gz Please download and verify that any mails in the FP set that are coming from your corpus, are indeed valid ham; and ditto for the FN set being spam. Btw Henry -- in my case, the breakdown of errors is as follows... FNS: (can be moved to spam if you want, or deleted) /home/jm/Mail/deld.priv/56232 /home/jm/Mail/deld.priv/61238 /home/jm/Mail/sent/587 /home/jm/Mail/sent/736 INVALID, DELETE FROM HAM: (rule discussion, bounced spam) /home/jm/Mail/deld.priv/111034 /home/jm/Mail/A3inbox/1 FPS: (can be moved to ham or deleted) /home/jm/cor/spam.cor/20041029a/216 /home/jm/cor/spam.cor/20041029a/226 /home/jm/cor/spam.cor/20041029a/246 /home/jm/cor/spam.cor/20041029a/233 /home/jm/cor/spam.cor/20041029a/235 INVALID, DELETE FROM SPAM: (bounced spam) /home/jm/Mail/Sapm/1540 /home/jm/Mail/Sapm/1647
Subject: Re: Score generation for SpamAssassin 3.1 On Wed, Jul 27, 2005 at 06:39:22PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > Please download and verify that any mails in the FP set that are coming from > your corpus, are indeed valid ham; and ditto for the FN set being spam. Ok, checked over the set3 results. FPs: all valid ham. FNs: all valid spam. In full disclosure, several of the spams could be considered "questionable", namely HGTV newsletters which also include DIY newsletters. I was originally receiving them to a hamtrap, but then I started receiving things I didn't ask for, and then couldn't unsubscribe, so they got switched to spam instead. The rest are a various set of things, mostly stock spams, phishing, several of those German spams from earlier in the year, national lottery spams, etc.
I let Henry know, but for the record, I looked through all of mine and they are all good to go.
I'm not too concerned about a few mis-labeled entries. All that will happen from those is that our numbers will look a bit off. Unless anyone has objections, I'm going to use the corpus as is and will generate the scores. The learning algorithm is stable enough to work around a bit of noise.
MY check of the set3 results gives: FNS: (can be moved to spam if you want, or deleted) /scratch/SA/mails/2005-01.mbox.ham.21322338 FPS: (can be moved to ham or deleted) /scratch/SA/mails/personal.2005w08.spam.194510 /scratch/SA/mails/personal.2005w09.spam.780822 /scratch/SA/mails/personal.2005w20.spam.220636 /scratch/SA/mails/personal.2005w21.spam.1340310 /scratch/SA/mails/personal.2005w22.spam.1210785 /scratch/SA/mails/personal.2005w25.spam.886714 INVALID, DELETE FROM SPAM: (bounces,virusses,etc) /scratch/SA/mails/backup.2005.jan-may.spam.101747 /scratch/SA/mails/traps.2005w09.spam.1189670 /scratch/SA/mails/personal.2005w09.spam.704311 /scratch/SA/mails/personal.2005w14.spam.1332942 /scratch/SA/mails/personal.2005w28.spam.161488
well, in terms of generating STATISTICS.txt at least, I would prefer to have the bad entries fixed; those numbers are published. it's pretty trivial to fix up the logs appropriately using "remove-ids-from-mclog", I'll do it if you want.
I'd rather that we didn't clean up the logs this way because: 1) You've only removed errors from 10% of the logs. 2) You haven't removed the errors that both you and SA have made. I'm running a set of cross-validations on the full set now. If you really want to remove only the instances where the human was incorrect and the classifier was correct and not the instances where both the human and the classifier are incorrect, I will upload the errors to the rsync server when it's finished.
well, we disagree ;) I'd appreciate some comments from the rest of the committers on how they feel about this one. Here's a chat log between myself and H talking about it.... (09:49:33) henry: so about fixing up logs (09:50:19) henry: I'd rather that we didn't because: 1) You've only removed errors from 10% of the logs. 2) You haven't removed the errors that both you and SA has made. (09:50:25) henry: have made (09:51:00) jm: please respond via mail on this one, I suspect I'm not the only one who disagrees ;) (09:51:18) henry: sure (09:51:56) jm: imo we need to try and get the logs as clean as poss, even if we're missing 90% of the FPs/FNs (09:52:19) henry: we're just gaming the numbers (09:52:32) jm: even if the perceptron is able to deal with some noise, the logs are used for other things (STATISTICS.txt) that cannot deal with noise (09:52:36) henry: the learning algorithm would be useless if it couldn't work around a few mistakes (09:52:58) jm: we're not gaming it -- we're using it to build something nearer a "gold standard" in Cormack temrs (09:53:13) henry: and what I'm saying is that by correcting errors in only one direction, STATISTICS.txt will be worse off than it was before (09:53:24) henry: Cormack uses multiple classifiers to make his "gold standard" (09:56:27) jm: why are we correcting errors only in 1 dir? (09:56:31) jm: don't get that (09:56:54) henry: you're not correcting entries where both you and SA have erred (09:57:22) henry: so they look like TPs and TNs, but in fact they are FNs and FPs (09:57:52) jm: ok. but it's still *better* than the current logs (09:58:03) henry: I disagree (09:58:03) jm: in that there are *less* FPs and FNs overall (09:58:17) jm: even if there are still *some* FPs and FNs (09:58:19) henry: there are indeed less FPs and FNs overall (09:58:44) henry: but since we know how many errors we've seen, we can make some predictions about what's gone on in the other direction (09:59:49) jm: I disagree that that's useful ;) (09:59:58) jm: unless you want to fix the STATISTICS generating scripts as well... (10:01:30) henry: well, here's the thing (10:01:37) henry: from first look (10:01:47) henry: it seems that people have about the same amount misclassified in each direction (10:01:49) henry: that have been found (10:02:42) henry: so you could hypothesise that there are plenty that have gone the other way (10:03:29) henry: and that they are about the same proportion (10:03:34) henry: maybe (10:03:36) henry: I don't know (10:04:07) henry: all that I can say is that by fixing based solely on the suspected mistakes of the classifier, we're biasing the results to make things look better than they are (10:04:45) henry: and really.. at the end of the day, the numbers reflect how good the sample set is
so in summary: - I think we should try to make the logs as clean as possible - Henry thinks we should keep the logs as they are, and use that to estimate a misclassification figure instead (PS: henry also notes that Bayes will have been trained on those instances, too.)
Here are my misclassifications (I guess whether or not it matters is still up for debate): Virus Bounce: /home/sone/spamassassin_masscheck_3.1-pre/corpora/spam/1082327516.17711_3.ns1:2,S Misclassified as spam (kinda sorta ham-ish-y I guess): /home/sone/spamassassin_masscheck_3.1-pre/corpora/spam/ 1115735940.M20350P12544V0000000000000304I001D2C12_6.ns1,S=14073:2,S Misclassified as spam (really ham) /home/sone/spamassassin_masscheck_3.1-pre/corpora/spam/1106269507.18978_3.ns1:2,S
My only misclassify: /Users/rod/spam/Maildir/.spam.2004-12/cur/1103575966.15119_0.blazing.arsecandle.org,S=18955:2,S is really ham.
Created attachment 3044 [details] freqs for scoreset 3, all logs fyi -- here's the freqs data from 3.1.0's mass-check logs, scoreset 3. I didn't clear up the misclassifications reporting since the perceptron run, fwiw; this is just using the rsync'd logs. so far, though, the FP/FNs reported are tiny compared to the number of mass-checked messages (1483066 spam, 743761 ham).
btw, more hits that look very iffy, from the freqs file: 0.333 0.0546 0.8887 0.058 0.26 -4.30 RCVD_IN_BSP_TRUSTED 0.051 0.0130 0.1267 0.093 0.19 -0.10 RCVD_IN_BSP_OTHER 0.036 0.0053 0.0961 0.053 0.29 -8.00 HABEAS_ACCREDITED_COI that seems like a *LOT* of Bonded Sender spam hits -- 809 messages hitting RCVD_IN_BSP_TRUSTED! could we get those spam hits verified? (Bob, in particular, most seem to be coming from your corpus)
Subject: Re: Score generation for SpamAssassin 3.1 On Thu, Jul 28, 2005 at 06:15:52PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > that seems like a *LOT* of Bonded Sender spam hits -- 809 messages hitting > RCVD_IN_BSP_TRUSTED! could we get those spam hits verified? My hits are all valid, btw.
I did a run with the full 2M corpus. Here are the results: vm-set0-2.0-4.0-100 False positives: mean=0.0625% std=0.0263 False negatives: mean=21.8408% std=7.6947 TCR (lambda=50): mean=2.6218 std=0.9242 vm-set1-2.0-4.0-100 False positives: mean=0.0682% std=0.0263 False negatives: mean=6.1945% std=2.1798 TCR (lambda=50): mean=9.5497 std=3.3674 vm-set2-2.0-4.625-100 False positives: mean=0.0846% std=0.0325 False negatives: mean=7.9603% std=2.8295 TCR (lambda=50): mean=7.3340 std=2.5958 vm-set3-2.0-5.0-100 False positives: mean=0.0822% std=0.0318 False negatives: mean=3.0710% std=1.0898 TCR (lambda=50): mean=15.2954 std=5.4556
further info regarding the BSP_TRUSTED hits -- grep BSP_TRUSTED spam.log > o perl -ne '/ (\/[^\/]+\/[^\/]+\/[^\/]+)/ and print "$1\n"' o | uniq -c 792 /home/Bob/spamassassin.active 10 /home/duncf/Maildir 2 /home/jm/Mail 1 /home/jm/cor 4 /home/corpus/mail 1 /home/corpus/SA 97% of the Bonded Sender hits on spam are from Bob's corpus. I suspect something's up with the corpus there... spamtraps? retired accounts? PS: there's an argument that having FPs in the logs is irrelevant. however, I disagree -- the Perceptron is only *one* thing that uses the logs. There are also the following: - overall FP/FN% figures for scoresets and thresholds (STATISTICS.txt) - rule freqs, for per-rule FP/FN% figures Given those two, there's good reasons to clean up the logs.
Here's an email Bob sent to sa-dev mailing list that looks like it was meant to be a comment here. Or if not, I think it should be in the record here and it is on a public list so I feel free to repost it. However, 259 is a lot less than 792 so there still is a question why Bob has so many Bonded sender FPs. ---- rest of this is a quote ----- Hello Henry, Wednesday, July 27, 2005, 6:39:22 PM, you wrote: >> jm@jmason.org changed: >> What |Removed |Added >> ---------------------------------------------------------------------------- >> Severity|normal |critical >> Priority|P5 |P1 >> since quite a few of the mass-checkers don't have accounts on that >> box, I've also copied the set3 files to these URLs: >> http://taint.org/xfer/2005/set3.fn.gz >> http://taint.org/xfer/2005/set3.fp.gz >> Please download and verify that any mails in the FP set that are >> coming from your corpus, are indeed valid ham; and ditto for the FN >> set being spam. FN: I spot-checked all FNs with positive scores, and checked every FN with negative scores. Corpus is clean, except: ham: mid=<mailman.3.1119452414.19901.announce@ctyme.com> discount: Message-ID: <12880891.1119562416154.JavaMail.root@agent1.ientrymail.com> Message-ID: <28195449.1118795862153.JavaMail.root@mailagent0.ientrymail.com> spam newsletter, but this user probably subscribed to it... There are 259 emails from/via constantcontact.com which are treated as spam on my system, have been flagged as spam on my system (scores as high as 30's and 40's), have been encapsulated on delivery, have never been flagged by any user as not-spam, but, for the purposes of a world-wide mass-check, these constantcontact.com emails might be questionable. Note: Not all constantcontact.com is treated as spam here -- quite a few cc.com newsletters are subscribed to and seen as ham by their subscribers and the system. The ones I find above in the fns file are all from a set of eight newsletters which have regularly (almost always) been seen as spam, and no user has ever corrected that classification. Henry: To remove these from the log (if you want to), remove everything where the path is /home/Bob/spamassassin.active/masses/corpus.spam (or corpus.ham), since that identifies my corpus contribution, and where the mid ends in @scheduler. FP: Checked every one. Corpus is clean, except: ham: Message-ID: <1118650726.505.53825.m18@yahoogroups.com> There are two of these listed. One should be removed. spam: mid=<17EDCF9C.FD9DD30@hotmail.com> Bob Menschel
Of course I should have said FN not FP in the last comment. And in case it is not clear to someone reading this: constantcontact.com runs the Bonded Sender service, which is what the RCVD_IN_BSP_TRUSTED rule looks for. Bob, what does it mean that you say that you have 259 emails from/via constantcontact.com that are flagged as spam, but Justin says that the log shows 792 BSP_TRUSTED hits from your spam corpus?
oops, missed that. however, I don't think Bob was talking about the BSP issue in that mail... Sidney -- I think you're confusing Constant Contact with Return Path -- Return Path are now partners in the BSP, http://www.returnpath.net/, but afaik Constant Contact are a different company. I don't think that's it (although it may be some of the hits).
Oh, I got confused by this: http://www.constantcontact.com/services/bonded-sender-program.jsp I guess constantcontact provides a way for people to get Bonded Sender status for $25/month and no risk of losing a bond. I wonder what the business model is if spammers take advantage of it. That would explain it if they are a source of most of the BSP_TRUSTED FNs. Except there are still another 533 FNs to explain.
Created attachment 3045 [details] Proposed scores for 3.1 gen-set0-2.0-4.0-100 # SUMMARY for threshold 5.0: # Correctly non-spam: 74239 99.92% # Correctly spam: 113219 76.56% # False positives: 60 0.08% # False negatives: 34655 23.44% # TCR(l=50): 3.927075 SpamRecall: 76.565% SpamPrec: 99.947% gen-set1-2.0-4.0-100 # Correctly non-spam: 74274 99.92% # Correctly spam: 138015 93.05% # False positives: 59 0.08% # False negatives: 10312 6.95% # TCR(l=50): 11.184361 SpamRecall: 93.048% SpamPrec: 99.957% gen-set2-2.0-4.625-100 # Correctly non-spam: 74747 99.92% # Correctly spam: 134723 90.61% # False positives: 58 0.08% # False negatives: 13955 9.39% # TCR(l=50): 8.821003 SpamRecall: 90.614% SpamPrec: 99.957% gen-set3-2.0-5.0-100 # Correctly non-spam: 74528 99.92% # Correctly spam: 143427 96.65% # False positives: 59 0.08% # False negatives: 4975 3.35% # TCR(l=50): 18.725804 SpamRecall: 96.648% SpamPrec: 99.959%
I hacked together something to make ROC curves... take a look. current SVN trunk: http://taint.org/xfer/2005/roc_curves_pre_perceptron.png with the scores in patch 3045: http://taint.org/xfer/2005/roc_curves_with_3045.png
Subject: Re: Score generation for SpamAssassin 3.1 +score BAYES_50 0 0 0.845 0.001 # n=1 +score BAYES_60 0 0 2.312 0.372 # n=1 +score BAYES_80 0 0 2.775 2.087 # n=1 +score BAYES_95 0 0 3.023 2.063 # n=1 +score BAYES_99 0 0 2.960 1.886 # n=1 I think the score for BAYES_99 should be hand tweaked, regardless of what the score generator said. This was big grief for most people on 3.0 - 3.0.3, and I'd just as soon not see it take until 3.1.3 to apply the same hack again. Loren
Bob, for some reason the email replies you are sending are not ending up in comment even though they are Cc'd to bugzilla-daemon. I'm pasting your last one in here below. I don't know about the others you list but I don't see how the Motley Fool ones are spam. The content looks like stock spam, but they are a very widely read reputable organization that requires registration with email confirmation to receive a login password before one can subscribe. Each email that I have received from them contains unsubscription information that states the email address that I am susbscribed under and a link to where I can view and change all subscription preferences. I have never seen any reference to them not honoring the preference settings. While the web site is fool.com, the subscribed email from them does come from foolsub.com addresses. See, for example, the reference to Motley Fool in http://www.ironport.com/company/pp_business_week_03-13-2003.html -- sidney >> ------- Additional Comments From jm@jmason.org 2005-07-28 18:15 ------- >> btw, more hits that look very iffy, from the freqs file: >> 0.333 0.0546 0.8887 0.058 0.26 -4.30 RCVD_IN_BSP_TRUSTED >> 0.051 0.0130 0.1267 0.093 0.19 -0.10 RCVD_IN_BSP_OTHER >> 0.036 0.0053 0.0961 0.053 0.29 -8.00 HABEAS_ACCREDITED_COI >> that seems like a *LOT* of Bonded Sender spam hits -- 809 messages hitting >> RCVD_IN_BSP_TRUSTED! could we get those spam hits verified? (Bob, in >> particular, most seem to be coming from your corpus) Summary: Misclassified ham: 28 Bounce/outscatter of spam: 1 Possibly misclassified ham: 34 Constant Contact questionable: 3099 (ham and spam) The remainder are IMO spam. Note: In the following discussions where I say "flagged spam", I mean fully encapsulated, with full SA report and score presented as the primary email to the user. >> Misclassified ham: From: newsletters@about.com (count: 7) From: "American Express" <AmericanExpress@email.americanexpress.com> count: 10, multiple users fed to sa-learn, primarily because instead of being official notifications, statements, alerts, etc., the "spam" identified by users were marketing emails, "take a look at our special offers", "plan the perfect holiday", "upgrade to a card with premium service", etc. Only one of the sa-learned "spam" was what I'd consider a ham, though none of them are spam. From: <support@godaddy.com> (count: 2, 1 to each of 2 users) From: PayPal <paypal@email.paypal.com> (count: 6) From: Tikkun <Tikkun@democracyinaction.org> (count: 1) From: HeartCenterOnline <HeartCenterOnline@heartcenteronline-mail.com> (count: 2) >> Possibly misclassified ham: From: "CNET Help.com Online Courses " <CNET_Networks_Member_Services@newsletter.online.com> Count: 9 User CR declared it to be spam via sa-learn. Probably old subscription. Several others not fed to sa-learn, but flagged as spam by our system (and not corrected by the users via sa-learn). Willing to consider these ham. From: "The Home Depot" <HomeDepotCustomerCare@homedepot.com> Subject: Great Last-Minute Gifts for Dad Count 4: Various users, flagged as spam by our system, not fed through sa-learn. Looked like spam during validation. also have nine emails from same source, 3 with low positive scores, six with negative scores, also not fed through sa-learn. Willing to consider these ham. From: Godiva.com <godiva@godiva.com> Count 3: User CR declared it to be spam via sa-learn. Might be old subscription. Count 1: User SV, flagged as spam by our system, no sa-learn correction. Note: my unverified corpus also has two more emails from same source, not flagged as spam (low positive score), not fed to sa-learn. From: "eBay" <eBay@reply3.ebay.com> Count: 7 Subject: Preview eBay's Summer Sizzlers & Save Big! Subject: B-52's Live, BBQ at Great America--register now for eBay Live and save! Subject: feralcanning, check these amazing eBay deals--all under $10 User CR declared it to be spam via sa-learn. Maybe old subscription, very likely not the type of email the user wanted from eBay. From: "Movies Unlimited Video E-Flash" <eflash@moviesunlimitedeflash.com> Count 3: User SA, system flagged as spam, no sa-learn, look like spam, but all to single user. Could be ham. From: "DVD Talk" <newsletter@dvdtalk.com> count: 2 To: mike@misosoup.com Subject: DVD Talk: It's Back - The Huge DeepDiscountDVD.com Sale User MM, system flagged as spam, no sa-learn, look like spam, all to single user, count 2, many others not flagged as spam (some low positive, some negative), none through sa-learn. Could be ham. From: "Planet DVD Now" <sales@planetdvdnow.com> count: 3 To: ncoronado@prontotax.com Subject: Planet DVD Now Insider News for Saturday June 18, 2005 User NP, system flagged as spam, no sa-learn, look like spam, all to single user, count 3, many others not flagged as spam (some low positive, some negative), none through sa-learn. Could be ham. From: support@sexsearchcom.com count: 3 Subject: SexSearch Shown Interest User JB, flagged spam, no sa-learn. Only user receiving these emails. >> Constant Contact Per earlier email, several other Constant Contact "newsletters" flagged by our system as spam, variety of newsletters, variety of users, spam classification not corrected by users, including technical users who regularly and reliably sa-learn their misclassified emails. Messages fed through sa-learn as spam by users: 17 Messages flagged as spam and not sa-learned as ham: 1586 Messages not flagged as spam: 1496 IMO, if we discard the 1603 flagged as spam, we should also discard the 1496 treated as ham. >> Sure looks like spam: From: "Entertainment Update" <EntertainmentUpdate@mail85.subscribermail.com> Subject: New Promotional Partner Opportunities User CR declared it to be spam via sa-learn. Sure looks to me like spam. From: The Motley Fool <Fool@foolsubs.com> Subject: Urgent Stock Buy/Sell Alert...from Motley Fool Stock Advisor User CR declared it to be spam via sa-learn. Sure looks to me like spam. Plus another copy flagged as spam by our system, same user, not fed to sa-learn. Quite a few others, all look like spam. From: "Entertainment Insider" <EntertainmentInsider@mail85.subscribermail.com> Subject: New Marketing Opportunities from The b EQUAL Company Subject: New Promotional Opportunities Available from Nickelodeon Subject: New Marketing Opportunities from Buena Vista Home Entertainment User CR declared it to be spam via sa-learn. Sure looks to me like spam. Count: 5 From: Rabbi Michael Lerner <rabbilerner@tikkun.org> Subject: Science and Spirit--a work group at the Network of Spiritual Progressives Founding Conferences User RI declared it to be spam via sa-learn. Maybe old subscription, very likely not the type of email the user wanted from this source. From: "ArcaMax" <ezines@arcamax.com> Subject: Congratulations - You Won User NP declared it to be spam via sa-learn. Sure looks to me like spam. Two copies, same recipient, different message ids Third email, also user NP, no sa-learn, flagged as spam by our system, sure looks like spam to me. Other emails, various users, no sa-learn, flagged as spam by our system, look like spam to me. From: South Beach Diet Online <products@southbeachdiet.com> Subject: why this diet WORKS! User AM, no sa-learn, flagged as spam by our system. >> You are receiving this message because you subscribed to or visited >> a Waterfront Media newsletter or product." Visited a newsletter or product = looks like spam to me. From: DGI Line - asi/50910 <promoflash@promotioncorner.com> Reply-To: promoflash@promotioncorner.com To: jan@award-source.com Subject: 2005 Magnetic Football Schedules! All Pro Teams Available User JA, no sa-learn, flagged as spam by our system, roving constant contact, contents look like spam to me. From: "NewsMax.com" <customerservice@reply.newsmax.com> Subject: Ken Blackwell and New Republicans: Inside Story User GI, no sa-learn, flagged as spam by our system, only one email in corpus, including unclassified. If "newsmax.com" were a real service, I'd expect repeated emails. Therefore I believe this to be spam. From: Health Insurance Solutions <HealthInsurance@focalex2.com> Subject: Health and happiness go hand in hand. User JC, system flagged as spam, no sa-learn, five separate emails, all look like spam (including no MID from sender), all to single user, an insurance agent. Could be ham. But... From: Medical Insurance <MedicalInsurance@focalex2.com> Subject: Take care with medical insurance. From: US Immigration Help <USImmigrationHelp@focalex2.com> Subject: Make the dream of citizenship a reality. User JC, system flagged as spam, no sa-learn, multiple emails, all look like spam (including no MID from sender), all to single user, an insurance agent. Content very much so aimed at consumer, not agent, strongly suggesting to me that all email from @focalex2.com is indeed spam. Then ... From: Posters And Wall Art <PostersAndWallArt@focalex2.com> Subject: What your walls want to wear. Same user (insurance agent), same source, nothing at all to do with insurance or anything similar to any other email received by this user. Other spam samples abound in more recent email. From: "SmartBargains" <SmartBargains@deals.smartbargains.com> Reply-To: "SmartBargains" <SmartBargains.L9A0NB.226361@deals.smartbargains.com> To: srose@cencalins.com Subject: 320TC Sheet Set, Duvet & More Just $29.95 User SC, system flagged as spam, no sa-learn, all look like spam. User DT, " Emails do refer to users by a first name which matches first letter of email address. >> You are receiving this email because you subscribed to it through >> SmartBargains.com or one of our partners. From: AIU Online <aiuonline@aiuonline-update.com> Subject: Nights. Weekends. We're here when it's convenient for YOU! Consistent spam, repeated sa-learn as spam, 2 users, plus one unclassified to third user. Confident this is spam. From: "International Living" <webeditor@internationalliving.com> To: jim@cudney.com Subject: IL Postcards - Tax Breaks in the Cloud Forest User JC, many emails flagged spam, many emails not flagged, no sa-learn. May or may not be spam. Certainly looks like scam. From: "Martin D. Weiss, Ph.D." <alerts@weissinc.com> Subject: A Personal Invitation from Martin Weiss User JC, all emails flagged spam, no sa-learn, emails certainly do look like spam/scam. Sent to only this user. From: Hersheys Kisses <kisses@prewards.com> Subject: Complimentary 10 lbs of Hershey~Rs Chocolate User BQ, clear spam, even in SURBL blacklist. From: "TopButton" <vip@TopButton.com> To: nysale@dvorak.org Subject: TOP BUTTON VIP - Prada Price Cuts: 4-Days Only User ND, among the most technically oriented and skilled of our users, email flagged as spam, no sa-learn, only email from this source in the entire corpus, looks unquestionably spam. From: eDiets Extra <extra@ediets.com> Subject: Miami Mediterranean Diet: It's Hot! Users ST and KG, several emails flagged spam, many emails not flagged, no sa-learn. May or may not be spam. Certainly looks like spam. Bob Menschel
As per Justin's request, I did a validation run without Bob's data. The numbers come out much better but leave an unanswered question: Is Bob's data really noisy or is it really hard? I'm doing a scoring run now and will post a patch when it's ready. I don't care what we do either way. What do you guys want to do? vm-set0-2.0-4.0-100-nobob False positives: mean=0.0767% std=0.0342 False negatives: mean=16.9041% std=5.9576 TCR (lambda=50): mean=3.5471 std=1.2481 vm-set1-2.0-4.0-100-nobob False positives: mean=0.0595% std=0.0252 False negatives: mean=3.3299% std=1.1745 TCR (lambda=50): mean=16.9662 std=6.0300 vm-set2-2.0-4.625-100-nobob False positives: mean=0.0686% std=0.0251 False negatives: mean=5.4227% std=1.9189 TCR (lambda=50): mean=11.0551 std=3.9115 vm-set3-2.0-5.0-100-nobob False positives: mean=0.0575% std=0.0241 False negatives: mean=1.2911% std=0.4657 TCR (lambda=50): mean=31.9635 std=11.8543
Re: comment #24 I absolutely agree with you, Loren. There's no problem with hand-tuning the scores afterwards. What I come up with is not necessarily the right answer, it's just the best answer that I can come up with given the data at hand.
Created attachment 3046 [details] Proposed scores for 3.1 generated without Bob's data gen-set0-2.0-4.0-100-nobob # Correctly non-spam: 52964 99.94% # Correctly spam: 100131 81.10% # False positives: 34 0.06% # False negatives: 23335 18.90% # TCR(l=50): 4.931736 SpamRecall: 81.100% SpamPrec: 99.966% gen-set1-2.0-4.0-100-nobob # Correctly non-spam: 53084 99.95% # Correctly spam: 118698 96.28% # False positives: 28 0.05% # False negatives: 4592 3.72% # TCR(l=50): 20.575768 SpamRecall: 96.275% SpamPrec: 99.976% gen-set2-2.0-4.625-100-nobob # Correctly non-spam: 53309 99.92% # Correctly spam: 116473 93.94% # False positives: 41 0.08% # False negatives: 7508 6.06% # TCR(l=50): 12.971438 SpamRecall: 93.944% SpamPrec: 99.965% gen-set3-2.0-5.0-100-nobob # Correctly non-spam: 53070 99.96% # Correctly spam: 121906 98.49% # False positives: 21 0.04% # False negatives: 1872 1.51% # TCR(l=50): 42.360712 SpamRecall: 98.488% SpamPrec: 99.983%
Subject: Re: Score generation for SpamAssassin 3.1 > TCR (lambda=50): mean=2.6218 std=0.9242 Out of curiosity what is TCR?
Full explanation of TCR (too long for this comment) is in http://wiki.apache.org/spamassassin/TotalCostRatio
> Here's an email Bob sent to sa-dev mailing list that looks like it was meant to > be a comment here. Or if not, I think it should be in the record here and it is > on a public list so I feel free to repost it. Agreed. Actually, this first comment was just back to the list; the second was to the list cc bugz, but didn't get to bugz. I'll try to post directly to bugz on this subject going forward. > However, 259 is a lot less than 792 so there still is a question why > Bob has so many Bonded sender FPs. My first analysis was on Henry's 10% extract from the log, going strictly against the FN/FP warning extract from that. So the numbers were significantly smaller than from my full corpus which Justin reviewed. > There are 259 emails from/via constantcontact.com from that 10% extract > which are treated as spam on my system, have been flagged as spam on > my system (scores as high as 30's and 40's), have been encapsulated > on delivery, have never been flagged by any user as not-spam, but, > for the purposes of a world-wide mass-check, these > constantcontact.com emails might be questionable. > Note: Not all constantcontact.com is treated as spam here -- quite a > few cc.com newsletters are subscribed to and seen as ham by their > subscribers and the system. The ones I find above in the fns file are > all from a set of eight newsletters which have regularly (almost > always) been seen as spam, and no user has ever corrected that > classification. Per my later email, this is out of over 3000 constant contact emails, split about 50/50 in my corpus. Of the 1500+ that are considered spam here, half are considered FPs, so apparently the other half are being flagged correctly regardless of my corpus. No problem there. Motley fool: Sidney indicates they're ham; I can't argue with him. Treated as spam here because a) a user intentionally flagged it as spam into sa-learn, b) they seem to me to be spam, based on the contents, c) I'm not familiar with that service myself, and d) I don't have time to research all of the sources of emails which get flagged as spam. In my corpus, 22 from this source are flagged as spam (2 via sa-learn), 26 as ham, 40 as unclassified. About 80% of my BSP-trusted hits, spam, ham, and apparently also not classifed, are through constant contact. Given Sidney's discovery and comment re: constantcontact, I'm fairly convinced that /some/ of the cc BSP-trusted emails in my corpus are spam. But I can't be absolutely sure which (I'd be willing to put money down on about a dozen of them that I reviewed yesterday, even after our discussions here, but given our discussions here, only that dozen or so). Not all of my cc emails, of cource, are BSP-Trusted. Those other also fall on all sides of the ham/spam/unclassified groupings, and while I haven't done stats on them, it feels from a quick glance as if the ratio is about the same. My corpus comes mostly from an aggressive ISP system, where a) a lot of spam from known spam sources is dropped before SA, b) there are a number of additional exim filters which put additional headers into emails for SA to analyze, c) we have an additional Bayes analysis system outside SA which gives additional feedback concerning whether an email is/isn't spam, d) we have additioanl custom rules that review the outputs of (b) and (c) in determining the SA score, e) we use most of the not-high-risk SARE rules, f) we have a large number of technical users very familiar with spam/anti-spam concerns and very able to sa-learn their own emails, g) we have a large number of other (not so technical) users, many of whom use this service specifically because of its aggressive anti-spam stance, many of whom do actively sa-learn also, and h) a fair number of users who do no sa-learn. Because of the aggressive stance, we do have a higher FP ratio than many other systems. Importantly, we don't have any complaints about that. Again, we do drop emails before they even get to SA, but those that get to SA all get delivered to the users, with spam encapsulated. Some FPs are corrected via sa-learn, as are many FNs. All FPs and FNs are trapped and entered into my corpus. The number that I then discard on review afterwards is small -- a handful each month. I also trap and enter those emails which are flagged as ham (negative scores) or spam (scores over 5) by BOTH SA and one of our internal systems. I review both of these categories, but because of the numbers I don't manually validate each and every one. I do review the ham more carefully than the spam. These practices may be where the discrepancy comes from -- my reliance on others to manually validate ham/spam via sa-learn, my acceptance of their determination when I do not have contradicting evidence myself, and my acceptance with careful but not paranoid review of automated classification when two or more classification systems agree. I'll be reviewing the BSP-other and HABEAS_ACCREDITED_COI spam hits later today. Meanwhile, though I have confidence that my corpus is reasonably accurate, I also have no problem with it being discarded if my methodology above is insufficient for scoring purposes. The two questions, one asked by Henry: > Is Bob's data really noisy or is it really hard? and, what is the definition of "spam" as it should be applied to scoring? Is there any room in there for end user perception (I didn't ask for this), or does it accept mail as ham if the user ever at any time opted in for any mail from the sender, even mail which does not properly relate to the reason the user wanted the email? Again, I have no problem with my corpus (or any subset of it) being discarded. I'm also willing to work on improving my methodologies for 3.2's rescoring run. Bob Menschel
> I'll be reviewing the BSP-other and HABEAS_ACCREDITED_COI spam hits > later today. > BSP-other misclassified ham: 11 Message-ID: <9992bcc605040810462af9cb11@mail.google.com> -- no idea how this obvious ham got into the corpus as spam. Message-ID: <bysp635axk0d48bfj1x7kbvjbu35j7.174415332.4053@mta300.email.americanexpress.com> Ditto. Message-ID: <20050419184459.18922.qmail@corpmailer01.prod.mesa1.secureserver.net> From notice@godaddy.com, pure advertising/marketing newsletter to a %%% user (Godaddy customer) who sa-learned this as spam, apparently wanting only domain registration data and not sales fluff. Only have the one godaddy.com email in the "sa-learned as spam" corpus. have 7 others that were classified as spam, obvious marketing newsletters. Well over 90% of all godaddy newsletters are in the ham corpus (or unclassified), and none of their functional emails dealing with registrations and specific domains are flagged as spam (about 40% of all godaddy emails are unclassified, the remainder ham, except for these 8). Message-ID: <PayPal.65mgpzxn8.h0@email.paypal.com> From: PayPal <paypal@email.paypal.com> Subject: Annual Privacy and Electronic Fund Transfer Rights Notice X-Header-CompanyDBUserName: paypal Errors-To: paypal@email.paypal.com Reply-To: paypal@email.paypal.com X-Header-MasterId: 900764 X-Header-Versions: PayPal.65mgpzxn8.h0@email.paypal.com X-Originating-IP: [206.165.246.83] X-Sender-Nameserver: ns3.yahoo.com ns4.yahoo.com ns5.yahoo.com ns1.yahoo.com ns2.yahoo.com em X-Spam-Status: Yes, score=106.1 required=5.0 tests=BAYES_00,DCC_CHECK, DIGEST_MULTIPLE,HTML_20_30,HTML_MESSAGE,MIME_HTML_ONLY,OPT_IN, PYZOR_CHECK,RCVD_IN_BSP_OTHER,SARE_FORGED_PAYPAL,SARE_FORGED_PAYPAL_C, SP_HAM_VERY autolearn=no version=3.0.4 Content looks like it came from PayPal, and I don't see any phishing links within, but the received header trail has nothing to do with any paypal or ebay system -- the only servers listed in the received chain are yahoo.com (starting at milter101.store.sc5.yahoo.com). I'm guessing this was sent to an email address within the yahoo store system, which auto-forwarded to the owner's address on our system, and the Yahoo system *stripped* all evidence that this actually came from paypal, causing our phish alarms to go off. 64 identical emails came through, most as ham, some unclassified, this was the only one flagged as spam. > BSP-other questionable entries: 4 Message-ID: <25789186.1117674104561.JavaMail.clundberg@scotch> From rabbilerner@tikkun.org, associated with democracyinaction.org Fed to sa-learn as spam by user RI. Religious/Political newsletter, of 7 emails in my corpus, 4 have been sa-learned by this user as spam, one to this user is unclassified, one to this user is classified as ham (not sa-learned), and one is classified as ham to a different user. > BSP-other definite spam: 1 Message-ID: <6.0.0.22.1.20050610214911.3eca3bd7@paypal.com> -- guaranteed phish. Internal link to <a href="http://www.paypallk.com:680/paypal.php" style="font-family: monospace; font-size: 10pt;">Click here to confirm your account</a> > HABEAS_ACCREDITED_COI misclassified ham: 12 Message-ID: <21139714.1120711719692.JavaMail.truelink@vma03.sbp-prod.truelink.com> From: FreeCreditProfile <support@freecreditprofile.com> count: 12 > HABEAS_ACCREDITED_COI questionable entries: 32 Message-Id: yournewsletterswf20094m05XZ200506090501807044@yournewsletters.net southbeachdiet.com email mentioned previously. Count: 1 Message-ID: <29140116.1118865322661.JavaMail.root@mailagent0.ientrymail.com> In general, @ientrynetwork.net newsletters are very spammy. One user religiously places his newsletters flagged as spam into sa-learn as ham, but no others do so. Count: 8 Message-Id: <E1DiegK-0006Wz-GA@pascal.ctyme.com> No message id from sender. Count: 22 From newsletter@tickle-inc.com, Subject: Your future, revealed! "The Tickle Newsletter is an email service designed with you in mind — it's the only email all about you. We think you're going to love it." Sure sounds like an introduction to spam. Contents look very spammy as well. Message-ID: <PRODWEB052en0bcaQUX00003c75@PRODWEB05.WLElmsford.com> FROM: Reservation Rewards Customer Service <customerservice@reservationrewards.com> SUBJECT: As requested, your Membership Kit for Reservation Rewards, please login today X-Spam-Status: Yes, score=12.5 required=5.0 tests=BANG_GUARANTEE,BAYES_00, CALL_FREE,CT_ACT_NOW,CT_DO_IT_TODAY,CT_OFFERS_ETC,CT_OFFER_3, CT_PERCENT,DNS_FROM_AHBL_RHSBL,FORGED_RCVD_HELO,HABEAS_USER, HTML_50_60,HTML_MESSAGE,LINK_PHRASE,MAILTO_LINK, MIME_HEADER_CTYPE_ONLY,NO_COST,ORDER_NOW,SARE_BOUNDARY_LC,SAVE_MONEY, SAVE_UP_TO,SP_SPAM_VERY,URI_OFFERS autolearn=no version=3.0.4 User CR; If she signed up, then this membership confirmation was not spam. However, this confirmation dated July 3 is followed by a billing notice dated July 17, and then confirmation of the user's cancellation dated July 17. Cannot tell whether the original was spam, but user seems to have no interest in the service. > HABEAS_ACCREDITED_COI definite spam: 0
Bob, It's tricky getting a good corpus: There are spammy looking mails from sources that follow the rules. There are people who are so clueless that they label something spam rather than unsubscribe. There are people who do the same not because they are clueless, but if they don't recognize that something comes from a subscription or just aren't sure, know better than to take a chance on using a spammer's unsubscribe link. And there's Constant Contact who may have found a way around what at first glance appears to be a good defense against spam. So how do you have a clean corpus when it could contain edge cases that are classified wrong? What is the "correct" score for such mail? If the only difference between a piece of spam and a piece of ham is whether the recipient subscribed to it, how do you call either one an FP or an FN for the purpose of the rule scoring program? I don't have answers to that. By the way, if Constant Contact really is doing that, they must be counting on low numbers of complaints. That link I posted to Ironport's site listed the Bonded Sender fees as of two years ago. It makes it risky for a single customer to spam. But I can see how Constant Contact could have a business model based on getting paid by a mix of spammers and hammers. The Bonded Sender fines are based on number of complaints per million mails. If you want to nail them, get aggressive about reporting the confirmed RCVD_IN_BSP_TRUSTED spam. Once the numbers of complaints reach the threshold where it costs Constant Contact $1000 per spam mail they are going to have to clean up their act if it really is that sleazy.
Subject: Re: Score generation for SpamAssassin 3.1 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 BTW weren't we planning to set the BAYES_ scores non-mutable? can't quite recall. - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) Comment: Exmh CVS iD8DBQFC7EU0MJF5cimLx9ARAsXJAKCaQgyOGV219B05AyKUwLI8KmWerACcC6I4 C807HMsT0flOAUTgo9otUQo= =kvP0 -----END PGP SIGNATURE-----
Subject: Re: Score generation for SpamAssassin 3.1 > BTW weren't we planning to set the BAYES_ scores non-mutable? > can't quite recall. I know there had been talk of it, although I'm too lazy to try to dig up the thread. I think, if it isn't too much work, what I'd like to see would be something like taking the final generated scoreset, normalizing the bayes numbers for all sets to ascending sequence more or less*, and then locking them and rerunning the score generation to get updated values for the other rules. * From the data I looked at in Henry's posting, I seem to recall that 05 and 99 were obviously out of sequence. I think 99 is the critical one to have in sequence. 05 may be correct where it is, even though out of sequence. Perhaps a topic for discussion. Loren
I personally would prefer to avoid fixing any Bayes scores so they couldn't float, but I feel equally strongly that BAYES_99 should score higher than the others. BAYES_00 is problematic when a Bayes database gets poisoned, but BAYES_99 generally doesn't have that problem. Option 1: Allow all Bayes scores to float, but add code which forces BAYES_99 to be at least 10% higher than the max score of all other Bayes scores (at least BAYES_95). Option 2: Allow all Bayes scores to float, but give BAYES_99 a floor of either 3.5 or 4.0 -- it can float higher if the Perceptron feels it should, but no lower. In SARE we sometimes run into a family of rules like Bayes, something like __RULE_1 -- spam sign # 1 __RULE_2 -- spam sign # 2 __RULE_3 -- spam sign # 3 meta RULE_1 -- rule 1 but not 2 or 3 meta RULE_2 -- rule 2 but not 1 or 3 meta RULE_3 -- rule 3 but not 1 or 2 meta RULE_4 -- rules 1 and 2 but not 3 meta RULE_5 -- rules 1 and 3 but not 2 meta RULE_6 -- rules 2 and 3 but not 1 meta RULE_7 -- rules 1, 2, and 3 The meta rules 1-3 are scored based on their solo hits (the hits of their __feeder rules), using our standard SARE algorithms. Assuming that meta rules 4-6 hit fewer ham than 1-3, we score them higher than 1-3, even if their total spam hits are lower (because of the increased requirements). Likewise, meta rule 7 will be scored highest of this family, because it's "safest" of the seven rules. Would it be worth while opening a new bugz entry for a 3.2 enhancement to implement some kind of "this rule scores better than that rule if its S/O is at least as good" linkage?
SM> It's tricky getting a good corpus: ... In addition to your reasons, a good corpus for local use (it's spam here, and always spam here) may not be good for global use (it's not spam to users on that other system over there). And to expand on your SM> There are people who [sa-learn as spam] not because they are clueless, but if they don't recognize that something comes from a subscription or just aren't sure, ... There are also sources that confound matters -- a user can sign up with them for one brand, and receive emails from a corporate parent with a different domain name. SM> And there's Constant Contact who may have found a way around what at first glance appears to be a good defense against spam. SM> ... if Constant Contact really is doing that, they must be counting on low numbers of complaints. Apparently they are, based on the large number of cc.com emails here that qualify for the BSP rules. SM> That link I posted to Ironport's site listed the Bonded Sender fees as of two years ago. It makes it risky for a single customer to spam. But I can see how Constant Contact could have a business model based on getting paid by a mix of spammers and hammers. The Bonded Sender fines are based on number of complaints per million mails. If you want to nail them, get aggressive about reporting the confirmed RCVD_IN_BSP_TRUSTED spam. ... My family gets a lot more ham than spam from cc.com, and so in the past on those rare occasions when we've gotten cc.com spam I've gone directly to them, with satisfactory results. Given what I'm seeing now in this corpus, I'll send in the formal complaints to BSP/Ironport, to increase cc.com's incentive to police their customers. SM> So how do you have a clean corpus when it could contain edge cases that are classified wrong? ... Or, IMO more correctly, a valid and representative corpus used for scoring /should/ have edge cases that may or may not be classified wrong -- there's no other way for a major ISP who can't know what their users did or didn't subscribe for, to manage their spam. It's important to classify them as accurately as humanly possible, but for SA to be optimally useful it needs to be able to make judgments about the edge cases as well, and it can only do that if we take the risk and include them in our corpus. SM> What is the "correct" score for such mail? If the only difference between a piece of spam and a piece of ham is whether the recipient subscribed to it, how do you call either one an FP or an FN for the purpose of the rule scoring program? I don't have answers to that. First pass suggestion: Aim to get these "edge" emails into the 2.0-4.0 score range, so that network tests and hopefully Bayes can push them over 5.0 or under 0.0 as appropriate for the user/site.
Created attachment 3048 [details] freqs for scoreset 3, all logs, all rules Daniel noticed that the freqs file I posted was missing SPF_PASS (for some reason, it's listed as a userconf rule, dunno why). here's a copy that does.
regarding the Bob's-corpus issue. I've been pondering this a bit, and I think we have to leave it out of the rescore run. Fundamentally, I don't trust the user population involved :( I think your users are using "learn as spam" to keep stuff that isn't *strictly* UBE out of their mail folders; by using those logs, we'd generate score-sets to consider spam to be "stuff your users don't want" rather than "unsolicited bulk email", which is what we have to aim towards. We used to have a spam definition, namely "spam == UBE", up somewhere related to corpus policy, but I can't find it now. But in my opinion that still applies ;) (to be honest, I'm not sure there's any good way to use someone else's email in a rescoring run, since I've often wound up saying "yes, I subscribed to that horrible spammy-looking newsletter that's sending with a misleading HELO string", even for my own mail. and you should see Rod's corpus! ;) --j.
The scores for upper BAYES scores (ie 80, 85 and 90) are too low. We should lock in the values based on what we saw in the 3.0 release. Personally I've been running with this in my local.cf for a long while with no issues: score BAYES_80 0 0 4.608 3.087 score BAYES_95 0 0 4.514 3.063 score BAYES_99 0 0 5.070 3.886 Granted the 80/95 set3 scores might be a tad high for general consumption.
Subject: Re: Score generation for SpamAssassin 3.1 Same here. I've been running with 3.0's scoreset 2 scores for both scoresets 2 and 3, for BAYES_50-99, with no problems (always using scoreset 3). score BAYES_50 0 0 1.567 1.567 score BAYES_60 0 0 3.515 3.515 score BAYES_80 0 0 3.608 3.608 score BAYES_95 0 0 3.514 3.514 score BAYES_99 0 0 4.070 4.070
anyway, back to the score generation thing, a few items: 1. I'm -1 on using those scores. They look great all-round, *except* for the Bayes scores: 56.044 84.1316 0.0375 1.000 0.84 1.89 BAYES_99 1.716 2.5715 0.0099 0.996 0.83 2.06 BAYES_95 1.983 2.9654 0.0251 0.992 0.76 2.09 BAYES_80 1.685 2.5064 0.0463 0.982 0.68 0.37 BAYES_60 31.996 0.3606 95.0772 0.004 0.60 -2.60 BAYES_00 4.503 5.9619 1.5927 0.789 0.47 0.00 BAYES_50 0.311 0.0880 0.7556 0.104 0.36 -0.41 BAYES_05 0.377 0.1622 0.8048 0.168 0.32 -1.95 BAYES_20 0.401 0.2655 0.6706 0.284 0.27 -1.10 BAYES_40 (scoreset 3 freqs output.) note that none of them was permitted above 2 points by the perceptron; those scores have the odd flattening for BAYES_95/99 we had to fix in 3.0.3 in r165033; and there seems to be unanimous support on the record for fixing these. (ok, I'm being a little disingenuous on the last point, as I think someone, either Daniel or Henry, was ok with letting them float, but they made the comment on a transitory medium like IRC or IM so it doesn't count. ;) So I suggest we set them to the static scores and move out of the mutable section, as done in the attached patch, then get Henry to rerun the perceptron. for ease of review, those static scores are: score BAYES_00 0.0001 0.0001 -2.312 -2.599 score BAYES_05 0.0001 0.0001 -1.110 -1.110 score BAYES_20 0.0001 0.0001 -0.740 -0.740 score BAYES_40 0.0001 0.0001 -0.185 -0.185 score BAYES_50 0.0001 0.0001 0.001 0.001 score BAYES_60 0.0001 0.0001 1.0 1.0 score BAYES_80 0.0001 0.0001 2.0 2.0 score BAYES_95 0.0001 0.0001 3.0 3.0 score BAYES_99 0.0001 0.0001 3.5 3.5 they're a mix of what the perceptron said in that last run, what was used in 3.0.3, and some smoothing (to avoid the FAQs again). Henry -- any chance you can gzip up the validation set after you run the perceptron, and put them somewhere? There's a whole batch of stuff that needs to be done that needs those. also, we need to get the statistics in. I've updated http://wiki.apache.org/spamassassin/RescoreMassCheck with what I think needs to be done (steps 5 onwards). Probably not worth doing those until we vote on the patch / figure out what to do with the BAYES scores, though.
Created attachment 3051 [details] bayes scores
Subject: Re: Score generation for SpamAssassin 3.1 FWIW, the data from scoreset 3 more closely supports using the equation (bayes_group-50)/(50/3.5) to calculate the score. This is quite close to Justin's values above 50, but departs considerably at lower Bayes values: Group Set 3 Norm 3.5 Justin 2 Justin 3 0 -2.600 -3.500 -2.312 -2.599 5 -0.410 -3.150 -1.110 -1.110 20 -1.950 -2.100 -0.740 -0.740 40 -1.100 -0.700 -0.185 -0.185 50 0.000 0.000 0.001 0.001 60 0.370 0.700 1.000 1.000 80 2.090 2.100 2.000 2.000 95 2.060 3.150 3.000 3.000 99 1.890 3.430 3.500 3.500 The "Norm 3.5" group matching the above equation is very close to the Perceptron scores for Bayes_20 to Bayes_80. The Perceptron score for Bayes_05 is just plain wonky, and of course the scores flatten completely at Bayes_80. Running a simple linear solution to approximate the bayes-20 to bayes-80 scores with a straight line produces a slightly lower value for the constant (3.5) above: 3.3875. This of course produces slightly less aggessive scores on the top and bottom ends: Group Set 3 Norm 3.3875 0 -2.600 -3.388 5 -0.410 -3.049 20 -1.950 -2.033 40 -1.100 -0.678 50 0.000 0.000 60 0.370 0.678 80 2.090 2.033 95 2.060 3.049 99 1.890 3.320
hellooooo! anyone out there? especially Henry, you're on the critical path here in a big way. This bug is the 3.1.0 blocker. Once this is done we can release 3.1.0. As such it's pretty important! IMMEDIATELY REQUIRED: - Henry: gzip up the validation logs set and put them somewhere. This gets you off the critical path for 3.1.0, at least temporarily, since we can try out new bayes scores and figure out if a new perceptron will need to be run, or if we can just bump the scores manually and use the patch you already posted. Without the validation set, we can't get an accurate idea afaik. - ALL DEVS: decide correct scores for BAYES*. this requires comments. please comment. - ALL DEVS: if my patch of proposed BAYES* scores meets with your approval (which I'd say it probably won't seeing as everyone has their favourites), vote +1. Otherwise create a patch of your own we can vote on. I think DOS' and Loren's suggested scores both look ok. DOWN THE ROAD A BIT: - Henry: (possibly) rerun the perceptron if the validation logs set indicates that it's required. - ALL DEVS: once there's a new patch with all scores, vote on it so it can be applied.
I just noticed that the proposed 3.1 BAYES_* scores in scoreset 2 are identical to the 3.0 ones. So... manually tweaked scores for 3.0 should work just as good with 3.1. I'm +1 on the BAYES_50-99 scores I posted in comment 41 (which are the scoreset 2 scores copied to scoreset 3). I really think BAYES_99 should score at least 4.0. I'm not exactly sure which of Loren's scores Justin is referring to, but I think 3.5 is too low for BAYES_99.
Subject: Re: Score generation for SpamAssassin 3.1 > I'm not exactly sure which of Loren's scores Justin is referring to, but I think > 3.5 is too low for BAYES_99. I'm not sure which set either. I hink that 3.5 *might* be OK with net tests also. I think I'd want something closer to 4.0 - 4.5 or even higher without net tests. Wasn't it something just shy of 5 in 2.6?
Another suggested set of bayes values: Bayes Set 2 Set 3 Eqn 2 Eqn 3 0 -2.312 -2.599 -2.5 -2.6 5 -1.11 -0.413 -1.525 -2.2 20 -0.74 -1.951 -0.7 -2 40 -0.185 -1.096 0.4 -0.78 50 0.912 0.001 0.95 -0.1 60 2.22 0.372 1.8 0.58 80 2.775 2.087 2.7 1.94 95 3.237 2.063 3.425 2.96 99 3.145 1.886 3.645 3.232 The second and third columns are sets 2 and 3 from Henry's data. The final two columns are my proposed values for sets 2 and 3. These values are not what I would really like to see on the high end, but I think are about as high as one can somewhat reasonably go based on the data. Both sets are essentially linear trendlines for sets 2 and 3, with some hand corrections to better match what I consider a few important data points. In particular, bayes_00 for both sets 2 and 3 are close to -2.5. However the trendlines would predict values around -1.7 for set 2 and -3.2 or so for set 3. I've moved the bayes_00 point to something that the data will support in both cases. Also both sets show a weakness in bayes_05. I've pushed the bayes_05 trendline values upward for both sets, although not far enough to create score inversions. It should be noted that both original sets indicate a flattening of the bayes scores over 80%. I've left these values as the linear trendline would predict, since that seems to be closer to normal human experience. It must be noted though that the data doesn't really support these extrapolations, especially for bayes_99. Neither bayes_99 score comes close to 4.0. I tried to play with the data until I could get something in that range, but it wouldn't go along with the game. It would be possible to tweak the set 2 scores for 95 and 99 upward to aim at 4.0 without departing too badly from the data. This wouldn't be possible with the set 3 scores.
'So... manually tweaked scores for 3.0 should work just as good with 3.1. I'm +1 on the BAYES_50-99 scores I posted in comment 41 (which are the scoreset 2 scores copied to scoreset 3). I really think BAYES_99 should score at least 4.0.' OK, I'm fine with the comment 41 scores, and I agree BAYES_99 should be >= 4.0. +1. care to make a patch?
OK, I got hold of the logs from Henry, and measured some BAYES scores against the validation set: base results from comment 28, gen-set3-2.0-5.0-100-nobob: # Correctly non-spam: 53070 99.96% # Correctly spam: 121906 98.49% # False positives: 21 0.04% # False negatives: 1872 1.51% # TCR(l=50): 42.360712 SpamRecall: 98.488% SpamPrec: 99.983% copying values from set 2 for set 3: # Correctly non-spam: 53064 99.95% # Correctly spam: 122453 98.93% # False positives: 27 0.05% # False negatives: 1325 1.07% # TCR(l=50): 46.272150 SpamRecall: 98.930% SpamPrec: 99.978% comment 14: # Correctly non-spam: 53014 99.85% # Correctly spam: 123093 99.45% # False positives: 77 0.15% # False negatives: 685 0.55% # TCR(l=50): 27.293936 SpamRecall: 99.447% SpamPrec: 99.937% comment 42 (the patch in attachment 3051 [details]): # Correctly non-spam: 53068 99.96% # Correctly spam: 122509 98.97% # False positives: 23 0.04% # False negatives: 1269 1.03% # TCR(l=50): 51.169078 SpamRecall: 98.975% SpamPrec: 99.981% I think 3051 has the best scores. less FNs, just 2 more FPs, sane scores. I'd suggest we just vote on that patch. If you want to try other values btw -- the logs are in the zone. do this: cd svncheckout/masses rm ham.log spam.log ln -s /home/corpus-rsync/corpus/scoregen-3.1/gen-set3-2.0-5.0-100-nobob/NSBASE/ham-test.log ham.log ln -s /home/corpus-rsync/corpus/scoregen-3.1/gen-set3-2.0-5.0-100-nobob/SPBASE/spam-test.log spam.log vi ../rules/50_scores.cf ./fp-fn-statistics --scoreset=3
+1 on 3051 It would probably be more valid if we set the bayes score a little higher and re-ran the perceptron, that way we could get scores over 4 for BAYES_99 without so many FPs.
+1 on 3051, and I agree it'd be good to see whether a perceptron run would back out those two extra FPs (though I'm not overly concerned about just two FPs).
What I meant to say was that we should set the BAYES scores explicitly and make them immutable, then re-run the perceptron. In that case, I'd rather see slightly higher bayes scores, closer to those in coment 40 or comment 41 (probably in between). I'd like to see about 4.5 for BAYES_99.
yeah, I'd like to do another perceptron run with those immutable -- however it might take too long. that's up to Henry, really.... in the meantime let's apply 3051.
I don't mind doing another validation and scoring run. Commit a patch with whatever you want to svn and let me know. Make sure that the scores are in an immutable block.
Henry: 3051 now has 3 +1s, and can be committed. It moves the BAYES scores into an immutable block. so if you want to give this a go, go ahead and patch that and check it in, then rerun the perceptron; alternatively, I'll check it in later if you haven't beaten me to it, and you can rerun perceptron after that.
ok, I got that chance; 3051 is now applied. trunk: Sending rules/50_scores.cf Transmitting file data . Committed revision 230721. b3_1_0: Sending rules/50_scores.cf Transmitting file data . Committed revision 230723.
Created attachment 3062 [details] release-quality patch hey, here's a patch that uses the scores from attachment 3046 [details], plus the bayes scores from attachment 3051 [details], and includes STATISTICS files for all scoresets. This is release-quality, if we want to go with this; alternatively, we can wait for a go-around with the locked-down Bayes scores. IMO: we should release with these. set 3 is looking fine as-is, and we're spending a lot of time on this.
hmm, nix that patch. I've just realised the STATISTICS files don't contain the freqs.
Changing the Bayes scores didn't have an impact on accuracy with newly-generated scores. This doesn't say that changing the scores with what was previously generated does not impact accuracy (we know otherwise). Do you really want me to generate the scores again? It's a real ballache but I'll do it. Samples: vm-set1-2.0-4.0-100-nobob vm-set1-2.0-4.0-100-nobob-ib False positives: Sample 1: mean=0.0554% std=0.0229 Sample 2: mean=0.0595% std=0.0252 Statistically significantly different with confidence 99.2161% Estimated difference: -0.0041% +/- 0.0117 False negatives: Sample 1: mean=3.3473% std=1.1779 Sample 2: mean=3.3299% std=1.1745 Not statistically significantly different (alpha=0.9500) Estimated difference: 0.0174% +/- 0.1339 TCR (lambda=50): Sample 1: mean=17.2267 std=6.1150 Sample 2: mean=16.9662 std=6.0300 Not statistically significantly different (alpha=0.9500) Estimated difference: 0.2605 +/- 1.0179 Samples: vm-set3-2.0-5.0-100-nobob vm-set3-2.0-5.0-100-nobob-ib False positives: Sample 1: mean=0.0546% std=0.0282 Sample 2: mean=0.0575% std=0.0241 Not statistically significantly different (alpha=0.9500) Estimated difference: -0.0028% +/- 0.0651 False negatives: Sample 1: mean=1.0845% std=0.5179 Sample 2: mean=1.2911% std=0.4657 Not statistically significantly different (alpha=0.9500) Estimated difference: -0.2066% +/- 0.8138 TCR (lambda=50): Sample 1: mean=37.6074 std=15.3585 Sample 2: mean=31.9635 std=11.8543 Not statistically significantly different (alpha=0.9500) Estimated difference: 5.6439 +/- 23.5426
'Do you really want me to generate the scores again? It's a real ballache but I'll do it.' no, no need. thanks for checking btw!
Created attachment 3065 [details] redo of 3062 ok, this one's better, includes the freqs! Please vote.....
3065 is almost there it seems. t/meta......................MULTI_FORGED depends on FORGED_AOL_RCVD with 0 score in set 0 MULTI_FORGED depends on FORGED_GW05_RCVD with 0 score in set 0 CONFIRMED_FORGED depends on FORGED_AOL_RCVD with 0 score in set 0 CONFIRMED_FORGED depends on FORGED_GW05_RCVD with 0 score in set 0 MULTI_FORGED depends on FORGED_AOL_RCVD with 0 score in set 1 MULTI_FORGED depends on FORGED_GW05_RCVD with 0 score in set 1 FORGED_THEBAT_HTML depends on MIME_HTML_ONLY with 0 score in set 1 FORGED_IMS_HTML depends on MIME_HTML_ONLY with 0 score in set 1 HTML_MIME_NO_HTML_TAG depends on MIME_HTML_ONLY with 0 score in set 1 DRUGS_MANYKINDS depends on DRUGS_PAIN with 0 score in set 1 OBFUSCATING_COMMENT depends on MIME_HTML_ONLY with 0 score in set 1 FORGED_OUTLOOK_HTML depends on MIME_HTML_ONLY with 0 score in set 1 MULTI_FORGED depends on FORGED_AOL_RCVD with 0 score in set 2 MULTI_FORGED depends on FORGED_GW05_RCVD with 0 score in set 2 CONFIRMED_FORGED depends on FORGED_AOL_RCVD with 0 score in set 2 CONFIRMED_FORGED depends on FORGED_GW05_RCVD with 0 score in set 2 MULTI_FORGED depends on FORGED_AOL_RCVD with 0 score in set 3 MULTI_FORGED depends on FORGED_GW05_RCVD with 0 score in set 3 DRUGS_MANYKINDS depends on DRUGS_PAIN with 0 score in set 3 DRUGS_MANYKINDS depends on DRUGS_MUSCLE with 0 score in set 3 I think there are a couple of things we may want to address in the future as well: some scores are set to "0.000" versus "0" ala "score HDR_ORDER_MTSRIX 0 # n=0 n=1 n=2 n=3" instead of "score URI_HEX 0.000". It'd be nice to round scores where abs(score) < 0.1 to 0 like we used to do. No point in running rules when they're basically not going to contribute. Etc.
Subject: Re: Score generation for SpamAssassin 3.1 +1
ok, working on the meta.t failures and the zeroing scores that are -0.1 < score < 0.1. question: has anyone used 'rewrite-cf-with-new-scores' recently? can it successfully rewrite these scores in place? # URIDNSBL ifplugin Mail::SpamAssassin::Plugin::URIDNSBL # <gen:mutable> score URIBL_AB_SURBL 0 3.306 0 3.812 score URIBL_JP_SURBL 0 3.360 0 4.087 score URIBL_OB_SURBL 0 2.617 0 3.008 score URIBL_PH_SURBL 0 2.240 0 2.800 score URIBL_SBL 0 1.094 0 1.639 score URIBL_SC_SURBL 0 3.600 0 4.498 score URIBL_WS_SURBL 0 1.533 0 2.140 # </gen:mutable> endif # Mail::SpamAssassin::Plugin::URIDNSBL what happens for me is that they get shoved into the main <gen:mutable> section, and lose their "ifplugin" scope. that's obviously bad news, as it means that manual hand-editing is required to fix it. is there a working script that avoids that problem?
Created attachment 3066 [details] redo of 3065 ok, this one: - passes t/meta.t - zeroes rules where -0.1 < score < 0.1 - is otherwise identical. I haven't redone the STATISTICS files, though. ;)
Created attachment 3068 [details] fix for test failures caused by 3066 this is an adjunct to 3066; unfortunately make test produces lots of failures without this patch otherwise. it's a set of fixes to the test suite, fixing more of the tests to use their own rules, isntead of relying on the distribution-default ruleset; this patch adds a new test-suite-specific rules file, so the test suite is more independent of the basic ruleset.
Created attachment 3069 [details] redo of 3066 well isn't this fun. it turns out that rule_names.t introduces more unpredictability in our test suite, and causes *occasional* 'make test' failures. FUZZY_VALIUM in rules/25_replace.cf was therefore causing make test failures, due to its name; this version of the rules patch includes the new scores, the new stats, and renames that rule to "FUZZY_VLIUM" to avoid this test failure. the following patch is a fix for t/rule_names.t that removes this unpredictability.
Created attachment 3070 [details] fix for t/rule_names.t I think this helps
ok. these patches all need votes, now: 3069, 3068, 3070.
Justin, can you elaborate on why rule_names.t was failing? I don't see why FUZZY_VALIUM had the problem, but FUZZY_VIOXX or FUZZY_VICODIN does not. +1 on all 3
'Justin, can you elaborate on why rule_names.t was failing? I don't see why FUZZY_VALIUM had the problem, but FUZZY_VIOXX or FUZZY_VICODIN does not.' FUZZY_VALIUM contained "VALIUM" which was firing on DRUGS_ANXIETY (__DRUGS_ANXIETY_3 to be exact). I couldn't see exactly why, but it certainly was firing on that bit of the name ;) I have no idea why VIOXX/VICODIN aren't firing, although the __DRUGS_FOO_N rules all seem to have individual subrules for each drug, and some have \b and some have other start-of-string markers. rule_names.t is a bit of a combinatorial lucky dip I think. :(
+1
ok! applied, 231543 and 231544.