SA Bugzilla – Bug 1589
Some negative scores too low in 2.50 evolved scores
Last modified: 2003-05-11 20:41:35 UTC
A bunch of the 'nice' rules in 2.50 have scores of -5 to -6.6. While they produced good results in the GA runs, I think they're likely to cause false negatives in the real world, and I've seen at least two or three of them (SIGNATURE_LONG_SPARSE, EMAIL_ATTRIBUTION, ?) discussed as the causes of FNs already. These are also going to present a very tempting way for spammers to get messages through. I think the low limit should have been the opposite of the high limit. For example, BAYES_10 subtracts 6.4 points in set 2, but BAYES_90 only adds 4.1. This seems lopsided given that they (presumably) have the same confidence level for spam or nonspam. I've listed all of the rules with scores of -5 or lower below. Some rules like HABEAS and BONDEDSENDER are trusted, so I didn't include them. If we end up with trust or confidence ratings for rules, some of the more-difficult-to-forge ones below could have lower limits too. Since the GA seemed to think it needed these low negatives to get the best results, I think in the long term what we'll need are a greater variety of nice rules, so the negative scores can be divided among them with no single rule making an easy whitelist tool for spammers. score BUGZILLA_BUG -6.400 -6.300 -2.900 -6.300 score CRON_ENV -6.400 -6.300 -5.701 -5.701 score EMAIL_ATTRIBUTION -6.600 -6.500 -6.500 -6.500 score GROUPS_YAHOO_1 -5.801 score MSGID_GOOD_EXCHANGE -5.801 -5.701 -5.701 -5.701 score PGP_SIGNATURE -6.400 -6.300 -5.701 -5.701 score SIGNATURE_LONG_SPARSE -5.801 -5.801 -3.101 -5.801 score SIGNATURE_LONG_DENSE -6.400 -6.300 -6.300 -6.300 score USER_AGENT_ENTOURAGE 0 0 0 -5.701 score USER_AGENT_PINE -5.801 -5.801 -5.701 -5.701 score USER_AGENT_VM -5.801 -5.701 -5.701 -5.701 score DEBIAN_BTS_BUG -5.701 -5.801 0 -2.900 score USER_AGENT_GNUS_UA -6.400 -6.300 -2.900 -6.300 score USER_AGENT_KMAIL -5.800 -5.801 -6.300 -6.400 score USER_AGENT_MOZILLA_UA -5.801 -5.800 -5.701 -6.300 score USER_AGENT_MUTT -6.400 -6.400 -6.300 -6.300 score USER_AGENT_XIMIAN -6.400 -6.300 -6.300 -6.300 score ORIGINAL_MESSAGE -3.101 -3.101 -6.300 -6.300 score PATCH_UNIFIED_DIFF -6.027 -6.027 -2.900 -6.300 score PGP_SIGNATURE_2 -6.400 -6.300 -6.300 -6.300 score REFERENCES -6.600 -6.600 -6.500 -6.500 score REPLY_WITH_QUOTES -6.600 -6.500 -6.400 -6.500 score BAYES_00 0 0 -6.400 -6.400 score BAYES_01 0 0 -6.600 -6.600 score BAYES_10 0 0 -6.400 -5.801 score BAYES_20 0 0 -5.801 -3.101 Note: In a worst case, these are low enough not only to cause a false negative but also to cause Bayes to auto-train a message as ham - so I think it's really important to tone down these scores.
Incidentally, I looked at my false negatives since upgrading to 2.50 (only four of them!) and two of them are due to BAYES_00 and BAYES_01 having such low scores. This is despite training on thousands of messages - so for those who have Bayes kick in after auto-training on only 400, I expect this will be even more of a problem. (Of course, none of this is that big a deal, a few FNs is far better than worrying about FPs.)
Subject: Re: [SAdev] New: Some negative scores too low in 2.50 evolved scores On Mon, Mar 03, 2003 at 02:47:42AM -0800, bugzilla-daemon@hughes-family.org wrote: > Since the GA seemed to think it needed these low negatives to get the best > results, I think in the long term what we'll need are a greater variety of nice > rules, so the negative scores can be divided among them with no single rule > making an easy whitelist tool for spammers. Well, I can tell you exactly where these numbers came from, it's pre-evolve... (my comments below the section starting on the first column): # 0.0 = -limit [......] 0 ........ limit # 0.25 = -limit ....[... 0 ]....... limit # 0.5 = -limit ......[. 0 .]...... limit (note: tighter) # 0.75 = -limit .......[ 0 ...].... limit # 1.0 = -limit ........ 0 [......] limit my $shrinking_window_lower_base = 0.00; my $shrinking_window_lower_range = 1.00; # *ratio, added to above my $shrinking_window_size_base = 1.00; my $shrinking_window_size_range = 1.00; # *ratio, added to above So the scores by default will get a range of 0-1 (rank 0) to 1-3 (rank 1). my $tflags = $rules{$test}->{tflags}; $tflags ||= ''; if ( $is_nice{$test} && ( $ranking < .5 ) ) { # proper nice rule $lo *= 2.2 if ( $soratio <= 0.05 && $nonspam > 0.4 ); # let good rules be larger if they want to This is where the -6.6 values comes from. $hi = ($soratio == 0) ? $lo : ($soratio <= 0.005 ) ? $lo/1.1 : ($soratio <= 0.010 && $nonspam > 0.2) ? $lo/2.0 : ($soratio <= 0.025 && $nonspam > 1.5) ? $lo/10.0 : 0; For good S/O ratios and hit percentages, we specify that the upper limit should be less than 0 since we know the rule is good. If the S/O ratio is < 0.975, then the score can goto 0 if the GA wants to. if ( $soratio >= 0.35 ) { # auto-disable bad rules ($lo,$hi) = (0,0); } } elsif ( !$is_nice{$test} && ( $ranking >= .5 ) ) { # proper spam rule $hi *= 1.5 if ( $soratio >= 0.99 && $spam > 1.0 ); # let good rules be larger if they want to This is where the 4.5 values come from. $lo = ($soratio == 1) ? $hi: ($soratio >= 0.995 ) ? $hi/4.0 : ($soratio >= 0.990 && $spam > 1.0) ? $hi/8.0 : ($soratio >= 0.900 && $spam > 10.0) ? $hi/24.0 : 0; Same deal as above, except we're less strict with the ratios here. if ( $soratio <= 0.65 ) { # auto-disable bad rules ($lo,$hi) = (0,0); } } else { # rule that has bad nice setting ($lo,$hi) = (0,0); i.e.: if it's a nice rule but the S/O ratio is < .5 or visa versa, auto-disable. } $mutable = 0 if ( $hi == $lo ); So if there is no range, don't let the GA try to mutate the score. And if you're interested in the RANK -> score calculation: sub shrinking_window_ratio_to_range { my $ratio = shift; my $is_nice = 0; my $adjusted = ($ratio -.5) * 2; # adj [0,1] to [-1,1] if ($adjusted < 0) { $is_nice = 1; $adjusted = -$adjusted; } my $lower = $shrinking_window_lower_base + ($shrinking_window_lower_range * $adjusted); my $range = $shrinking_window_size_base + ($shrinking_window_size_range * $adjusted); my $lo = $lower; my $hi = $lower + $range; if ($is_nice) { my $tmp = $hi; $hi = -$lo; $lo = -$tmp; } if ($lo > $hi) { # ??? ($lo,$hi) = ($hi,$lo); } ($lo, $hi); }
FYI: for 2.6, I've lowered the boost to nice rules to be the same as the non-nice rules. So both pos and neg rules will be abs(max) of 4.5. However, I'm more interested in seeing forged mails. I'm working on redoing all the compensate rules to be less forgable (make sure X-Mailer and Message-Id's match the correct format, make sure mailers like Pine aren't sending HTML only mails, etc.) The work is mostly just things I could come up with off hand, but if I had more input data, I'd have a better time crafting rules. :) Feel free to attach some to this bug.
Created attachment 771 [details] Sample High Ranking SPAM
Created attachment 772 [details] Sample High Ranking SPAM 2
*** Bug 1647 has been marked as a duplicate of this bug. ***
*** Bug 1686 has been marked as a duplicate of this bug. ***
*** Bug 1693 has been marked as a duplicate of this bug. ***
Created attachment 815 [details] Message with forged USER_AGENT_VM.
quinlan and I have been chatting, and are planning to rerun the GA to lower the nice scores for 2.54, so I'm using this bug as a placeholder. the GA run won't be 100% perfect since the nice scoring will affect autolearning, and therefore bayes stats, but I think the results we currently have are close enough to not matter in this case. I may also add in a "too many mua" rule since that doesn't require any new mass-checks. I'll probably do this up sometime next week.
ok, here are some thoughts about the score changes for the 2.54/2.60 run: - remove the 2.2x multiplier for strong nice rules - add a 1.7x multiplier for all 'learn' rules - add a new tflag (haven't decided on a name yet, but related to confidence) that lets some nice rules (HABEAS_SWE, RCVD_IN_BONDEDSENDER, EVITE, etc) get a multiplier. that's for rules we know are either very unlikely to be forged or for ones which have legal teeth behind them. say a multiplier of 2x? this will let BAYES_* go -5.1 to 5.1 nice rules will be limited to -3 to 0 (upper part depends on s/o ratio) nice rules with the new tflag will go from -6 to 0 (ditto) spam rules will be limited to 0 to 4.5 (lower part depends on s/o ratio) what do you think?
although this may be a long term goal, it might be nice if we modified spamassassin to only allow negative tests to hit to a maximum of -5 points. Thus, forging many many nice tests would not allow spammers to go nuts with spam signs. I feel that BAYES_* should be between -4.9 and 4.9.
Created attachment 899 [details] Forged Message-ID and User-Agent hitting two negative rules
*** Bug 1798 has been marked as a duplicate of this bug. ***
*** Bug 1538 has been marked as a duplicate of this bug. ***
Created attachment 912 [details] patch to score-ranges to deal with new multipliers and "confidence" levels
Created attachment 913 [details] patch to rule files to specify "confidence"
Created attachment 914 [details] patch to 50_scores.cf with new GA run given other attached patches
*** Bug 1811 has been marked as a duplicate of this bug. ***
*** Bug 1815 has been marked as a duplicate of this bug. ***
wow, no comments on the scores and such?
*** Bug 1793 has been marked as a duplicate of this bug. ***
Coincidentally, I've just started noticing some problems with this over the past week or so. It seems as though the spammers are starting to wise up to and exploit some of the negative scoring tests. I've had some complaints about this from some of my users and have confirmed it for myself. I've gotten some spam messages that are not only below the threshold, but actually have negative scores. They seem to be faking out the IN_REP_TO and MSGID_GOOD_EXCHANGE tests most often. For now I've put the scores for these two tests closer to zero to see what impact that has. (I've lowered IN_REP_TO to -0.5 and MSGID_GOOD_EXCHANGE to -1.0)
*** Bug 1837 has been marked as a duplicate of this bug. ***
Sorry I've been avoiding commenting on this one, I wanted to, but it took some time to collect my thoughts on it. There are several things that might help here: 1. Would it make sense to do a stand-alone Bayes pass on the headers only? That could replace a lot of the pre-scored (and thus exploitable) nice tests on headers with real-time adaptable Bayesian handling that's unique per user or at most per-site. It would also allow a user to score ham very low based on simple things like the presence of certain headers in a safe and adaptable way. 2. If you're thinking tflag thoughts for constraining the evolved scores consider weighting them by how easily they can be forged. There are three levels of forgability that I look for: a) just add a header or simple tag b) interaction between headers and formats like X-Mailer + MessageID signature. These are harder to forge, but very doable c) unforgable. These include most local Received header tests, or, as others have pointed out, those that have legal teeth (perhahps a fourth class for those?) 3. Is there some way that the scores could be updated on the fly on a per-user basis? Perhaps keeping the original score, and then applying some scaling factor based on abuse or high-accuracy? I dunno, that kinda ends up being idea #1, but since there seems to be a feeling that Bayes isn't for everyone (still not quite sure why), this might be a good middle-ground. Long-term (VERY long-term), I wonder if evolving #3 into a replacement for scoring entirely would make sense. You could easily have a statistics package that uses the GA's scoring as a starting set of weights, and then "learns" based on each new message's "SA-tokens". To do this, you would probably want to tokenize the entire message into all of the tokens from the header (something like "token:headertext:hits" for the word "hits" and some special annotation like "token:headerseen:References" and "token:toaddr:ajs@ajs.com"), plus some abstract tokens from the body (all of the (raw|)body tests that matched become a signle token including body-Bayes + some select tokens that get pulled out like "token:raw:MIMETYPE:text/html"). As far as performance goes, I don't think this would be substantially slower than what SA does now, but in terms of accuracy it has the same advantages as Razor2 and Bayes in that it's constantly evolving on a per-user basis, and the (spammer-accessible) GA scoring is just a starting point that everyone will Brownian away form pretty quickly into their own dialect of SA-scores. Mind you, this is all based on a discussion with the very bright author of DSPAM, Jonathan A. Zdziarski <jonathan@networkdweebs.com>, and I cannot fully take credit for the ideas here. We argued back and forth about the merits of a pure-Bayes approach vs. SA's approach and while we still disagree on some fundamental points, I think we both agree that a Bayesian-style learning system avoids many of the problems that are introduced by static scoring, but there's still a HUGE benefit to specialized tests. He calls these "tokenized rules", which is certainly a valid term. If there's a consensus that this is interesting (even if you have grave doubts about how well it would work or how it would compare to current-SA scoring), I'd be happy to go off and work on this in isolation and come back with a prototype that we can look at in the hard light of real-world mail. I'm happy to work on this because it has the potential to address one of my largest concerns: ISPs and medium-to-larger businesses may want to remove dozens of expensive tests in order to increase SA performance (and stop buying hardware to support it), but with the current scoring system, you have to then manually re-weight scores or run the GA yourself in order to adapt to the lack of input from those tests (which is still a static result and requires you to maintain a large corpus). With a feedback learning system in place for top-level scores, we could safely offer command-line switches that limit the tests used to a specific list of tclasses without having to create new score categories for every permutation (and I agree with the previous statement by one of the developers, forget who) that categories are a bit of a hack that doesn't fit in terribly well. This would have the potential to create those categories on-the-fly as long as a given site continued to run with the same switches. Sorry for my usual verbosity. It's just the way I communicate (especially, for odd reasons I won't go into) in the mornings.
Subject: Re: [SAdev] Some negative scores too low in 2.50 evolved scores bugzilla-daemon@hughes-family.org writes: > 1. Would it make sense to do a stand-alone Bayes pass on the headers > only? That could replace a lot of the pre-scored (and thus > exploitable) nice tests on headers with real-time adaptable Bayesian > handling that's unique per user or at most per-site. It would also > allow a user to score ham very low based on simple things like the > presence of certain headers in a safe and adaptable way. The nice tests on headers are gone. We've essentially already replaced them with Bayes. The question is not about them, it's about whether this idea would make Bayes more accurate or not. Test it and find out. > 2. If you're thinking tflag thoughts for constraining the evolved > scores consider weighting them by how easily they can be forged. There > are three levels of forgability that I look for: a) just add a header > or simple tag b) interaction between headers and formats like X-Mailer > + MessageID signature. These are harder to forge, but very doable c) > unforgable. These include most local Received header tests, or, as > others have pointed out, those that have legal teeth (perhahps a > fourth class for those?) I think we just want to drop anything remotely forgeable. The tflags ideas we've discussed were more like temporary workarounds for 2.5x. I don't think we want any forgeable tests in 2.60. If they had a really low score, (a) they wouldn't be very effective anyway and (b) people would still complain a lot -- it's not worth the hassle. I think #3 is a more involved discussion and should be a separate bug or a mailing list discussion.
Subject: Re: [SAdev] Some negative scores too low in 2.50 evolved scores 1. I don't see how this differs much from what we do now. 2. Levels 1 and 2 will quickly become essentially useless. I've thought about it for a bit, and it really seems like nice tests are pretty much impossible as long as we are so popular :-) 3. I've often thought about using a bayes type system for scoring based on rules hit, allowing for realtime changes in scoring, etc. Perhaps this won't be as much of an issue when/if we get the rules hit mixed into the bayes system. We'll see.
OKAY: attachments 912-914 How did you do your GA run? Don't we really need 3 mass-checks to get appropriate results? I suppose using the data from the 2.50 checks is probably okay for our purposes. Have these been checked into HEAD?
Theo, I am testing out the new scores vs. 2.5x on my corpus which has been updated since 2.50. I think we might to consider committing only the scores patch (plus the new anti-forgery rules), but not the score-ranges code or tflags changes. Duncan, none of these patches are going into HEAD. They're 2.54-only. 2.60 only has a few nice rules and all are hard to forge.
Subject: Re: [SAdev] Some negative scores too low in 2.50 evolved scores > Duncan, none of these patches are going into HEAD. They're 2.54-only. > 2.60 only has a few nice rules and all are hard to forge. Right. That had slipped my mind when I asked. The scores should be committed to HEAD though, to support all us insane people that like to run 2.60-cvs on real mail :-)
I'm just the messenger. 2.5x branch current: # SUMMARY for threshold 5.0: # Correctly non-spam: 17693 59.51% (99.98% of non-spam corpus) # Correctly spam: 10554 35.50% (87.69% of spam corpus) # False positives: 4 0.01% (0.02% of nonspam, 156 weighted) # False negatives: 1482 4.98% (12.31% of spam, 4368 weighted) # TCR: 8.013316 SpamRecall: 87.687% SpamPrec: 99.962% FP: 0.01% FN: 4.98% with patches 913 and 914: # SUMMARY for threshold 5.0: # Correctly non-spam: 17692 59.50% (99.97% of non-spam corpus) # Correctly spam: 10431 35.08% (86.67% of spam corpus) # False positives: 5 0.02% (0.03% of nonspam, 192 weighted) # False negatives: 1605 5.40% (13.33% of spam, 4968 weighted) # TCR: 7.384049 SpamRecall: 86.665% SpamPrec: 99.952% FP: 0.02% FN: 5.40%
Well, I should be fair -- my corpus is a 1.5 months old now. I need to update it. Might still be a win.
<quinlan> felicity: with Bayes, went from (for 158 of my hard spam previously missed by SA 2.5x) 13 with negative scores to 0 <felicity> quinlan, so that's good. they're not negative now. :) <quinlan> old average = 4.47, std = 4.3, new average = 5.54, std = 3.54 <quinlan> and 64 FPs instead of 94 <quinlan> that was bayes w/o net So, I'd say this: OKAY: new scores OKAY: new TOO_MANY_MUA rule OKAY: bug fixes only for GA code ISSUE: I don't want new tflags and related changes to GA code to be checked into 2.54, reasoning below: <quinlan> I think (a) if people are doing their own GA run, there's no reason to constrain it (since abuse by spammers is harder) and (b) less clues for spammers still figuring this out and (c) ewww
ok, I applied the scores and new rule to 2.54. still need to generate the new STATISTICS* files, but evolve is segfaulting on me, so it may be a bit. :(
fixed evolve, generated statistics files, committed to stable.
*** Bug 1870 has been marked as a duplicate of this bug. ***
*** Bug 1898 has been marked as a duplicate of this bug. ***