1589 – Some negative scores too low in 2.50 evolved scores

Bug 1589 - Some negative scores too low in 2.50 evolved scores

Summary: Some negative scores too low in 2.50 evolved scores

Status:	RESOLVED FIXED

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	Rules (show other bugs)
Version:	SVN Trunk (Latest Devel Version)
Hardware:	Other other

Importance:	P2 blocker
Target Milestone:	2.54
Assignee:	Theo Van Dinter

URL:
Whiteboard:
Keywords:	backport

Duplicates (11):	1538 1647 1686 1693 1793 1798 1811 1815 1837 1870 1898 (view as bug list)
Depends on:
Blocks:	1808
	Show dependency tree

Reported:	2003-03-03 02:47 UTC by Michael Moncur
Modified:	2003-05-11 20:41 UTC (History)
CC List:	12 users (show)

Attachment	Type	Actions	Submitter/CLA Status
Sample High Ranking SPAM	text/plain	None	lance
Sample High Ranking SPAM 2	text/plain	None	lance
Message with forged USER_AGENT_VM.	text/plain	None	Terry Hardie
Forged Message-ID and User-Agent hitting two negative rules	text/plain	None	Will England
patch to score-ranges to deal with new multipliers and "confidence" levels	patch	None	Theo Van Dinter
patch to rule files to specify "confidence"	patch	None	Theo Van Dinter
patch to 50_scores.cf with new GA run given other attached patches	patch	None	Theo Van Dinter
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Michael Moncur 2003-03-03 02:47:42 UTC

A bunch of the 'nice' rules in 2.50 have scores of -5 to -6.6. While they 
produced good results in the GA runs, I think they're likely to cause false 
negatives in the real world, and I've seen at least two or three of them 
(SIGNATURE_LONG_SPARSE, EMAIL_ATTRIBUTION, ?) discussed as the causes of FNs 
already.

These are also going to present a very tempting way for spammers to get 
messages through.

I think the low limit should have been the opposite of the high limit. For 
example, BAYES_10 subtracts 6.4 points in set 2, but BAYES_90 only adds 4.1. 
This seems lopsided given that they (presumably) have the same confidence level 
for spam or nonspam.

I've listed all of the rules with scores of -5 or lower below. Some rules like 
HABEAS and BONDEDSENDER are trusted, so I didn't include them. If we end up 
with trust or confidence ratings for rules, some of the more-difficult-to-forge 
ones below could have lower limits too.

Since the GA seemed to think it needed these low negatives to get the best 
results, I think in the long term what we'll need are a greater variety of nice 
rules, so the negative scores can be divided among them with no single rule 
making an easy whitelist tool for spammers.

score BUGZILLA_BUG -6.400 -6.300 -2.900 -6.300
score CRON_ENV -6.400 -6.300 -5.701 -5.701
score EMAIL_ATTRIBUTION -6.600 -6.500 -6.500 -6.500
score GROUPS_YAHOO_1 -5.801
score MSGID_GOOD_EXCHANGE -5.801 -5.701 -5.701 -5.701
score PGP_SIGNATURE -6.400 -6.300 -5.701 -5.701
score SIGNATURE_LONG_SPARSE -5.801 -5.801 -3.101 -5.801
score SIGNATURE_LONG_DENSE -6.400 -6.300 -6.300 -6.300
score USER_AGENT_ENTOURAGE 0 0 0 -5.701
score USER_AGENT_PINE -5.801 -5.801 -5.701 -5.701
score USER_AGENT_VM -5.801 -5.701 -5.701 -5.701
score DEBIAN_BTS_BUG -5.701 -5.801 0 -2.900
score USER_AGENT_GNUS_UA -6.400 -6.300 -2.900 -6.300
score USER_AGENT_KMAIL -5.800 -5.801 -6.300 -6.400
score USER_AGENT_MOZILLA_UA -5.801 -5.800 -5.701 -6.300
score USER_AGENT_MUTT -6.400 -6.400 -6.300 -6.300
score USER_AGENT_XIMIAN -6.400 -6.300 -6.300 -6.300
score ORIGINAL_MESSAGE -3.101 -3.101 -6.300 -6.300
score PATCH_UNIFIED_DIFF -6.027 -6.027 -2.900 -6.300
score PGP_SIGNATURE_2 -6.400 -6.300 -6.300 -6.300
score REFERENCES -6.600 -6.600 -6.500 -6.500
score REPLY_WITH_QUOTES -6.600 -6.500 -6.400 -6.500
score BAYES_00 0 0 -6.400 -6.400
score BAYES_01 0 0 -6.600 -6.600
score BAYES_10 0 0 -6.400 -5.801
score BAYES_20 0 0 -5.801 -3.101

Note: In a worst case, these are low enough not only to cause a false negative 
but also to cause Bayes to auto-train a message as ham - so I think it's really 
important to tone down these scores.

Comment 1 Michael Moncur 2003-03-03 03:06:10 UTC

Incidentally, I looked at my false negatives since upgrading to 2.50 (only four 
of them!) and two of them are due to BAYES_00 and BAYES_01 having such low 
scores. This is despite training on thousands of messages - so for those who 
have Bayes kick in after auto-training on only 400, I expect this will be even 
more of a problem.

(Of course, none of this is that big a deal, a few FNs is far better than 
worrying about FPs.)

Comment 2 Theo Van Dinter 2003-03-03 06:47:23 UTC

Subject: Re: [SAdev]  New: Some negative scores too low in 2.50 evolved scores

On Mon, Mar 03, 2003 at 02:47:42AM -0800, bugzilla-daemon@hughes-family.org wrote:
> Since the GA seemed to think it needed these low negatives to get the best 
> results, I think in the long term what we'll need are a greater variety of nice 
> rules, so the negative scores can be divided among them with no single rule 
> making an easy whitelist tool for spammers.

Well, I can tell you exactly where these numbers came from, it's
pre-evolve... (my comments below the section starting on the first column):

  # 0.0  = -limit [......] 0 ........ limit
  # 0.25 = -limit ....[... 0 ]....... limit
  # 0.5  = -limit ......[. 0 .]...... limit (note: tighter)
  # 0.75 = -limit .......[ 0 ...].... limit
  # 1.0  = -limit ........ 0 [......] limit
  my $shrinking_window_lower_base =   0.00; 
  my $shrinking_window_lower_range =  1.00; # *ratio, added to above
  my $shrinking_window_size_base =    1.00;
  my $shrinking_window_size_range =   1.00; # *ratio, added to above

So the scores by default will get a range of 0-1 (rank 0) to 1-3 (rank 1).

  my $tflags = $rules{$test}->{tflags}; $tflags ||= '';
  if ( $is_nice{$test} && ( $ranking < .5 ) ) { # proper nice rule
    $lo *= 2.2 if ( $soratio <= 0.05 && $nonspam > 0.4 ); # let good rules be larger if they want to

This is where the -6.6 values comes from.

    $hi =	($soratio == 0) ? $lo :
    		($soratio <= 0.005 ) ? $lo/1.1 :
    		($soratio <= 0.010 && $nonspam > 0.2) ? $lo/2.0 :
		($soratio <= 0.025 && $nonspam > 1.5) ? $lo/10.0 :
		0;

For good S/O ratios and hit percentages, we specify that the upper limit
should be less than 0 since we know the rule is good.  If the S/O ratio
is < 0.975, then the score can goto 0 if the GA wants to.

    if ( $soratio >= 0.35 ) { # auto-disable bad rules
      ($lo,$hi) = (0,0);
    }
  }
  elsif ( !$is_nice{$test} && ( $ranking >= .5 ) ) { # proper spam rule
    $hi *= 1.5 if ( $soratio >= 0.99 && $spam > 1.0 ); # let good rules be larger if they want to

This is where the 4.5 values come from.

    $lo =	($soratio == 1) ? $hi:
    		($soratio >= 0.995 ) ? $hi/4.0 :
    		($soratio >= 0.990 && $spam > 1.0) ? $hi/8.0 :
		($soratio >= 0.900 && $spam > 10.0) ? $hi/24.0 :
		0;

Same deal as above, except we're less strict with the ratios here.

    if ( $soratio <= 0.65 ) { # auto-disable bad rules
      ($lo,$hi) = (0,0);
    }
  }
  else { # rule that has bad nice setting
    ($lo,$hi) = (0,0);

i.e.: if it's a nice rule but the S/O ratio is < .5 or visa versa, auto-disable.

  }
  $mutable = 0 if ( $hi == $lo );

So if there is no range, don't let the GA try to mutate the score.


And if you're interested in the RANK -> score calculation:

  sub shrinking_window_ratio_to_range {
    my $ratio = shift;
    my $is_nice = 0;
    my $adjusted = ($ratio -.5) * 2;      # adj [0,1] to [-1,1]
    if ($adjusted < 0) { $is_nice = 1; $adjusted = -$adjusted; }
  
    my $lower = $shrinking_window_lower_base 
                          + ($shrinking_window_lower_range * $adjusted);
    my $range = $shrinking_window_size_base 
                          + ($shrinking_window_size_range * $adjusted);
    my $lo = $lower;
    my $hi = $lower + $range;
    if ($is_nice) {
      my $tmp = $hi; $hi = -$lo; $lo = -$tmp;
    }
    if ($lo > $hi) { # ???
      ($lo,$hi) = ($hi,$lo);
    }
  
    ($lo, $hi);
  }

Comment 3 Theo Van Dinter 2003-03-17 11:02:10 UTC

FYI: for 2.6, I've lowered the boost to nice rules to be the same as the non-nice rules.  So both pos and neg rules will be abs(max) of 4.5.

However, I'm more interested in seeing forged mails.  I'm working on redoing all the compensate rules to be less forgable (make sure X-Mailer and Message-Id's match the correct format, make sure mailers like Pine aren't sending HTML only mails, etc.)   The work is mostly just things I could come up with off hand, but if I had more input data, I'd have a better time crafting rules. :)  Feel free to attach some to this bug.

Comment 4 lance 2003-03-17 12:08:08 UTC

Created attachment 771 [details]
Sample High Ranking SPAM

Comment 5 lance 2003-03-17 12:08:28 UTC

Created attachment 772 [details]
Sample High Ranking SPAM 2

Comment 6 Theo Van Dinter 2003-03-25 04:56:37 UTC

*** Bug 1647 has been marked as a duplicate of this bug. ***

Comment 7 Theo Van Dinter 2003-03-25 10:22:09 UTC

*** Bug 1686 has been marked as a duplicate of this bug. ***

Comment 8 Theo Van Dinter 2003-03-26 14:25:16 UTC

*** Bug 1693 has been marked as a duplicate of this bug. ***

Comment 9 Terry Hardie 2003-03-26 14:31:51 UTC

Created attachment 815 [details]
Message with forged USER_AGENT_VM.

Comment 10 Theo Van Dinter 2003-04-06 15:59:13 UTC

quinlan and I have been chatting, and are planning to rerun the GA to lower 
the nice scores for 2.54, so I'm using this bug as a placeholder.  the GA run 
won't be 100% perfect since the nice scoring will affect autolearning, and 
therefore bayes stats, but I think the results we currently have are close 
enough to not matter in this case.

I may also add in a "too many mua" rule since that doesn't require any new 
mass-checks.

I'll probably do this up sometime next week.

Comment 11 Theo Van Dinter 2003-04-12 16:28:36 UTC

ok, here are some thoughts about the score changes for the 2.54/2.60 run:

- remove the 2.2x multiplier for strong nice rules
- add a 1.7x multiplier for all 'learn' rules
- add a new tflag (haven't decided on a name yet, but related to confidence)
that lets some nice rules (HABEAS_SWE, RCVD_IN_BONDEDSENDER, EVITE, etc) get a
multiplier.  that's for rules we know are either very unlikely to be forged or
for ones which have legal teeth behind them.  say a multiplier of 2x?

this will let BAYES_* go -5.1 to 5.1
nice rules will be limited to -3 to 0 (upper part depends on s/o ratio)
nice rules with the new tflag will go from -6 to 0 (ditto)
spam rules will be limited to 0 to 4.5 (lower part depends on s/o ratio)

what do you think?

Comment 12 Duncan Findlay 2003-04-13 18:05:25 UTC

although this may be a long term goal, it might be nice if we modified
spamassassin to only allow negative tests to hit to a maximum of -5 points.
Thus, forging many many nice tests would not allow spammers to go nuts with spam
signs.

I feel that BAYES_* should be between -4.9 and 4.9.

Comment 13 Will England 2003-04-18 09:06:00 UTC

Created attachment 899 [details]
Forged Message-ID and User-Agent hitting two negative rules

Comment 14 Will England 2003-04-18 09:07:08 UTC

*** Bug 1798 has been marked as a duplicate of this bug. ***

Comment 15 Theo Van Dinter 2003-04-21 10:53:04 UTC

*** Bug 1538 has been marked as a duplicate of this bug. ***

Comment 16 Theo Van Dinter 2003-04-22 04:49:43 UTC

Created attachment 912 [details]
patch to score-ranges to deal with new multipliers and "confidence" levels

Comment 17 Theo Van Dinter 2003-04-22 04:50:05 UTC

Created attachment 913 [details]
patch to rule files to specify "confidence"

Comment 18 Theo Van Dinter 2003-04-22 04:50:39 UTC

Created attachment 914 [details]
patch to 50_scores.cf with new GA run given other attached patches

Comment 19 Theo Van Dinter 2003-04-22 08:18:44 UTC

*** Bug 1811 has been marked as a duplicate of this bug. ***

Comment 20 Theo Van Dinter 2003-04-22 19:19:00 UTC

*** Bug 1815 has been marked as a duplicate of this bug. ***

Comment 21 Theo Van Dinter 2003-04-23 21:55:02 UTC

wow, no comments on the scores and such?

Comment 22 Theo Van Dinter 2003-04-23 22:16:39 UTC

*** Bug 1793 has been marked as a duplicate of this bug. ***

Comment 23 Bill Pitz 2003-04-23 22:33:32 UTC

Coincidentally, I've just started noticing some problems with this over the 
past week or so.  It seems as though the spammers are starting to wise up to 
and exploit some of the negative scoring tests.

I've had some complaints about this from some of my users and have confirmed it 
for myself.  I've gotten some spam messages that are not only below the 
threshold, but actually have negative scores.  They seem to be faking out the 
IN_REP_TO and MSGID_GOOD_EXCHANGE tests most often.  For now I've put the 
scores for these two tests closer to zero to see what impact that has.  (I've 
lowered IN_REP_TO to -0.5 and MSGID_GOOD_EXCHANGE to -1.0)

Comment 24 Theo Van Dinter 2003-04-29 07:58:39 UTC

*** Bug 1837 has been marked as a duplicate of this bug. ***

Comment 25 Aaron Sherman 2003-05-02 08:21:48 UTC

Sorry I've been avoiding commenting on this one, I wanted to, but it took some
time to collect my thoughts on it.

There are several things that might help here:

1. Would it make sense to do a stand-alone Bayes pass on the headers only? That
could replace a lot of the pre-scored (and thus exploitable) nice tests on
headers with real-time adaptable Bayesian handling that's unique per user or at
most per-site. It would also allow a user to score ham very low based on simple
things like the presence of certain headers in a safe and adaptable way.

2. If you're thinking tflag thoughts for constraining the evolved scores
consider weighting them by how easily they can be forged. There are three levels
of forgability that I look for: a) just add a header or simple tag b)
interaction between headers and formats like X-Mailer + MessageID signature.
These are harder to forge, but very doable c) unforgable. These include most
local Received header tests, or, as others have pointed out, those that have
legal teeth (perhahps a fourth class for those?)

3. Is there some way that the scores could be updated on the fly on a per-user
basis? Perhaps keeping the original score, and then applying some scaling factor
based on abuse or high-accuracy? I dunno, that kinda ends up being idea #1, but
since there seems to be a feeling that Bayes isn't for everyone (still not quite
sure why), this might be a good middle-ground.

Long-term (VERY long-term), I wonder if evolving #3 into a replacement for
scoring entirely would make sense. You could easily have a statistics package
that uses the GA's scoring as a starting set of weights, and then "learns" based
on each new message's "SA-tokens". To do this, you would probably want to
tokenize the entire message into all of the tokens from the header (something
like "token:headertext:hits" for the word "hits" and some special annotation
like "token:headerseen:References" and "token:toaddr:ajs@ajs.com"), plus some
abstract tokens from the body (all of the (raw|)body tests that matched become a
signle token including body-Bayes + some select tokens that get pulled out like
"token:raw:MIMETYPE:text/html"). As far as performance goes, I don't think this
would be substantially slower than what SA does now, but in terms of accuracy it
has the same advantages as Razor2 and Bayes in that it's constantly evolving on
a per-user basis, and the (spammer-accessible) GA scoring is just a starting
point that everyone will Brownian away form pretty quickly into their own
dialect of SA-scores.

Mind you, this is all based on a discussion with the very bright author of
DSPAM, Jonathan A. Zdziarski <jonathan@networkdweebs.com>, and I cannot fully
take credit for the ideas here. We argued back and forth about the merits of a
pure-Bayes approach vs. SA's approach and while we still disagree on some
fundamental points, I think we both agree that a Bayesian-style learning system
avoids many of the problems that are introduced by static scoring, but there's
still a HUGE benefit to specialized tests. He calls these "tokenized rules",
which is certainly a valid term.

If there's a consensus that this is interesting (even if you have grave doubts
about how well it would work or how it would compare to current-SA scoring), I'd
be happy to go off and work on this in isolation and come back with a prototype
that we can look at in the hard light of real-world mail. I'm happy to work on
this because it has the potential to address one of my largest concerns: ISPs
and medium-to-larger businesses may want to remove dozens of expensive tests in
order to increase SA performance (and stop buying hardware to support it), but
with the current scoring system, you have to then manually re-weight scores or
run the GA yourself in order to adapt to the lack of input from those tests
(which is still a static result and requires you to maintain a large corpus).

With a feedback learning system in place for top-level scores, we could safely
offer command-line switches that limit the tests used to a specific list of
tclasses without having to create new score categories for every permutation
(and I agree with the previous statement by one of the developers, forget who)
that categories are a bit of a hack that doesn't fit in terribly well. This
would have the potential to create those categories on-the-fly as long as a
given site continued to run with the same switches.

Sorry for my usual verbosity. It's just the way I communicate (especially, for
odd reasons I won't go into) in the mornings.

Comment 26 Daniel Quinlan 2003-05-02 14:08:53 UTC

Subject: Re: [SAdev]  Some negative scores too low in 2.50 evolved scores

bugzilla-daemon@hughes-family.org writes:

> 1. Would it make sense to do a stand-alone Bayes pass on the headers
> only? That could replace a lot of the pre-scored (and thus
> exploitable) nice tests on headers with real-time adaptable Bayesian
> handling that's unique per user or at most per-site. It would also
> allow a user to score ham very low based on simple things like the
> presence of certain headers in a safe and adaptable way.

The nice tests on headers are gone.  We've essentially already replaced
them with Bayes.  The question is not about them, it's about whether
this idea would make Bayes more accurate or not.  Test it and find out.

> 2. If you're thinking tflag thoughts for constraining the evolved
> scores consider weighting them by how easily they can be forged. There
> are three levels of forgability that I look for: a) just add a header
> or simple tag b) interaction between headers and formats like X-Mailer
> + MessageID signature.  These are harder to forge, but very doable c)
> unforgable. These include most local Received header tests, or, as
> others have pointed out, those that have legal teeth (perhahps a
> fourth class for those?)

I think we just want to drop anything remotely forgeable.  The tflags
ideas we've discussed were more like temporary workarounds for 2.5x.  I
don't think we want any forgeable tests in 2.60.  If they had a really
low score, (a) they wouldn't be very effective anyway and (b) people
would still complain a lot -- it's not worth the hassle.

I think #3 is a more involved discussion and should be a separate bug or
a mailing list discussion.

Comment 27 Duncan Findlay 2003-05-02 22:13:42 UTC

Subject: Re: [SAdev]  Some negative scores too low in 2.50 evolved scores

1. I don't see how this differs much from what we do now.

2. Levels 1 and 2 will quickly become essentially useless. I've
   thought about it for a bit, and it really seems like nice tests are
   pretty much impossible as long as we are so popular :-)

3. I've often thought about using a bayes type system for scoring
   based on rules hit, allowing for realtime changes in scoring, etc.
   Perhaps this won't be as much of an issue when/if we get the rules
   hit mixed into the bayes system. We'll see.

Comment 28 Duncan Findlay 2003-05-03 16:07:36 UTC

OKAY: attachments 912-914

How did you do your GA run? Don't we really need 3 mass-checks to get
appropriate results? I suppose using the data from the 2.50 checks is probably
okay for our purposes.

Have these been checked into HEAD?

Comment 29 Daniel Quinlan 2003-05-03 17:00:38 UTC

Theo, I am testing out the new scores vs. 2.5x on my corpus which has been
updated since 2.50.

I think we might to consider committing only the scores patch (plus the new
anti-forgery rules), but not the score-ranges code or tflags changes.

Duncan, none of these patches are going into HEAD.  They're 2.54-only.
2.60 only has a few nice rules and all are hard to forge.

Comment 30 Duncan Findlay 2003-05-03 17:10:41 UTC

Subject: Re: [SAdev]  Some negative scores too low in 2.50 evolved scores

> Duncan, none of these patches are going into HEAD.  They're 2.54-only.
> 2.60 only has a few nice rules and all are hard to forge.

Right. That had slipped my mind when I asked. The scores should be
committed to HEAD though, to support all us insane people that like to
run 2.60-cvs on real mail :-)

Comment 31 Daniel Quinlan 2003-05-03 18:57:54 UTC

I'm just the messenger.

2.5x branch current:

# SUMMARY for threshold 5.0:
# Correctly non-spam:  17693  59.51%  (99.98% of non-spam corpus)
# Correctly spam:      10554  35.50%  (87.69% of spam corpus)
# False positives:         4  0.01%  (0.02% of nonspam,    156 weighted)
# False negatives:      1482  4.98%  (12.31% of spam,   4368 weighted)
# TCR: 8.013316  SpamRecall: 87.687%  SpamPrec: 99.962%  FP: 0.01%  FN: 4.98%

with patches 913 and 914:

# SUMMARY for threshold 5.0:
# Correctly non-spam:  17692  59.50%  (99.97% of non-spam corpus)
# Correctly spam:      10431  35.08%  (86.67% of spam corpus)
# False positives:         5  0.02%  (0.03% of nonspam,    192 weighted)
# False negatives:      1605  5.40%  (13.33% of spam,   4968 weighted)
# TCR: 7.384049  SpamRecall: 86.665%  SpamPrec: 99.952%  FP: 0.02%  FN: 5.40%

Comment 32 Daniel Quinlan 2003-05-03 19:00:18 UTC

Well, I should be fair -- my corpus is a 1.5 months old now.  I need to
update it.  Might still be a win.

Comment 33 Daniel Quinlan 2003-05-05 19:54:35 UTC

<quinlan> felicity: with Bayes, went from (for 158 of my hard spam previously
missed by SA 2.5x) 13 with negative scores to 0
<felicity> quinlan, so that's good.  they're not negative now. :)
<quinlan> old average = 4.47, std = 4.3, new average = 5.54, std = 3.54
<quinlan> and 64 FPs instead of 94
<quinlan> that was bayes w/o net

So, I'd say this:

OKAY: new scores
OKAY: new TOO_MANY_MUA rule
OKAY: bug fixes only for GA code
ISSUE: I don't want new tflags and related changes to GA code
       to be checked into 2.54, reasoning below:

<quinlan> I think (a) if people are doing their own GA run, there's no reason to
constrain it (since abuse by spammers is harder) and (b) less clues for spammers
still figuring this out and (c) ewww

Comment 34 Theo Van Dinter 2003-05-06 07:17:01 UTC

ok, I applied the scores and new rule to 2.54.

still need to generate the new STATISTICS* files, but evolve is segfaulting on
me, so it may be a bit. :(

Comment 35 Theo Van Dinter 2003-05-07 16:51:03 UTC

fixed evolve, generated statistics files, committed to stable.

Comment 36 Theo Van Dinter 2003-05-07 20:39:29 UTC

*** Bug 1870 has been marked as a duplicate of this bug. ***

Comment 37 Theo Van Dinter 2003-05-12 04:41:35 UTC

*** Bug 1898 has been marked as a duplicate of this bug. ***