Bug 3821 - scores are overoptimized for training set
Summary: scores are overoptimized for training set
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Score Generation (show other bugs)
Version: 3.0.0
Hardware: Other Linux
: P5 major
Target Milestone: 3.1.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
: 4031 (view as bug list)
Depends on:
Blocks:
 
Reported: 2004-09-25 04:05 UTC by sehh
Modified: 2005-05-01 16:29 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description sehh 2004-09-25 04:05:17 UTC
I installed SA v3 (been using 2.xx for some years now). The first thing i noticed
was that many BAYES_99 emails wouldn't get a high score, while BAYES_95 or so
would get higher scores.

then i noticed that 50_scores.cf has the following:

score BAYES_50 0 0 1.567 0.001
score BAYES_60 0 0 3.515 0.372
score BAYES_80 0 0 3.608 2.087
score BAYES_95 0 0 3.514 2.063
score BAYES_99 0 0 4.070 1.886

WOW!! that is messed up!

BAYES_80 has higher score than BAYES_95 (3.608,3.514 and 2.087,2.063)

BAYES_95 has higher score than BAYES_99 (2.063,1.886)


from my understanding, BAYES_99 means 'more spamminess' right? so why does a
'more spam' email get less score from a 'less spam' email?
Comment 1 Bob Apthorpe 2004-09-25 08:47:25 UTC
This is a known issue, it's not a problem, and it's been covered in the FAQ
(http://wiki.apache.org/spamassassin/HowScoresAreAssigned) and the mailing list
archives repeatedly.
Comment 2 sehh 2004-09-25 12:01:43 UTC
I understand the reason behind it.

Though in practice, it causes a shit load of false negatives.

All that spam which aren't recognized by any network test (RBLs)
will pass as ham because the BAYES_xx is not high enough!!!

With SA 2.6x all those emails would only hit the BAYES_99 rule
and would be marked as spam, quite effectively.

Now that i've re-set the scores as they had been in SA 2.6x, i
no longer get ham that should have been marked as spam with
a single BAYES_99 rule.

I'd advise against this 'dynamic scoring'. Its nice in theory only.
Comment 3 Daniel Quinlan 2004-09-25 13:09:07 UTC
Changing summary, the problem is that harder to defeat tests should have
higher weights than easy-to-defeat ones.

Harder to defeat:

 - learning tests (local control)
 - user configuration tests (local control)
 - network tests (remote maintenance)
 - nice tests (via design)

I realize Henry is working on this, but perhaps we could get a shorter-term
improvement in place sooner.

Assigning to 3.1.0 for now since this is accuracy-related.  Any fixes should
also be considered for 3.0.x.
Comment 4 Bob Menschel 2004-09-26 17:31:54 UTC
I like your idea concerning "harder to defeat" rules.  I'd also suggest a 
classification of "more likely to be correct", which would include 
- obfuscation rules with ultra high confidence (vi@gra)
- spam headers (X_MESSAGE_INFO)
- known forgeries (FAKE_OUTBLAZE_RCVD)
- broken ratware (subject =~ /%RAND/)

Perhaps such rules can be flagged via tflags or similar mechanism, such that 
the automatic scoring mechanism will apply preferential treatment to them, 
provided that the scoring mass-checks hit no ham at all (or no spam if a 
negative scoring rule). 

Thus, X_MESSAGE_INFO will be given a higher score than a non-preferential rule 
with the same hits, because we're more confident in X_MESSAGE_INFO's accuracy. 
And during the perceptron's scoring adjustments, these rules' scores will be 
adjusted less than others (because we're more confident they will not cause 
false positives). 
Comment 5 Daniel Quinlan 2004-09-26 18:19:46 UTC
Subject: Re:  scores are overoptimized for training set

> I like your idea concerning "harder to defeat" rules.  I'd also suggest a 
> classification of "more likely to be correct", which would include 

I think "more likely to be correct" is more or less already handled by
the perceptron, training with sufficient ham, and the score ranging
code.

Comment 6 Matthias Heiler 2004-09-27 05:22:59 UTC
The bug shows two principle problems with perceptrons: 
1.) They are only guaranteed to converge on a local optimum.
2.) They, in general, have not protection from overlearning, meaning that they
"learn the training data set by heart", failing to generalize to new cases
(messages not previously trained).

Both might have happened in the Bayes-Score example.  
(Also, note that the coding of the output from the Bayes-Classificator is
unneccessary hard to learn for the perceptron: One single Bayes-Score value
(with a real number from [0, 1]) would be much easier to learn.)

A real fix for the problem would be not to use perceptrons at all.  Other
machine learning algorithms (Boosting or Support Vector Machines) have much
better regularization properties and they are guaranteed to converge on a global
optimum.

Sure, the perceptron is an improvement over the GA.  But, IMHO, it is still not
the best way to go.
Comment 7 Henry Stern 2004-09-27 07:28:23 UTC
Subject: Re:  scores are overoptimized for training set

Hi Matthias,

Mike Brzozowski at Stanford's AI lab has been doing experiments using 
support vector machines and logistic regression to classify messages.  
 From what I've seen, their results are no better nor worse from mine 
with the perceptron, so I doubt that changing the learning algorithm 
itself will have any effect.

The problem lies in the fact that this is an adversarial classification 
problem which adds some constraints to the solution space.  For any 
message, M, the adversary must not be able to create a message M' that 
triggers additional rules where P(Spam|M) < P(Spam|M').  In the case of 
margin classifiers like perceptrons and support vector machines, this 
means that there can be no negative weights for rules that the adversary 
can affect.  For any message, M, if the adversary creates a message M' 
that triggers fewer rules, P(Spam|M) should not be greatly larger than 
P(Spam|M').  This means that the weights must be as large as possible 
without causing unnecessary false positives.

The senior SpamAssassin developers wanted a false positive rate of 0.04% 
(1:2500) for each configuration using the default threshold.  To 
accomplish this, we scaled the allowable ranges for the scores and 
trained the perceptron using a cross validation to choose the set of 
parameters that best met our needs.  This is why the scores seem so low.

Even though the scores seem low, they are actually quite good.  For set 
3 (network+bayes), the results of our cross validations suggest that 
SpamAssassin will have a false positive rate of approximately 0.9%.  
This seems about right for my own personal e-mail (I do not contribute 
to the corpus that we use to train the classifier).

In the coming weeks, we will have to pay close attention to how spammers 
are able to defeat SpamAssassin.  If we find that the scores really are 
too low, I can quickly generate new ones with wider ranges for a 3.0.1 
release.  Keep in mind that our top priority is precision, not recall.

Henry

>------- Additional Comments From heiler@gmx.de  2004-09-27 05:22 -------
>The bug shows two principle problems with perceptrons: 
>1.) They are only guaranteed to converge on a local optimum.
>2.) They, in general, have not protection from overlearning, meaning that they
>"learn the training data set by heart", failing to generalize to new cases
>(messages not previously trained).
>
>Both might have happened in the Bayes-Score example.  
>(Also, note that the coding of the output from the Bayes-Classificator is
>unneccessary hard to learn for the perceptron: One single Bayes-Score value
>(with a real number from [0, 1]) would be much easier to learn.)
>
>A real fix for the problem would be not to use perceptrons at all.  Other
>machine learning algorithms (Boosting or Support Vector Machines) have much
>better regularization properties and they are guaranteed to converge on a global
>optimum.
>
>Sure, the perceptron is an improvement over the GA.  But, IMHO, it is still not
>the best way to go.
>
>
>
>------- You are receiving this mail because: -------
>You are the assignee for the bug, or are watching the assignee.
>  
>

Comment 8 sehh 2004-09-27 08:15:24 UTC
Henry,

In theory, your ideas are good. Though in practice they are not as effective.
Allow me to explain my point of view.

Your theory is based on the axiom that "a spam email will hit multiple rules".
Thus the your genetic algorithm generates scores that are "as large as possible
without causing unnecessary false positives".

Unfortunately, in practice, spam emails may not hit multiple scores. In fact,
many many many times a spam email will only hit a BAYES_xx rule and nothing
else. Contributing factors are: 

- spammer uses a new location to send spam (most network tests are not effective)
- spammer uses an image to advertise the product (even network tests can't
extract URLs)
- spammer has a carefuly written email without common mistakes
- spammer uses a non-english language (most body rules are english specific)

My experience has shown that BAYES_xx is the only thing that saves us from these
kind of spam from passing through. With the current BAYES_99 low score, all
these emails started passing through our system. Sure, they aren't many, but
this problem did not exist in SA 2.6x and it makes it look like our upgrade
wasn't that good.

Please don't take me the wrong way, the developers of SA have done a superb job
and i'm not criticising their decisions. Just pointing out a specific case that
the algorithm does not take into consideration.

Thank you.
Comment 9 Justin Mason 2004-09-27 14:16:38 UTC
'I like your idea concerning "harder to defeat" rules.  I'd also suggest a 
classification of "more likely to be correct", which would include 
- obfuscation rules with ultra high confidence (vi@gra)
- spam headers (X_MESSAGE_INFO)
- known forgeries (FAKE_OUTBLAZE_RCVD)
- broken ratware (subject =~ /%RAND/)

Perhaps such rules can be flagged via tflags or similar mechanism, such that 
the automatic scoring mechanism will apply preferential treatment to them, 
provided that the scoring mass-checks hit no ham at all (or no spam if a 
negative scoring rule).'

BTW, this is the "rule reliability tflag" idea again; basically provide a way to
hint that this rule is reliable, and this rule should not be considered reliable
-- no matter what their hit-rates in mass-checks were. 

I agree it may have good effects as a hint to the Perceptron, so it may now be
time to do this.  what d'you think, Henry?
Comment 10 Justin Mason 2004-09-27 14:21:50 UTC
'BTW, this is the "rule reliability tflag" idea again; basically provide a way
to hint that this rule is reliable, and this rule should not be considered
reliable -- no matter what their hit-rates in mass-checks were.'

oh, I should point out -- the point in particular here is that, often, you can
get rules that hit 20%:0.001% spam:ham for a very high S/O -- they would always
be given a good high range, and the perceptron allowed to range those rules
highly.  

However, sometimes a really simple one-word body pattern (f.e. "viagra") may get
1.0%:0.0001% hit-rates.  Given that it's a really simple one-word body pattern,
*we* know that it has a high chance of FP'ing in the field, even if our corpora
don't use it at all -- so a reliability tflag gives us a way to indicate this.

OTOH, at times, we know that another similarly low-frequency rule is very
reliable and won't FP, and so can safely get a high score, but we just don't
have a lot of data that hits it in our corpora.

The current problem is that our scoring code has to be over-paranoid about
ranges for low-frequency rules -- just in case it's the first case and not the
second -- hence restricting them unfairly. 
Comment 11 Henry Stern 2004-09-27 14:51:23 UTC
Subject: Re:  scores are overoptimized for training set

Hi Dimitris,

The scores were generated from a sample of over 850000 e-mails submitted 
by multiple users.  The likely reason why BAYES_xx was scored so low was 
that due to Bayes busting, it is not as effective as it once was.  If 
you feel that the BAYES_xx scores are too low for you then you should 
increase them.

I'm not going to address your theory about how spam e-mails don't hit 
multiple rules.  I'd suggest running mass-check on your own corpus or 
examining the mass-check logs in the submit directory on the rsync 
server.  If you find that they are that different, it is fairly easy to 
run the score optimizer in order to personalise your scores.

Henry

>------- Additional Comments From sehh@altered.com  2004-09-27 08:15 -------
>Henry,
>
>In theory, your ideas are good. Though in practice they are not as effective.
>Allow me to explain my point of view.
>
>Your theory is based on the axiom that "a spam email will hit multiple rules".
>Thus the your genetic algorithm generates scores that are "as large as possible
>without causing unnecessary false positives".
>
>Unfortunately, in practice, spam emails may not hit multiple scores. In fact,
>many many many times a spam email will only hit a BAYES_xx rule and nothing
>else. Contributing factors are: 
>
>- spammer uses a new location to send spam (most network tests are not effective)
>- spammer uses an image to advertise the product (even network tests can't
>extract URLs)
>- spammer has a carefuly written email without common mistakes
>- spammer uses a non-english language (most body rules are english specific)
>
>My experience has shown that BAYES_xx is the only thing that saves us from these
>kind of spam from passing through. With the current BAYES_99 low score, all
>these emails started passing through our system. Sure, they aren't many, but
>this problem did not exist in SA 2.6x and it makes it look like our upgrade
>wasn't that good.
>
>Please don't take me the wrong way, the developers of SA have done a superb job
>and i'm not criticising their decisions. Just pointing out a specific case that
>the algorithm does not take into consideration.
>
>Thank you.
>
>
>
>------- You are receiving this mail because: -------
>You are the assignee for the bug, or are watching the assignee.
>  
>

Comment 12 Henry Stern 2004-09-27 14:58:20 UTC
Subject: Re:  scores are overoptimized for training set

This sounds like a reasonable approach.  I can't help out with it at the 
moment, though.  My thesis needs to be finished in 7 weeks.

Henry

>------- Additional Comments From jm@jmason.org  2004-09-27 14:16 -------
>'I like your idea concerning "harder to defeat" rules.  I'd also suggest a 
>classification of "more likely to be correct", which would include 
>- obfuscation rules with ultra high confidence (vi@gra)
>- spam headers (X_MESSAGE_INFO)
>- known forgeries (FAKE_OUTBLAZE_RCVD)
>- broken ratware (subject =~ /%RAND/)
>
>Perhaps such rules can be flagged via tflags or similar mechanism, such that 
>the automatic scoring mechanism will apply preferential treatment to them, 
>provided that the scoring mass-checks hit no ham at all (or no spam if a 
>negative scoring rule).'
>
>BTW, this is the "rule reliability tflag" idea again; basically provide a way to
>hint that this rule is reliable, and this rule should not be considered reliable
>-- no matter what their hit-rates in mass-checks were. 
>
>I agree it may have good effects as a hint to the Perceptron, so it may now be
>time to do this.  what d'you think, Henry?
>
>
>
>------- You are receiving this mail because: -------
>You are the assignee for the bug, or are watching the assignee.
>  
>

Comment 13 sehh 2004-09-27 15:08:41 UTC
Henry,

understood, i will approach this problem based on your suggestion and see how i
can improve the situation. much appriciated.
Comment 14 Justin Mason 2004-09-27 15:55:30 UTC
whoa, fair enough.  good luck with the thesis! ;)
Comment 15 Bob Menschel 2004-09-30 21:06:40 UTC
I agree this is a better place to discuss the philosophical question: 
> If RFCI was perfect at hitting otherwise missed spam with no FPs except 
roaringpenguin.com, and the mail used to score the perceptron had much less 
than 1 in 2500 mails from roaringpenguin.com, is it correct to let the rule 
get a very high score?

What do we mean by "very high score?" My practice is that any rule which hits 
ANY ham gets scored no higher than 1/3 of required_hits. Those that hit lotsa 
spam will get that RH/3, but no higher. 

Philosophically, if we had a rule which we knew /should/ hit the occasional 
ham, then I would similarly limit it, even if that theoretical ham was not in 
any testing corpus. 

RH/3 is simply my rule of thumb, because I generally deal with a limited 
corpus of only 100k emails or so. IMO, if tested via corpora with enough 
emails for testing, RH/2 wouldn't be unreasonable. 
Comment 16 Daniel Quinlan 2004-09-30 22:31:07 UTC
Subject: Re:  scores are overoptimized for training set

> RH/3 is simply my rule of thumb, because I generally deal with a limited 
> corpus of only 100k emails or so. IMO, if tested via corpora with enough 
> emails for testing, RH/2 wouldn't be unreasonable. 

Sure, the perceptron does the same, but much better than humans (which
is why I generally avoid second guessing scores).  Henry is
experimenting with rule accuracy degradation over time and perhaps the
perceptron can handle this even better in the future.

Comment 17 Henry Stern 2004-12-15 09:07:25 UTC
*** Bug 4031 has been marked as a duplicate of this bug. ***
Comment 18 Christian Becker 2004-12-24 07:17:57 UTC
I think the reaction time of the online blocklists plays into this alot. I
increasingly see spam come in that treads around 4 points because it's listed
only on a couple of lists. When I scan it again half an hour later, more lists
have it and it scores 6 points and above (not taking in bayes/awl changes).
I just started collecting these - the most significant jump I noticed was from
2.2 (bayes_80+sbl) to 8.1 (+spamcop,ws,ob) the next day, with spamcop catching
on after 3 hours, uribl's not after 8 (slept then =)). Verified using the
checker at rulesemporium.com. This stuff definitly screws up the mass-check numbers.

Quoting Daniel from Bug 3947:

> Well, we want to move to using real-time data for network tests and that
> would improve the scores, but it's a non-trivial amount of work (to log
> complete network data in message headers, and then reuse it in
> mass-check).  Patches accepted.  :-)

Could it be an easy fix to derive some discounting factors based on the live
data that we see, and to apply those in the scoring process?
Comment 19 Daniel Quinlan 2005-05-02 00:29:50 UTC
This should be fixed provided the mass-checks are done with --reuse
as much as possible.  Closing as FIXED.