SA Bugzilla – Bug 3821
scores are overoptimized for training set
Last modified: 2005-05-01 16:29:50 UTC
I installed SA v3 (been using 2.xx for some years now). The first thing i noticed was that many BAYES_99 emails wouldn't get a high score, while BAYES_95 or so would get higher scores. then i noticed that 50_scores.cf has the following: score BAYES_50 0 0 1.567 0.001 score BAYES_60 0 0 3.515 0.372 score BAYES_80 0 0 3.608 2.087 score BAYES_95 0 0 3.514 2.063 score BAYES_99 0 0 4.070 1.886 WOW!! that is messed up! BAYES_80 has higher score than BAYES_95 (3.608,3.514 and 2.087,2.063) BAYES_95 has higher score than BAYES_99 (2.063,1.886) from my understanding, BAYES_99 means 'more spamminess' right? so why does a 'more spam' email get less score from a 'less spam' email?
This is a known issue, it's not a problem, and it's been covered in the FAQ (http://wiki.apache.org/spamassassin/HowScoresAreAssigned) and the mailing list archives repeatedly.
I understand the reason behind it. Though in practice, it causes a shit load of false negatives. All that spam which aren't recognized by any network test (RBLs) will pass as ham because the BAYES_xx is not high enough!!! With SA 2.6x all those emails would only hit the BAYES_99 rule and would be marked as spam, quite effectively. Now that i've re-set the scores as they had been in SA 2.6x, i no longer get ham that should have been marked as spam with a single BAYES_99 rule. I'd advise against this 'dynamic scoring'. Its nice in theory only.
Changing summary, the problem is that harder to defeat tests should have higher weights than easy-to-defeat ones. Harder to defeat: - learning tests (local control) - user configuration tests (local control) - network tests (remote maintenance) - nice tests (via design) I realize Henry is working on this, but perhaps we could get a shorter-term improvement in place sooner. Assigning to 3.1.0 for now since this is accuracy-related. Any fixes should also be considered for 3.0.x.
I like your idea concerning "harder to defeat" rules. I'd also suggest a classification of "more likely to be correct", which would include - obfuscation rules with ultra high confidence (vi@gra) - spam headers (X_MESSAGE_INFO) - known forgeries (FAKE_OUTBLAZE_RCVD) - broken ratware (subject =~ /%RAND/) Perhaps such rules can be flagged via tflags or similar mechanism, such that the automatic scoring mechanism will apply preferential treatment to them, provided that the scoring mass-checks hit no ham at all (or no spam if a negative scoring rule). Thus, X_MESSAGE_INFO will be given a higher score than a non-preferential rule with the same hits, because we're more confident in X_MESSAGE_INFO's accuracy. And during the perceptron's scoring adjustments, these rules' scores will be adjusted less than others (because we're more confident they will not cause false positives).
Subject: Re: scores are overoptimized for training set > I like your idea concerning "harder to defeat" rules. I'd also suggest a > classification of "more likely to be correct", which would include I think "more likely to be correct" is more or less already handled by the perceptron, training with sufficient ham, and the score ranging code.
The bug shows two principle problems with perceptrons: 1.) They are only guaranteed to converge on a local optimum. 2.) They, in general, have not protection from overlearning, meaning that they "learn the training data set by heart", failing to generalize to new cases (messages not previously trained). Both might have happened in the Bayes-Score example. (Also, note that the coding of the output from the Bayes-Classificator is unneccessary hard to learn for the perceptron: One single Bayes-Score value (with a real number from [0, 1]) would be much easier to learn.) A real fix for the problem would be not to use perceptrons at all. Other machine learning algorithms (Boosting or Support Vector Machines) have much better regularization properties and they are guaranteed to converge on a global optimum. Sure, the perceptron is an improvement over the GA. But, IMHO, it is still not the best way to go.
Subject: Re: scores are overoptimized for training set Hi Matthias, Mike Brzozowski at Stanford's AI lab has been doing experiments using support vector machines and logistic regression to classify messages. From what I've seen, their results are no better nor worse from mine with the perceptron, so I doubt that changing the learning algorithm itself will have any effect. The problem lies in the fact that this is an adversarial classification problem which adds some constraints to the solution space. For any message, M, the adversary must not be able to create a message M' that triggers additional rules where P(Spam|M) < P(Spam|M'). In the case of margin classifiers like perceptrons and support vector machines, this means that there can be no negative weights for rules that the adversary can affect. For any message, M, if the adversary creates a message M' that triggers fewer rules, P(Spam|M) should not be greatly larger than P(Spam|M'). This means that the weights must be as large as possible without causing unnecessary false positives. The senior SpamAssassin developers wanted a false positive rate of 0.04% (1:2500) for each configuration using the default threshold. To accomplish this, we scaled the allowable ranges for the scores and trained the perceptron using a cross validation to choose the set of parameters that best met our needs. This is why the scores seem so low. Even though the scores seem low, they are actually quite good. For set 3 (network+bayes), the results of our cross validations suggest that SpamAssassin will have a false positive rate of approximately 0.9%. This seems about right for my own personal e-mail (I do not contribute to the corpus that we use to train the classifier). In the coming weeks, we will have to pay close attention to how spammers are able to defeat SpamAssassin. If we find that the scores really are too low, I can quickly generate new ones with wider ranges for a 3.0.1 release. Keep in mind that our top priority is precision, not recall. Henry >------- Additional Comments From heiler@gmx.de 2004-09-27 05:22 ------- >The bug shows two principle problems with perceptrons: >1.) They are only guaranteed to converge on a local optimum. >2.) They, in general, have not protection from overlearning, meaning that they >"learn the training data set by heart", failing to generalize to new cases >(messages not previously trained). > >Both might have happened in the Bayes-Score example. >(Also, note that the coding of the output from the Bayes-Classificator is >unneccessary hard to learn for the perceptron: One single Bayes-Score value >(with a real number from [0, 1]) would be much easier to learn.) > >A real fix for the problem would be not to use perceptrons at all. Other >machine learning algorithms (Boosting or Support Vector Machines) have much >better regularization properties and they are guaranteed to converge on a global >optimum. > >Sure, the perceptron is an improvement over the GA. But, IMHO, it is still not >the best way to go. > > > >------- You are receiving this mail because: ------- >You are the assignee for the bug, or are watching the assignee. > >
Henry, In theory, your ideas are good. Though in practice they are not as effective. Allow me to explain my point of view. Your theory is based on the axiom that "a spam email will hit multiple rules". Thus the your genetic algorithm generates scores that are "as large as possible without causing unnecessary false positives". Unfortunately, in practice, spam emails may not hit multiple scores. In fact, many many many times a spam email will only hit a BAYES_xx rule and nothing else. Contributing factors are: - spammer uses a new location to send spam (most network tests are not effective) - spammer uses an image to advertise the product (even network tests can't extract URLs) - spammer has a carefuly written email without common mistakes - spammer uses a non-english language (most body rules are english specific) My experience has shown that BAYES_xx is the only thing that saves us from these kind of spam from passing through. With the current BAYES_99 low score, all these emails started passing through our system. Sure, they aren't many, but this problem did not exist in SA 2.6x and it makes it look like our upgrade wasn't that good. Please don't take me the wrong way, the developers of SA have done a superb job and i'm not criticising their decisions. Just pointing out a specific case that the algorithm does not take into consideration. Thank you.
'I like your idea concerning "harder to defeat" rules. I'd also suggest a classification of "more likely to be correct", which would include - obfuscation rules with ultra high confidence (vi@gra) - spam headers (X_MESSAGE_INFO) - known forgeries (FAKE_OUTBLAZE_RCVD) - broken ratware (subject =~ /%RAND/) Perhaps such rules can be flagged via tflags or similar mechanism, such that the automatic scoring mechanism will apply preferential treatment to them, provided that the scoring mass-checks hit no ham at all (or no spam if a negative scoring rule).' BTW, this is the "rule reliability tflag" idea again; basically provide a way to hint that this rule is reliable, and this rule should not be considered reliable -- no matter what their hit-rates in mass-checks were. I agree it may have good effects as a hint to the Perceptron, so it may now be time to do this. what d'you think, Henry?
'BTW, this is the "rule reliability tflag" idea again; basically provide a way to hint that this rule is reliable, and this rule should not be considered reliable -- no matter what their hit-rates in mass-checks were.' oh, I should point out -- the point in particular here is that, often, you can get rules that hit 20%:0.001% spam:ham for a very high S/O -- they would always be given a good high range, and the perceptron allowed to range those rules highly. However, sometimes a really simple one-word body pattern (f.e. "viagra") may get 1.0%:0.0001% hit-rates. Given that it's a really simple one-word body pattern, *we* know that it has a high chance of FP'ing in the field, even if our corpora don't use it at all -- so a reliability tflag gives us a way to indicate this. OTOH, at times, we know that another similarly low-frequency rule is very reliable and won't FP, and so can safely get a high score, but we just don't have a lot of data that hits it in our corpora. The current problem is that our scoring code has to be over-paranoid about ranges for low-frequency rules -- just in case it's the first case and not the second -- hence restricting them unfairly.
Subject: Re: scores are overoptimized for training set Hi Dimitris, The scores were generated from a sample of over 850000 e-mails submitted by multiple users. The likely reason why BAYES_xx was scored so low was that due to Bayes busting, it is not as effective as it once was. If you feel that the BAYES_xx scores are too low for you then you should increase them. I'm not going to address your theory about how spam e-mails don't hit multiple rules. I'd suggest running mass-check on your own corpus or examining the mass-check logs in the submit directory on the rsync server. If you find that they are that different, it is fairly easy to run the score optimizer in order to personalise your scores. Henry >------- Additional Comments From sehh@altered.com 2004-09-27 08:15 ------- >Henry, > >In theory, your ideas are good. Though in practice they are not as effective. >Allow me to explain my point of view. > >Your theory is based on the axiom that "a spam email will hit multiple rules". >Thus the your genetic algorithm generates scores that are "as large as possible >without causing unnecessary false positives". > >Unfortunately, in practice, spam emails may not hit multiple scores. In fact, >many many many times a spam email will only hit a BAYES_xx rule and nothing >else. Contributing factors are: > >- spammer uses a new location to send spam (most network tests are not effective) >- spammer uses an image to advertise the product (even network tests can't >extract URLs) >- spammer has a carefuly written email without common mistakes >- spammer uses a non-english language (most body rules are english specific) > >My experience has shown that BAYES_xx is the only thing that saves us from these >kind of spam from passing through. With the current BAYES_99 low score, all >these emails started passing through our system. Sure, they aren't many, but >this problem did not exist in SA 2.6x and it makes it look like our upgrade >wasn't that good. > >Please don't take me the wrong way, the developers of SA have done a superb job >and i'm not criticising their decisions. Just pointing out a specific case that >the algorithm does not take into consideration. > >Thank you. > > > >------- You are receiving this mail because: ------- >You are the assignee for the bug, or are watching the assignee. > >
Subject: Re: scores are overoptimized for training set This sounds like a reasonable approach. I can't help out with it at the moment, though. My thesis needs to be finished in 7 weeks. Henry >------- Additional Comments From jm@jmason.org 2004-09-27 14:16 ------- >'I like your idea concerning "harder to defeat" rules. I'd also suggest a >classification of "more likely to be correct", which would include >- obfuscation rules with ultra high confidence (vi@gra) >- spam headers (X_MESSAGE_INFO) >- known forgeries (FAKE_OUTBLAZE_RCVD) >- broken ratware (subject =~ /%RAND/) > >Perhaps such rules can be flagged via tflags or similar mechanism, such that >the automatic scoring mechanism will apply preferential treatment to them, >provided that the scoring mass-checks hit no ham at all (or no spam if a >negative scoring rule).' > >BTW, this is the "rule reliability tflag" idea again; basically provide a way to >hint that this rule is reliable, and this rule should not be considered reliable >-- no matter what their hit-rates in mass-checks were. > >I agree it may have good effects as a hint to the Perceptron, so it may now be >time to do this. what d'you think, Henry? > > > >------- You are receiving this mail because: ------- >You are the assignee for the bug, or are watching the assignee. > >
Henry, understood, i will approach this problem based on your suggestion and see how i can improve the situation. much appriciated.
whoa, fair enough. good luck with the thesis! ;)
I agree this is a better place to discuss the philosophical question: > If RFCI was perfect at hitting otherwise missed spam with no FPs except roaringpenguin.com, and the mail used to score the perceptron had much less than 1 in 2500 mails from roaringpenguin.com, is it correct to let the rule get a very high score? What do we mean by "very high score?" My practice is that any rule which hits ANY ham gets scored no higher than 1/3 of required_hits. Those that hit lotsa spam will get that RH/3, but no higher. Philosophically, if we had a rule which we knew /should/ hit the occasional ham, then I would similarly limit it, even if that theoretical ham was not in any testing corpus. RH/3 is simply my rule of thumb, because I generally deal with a limited corpus of only 100k emails or so. IMO, if tested via corpora with enough emails for testing, RH/2 wouldn't be unreasonable.
Subject: Re: scores are overoptimized for training set > RH/3 is simply my rule of thumb, because I generally deal with a limited > corpus of only 100k emails or so. IMO, if tested via corpora with enough > emails for testing, RH/2 wouldn't be unreasonable. Sure, the perceptron does the same, but much better than humans (which is why I generally avoid second guessing scores). Henry is experimenting with rule accuracy degradation over time and perhaps the perceptron can handle this even better in the future.
*** Bug 4031 has been marked as a duplicate of this bug. ***
I think the reaction time of the online blocklists plays into this alot. I increasingly see spam come in that treads around 4 points because it's listed only on a couple of lists. When I scan it again half an hour later, more lists have it and it scores 6 points and above (not taking in bayes/awl changes). I just started collecting these - the most significant jump I noticed was from 2.2 (bayes_80+sbl) to 8.1 (+spamcop,ws,ob) the next day, with spamcop catching on after 3 hours, uribl's not after 8 (slept then =)). Verified using the checker at rulesemporium.com. This stuff definitly screws up the mass-check numbers. Quoting Daniel from Bug 3947: > Well, we want to move to using real-time data for network tests and that > would improve the scores, but it's a non-trivial amount of work (to log > complete network data in message headers, and then reuse it in > mass-check). Patches accepted. :-) Could it be an easy fix to derive some discounting factors based on the live data that we see, and to apply those in the scoring process?
This should be fixed provided the mass-checks are done with --reuse as much as possible. Closing as FIXED.