SA Bugzilla – Bug 4467
investigate setting BAYES_ scores manually instead of via perceptron
Last modified: 2006-12-04 10:36:47 UTC
It's been a pretty solid FAQ during SpamAssassin 3.0.0's release timeframe, that BAYES_99 was scored too low. e.g.: http://permalink.gmane.org/gmane.mail.spam.spamassassin.general/60217 http://readlist.com/lists/incubator.apache.org/spamassassin-users/0/1500.html On top of that, the scores for the BAYES_* rules are wholly dependent on external factors that cannot be measured effectively through mass-checks to match all environments. For example, these setups have radically different amounts of accurate training: - a site-wide autolearning system - a personalised, extensively hand-trained system with over 10000 mails of each type - a system that has received the bare minimum "200 of each" training, with a little autolearning on top - mass-check, with the new sampling method As a result, I suspect that the Perceptron is going to generate scores that are over-optimized for mass-check only, and under-optimized for the other end-user setups. To avoid this, I suggest that we set the BAYES_* scores manually, by setting them as "userconf" rules. comments/votes please.
Subject: Re: New: investigate setting BAYES_ scores manually instead of via perceptron > comments/votes please. I don't get a vote, but if I did I'd sure be in favor of this! I seem to recall quite a good deal of discussion back in the 3.0 timeframe on ranges and possible score assignments for the Bayes tests; or at least I think I do. Perhaps there were some useful potential scores in there. Is the Perceptron smart enough to take fixed scores into account and redistribute the score amongst the other non-fixed rules that hit, or does it just ignore fixed scores? If it takes fixed scores into account, it might be interesting to do several scoring runs with different Bayes scores and see what effect this has on a few of the other more interesting rules, unless this would be a huge pain to attempt.
I'm in full agreement with this idea. And following Loren's comment, it might be worth while to 1) Let the perceptron suggest initial BAYES values. 2) Adjust BAYES_9* rules up towards 5.0 by 25%, 50%, and 75%, and rescore at those levels. I'd be interested in seeing not only how other rules' scores change, but the overall FN and FP rates. Given Justin's four categories of Bayes systems, perhaps it might be worth having two or three Bayes score sets, for Bayes with low confidence (new systems, still feeling their way), Bayes with good confidence, and Bayes with high confidence. I'd have no problem with the "low confidence" scores file being the default, with instructions on how to apply the higher confidence scores files being included in the INSTALL file.
'Is the Perceptron smart enough to take fixed scores into account and redistribute the score amongst the other non-fixed rules that hit, or does it just ignore fixed scores?' yes, it redistributes. 'If it takes fixed scores into account, it might be interesting to do several scoring runs with different Bayes scores and see what effect this has on a few of the other more interesting rules, unless this would be a huge pain to attempt.' btw, I reread the bug for perceptron runs in 3.1.x; we actually did this on the last perceptron run for 3.1.x, since we fixed up some extreme perceptron-generated BAYES scores for more sane ones. It made little difference (which was good).
I was thinking the other day, what if we used reuse for BAYES_ rules? This assumes that mass-checkers are running bayes of course.
'I was thinking the other day, what if we used reuse for BAYES_ rules? This assumes that mass-checkers are running bayes of course.' I'm not keen on that -- each mass-checker would have differing levels of reliability for their training data. for example, I haven't trained bayes (apart from via autolearning) in 2 years... I wouldn't really want the accuracy of my neglected db to dictate scores for someone who's put in the work to train theirs. I'm just going to mark this as FIXED for 3.2.0, since the bayes scores in 50_scores.cf *are* marked as unmutable anyway since bug 4505.