SA Bugzilla – Bug 3785
Suggestion: Use Bayes classifier to select among score sets
Last modified: 2010-01-27 06:28:41 UTC
(Originally posted by me to the users list; reposting here upon request.) Rather than divide the score sets by with/without Bayes, have multiple score sets and use the Bayes probability to choose which score set to apply. (I.e., there is no direct score for Bayes itself.) A Bayes probability of, say, 0.45 - 0.55 would use the same score set as "without Bayes," on the assumption that in that range Bayes is unable to contribute to the decision.
Alternately, would this do the same?: if the Bayes probability is within a given range from 50% (such as your 0.45 to 0.55), then drop back to the appropriate "no bayes" score set. If the Bayes probability is outside that range, then use the appropriate "with bayes" score set.
(In reply to comment #1) That wouldn't quite get where I was hoping to go, because the real problem is with low scores for high BAYES_* values. I haven't looked at the BAYES_* scores in 3.1.0 yet, but in 3.0.x I frequently get spams that hit BAYES_99 but pass through because they hit nothing else. Switching rulesets only when Bayes can't make up its mind would not address that problem; rather, I'm trying to discover rules that have a different (better) S/O ratio when tested only on "likely" spam (as opposed to when tested on all messages in the input set). Consider, for example, a rule that looks for "remove me" phrases or URLs. If the entire input set contains mailing list messages, the S/O of such a rule may be pretty poor. But if you look only at the subset that is already deemed spammy by Bayesian analysis, mailing list messages might already have been filtered out, and having a remove-me phrase could become more significant.
> That wouldn't quite get where I was hoping to go, because the real problem is > with low scores for high BAYES_* values. I haven't looked at the BAYES_* > scores in 3.1.0 yet, but in 3.0.x I frequently get spams that hit BAYES_99 but > pass through because they hit nothing else. http://spamassassin.apache.org/tests_3_0_x.html 3.0.3 and 3.0.4 have better default scores for the high BAYES values. BAYES_60 0 0 3.515 1.0 BAYES_80 0 0 3.608 2.0 BAYES_95 0 0 3.514 3.0 BAYES_99 0 0 4.070 3.5
it's an interesting idea, but I don't think it'll happen for 3.2.0.
this is definitely worth testing -- I'm not sure it's sufficiently *big* (ie. long timescale) to qualify for a Summer of Code project though...
(In reply to comment #5) > this is definitely worth testing -- I'm not sure it's sufficiently *big* (ie. > long timescale) to qualify for a Summer of Code project though... It might be a decent proof of concept for the (also suggested) pluggable Bayes scoring.
moving most remaining 3.3.0 bugs to 3.3.1 milestone
reassigning, too
Retargeting: Future