|
SA Bugzilla – Full Text Bug Listing |
Summary: | Suggestion: Use Bayes classifier to select among score sets | ||
---|---|---|---|
Product: | Spamassassin | Reporter: | Bart Schaefer <schaefer> |
Component: | Libraries | Assignee: | SpamAssassin Developer Mailing List <dev> |
Status: | NEW --- | ||
Severity: | enhancement | ||
Priority: | P5 | ||
Version: | SVN Trunk (Latest Devel Version) | ||
Target Milestone: | Future | ||
Hardware: | Other | ||
OS: | other | ||
Whiteboard: |
Description
Bart Schaefer
2004-09-17 20:42:24 UTC
Alternately, would this do the same?: if the Bayes probability is within a given range from 50% (such as your 0.45 to 0.55), then drop back to the appropriate "no bayes" score set. If the Bayes probability is outside that range, then use the appropriate "with bayes" score set. (In reply to comment #1) That wouldn't quite get where I was hoping to go, because the real problem is with low scores for high BAYES_* values. I haven't looked at the BAYES_* scores in 3.1.0 yet, but in 3.0.x I frequently get spams that hit BAYES_99 but pass through because they hit nothing else. Switching rulesets only when Bayes can't make up its mind would not address that problem; rather, I'm trying to discover rules that have a different (better) S/O ratio when tested only on "likely" spam (as opposed to when tested on all messages in the input set). Consider, for example, a rule that looks for "remove me" phrases or URLs. If the entire input set contains mailing list messages, the S/O of such a rule may be pretty poor. But if you look only at the subset that is already deemed spammy by Bayesian analysis, mailing list messages might already have been filtered out, and having a remove-me phrase could become more significant. > That wouldn't quite get where I was hoping to go, because the real problem is > with low scores for high BAYES_* values. I haven't looked at the BAYES_* > scores in 3.1.0 yet, but in 3.0.x I frequently get spams that hit BAYES_99 but > pass through because they hit nothing else. http://spamassassin.apache.org/tests_3_0_x.html 3.0.3 and 3.0.4 have better default scores for the high BAYES values. BAYES_60 0 0 3.515 1.0 BAYES_80 0 0 3.608 2.0 BAYES_95 0 0 3.514 3.0 BAYES_99 0 0 4.070 3.5 it's an interesting idea, but I don't think it'll happen for 3.2.0. this is definitely worth testing -- I'm not sure it's sufficiently *big* (ie. long timescale) to qualify for a Summer of Code project though... (In reply to comment #5) > this is definitely worth testing -- I'm not sure it's sufficiently *big* (ie. > long timescale) to qualify for a Summer of Code project though... It might be a decent proof of concept for the (also suggested) pluggable Bayes scoring. moving most remaining 3.3.0 bugs to 3.3.1 milestone reassigning, too Retargeting: Future |