Bug 3785

Summary:	Suggestion: Use Bayes classifier to select among score sets
Product:	Spamassassin	Reporter:	Bart Schaefer <schaefer>
Component:	Libraries	Assignee:	SpamAssassin Developer Mailing List <dev>
Status:	NEW ---
Severity:	enhancement
Priority:	P5
Version:	SVN Trunk (Latest Devel Version)
Target Milestone:	Future
Hardware:	Other
OS:	other
Whiteboard:

Description Bart Schaefer 2004-09-17 20:42:24 UTC

(Originally posted by me to the users list; reposting here upon request.)

Rather than divide the score sets by with/without Bayes, have multiple
score sets and use the Bayes probability to choose which score set to
apply.  (I.e., there is no direct score for Bayes itself.)  A Bayes
probability of, say, 0.45 - 0.55 would use the same score set as
"without Bayes," on the assumption that in that range Bayes is unable
to contribute to the decision.

Comment 1 Bob Menschel 2005-04-07 21:28:55 UTC

Alternately, would this do the same?: if the Bayes probability is within a given
range from 50% (such as your 0.45 to 0.55), then drop back to the appropriate
"no bayes" score set. If the Bayes probability is outside that range, then use
the appropriate "with bayes" score set.

Comment 2 Bart Schaefer 2005-09-16 16:46:11 UTC

(In reply to comment #1)

That wouldn't quite get where I was hoping to go, because the real problem is
with  low scores for high BAYES_* values.  I haven't looked at the BAYES_*
scores in 3.1.0 yet, but in 3.0.x I frequently get spams that hit BAYES_99 but
pass through because they hit nothing else.  Switching rulesets only when Bayes
can't make up its mind would not address that problem; rather, I'm trying to
discover rules that have a different (better) S/O ratio when tested only on
"likely" spam (as opposed to when tested on all messages in the input set).

Consider, for example, a rule that looks for "remove me" phrases or URLs.  If
the entire input set contains mailing list messages, the S/O of such a rule may
be pretty poor.  But if you look only at the subset that is already deemed
spammy by Bayesian analysis, mailing list messages might already have been
filtered out, and having a remove-me phrase could become more significant.

Comment 3 Warren Togami 2005-09-16 17:49:26 UTC

> That wouldn't quite get where I was hoping to go, because the real problem is
> with  low scores for high BAYES_* values.  I haven't looked at the BAYES_*
> scores in 3.1.0 yet, but in 3.0.x I frequently get spams that hit BAYES_99 but
> pass through because they hit nothing else.

http://spamassassin.apache.org/tests_3_0_x.html
3.0.3 and 3.0.4 have better default scores for the high BAYES values.
BAYES_60 	0 0 3.515 1.0
BAYES_80 	0 0 3.608 2.0
BAYES_95 	0 0 3.514 3.0
BAYES_99 	0 0 4.070 3.5

Comment 4 Justin Mason 2006-12-05 05:55:32 UTC

it's an interesting idea, but I don't think it'll happen for 3.2.0.

Comment 5 Justin Mason 2007-02-16 14:34:05 UTC

this is definitely worth testing -- I'm not sure it's sufficiently *big* (ie.
long timescale) to qualify for a Summer of Code project though...

Comment 6 Dallas Engelken 2007-02-16 14:40:57 UTC

(In reply to comment #5)
> this is definitely worth testing -- I'm not sure it's sufficiently *big* (ie.
> long timescale) to qualify for a Summer of Code project though...

It might be a decent proof of concept for the (also suggested) pluggable Bayes
scoring.

Comment 7 Justin Mason 2010-01-27 02:20:21 UTC

moving most remaining 3.3.0 bugs to 3.3.1 milestone

Comment 8 Justin Mason 2010-01-27 03:16:15 UTC

reassigning, too

Comment 9 Mark Martinec 2010-01-27 06:28:41 UTC

Retargeting: Future