Bug 3429 - bayes scores
Summary: bayes scores
Status: RESOLVED WORKSFORME
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: 2.63
Hardware: All All
: P5 enhancement
Target Milestone: 3.1.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-05-26 11:08 UTC by Sergey Shmelev
Modified: 2004-05-27 05:37 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Sergey Shmelev 2004-05-26 11:08:52 UTC
Let see at scores

score BAYES_00 0 0 -4.901 -4.900
score BAYES_01 0 0 -0.600 -1.524
score BAYES_10 0 0 -0.734 -0.908
score BAYES_20 0 0 -0.127 -1.428
score BAYES_30 0 0 -0.349 -0.904
score BAYES_40 0 0 -0.001 -0.001
score BAYES_44 0 0 -0.001 -0.001
score BAYES_50 0 0 0.001 0.001
score BAYES_56 0 0 0.001 0.001
score BAYES_60 0 0 1.789 1.592
score BAYES_70 0 0 2.142 2.255
score BAYES_80 0 0 2.442 1.657
score BAYES_90 0 0 2.454 2.101
score BAYES_99 0 0 5.400 5.400

Rules BAYES_30 BAYES_40 BAYES_44 BAYES_50 BAYES_56 is not effective

More effective will be next logarithmic rules

score BAYES_007  from 0  to exp(-5)
score BAYES_018  from exp(-5) to exp(-4)
score BAYES_049  from exp(-4) to exp(-3)
score BAYES_135  from exp(-3) to exp(-2)
score BAYES_367  from exp(-2) to 1 - exp(-2)
score BAYES_633
score BAYES_865
score BAYES_951
score BAYES_982
score BAYES_993  from 1-exp(-5) to 1
Comment 1 Daniel Quinlan 2004-05-26 13:35:14 UTC
Subject: Re:  New: bayes scores

> More effective will be next logarithmic rules
> 
> score BAYES_007  from 0  to exp(-5)
> score BAYES_018  from exp(-5) to exp(-4)
> score BAYES_049  from exp(-4) to exp(-3)
> score BAYES_135  from exp(-3) to exp(-2)
> score BAYES_367  from exp(-2) to 1 - exp(-2)
> score BAYES_633
> score BAYES_865
> score BAYES_951
> score BAYES_982
> score BAYES_993  from 1-exp(-5) to 1

I like the concept.  I pretty much ended up with an experimentally
derived ranging in 3.0 that is not too different.  I'm willing to give
yours a look:

current:

body BAYES_00		eval:check_bayes('0.00', '0.01')
body BAYES_05		eval:check_bayes('0.01', '0.05')
body BAYES_10		eval:check_bayes('0.05', '0.20')
body BAYES_25		eval:check_bayes('0.20', '0.40')
body BAYES_50		eval:check_bayes('0.40', '0.60')
body BAYES_75		eval:check_bayes('0.60', '0.80')
body BAYES_90		eval:check_bayes('0.80', '0.95')
body BAYES_95		eval:check_bayes('0.95', '0.99')
body BAYES_99		eval:check_bayes('0.99', '1.00')

 0.000000-0.010000 39.550
 0.010000-0.050000  0.579
 0.050000-0.200000  0.306 <- thin
 0.200000-0.400000  0.318 <- thin
 0.400000-0.600000  4.385
 0.600000-0.800000  1.337
 0.800000-0.950000  1.401
 0.950000-0.990000  1.200
 0.990000-1.000000 50.923

new

 0.000000-0.006738 39.485
 0.006738-0.018316  0.159 <- thin
 0.018316-0.049787  0.441 <- thin
 0.049787-0.135335  0.258 <- thin
 0.135335-0.367879  0.352 <- thin
 0.367879-0.632121  4.717
 0.632121-0.864665  1.511
 0.864665-0.950213  0.960
 0.950213-0.981684  0.779
 0.981684-0.993262  0.697
 0.993262-1.000000 50.641

I think some of the ranges are too empty.  Let's try:

0 to exp(-8)
exp(-4) to exp(-2)
exp(-2) to exp(-1)
exp(-1) to 1-exp(-1)
1-exp(-1) to 1-exp(-2)
1-exp(-2) to 1-exp(-4)
1-exp(-4) to 1-exp(-8)
1-exp(-8) to 1

 0.000000-0.000335 39.046
 0.000335-0.018316 0.598
 0.018316-0.135335 0.699
 0.135335-0.367879 0.352
 0.367879-0.632121 4.717
 0.632121-0.864665 1.511
 0.864665-0.981684 1.739
 0.981684-0.999665 2.450
 0.999665-1.000000 48.888

That's better.  Maybe we could...

Comment 2 Sergey Shmelev 2004-05-26 15:03:13 UTC
Lets use Ratio*Wide criteria.

>new
>
> 0.000000-0.006738 39.485
> 0.006738-0.018316  0.159 <- thin
> 0.018316-0.049787  0.441 <- thin
> 0.049787-0.135335  0.258 <- thin
They are thin (small Ratio) but they can have strong Wide (popularity, frequency)

> 0.135335-0.367879  0.352 <- thin
> 0.367879-0.632121  4.717
> 0.632121-0.864665  1.511
> 0.864665-0.950213  0.960
> 0.950213-0.981684  0.779
> 0.981684-0.993262  0.697
> 0.993262-1.000000 50.641

>I think some of the ranges are too empty.  Let's try:

>0 to exp(-8)
>exp(-4) to exp(-2)
>exp(-2) to exp(-1)
>exp(-1) to 1-exp(-1)
>1-exp(-1) to 1-exp(-2)
>1-exp(-2) to 1-exp(-4)
>1-exp(-4) to 1-exp(-8)
>1-exp(-8) to 1

> 0.000000-0.000335 39.046
> 0.000335-0.018316 0.598
> 0.018316-0.135335 0.699
> 0.135335-0.367879 0.352
> 0.367879-0.632121 4.717
> 0.632121-0.864665 1.511
> 0.864665-0.981684 1.739
> 0.981684-0.999665 2.450
> 0.999665-1.000000 48.888

>That's better.  Maybe we could...

I dont think that its better....

We should Sum all Ratio * Wide for every row and search combination
that maximize the Sum



Comment 3 Daniel Quinlan 2004-05-26 17:22:45 UTC
Subject: Re:  bayes scores

> They are thin (small Ratio) but they can have strong Wide (popularity,
> frequency)

I have no idea what you mean.

By "thin", I meant that not enough messages fall into the category, so
the score optimizer (GA or the perceptron) would have a hard time
determining the correct score for the rule.

It's better to have more messages (spam plus ham) falling into each
rule bucket.

> We should Sum all Ratio * Wide for every row and search combination
> that maximize the Sum

I don't understand.

Comment 4 Sergey Shmelev 2004-05-26 23:11:08 UTC
First We should create a mathematical criteria of rule quality and effectivly.

(I suppose that this criteria reject/commit new rules and remove old rules)

The first and main criteria is ham/spam ratio for whitelist rules (score<0)
and spam/ham ratio for blacklist rules (score>0).

The second criteria is "popularity" or "wide" - ham/totalhams from whitelist
rules and spam/totalspams for blacklist rules.

The third criteria is correlation this other rules. For bayes rules correlation
= 0; Itis good.

For better quality all coefficients must be > 200.

There are rules that have biggest Ratio (big scores) but work seldom.
There are rules that have small Ratio (small scores) but they work almost in
every message and there are many rules of this type.

We dont need rules whith small scores that work seldom.

The total criteria I define as production Ratio*Wide


 
Comment 5 Sergey Shmelev 2004-05-26 23:46:31 UTC
> so the score optimizer (GA or the perceptron)

Why you use Genetic Algorith and perceptron, but not use some Statistical Rules?

For example

Score = log(spams/(20+hams_detected_as_spam)) for blacklisted rules.

(20 is a penalty for low accuracy data)
Comment 6 Daniel Quinlan 2004-05-27 11:05:29 UTC
Subject: Re:  bayes scores

> First We should create a mathematical criteria of rule quality and
> effectivly.
> 
> (I suppose that this criteria reject/commit new rules and remove old rules)
> 
> The first and main criteria is ham/spam ratio for whitelist rules
> (score<0) and spam/ham ratio for blacklist rules (score>0).

We already have criteria.

We use the S/O ratio (spam/overall = spam/(ham+spam) using a 50/50
weighting of ham to spam so the weighting is constant).  High is good
for spam rules.  Low is good for ham rules.

We also use a RANK number which is a relative ranking system of each
rule compared to every other rule.

We also use the hit rate.  SPAM% for spam rules and HAM% for ham rules.

And also we use overlap (or correlation) of rules to eliminate rules
that overlap with other rules too much.

At the end of the day, however, the only thing that matters is the score
generated by the perceptron.  It does a better job than other simple
measures of setting scores because interactions between rules are too
complicated to represent with simple formulas.

Comment 7 Sergey Shmelev 2004-05-27 11:33:46 UTC
>we already have criteria.

>We use the S/O ratio (spam/overall = spam/(ham+spam) using a 50/50
>weighting of ham to spam so the weighting is constant).  High is good
>for spam rules.  Low is good for ham rules.

>We also use a RANK number which is a relative ranking system of each
>rule compared to every other rule.

RANK - is simple sorting by other criteries?

>We also use the hit rate.  SPAM% for spam rules and HAM% for ham rules.

Thank you for criteria!

We have 3 coefficient - R/0, HitRate and Overlap

What about Potencial Forging?

Lets speak about R/0, HitRate

What should be R/0 and HitRate for new rule, that rule will be accepted?

Can we public formula R/0*HitRate > something to accept new rules?

Where users can found  R/0 and HitRates for all rules?

At page http://www.spamassassin.org/tests.html I see only scores.

Can we public corpus size, number on hams/spams, and R/0 ratio and HitRate for
every rule on this page?


Thank you?





Comment 8 Sergey Shmelev 2004-05-27 11:48:54 UTC
For Bayes rules we should select intervals, that maximize the sum of R/0*HitRates

BAYES_INTERVAL1  R/O*HitRate = B1
......
BAYES_INTERVALN  R/0*HitRate = B2

Sum = B1 + ... + BN
Comment 9 Sergey Shmelev 2004-05-27 12:23:13 UTC
For Bayes rules we should select intervals, 
that maximize the sum of R/0*HitRates

BAYES_INTERVAL1  R/O*HitRate = B1
BAYES_INTERVAL2  R/O*HitRate = B2
......
BAYES_INTERVALN  R/0*HitRate = BN

Sum = B1 + B2 + ... + BN,  N = fixed

I think, this sum will be maximum, if b1 will be about b2, b2 about b3 ... b1
about bn

--------

Other idea is to transfer bayses probability to scores

Bayes_Score = Constant1*log(BAYES_PROBABYLYTY) if BAYES_PROBABYLYTY < 0.5
Bayes_Score = -Constant2*log(1-BAYES_PROBABYLYTY) if BAYES_PROBABYLYTY > 0.5

We should select only Constant1 and Constant2


Comment 10 Daniel Quinlan 2004-05-27 13:37:20 UTC
Closing as WORKSFORME, maybe we'll tweak the ranges, but I don't want to argue
with you about it.