SA Bugzilla – Bug 3429
bayes scores
Last modified: 2004-05-27 05:37:20 UTC
Let see at scores score BAYES_00 0 0 -4.901 -4.900 score BAYES_01 0 0 -0.600 -1.524 score BAYES_10 0 0 -0.734 -0.908 score BAYES_20 0 0 -0.127 -1.428 score BAYES_30 0 0 -0.349 -0.904 score BAYES_40 0 0 -0.001 -0.001 score BAYES_44 0 0 -0.001 -0.001 score BAYES_50 0 0 0.001 0.001 score BAYES_56 0 0 0.001 0.001 score BAYES_60 0 0 1.789 1.592 score BAYES_70 0 0 2.142 2.255 score BAYES_80 0 0 2.442 1.657 score BAYES_90 0 0 2.454 2.101 score BAYES_99 0 0 5.400 5.400 Rules BAYES_30 BAYES_40 BAYES_44 BAYES_50 BAYES_56 is not effective More effective will be next logarithmic rules score BAYES_007 from 0 to exp(-5) score BAYES_018 from exp(-5) to exp(-4) score BAYES_049 from exp(-4) to exp(-3) score BAYES_135 from exp(-3) to exp(-2) score BAYES_367 from exp(-2) to 1 - exp(-2) score BAYES_633 score BAYES_865 score BAYES_951 score BAYES_982 score BAYES_993 from 1-exp(-5) to 1
Subject: Re: New: bayes scores > More effective will be next logarithmic rules > > score BAYES_007 from 0 to exp(-5) > score BAYES_018 from exp(-5) to exp(-4) > score BAYES_049 from exp(-4) to exp(-3) > score BAYES_135 from exp(-3) to exp(-2) > score BAYES_367 from exp(-2) to 1 - exp(-2) > score BAYES_633 > score BAYES_865 > score BAYES_951 > score BAYES_982 > score BAYES_993 from 1-exp(-5) to 1 I like the concept. I pretty much ended up with an experimentally derived ranging in 3.0 that is not too different. I'm willing to give yours a look: current: body BAYES_00 eval:check_bayes('0.00', '0.01') body BAYES_05 eval:check_bayes('0.01', '0.05') body BAYES_10 eval:check_bayes('0.05', '0.20') body BAYES_25 eval:check_bayes('0.20', '0.40') body BAYES_50 eval:check_bayes('0.40', '0.60') body BAYES_75 eval:check_bayes('0.60', '0.80') body BAYES_90 eval:check_bayes('0.80', '0.95') body BAYES_95 eval:check_bayes('0.95', '0.99') body BAYES_99 eval:check_bayes('0.99', '1.00') 0.000000-0.010000 39.550 0.010000-0.050000 0.579 0.050000-0.200000 0.306 <- thin 0.200000-0.400000 0.318 <- thin 0.400000-0.600000 4.385 0.600000-0.800000 1.337 0.800000-0.950000 1.401 0.950000-0.990000 1.200 0.990000-1.000000 50.923 new 0.000000-0.006738 39.485 0.006738-0.018316 0.159 <- thin 0.018316-0.049787 0.441 <- thin 0.049787-0.135335 0.258 <- thin 0.135335-0.367879 0.352 <- thin 0.367879-0.632121 4.717 0.632121-0.864665 1.511 0.864665-0.950213 0.960 0.950213-0.981684 0.779 0.981684-0.993262 0.697 0.993262-1.000000 50.641 I think some of the ranges are too empty. Let's try: 0 to exp(-8) exp(-4) to exp(-2) exp(-2) to exp(-1) exp(-1) to 1-exp(-1) 1-exp(-1) to 1-exp(-2) 1-exp(-2) to 1-exp(-4) 1-exp(-4) to 1-exp(-8) 1-exp(-8) to 1 0.000000-0.000335 39.046 0.000335-0.018316 0.598 0.018316-0.135335 0.699 0.135335-0.367879 0.352 0.367879-0.632121 4.717 0.632121-0.864665 1.511 0.864665-0.981684 1.739 0.981684-0.999665 2.450 0.999665-1.000000 48.888 That's better. Maybe we could...
Lets use Ratio*Wide criteria. >new > > 0.000000-0.006738 39.485 > 0.006738-0.018316 0.159 <- thin > 0.018316-0.049787 0.441 <- thin > 0.049787-0.135335 0.258 <- thin They are thin (small Ratio) but they can have strong Wide (popularity, frequency) > 0.135335-0.367879 0.352 <- thin > 0.367879-0.632121 4.717 > 0.632121-0.864665 1.511 > 0.864665-0.950213 0.960 > 0.950213-0.981684 0.779 > 0.981684-0.993262 0.697 > 0.993262-1.000000 50.641 >I think some of the ranges are too empty. Let's try: >0 to exp(-8) >exp(-4) to exp(-2) >exp(-2) to exp(-1) >exp(-1) to 1-exp(-1) >1-exp(-1) to 1-exp(-2) >1-exp(-2) to 1-exp(-4) >1-exp(-4) to 1-exp(-8) >1-exp(-8) to 1 > 0.000000-0.000335 39.046 > 0.000335-0.018316 0.598 > 0.018316-0.135335 0.699 > 0.135335-0.367879 0.352 > 0.367879-0.632121 4.717 > 0.632121-0.864665 1.511 > 0.864665-0.981684 1.739 > 0.981684-0.999665 2.450 > 0.999665-1.000000 48.888 >That's better. Maybe we could... I dont think that its better.... We should Sum all Ratio * Wide for every row and search combination that maximize the Sum
Subject: Re: bayes scores > They are thin (small Ratio) but they can have strong Wide (popularity, > frequency) I have no idea what you mean. By "thin", I meant that not enough messages fall into the category, so the score optimizer (GA or the perceptron) would have a hard time determining the correct score for the rule. It's better to have more messages (spam plus ham) falling into each rule bucket. > We should Sum all Ratio * Wide for every row and search combination > that maximize the Sum I don't understand.
First We should create a mathematical criteria of rule quality and effectivly. (I suppose that this criteria reject/commit new rules and remove old rules) The first and main criteria is ham/spam ratio for whitelist rules (score<0) and spam/ham ratio for blacklist rules (score>0). The second criteria is "popularity" or "wide" - ham/totalhams from whitelist rules and spam/totalspams for blacklist rules. The third criteria is correlation this other rules. For bayes rules correlation = 0; Itis good. For better quality all coefficients must be > 200. There are rules that have biggest Ratio (big scores) but work seldom. There are rules that have small Ratio (small scores) but they work almost in every message and there are many rules of this type. We dont need rules whith small scores that work seldom. The total criteria I define as production Ratio*Wide
> so the score optimizer (GA or the perceptron) Why you use Genetic Algorith and perceptron, but not use some Statistical Rules? For example Score = log(spams/(20+hams_detected_as_spam)) for blacklisted rules. (20 is a penalty for low accuracy data)
Subject: Re: bayes scores > First We should create a mathematical criteria of rule quality and > effectivly. > > (I suppose that this criteria reject/commit new rules and remove old rules) > > The first and main criteria is ham/spam ratio for whitelist rules > (score<0) and spam/ham ratio for blacklist rules (score>0). We already have criteria. We use the S/O ratio (spam/overall = spam/(ham+spam) using a 50/50 weighting of ham to spam so the weighting is constant). High is good for spam rules. Low is good for ham rules. We also use a RANK number which is a relative ranking system of each rule compared to every other rule. We also use the hit rate. SPAM% for spam rules and HAM% for ham rules. And also we use overlap (or correlation) of rules to eliminate rules that overlap with other rules too much. At the end of the day, however, the only thing that matters is the score generated by the perceptron. It does a better job than other simple measures of setting scores because interactions between rules are too complicated to represent with simple formulas.
>we already have criteria. >We use the S/O ratio (spam/overall = spam/(ham+spam) using a 50/50 >weighting of ham to spam so the weighting is constant). High is good >for spam rules. Low is good for ham rules. >We also use a RANK number which is a relative ranking system of each >rule compared to every other rule. RANK - is simple sorting by other criteries? >We also use the hit rate. SPAM% for spam rules and HAM% for ham rules. Thank you for criteria! We have 3 coefficient - R/0, HitRate and Overlap What about Potencial Forging? Lets speak about R/0, HitRate What should be R/0 and HitRate for new rule, that rule will be accepted? Can we public formula R/0*HitRate > something to accept new rules? Where users can found R/0 and HitRates for all rules? At page http://www.spamassassin.org/tests.html I see only scores. Can we public corpus size, number on hams/spams, and R/0 ratio and HitRate for every rule on this page? Thank you?
For Bayes rules we should select intervals, that maximize the sum of R/0*HitRates BAYES_INTERVAL1 R/O*HitRate = B1 ...... BAYES_INTERVALN R/0*HitRate = B2 Sum = B1 + ... + BN
For Bayes rules we should select intervals, that maximize the sum of R/0*HitRates BAYES_INTERVAL1 R/O*HitRate = B1 BAYES_INTERVAL2 R/O*HitRate = B2 ...... BAYES_INTERVALN R/0*HitRate = BN Sum = B1 + B2 + ... + BN, N = fixed I think, this sum will be maximum, if b1 will be about b2, b2 about b3 ... b1 about bn -------- Other idea is to transfer bayses probability to scores Bayes_Score = Constant1*log(BAYES_PROBABYLYTY) if BAYES_PROBABYLYTY < 0.5 Bayes_Score = -Constant2*log(1-BAYES_PROBABYLYTY) if BAYES_PROBABYLYTY > 0.5 We should select only Constant1 and Constant2
Closing as WORKSFORME, maybe we'll tweak the ranges, but I don't want to argue with you about it.