Bug 2413 - chi formula error
Summary: chi formula error
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: 2.60
Hardware: Other All
: P5 normal
Target Milestone: 2.60
Assignee: Daniel Quinlan
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-09-06 13:41 UTC by Daniel Quinlan
Modified: 2003-09-07 10:07 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
proposed fix patch None Daniel Quinlan [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Quinlan 2003-09-06 13:41:31 UTC
From Andrew Flury:

-------------------------------------------------------------------------------

When you supply chi_squared_probs_combine with an even number of probabilities,
the result is screwy:

print chi_squared_probs_combine(0.999), "\n";
print chi_squared_probs_combine(0.999, 0.999), "\n";
print chi_squared_probs_combine(0.999, 0.999, 0.999), "\n";
print chi_squared_probs_combine(0.999, 0.999, 0.999, 0.999), "\n";

output:
0.999
0.80822229895745
0.999999778367055
0.991794986478665

We changed the following lines (I noticed the difference in py-spambayes):
$S = log($S) + $Sexp + LN2;
$H = log($H) + $Hexp + LN2;
to:
$S = log($S) + $Sexp * LN2;
$H = log($H) + $Hexp * LN2;

Seems to fix the results:
0.999
0.999991592578138
0.999999879526618
0.999999998031527

-------------------------------------------------------------------------------

The SpamBayes code does use * there instead of +.  It looks like an older
version of their code had a + in there.

I tested on my corpus, here's the shift in S/O for an autolearning run
on the last 6 months of my corpus:

0       BAYES_00
-0.01   BAYES_01
-0.01   BAYES_10
-0.124  BAYES_20
-0.114  BAYES_30
0.007   BAYES_40
-0.002  BAYES_44
-0.002  BAYES_50
0.038   BAYES_56
0.004   BAYES_60
0       BAYES_70
0       BAYES_80
0       BAYES_90
0       BAYES_99

Some results shift from BAYES_99 to BAYES_90 and we're not tuned for this at
the moment, so there's a very slight rise in FNs (0.1%) in my test, but it
seems unwise to leave this error in SA.

before:
 53.709   0.0912  97.9502    0.001   1.00   -4.90  BAYES_00
  0.397   0.0798   0.6582    0.108   0.62   -0.60  BAYES_01
  0.216   0.0342   0.3667    0.085   0.67   -0.73  BAYES_10
  0.124   0.0798   0.1598    0.333   0.26   -0.13  BAYES_20
  0.118   0.1368   0.1034    0.569   0.07   -0.35  BAYES_30
  0.072   0.1140   0.0376    0.752   0.01    0.00  BAYES_40
  0.835   1.3675   0.3949    0.776   0.01    0.00  BAYES_44
  3.230   6.8946   0.2069    0.971   0.81    0.00  BAYES_50
  0.448   0.9345   0.0470    0.952   0.76    0.00  BAYES_56
  0.649   1.4017   0.0282    0.980   0.83    1.79  BAYES_60
  0.907   1.9715   0.0282    0.986   0.84    2.14  BAYES_70
  1.133   2.4957   0.0094    0.996   0.87    2.44  BAYES_80
  2.581   5.6980   0.0094    0.998   0.88    2.45  BAYES_90
 35.580  78.7009   0.0000    1.000   0.98    5.40  BAYES_99

after:
 53.711   0.0800  97.8375    0.001   1.00   -4.90  BAYES_00
  0.438   0.0800   0.7334    0.098   0.64   -0.60  BAYES_01
  0.165   0.0229   0.2821    0.075   0.69   -0.73  BAYES_10
  0.175   0.0686   0.2633    0.207   0.44   -0.13  BAYES_20
  0.113   0.1028   0.1222    0.457   0.14   -0.35  BAYES_30
  0.093   0.1486   0.0470    0.760   0.01    0.00  BAYES_40
  0.846   1.3827   0.4043    0.774   0.01    0.00  BAYES_44
  3.265   6.9592   0.2256    0.969   0.80    0.00  BAYES_50
  0.377   0.8228   0.0094    0.989   0.85    0.00  BAYES_56
  0.789   1.7141   0.0282    0.984   0.83    1.79  BAYES_60
  0.903   1.9655   0.0282    0.986   0.84    2.14  BAYES_70
  1.124   2.4797   0.0094    0.996   0.87    2.44  BAYES_80
  2.734   6.0450   0.0094    0.998   0.88    2.45  BAYES_90
 35.266  78.1282   0.0000    1.000   0.98    5.40  BAYES_99

average RANK for BAYES_00 to BAYES_30 => 0.524 improves to 0.582
average RANK for BAYES_60 to BAYES_99 => 0.880 unchanged to 0.880
Comment 1 Daniel Quinlan 2003-09-06 13:42:15 UTC
Created attachment 1319 [details]
proposed fix
Comment 2 Theo Van Dinter 2003-09-06 14:12:02 UTC
+1 I like things that produce better results. ;)
Comment 3 Malte S. Stretz 2003-09-06 16:45:22 UTC
+1. hmmm... statistics :-/ 
Comment 4 Daniel Quinlan 2003-09-06 22:48:55 UTC
Justin, any comments before I commit this?
Comment 5 Justin Mason 2003-09-07 01:24:39 UTC
damn, looks totally right.  I think that was Matt's code anyway ;)
I suggest tweaking up the BAYES_90 score to compensate for those
slight differences...
Comment 6 Daniel Quinlan 2003-09-07 18:07:16 UTC
applied and closing