SA Bugzilla – Bug 2413
chi formula error
Last modified: 2003-09-07 10:07:16 UTC
From Andrew Flury: ------------------------------------------------------------------------------- When you supply chi_squared_probs_combine with an even number of probabilities, the result is screwy: print chi_squared_probs_combine(0.999), "\n"; print chi_squared_probs_combine(0.999, 0.999), "\n"; print chi_squared_probs_combine(0.999, 0.999, 0.999), "\n"; print chi_squared_probs_combine(0.999, 0.999, 0.999, 0.999), "\n"; output: 0.999 0.80822229895745 0.999999778367055 0.991794986478665 We changed the following lines (I noticed the difference in py-spambayes): $S = log($S) + $Sexp + LN2; $H = log($H) + $Hexp + LN2; to: $S = log($S) + $Sexp * LN2; $H = log($H) + $Hexp * LN2; Seems to fix the results: 0.999 0.999991592578138 0.999999879526618 0.999999998031527 ------------------------------------------------------------------------------- The SpamBayes code does use * there instead of +. It looks like an older version of their code had a + in there. I tested on my corpus, here's the shift in S/O for an autolearning run on the last 6 months of my corpus: 0 BAYES_00 -0.01 BAYES_01 -0.01 BAYES_10 -0.124 BAYES_20 -0.114 BAYES_30 0.007 BAYES_40 -0.002 BAYES_44 -0.002 BAYES_50 0.038 BAYES_56 0.004 BAYES_60 0 BAYES_70 0 BAYES_80 0 BAYES_90 0 BAYES_99 Some results shift from BAYES_99 to BAYES_90 and we're not tuned for this at the moment, so there's a very slight rise in FNs (0.1%) in my test, but it seems unwise to leave this error in SA. before: 53.709 0.0912 97.9502 0.001 1.00 -4.90 BAYES_00 0.397 0.0798 0.6582 0.108 0.62 -0.60 BAYES_01 0.216 0.0342 0.3667 0.085 0.67 -0.73 BAYES_10 0.124 0.0798 0.1598 0.333 0.26 -0.13 BAYES_20 0.118 0.1368 0.1034 0.569 0.07 -0.35 BAYES_30 0.072 0.1140 0.0376 0.752 0.01 0.00 BAYES_40 0.835 1.3675 0.3949 0.776 0.01 0.00 BAYES_44 3.230 6.8946 0.2069 0.971 0.81 0.00 BAYES_50 0.448 0.9345 0.0470 0.952 0.76 0.00 BAYES_56 0.649 1.4017 0.0282 0.980 0.83 1.79 BAYES_60 0.907 1.9715 0.0282 0.986 0.84 2.14 BAYES_70 1.133 2.4957 0.0094 0.996 0.87 2.44 BAYES_80 2.581 5.6980 0.0094 0.998 0.88 2.45 BAYES_90 35.580 78.7009 0.0000 1.000 0.98 5.40 BAYES_99 after: 53.711 0.0800 97.8375 0.001 1.00 -4.90 BAYES_00 0.438 0.0800 0.7334 0.098 0.64 -0.60 BAYES_01 0.165 0.0229 0.2821 0.075 0.69 -0.73 BAYES_10 0.175 0.0686 0.2633 0.207 0.44 -0.13 BAYES_20 0.113 0.1028 0.1222 0.457 0.14 -0.35 BAYES_30 0.093 0.1486 0.0470 0.760 0.01 0.00 BAYES_40 0.846 1.3827 0.4043 0.774 0.01 0.00 BAYES_44 3.265 6.9592 0.2256 0.969 0.80 0.00 BAYES_50 0.377 0.8228 0.0094 0.989 0.85 0.00 BAYES_56 0.789 1.7141 0.0282 0.984 0.83 1.79 BAYES_60 0.903 1.9655 0.0282 0.986 0.84 2.14 BAYES_70 1.124 2.4797 0.0094 0.996 0.87 2.44 BAYES_80 2.734 6.0450 0.0094 0.998 0.88 2.45 BAYES_90 35.266 78.1282 0.0000 1.000 0.98 5.40 BAYES_99 average RANK for BAYES_00 to BAYES_30 => 0.524 improves to 0.582 average RANK for BAYES_60 to BAYES_99 => 0.880 unchanged to 0.880
Created attachment 1319 [details] proposed fix
+1 I like things that produce better results. ;)
+1. hmmm... statistics :-/
Justin, any comments before I commit this?
damn, looks totally right. I think that was Matt's code anyway ;) I suggest tweaking up the BAYES_90 score to compensate for those slight differences...
applied and closing