SA Bugzilla – Bug 5686
add OSBF/Winnow as an alternative to Bayes
Last modified: 2009-07-23 07:09:27 UTC
now that Bayes is a little more pluginized (bug 5293), here's what I wanted to do: offer an alternative to the default BAYES rules, using a more up-to-date probabilistic classifier algorithm. This is the use of Orthogonal Sparse Bigram tokenization, combined with the Winnow machine-learning algorithm [1][2]. This combo has been scoring very well in the TREC anti-spam probabilistic-classifier shootout [3], as implemented by osbf-lua [4]. [1]: http://en.wikipedia.org/wiki/Winnow [2]: http://www.siefkes.net/ie/winnow-spam.pdf [3]: http://trec.nist.gov [4]: http://osbf-lua.luaforge.net/ initial results of an implementation (based on that paper) seem very promising so far...
Let's have some graphs! Here's a graph of scores from SVN trunk's version of Bayes, measured using 10-fold cross validation on a corpus of ~2000 recent spam and ~2000 recent ham from my collection (I'm using this corpus to measure results as I develop this): http://taint.org/x/2007/graph_trunk.png And here's a graph on the same corpus, classified using osbf-lua: http://taint.org/x/2007/graph_osbflua.png You can see several things: - current trunk's Bayes has a tendency to put a fair bit of spam into the "unsure" middle ground, BAYES_50, where it gets no score. - osbf-lua is better at separating more of the samples into their correct class, with a more or less clear dividing line around -15. (I'm not sure what their score figure represents.) This demonstrates that the algorithms used in osbf-lua are pretty effective, in my opinion (and gives us an idea of what osbf can do, something to aim for with our implementation). Now for the implementation of Winnow/OSBF as checked in in r584432, compared to SVN trunk. Here's a score histogram from trunk: SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 (99.914%) ..........|....................................................... 0.000 ( 0.761%) ######### | 0.040 ( 0.020%) | 0.040 ( 0.028%) | 0.080 ( 0.050%) # | 0.120 ( 0.015%) | 0.120 ( 0.039%) | 0.160 ( 0.005%) | 0.160 ( 0.011%) | 0.200 ( 0.017%) | 0.240 ( 0.005%) | 0.240 ( 0.022%) | 0.280 ( 0.017%) | 0.320 ( 0.011%) | 0.360 ( 0.028%) | 0.400 ( 0.010%) | 0.400 ( 0.017%) | 0.440 ( 0.005%) | 0.440 ( 0.083%) # | 0.480 ( 0.025%) | 0.480 ( 2.122%) ##########|# 0.520 ( 0.231%) ### | 0.560 ( 0.138%) ## | 0.600 ( 0.088%) # | 0.640 ( 0.127%) # | 0.680 ( 0.121%) # | 0.720 ( 0.182%) ## | 0.760 ( 0.193%) ## | 0.800 ( 0.187%) ## | 0.840 ( 0.116%) # | 0.880 ( 0.215%) ## | 0.920 ( 0.375%) #### | 0.960 (94.825%) ##########|####################################################### (hopefully that pastes ok). the thing we want to see is all "."s at 0.000, all "#"s at 0.960-1.0, no "."s between 0.5 and 1.0 (false positives), and no "#"s between 0.0 and 0.5 (false negatives). here's the histogram for r584432: SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 (94.728%) ..........|....................................................... 0.000 ( 0.077%) # | 0.960 ( 5.272%) ..........|... 0.960 (99.923%) ##########|####################################################### that's very good, except for the 5.272% of false positives :( we need to avoid that, since 5% fps is serious. the "thresholds" cost figure (in "results/thresholds.static"), which comes up with a single-figure metric based on the score distribution, looks like this: trunk: Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$222.30 Total ham:spam: 19764:18144 FP: 0 0.000% FN: 168 0.926% Unsure: 543 1.432% (ham: 8 0.040% spam: 535 2.949%) TCRs: l=1 25.809 l=5 25.809 l=9 25.809 SUMMARY: 0.30/0.70 fp 0 fn 168 uh 8 us 535 c 222.30 r584432: Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$10434.00 Total ham:spam: 19764:18144 FP: 1042 5.272% FN: 14 0.077% Unsure: 0 0.000% (ham: 0 0.000% spam: 0 0.000%) TCRs: l=1 17.182 l=5 3.473 l=9 1.932 SUMMARY: 0.30/0.70 fp 1042 fn 14 uh 0 us 0 c 10434.00 that cost metric penalised the 5% fp rate very heavily.
OK, I've now implemented osbf-lua-style OSBF, with EDDC (Exponential Differential Document Count), as r584760. (Note that r584432 described above wasn't OSBF -- it was just OSB ;) The test took too long. ;) I interrupted it after 5 of the 10 folds; this histogram is about representative: SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 ( 5.820%) ..........|.......... 0.040 ( 6.225%) ..........|.......... 0.080 ( 3.998%) ..........|....... 0.120 (20.749%) ..........|.................................. 0.160 (33.198%) ..........|....................................................... 0.200 (18.370%) ..........|.............................. 0.200 ( 0.055%) # | 0.240 ( 9.565%) ..........|................ 0.280 ( 1.721%) ..........|... 0.280 ( 0.331%) ##### | 0.320 ( 0.101%) ... | 0.320 ( 0.110%) ## | 0.360 ( 0.101%) ... | 0.360 ( 0.331%) ##### | 0.400 ( 0.110%) ## | 0.440 ( 0.496%) ####### | 0.480 ( 0.152%) ..... | 0.480 ( 1.929%) ##########|# 0.520 ( 1.103%) ##########|# 0.560 ( 1.213%) ##########|# 0.600 ( 1.323%) ##########|# 0.640 ( 0.717%) ##########|# 0.680 ( 8.434%) ##########|###### 0.720 (77.233%) ##########|####################################################### 0.760 ( 6.615%) ##########|##### Note that the fundamental shape has changed, since OSBF uses a traditional naive Bayesian combiner, instead of the binary Winnow style, or the nearly-binary Robinsonian chi-square combiner. OSBF however is behind the impressively low number of FPs and FNs, I think, though! Here's the numbers: Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$21.00 Total ham:spam: 1976:1814 FP: 0 0.000% FN: 1 0.055% Unsure: 200 5.277% (ham: 41 2.075% spam: 159 8.765%) TCRs: l=1 11.338 l=5 11.338 l=9 11.338 SUMMARY: 0.30/0.70 fp 0 fn 1 uh 41 us 159 c 21.00 I think I need to keep working on this...
here's some score-frequency histograms, measuring alternative values for the K3 constant in the EDDC equation. (these are all after 1 of the 10 folds, to get a quick idea. the 10 folds are self-similar enough that IMO this is safe)... K3=8 SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.280 (64.626%) ..........|....................................................... 0.280 ( 0.055%) # | 0.320 ( 0.051%) . | 0.440 (35.324%) ..........|.............................. 0.440 ( 0.606%) ####### | 0.480 ( 6.395%) ##########|#### 0.520 (92.778%) ##########|####################################################### 0.680 ( 0.165%) ## | K3=6 SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.280 (69.332%) ..........|....................................................... 0.280 ( 0.055%) # | 0.400 (21.255%) ..........|................. 0.440 ( 9.413%) ..........|....... 0.440 ( 0.772%) ######### | 0.480 ( 3.528%) ##########|## 0.520 (95.480%) ##########|####################################################### 0.680 ( 0.165%) ## | K3=4 SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.240 (38.985%) ..........|....................................................... 0.280 (36.531%) ..........|.................................................... 0.280 ( 0.094%) ## | 0.360 ( 4.346%) ..........|...... 0.400 (18.974%) ..........|........................... 0.400 ( 0.006%) | 0.440 ( 1.138%) ..........|.. 0.440 ( 0.628%) ##########|# 0.480 ( 0.025%) . | 0.480 ( 1.631%) ##########|## 0.520 (46.616%) ##########|################################################## 0.560 (50.871%) ##########|####################################################### 0.640 ( 0.006%) | 0.680 ( 0.033%) # | 0.720 ( 0.116%) ### |
K3=2 SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.240 (39.524%) ..........|....................................................... 0.280 (36.032%) ..........|.................................................. 0.280 ( 0.110%) ## | 0.360 ( 2.227%) ..........|... 0.400 (21.306%) ..........|.............................. 0.400 ( 0.055%) # | 0.440 ( 0.911%) ..........|. 0.440 ( 0.827%) ##########|# 0.480 ( 2.205%) ##########|## 0.520 (56.505%) ##########|####################################################### 0.560 (40.132%) ##########|####################################### 0.680 ( 0.055%) # | 0.720 ( 0.110%) ## | I've also implemented the Bayes chain rule algorithm described in the EDDC paper, in r585450. Here's a histogram using K3=1 and that combiner: SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 (100.000%) ..........|....................................................... 0.000 ( 1.764%) ##########|# 0.960 (98.236%) ##########|####################################################### Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$32.00 Total ham:spam: 1976:1814 FP: 0 0.000% FN: 32 1.764% Unsure: 0 0.000% (ham: 0 0.000% spam: 0 0.000%) TCRs: l=1 56.688 l=5 56.687 l=9 56.688 SUMMARY: 0.30/0.70 fp 0 fn 32 uh 0 us 0 c 32.00 that's pretty cool. 0% FP rate! but the 1.7% FN rate is not great. the tweaks continue...
here's that Bayes chain rule, with K3=8: SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 (100.000%) ..........|....................................................... 0.000 ( 1.709%) ##########|# 0.040 ( 0.055%) # | 0.120 ( 0.055%) # | 0.920 ( 0.055%) # | 0.960 (98.126%) ##########|####################################################### Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$33.00 Total ham:spam: 1976:1814 FP: 0 0.000% FN: 33 1.819% Unsure: 0 0.000% (ham: 0 0.000% spam: 0 0.000%) TCRs: l=1 54.970 l=5 54.970 l=9 54.970 SUMMARY: 0.30/0.70 fp 0 fn 33 uh 0 us 0 c 33.00 so that doesn't really improve it much. I think the method used in comment 2 seems to be performing the best here so far; that's the use of OSBF with traditional naive Bayes combining (at least, the trad combiner we used to use in SA before we switched to Fisher/Robinson chi square combining).
ok, some more tests.... trying the (crazy) K3=20 with the Bayes chain rule combiner: SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 (100.000%) ..........|....................................................... 0.000 ( 1.047%) ##########|# 0.200 ( 0.110%) # | 0.320 ( 0.055%) # | 0.680 ( 0.055%) # | 0.800 ( 0.055%) # | 0.840 ( 0.055%) # | 0.880 ( 0.165%) ## | 0.920 ( 0.110%) # | 0.960 (98.346%) ##########|####################################################### let's try the naive Bayes combiner, K3 = 0.8: SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.120 (32.439%) ..........|...................................................... 0.160 (32.844%) ..........|....................................................... 0.160 ( 0.055%) # | 0.200 (27.379%) ..........|.............................................. 0.240 ( 6.275%) ..........|........... 0.280 ( 0.658%) ..........|. 0.280 ( 0.331%) ###### | 0.320 ( 0.152%) ..... | 0.320 ( 0.221%) #### | 0.360 ( 0.051%) .. | 0.400 ( 0.202%) ....... | 0.400 ( 0.110%) ## | 0.440 ( 0.331%) ###### | 0.480 ( 0.496%) ######### | 0.520 ( 0.331%) ###### | 0.560 ( 1.378%) ##########|# 0.600 (15.160%) ##########|############## 0.640 (57.938%) ##########|####################################################### 0.680 ( 2.426%) ##########|## 0.720 (20.066%) ##########|################### 0.760 ( 1.047%) ##########|# 0.840 ( 0.110%) ## | So far I think K3=1, with the traditional naive Bayes combiner, is working best for us, since it's so good at avoiding FPs and FNs that the others leave behind. To compare with the figures from comment 1, here's the results from a full 10-fold cross validation: SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 (25.415%) ..........|....................................................... 0.040 ( 9.831%) ..........|..................... 0.080 (22.571%) ..........|................................................. 0.120 (21.716%) ..........|............................................... 0.160 ( 8.435%) ..........|.................. 0.200 ( 5.444%) ..........|............ 0.200 ( 0.028%) # | 0.240 ( 3.916%) ..........|........ 0.240 ( 0.022%) # | 0.280 ( 1.801%) ..........|.... 0.280 ( 0.022%) # | 0.320 ( 0.491%) ..........|. 0.320 ( 0.226%) ##### | 0.360 ( 0.116%) ..... | 0.360 ( 0.231%) ###### | 0.400 ( 0.040%) .. | 0.400 ( 0.193%) ##### | 0.440 ( 0.132%) ### | 0.480 ( 0.223%) ..........| 0.480 ( 1.334%) ##########|## 0.520 ( 0.110%) ### | 0.560 ( 0.419%) ##########|# 0.600 ( 0.832%) ##########|# 0.640 ( 1.769%) ##########|## 0.680 ( 8.813%) ##########|########### 0.720 (36.767%) ##########|############################################ 0.760 (45.712%) ##########|####################################################### 0.800 ( 3.279%) ##########|#### 0.840 ( 0.006%) | 0.880 ( 0.011%) | 0.920 ( 0.022%) # | 0.960 ( 0.072%) ## | Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$206.30 Total ham:spam: 19764:18144 FP: 0 0.000% FN: 9 0.050% Unsure: 1973 5.205% (ham: 528 2.672% spam: 1445 7.964%) TCRs: l=1 12.479 l=5 12.479 l=9 12.479 SUMMARY: 0.30/0.70 fp 0 fn 9 uh 528 us 1445 c 206.30
things I now want to test: 1. the effect of smaller/bigger db sizes on the OSBF code (right now I've more or less disabled expiry for these tests, which is unrealistic) 2. the effect of less training data, which is the real issue -- can OSBF do a better job with tiny amounts of training, than our existing Bayes impl? 3. different tokenization
(In reply to comment #7) > 2. the effect of less training data, which is the real issue -- can OSBF do a > better job with tiny amounts of training, than our existing Bayes impl? results from the weekend's testing of this. I ran the 10fold cross-validation driver with "--learnprob 0.1 --randseed 23" -- ie. train on only 10% of the messages -- and got these histograms: SVN trunk: Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$252.30 Total ham:spam: 19764:18144 FP: 0 0.000% FN: 155 0.854% Unsure: 973 2.567% (ham: 24 0.121% spam: 949 5.230%) TCRs: l=1 16.435 l=5 16.435 l=9 16.435 SUMMARY: 0.30/0.70 fp 0 fn 155 uh 24 us 949 c 252.30 SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 (99.676%) ..........|....................................................... 0.000 ( 0.645%) ######## | 0.040 ( 0.040%) | 0.040 ( 0.055%) # | 0.080 ( 0.040%) | 0.080 ( 0.022%) | 0.120 ( 0.030%) | 0.120 ( 0.050%) # | 0.160 ( 0.035%) | 0.160 ( 0.022%) | 0.200 ( 0.040%) | 0.200 ( 0.028%) | 0.240 ( 0.015%) | 0.240 ( 0.033%) | 0.280 ( 0.020%) | 0.280 ( 0.077%) # | 0.320 ( 0.015%) | 0.320 ( 0.061%) # | 0.360 ( 0.015%) | 0.360 ( 0.044%) # | 0.400 ( 0.015%) | 0.400 ( 0.121%) # | 0.440 ( 0.035%) | 0.440 ( 0.198%) ## | 0.480 ( 0.020%) | 0.480 ( 3.919%) ##########|## 0.520 ( 0.314%) #### | 0.560 ( 0.165%) ## | 0.600 ( 0.149%) ## | 0.640 ( 0.077%) # | 0.680 ( 0.215%) ### | 0.720 ( 0.116%) # | 0.760 ( 0.116%) # | 0.800 ( 0.171%) ## | 0.840 ( 0.121%) # | 0.880 ( 0.193%) ## | 0.920 ( 0.336%) #### | 0.960 (92.752%) ##########|####################################################### OSBF with EDDC: SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 ( 4.007%) ..........|........ 0.040 ( 3.177%) ..........|...... 0.080 (18.787%) ..........|.................................... 0.120 (28.415%) ..........|....................................................... 0.160 (17.588%) ..........|.................................. 0.160 ( 0.006%) | 0.200 (11.369%) ..........|...................... 0.200 ( 0.011%) | 0.240 ( 7.357%) ..........|.............. 0.240 ( 0.022%) # | 0.280 ( 4.574%) ..........|......... 0.280 ( 0.033%) # | 0.320 ( 3.046%) ..........|...... 0.320 ( 0.127%) #### | 0.360 ( 1.184%) ..........|.. 0.360 ( 0.303%) ######### | 0.400 ( 0.233%) ......... | 0.400 ( 0.733%) ##########|# 0.440 ( 0.046%) .. | 0.440 ( 0.424%) ##########|# 0.480 ( 0.207%) ........ | 0.480 ( 1.560%) ##########|## 0.520 ( 0.010%) | 0.520 ( 1.036%) ##########|## 0.560 ( 1.565%) ##########|## 0.600 ( 1.984%) ##########|### 0.640 ( 5.958%) ##########|######### 0.680 (20.993%) ##########|############################### 0.720 (36.795%) ##########|####################################################### 0.760 (25.143%) ##########|###################################### 0.800 ( 3.213%) ##########|##### 0.840 ( 0.083%) ## | 0.960 ( 0.011%) | the thresholds report looks like this Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$583.00 Total ham:spam: 19764:18144 FP: 0 0.000% FN: 7 0.039% Unsure: 5760 15.195% (ham: 1838 9.300% spam: 3922 21.616%) TCRs: l=1 4.618 l=5 4.618 l=9 4.618 SUMMARY: 0.30/0.70 fp 0 fn 7 uh 1838 us 3922 c 583.00 but that's unfair, because 0.70 (as you can see from the histogram) is right in the middle of most of the ham. 0.56 would be better: Threshold optimization for hamcutoff=0.38, spamcutoff=0.56: cost=$234.80 Total ham:spam: 19764:18144 FP: 0 0.000% FN: 55 0.303% Unsure: 899 2.372% (ham: 182 0.921% spam: 717 3.952%) TCRs: l=1 23.503 l=5 23.503 l=9 23.503 I guess it's good, but it's not stellar :(
(In reply to comment #1) > And here's a graph on the same corpus, classified using osbf-lua: > http://taint.org/x/2007/graph_osbflua.png since I've been using the bayes score histograms for comparison, it's worth converting that graph into histogram format. here it is: SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.120 ( 0.011%) | 0.160 ( 0.011%) | 0.200 ( 1.604%) ##########|## 0.240 (23.380%) ##########|####################### 0.280 (56.564%) ##########|####################################################### 0.320 (17.934%) ##########|################# 0.360 ( 0.025%) | 0.360 ( 0.419%) ######## | 0.400 ( 4.948%) ..........|..... 0.400 ( 0.072%) # | 0.440 (14.263%) ..........|............. 0.440 ( 0.006%) | 0.480 (58.313%) ..........|....................................................... 0.520 (19.955%) ..........|................... 0.560 ( 1.022%) ..........|. 0.600 ( 1.128%) ..........|. 0.640 ( 0.299%) ...... | 0.680 ( 0.020%) | 0.760 ( 0.005%) | 0.800 ( 0.015%) | 0.960 ( 0.005%) | (the scores are an approximation -- (($osbfluascore+0) + 100) / 220 .
(In reply to comment #7) > 3. different tokenization so I tried some of this out last night; I took one of the persistent FNs that keeps showing up around the 0.2 mark, and examined the tokens being generated during tokenization. It turned out that some of the OSBF tokenization didn't cope well with some of *our* tokens. 1. The decomposed address tokens, like "UD*jmason.org" for an email addr containing hte domain "taint.org", were being split up into two tokens "UD*" and "jmason.org" -- not useful -- so I fixed that; 2. the "key=value" metadata in the X-Spam-Relays headers was similarly being broken up into "key=", "value". fixed. this is checked in as r587469. here's a histogram: SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 (21.949%) ..........|............................................ 0.040 (21.620%) ..........|........................................... 0.080 (27.737%) ..........|....................................................... 0.120 (12.351%) ..........|........................ 0.160 ( 6.993%) ..........|.............. 0.160 ( 0.044%) # | 0.200 ( 4.802%) ..........|.......... 0.200 ( 0.006%) | 0.240 ( 2.656%) ..........|..... 0.280 ( 1.169%) ..........|.. 0.280 ( 0.055%) # | 0.320 ( 0.400%) ..........|. 0.320 ( 0.215%) ##### | 0.360 ( 0.172%) ....... | 0.360 ( 0.287%) ####### | 0.400 ( 0.056%) .. | 0.400 ( 0.287%) ####### | 0.440 ( 0.083%) ## | 0.480 ( 0.096%) .... | 0.480 ( 1.075%) ##########|# 0.520 ( 0.276%) ####### | 0.560 ( 0.573%) ##########|# 0.600 ( 0.843%) ##########|# 0.640 ( 1.725%) ##########|## 0.680 ( 5.545%) ##########|####### 0.720 (20.387%) ##########|######################## 0.760 (46.555%) ##########|####################################################### 0.800 (20.800%) ##########|######################### 0.840 ( 1.141%) ##########|# 0.880 ( 0.017%) | 0.920 ( 0.017%) | 0.960 ( 0.072%) ## | Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$178.60 Total ham:spam: 19764:18144 FP: 0 0.000% FN: 9 0.050% Unsure: 1696 4.474% (ham: 374 1.892% spam: 1322 7.286%) TCRs: l=1 13.632 l=5 13.632 l=9 13.632 Threshold optimization for hamcutoff=0.30, spamcutoff=0.54: cost=$130.40 Total ham:spam: 19764:18144 FP: 0 0.000% FN: 11 0.061% Unsure: 597 1.575% (ham: 220 1.113% spam: 377 2.078%) TCRs: l=1 46.763 l=5 46.763 l=9 46.763 looking quite a bit better!
more meddling with tokenization. r587841 is an experiment to discard OSBF-style tokenization and just use the simpler SpamAssassin "split on whitespace" tokenization with the OSBF bigram format: SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 ( 9.173%) ..........|........... 0.040 (21.726%) ..........|......................... 0.040 ( 0.011%) | 0.080 (47.814%) ..........|....................................................... 0.080 ( 0.017%) | 0.120 (15.204%) ..........|................. 0.120 ( 0.017%) | 0.160 ( 3.527%) ..........|.... 0.160 ( 0.006%) | 0.200 ( 1.331%) ..........|.. 0.200 ( 0.022%) | 0.240 ( 0.653%) ..........|. 0.240 ( 0.143%) ## | 0.280 ( 0.263%) ...... | 0.280 ( 0.397%) ###### | 0.320 ( 0.126%) ... | 0.320 ( 0.171%) ### | 0.360 ( 0.121%) ... | 0.360 ( 0.243%) #### | 0.400 ( 0.040%) . | 0.400 ( 0.303%) ##### | 0.440 ( 0.020%) | 0.440 ( 0.353%) ###### | 0.480 ( 0.496%) ######## | 0.520 ( 0.623%) ##########| 0.560 ( 0.579%) ######### | 0.600 ( 0.882%) ##########|# 0.640 ( 1.295%) ##########|# 0.680 ( 1.554%) ##########|# 0.720 (11.001%) ##########|######### 0.760 (69.604%) ##########|####################################################### 0.800 (11.436%) ##########|######### 0.840 ( 0.777%) ##########|# 0.880 ( 0.011%) | 0.960 ( 0.061%) # | Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$160.00 Total ham:spam: 19764:18144 FP: 0 0.000% FN: 39 0.215% Unsure: 1210 3.192% (ham: 113 0.572% spam: 1097 6.046%) TCRs: l=1 15.972 l=5 15.972 l=9 15.972 SUMMARY: 0.30/0.70 fp 0 fn 39 uh 113 us 1097 c 160.00 So I think that basically doesn't work too well. There are a high number of one-off spam FNs scattered around the 0.040- 0.440 range, and ham FP at 0.880, which the more complex OSBF tokenization style avoids.
(In reply to comment #11) > ham FP at 0.880 correction -- that's not an FP, my mistake.
Do you suppose it would be interesting to try using OSBF-style tokenization in place of the current Bayes tokenization?
(In reply to comment #13) > Do you suppose it would be interesting to try using OSBF-style tokenization in > place of the current Bayes tokenization? I might give it a try, but I'm pretty sure the use of bigrams in OSBF is key.
I've been doing some tokenizer tweaks, but none are really doing great; so one thing that would be handy at this point is just to restate the current "baseline" best results so far, in r585992. The full 10-fold cross-validation's histogram is the last graph in comment 6 -- I'll paste it here: SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 (25.415%) ..........|....................................................... 0.040 ( 9.831%) ..........|..................... 0.080 (22.571%) ..........|................................................. 0.120 (21.716%) ..........|............................................... 0.160 ( 8.435%) ..........|.................. 0.200 ( 5.444%) ..........|............ 0.200 ( 0.028%) # | 0.240 ( 3.916%) ..........|........ 0.240 ( 0.022%) # | 0.280 ( 1.801%) ..........|.... 0.280 ( 0.022%) # | 0.320 ( 0.491%) ..........|. 0.320 ( 0.226%) ##### | 0.360 ( 0.116%) ..... | 0.360 ( 0.231%) ###### | 0.400 ( 0.040%) .. | 0.400 ( 0.193%) ##### | 0.440 ( 0.132%) ### | 0.480 ( 0.223%) ..........| 0.480 ( 1.334%) ##########|## 0.520 ( 0.110%) ### | 0.560 ( 0.419%) ##########|# 0.600 ( 0.832%) ##########|# 0.640 ( 1.769%) ##########|## 0.680 ( 8.813%) ##########|########### 0.720 (36.767%) ##########|############################################ 0.760 (45.712%) ##########|####################################################### 0.800 ( 3.279%) ##########|#### 0.840 ( 0.006%) | 0.880 ( 0.011%) | 0.920 ( 0.022%) # | 0.960 ( 0.072%) ## | Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$206.30 Total ham:spam: 19764:18144 FP: 0 0.000% FN: 9 0.050% Unsure: 1973 5.205% (ham: 528 2.672% spam: 1445 7.964%) TCRs: l=1 12.479 l=5 12.479 l=9 12.479 SUMMARY: 0.30/0.70 fp 0 fn 9 uh 528 us 1445 c 206.30 Conveniently I've noticed that fold 1 is pretty representative of that graph and those numbers -- SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 (27.277%) ..........|....................................................... 0.040 (10.020%) ..........|.................... 0.080 (21.356%) ..........|........................................... 0.120 (24.190%) ..........|................................................. 0.160 ( 8.654%) ..........|................. 0.200 ( 5.061%) ..........|.......... 0.200 ( 0.055%) # | 0.240 ( 2.379%) ..........|..... 0.280 ( 0.709%) ..........|. 0.280 ( 0.055%) # | 0.320 ( 0.152%) ...... | 0.320 ( 0.386%) ##########|# 0.360 ( 0.051%) .. | 0.360 ( 0.165%) #### | 0.400 ( 0.110%) ### | 0.440 ( 0.662%) ##########|# 0.480 ( 0.152%) ...... | 0.480 ( 0.937%) ##########|# 0.520 ( 0.276%) ####### | 0.560 ( 0.827%) ##########|# 0.600 ( 1.213%) ##########|## 0.640 ( 1.985%) ##########|### 0.680 (11.025%) ##########|############### 0.720 (39.802%) ##########|###################################################### 0.760 (40.463%) ##########|####################################################### 0.800 ( 1.985%) ##########|### 0.960 ( 0.055%) # | Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$20.50 Total ham:spam: 1976:1814 FP: 0 0.000% FN: 1 0.055% Unsure: 195 5.145% (ham: 21 1.063% spam: 174 9.592%) TCRs: l=1 10.366 l=5 10.366 l=9 10.366 SUMMARY: 0.30/0.70 fp 0 fn 1 uh 21 us 174 c 20.50 This is handy because a single fold takes 1/10th of the time to run. ;) (btw note that you have to scale the "threshold optimization" cost figure 10x to cope with the corpus size differences, I should have normalized it but didn't). Anyway, I've checked it in as r588315. This is the new baseline for further tests.
ok, fixing a bug -- in current baseline, if multiple token strings are found with different weights, it's ~random which one gets to set the weight. revision 588709 fixes this, by simply using the lowest weight for that token string: SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 (27.480%) ..........|....................................................... 0.040 (10.324%) ..........|..................... 0.080 (21.356%) ..........|........................................... 0.120 (23.785%) ..........|................................................ 0.160 ( 8.553%) ..........|................. 0.200 ( 4.960%) ..........|.......... 0.200 ( 0.055%) ## | 0.240 ( 2.480%) ..........|..... 0.280 ( 0.709%) ..........|. 0.280 ( 0.055%) ## | 0.320 ( 0.152%) ...... | 0.320 ( 0.386%) ##########|# 0.360 ( 0.051%) .. | 0.360 ( 0.165%) ##### | 0.400 ( 0.110%) ### | 0.440 ( 0.606%) ##########|# 0.480 ( 0.152%) ...... | 0.480 ( 1.047%) ##########|# 0.520 ( 0.276%) ######## | 0.560 ( 0.827%) ##########|# 0.600 ( 1.323%) ##########|## 0.640 ( 2.040%) ##########|### 0.680 (10.915%) ##########|############### 0.720 (39.746%) ##########|###################################################### 0.760 (40.298%) ##########|####################################################### 0.800 ( 2.095%) ##########|### 0.960 ( 0.055%) ## | Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$20.70 Total ham:spam: 1976:1814 FP: 0 0.000% FN: 1 0.055% Unsure: 197 5.198% (ham: 21 1.063% spam: 176 9.702%) TCRs: l=1 10.249 l=5 10.249 l=9 10.249 SUMMARY: 0.30/0.70 fp 0 fn 1 uh 21 us 176 c 20.70 not an obvious improvement, but a necessary bugfix :(
more tests. setting N_SIGNIFICANT_TOKENS to be infinite (ie. using all tokens instead of the N most significant/strong ones), is bad: SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 ( 0.506%) ..........|. 0.040 (25.658%) ..........|................................. 0.080 (43.067%) ..........|....................................................... 0.120 (22.166%) ..........|............................ 0.120 ( 0.055%) # | 0.160 ( 6.275%) ..........|........ 0.200 ( 1.569%) ..........|.. 0.200 ( 0.055%) # | 0.240 ( 0.607%) ..........|. 0.240 ( 0.717%) ##########|# 0.280 ( 0.051%) . | 0.280 ( 0.276%) #### | 0.320 ( 0.101%) ... | 0.320 ( 0.276%) #### | 0.360 ( 0.276%) #### | 0.400 ( 0.221%) ### | 0.440 ( 0.441%) ####### | 0.480 ( 0.662%) ##########|# 0.520 ( 1.323%) ##########|# 0.560 ( 0.882%) ##########|# 0.600 ( 0.827%) ##########|# 0.640 ( 0.882%) ##########|# 0.680 ( 1.047%) ##########|# 0.720 ( 8.379%) ##########|###### 0.760 (70.948%) ##########|####################################################### 0.800 (12.679%) ##########|########## 0.880 ( 0.055%) # | Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$27.60 Total ham:spam: 1976:1814 FP: 0 0.000% FN: 15 0.827% Unsure: 126 3.325% (ham: 3 0.152% spam: 123 6.781%) TCRs: l=1 13.145 l=5 13.145 l=9 13.145 SUMMARY: 0.30/0.70 fp 0 fn 15 uh 3 us 123 c 27.60 N_SIGNIFICANT_TOKENS=999 still on the wrong side of the baseline: SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 (24.747%) ..........|............................................ 0.040 (18.522%) ..........|................................. 0.080 (31.123%) ..........|....................................................... 0.120 (13.057%) ..........|....................... 0.160 ( 5.820%) ..........|.......... 0.160 ( 0.055%) # | 0.200 ( 4.251%) ..........|........ 0.240 ( 1.822%) ..........|... 0.280 ( 0.405%) ..........|. 0.280 ( 0.110%) ### | 0.320 ( 0.152%) ..... | 0.320 ( 0.331%) ######## | 0.360 ( 0.101%) .... | 0.360 ( 0.110%) ### | 0.400 ( 0.772%) ##########|# 0.440 ( 0.165%) #### | 0.480 ( 0.717%) ##########|# 0.520 ( 0.606%) ##########|# 0.560 ( 0.992%) ##########|# 0.600 ( 1.268%) ##########|## 0.640 ( 1.985%) ##########|## 0.680 ( 7.166%) ##########|######## 0.720 (24.862%) ##########|############################# 0.760 (46.472%) ##########|####################################################### 0.800 (13.671%) ##########|################ 0.840 ( 0.662%) ##########|# 0.920 ( 0.055%) # | Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$18.80 Total ham:spam: 1976:1814 FP: 0 0.000% FN: 1 0.055% Unsure: 178 4.697% (ham: 13 0.658% spam: 165 9.096%) TCRs: l=1 10.928 l=5 10.928 l=9 10.928 SUMMARY: 0.30/0.70 fp 0 fn 1 uh 13 us 165 c 18.80 N_SIGNIFICANT_TOKENS=150, ditto: SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 (24.747%) ..........|............................................ 0.040 (18.522%) ..........|................................. 0.080 (31.123%) ..........|....................................................... 0.120 (13.057%) ..........|....................... 0.160 ( 5.820%) ..........|.......... 0.160 ( 0.055%) # | 0.200 ( 4.251%) ..........|........ 0.240 ( 1.822%) ..........|... 0.280 ( 0.405%) ..........|. 0.280 ( 0.110%) ### | 0.320 ( 0.152%) ..... | 0.320 ( 0.331%) ######## | 0.360 ( 0.101%) .... | 0.360 ( 0.110%) ### | 0.400 ( 0.772%) ##########|# 0.440 ( 0.165%) #### | 0.480 ( 0.717%) ##########|# 0.520 ( 0.606%) ##########|# 0.560 ( 0.992%) ##########|# 0.600 ( 1.268%) ##########|## 0.640 ( 1.985%) ##########|## 0.680 ( 7.166%) ##########|######## 0.720 (24.862%) ##########|############################# 0.760 (46.472%) ##########|####################################################### 0.800 (13.671%) ##########|################ 0.840 ( 0.662%) ##########|# 0.920 ( 0.055%) # | Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$18.80 Total ham:spam: 1976:1814 FP: 0 0.000% FN: 1 0.055% Unsure: 178 4.697% (ham: 13 0.658% spam: 165 9.096%) TCRs: l=1 10.928 l=5 10.928 l=9 10.928 SUMMARY: 0.30/0.70 fp 0 fn 1 uh 13 us 165 c 18.80 Trying out a new tokenization, where the header and URIs are simply "split on whitespace", but the body still uses the full OSBF tokenization, is pretty bad compared to baseline: 0.000 ( 4.706%) ..........|.......... 0.040 (11.285%) ..........|........................ 0.080 (11.842%) ..........|......................... 0.120 (25.860%) ..........|....................................................... 0.160 (25.607%) ..........|...................................................... 0.200 (11.437%) ..........|........................ 0.200 ( 0.055%) # | 0.240 ( 6.174%) ..........|............. 0.280 ( 2.429%) ..........|..... 0.280 ( 0.165%) ### | 0.320 ( 0.506%) ..........|. 0.320 ( 0.276%) ##### | 0.360 ( 0.051%) .. | 0.360 ( 0.221%) #### | 0.400 ( 0.772%) ##########|# 0.440 ( 0.221%) #### | 0.480 ( 0.101%) .... | 0.480 ( 1.433%) ##########|# 0.520 ( 0.606%) ##########|# 0.560 ( 1.433%) ##########|# 0.600 ( 2.150%) ##########|## 0.640 (16.869%) ##########|################ 0.680 (58.545%) ##########|####################################################### 0.720 (17.089%) ##########|################ 0.760 ( 0.110%) ## | 0.840 ( 0.055%) # | Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$101.10 Total ham:spam: 1976:1814 FP: 0 0.000% FN: 1 0.055% Unsure: 1001 26.412% (ham: 61 3.087% spam: 940 51.819%) TCRs: l=1 1.928 l=5 1.928 l=9 1.928 SUMMARY: 0.30/0.70 fp 0 fn 1 uh 61 us 940 c 101.10 split(' ') for just headers is also not an improvement: SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 (11.184%) ..........|............... 0.040 (34.615%) ..........|.............................................. 0.080 (41.346%) ..........|....................................................... 0.120 (10.273%) ..........|.............. 0.120 ( 0.055%) # | 0.160 ( 1.569%) ..........|.. 0.200 ( 0.709%) ..........|. 0.200 ( 0.055%) # | 0.240 ( 0.304%) ........ | 0.240 ( 0.827%) ##########|# 0.280 ( 0.165%) ### | 0.320 ( 0.221%) ### | 0.360 ( 0.386%) ###### | 0.400 ( 0.551%) ######### | 0.440 ( 0.551%) ######### | 0.480 ( 1.268%) ##########|# 0.520 ( 0.992%) ##########|# 0.560 ( 1.764%) ##########|# 0.600 ( 1.488%) ##########|# 0.640 ( 5.347%) ##########|#### 0.680 (70.232%) ##########|####################################################### 0.720 (14.939%) ##########|############ 0.760 ( 1.103%) ##########|# 0.840 ( 0.055%) # | Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$102.30 Total ham:spam: 1976:1814 FP: 0 0.000% FN: 17 0.937% Unsure: 853 22.507% (ham: 0 0.000% spam: 853 47.023%) TCRs: l=1 2.085 l=5 2.085 l=9 2.085 SUMMARY: 0.30/0.70 fp 0 fn 17 uh 0 us 853 c 102.30 tokenizing just URLs this way is even worse (see that FP creeping closer to 0.0): SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 (10.374%) ..........|............. 0.040 (34.109%) ..........|........................................... 0.080 (43.168%) ..........|....................................................... 0.080 ( 0.055%) # | 0.120 ( 9.818%) ..........|............. 0.160 ( 1.518%) ..........|.. 0.200 ( 0.759%) ..........|. 0.200 ( 0.055%) # | 0.240 ( 0.253%) ...... | 0.240 ( 0.827%) ##########|# 0.280 ( 0.165%) ### | 0.320 ( 0.221%) ### | 0.360 ( 0.386%) ###### | 0.400 ( 0.551%) ######### | 0.440 ( 0.551%) ######### | 0.480 ( 1.323%) ##########|# 0.520 ( 0.937%) ##########|# 0.560 ( 1.985%) ##########|## 0.600 ( 1.213%) ##########|# 0.640 ( 5.788%) ##########|##### 0.680 (69.901%) ##########|####################################################### 0.720 (14.939%) ##########|############ 0.760 ( 1.047%) ##########|# 0.840 ( 0.055%) # | Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$102.80 Total ham:spam: 1976:1814 FP: 0 0.000% FN: 17 0.937% Unsure: 858 22.639% (ham: 0 0.000% spam: 858 47.299%) TCRs: l=1 2.073 l=5 2.073 l=9 2.073 SUMMARY: 0.30/0.70 fp 0 fn 17 uh 0 us 858 c 102.80 interesting! these were all tweaks I thought might help, but they really don't -- the graphs and figures don't lie. The baseline tokenization just works better in all my testing...
(In reply to comment #16) > ok, fixing a bug -- in current baseline, if multiple token strings are found > with different weights, it's ~random which one gets to set the weight. revision 588709 fixes this, by simply using the lowest weight for that token string damn. this was wrong. Somehow (probably a failed "make"), this wound up using different code -- so r588709 doesn't actually have those graphs in reality, and instead is worse :( I'm rerunning now to establish another fix to that bug, that still displays an improvement, since the attempt in r588709 doesn't do that...
(In reply to comment #18) > I'm rerunning now to establish another fix to that bug, that still displays an > improvement, since the attempt in r588709 doesn't do that... this proved really tricky. after a full 10-fold cv run, here's what the original (buggy) code scores, for two sample score thresholds: SUMMARY: 0.30/0.70 fp 0 fn 9 uh 528 us 1445 c 206.30 SUMMARY: 0.20/0.80 fp 0 fn 0 uh 2378 us 17529 c 1990.70 it took a few days, but I've finally figured out a patch that is both (a) not buggy ;) and (b) has better results: SUMMARY: 0.30/0.70 fp 0 fn 7 uh 994 us 631 c 169.50 SUMMARY: 0.20/0.80 fp 0 fn 0 uh 3018 us 3295 c 631.30 It includes a small hack -- it scales the scores up by 10%, since EDDC and the naive Bayes combiner seem to skew scores a little lower. results improve with this; it'd probably be better to analyze the EDDC equation and figure out why the scores aren't 10% higher to start with, but hey ;) This is now the new baseline, checked in as r591167. Here's the score histogram: SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 (25.086%) ..........|....................................................... 0.040 ( 9.016%) ..........|.................... 0.080 (16.146%) ..........|................................... 0.120 (23.593%) ..........|.................................................... 0.160 (10.888%) ..........|........................ 0.200 ( 5.976%) ..........|............. 0.200 ( 0.011%) | 0.240 ( 4.265%) ..........|......... 0.240 ( 0.028%) # | 0.280 ( 2.970%) ..........|....... 0.280 ( 0.011%) | 0.320 ( 1.295%) ..........|... 0.320 ( 0.039%) # | 0.360 ( 0.390%) ..........|. 0.360 ( 0.220%) ###### | 0.400 ( 0.106%) ..... | 0.400 ( 0.209%) ###### | 0.440 ( 0.040%) .. | 0.440 ( 0.165%) ##### | 0.480 ( 0.121%) ### | 0.520 ( 0.228%) ..........| 0.520 ( 1.361%) ##########|## 0.560 ( 0.072%) ## | 0.600 ( 0.259%) ####### | 0.640 ( 0.612%) ##########|# 0.680 ( 0.970%) ##########|# 0.720 ( 2.750%) ##########|#### 0.760 (11.332%) ##########|################ 0.800 (38.261%) ##########|##################################################### 0.840 (40.074%) ##########|####################################################### 0.880 ( 3.390%) ##########|##### 0.920 ( 0.011%) | 0.960 ( 0.105%) ### |
I tried adding 5% instead of 10%, in an attempt to remove a slight bias towards FPs; results weren't great: SUMMARY: 0.30/0.70 fp 0 fn 9 uh 774 us 822 c 168.60 SUMMARY: 0.20/0.80 fp 0 fn 0 uh 2702 us 9772 c 1247.40 so sticking with 10% is better.
I had some off-list discussion with Fidelis about this... he suggests using ROCA% as a better error-rate measurement system: Fidelis Assis writes: > Justin Mason wrote: > > Fidelis Assis writes: > >> Justin Mason escreveu: > >>> Fidelis Assis writes: > >>>> Justin Mason escreveu: > >>>>> Fidelis Assis writes: > >> The other day I was in a discussion on the CRM114 list about error-rate > >> X ROCA% and I made an analogy to archers showing why I think it's > >> possibly better for spam filters. It might be interesting, at least as a > >> curiosity :-) > >> > >> http://sourceforge.net/mailarchive/forum.php?thread_name=200711271356.lARDujYL031322%40spoo.merl.com&forum_name=crm114-general > > > > Ah, that's a very good explanation. You might have convinced me, I think ;) > > If I get some time soon, I'll try re-examining those results using > > 1-ROCA%. also suggests changing the inputs to the combiner: > >>>>> from the EDDC equation is used as P(spam) values and fed into our naive > >>>>> Bayes combiner, producing a value ranging from 0.0 (nonspam) to 0.5 > >>>>> (unsure) to 1.0 (spam). > >>>> I don't use probabilities directly, but the ratio > >>>> 0.59*log10(p(ham)/p(spam)). OSBF probabilities are either very close to > >>>> 1 or to 0. > >>> hmm, I may try that. and tried out osbf-lua on my test corpus: > The filter learns better if the order of the messages is the original, or > random, instead of a batch of a class and then a batch of the other. A > modified script using random order is attached for your tests. he gets much better results: 'I did the tests removing the X-Spam-* headers and I got 0 FP and 12 FN but from the 12, 9 are exactly the same message: msg 33 in spam bucket.4 with Subject: "Congress Proposes Olympic Boycott" (is this spam?); another 2 are also the same message: msg 165 in spam bucket.2, with subject: "Notice of account temporary suspension" (paypal phishing). The last one is another paypal phishing, but with the same contents: msg 174 in spam bucket.6. If we don't count the same mistake repeatedly we have 0 FP and 3 FN, which is still very good considering that the filter was trained with only 422 msgs, and it reaches its max accuracy after 2-3k.' so the code I've got here is a way off osbf-lua's accuracy rates yet...
just a progress note. This has been shelved, as the accuracy rates I get from our implementation aren't good enough: default "Bayes" from trunk: SUMMARY: 0.30/0.70 fp 0 fn 168 uh 8 us 535 c 222.30 latest OSBF-in-perl: SUMMARY: 0.30/0.70 fp 0 fn 7 uh 994 us 631 c 169.50 FNs are a good bit lower, but there's some wierd hacks like having to scale the scores up by 10% which I'm not comfortable with. It needs more work, and possibly some analysis of what osbf-lua does that the perl code doesn't do. I haven't done this since the license is incompatible, and have been working from papers/published docs instead, but I don't think they're quite the same anymore. Anyway, if anyone wants to take over, the code's all in svn.
moving off the release milestone, not ready.