Bug 5686 - add OSBF/Winnow as an alternative to Bayes
Summary: add OSBF/Winnow as an alternative to Bayes
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Plugins (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P5 enhancement
Target Milestone: Future
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on: 5293
Blocks:
  Show dependency tree
 
Reported: 2007-10-14 15:49 UTC by Justin Mason
Modified: 2009-07-23 07:09 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Justin Mason 2007-10-14 15:49:22 UTC
now that Bayes is a little more pluginized (bug 5293), here's what I wanted to
do: offer an alternative to the default BAYES rules, using a more up-to-date
probabilistic classifier algorithm.

This is the use of Orthogonal Sparse Bigram tokenization, combined with the
Winnow machine-learning algorithm [1][2].  This combo has been scoring very well
in the TREC anti-spam probabilistic-classifier shootout [3], as implemented by
osbf-lua [4].

[1]: http://en.wikipedia.org/wiki/Winnow
[2]: http://www.siefkes.net/ie/winnow-spam.pdf
[3]: http://trec.nist.gov
[4]: http://osbf-lua.luaforge.net/

initial results of an implementation (based on that paper) seem very promising
so far...
Comment 1 Justin Mason 2007-10-15 05:51:01 UTC
Let's have some graphs!

Here's a graph of scores from SVN trunk's version of Bayes, measured using
10-fold cross validation on a corpus of ~2000 recent spam and ~2000 recent ham
from my collection (I'm using this corpus to measure results as I develop
this):

    http://taint.org/x/2007/graph_trunk.png

And here's a graph on the same corpus, classified using osbf-lua:

    http://taint.org/x/2007/graph_osbflua.png

You can see several things:

- current trunk's Bayes has a tendency to put a fair bit of spam into the
  "unsure" middle ground, BAYES_50, where it gets no score.

- osbf-lua is better at separating more of the samples into their correct
  class, with a more or less clear dividing line around -15.  (I'm not sure
  what their score figure represents.)

This demonstrates that the algorithms used in osbf-lua are pretty effective, in
my opinion (and gives us an idea of what osbf can do, something to aim for with
our implementation).


Now for the implementation of Winnow/OSBF as checked in in r584432,
compared to SVN trunk.  Here's a score histogram from trunk:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (99.914%) ..........|.......................................................
0.000 ( 0.761%) ######### |
0.040 ( 0.020%)           |
0.040 ( 0.028%)           |
0.080 ( 0.050%) #         |
0.120 ( 0.015%)           |
0.120 ( 0.039%)           |
0.160 ( 0.005%)           |
0.160 ( 0.011%)           |
0.200 ( 0.017%)           |
0.240 ( 0.005%)           |
0.240 ( 0.022%)           |
0.280 ( 0.017%)           |
0.320 ( 0.011%)           |
0.360 ( 0.028%)           |
0.400 ( 0.010%)           |
0.400 ( 0.017%)           |
0.440 ( 0.005%)           |
0.440 ( 0.083%) #         |
0.480 ( 0.025%)           |
0.480 ( 2.122%) ##########|#
0.520 ( 0.231%) ###       |
0.560 ( 0.138%) ##        |
0.600 ( 0.088%) #         |
0.640 ( 0.127%) #         |
0.680 ( 0.121%) #         |
0.720 ( 0.182%) ##        |
0.760 ( 0.193%) ##        |
0.800 ( 0.187%) ##        |
0.840 ( 0.116%) #         |
0.880 ( 0.215%) ##        |
0.920 ( 0.375%) ####      |
0.960 (94.825%) ##########|#######################################################

(hopefully that pastes ok).  the thing we want to see is all "."s at 0.000, all
"#"s at 0.960-1.0, no "."s between 0.5 and 1.0 (false positives), and no "#"s
between 0.0 and 0.5 (false negatives).


here's the histogram for r584432:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (94.728%) ..........|.......................................................
0.000 ( 0.077%) #         |
0.960 ( 5.272%) ..........|...
0.960 (99.923%) ##########|#######################################################

that's very good, except for the 5.272% of false positives :(  we need to avoid
that, since 5% fps is serious.

the "thresholds" cost figure (in "results/thresholds.static"), which comes up
with a single-figure metric based on the score distribution, looks like this:

trunk:
  Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$222.30
  Total ham:spam:   19764:18144
  FP:     0 0.000%    FN:   168 0.926%
  Unsure:   543 1.432%     (ham:     8 0.040%    spam:   535 2.949%)
  TCRs:              l=1 25.809    l=5 25.809    l=9 25.809
  SUMMARY: 0.30/0.70  fp     0 fn   168 uh     8 us   535    c 222.30

r584432:
  Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$10434.00
  Total ham:spam:   19764:18144
  FP:  1042 5.272%    FN:    14 0.077%
  Unsure:     0 0.000%     (ham:     0 0.000%    spam:     0 0.000%)
  TCRs:              l=1 17.182    l=5 3.473    l=9 1.932
  SUMMARY: 0.30/0.70  fp  1042 fn    14 uh     0 us     0    c 10434.00

that cost metric penalised the 5% fp rate very heavily.

Comment 2 Justin Mason 2007-10-17 15:03:27 UTC
OK, I've now implemented osbf-lua-style OSBF, with EDDC (Exponential
Differential Document Count), as r584760. (Note that r584432 described above
wasn't OSBF -- it was just OSB ;)

The test took too long. ;)  I interrupted it after 5 of the 10 folds;
this histogram is about representative:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 ( 5.820%) ..........|..........
0.040 ( 6.225%) ..........|..........
0.080 ( 3.998%) ..........|.......
0.120 (20.749%) ..........|..................................
0.160 (33.198%) ..........|.......................................................
0.200 (18.370%) ..........|..............................
0.200 ( 0.055%) #         |
0.240 ( 9.565%) ..........|................
0.280 ( 1.721%) ..........|...
0.280 ( 0.331%) #####     |
0.320 ( 0.101%) ...       |
0.320 ( 0.110%) ##        |
0.360 ( 0.101%) ...       |
0.360 ( 0.331%) #####     |
0.400 ( 0.110%) ##        |
0.440 ( 0.496%) #######   |
0.480 ( 0.152%) .....     |
0.480 ( 1.929%) ##########|#
0.520 ( 1.103%) ##########|#
0.560 ( 1.213%) ##########|#
0.600 ( 1.323%) ##########|#
0.640 ( 0.717%) ##########|#
0.680 ( 8.434%) ##########|######
0.720 (77.233%) ##########|#######################################################
0.760 ( 6.615%) ##########|#####


Note that the fundamental shape has changed, since OSBF uses a traditional
naive Bayesian combiner, instead of the binary Winnow style, or the
nearly-binary Robinsonian chi-square combiner.  OSBF however is behind
the impressively low number of FPs and FNs, I think, though!

Here's the numbers:

Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$21.00
Total ham:spam:   1976:1814
FP:     0 0.000%    FN:     1 0.055%
Unsure:   200 5.277%     (ham:    41 2.075%    spam:   159 8.765%)
TCRs:              l=1 11.338    l=5 11.338    l=9 11.338
SUMMARY: 0.30/0.70  fp     0 fn     1 uh    41 us   159    c 21.00

I think I need to keep working on this...
Comment 3 Justin Mason 2007-10-18 07:55:11 UTC
here's some score-frequency histograms, measuring alternative values for the K3
constant in the EDDC equation.  (these are all after 1 of the 10 folds, to get a
quick idea. the 10 folds are self-similar enough that IMO this is safe)...

K3=8

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.280 (64.626%) ..........|.......................................................
0.280 ( 0.055%) #         |
0.320 ( 0.051%) .         |
0.440 (35.324%) ..........|..............................
0.440 ( 0.606%) #######   |
0.480 ( 6.395%) ##########|####
0.520 (92.778%) ##########|#######################################################
0.680 ( 0.165%) ##        |


K3=6

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.280 (69.332%) ..........|.......................................................
0.280 ( 0.055%) #         |
0.400 (21.255%) ..........|.................
0.440 ( 9.413%) ..........|.......
0.440 ( 0.772%) ######### |
0.480 ( 3.528%) ##########|##
0.520 (95.480%) ##########|#######################################################
0.680 ( 0.165%) ##        |


K3=4

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.240 (38.985%) ..........|.......................................................
0.280 (36.531%) ..........|....................................................
0.280 ( 0.094%) ##        |
0.360 ( 4.346%) ..........|......
0.400 (18.974%) ..........|...........................
0.400 ( 0.006%)           |
0.440 ( 1.138%) ..........|..
0.440 ( 0.628%) ##########|#
0.480 ( 0.025%) .         |
0.480 ( 1.631%) ##########|##
0.520 (46.616%) ##########|##################################################
0.560 (50.871%) ##########|#######################################################
0.640 ( 0.006%)           |
0.680 ( 0.033%) #         |
0.720 ( 0.116%) ###       |
Comment 4 Justin Mason 2007-10-18 13:15:18 UTC
K3=2
SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.240 (39.524%) ..........|.......................................................
0.280 (36.032%) ..........|..................................................
0.280 ( 0.110%) ##        |
0.360 ( 2.227%) ..........|...
0.400 (21.306%) ..........|..............................
0.400 ( 0.055%) #         |
0.440 ( 0.911%) ..........|.
0.440 ( 0.827%) ##########|#
0.480 ( 2.205%) ##########|##
0.520 (56.505%) ##########|#######################################################
0.560 (40.132%) ##########|#######################################
0.680 ( 0.055%) #         |
0.720 ( 0.110%) ##        |


I've also implemented the Bayes chain rule algorithm described
in the EDDC paper, in r585450.  Here's a histogram using K3=1 and
that combiner:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (100.000%) ..........|.......................................................
0.000 ( 1.764%) ##########|#
0.960 (98.236%) ##########|#######################################################

Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$32.00
Total ham:spam:   1976:1814
FP:     0 0.000%    FN:    32 1.764%
Unsure:     0 0.000%     (ham:     0 0.000%    spam:     0 0.000%)
TCRs:              l=1 56.688    l=5 56.687    l=9 56.688
SUMMARY: 0.30/0.70  fp     0 fn    32 uh     0 us     0    c 32.00


that's pretty cool.  0% FP rate!  but the 1.7% FN rate is not great.
the tweaks continue...
Comment 5 Justin Mason 2007-10-19 07:09:13 UTC
here's that Bayes chain rule, with K3=8:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (100.000%) ..........|.......................................................
0.000 ( 1.709%) ##########|#
0.040 ( 0.055%) #         |
0.120 ( 0.055%) #         |
0.920 ( 0.055%) #         |
0.960 (98.126%) ##########|#######################################################

Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$33.00
Total ham:spam:   1976:1814
FP:     0 0.000%    FN:    33 1.819%
Unsure:     0 0.000%     (ham:     0 0.000%    spam:     0 0.000%)
TCRs:              l=1 54.970    l=5 54.970    l=9 54.970
SUMMARY: 0.30/0.70  fp     0 fn    33 uh     0 us     0    c 33.00


so that doesn't really improve it much.  I think the method used in comment 2
seems to be performing the best here so far; that's the use of OSBF with
traditional naive Bayes combining (at least, the trad combiner we used to use in
SA before we switched to Fisher/Robinson chi square combining).
Comment 6 Justin Mason 2007-10-20 07:19:52 UTC
ok, some more tests....

trying the (crazy) K3=20 with the Bayes chain rule combiner:
SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (100.000%) ..........|.......................................................
0.000 ( 1.047%) ##########|#
0.200 ( 0.110%) #         |
0.320 ( 0.055%) #         |
0.680 ( 0.055%) #         |
0.800 ( 0.055%) #         |
0.840 ( 0.055%) #         |
0.880 ( 0.165%) ##        |
0.920 ( 0.110%) #         |
0.960 (98.346%) ##########|#######################################################

let's try the naive Bayes combiner, K3 = 0.8:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.120 (32.439%) ..........|......................................................
0.160 (32.844%) ..........|.......................................................
0.160 ( 0.055%) #         |
0.200 (27.379%) ..........|..............................................
0.240 ( 6.275%) ..........|...........
0.280 ( 0.658%) ..........|.
0.280 ( 0.331%) ######    |
0.320 ( 0.152%) .....     |
0.320 ( 0.221%) ####      |
0.360 ( 0.051%) ..        |
0.400 ( 0.202%) .......   |
0.400 ( 0.110%) ##        |
0.440 ( 0.331%) ######    |
0.480 ( 0.496%) ######### |
0.520 ( 0.331%) ######    |
0.560 ( 1.378%) ##########|#
0.600 (15.160%) ##########|##############
0.640 (57.938%) ##########|#######################################################
0.680 ( 2.426%) ##########|##
0.720 (20.066%) ##########|###################
0.760 ( 1.047%) ##########|#
0.840 ( 0.110%) ##        |


So far I think K3=1, with the traditional naive Bayes combiner, is
working best for us, since it's so good at avoiding FPs and FNs
that the others leave behind. 

To compare with the figures from comment 1, here's the results from a full
10-fold cross validation:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (25.415%) ..........|.......................................................
0.040 ( 9.831%) ..........|.....................
0.080 (22.571%) ..........|.................................................
0.120 (21.716%) ..........|...............................................
0.160 ( 8.435%) ..........|..................
0.200 ( 5.444%) ..........|............
0.200 ( 0.028%) #         |
0.240 ( 3.916%) ..........|........
0.240 ( 0.022%) #         |
0.280 ( 1.801%) ..........|....
0.280 ( 0.022%) #         |
0.320 ( 0.491%) ..........|.
0.320 ( 0.226%) #####     |
0.360 ( 0.116%) .....     |
0.360 ( 0.231%) ######    |
0.400 ( 0.040%) ..        |
0.400 ( 0.193%) #####     |
0.440 ( 0.132%) ###       |
0.480 ( 0.223%) ..........|
0.480 ( 1.334%) ##########|##
0.520 ( 0.110%) ###       |
0.560 ( 0.419%) ##########|#
0.600 ( 0.832%) ##########|#
0.640 ( 1.769%) ##########|##
0.680 ( 8.813%) ##########|###########
0.720 (36.767%) ##########|############################################
0.760 (45.712%) ##########|#######################################################
0.800 ( 3.279%) ##########|####
0.840 ( 0.006%)           |
0.880 ( 0.011%)           |
0.920 ( 0.022%) #         |
0.960 ( 0.072%) ##        |

Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$206.30
Total ham:spam:   19764:18144
FP:     0 0.000%    FN:     9 0.050%
Unsure:  1973 5.205%     (ham:   528 2.672%    spam:  1445 7.964%)
TCRs:              l=1 12.479    l=5 12.479    l=9 12.479
SUMMARY: 0.30/0.70  fp     0 fn     9 uh   528 us  1445    c 206.30

Comment 7 Justin Mason 2007-10-21 10:41:11 UTC
things I now want to test:

1. the effect of smaller/bigger db sizes on the OSBF code (right now I've more
or less disabled expiry for these tests, which is unrealistic)

2. the effect of less training data, which is the real issue -- can OSBF do a
better job with tiny amounts of training, than our existing Bayes impl?

3. different tokenization
Comment 8 Justin Mason 2007-10-22 07:10:43 UTC
(In reply to comment #7)
> 2. the effect of less training data, which is the real issue -- can OSBF do a
> better job with tiny amounts of training, than our existing Bayes impl?

results from the weekend's testing of this.  I ran the 10fold cross-validation
driver with "--learnprob 0.1 --randseed 23" -- ie. train on only 10% of the
messages -- and got these histograms:

SVN trunk:

Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$252.30
Total ham:spam:   19764:18144
FP:     0 0.000%    FN:   155 0.854%
Unsure:   973 2.567%     (ham:    24 0.121%    spam:   949 5.230%)
TCRs:              l=1 16.435    l=5 16.435    l=9 16.435
SUMMARY: 0.30/0.70  fp     0 fn   155 uh    24 us   949    c 252.30

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (99.676%) ..........|.......................................................
0.000 ( 0.645%) ########  |
0.040 ( 0.040%)           |
0.040 ( 0.055%) #         |
0.080 ( 0.040%)           |
0.080 ( 0.022%)           |
0.120 ( 0.030%)           |
0.120 ( 0.050%) #         |
0.160 ( 0.035%)           |
0.160 ( 0.022%)           |
0.200 ( 0.040%)           |
0.200 ( 0.028%)           |
0.240 ( 0.015%)           |
0.240 ( 0.033%)           |
0.280 ( 0.020%)           |
0.280 ( 0.077%) #         |
0.320 ( 0.015%)           |
0.320 ( 0.061%) #         |
0.360 ( 0.015%)           |
0.360 ( 0.044%) #         |
0.400 ( 0.015%)           |
0.400 ( 0.121%) #         |
0.440 ( 0.035%)           |
0.440 ( 0.198%) ##        |
0.480 ( 0.020%)           |
0.480 ( 3.919%) ##########|##
0.520 ( 0.314%) ####      |
0.560 ( 0.165%) ##        |
0.600 ( 0.149%) ##        |
0.640 ( 0.077%) #         |
0.680 ( 0.215%) ###       |
0.720 ( 0.116%) #         |
0.760 ( 0.116%) #         |
0.800 ( 0.171%) ##        |
0.840 ( 0.121%) #         |
0.880 ( 0.193%) ##        |
0.920 ( 0.336%) ####      |
0.960 (92.752%) ##########|#######################################################


OSBF with EDDC:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 ( 4.007%) ..........|........
0.040 ( 3.177%) ..........|......
0.080 (18.787%) ..........|....................................
0.120 (28.415%) ..........|.......................................................
0.160 (17.588%) ..........|..................................
0.160 ( 0.006%)           |
0.200 (11.369%) ..........|......................
0.200 ( 0.011%)           |
0.240 ( 7.357%) ..........|..............
0.240 ( 0.022%) #         |
0.280 ( 4.574%) ..........|.........
0.280 ( 0.033%) #         |
0.320 ( 3.046%) ..........|......
0.320 ( 0.127%) ####      |
0.360 ( 1.184%) ..........|..
0.360 ( 0.303%) ######### |
0.400 ( 0.233%) ......... |
0.400 ( 0.733%) ##########|#
0.440 ( 0.046%) ..        |
0.440 ( 0.424%) ##########|#
0.480 ( 0.207%) ........  |
0.480 ( 1.560%) ##########|##
0.520 ( 0.010%)           |
0.520 ( 1.036%) ##########|##
0.560 ( 1.565%) ##########|##
0.600 ( 1.984%) ##########|###
0.640 ( 5.958%) ##########|#########
0.680 (20.993%) ##########|###############################
0.720 (36.795%) ##########|#######################################################
0.760 (25.143%) ##########|######################################
0.800 ( 3.213%) ##########|#####
0.840 ( 0.083%) ##        |
0.960 ( 0.011%)           |

the thresholds report looks like this
Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$583.00
Total ham:spam:   19764:18144
FP:     0 0.000%    FN:     7 0.039%
Unsure:  5760 15.195%     (ham:  1838 9.300%    spam:  3922 21.616%)
TCRs:              l=1 4.618    l=5 4.618    l=9 4.618
SUMMARY: 0.30/0.70  fp     0 fn     7 uh  1838 us  3922    c 583.00

but that's unfair, because 0.70 (as you can see from the histogram)
is right in the middle of most of the ham.  0.56 would be better:

Threshold optimization for hamcutoff=0.38, spamcutoff=0.56: cost=$234.80
Total ham:spam:   19764:18144
FP:     0 0.000%    FN:    55 0.303%
Unsure:   899 2.372%     (ham:   182 0.921%    spam:   717 3.952%)
TCRs:              l=1 23.503    l=5 23.503    l=9 23.503

I guess it's good, but it's not stellar :(
Comment 9 Justin Mason 2007-10-22 07:43:16 UTC
(In reply to comment #1)
> And here's a graph on the same corpus, classified using osbf-lua:
>     http://taint.org/x/2007/graph_osbflua.png

since I've been using the bayes score histograms for comparison, it's worth
converting that graph into histogram format.  here it is:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.120 ( 0.011%)           |
0.160 ( 0.011%)           |
0.200 ( 1.604%) ##########|##
0.240 (23.380%) ##########|#######################
0.280 (56.564%) ##########|#######################################################
0.320 (17.934%) ##########|#################
0.360 ( 0.025%)           |
0.360 ( 0.419%) ########  |
0.400 ( 4.948%) ..........|.....
0.400 ( 0.072%) #         |
0.440 (14.263%) ..........|.............
0.440 ( 0.006%)           |
0.480 (58.313%) ..........|.......................................................
0.520 (19.955%) ..........|...................
0.560 ( 1.022%) ..........|.
0.600 ( 1.128%) ..........|.
0.640 ( 0.299%) ......    |
0.680 ( 0.020%)           |
0.760 ( 0.005%)           |
0.800 ( 0.015%)           |
0.960 ( 0.005%)           |

(the scores are an approximation -- (($osbfluascore+0) + 100) / 220 .
Comment 10 Justin Mason 2007-10-23 05:08:44 UTC
(In reply to comment #7)
> 3. different tokenization

so I tried some of this out last night; I took one of the persistent FNs that
keeps showing up around the 0.2 mark, and examined the tokens being generated
during tokenization.  It turned out that some of the OSBF tokenization didn't
cope well with some of *our* tokens.

1. The decomposed address tokens, like "UD*jmason.org" for an email addr
containing hte domain "taint.org", were being split up into two tokens "UD*" and
"jmason.org" -- not useful -- so I fixed that; 

2. the "key=value" metadata in the X-Spam-Relays headers was similarly being
broken up into "key=", "value".  fixed.

this is checked in as r587469.  here's a histogram:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (21.949%) ..........|............................................
0.040 (21.620%) ..........|...........................................
0.080 (27.737%) ..........|.......................................................
0.120 (12.351%) ..........|........................
0.160 ( 6.993%) ..........|..............
0.160 ( 0.044%) #         |
0.200 ( 4.802%) ..........|..........
0.200 ( 0.006%)           |
0.240 ( 2.656%) ..........|.....
0.280 ( 1.169%) ..........|..
0.280 ( 0.055%) #         |
0.320 ( 0.400%) ..........|.
0.320 ( 0.215%) #####     |
0.360 ( 0.172%) .......   |
0.360 ( 0.287%) #######   |
0.400 ( 0.056%) ..        |
0.400 ( 0.287%) #######   |
0.440 ( 0.083%) ##        |
0.480 ( 0.096%) ....      |
0.480 ( 1.075%) ##########|#
0.520 ( 0.276%) #######   |
0.560 ( 0.573%) ##########|#
0.600 ( 0.843%) ##########|#
0.640 ( 1.725%) ##########|##
0.680 ( 5.545%) ##########|#######
0.720 (20.387%) ##########|########################
0.760 (46.555%) ##########|#######################################################
0.800 (20.800%) ##########|#########################
0.840 ( 1.141%) ##########|#
0.880 ( 0.017%)           |
0.920 ( 0.017%)           |
0.960 ( 0.072%) ##        |

Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$178.60
Total ham:spam:   19764:18144
FP:     0 0.000%    FN:     9 0.050%
Unsure:  1696 4.474%     (ham:   374 1.892%    spam:  1322 7.286%)
TCRs:              l=1 13.632    l=5 13.632    l=9 13.632

Threshold optimization for hamcutoff=0.30, spamcutoff=0.54: cost=$130.40
Total ham:spam:   19764:18144
FP:     0 0.000%    FN:    11 0.061%
Unsure:   597 1.575%     (ham:   220 1.113%    spam:   377 2.078%)
TCRs:              l=1 46.763    l=5 46.763    l=9 46.763


looking quite a bit better!
Comment 11 Justin Mason 2007-10-24 03:04:57 UTC
more meddling with tokenization.  r587841 is an experiment to discard
OSBF-style tokenization and just use the simpler SpamAssassin "split on
whitespace" tokenization with the OSBF bigram format:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 ( 9.173%) ..........|...........
0.040 (21.726%) ..........|.........................
0.040 ( 0.011%)           |
0.080 (47.814%) ..........|.......................................................
0.080 ( 0.017%)           |
0.120 (15.204%) ..........|.................
0.120 ( 0.017%)           |
0.160 ( 3.527%) ..........|....
0.160 ( 0.006%)           |
0.200 ( 1.331%) ..........|..
0.200 ( 0.022%)           |
0.240 ( 0.653%) ..........|.
0.240 ( 0.143%) ##        |
0.280 ( 0.263%) ......    |
0.280 ( 0.397%) ######    |
0.320 ( 0.126%) ...       |
0.320 ( 0.171%) ###       |
0.360 ( 0.121%) ...       |
0.360 ( 0.243%) ####      |
0.400 ( 0.040%) .         |
0.400 ( 0.303%) #####     |
0.440 ( 0.020%)           |
0.440 ( 0.353%) ######    |
0.480 ( 0.496%) ########  |
0.520 ( 0.623%) ##########|
0.560 ( 0.579%) ######### |
0.600 ( 0.882%) ##########|#
0.640 ( 1.295%) ##########|#
0.680 ( 1.554%) ##########|#
0.720 (11.001%) ##########|#########
0.760 (69.604%) ##########|#######################################################
0.800 (11.436%) ##########|#########
0.840 ( 0.777%) ##########|#
0.880 ( 0.011%)           |
0.960 ( 0.061%) #         |

Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$160.00
Total ham:spam:   19764:18144
FP:     0 0.000%    FN:    39 0.215%
Unsure:  1210 3.192%     (ham:   113 0.572%    spam:  1097 6.046%)
TCRs:              l=1 15.972    l=5 15.972    l=9 15.972
SUMMARY: 0.30/0.70  fp     0 fn    39 uh   113 us  1097    c 160.00


So I think that basically doesn't work too well.  There are a high number of
one-off spam FNs scattered around the 0.040- 0.440 range, and ham FP at
0.880, which the more complex OSBF tokenization style avoids.
Comment 12 Justin Mason 2007-10-24 03:06:29 UTC
(In reply to comment #11)
> ham FP at 0.880

correction -- that's not an FP, my mistake.
Comment 13 Loren Wilton 2007-10-24 03:24:39 UTC
Do you suppose it would be interesting to try using OSBF-style tokenization in 
place of the current Bayes tokenization?
Comment 14 Justin Mason 2007-10-24 03:41:17 UTC
(In reply to comment #13)
> Do you suppose it would be interesting to try using OSBF-style tokenization in 
> place of the current Bayes tokenization?

I might give it a try, but I'm pretty sure the use of bigrams in OSBF is key.
Comment 15 Justin Mason 2007-10-25 12:11:22 UTC
I've been doing some tokenizer tweaks, but none are really doing great; so one
thing that would be handy at this point is just to restate the current
"baseline" best results so far, in r585992.

The full 10-fold cross-validation's histogram is the last graph in comment 6 --
I'll paste it here:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (25.415%) ..........|.......................................................
0.040 ( 9.831%) ..........|.....................
0.080 (22.571%) ..........|.................................................
0.120 (21.716%) ..........|...............................................
0.160 ( 8.435%) ..........|..................
0.200 ( 5.444%) ..........|............
0.200 ( 0.028%) #         |
0.240 ( 3.916%) ..........|........
0.240 ( 0.022%) #         |
0.280 ( 1.801%) ..........|....
0.280 ( 0.022%) #         |
0.320 ( 0.491%) ..........|.
0.320 ( 0.226%) #####     |
0.360 ( 0.116%) .....     |
0.360 ( 0.231%) ######    |
0.400 ( 0.040%) ..        |
0.400 ( 0.193%) #####     |
0.440 ( 0.132%) ###       |
0.480 ( 0.223%) ..........|
0.480 ( 1.334%) ##########|##
0.520 ( 0.110%) ###       |
0.560 ( 0.419%) ##########|#
0.600 ( 0.832%) ##########|#
0.640 ( 1.769%) ##########|##
0.680 ( 8.813%) ##########|###########
0.720 (36.767%) ##########|############################################
0.760 (45.712%) ##########|#######################################################
0.800 ( 3.279%) ##########|####
0.840 ( 0.006%)           |
0.880 ( 0.011%)           |
0.920 ( 0.022%) #         |
0.960 ( 0.072%) ##        |

Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$206.30
Total ham:spam:   19764:18144
FP:     0 0.000%    FN:     9 0.050%
Unsure:  1973 5.205%     (ham:   528 2.672%    spam:  1445 7.964%)
TCRs:              l=1 12.479    l=5 12.479    l=9 12.479
SUMMARY: 0.30/0.70  fp     0 fn     9 uh   528 us  1445    c 206.30

Conveniently I've noticed that fold 1 is pretty representative of that graph
and those numbers -- 

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (27.277%) ..........|.......................................................
0.040 (10.020%) ..........|....................
0.080 (21.356%) ..........|...........................................
0.120 (24.190%) ..........|.................................................
0.160 ( 8.654%) ..........|.................
0.200 ( 5.061%) ..........|..........
0.200 ( 0.055%) #         |
0.240 ( 2.379%) ..........|.....
0.280 ( 0.709%) ..........|.
0.280 ( 0.055%) #         |
0.320 ( 0.152%) ......    |
0.320 ( 0.386%) ##########|#
0.360 ( 0.051%) ..        |
0.360 ( 0.165%) ####      |
0.400 ( 0.110%) ###       |
0.440 ( 0.662%) ##########|#
0.480 ( 0.152%) ......    |
0.480 ( 0.937%) ##########|#
0.520 ( 0.276%) #######   |
0.560 ( 0.827%) ##########|#
0.600 ( 1.213%) ##########|##
0.640 ( 1.985%) ##########|###
0.680 (11.025%) ##########|###############
0.720 (39.802%) ##########|######################################################
0.760 (40.463%) ##########|#######################################################
0.800 ( 1.985%) ##########|###
0.960 ( 0.055%) #         |

Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$20.50
Total ham:spam:   1976:1814
FP:     0 0.000%    FN:     1 0.055%
Unsure:   195 5.145%     (ham:    21 1.063%    spam:   174 9.592%)
TCRs:              l=1 10.366    l=5 10.366    l=9 10.366
SUMMARY: 0.30/0.70  fp     0 fn     1 uh    21 us   174    c 20.50

This is handy because a single fold takes 1/10th of the time to run. ;)

(btw note that you have to scale the "threshold optimization" cost figure 10x
to cope with the corpus size differences, I should have normalized it but
didn't).

Anyway, I've checked it in as r588315.  This is the new baseline for further tests.
Comment 16 Justin Mason 2007-10-26 09:59:39 UTC
ok, fixing a bug -- in current baseline, if multiple token strings are found
with different weights, it's ~random which one gets to set the weight.  revision
588709 fixes this, by simply using the lowest weight for that token string:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (27.480%) ..........|.......................................................
0.040 (10.324%) ..........|.....................
0.080 (21.356%) ..........|...........................................
0.120 (23.785%) ..........|................................................
0.160 ( 8.553%) ..........|.................
0.200 ( 4.960%) ..........|..........
0.200 ( 0.055%) ##        |
0.240 ( 2.480%) ..........|.....
0.280 ( 0.709%) ..........|.
0.280 ( 0.055%) ##        |
0.320 ( 0.152%) ......    |
0.320 ( 0.386%) ##########|#
0.360 ( 0.051%) ..        |
0.360 ( 0.165%) #####     |
0.400 ( 0.110%) ###       |
0.440 ( 0.606%) ##########|#
0.480 ( 0.152%) ......    |
0.480 ( 1.047%) ##########|#
0.520 ( 0.276%) ########  |
0.560 ( 0.827%) ##########|#
0.600 ( 1.323%) ##########|##
0.640 ( 2.040%) ##########|###
0.680 (10.915%) ##########|###############
0.720 (39.746%) ##########|######################################################
0.760 (40.298%) ##########|#######################################################
0.800 ( 2.095%) ##########|###
0.960 ( 0.055%) ##        |

Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$20.70
Total ham:spam:   1976:1814
FP:     0 0.000%    FN:     1 0.055%
Unsure:   197 5.198%     (ham:    21 1.063%    spam:   176 9.702%)
TCRs:              l=1 10.249    l=5 10.249    l=9 10.249
SUMMARY: 0.30/0.70  fp     0 fn     1 uh    21 us   176    c 20.70


not an obvious improvement, but a necessary bugfix :(
Comment 17 Justin Mason 2007-10-30 11:24:50 UTC
more tests.  setting N_SIGNIFICANT_TOKENS to be infinite (ie. using all
tokens instead of the N most significant/strong ones), is bad:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 ( 0.506%) ..........|.
0.040 (25.658%) ..........|.................................
0.080 (43.067%) ..........|.......................................................
0.120 (22.166%) ..........|............................
0.120 ( 0.055%) #         |
0.160 ( 6.275%) ..........|........
0.200 ( 1.569%) ..........|..
0.200 ( 0.055%) #         |
0.240 ( 0.607%) ..........|.
0.240 ( 0.717%) ##########|#
0.280 ( 0.051%) .         |
0.280 ( 0.276%) ####      |
0.320 ( 0.101%) ...       |
0.320 ( 0.276%) ####      |
0.360 ( 0.276%) ####      |
0.400 ( 0.221%) ###       |
0.440 ( 0.441%) #######   |
0.480 ( 0.662%) ##########|#
0.520 ( 1.323%) ##########|#
0.560 ( 0.882%) ##########|#
0.600 ( 0.827%) ##########|#
0.640 ( 0.882%) ##########|#
0.680 ( 1.047%) ##########|#
0.720 ( 8.379%) ##########|######
0.760 (70.948%) ##########|#######################################################
0.800 (12.679%) ##########|##########
0.880 ( 0.055%) #         |

Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$27.60
Total ham:spam:   1976:1814
FP:     0 0.000%    FN:    15 0.827%
Unsure:   126 3.325%     (ham:     3 0.152%    spam:   123 6.781%)
TCRs:              l=1 13.145    l=5 13.145    l=9 13.145
SUMMARY: 0.30/0.70  fp     0 fn    15 uh     3 us   123    c 27.60


N_SIGNIFICANT_TOKENS=999 still on the wrong side of the baseline:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (24.747%) ..........|............................................
0.040 (18.522%) ..........|.................................
0.080 (31.123%) ..........|.......................................................
0.120 (13.057%) ..........|.......................
0.160 ( 5.820%) ..........|..........
0.160 ( 0.055%) #         |
0.200 ( 4.251%) ..........|........
0.240 ( 1.822%) ..........|...
0.280 ( 0.405%) ..........|.
0.280 ( 0.110%) ###       |
0.320 ( 0.152%) .....     |
0.320 ( 0.331%) ########  |
0.360 ( 0.101%) ....      |
0.360 ( 0.110%) ###       |
0.400 ( 0.772%) ##########|#
0.440 ( 0.165%) ####      |
0.480 ( 0.717%) ##########|#
0.520 ( 0.606%) ##########|#
0.560 ( 0.992%) ##########|#
0.600 ( 1.268%) ##########|##
0.640 ( 1.985%) ##########|##
0.680 ( 7.166%) ##########|########
0.720 (24.862%) ##########|#############################
0.760 (46.472%) ##########|#######################################################
0.800 (13.671%) ##########|################
0.840 ( 0.662%) ##########|#
0.920 ( 0.055%) #         |

Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$18.80
Total ham:spam:   1976:1814
FP:     0 0.000%    FN:     1 0.055%
Unsure:   178 4.697%     (ham:    13 0.658%    spam:   165 9.096%)
TCRs:              l=1 10.928    l=5 10.928    l=9 10.928
SUMMARY: 0.30/0.70  fp     0 fn     1 uh    13 us   165    c 18.80


N_SIGNIFICANT_TOKENS=150, ditto:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (24.747%) ..........|............................................
0.040 (18.522%) ..........|.................................
0.080 (31.123%) ..........|.......................................................
0.120 (13.057%) ..........|.......................
0.160 ( 5.820%) ..........|..........
0.160 ( 0.055%) #         |
0.200 ( 4.251%) ..........|........
0.240 ( 1.822%) ..........|...
0.280 ( 0.405%) ..........|.
0.280 ( 0.110%) ###       |
0.320 ( 0.152%) .....     |
0.320 ( 0.331%) ########  |
0.360 ( 0.101%) ....      |
0.360 ( 0.110%) ###       |
0.400 ( 0.772%) ##########|#
0.440 ( 0.165%) ####      |
0.480 ( 0.717%) ##########|#
0.520 ( 0.606%) ##########|#
0.560 ( 0.992%) ##########|#
0.600 ( 1.268%) ##########|##
0.640 ( 1.985%) ##########|##
0.680 ( 7.166%) ##########|########
0.720 (24.862%) ##########|#############################
0.760 (46.472%) ##########|#######################################################
0.800 (13.671%) ##########|################
0.840 ( 0.662%) ##########|#
0.920 ( 0.055%) #         |

Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$18.80
Total ham:spam:   1976:1814
FP:     0 0.000%    FN:     1 0.055%
Unsure:   178 4.697%     (ham:    13 0.658%    spam:   165 9.096%)
TCRs:              l=1 10.928    l=5 10.928    l=9 10.928
SUMMARY: 0.30/0.70  fp     0 fn     1 uh    13 us   165    c 18.80


Trying out a new tokenization, where the header and URIs are simply
"split on whitespace", but the body still uses the full OSBF tokenization,
is pretty bad compared to baseline:


0.000 ( 4.706%) ..........|..........
0.040 (11.285%) ..........|........................
0.080 (11.842%) ..........|.........................
0.120 (25.860%) ..........|.......................................................
0.160 (25.607%) ..........|......................................................
0.200 (11.437%) ..........|........................
0.200 ( 0.055%) #         |
0.240 ( 6.174%) ..........|.............
0.280 ( 2.429%) ..........|.....
0.280 ( 0.165%) ###       |
0.320 ( 0.506%) ..........|.
0.320 ( 0.276%) #####     |
0.360 ( 0.051%) ..        |
0.360 ( 0.221%) ####      |
0.400 ( 0.772%) ##########|#
0.440 ( 0.221%) ####      |
0.480 ( 0.101%) ....      |
0.480 ( 1.433%) ##########|#
0.520 ( 0.606%) ##########|#
0.560 ( 1.433%) ##########|#
0.600 ( 2.150%) ##########|##
0.640 (16.869%) ##########|################
0.680 (58.545%) ##########|#######################################################
0.720 (17.089%) ##########|################
0.760 ( 0.110%) ##        |
0.840 ( 0.055%) #         |
Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$101.10
Total ham:spam:   1976:1814
FP:     0 0.000%    FN:     1 0.055%
Unsure:  1001 26.412%     (ham:    61 3.087%    spam:   940 51.819%)
TCRs:              l=1 1.928    l=5 1.928    l=9 1.928
SUMMARY: 0.30/0.70  fp     0 fn     1 uh    61 us   940    c 101.10



split(' ') for just headers is also not an improvement:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (11.184%) ..........|...............
0.040 (34.615%) ..........|..............................................
0.080 (41.346%) ..........|.......................................................
0.120 (10.273%) ..........|..............
0.120 ( 0.055%) #         |
0.160 ( 1.569%) ..........|..
0.200 ( 0.709%) ..........|.
0.200 ( 0.055%) #         |
0.240 ( 0.304%) ........  |
0.240 ( 0.827%) ##########|#
0.280 ( 0.165%) ###       |
0.320 ( 0.221%) ###       |
0.360 ( 0.386%) ######    |
0.400 ( 0.551%) ######### |
0.440 ( 0.551%) ######### |
0.480 ( 1.268%) ##########|#
0.520 ( 0.992%) ##########|#
0.560 ( 1.764%) ##########|#
0.600 ( 1.488%) ##########|#
0.640 ( 5.347%) ##########|####
0.680 (70.232%) ##########|#######################################################
0.720 (14.939%) ##########|############
0.760 ( 1.103%) ##########|#
0.840 ( 0.055%) #         |
Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$102.30
Total ham:spam:   1976:1814
FP:     0 0.000%    FN:    17 0.937%
Unsure:   853 22.507%     (ham:     0 0.000%    spam:   853 47.023%)
TCRs:              l=1 2.085    l=5 2.085    l=9 2.085
SUMMARY: 0.30/0.70  fp     0 fn    17 uh     0 us   853    c 102.30


tokenizing just URLs this way is even worse (see that FP creeping closer
to 0.0):

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (10.374%) ..........|.............
0.040 (34.109%) ..........|...........................................
0.080 (43.168%) ..........|.......................................................
0.080 ( 0.055%) #         |
0.120 ( 9.818%) ..........|.............
0.160 ( 1.518%) ..........|..
0.200 ( 0.759%) ..........|.
0.200 ( 0.055%) #         |
0.240 ( 0.253%) ......    |
0.240 ( 0.827%) ##########|#
0.280 ( 0.165%) ###       |
0.320 ( 0.221%) ###       |
0.360 ( 0.386%) ######    |
0.400 ( 0.551%) ######### |
0.440 ( 0.551%) ######### |
0.480 ( 1.323%) ##########|#
0.520 ( 0.937%) ##########|#
0.560 ( 1.985%) ##########|##
0.600 ( 1.213%) ##########|#
0.640 ( 5.788%) ##########|#####
0.680 (69.901%) ##########|#######################################################
0.720 (14.939%) ##########|############
0.760 ( 1.047%) ##########|#
0.840 ( 0.055%) #         |
Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$102.80
Total ham:spam:   1976:1814
FP:     0 0.000%    FN:    17 0.937%
Unsure:   858 22.639%     (ham:     0 0.000%    spam:   858 47.299%)
TCRs:              l=1 2.073    l=5 2.073    l=9 2.073
SUMMARY: 0.30/0.70  fp     0 fn    17 uh     0 us   858    c 102.80

interesting!  these were all tweaks I thought might help, but they
really don't -- the graphs and figures don't lie.  The baseline
tokenization just works better in all my testing...

Comment 18 Justin Mason 2007-10-31 10:40:11 UTC
(In reply to comment #16)
> ok, fixing a bug -- in current baseline, if multiple token strings are found
> with different weights, it's ~random which one gets to set the weight. 
revision  588709 fixes this, by simply using the lowest weight for that token string

damn.  this was wrong.  Somehow (probably a failed "make"), this wound up using
different code -- so r588709 doesn't actually have those graphs in reality, and
instead is worse :(

I'm rerunning now to establish another fix to that bug, that still displays an
improvement, since the attempt in r588709 doesn't do that...
Comment 19 Justin Mason 2007-11-01 16:22:33 UTC
(In reply to comment #18)
> I'm rerunning now to establish another fix to that bug, that still displays an
> improvement, since the attempt in r588709 doesn't do that...

this proved really tricky.

after a full 10-fold cv run, here's what the original (buggy) code scores,
for two sample score thresholds:

SUMMARY: 0.30/0.70  fp     0 fn     9 uh   528 us  1445    c 206.30
SUMMARY: 0.20/0.80  fp     0 fn     0 uh  2378 us 17529    c 1990.70


it took a few days, but I've finally figured out a patch that is both
(a) not buggy ;) and (b) has better results:

SUMMARY: 0.30/0.70  fp     0 fn     7 uh   994 us   631    c 169.50
SUMMARY: 0.20/0.80  fp     0 fn     0 uh  3018 us  3295    c 631.30

It includes a small hack -- it scales the scores up by 10%, since EDDC and the
naive Bayes combiner seem to skew scores a little lower.  results improve with
this; it'd probably be better to analyze the EDDC equation and figure out why
the scores aren't 10% higher to start with, but hey ;)

This is now the new baseline, checked in as r591167.


Here's the score histogram:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (25.086%) ..........|.......................................................
0.040 ( 9.016%) ..........|....................
0.080 (16.146%) ..........|...................................
0.120 (23.593%) ..........|....................................................
0.160 (10.888%) ..........|........................
0.200 ( 5.976%) ..........|.............
0.200 ( 0.011%)           |
0.240 ( 4.265%) ..........|.........
0.240 ( 0.028%) #         |
0.280 ( 2.970%) ..........|.......
0.280 ( 0.011%)           |
0.320 ( 1.295%) ..........|...
0.320 ( 0.039%) #         |
0.360 ( 0.390%) ..........|.
0.360 ( 0.220%) ######    |
0.400 ( 0.106%) .....     |
0.400 ( 0.209%) ######    |
0.440 ( 0.040%) ..        |
0.440 ( 0.165%) #####     |
0.480 ( 0.121%) ###       |
0.520 ( 0.228%) ..........|
0.520 ( 1.361%) ##########|##
0.560 ( 0.072%) ##        |
0.600 ( 0.259%) #######   |
0.640 ( 0.612%) ##########|#
0.680 ( 0.970%) ##########|#
0.720 ( 2.750%) ##########|####
0.760 (11.332%) ##########|################
0.800 (38.261%) ##########|#####################################################
0.840 (40.074%) ##########|#######################################################
0.880 ( 3.390%) ##########|#####
0.920 ( 0.011%)           |
0.960 ( 0.105%) ###       |
Comment 20 Justin Mason 2007-11-02 07:32:34 UTC
I tried adding 5% instead of 10%, in an attempt to remove a slight bias towards
FPs; results weren't great:

SUMMARY: 0.30/0.70  fp     0 fn     9 uh   774 us   822    c 168.60
SUMMARY: 0.20/0.80  fp     0 fn     0 uh  2702 us  9772    c 1247.40

so sticking with 10% is better.
Comment 21 Justin Mason 2008-01-18 02:26:05 UTC
I had some off-list discussion with Fidelis about this...

he suggests using ROCA% as a better error-rate measurement system:

Fidelis Assis writes:
> Justin Mason wrote:
> > Fidelis Assis writes:
> >> Justin Mason escreveu:
> >>> Fidelis Assis writes:
> >>>> Justin Mason escreveu:
> >>>>> Fidelis Assis writes:
> >> The other day I was in a discussion on the CRM114 list about error-rate
> >> X ROCA% and I made an analogy to archers showing why I think it's
> >> possibly better for spam filters. It might be interesting, at least as a
> >> curiosity :-)
> >>
> >>
http://sourceforge.net/mailarchive/forum.php?thread_name=200711271356.lARDujYL031322%40spoo.merl.com&forum_name=crm114-general
> > 
> > Ah, that's a very good explanation.  You might have convinced me, I think ;)
> > If I get some time soon, I'll try re-examining those results using
> > 1-ROCA%.

also suggests changing the inputs to the combiner:

> >>>>> from the EDDC equation is used as P(spam) values and fed into our naive
> >>>>> Bayes combiner, producing a value ranging from 0.0 (nonspam) to 0.5
> >>>>> (unsure) to 1.0 (spam).
> >>>> I don't use probabilities directly, but the ratio
> >>>> 0.59*log10(p(ham)/p(spam)). OSBF probabilities are either very close to
> >>>> 1 or to 0.
> >>> hmm, I may try that.

and tried out osbf-lua on my test corpus:

> The filter learns better if the order of the messages is the original, or 
> random, instead of a batch of a class and then a batch of the other. A 
> modified script using random order is attached for your tests.

he gets much better results:

'I did the tests removing the X-Spam-* headers and I got 0 FP and 12 FN
but from the 12, 9 are exactly the same message: msg 33 in spam bucket.4
with Subject: "Congress Proposes Olympic Boycott" (is this spam?);
another 2 are also the same message: msg 165 in spam bucket.2, with
subject: "Notice of account temporary suspension" (paypal phishing). The
last one is another paypal phishing, but with the same contents: msg 174
in spam bucket.6.

If we don't count the same mistake repeatedly we have 0 FP and 3 FN,
which is still very good considering that the filter was trained with
only 422 msgs, and it reaches its max accuracy after 2-3k.'

so the code I've got here is a way off osbf-lua's accuracy rates yet...
Comment 22 Justin Mason 2009-05-13 01:58:10 UTC
just a progress note.  This has been shelved, as the accuracy rates I get from our implementation aren't good enough:

default "Bayes" from trunk:
  SUMMARY: 0.30/0.70  fp     0 fn   168 uh     8 us   535    c 222.30

latest OSBF-in-perl:
  SUMMARY: 0.30/0.70  fp     0 fn     7 uh   994 us   631    c 169.50


FNs are a good bit lower, but there's some wierd hacks like having to scale the scores up by 10% which I'm not comfortable with.

It needs more work, and possibly some analysis of what osbf-lua does that the perl code doesn't do.  I haven't done this since the license is incompatible, and have been working from papers/published docs instead, but I don't think they're quite the same anymore.

Anyway, if anyone wants to take over, the code's all in svn.
Comment 23 Justin Mason 2009-07-23 07:09:27 UTC
moving off the release milestone, not ready.