SA Bugzilla – Bug 4331
check Bayes tokenizer tweaks against current spam
Last modified: 2006-10-15 15:51:05 UTC
It's been quite a while since our current selection of Bayes tokenizer tweaks were measured against a real spam corpus using 10-fold cross validation. Spam has mutated since then, and there's a possibility that we could get better results with different settings on these. It'd be nice to run these tests before 3.1.0, but not essential.
wishful thinking milestone...
testing these
OK, here's some results! KEY --- - base: current svn trunk Firstly, some code tweaks: - no_inviz_tokens: ADD_INVIZ_TOKENS_I_PREFIX set to 0, so no invisible text tokens at all - no_decomposed: inhibiting the decomposition of body tokens, and the mapping of Message-Id/In-Reply-To, From/To/Cc, and User-Agent/X-Mailer headers -- the tweaks discussed in bug 2129. - casei: IGNORE_TITLE_CASE set to 0. in other words, fully case-insensitive for body text - no8bits: TOKENIZE_LONG_8BIT_SEQS_AS_TUPLES set to 0. in other words, 8-bit sequences are not decomposed into byte-pairs. - no_mid: IGNORE_MSGID_TOKENS set to 1. in other words, no message-ID tokens. And some constant tweaks: - s005: FW_S_CONSTANT = 0.050 instead of default 0.100 - s015: FW_S_CONSTANT = 0.150 instead of default 0.100 - x05: FW_X_CONSTANT = 0.500 instead of default 0.538 - mps02: MIN_PROB_STRENGTH = 0.2 instead of default 0.346 - mps04: MIN_PROB_STRENGTH = 0.4 instead of default 0.346 DB SIZES -------- : jm 183...; l */results/config/dbs/bayes_toks -rw------- 1 jm jm 1302528 May 19 14:08 x05/results/config/dbs/bayes_toks -rw------- 1 jm jm 1302528 May 19 11:34 s015/results/config/dbs/bayes_toks -rw------- 1 jm jm 1302528 May 19 09:00 s005/results/config/dbs/bayes_toks -rw------- 1 jm jm 1302528 May 19 06:10 mps04/results/config/dbs/bayes_toks -rw------- 1 jm jm 1302528 May 19 03:21 mps02/results/config/dbs/bayes_toks -rw------- 1 jm jm 1298432 May 19 00:30 no_mid/results/config/dbs/bayes_toks -rw------- 1 jm jm 1306624 May 18 21:04 no8bits/results/config/dbs/bayes_toks -rw------- 1 jm jm 1306624 May 18 17:18 casei/results/config/dbs/bayes_toks -rw------- 1 jm jm 1318912 May 18 14:15 no_decomposed/results/config/dbs/bayes_toks -rw------- 1 jm jm 1302528 May 18 12:14 no_inviz_tokens/results/config/dbs/bayes_toks -rw------- 1 jm jm 1302528 May 18 03:40 base/results/config/dbs/bayes_toks interesting to see that 'no_decomposed' results in a larger database! I have *no* idea why that is -- I guess the decomposed tokens wind up more interesting normally, and the non-decomp ones are expired out quicker when there are decomp tokens around. GRAPHS ------ Next, some graphs. These are graphs of the P(spam) curves; ideally you want to see a big spike at the left, made up entirely of ham, a big spike on the right, made up entirely of spam, and both curving down to 0.5, where there's a smaller spike of the "unsures" that we don't want to give a score to at all. Ideally there'd be no ham > 0.5, definitely none at 0.99, and ditto vice-versa for spam. They are all visible at http://taint.org/xfer/2005/bug-4331/ . I'd have made a page on the Wiki, but that doesn't allow attachments. that's helpful! Also, the next line is the cost figures for Bayes based on thresholds of 0.20 and 0.80; "fp" = ham in [0.8 .. 1.0] range, "fn" = spam in [0.0 .. 0.2] range, "uh" = unsure ham in [0.2 .. 0.8] range, "us" = unsure spam in [0.2 .. 0.8]. - g_base_v_no_inviz_tokens.png: as you can see, there's absolutely no difference in the graphs. hmm. looks like our use of invisible tokens in Bayes isn't working and can be disabled ;) base: fp 24 fn 5 uh 815 us 2647 c 591.20 no_inviz_tokens: fp 24 fn 5 uh 815 us 2648 c 591.30 - g_base_v_no_decomposed.png: there's little difference, generally -- except that the FPs (ham in the 0.5 .. 1.0 range), and the FNs (spam in 0.0 .. 0.5) are higher. clearly not a good idea to turn off decomposition then! no_decomposed: fp 27 fn 4 uh 781 us 3097 c 661.80 - g_casei.png: this is very, very close by the graph, but on examination you can see that several hams have been pushed into the solid-spam [0.8, 1.0] range. The cost figures below confirm this. Better stick with base. casei: fp 31 fn 6 uh 801 us 2673 c 663.40 - g_no8bits.png: virtually no difference, except for some more unsureness around the middle. in my opinion again better to stick with the base. no8bits: fp 24 fn 5 uh 810 us 2733 c 599.30 - g_no_mid.png: still looks like base is better. we don't gain very much with the Message-ID tokens, but OTOH the database size increase (0.4% according to above) is pretty tiny, too, so let's just leave it in. no_mid: fp 24 fn 4 uh 816 us 2741 c 599.70 - g_s_constants.png: s005: FW_S_CONSTANT = 0.050 instead of default 0.100 fp 17 fn 4 uh 1046 us 3516 c 630.20 s015: FW_S_CONSTANT = 0.150 instead of default 0.100 fp 37 fn 7 uh 705 us 2188 c 666.30 These are interesting! To remind you -- the S constant is the strength of learned data; if S is nearer to 0, then learned data is trusted more strongly. The fact that s005 has a very low FP/FN rate compared to the normal results is very attractive. It does increase the "unsure" rate, but in our implementation that's not a big deal -- it just means that the message gets a 0 score from BAYES_50. I think exploring the low figures for S might be worthwhile. - x05: FW_X_CONSTANT = 0.500 instead of default 0.538 fp 22 fn 7 uh 753 us 2774 c 579.70 Nothing really too exciting about this one. as expected, FPs go down but FNs go up. I think we might as well stick with the normal setting. - g_mps.png: mps02: MIN_PROB_STRENGTH = 0.2 instead of default 0.346 fp 33 fn 5 uh 727 us 1913 c 599.00 mps04: MIN_PROB_STRENGTH = 0.4 instead of default 0.346 fp 23 fn 4 uh 836 us 2829 c 600.50 nothing really too exciting here either. we could possibly go up to require 0.4 for a minimum probability strength, since it seems to have the nice effect of lowering FP *and* FN at the expense of a little more BAYES_50's on the uncertain cases. But I think tweaking S would be a better way to do that. Overall: the code tweaks we have are still working well. This is good, as I was worried that spam had changed enough to make them counterproductive. One exception is the invisible-tokens stuff, which is having no effect at all, and that is probably a bug. ;) I'm going to try a few more values for the S constant, which seems to reduce FPs and FNs while increasing the BAYES_50 cases. in my opinion it'd be more valuable for us at this stage to reduce FPs and FNs, since we're not reliant on Bayes alone.
Created attachment 2885 [details] graph of S values ok, here's a graph, and some figures: /s015/ fp 37 fn 7 uh 705 us 2188 c 666.30 /base/ fp 24 fn 5 uh 815 us 2647 c 591.20 /s005/ fp 17 fn 4 uh 1046 us 3516 c 630.20 /s003/ fp 16 fn 4 uh 1227 us 4227 c 709.40 /s001/ fp 11 fn 3 uh 1637 us 5482 c 824.90 /s0005/fp 7 fn 2 uh 1953 us 6157 c 883.00 /s0001/fp 4 fn 1 uh 2837 us 7571 c 1081.80 I think we could go to s=0.03 (s003). it does increase the number of spams marked as unsure (in this case [0.5 .. 0.8]) quite a lot, but IMO it's worth it just to reduce the FPs and FNs, since now we can count on other rules like URIBL to catch a lot of the spams bayes misses. votes please!
BTW, I should reiterate, because there seems to be some confusion. What we want from the BAYES_ rules is *not* the same as what we want from the generic SA rescoring process. In the rescoring, we want to minimize FPs and FNs at all SA score thresholds, but 5.0 in particular. In BAYES, I'm saying that we should aim to minimize FPs and FNs at the *end* of the scale only -- in other words, we want to minimize hams marked as BAYES_80 and up, and spams marked BAYES_20 or less. Minimizing the inaccuracies in the middle of the range is a lower priority to this.
moving off the 3.1.0 milestone -- this is not urgent, as our current bayes setup seems to be working well enough anyway
hmm, forgot about this. r464317 has the s = 0.03 change checked in for 3.2.0.