Bug 4331 - check Bayes tokenizer tweaks against current spam
Summary: check Bayes tokenizer tweaks against current spam
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P5 minor
Target Milestone: Future
Assignee: Justin Mason
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-05-13 15:35 UTC by Justin Mason
Modified: 2006-10-15 15:51 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
graph of S values image/png None Justin Mason [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Justin Mason 2005-05-13 15:35:32 UTC
It's been quite a while since our current selection of Bayes tokenizer tweaks
were measured against a real spam corpus using 10-fold cross validation.  Spam
has mutated since then, and there's a possibility that we could get better
results with different settings on these.

It'd be nice to run these tests before 3.1.0, but not essential.
Comment 1 Justin Mason 2005-05-13 15:35:57 UTC
wishful thinking milestone...
Comment 2 Justin Mason 2005-05-19 00:25:21 UTC
testing these
Comment 3 Justin Mason 2005-05-19 19:28:03 UTC
OK, here's some results!


KEY
---

- base: current svn trunk


Firstly, some code tweaks:

- no_inviz_tokens: ADD_INVIZ_TOKENS_I_PREFIX set to 0, so no invisible text
  tokens at all

- no_decomposed: inhibiting the decomposition of body tokens, and the mapping
  of Message-Id/In-Reply-To, From/To/Cc, and User-Agent/X-Mailer headers -- the
  tweaks discussed in bug 2129.

- casei: IGNORE_TITLE_CASE set to 0.  in other words, fully case-insensitive
  for body text

- no8bits: TOKENIZE_LONG_8BIT_SEQS_AS_TUPLES set to 0.  in other words,
  8-bit sequences are not decomposed into byte-pairs.

- no_mid: IGNORE_MSGID_TOKENS set to 1.  in other words, no message-ID
  tokens.


And some constant tweaks:

- s005: FW_S_CONSTANT = 0.050 instead of default 0.100

- s015: FW_S_CONSTANT = 0.150 instead of default 0.100

- x05: FW_X_CONSTANT = 0.500 instead of default 0.538

- mps02: MIN_PROB_STRENGTH = 0.2 instead of default 0.346

- mps04: MIN_PROB_STRENGTH = 0.4 instead of default 0.346


DB SIZES
--------

: jm 183...; l */results/config/dbs/bayes_toks
-rw-------  1 jm jm 1302528 May 19 14:08 x05/results/config/dbs/bayes_toks
-rw-------  1 jm jm 1302528 May 19 11:34 s015/results/config/dbs/bayes_toks
-rw-------  1 jm jm 1302528 May 19 09:00 s005/results/config/dbs/bayes_toks
-rw-------  1 jm jm 1302528 May 19 06:10 mps04/results/config/dbs/bayes_toks
-rw-------  1 jm jm 1302528 May 19 03:21 mps02/results/config/dbs/bayes_toks
-rw-------  1 jm jm 1298432 May 19 00:30 no_mid/results/config/dbs/bayes_toks
-rw-------  1 jm jm 1306624 May 18 21:04 no8bits/results/config/dbs/bayes_toks
-rw-------  1 jm jm 1306624 May 18 17:18 casei/results/config/dbs/bayes_toks
-rw-------  1 jm jm 1318912 May 18 14:15 no_decomposed/results/config/dbs/bayes_toks
-rw-------  1 jm jm 1302528 May 18 12:14
no_inviz_tokens/results/config/dbs/bayes_toks
-rw-------  1 jm jm 1302528 May 18 03:40 base/results/config/dbs/bayes_toks

interesting to see that 'no_decomposed' results in a larger database!
I have *no* idea why that is -- I guess the decomposed tokens wind up
more interesting normally, and the non-decomp ones are expired out
quicker when there are decomp tokens around.


GRAPHS
------

Next, some graphs.  These are graphs of the P(spam) curves; ideally you want to
see a big spike at the left, made up entirely of ham, a big spike on the right,
made up entirely of spam, and both curving down to 0.5, where there's a smaller
spike of the "unsures" that we don't want to give a score to at all.  Ideally
there'd be no ham > 0.5, definitely none at 0.99, and ditto vice-versa for
spam.

They are all visible at http://taint.org/xfer/2005/bug-4331/ .  I'd have made
a page on the Wiki, but that doesn't allow attachments.  that's helpful!

Also, the next line is the cost figures for Bayes based on thresholds of 0.20
and 0.80; "fp" = ham in [0.8 .. 1.0] range, "fn" = spam in [0.0 .. 0.2] range,
"uh" = unsure ham in [0.2 .. 0.8] range, "us" = unsure spam in [0.2 .. 0.8].


- g_base_v_no_inviz_tokens.png: as you can see, there's absolutely no
  difference in the graphs. hmm. looks like our use of invisible tokens in
  Bayes isn't working and can be disabled ;)

base:             fp    24 fn     5 uh   815 us  2647    c 591.20
no_inviz_tokens:  fp    24 fn     5 uh   815 us  2648    c 591.30

- g_base_v_no_decomposed.png: there's little difference, generally -- except
  that the FPs (ham in the 0.5 .. 1.0 range), and the FNs (spam in 0.0 .. 0.5)
  are higher.  clearly not a good idea to turn off decomposition then!

no_decomposed:    fp    27 fn     4 uh   781 us  3097    c 661.80

- g_casei.png: this is very, very close by the graph, but on examination you
  can see that several hams have been pushed into the solid-spam [0.8, 1.0]
  range.  The cost figures below confirm this.  Better stick with base.

casei:            fp    31 fn     6 uh   801 us  2673    c 663.40

- g_no8bits.png: virtually no difference, except for some more unsureness
  around the middle.  in my opinion again better to stick with the base.

no8bits:          fp    24 fn     5 uh   810 us  2733    c 599.30

- g_no_mid.png: still looks like base is better.  we don't gain very
  much with the Message-ID tokens, but OTOH the database size increase
  (0.4% according to above) is pretty tiny, too, so let's just leave
  it in.

no_mid:           fp    24 fn     4 uh   816 us  2741    c 599.70

- g_s_constants.png:

  s005: FW_S_CONSTANT = 0.050 instead of default 0.100
                  fp    17 fn     4 uh  1046 us  3516    c 630.20
  s015: FW_S_CONSTANT = 0.150 instead of default 0.100
                  fp    37 fn     7 uh   705 us  2188    c 666.30

  These are interesting!   To remind you -- the S constant is the strength of
  learned data; if S is nearer to 0, then learned data is trusted more
  strongly.

  The fact that s005 has a very low FP/FN rate compared to the normal
  results is very attractive.  It does increase the "unsure" rate,
  but in our implementation that's not a big deal -- it just means
  that the message gets a 0 score from BAYES_50.

  I think exploring the low figures for S might be worthwhile.


- x05: FW_X_CONSTANT = 0.500 instead of default 0.538
                  fp    22 fn     7 uh   753 us  2774    c 579.70

  Nothing really too exciting about this one.  as expected, FPs go
  down but FNs go up.  I think we might as well stick with the
  normal setting.

- g_mps.png:

  mps02: MIN_PROB_STRENGTH = 0.2 instead of default 0.346
                  fp    33 fn     5 uh   727 us  1913    c 599.00
  mps04: MIN_PROB_STRENGTH = 0.4 instead of default 0.346
                  fp    23 fn     4 uh   836 us  2829    c 600.50

  nothing really too exciting here either.  we could possibly go
  up to require 0.4 for a minimum probability strength, since
  it seems to have the nice  effect of lowering FP *and* FN at
  the expense of a little more BAYES_50's on the uncertain cases.
  But I think tweaking S would be a better way to do that.


Overall: the code tweaks we have are still working well.  This is good, as I
was worried that spam had changed enough to make them counterproductive. One
exception is the invisible-tokens stuff, which is having no effect at all,
and that is probably a bug. ;)

I'm going to try a few more values for the S constant, which seems to reduce
FPs and FNs while increasing the BAYES_50 cases.  in my opinion it'd be more
valuable for us at this stage to reduce FPs and FNs, since we're not reliant on
Bayes alone.

Comment 4 Justin Mason 2005-05-20 14:48:34 UTC
Created attachment 2885 [details]
graph of S values

ok, here's a graph, and some figures:

/s015/ fp    37 fn     7 uh   705 us  2188    c 666.30
/base/ fp    24 fn     5 uh   815 us  2647    c 591.20
/s005/ fp    17 fn     4 uh  1046 us  3516    c 630.20
/s003/ fp    16 fn     4 uh  1227 us  4227    c 709.40
/s001/ fp    11 fn     3 uh  1637 us  5482    c 824.90
/s0005/fp     7 fn     2 uh  1953 us  6157    c 883.00
/s0001/fp     4 fn     1 uh  2837 us  7571    c 1081.80

I think we could go to s=0.03 (s003).  it does increase the number of spams
marked as unsure (in this case [0.5 .. 0.8]) quite a lot, but IMO it's worth it
just to reduce the FPs and FNs, since now we can count on other rules like
URIBL to catch a lot of the spams bayes misses.

votes please!
Comment 5 Justin Mason 2005-05-20 17:48:42 UTC
BTW, I should reiterate, because there seems to be some confusion.  What we want
from the BAYES_ rules is *not* the same as what we want from the generic SA
rescoring process.

In the rescoring, we want to minimize FPs and FNs at all SA score thresholds,
but 5.0 in particular.

In BAYES, I'm saying that we should aim to minimize FPs and FNs at the *end* of
the scale only -- in other words, we want to minimize hams marked as BAYES_80
and up, and spams marked BAYES_20 or less.   Minimizing the inaccuracies in the
middle of the range is a lower priority to this.
Comment 6 Justin Mason 2005-05-23 23:59:17 UTC
moving off the 3.1.0 milestone -- this is not urgent, as our current bayes setup
seems to be working well enough anyway
Comment 7 Justin Mason 2006-10-15 15:51:05 UTC
hmm, forgot about this.  r464317 has the s = 0.03 change checked in for 3.2.0.