2129 – Bayes tweaks to test

Bug 2129 - Bayes tweaks to test

Summary: Bayes tweaks to test

Status:	RESOLVED FIXED

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	Rules (Eval Tests) (show other bugs)
Version:	SVN Trunk (Latest Devel Version)
Hardware:	Other other

Importance:	P3 enhancement
Target Milestone:	3.0.0
Assignee:	SpamAssassin Developer Mailing List

URL:
Whiteboard:
Keywords:

Duplicates (1):	2160 (view as bug list)
Depends on:
Blocks:	3208
	Show dependency tree

Reported:	2003-06-24 21:23 UTC by Daniel Quinlan
Modified:	2004-03-23 11:00 UTC (History)
CC List:	1 user (show)

Attachment	Type	Modified	Status	Actions	Submitter/CLA Status
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Daniel Quinlan 2003-06-24 21:23:06 UTC

When determining the probability of an incoming message, generate several
temporary tokens to be used in the calculation, but NOT stored in the Bayes
database:

In-Reply-To -> look for same ID except as Message-ID or Message-Id
References -> look for same ID except as Message-ID or Message-Id

I tried looking for repeated header tokens in the Bayes DB that might work
similarly (to the above idea) if the theory works and here's a few more to try:

Cc -> look for same ID except in From
Cc -> look for same ID except in To
To -> look for same ID except in Cc
To -> look for same ID except in From
From -> look for same ID except in To
From -> look for same ID except in Cc

a bit more of a stretch:

X-Mailer -> look for same ID except in User-Agent (looking for OS tokens)
User-Agent -> ditto X-Mailer

X-Mailer -> ditto Received
Received -> ditto X-Mailer

User-Agent -> Received
Received -> User-Agent

Summary of the theory: looking for similarities in the current message being
scanned to past history taking into account that some historical learned
features may not be in the same place they were originally learned.  The
message-id ones are looking for thread-based history as are the To/Cc/From
ones.  The User-Agent/X-Mailer/Received ones are looking for operating
system details mostly -- I doubt that will be very effective, but it might
not too too hard to test once the other ones are set up.

Note that these temporary tokens should not be inserted into the DB.  It's
just that we widen the search for past probability values.  In addition,
for the Message-ID ones, we should go on the basis of a single learned
instance -- why?  because we should never see a message-id more than once in
the Message-ID field (at least for ham).

Comment 1 Justin Mason 2004-03-11 22:03:41 UTC

testing this now.  (finally)

anyone got *simple* tokenizer tweaks to try?

Comment 2 Daniel Quinlan 2004-03-12 23:02:51 UTC

> anyone got *simple* tokenizer tweaks to try?

Ignoring invisible text in HTML.  Should be a simple change to HTML.pm.

Comment 3 Daniel Quinlan 2004-03-12 23:04:35 UTC

*** Bug 2160 has been marked as a duplicate of this bug. ***

Comment 4 Daniel Quinlan 2004-03-12 23:06:24 UTC

Oh, one other idea from 2160 (the most viable of those ideas, I think)...

  adding ALT text from HTML, we don't want to render it in the body since it
  can easily be abused as a bayes buster, but it still might be useful as bayes
  fodder in learn/scan, perhaps marked up

I've partially discarded HTML-tag information that is not text (like ALT)
because it'll cause problems for newsletters.

Comment 5 Justin Mason 2004-03-13 00:58:21 UTC

Subject: Re:  Bayes tweaks to test 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


>Oh, one other idea from 2160 (the most viable of those ideas, I think)...
>
>  adding ALT text from HTML, we don't want to render it in the body since it
>  can easily be abused as a bayes buster, but it still might be useful as bayes
>  fodder in learn/scan, perhaps marked up
>
>I've partially discarded HTML-tag information that is not text (like ALT)
>because it'll cause problems for newsletters.

I disagree about using ALT text -- given that it's not parsing ALT text at
the moment, and that is a massively easy way to skip bayes-poison, let's
just skip it. 

Anyway, I was just about to check in!!   if you want these checked, I'd
suggest quickly adding support to Bayes.pm, protected by a conditional
clause on some constant a la "use constant TWEAK_FOO => 0", and I'll
run the tests tomorrow.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAUs0aQTcbUG5Y7woRAjBnAJ4rG8d1K19wkT35YY12xq7j3sxrnACdGigz
x+TIFnrBVs8Utq37oEb7tv0=
=j+bb
-----END PGP SIGNATURE-----

Comment 6 Justin Mason 2004-03-13 17:16:34 UTC

Subject: Re:  Bayes tweaks to test 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


OK, here's the results.

First pass:

base: current SVN

bug3118: with Henry's fix for bug 3118.  In order to test this,
I used an unbalanced corpus of 39987 ham and 23337 spam.

decomp: using "decomposing" tokens: namely if the token "Foo!"
appears, decompose that into "Foo!" "Foo" "foo!" and "foo".
In other words, make dup tokens with nonalphanumerics and case
stripped.

dhm1: "dual header map" variant 1: Dan's first suggestion above;
mapping "In-Reply-To" and "Message-Id" tokens into a shared
token, so that a ref to a previously-learned Message-Id in the
IRT header will be a hit.

dhm2: similar for From, To and CC headers

dhm3: similar for X-Mailer and User-Agent headers

Then I threw in a couple of retests.  Some of our old tokenizer
tweaks may be smelling a little off by this stage, so they need
a test.

ignmid: ignore Message-Id headers -- just testing this out, as
it's a large source of hapaxes.

Results:

base:    0.30/0.70  fp     3 fn   360 uh   193 us  3952    c 804.50
bug3118: 0.30/0.70  fp     2 fn   336 uh   207 us  4080    c 784.70
decomp:  0.30/0.70  fp     1 fn   324 uh   187 us  3981    c 750.80
dhm1:    0.30/0.70  fp     3 fn   344 uh   220 us  3867    c 782.70
dhm2:    0.30/0.70  fp     3 fn   343 uh   224 us  3709    c 766.30
dhm3:    0.30/0.70  fp     4 fn   342 uh   206 us  3886    c 791.20
ignmid:  0.30/0.70  fp     1 fn   383 uh   184 us  4020    c 813.40

(Don't forget -- compare all of these with "base", not with each
other.  They're all complementary so far.)

Clearly decomp is a *big* win, by far! "ignmid" is not so hot, as there's
a lot of missed spam as a result.  "bug3118" looks good overall. dhm1 and
dhm2 seem good, dhm3 borderline due to the new FP.


Test set 2:

try1: bug3118 + decomp + dhm1 + dhm2 -- ie best of previous run

try2: bug3118 + decomp + dhm1 + dhm2 + dhm3 -- giving dhm3 a second
chance.

hdrs_no_num: try1, with an extra tweak; NO_NUMERIC_IN_HEADERS
is turned on.  I suspect the decomposed numeric tokens (ie.
"8139" -> "N:NNNN") added to catch patterns, are no longer
working well.

no_num: same as hdrs_no_num, but also with no numeric tokens
in the message body either.


Results:

hdrs_no_num: 0.30/0.70  fp     1 fn   266 uh   269 us  3804    c 683.30
no_num:      0.30/0.70  fp     1 fn   268 uh   260 us  3854    c 689.40
try1:        0.30/0.70  fp     2 fn   283 uh   238 us  3785    c 705.30
try2:        0.30/0.70  fp     2 fn   277 uh   251 us  3745    c 696.60


This time, try2 is looking good -- quite a bit better than try1.

Also, clearly, dropping numeric tokens is now a good idea;
both variants of that are a clear improvement.


Test set 3:

combined: try2 + no_num.

combined:0.30/0.70  fp     2 fn   260 uh   267 us  3826    c 689.30

So that's what's gone in as r9447.


I tried Dan's suggestion of looking up the dual-header-map tokens
instead of making dupe copies of them -- unfortunately it didn't 
work, getting bad numbers, so I dropped that.  Combining them into
1 duplicate header gets better accuracy for some reason.

I also updated the Bayes 10fold cross-validation scripts to work again
with current SVN, and wrote quite a bit more doco on how to run them.

Note that "sa-learn --dump --dbpath" is required for these to work, so
anyone who removes that will have to fix them ;)

Next: I'll see if I can figure out a good invisible-text tweak.  I
may have to add a new rendering API for that, specifically for Bayes.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAU7JqQTcbUG5Y7woRArjoAJwN2B3HltmR1VS1XIEQUtg34+CmNwCg4eXu
q0wcV3wSfPy2VRep1BklaZQ=
=/lK1
-----END PGP SIGNATURE-----

Comment 7 Justin Mason 2004-03-13 21:52:03 UTC

OK, results for using "visible text" only:

Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$655.10
Total ham:spam:   39987:23337
FP:     2 0.005%    FN:   243 1.041%
Unsure:  3921 6.192%     (ham:   304 0.760%    spam:  3617 15.499%)
TCRs:              l=1 6.043    l=5 6.030    l=9 6.018
SUMMARY: 0.30/0.70  fp     2 fn   243 uh   304 us  3617    c 655.10

Quite a bit better -- 7% less FNs than the best of before, no increase in FPs. 
It's checked in now.

One issue:  I've duplicated rendering code this way, because I wasn't sure if we
want to simply *replace* the current body rendering to remove invis text, or
not.  Do we?

If so, we can just nuke the duplication and fix {html_text} to not include invis
segments.

Anyway, that's another bug.  this one's now closed...

Comment 8 Loren Wilton 2004-03-13 22:10:58 UTC

Subject: Re:  Bayes tweaks to test

> One issue:  I've duplicated rendering code this way, because I wasn't sure
if we
> want to simply *replace* the current body rendering to remove invis text,
or
> not.  Do we?
>
> If so, we can just nuke the duplication and fix {html_text} to not include
invis
> segments.

As a rule writer, I think vanishing invisible text during rendering will be
a good idea, it will simplify obfuscation tests, at least potentially.

Of course we still need rule access to the pre-rendered html to check for
bogus tags and whatnot, but I believe that is a separate consideration.

        Loren

Comment 9 Daniel Quinlan 2004-03-13 23:49:48 UTC

Subject: Re:  Bayes tweaks to test

> Quite a bit better -- 7% less FNs than the best of before, no increase
> in FPs.  It's checked in now.

Fantastic!

> One issue: I've duplicated rendering code this way, because I wasn't
> sure if we want to simply *replace* the current body rendering to
> remove invis text, or not.  Do we?

Yes and no.  We don't need to render invisible text for most body tests,
however there are some tests that specifically look for random text
(such as the unique_words code), very effective tests, so we need to
figure out the appropriate way to handle those.

I might try this:

1. run a normal mass-check, local only, no bayes
2. experimentally change the body rendering so that invisible text is
   not rendered
3. re-run the mass-check
4. freqdiff and see which spam rules are most reduced in SPAM%
5. then we can consider what to do about it

Daniel

Comment 10 Theo Van Dinter 2004-03-14 07:40:59 UTC

Subject: Re:  Bayes tweaks to test

On Sat, Mar 13, 2004 at 11:49:50PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> 2. experimentally change the body rendering so that invisible text is
>    not rendered

I always envisioned us rendering the "invisible" text into metadata and
for rules, put it all at the end to get rid of the obfuscation benefit
while still allowing rules to find it.

Then for things like Bayes tokenization, prefix all the invisible text
with something like "I*token".  That way we'll get the upshot that
"I*hamtoken" is considered spam whereas "hamtoken" isn't.

Comment 11 David Koppelman 2004-03-14 09:22:31 UTC

I like Comment 10, making invisible words metadata.  Perhaps the same
could be done for low-contrast, tiny-font, and other near-invisible
words.

Comment 12 Sidney Markowitz 2004-03-14 09:39:12 UTC

Whatever we do with invisible text other than ignore them, we have to remember
to protect against attacks by spammers who will know that SpamAssassin will
process them while their target customers will not see them. For example, an
attack on Bayes performance by feeding it 20,000 random unique four letter
"words" in invisible text.

Comment 13 David Koppelman 2004-03-14 09:53:56 UTC

Spammers have been using random strings and dictionary words to foil
spam filters for some time.  Some of this text is invisible, some just
placed out of site.  Though my Bayes database has many of these random
character strings it still words well.  The automatic aging will
eventually clear out random strings because they are not repeated very
often.  (And if they are repeated, say because of improper random
number generator seeding, that will help catch spam.)

Increasing the number of random strings to 20,000, as mentioned in
Comment 12, might sweep useful tokens out of the database (say if
100 such messages were received per day) however the size of such
messages would be much larger and may not be profitable to send.

Comment 14 Sidney Markowitz 2004-03-14 10:07:49 UTC

No, I've been looking at performance issue in Bayes. I'm not talking about
sweeping useful tokens out of the database, although to the degree that
performance is impacted by number of unique tokens in the database, that could
be an issue. What I'm concerned about is the per message performance if a single
message had 20,000 or more unique tokens in it. My email has an average of 262
tokens per message lately. 20,000 random four character sequences adds only
100Kbytes to the length of a message, below the typical 256Kbyte limit on what
we will process in SpamAssassin, and would increase the number of tokens to look
up in the database by a factor of 100.

This is not the same as the "Bayes poisoning" that we have seen so far. I don't
think we can ignore the possibility when considering whether to make use of
I*tokens.

Comment 15 David Koppelman 2004-03-14 11:05:40 UTC

I have strong reservations about the ignore-invisible-text change,

  r9460 | jm | 2004-03-13 23:49:08 -0600 (Sat, 13 Mar 2004) | 1 line

A message could be crafted so that it would be considered not
"visible_for_bayes" but still visible enough when using an email
client.  That might be done by using a seemingly low contrast color
pair or by markup that tricks the code into thinking the text is
invisible (for example, using complex styles).

Such messages would get around one of the most robust spam detection
techniques.  

As an alternative, tag invisible text, as suggested in Comment 10.

I'd be happy to make the changes if others are too busy.

Comment 16 Sidney Markowitz 2004-03-14 12:30:36 UTC

This bug has been marked as Resolved Fixed after the invisible text code was
checked in and reduced FNs by 7% with no new FPs. If you have any other ideas or
concerns, please open a new ticket instead of continuing the discussion here. I
aplogize for not waiting for an appropriate open ticket on which to post my
concerns about I*tokens.

Comment 17 Justin Mason 2004-03-14 14:07:15 UTC

Subject: Re:  Bayes tweaks to test 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


>I like Comment 10, making invisible words metadata.  Perhaps the same
>could be done for low-contrast, tiny-font, and other near-invisible
>words.

I was considering this -- but I decided against it based on current
use of "invisible text".  Nowadays it's predominantly

  - random words from a dictionary
  - random strings of letters and numbers
  - "travesty" output from Project Gutenberg texts

learning these tokens with an "I*" prefix will be actively bad for
1 and 2, seeing as they'll bloat the db and possibly cause good
tokens to be expired due to space pressure.  in the 3 case, it
might be marginally useful, but I'm not convinced.

IMO, it's better to just ignore them, for bayes at least.  (but 
if someone codes it up and checks it in, I'll test it ;)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAVNeHQTcbUG5Y7woRAuxWAJ92b56XemLh27QbmYwGMbkIhVbYhwCgk51Z
+wMEYt/F/RuA4LQztakGI+o=
=q0dY
-----END PGP SIGNATURE-----

Comment 18 Justin Mason 2004-03-17 18:16:50 UTC

(discussion of invis tokens moved to bug 3173)