SA Bugzilla – Bug 2129
Bayes tweaks to test
Last modified: 2004-03-23 11:00:41 UTC
When determining the probability of an incoming message, generate several temporary tokens to be used in the calculation, but NOT stored in the Bayes database: In-Reply-To -> look for same ID except as Message-ID or Message-Id References -> look for same ID except as Message-ID or Message-Id I tried looking for repeated header tokens in the Bayes DB that might work similarly (to the above idea) if the theory works and here's a few more to try: Cc -> look for same ID except in From Cc -> look for same ID except in To To -> look for same ID except in Cc To -> look for same ID except in From From -> look for same ID except in To From -> look for same ID except in Cc a bit more of a stretch: X-Mailer -> look for same ID except in User-Agent (looking for OS tokens) User-Agent -> ditto X-Mailer X-Mailer -> ditto Received Received -> ditto X-Mailer User-Agent -> Received Received -> User-Agent Summary of the theory: looking for similarities in the current message being scanned to past history taking into account that some historical learned features may not be in the same place they were originally learned. The message-id ones are looking for thread-based history as are the To/Cc/From ones. The User-Agent/X-Mailer/Received ones are looking for operating system details mostly -- I doubt that will be very effective, but it might not too too hard to test once the other ones are set up. Note that these temporary tokens should not be inserted into the DB. It's just that we widen the search for past probability values. In addition, for the Message-ID ones, we should go on the basis of a single learned instance -- why? because we should never see a message-id more than once in the Message-ID field (at least for ham).
testing this now. (finally) anyone got *simple* tokenizer tweaks to try?
> anyone got *simple* tokenizer tweaks to try? Ignoring invisible text in HTML. Should be a simple change to HTML.pm.
*** Bug 2160 has been marked as a duplicate of this bug. ***
Oh, one other idea from 2160 (the most viable of those ideas, I think)... adding ALT text from HTML, we don't want to render it in the body since it can easily be abused as a bayes buster, but it still might be useful as bayes fodder in learn/scan, perhaps marked up I've partially discarded HTML-tag information that is not text (like ALT) because it'll cause problems for newsletters.
Subject: Re: Bayes tweaks to test -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >Oh, one other idea from 2160 (the most viable of those ideas, I think)... > > adding ALT text from HTML, we don't want to render it in the body since it > can easily be abused as a bayes buster, but it still might be useful as bayes > fodder in learn/scan, perhaps marked up > >I've partially discarded HTML-tag information that is not text (like ALT) >because it'll cause problems for newsletters. I disagree about using ALT text -- given that it's not parsing ALT text at the moment, and that is a massively easy way to skip bayes-poison, let's just skip it. Anyway, I was just about to check in!! if you want these checked, I'd suggest quickly adding support to Bayes.pm, protected by a conditional clause on some constant a la "use constant TWEAK_FOO => 0", and I'll run the tests tomorrow. - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFAUs0aQTcbUG5Y7woRAjBnAJ4rG8d1K19wkT35YY12xq7j3sxrnACdGigz x+TIFnrBVs8Utq37oEb7tv0= =j+bb -----END PGP SIGNATURE-----
Subject: Re: Bayes tweaks to test -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 OK, here's the results. First pass: base: current SVN bug3118: with Henry's fix for bug 3118. In order to test this, I used an unbalanced corpus of 39987 ham and 23337 spam. decomp: using "decomposing" tokens: namely if the token "Foo!" appears, decompose that into "Foo!" "Foo" "foo!" and "foo". In other words, make dup tokens with nonalphanumerics and case stripped. dhm1: "dual header map" variant 1: Dan's first suggestion above; mapping "In-Reply-To" and "Message-Id" tokens into a shared token, so that a ref to a previously-learned Message-Id in the IRT header will be a hit. dhm2: similar for From, To and CC headers dhm3: similar for X-Mailer and User-Agent headers Then I threw in a couple of retests. Some of our old tokenizer tweaks may be smelling a little off by this stage, so they need a test. ignmid: ignore Message-Id headers -- just testing this out, as it's a large source of hapaxes. Results: base: 0.30/0.70 fp 3 fn 360 uh 193 us 3952 c 804.50 bug3118: 0.30/0.70 fp 2 fn 336 uh 207 us 4080 c 784.70 decomp: 0.30/0.70 fp 1 fn 324 uh 187 us 3981 c 750.80 dhm1: 0.30/0.70 fp 3 fn 344 uh 220 us 3867 c 782.70 dhm2: 0.30/0.70 fp 3 fn 343 uh 224 us 3709 c 766.30 dhm3: 0.30/0.70 fp 4 fn 342 uh 206 us 3886 c 791.20 ignmid: 0.30/0.70 fp 1 fn 383 uh 184 us 4020 c 813.40 (Don't forget -- compare all of these with "base", not with each other. They're all complementary so far.) Clearly decomp is a *big* win, by far! "ignmid" is not so hot, as there's a lot of missed spam as a result. "bug3118" looks good overall. dhm1 and dhm2 seem good, dhm3 borderline due to the new FP. Test set 2: try1: bug3118 + decomp + dhm1 + dhm2 -- ie best of previous run try2: bug3118 + decomp + dhm1 + dhm2 + dhm3 -- giving dhm3 a second chance. hdrs_no_num: try1, with an extra tweak; NO_NUMERIC_IN_HEADERS is turned on. I suspect the decomposed numeric tokens (ie. "8139" -> "N:NNNN") added to catch patterns, are no longer working well. no_num: same as hdrs_no_num, but also with no numeric tokens in the message body either. Results: hdrs_no_num: 0.30/0.70 fp 1 fn 266 uh 269 us 3804 c 683.30 no_num: 0.30/0.70 fp 1 fn 268 uh 260 us 3854 c 689.40 try1: 0.30/0.70 fp 2 fn 283 uh 238 us 3785 c 705.30 try2: 0.30/0.70 fp 2 fn 277 uh 251 us 3745 c 696.60 This time, try2 is looking good -- quite a bit better than try1. Also, clearly, dropping numeric tokens is now a good idea; both variants of that are a clear improvement. Test set 3: combined: try2 + no_num. combined:0.30/0.70 fp 2 fn 260 uh 267 us 3826 c 689.30 So that's what's gone in as r9447. I tried Dan's suggestion of looking up the dual-header-map tokens instead of making dupe copies of them -- unfortunately it didn't work, getting bad numbers, so I dropped that. Combining them into 1 duplicate header gets better accuracy for some reason. I also updated the Bayes 10fold cross-validation scripts to work again with current SVN, and wrote quite a bit more doco on how to run them. Note that "sa-learn --dump --dbpath" is required for these to work, so anyone who removes that will have to fix them ;) Next: I'll see if I can figure out a good invisible-text tweak. I may have to add a new rendering API for that, specifically for Bayes. - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFAU7JqQTcbUG5Y7woRArjoAJwN2B3HltmR1VS1XIEQUtg34+CmNwCg4eXu q0wcV3wSfPy2VRep1BklaZQ= =/lK1 -----END PGP SIGNATURE-----
OK, results for using "visible text" only: Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$655.10 Total ham:spam: 39987:23337 FP: 2 0.005% FN: 243 1.041% Unsure: 3921 6.192% (ham: 304 0.760% spam: 3617 15.499%) TCRs: l=1 6.043 l=5 6.030 l=9 6.018 SUMMARY: 0.30/0.70 fp 2 fn 243 uh 304 us 3617 c 655.10 Quite a bit better -- 7% less FNs than the best of before, no increase in FPs. It's checked in now. One issue: I've duplicated rendering code this way, because I wasn't sure if we want to simply *replace* the current body rendering to remove invis text, or not. Do we? If so, we can just nuke the duplication and fix {html_text} to not include invis segments. Anyway, that's another bug. this one's now closed...
Subject: Re: Bayes tweaks to test > One issue: I've duplicated rendering code this way, because I wasn't sure if we > want to simply *replace* the current body rendering to remove invis text, or > not. Do we? > > If so, we can just nuke the duplication and fix {html_text} to not include invis > segments. As a rule writer, I think vanishing invisible text during rendering will be a good idea, it will simplify obfuscation tests, at least potentially. Of course we still need rule access to the pre-rendered html to check for bogus tags and whatnot, but I believe that is a separate consideration. Loren
Subject: Re: Bayes tweaks to test > Quite a bit better -- 7% less FNs than the best of before, no increase > in FPs. It's checked in now. Fantastic! > One issue: I've duplicated rendering code this way, because I wasn't > sure if we want to simply *replace* the current body rendering to > remove invis text, or not. Do we? Yes and no. We don't need to render invisible text for most body tests, however there are some tests that specifically look for random text (such as the unique_words code), very effective tests, so we need to figure out the appropriate way to handle those. I might try this: 1. run a normal mass-check, local only, no bayes 2. experimentally change the body rendering so that invisible text is not rendered 3. re-run the mass-check 4. freqdiff and see which spam rules are most reduced in SPAM% 5. then we can consider what to do about it Daniel
Subject: Re: Bayes tweaks to test On Sat, Mar 13, 2004 at 11:49:50PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > 2. experimentally change the body rendering so that invisible text is > not rendered I always envisioned us rendering the "invisible" text into metadata and for rules, put it all at the end to get rid of the obfuscation benefit while still allowing rules to find it. Then for things like Bayes tokenization, prefix all the invisible text with something like "I*token". That way we'll get the upshot that "I*hamtoken" is considered spam whereas "hamtoken" isn't.
I like Comment 10, making invisible words metadata. Perhaps the same could be done for low-contrast, tiny-font, and other near-invisible words.
Whatever we do with invisible text other than ignore them, we have to remember to protect against attacks by spammers who will know that SpamAssassin will process them while their target customers will not see them. For example, an attack on Bayes performance by feeding it 20,000 random unique four letter "words" in invisible text.
Spammers have been using random strings and dictionary words to foil spam filters for some time. Some of this text is invisible, some just placed out of site. Though my Bayes database has many of these random character strings it still words well. The automatic aging will eventually clear out random strings because they are not repeated very often. (And if they are repeated, say because of improper random number generator seeding, that will help catch spam.) Increasing the number of random strings to 20,000, as mentioned in Comment 12, might sweep useful tokens out of the database (say if 100 such messages were received per day) however the size of such messages would be much larger and may not be profitable to send.
No, I've been looking at performance issue in Bayes. I'm not talking about sweeping useful tokens out of the database, although to the degree that performance is impacted by number of unique tokens in the database, that could be an issue. What I'm concerned about is the per message performance if a single message had 20,000 or more unique tokens in it. My email has an average of 262 tokens per message lately. 20,000 random four character sequences adds only 100Kbytes to the length of a message, below the typical 256Kbyte limit on what we will process in SpamAssassin, and would increase the number of tokens to look up in the database by a factor of 100. This is not the same as the "Bayes poisoning" that we have seen so far. I don't think we can ignore the possibility when considering whether to make use of I*tokens.
I have strong reservations about the ignore-invisible-text change, r9460 | jm | 2004-03-13 23:49:08 -0600 (Sat, 13 Mar 2004) | 1 line A message could be crafted so that it would be considered not "visible_for_bayes" but still visible enough when using an email client. That might be done by using a seemingly low contrast color pair or by markup that tricks the code into thinking the text is invisible (for example, using complex styles). Such messages would get around one of the most robust spam detection techniques. As an alternative, tag invisible text, as suggested in Comment 10. I'd be happy to make the changes if others are too busy.
This bug has been marked as Resolved Fixed after the invisible text code was checked in and reduced FNs by 7% with no new FPs. If you have any other ideas or concerns, please open a new ticket instead of continuing the discussion here. I aplogize for not waiting for an appropriate open ticket on which to post my concerns about I*tokens.
Subject: Re: Bayes tweaks to test -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >I like Comment 10, making invisible words metadata. Perhaps the same >could be done for low-contrast, tiny-font, and other near-invisible >words. I was considering this -- but I decided against it based on current use of "invisible text". Nowadays it's predominantly - random words from a dictionary - random strings of letters and numbers - "travesty" output from Project Gutenberg texts learning these tokens with an "I*" prefix will be actively bad for 1 and 2, seeing as they'll bloat the db and possibly cause good tokens to be expired due to space pressure. in the 3 case, it might be marginally useful, but I'm not convinced. IMO, it's better to just ignore them, for bayes at least. (but if someone codes it up and checks it in, I'll test it ;) - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFAVNeHQTcbUG5Y7woRAuxWAJ92b56XemLh27QbmYwGMbkIhVbYhwCgk51Z +wMEYt/F/RuA4LQztakGI+o= =q0dY -----END PGP SIGNATURE-----
(discussion of invis tokens moved to bug 3173)