SA Bugzilla – Bug 2282
RFE: Tokenize reduced visibility text specially or not at all.
Last modified: 2004-03-23 12:43:50 UTC
I recently received spam that looks like it was specifically designed to get around Bayesian classification. The message contained a large number of words that don't usually appear in spam but might appear in normal messages. (I'll attach the spam message.) Those words were rendered invisible by using a white font. SpamAssassin's rules caught the invisibility however BC was still done on the invisible words and the message was scored 0.5, despite having many previously encountered spammy tokens. Despite the HTML_FONT_INVISIBLE rule, the message was classified as ham. A possible workaround is to provide the tokenizer with two versions of the body text, one for visible text and one for invisible or reduced-visibility (RV) text. One option is to not tokenize RV text, the other is to prefix the tokens with something like "I:kindergarten". The system would have to keep up with subtler ways of hiding text that spammers would develop, but like comment obfuscation, the hiding attempts would provide strong evidence of spam even before performing BC.
Created attachment 1201 [details] Spam designed for Bayesian classification. The token entries, for example, 0.949-4--H*r:forged, show the probability, the number of times the token has to be received before loosing its status (spam or ham), and the token itself.
Subject: Re: [SAdev] New: RFE: Tokenize reduced visibility text specially or not at all. > A possible workaround is to provide the tokenizer with two versions of > the body text, one for visible text and one for invisible or > reduced-visibility (RV) text. One option is to not tokenize RV text, > the other is to prefix the tokens with something like > "I:kindergarten". I think it would be better to just ignore invisible text. Sometimes it has tell-tale messages ("to unsubscribe..."), but most often it's just to hide obfuscating text. Since that stuff is never seen by the user, it can be anything and thus checking it will probably not provide much benefit. The actual visible content of the message is the only thing that can be relied upon. With this in mind, I went in to the HTML.pm file and came up with the following... ------------------------------------------------------------------------------- --- orig/HTML.pm Fri Aug 1 14:26:13 2003 +++ HTML.pm Fri Aug 1 15:04:40 2003 @@ -213,16 +213,31 @@ } if ($tag eq "font" && exists $attr->{color}) { my $c = lc($attr->{color}); + $self->{html}{bgcolor} = "#ffffff" unless (exists $self->{html}{bgcolor}); $self->{html}{font_color_nohash} = 1 if $c =~ /^[0-9a-f]{6}$/; $self->{html}{font_color_unsafe} = 1 if ($c =~ /^\#?[0-9a-f]{6}$/ && $c !~ /^\#?(?:00|33|66|80|99|cc|ff){3}$/); $self->{html}{font_color_name} = 1 if ($c !~ /^\#?[0-9a-f]{6}$/ && $c !~ /^(?:navy|gray|red|white)$/); $c = name_to_rgb($c); - $self->{html}{font_invisible} = 1 if (exists $self->{html}{bgcolor} && - substr($c,-6) eq substr($self->{html}{bgcolor},-6)); + if (substr($c,-6) eq substr($self->{html}{bgcolor},-6)) { +# print STDERR "html_tests: self->bgcolor=$self->{html}{bgcolor}; fgcolor=$c\n"; + $self->{html}{font_invisible} = 1; + } if ($c =~ /^\#?([0-9a-f]{2})([0-9a-f]{2})([0-9a-f]{2})$/) { + my ($r, $g, $b) = ($1, $2, $3); my ($h, $s, $v) = rgb_to_hsv(hex($1), hex($2), hex($3)); + if ($self->{html}{bgcolor} =~ /^\#?([0-9a-f]{2})([0-9a-f]{2})([0-9a-f]{2})$/) { +# print STDERR "html_tests: bg(r,g,b)=($r,$g,$b); fg(r,g,b)=($1,$2,$3) -- "; + if (abs(hex($r)-hex($1)) < 16 && abs(hex($g)-hex($2)) < 16 && abs(hex($b)-hex($3)) < 16) { +# print STDERR "invisible!\n"; + $self->{html}{font_invisible} = 1; + $self->{html}{invisible} = 1; + } else { +# print STDERR "visible\n"; + $self->{html}{invisible} = 0; + } + } if (!defined($h)) { $self->{html}{font_gray} = 1 unless ($v == 0 || $v == 255); } @@ -366,6 +381,12 @@ while ($text =~ s/<(\S[^>]*)>//) { # print STDERR "html_text: found unparsed <$1> inside text\n"; html_tag($self,$1,undef,0); + } + + # ignore all invisible text + if (exists $self->{html}{invisible} && $self->{html}{invisible}) { +# print STDERR "html_text: ignoring invisible text \"$text\"\n"; + return; } # record when something non-tag exists between other tags (search of obfuscating tags) ------------------------------------------------------------------------------- I'm currently testing it on our mailsystem here at work. I'll attach it as a real diff when I'm satisfied that it's working correctly. The idea is that any difference between fg and bg colors where R, G, & B are all within 16 of 256 points of each other (respectively) is basically invisible to the naked eye. A real message has much more contrast than this and for a spammer to use a difference of 17+ is likely to start becoming visible to the very people they don't want it to be. One other thing I added was to set the bgcolor to #ffffff if not otherwise defined since that's the expected default. I'd appreciate any comments people have on this. Thanks! Brian ( bcwhite@precidia.com ) ------------------------------------------------------------------------------- It seems that anything people have learned prior to puberty takes on the status of an immutable truth (this is something well understood by parents, governments, and religions). Rational explanations of why some previous belief might be incompatible with the behavior of nature, and a careful explanation of the actual behavior of nature are of little avail.
In the version I have font near invisibility is already tested for in html_font_invisible. The following patch should be sufficient, except for the fact that there is no way to turn text skipping off. BTW, I'm not proposing this as the enhancement just something to play with. *** HTML.pm.~1.95.~ Sat Jun 14 15:42:18 2003 --- HTML.pm Fri Aug 1 15:18:57 2003 *************** *** 536,542 **** $self->html_font_invisible($text) if $text =~ /[^ \t\n\r\f\x0b\xa0]/; $text =~ s/^\n//s if $self->{html_last_tag} eq "br"; ! push @{$self->{html_text}}, $text; } sub html_comment { --- 536,543 ---- $self->html_font_invisible($text) if $text =~ /[^ \t\n\r\f\x0b\xa0]/; $text =~ s/^\n//s if $self->{html_last_tag} eq "br"; ! push @{$self->{html_text}}, $text ! unless $self->{html}{font_invisible} or $self->{html}{font_near_invisible}; } sub html_comment {
Brian, can you let us know how well your changes worked. In my case I've only gotten one message that skipping would obviously work on but of course that could be the solitary fat raindrop before the downpour.
Subject: Re: [SAdev] RFE: Tokenize reduced visibility text specially or not at all. > ------- Additional Comments From koppel@ece.lsu.edu 2003-08-01 13:53 ------- > Brian, can you let us know how well your changes worked. In my case I've > only gotten one message that skipping would obviously work on but of course > that could be the solitary fat raindrop before the downpour. Well, first I had to fix my change. <sigh> It would recognize invisible text allright... It just didn't see when "</font>" made it visible again. So, now I have a stack that grows with each font tag and shrinks with every /font tag. Much better. Interestingly enough, a message very similar to the one used to report this bug came through my filter not long after I got this done. Here's how it was tagged: Aug 1 17:41:20 jordan mailscanner[5673]: Message 19iheC-0001kH-00 from 210.219.251.251 (hotmail.com) is spam according to SpamAssassin (score=8.2, required 6, BAYES_70, HTML_BAD_TAGS_0, HTML_EXTERNAL_IMAGE, HTML_IMAGE_ONLY_02, HTML_LINKED_IMAGE, HTML_LINKED_IMAGE_ONLY_02, MIME_HTML_ONLY, MSG_ID_ADDED_BY_MTA_3, RCVD_IN_RFCI, SEMIFORGED_HOTMAIL_RCVD) So... Even ignoring the invisible words meant a BAYES_70 hit. Note that my local SpamAssassin has my patch for the HTML_LINKED_IMAGE* rules. I'll have to run the same message both with and without the change, of course, but it's a long weekend here so I'm leaving soon. See ya Tuesday! Brian ( bcwhite@precidia.com ) ------------------------------------------------------------------------------- It seems that anything people have learned prior to puberty takes on the status of an immutable truth (this is something well understood by parents, governments, and religions). Rational explanations of why some previous belief might be incompatible with the behavior of nature, and a careful explanation of the actual behavior of nature are of little avail.
Brian, also make sure to keep a stack of background colors, since table elements can change the background color away from the default.
As a reminder, the current CVS code already keeps a stack of fg and bg colors and keeps track of both invisibility and near invisibilty. If you like I could attach a recent version of HTML.pm with that code.
One thing to keep in mind is that omitting invisible and low-visibility text from SA rules (not including Bayes) is not a good idea because these rules, at least in recent versions, can mostly increase the spam score. If text were incorrectly classified as invisible those rules would not work and there would be no harm in having those rules operate on invisible text. I think this would be best: The regular SA rules get the usual text. The BC gets text with the invisible and low visibility portions removed, but with URI's retained. (With such changes it would be easy to add a high-ratio-of-invisible-text eval rule.) A potential problem is misclassifying text as invisible and to a lesser extent, having invisible text marked as visible. (CSS makes things complicated, especially if we have to chase down external css files.) If the developers are interested I'd be glad to make the changes.
Subject: Re: [SAdev] RFE: Tokenize reduced visibility text specially or not at all. >If the developers are interested I'd be glad to make the changes. whoa, yeah, definitely! ;) At least I would think so, the changes sound good and we've been meaning to do them for a while. Dan, Theo I presume you have no objections? BTW -- I would add a caveat; we need to do a ten-fold cross validation to test how it affects classification at the end, before it can be merged. If it *decreases* accuracy, we can't check it in, for obvious reasons. This is std practice for bayes modifications, and is pretty unavoidable. ;) --j.
> If it *decreases* accuracy, we can't check it in, for obvious reasons. > This is std practice for bayes modifications, and is pretty unavoidable. ;) I've only noticed one message that used invisible text to hide hammy tokens. (Out of hundreds.) If that's typical then it might have no significant impact on accuracy, at least until the tactic becomes more widespread. I'll probably start working on it in a few days. Probably have get_decoded_stripped.. return two references, one to the usualy text, the other with invisible material removed or maybe prefixed (I:algorithm).
Subject: Re: [SAdev] RFE: Tokenize reduced visibility text specially or not at all. > Brian, also make sure to keep a stack of background colors, since table elements > can change the background color away from the default. Yup. My patch is now 3rd generation. It tracks all font attrs whether they come from font|table|tr|td. So far, it appears to be working well. As an added bonus, I think this will greatly improve the effectiveness of the "font_invisible" test since now it can actually set that flag only when real text is found and not just when the foreground/background happen to be the same (which occurs quite often, it seems). Brian ( bcwhite@precidia.com ) ------------------------------------------------------------------------------- The two most plentiful elements in the universe are hydrogen and stupidity.
Subject: Re: [SAdev] RFE: Tokenize reduced visibility text specially or not at all. > As a reminder, the current CVS code already keeps a stack of fg and bg > colors and keeps track of both invisibility and near invisibilty. If > you like I could attach a recent version of HTML.pm with that code. Arghhh!!! Ummm, well, sure. <sigh> It'll be interesting to see how our approaches differ. Brian ( bcwhite@precidia.com ) ------------------------------------------------------------------------------- The two most plentiful elements in the universe are hydrogen and stupidity.
Subject: Re: [SAdev] RFE: Tokenize reduced visibility text specially or not at all. > One thing to keep in mind is that omitting invisible and > low-visibility text from SA rules (not including Bayes) is not a good > idea because these rules, at least in recent versions, can mostly > increase the spam score. If text were incorrectly classified as > invisible those rules would not work and there would be no harm in > having those rules operate on invisible text. While it's true that the invisible words do seem to increase the Bayes score at the moment, I think that's a short-term thing. Right now, the words are either garbage or random but it's only a matter of time before spammers start to figure out statistically what words are common in ham messages and start including those as the invisible text. I believe it's better long-term to weigh in only on the part of the message that a user will see. Anything else is open to and end-run by the dedicated spammer. Falsely determining text to be invisible would be a problem, but it should be fairly easy to avoid that. > The regular SA rules get the usual text. > > The BC gets text with the invisible and low visibility portions > removed, but with URI's retained. (With such changes it would be > easy to add a high-ratio-of-invisible-text eval rule.) Perhaps I don't understand, but this seems at odds with your first comment that "omitting invisible and low-visibility text ... is not a good idea". > A potential problem is misclassifying text as invisible and to a > lesser extent, having invisible text marked as visible. (CSS makes > things complicated, especially if we have to chase down external css > files.) Well, I think if there's an external style-sheet, then it's probably spam anyway. Does SA have a test for that? Do any mailers create/attach a CSS to a mail message? Brian ( bcwhite@precidia.com ) ------------------------------------------------------------------------------- The two most plentiful elements in the universe are hydrogen and stupidity.
Brian, I'd be interested to see your patch (or at least find out if I missed anything), the code that's in 2.60 went through many *many* generations, but I definitely neglected some things (style sheets come to mind). Comments on other past discussion: 1. tagging invisible text vs. skipping: just do a 10fcv (remembering that skipping text or altering it may affect non-Bayes rules) 2. David's offer to do work: yes, we're interested! Watch out for memory usage and performance (re-running any routines more than we already do). One other note: when in doubt, simulate Microsoft rendering with Outlook or Outlook Express (same as Internet Explorer). P.S. Always work off of top-of-tree. ;-)
Subject: Re: [SAdev] RFE: Tokenize reduced visibility text specially or not at all. > Brian, I'd be interested to see your patch (or at least find out if I missed > anything), the code that's in 2.60 went through many *many* generations, but I > definitely neglected some things (style sheets come to mind). I'll attach it to the bug. It's been running for several weeks now and seems quite effective. > 1. tagging invisible text vs. skipping: just do a 10fcv (remembering that > skipping text or altering it may affect non-Bayes rules) My hunch is that, as far as Bayes is concerned, it'll work out about the same. Invisible text in spam is likely to be semi-random and thus won't correlate too much between messages. Invisible text is likely non-existant in ham and so won't apply any weighting that direction. So, I figure that the "I:" tagged tokens will seldom make the top 15 interesting words and thus have about the same effect as just skipping them. Since providing alternate tagging would mean having to somehow pass this information out-of-band to the text classifiers (Bayes for now, perhaps others in the future), I don't believe that the small improvement that may result over just skipping the text justifies the additional amount of work and likelihood of bugs/errors. Other classification systems may become even more difficult since things like CRM114 use the relative position of the words, too. > P.S. Always work off of top-of-tree. ;-) Unfortunately, this is not an option for me. I'm adding this code on our production mail server here at work. While I can quietly justify making changes personally and testing them on the spot, I can't just bring in unreleased changes. I have a machine at home I can do unstable testing with, but it's much easier at work where we get thousands of spam per day instead of just a hundred or so. As an aside... I goofed a couple weeks back and SA was off-line for a weekend. I got a few inquiries Monday morning about that! On the plus side, everybody now has a good feel for how effective the filter really is. <grin> Brian ( bcwhite@precidia.com ) ------------------------------------------------------------------------------- There's no healthy way to mess with the line between wrong and right.
Created attachment 1256 [details] Diff for HTML.pm (against SA v2.55) for better detection of invisible text in messages. When invisible text is detected by this patch, it is removed from the text stream so as to avoid having it tested for by other rules, including the Bayesian classifier. I think it's important to have the modified text string affect all "body" rules since many look for a series of tell-tale words and including an invisible word, even tagged, would cause them not to function.
Subject: Re: [SAdev] RFE: Tokenize reduced visibility text specially or not at all. > 1. tagging invisible text vs. skipping: just do a 10fcv (remembering that > skipping text or altering it may affect non-Bayes rules) Hmmm... Here's an interesting bit of invisible text in a piece of spam I just received: <div align=3D"center"> <FONT color=3D#ffffff size=3D1>Order Confirmation. = Your order should be shipped by January, via FedEx. Your Federal Express tracking n= umber is 6-8.</FONT><font size=3D"1"><BR> Perhaps it's not as semi-random as I originally thought. Whatever we do, we need to keep invisible text from all the "body" rules or find a way to negate the result of any "nice" rules that match invisible text. Invisible text would also need to be scanned as a separate text string. Keeping them mixed with the visible text would break any rules that are positionally dependant (i.e. many of the body rules and text classifiers like CRM114). I'm still not sure the potential results justify the amount of effort necessary to implement this. Brian ( bcwhite@precidia.com ) ------------------------------------------------------------------------------- There's no healthy way to mess with the line between wrong and right.
> Diff for HTML.pm (against SA v2.55) for better detection of invisible text in > messages. The patch doesn't apply against SA v2.55. You must have some other patches in there or something else. $ patch --dry-run < /tmp/1256 patching file HTML.pm Hunk #1 FAILED at 35. Hunk #2 succeeded at 31 with fuzz 2 (offset -12 lines). Hunk #3 succeeded at 169 (offset -12 lines). Hunk #4 succeeded at 291 (offset -12 lines). Hunk #5 FAILED at 443. 2 out of 5 hunks FAILED -- saving rejects to file HTML.pm.rej
Subject: Re: [SAdev] RFE: Tokenize reduced visibility text specially or not at all. > > Diff for HTML.pm (against SA v2.55) for better detection of invisible text in > > messages. > > The patch doesn't apply against SA v2.55. You must have some other patches in > there or something else. > > $ patch --dry-run < /tmp/1256 > patching file HTML.pm > Hunk #1 FAILED at 35. > Hunk #2 succeeded at 31 with fuzz 2 (offset -12 lines). > Hunk #3 succeeded at 169 (offset -12 lines). > Hunk #4 succeeded at 291 (offset -12 lines). > Hunk #5 FAILED at 443. > 2 out of 5 hunks FAILED -- saving rejects to file HTML.pm.rej There might be. I had another patch for handling bad tags and there is probably some overlap. That's why I'll have to regenerate the patch by hand for the new code. Brian ( bcwhite@precidia.com ) ------------------------------------------------------------------------------- There's no healthy way to mess with the line between wrong and right.
Created attachment 1293 [details] Patch to HTML.pm (against SA v2.60-rc2) for removal of invisible text Here's a refinement of the patch for the lastest SA release. I changed the html_font_invisible function to return a flag and use that within the html_text function to remove invisible text. I know there is some debate as to the best action for invisible text, but I believe simply stripping it is the best choice for these reasons: - Tagging each invisible word in-line would cause problems with rules that look at more than a single word at a time (including CRM114, should it ever get included). - There is no method to pass the invisible text out-of-band to just those rules that can make use of it. I think adding such an ability would cost us more development effort than the amount of effort a spammer would have to invest to defeat it. - The invisible text can be any text a spammer chooses, so trying to act based on the contents of this text is a losing battle. It's better just to act based on the existence of invisible text rather than its contents. On another note, I liked the method of detecting invisible text; it looks easy to extend and made my patch very simple. What about other types of invisible text? Some spam has a dozen <BR> lines followed by some text at <font size=1> so, while it's technically visible, it's not really seen. -- Brian
'What about other types of invisible text? Some spam has a dozen <BR> lines followed by some text at <font size=1> so, while it's technically visible, it's not really seen.' Perhaps we should think of ways to use different tagging for "faraway" text -- ie. text that would be "far away" from the top of the message. However for long ham mails that may be not good. Some empirical 10fcv testing could provide results on this I think. I think I agree that stripping invisible text is the best option BTW.
Subject: Re: [SAdev] RFE: Tokenize reduced visibility text specially or not at all. > 'What about other types of invisible text? Some spam has a dozen <BR> lines > followed by some text at <font size=1> so, while it's technically visible, it's > not really seen.' > > Perhaps we should think of ways to use different tagging for "faraway" text -- > ie. text that would be "far away" from the top of the message. However for long > ham mails that may be not good. Some empirical 10fcv testing could provide > results on this I think. What is "10fcv"? How about just detecting large blocks of blank lines? body TEXT_FARAWAY /\n{5}/ Brian ( bcwhite@precidia.com ) ------------------------------------------------------------------------------- In theory, theory and practice are the same. In practice, they're not.
Subject: Re: [SAdev] RFE: Tokenize reduced visibility text specially or not at all. >What is "10fcv"? 10-fold cross-validation testing. A very good way to test tweaks to learning systems like Bayes or the SpamAssassin GA: http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html >How about just detecting large blocks of blank lines? >body TEXT_FARAWAY /\n{5}/ That would work -- but doesn't fix the problem that the tokens found after the blank lines would be visible to Bayes, like the "invisible text" trick. --j.
Given the spam that I've been receiving in the past few months the ignore invisible text feature would be of little use. It was supposed to ignore text intentionally included to confound the Baysean classifier. However in alot of recent spam I've received such confounding words do not appear in an invisible region so the feature would not skip them. Also the confounding words are re-used and so many are recognized as spammy the second time around, so skipping them would reduce SpamAssassin's effectiveness. I'm attaching a sample of such mail below.
Created attachment 1716 [details] Spam with Baysean classifier confounding words. The spam shows the classification of words. About one out of ten of the confounding words are classified as hammy but a greater number are recognized as spammy since they've appeared in earlier messages of this type.
adding dep on 3173
actually, marking fixed. 3173 has the most current state-of-play for this code. (btw I like that representation of the tokens. we should add that ;)