|
SA Bugzilla – Full Text Bug Listing |
Summary: | Standards violating comment obfuscation | ||
---|---|---|---|
Product: | Spamassassin | Reporter: | Matthew Cline <matt> |
Component: | Regression Tests | Assignee: | Daniel Quinlan <quinlan> |
Status: | RESOLVED FIXED | ||
Severity: | blocker | CC: | donapieppo, jm, lan |
Priority: | P1 | ||
Version: | SVN Trunk (Latest Devel Version) | ||
Target Milestone: | 2.60 | ||
Hardware: | Other | ||
OS: | other | ||
Whiteboard: | |||
Attachments: |
Remove malformed HTML comments in get_decoded_stripped_body_text_array()
Obfuscation example here's one I got proposed patch against current CVS new proposed patch, updates test, INSTALL note |
Description
Matthew Cline
2003-02-25 00:54:27 UTC
Actually, these can be stripped out by adding: $text =~ s/<![^>]*>//g; to PerMsgStatus::get_decoded_stripped_body_text_array(); seems to work fine, and it will also get rid of any HTML declarations that HTML::Parser didn't strip out. I'll be committing the change and attaching the patch. Created attachment 685 [details]
Remove malformed HTML comments in get_decoded_stripped_body_text_array()
I guess I'd like to have a better understanding of why this works in both MUAs and SpamAssassin. I've seen other examples of obfuscated tags with made-up names like <foo> and <kqudbnzuy> which presumably fool some important subset of HTML parsers. I'm trying a rule to see how many unknown tags are parsed by HTML::Parser and compare the ratio of them to real tags. If <!foo> is masquerading as an HTML declaration, then it should catch those as well. Do you have an example mail? Created attachment 686 [details]
Obfuscation example
*** Bug 1973 has been marked as a duplicate of this bug. *** Created attachment 992 [details]
here's one I got
*** Bug 1955 has been marked as a duplicate of this bug. *** I was noticing that I'm using the HTML-Parser that comes with RH 7.3 (3.26), but 3.28 is out now. Some new things: + When 'strict_comment' is off (which it is by default) + treat anything that matches <!...> a comment. btw notable factor of bug 1955 is that the tag style they used -- </ randomword > -- is strictly illegal -- it's nonconformant for all revs of HTML/XML. It's much more illegal than the existing obfuscating comments I've seen before. I think our HTML parser is refusing to parse that as HTML, so returning the tokens; but Outlook must do, hence the bug. perhaps a way to fix it would be to preproc the HTML text before it goes into the parser to strip out such illegal crap inside <...> tags so we're not fooled. oh btw note that in bug 1955 the obfuscation tags are not <!comments> at all. Is it possible to use HTML::Parser to easily find out that there are such illegal tags in the body? Would enough of those in a body be a strong spam sign? Perhaps it would be better to detect these, not just strip them. I'm talking about the <foo> totally illegal ones, not just the <!foo> that Matt tested for at the start of this bug report. Subject: Re: Standards violating comment obfuscation
> I'm talking about the <foo> totally illegal ones, not just the <!foo> that
> Matt tested for at the start of this bug report.
Well, <foo> would match XML documents of non-HTML type, so probably
not a good idea. But strictly-illegal tags like </ foo > would
be a good catch.
I have an idea: What if we will create rule that checks validity of html tags ? For exmaple, it will have a list if good/correct html tags and all other tags will be penalized, something about 0.2-0.5 for each wrong tag. Legimate mail with possible hand-written html and wrong tags will no be penalized too much, but spam mails with obfuscated comments/tags will have a huge penalty. What do you think about this? Subject: Re: Standards violating comment obfuscation
> What if we will create rule that checks validity of html tags ? For
> exmaple, it will have a list if good/correct html tags and all other
> tags will be penaliz ed, something about 0.2-0.5 for each wrong tag.
> Legimate mail with possible hand-written html and wrong tags will no be
> penalized too much, but spam mail s with obfuscated comments/tags will
> have a huge penalty.
As I was saying -- XML :(
I understand it is not possible to make list of 'good' tags for XML, but HTML is a very small part of XML, and for now, has limited set of allowed tags. I don't see any problems with this. btw, I've added a rule to catch these ones I've been going on about. but it'd be nice if we could also fix the HTML parser to decode them Outlook-compatibly. To do this, we'd have to s/// the text before passing it into HTML::Parser. anyone got a big problem with that? it'd be: s/<\/\s[^>]*>//gs; s/<[\/\?\!]\s*>//gs; s/<\?\s[^>]*>//gs; that catches all the ones I can think of ;) Subject: Re: [SAdev] Standards violating comment obfuscation On Mon, May 26, 2003 at 09:07:22PM -0700, bugzilla-daemon@hughes-family.org wrote: > I understand it is not possible to make list of 'good' tags for XML, but HTML is > a very small part of XML, and for now, has limited set of allowed tags. without giving a yea or nay, just fyi: HTML::Tagset is already required by SA (via the requirement for HTML::Parser), so it will have a relatively full list of valid HTML tags and the like. Subject: Re: [SAdev] Standards violating comment obfuscation
> I understand it is not possible to make list of 'good' tags for XML, but
> HTML is a very small part of XML, and for now, has limited set of
> allowed tags.
BTW my point on this, however, is that if we add code to detect "bad"
tags -- ie. tags that aren't in the HTML standards -- we'll hit snippets
of XML being posted.
> HTML::Tagset is already required
Yes, that's what I meant -- Since HTML::Parser can parse HTML there must
already be code with all the legal tags.
As for XML, is there any way that mail that is in a text/html section can have
XML instead of HTML without something that labels it as such? If not, what's
the problem with checking for valid HTML tags?
We will not hit XML inserted into HTML page because XML will have quoted < and > chars and < and > that is why XML tags are not may exists in HTML. By the way, I may assume spamers will start to use this method: VI<b></b>AG<i></i>RA ??? This is CORRECT HTML with legal tags :-) Subject: Re: [SAdev] Standards violating comment obfuscation On Mon, May 26, 2003 at 09:43:17PM -0700, bugzilla-daemon@hughes-family.org wrote: > By the way, I may assume spamers will start to use this method: > > VI<b></b>AG<i></i>RA > > ??? > This is CORRECT HTML with legal tags :-) True. But 1) those won't be a problem for SA because they'll be removed as normal tags. 2) we could easily have a rule looking for empty tags. Ok, another method: <font color="some color">VI</font><font color="the same color">AG</font><font color="the same color">RA</font> My guess is that detecting empty tags will FP all over the place due to the generally poor quality of HTML editors. OK, I've checked in that substitution and added a test script to verify that all these are caught. they are, so closing bug ;) Alexander, if you'd like to suggest the "check for real HTML tags" check in another bug, go for it.... I want to close this bug though, as it's a 2.60 milestone. Oops. Looks like the diff I made in attachment 685 [details] puts in a big hole: you can
now hide the entire body of an HTML message from SA by doing something like:
<HTML><BODY> <! blah blah blah > </BODY></HTML>
hmm, you could be right. don't have time to test it right now though. the fix would be to move the $text =~ s{<(?: ![^>]* # <!invalid comment style> |\s[^>]* # < foo > |\/\s[^>]* # </ foo > |\s* # <> or < > )>}{}gsx; substitution to before the HTML::Parser parsing step. > substitution to before the HTML::Parser parsing step.
Don't we want HTML::Parser to see HTML comments? There are tests done
on comments.
Subject: Re: Standards violating comment obfuscation bugzilla-daemon@hughes-family.org said: > Don't we want HTML::Parser to see HTML comments? There are tests done > on comments. Yes. OK, this is tricky then. How's about we modify the code to *fix* the invalid-HTML text so that they are valid HTML comments, before putting them through HTML::Parser. Then (a) we can avoid this bug, (b) we still ignore them in Bayes etc., and (c) HTML::Parser can still see whatever text may be inside those "tags". To recap BTW: the problem is the use of "tags" like: <!foo> <?foo> </ foo > < foo > those are all invalid HTML, and HTML::Parser ignores 'em. The previous fix stripped them after HTML::Parser, but a spammer could then use <!-- entire text of message here > to hide their message; since HTML::Parser converts < to < etc., by the time the stirpping code ran, the text would look like <!-- entire text of message here > and the above stripping would strip these too (even though a MUA would display them.) The correct fix would be to modify HTML::Parser to deal with these invalid HTML tags, but that would be pretty messy. --j. the new issue is important enough that I'm changing the priority and such. I've modified things so that the stripping is done *before* the parser is called, so that the spammers can't use "<" trickery. The regexp leaves normal comments alone, only getting rid of HTML declarations. $text =~ s{<(?: ! # The start of either an HTML comment or declaration... (?!--)[^>]* # But *not* followed by "--", so it isn't a comment... |\s[^>]* # < foo > |/\s[^>]* # </ foo > |\s* # <> or < > )>}{}gsx; Earlier versions of Perl have the regexp "zero-width lookahead assertions", right? Subject: Re: Standards violating comment obfuscation bugzilla-daemon@hughes-family.org said: > Earlier versions of Perl have the regexp "zero-width lookahead assertions", > right? Yeah, they do I think. We use it in a few rules. --j. Hmmm... I see that HTML.pm contains a sub to deal with declarations: sub html_declaration { my ($self, $text) = @_; if ($text =~ /^<!doctype/i) { my $tag = "!doctype"; $self->{html}{elements}++; $self->{html}{tags}++; $self->{html_inside}{$tag} = 0; } } And the latest change I make strips out declarations. However, if HTML::Parser properly dealt with delcarations, wouldn't it strip out things like "<!foo>" be stripped out? Hmmmm, as far as I can tell, SA doesn't have any rules that deal with "doctype" anymore, so stripping out declarations shouldn't be a problem (for now). Matt, please follow the code review process before making changes to the code. We are in a code freeze for some time now. Nobody even saw your changes to HTML.pm that you checked in six hours ago, let alone gave them a review. This change could have a significant affect on rules, I'd like to make sure it gets some testing and that we agree on the solution before it goes into the tree. -1: this patch (already applied) seems to break a bunch of tests, my ham FPs increased by this count of messages for these tests: 19 HTML_TAG_BALANCE_HEAD 18 HTML_TAG_BALANCE_HTML 9 FORGED_OUTLOOK_TAGS 6 FORGED_IMS_TAGS (and more, these are just the top four) bringing the results for those tests to: OVERALL SPAM HAM S/O SCORE NAME 29733 12036 17697 0.405 0.00 0.00 (all messages) 1030 1021 9 0.994 0.92 1.00 FORGED_OUTLOOK_TAGS 807 763 44 0.962 0.84 0.28 HTML_TAG_BALANCE_HTML 98 92 6 0.958 0.82 1.00 FORGED_IMS_TAGS 206 177 29 0.900 0.68 0.00 HTML_TAG_BALANCE_HEAD All of the FPs are new for FORGED_*_TAGS and most of the HTML_TAG_BALANCE_* FPs are new for me too. On the other side, we lost a fair number of spam hits for some rules too: -10 HTML_LINK_CLICK_CAPS -11 LINES_OF_YELLING_2 -12 HTML_00_10 -12 HTML_80_90 -15 HTML_COMMENT_RATIO -17 LINES_OF_YELLING -34 HTML_FONT_BIG In addition, the patch should at least exempt doctype using negative lookahead something like: <!(?!doctype) with a /i on the end. I'm not sure what the right solution is, maybe it's just a problem with the current replacement code. I tested out some of these obfuscations in Internet Explorer (I am not being paid enough for this... well, I'm not being paid at all for this). I think we only need to fix up things when we've verified that our rendering is different from what most people see in their MUA. I think that's pretty similar to what IE does and probably the same as what Mozilla does 9 times out of 10. IE does not remove "<>" nor does it remove "< >". It also does not seem to remove "< foo >". It does remove "</>", "</ >, and "</ foo>", though. So, here's my new removal. I leave it at the beginning for the same reasons Matt mentioned. # convert <!foo> to <!--foo--> if ($HTML::Parser::VERSION < 3.28) { $text =~ s/<!((?!--|doctype)[^>]*)>/<!--$1-->/gsi; } # remove empty close tags: </>, </ >, </ foo> $text =~ s/<\/(?:\s.*?)?>//gs; Testing now on my corpus to see how it compares. I'll try to finish up my testing and post a patch later tonight. Created attachment 1064 [details]
proposed patch against current CVS
Working on this... ok -- looks good I think. One thing: could you add a "t" script to test it? basically it should test against a mail which contains a load of body test text in different exploited-comment-format styles, and then verifies that each of the body tests in question match. Subject: Re: Standards violating comment obfuscation
> ok -- looks good I think. One thing: could you add a "t" script to
> test it? basically it should test against a mail which contains a
> load of body test text in different exploited-comment-format styles,
> and then verifies that each of the body tests in question match.
I can update t/html_obfu.t to match. It currently has this HTML:
------- start of cut text --------------
Bu</ donkey >lk e-< nope >mail<br>
10<b></b>0% GU<i></i>ARANTEED<br>
ne< >ver rec<>eive an<? blah >other mai<!foo>ling<br>
ma<himum>il w<himum>as s<himum>ent t<himum>o y<himum>ou beca<himum>use<br>
her</nothing>bal vi</nothing>agra<br>
Unso< / argh >licited Com< / argh >mercial El< / argh >ectronic Ma< / argh >il A< / argh >ct<br>
------- end ----------------------------
When I render that HTML in IE 6.0, Outlook Express XP, and Mozilla 1.0,
I see this:
------- start of cut text --------------
Bulk e-< nope >mail
100% GUARANTEED
ne< >ver rec<>eive another mailing
mail was sent to you because
herbal viagra
Unso< / argh >licited Com< / argh >mercial El< / argh >ectronic Ma< / argh >il A< / argh >ct
------- end ----------------------------
With the current patch for this bug, the "body" text matches exactly.
However, the current (pre-patch) 2.60-cvs sees:
------- start of cut text --------------
Bulk e-mail
100% GUARANTEED
never receive another mailing
mail was sent to you because
herbal viagra
Unsolicited Commercial Electronic Mail Act
------- end ----------------------------
I think the current code is actually too good, although we could make a
case for removing those additional fake tags anyway, it's not what
people see and I think it's probably best just to imitate Microsoft.
So, I'm going to adjust the SA test to include some antitests and I'll
also add a test for the empty close tag style.
Daniel
I've just seen a spam that uses a new obfuscation: "<-!foo-->". Even with the
patch in attachment 1064 [details], we don't handle it properly, asuming that IE/Outlook
treats such things as comments.
Created attachment 1069 [details]
new proposed patch, updates test, INSTALL note
None of IE, Mozilla, nor Outlook Express treat "<-!foo-->" as a comment. It shows up in plain text. It's just a dumb spammer. We could add plain old body rules to look for these "non-obfuscations" if delusional spammers start using them often. In addition, I did a mass-check on my corpus with and without the patch. It had this effect on my results (granted this is over 29733 messages, but the effect should be much larger once I update my corpus). spam rule hits: +88 ham rule hits: -35 spam total score: +77.047 ham total score: -10.457 FNs: no change FPs: -3 The patch is ready to be reviewed and applied, I think. Oops, one mistake. There were 4 additional FPs, not 3 fewer FPs. Three of them are due to the binary problem noted in bug 2106 -- a separate problem that has to do with our handling of nested MIME with more than one boundary and the fourth is due to the inaccuracy of our "when to run HTML::Parser" code. More notes in that bug. Anyway, the patch is still good-to-go. I've been working on this too long... only 3 additional FPs. The 4th was just trading one FP for another, so the only problem somewhat exasperated by the new code is the binary attachment in nested MIME documents issue in bug 2106. +1, sounds good seems good to me. +1 applied to CVS |