Bug 1551

Summary:	Standards violating comment obfuscation
Product:	Spamassassin	Reporter:	Matthew Cline <matt>
Component:	Regression Tests	Assignee:	Daniel Quinlan <quinlan>
Status:	RESOLVED FIXED
Severity:	blocker	CC:	donapieppo, jm, lan
Priority:	P1
Version:	SVN Trunk (Latest Devel Version)
Target Milestone:	2.60
Hardware:	Other
OS:	other
Whiteboard:
Attachments:	Remove malformed HTML comments in get_decoded_stripped_body_text_array() Obfuscation example here's one I got proposed patch against current CVS new proposed patch, updates test, INSTALL note

Description Matthew Cline 2003-02-25 00:54:27 UTC

Some recent spam has been using broken comments for obfuscation:

    VIA<!foo>GRA

HTML::Parser doesn't consider this to be an HTML tag, so the caps viagra rule
doen't get triggered, and bayes get VIA and GRA:

    debug: bayes token 'VIA' => 0.95214582163703

I guess that to fully fix this, we'd need to tweak HTML::Parser somehow; if we
can't do it be subclassing, we'd need to make a patch to to the parser code
itself, submit it to the maintainer, and then have a special SA version of
HTML::Parser to use until the next version of HTML::Parser was released.

I've also created a test rule to find these obfuscations; it only got a hit
rate of about 0.26% on my corpus, but the S/O was 1.0, and this tactic will
only become more common if we don't fix this.

===

rawbody  __OBFUSCATING_COMMENT2 /[^\s>]<![^-].{0,64}?[^-]>[^\s<]/
meta     T_OBFUSCATING_COMMENT2 __OBFUSCATING_COMMENT2 && MIME_HTML_ONLY
describe T_OBFUSCATING_COMMENT2 HTML comments which obfuscate text (2)

Comment 1 Matthew Cline 2003-02-25 21:12:48 UTC

Actually, these can be stripped out by adding:

    $text =~ s/<![^>]*>//g;

to PerMsgStatus::get_decoded_stripped_body_text_array(); seems to work fine,
and it will also get rid of any HTML declarations that HTML::Parser didn't
strip out.  I'll be committing the change and attaching the patch.

Comment 2 Matthew Cline 2003-02-25 21:14:32 UTC

Created attachment 685 [details]
Remove malformed HTML comments in get_decoded_stripped_body_text_array()

Comment 3 Daniel Quinlan 2003-02-25 21:51:12 UTC

I guess I'd like to have a better understanding of why this works in both
MUAs and SpamAssassin.  I've seen other examples of obfuscated tags with
made-up names like <foo> and <kqudbnzuy> which presumably fool some important
subset of HTML parsers.

I'm trying a rule to see how many unknown tags are parsed by HTML::Parser
and compare the ratio of them to real tags.  If <!foo> is masquerading as
an HTML declaration, then it should catch those as well.

Do you have an example mail?

Comment 4 Matthew Cline 2003-02-26 00:15:17 UTC

Created attachment 686 [details]
Obfuscation example

Comment 5 Theo Van Dinter 2003-05-26 10:40:06 UTC

*** Bug 1973 has been marked as a duplicate of this bug. ***

Comment 6 Theo Van Dinter 2003-05-26 11:10:33 UTC

Created attachment 992 [details]
here's one I got

Comment 7 Theo Van Dinter 2003-05-26 11:11:41 UTC

*** Bug 1955 has been marked as a duplicate of this bug. ***

Comment 8 Theo Van Dinter 2003-05-26 11:51:04 UTC

I was noticing that I'm using the HTML-Parser that comes with RH 7.3 (3.26), but
3.28 is out now.  Some new things:

+     When 'strict_comment' is off (which it is by default)
+     treat anything that matches <!...> a comment.

Comment 9 Justin Mason 2003-05-26 12:10:08 UTC

btw notable factor of bug 1955 is that the tag style they used -- </ randomword
> -- is strictly illegal -- it's nonconformant for all revs of HTML/XML.  It's
much more illegal than the existing obfuscating comments I've seen before.

I think our HTML parser is refusing to parse that as HTML, so returning the
tokens; but Outlook must do, hence the bug.  perhaps a way to fix it would be to
preproc the HTML text before it goes into the parser to strip out such illegal
crap inside <...> tags so we're not fooled.

Comment 10 Justin Mason 2003-05-26 12:10:45 UTC

oh btw note that in bug 1955 the obfuscation tags are not <!comments> at all.

Comment 11 Sidney Markowitz 2003-05-26 13:20:46 UTC

Is it possible to use HTML::Parser to easily find out that there are such 
illegal tags in the body? Would enough of those in a body be a strong spam 
sign? Perhaps it would be better to detect these, not just strip them.

I'm talking about the <foo> totally illegal ones, not just the <!foo> that 
Matt tested for at the start of this bug report.

Comment 12 Antony Mawer 2003-05-26 20:34:59 UTC

Subject: Re:  Standards violating comment obfuscation 



> I'm talking about the <foo> totally illegal ones, not just the <!foo> that 
> Matt tested for at the start of this bug report.

Well, <foo> would match XML documents of non-HTML type, so probably
not a good idea.  But strictly-illegal tags like </ foo > would
be a good catch.

Comment 13 Alexander Litvinov 2003-05-26 20:45:02 UTC

I have an idea:

What if we will create rule that checks validity of html tags ? For exmaple, it
will have a list if good/correct html tags and all other tags will be penalized,
something about 0.2-0.5 for each wrong tag. Legimate mail with possible
hand-written html and wrong tags will no be penalized too much, but spam mails
with obfuscated comments/tags will have a huge penalty.

What do you think about this?

Comment 14 Antony Mawer 2003-05-26 20:54:37 UTC

Subject: Re:  Standards violating comment obfuscation 


> What if we will create rule that checks validity of html tags ? For
> exmaple, it will have a list if good/correct html tags and all other
> tags will be penaliz ed, something about 0.2-0.5 for each wrong tag.
> Legimate mail with possible hand-written html and wrong tags will no be
> penalized too much, but spam mail s with obfuscated comments/tags will
> have a huge penalty.

As I was saying -- XML :(

Comment 15 Alexander Litvinov 2003-05-26 21:07:21 UTC

I understand it is not possible to make list of 'good' tags for XML, but HTML is
a very small part of XML, and for now, has limited set of allowed tags. 

I don't see any problems with this.

Comment 16 Justin Mason 2003-05-26 21:10:03 UTC

btw, I've added a rule to catch these ones I've been going on about.  but it'd
be nice if we could also fix the HTML parser to decode them Outlook-compatibly.
 To do this, we'd have to s/// the text before passing it into HTML::Parser. 
anyone got a big problem with that?  it'd be:

   s/<\/\s[^>]*>//gs;
   s/<[\/\?\!]\s*>//gs;
   s/<\?\s[^>]*>//gs;

that catches all the ones I can think of ;)

Comment 17 Theo Van Dinter 2003-05-26 21:10:36 UTC

Subject: Re: [SAdev]  Standards violating comment obfuscation

On Mon, May 26, 2003 at 09:07:22PM -0700, bugzilla-daemon@hughes-family.org wrote:
> I understand it is not possible to make list of 'good' tags for XML, but HTML is
> a very small part of XML, and for now, has limited set of allowed tags. 

without giving a yea or nay, just fyi: HTML::Tagset is already required
by SA (via the requirement for HTML::Parser), so it will have a relatively
full list of valid HTML tags and the like.

Comment 18 Antony Mawer 2003-05-26 21:24:57 UTC

Subject: Re: [SAdev]  Standards violating comment obfuscation 


> I understand it is not possible to make list of 'good' tags for XML, but
> HTML is a very small part of XML, and for now, has limited set of
> allowed tags. 

BTW my point on this, however, is that if we add code to detect "bad"
tags -- ie. tags that aren't in the HTML standards -- we'll hit snippets
of XML being posted.

Comment 19 Sidney Markowitz 2003-05-26 21:31:36 UTC

> HTML::Tagset is already required

Yes, that's what I meant -- Since HTML::Parser can parse HTML there must 
already be code with all the legal tags.

As for XML, is there any way that mail that is in a text/html section can have 
XML instead of HTML without something that labels it as such? If not, what's 
the problem with checking for valid HTML tags?

Comment 20 Alexander Litvinov 2003-05-26 21:36:33 UTC

We will not hit XML inserted into HTML page because XML will have quoted < and >
chars and &lt; and &gt; that is why XML tags are not may exists in HTML.

Comment 21 Alexander Litvinov 2003-05-26 21:43:17 UTC

By the way, I may assume spamers will start to use this method:

VI<b></b>AG<i></i>RA

???
This is CORRECT HTML with legal tags :-)

Comment 22 Theo Van Dinter 2003-05-26 21:45:59 UTC

Subject: Re: [SAdev]  Standards violating comment obfuscation

On Mon, May 26, 2003 at 09:43:17PM -0700, bugzilla-daemon@hughes-family.org wrote:
> By the way, I may assume spamers will start to use this method:
> 
> VI<b></b>AG<i></i>RA
> 
> ???
> This is CORRECT HTML with legal tags :-)

True.  But 1) those won't be a problem for SA because they'll be removed
as normal tags.  2) we could easily have a rule looking for empty tags.

Comment 23 Alexander Litvinov 2003-05-26 21:52:37 UTC

Ok, another method:
<font color="some color">VI</font><font color="the same color">AG</font><font
color="the same color">RA</font>

Comment 24 Bob Apthorpe 2003-05-26 22:01:12 UTC

My guess is that detecting empty tags will FP all over the place due to the 
generally poor quality of HTML editors.

Comment 25 Justin Mason 2003-05-27 15:06:16 UTC

OK, I've checked in that substitution and added a test script to verify that all
these are caught.  they are, so closing bug ;)

Alexander, if you'd like to suggest the "check for real HTML tags" check in
another bug, go for it.... I want to close this bug though, as it's a 2.60
milestone.

Comment 26 Matthew Cline 2003-06-17 19:18:09 UTC

Oops.  Looks like the diff I made in attachment 685 [details] puts in a big hole: you can
now hide the entire body of an HTML message from SA by doing something like:

<HTML><BODY> &lt;! blah blah blah &gt; </BODY></HTML>

Comment 27 Justin Mason 2003-06-17 19:31:21 UTC

hmm, you could be right.  don't have time to test it right now though.
the fix would be to move the

    $text =~ s{<(?:
        ![^>]*          # <!invalid comment style>
        |\s[^>]*        # < foo >
        |\/\s[^>]*      # </ foo >
        |\s*            # <> or < >
    )>}{}gsx;

substitution to before the HTML::Parser parsing step.

Comment 28 Daniel Quinlan 2003-06-17 23:51:52 UTC

> substitution to before the HTML::Parser parsing step.

Don't we want HTML::Parser to see HTML comments?  There are tests done
on comments.

Comment 29 Antony Mawer 2003-06-18 09:56:43 UTC

Subject: Re:  Standards violating comment obfuscation 


bugzilla-daemon@hughes-family.org said:

> Don't we want HTML::Parser to see HTML comments?  There are tests done
> on comments.

Yes.  OK, this is tricky then.

How's about we modify the code to *fix* the invalid-HTML text so that they
are valid HTML comments, before putting them through HTML::Parser.  Then
(a) we can avoid this bug, (b) we still ignore them in Bayes etc., and (c)
HTML::Parser can still see whatever text may be inside those "tags".

To recap BTW: the problem is the use of "tags" like:

	<!foo> <?foo> </ foo > < foo >

those are all invalid HTML, and HTML::Parser ignores 'em.  The previous
fix stripped them after HTML::Parser, but a spammer could then use

	&lt;!-- entire text of message here &gt;

to hide their message; since HTML::Parser converts &lt; to < etc., by the
time the stirpping code ran, the text would look like

	<!-- entire text of message here >

and the above stripping would strip these too (even though a MUA would
display them.)

The correct fix would be to modify HTML::Parser to deal with these invalid
HTML tags, but that would be pretty messy.

--j.

Comment 30 Theo Van Dinter 2003-06-18 11:10:53 UTC

the new issue is important enough that I'm changing the priority and such.

Comment 31 Matthew Cline 2003-06-18 20:13:37 UTC

I've modified things so that the stripping is done *before* the parser is 
called, so that the spammers can't use "&lt;" trickery.  The regexp leaves 
normal comments alone, only getting rid of HTML declarations.

    $text =~
      s{<(?:
         !           # The start of either an HTML comment or declaration...
         (?!--)[^>]* # But *not* followed by "--", so it isn't a comment...

         |\s[^>]*    # < foo >
         |/\s[^>]*   # </ foo >
         |\s*        # <> or < >
         )>}{}gsx;

Earlier versions of Perl have the regexp "zero-width lookahead assertions",
right?

Comment 32 Antony Mawer 2003-06-18 20:56:33 UTC

Subject: Re:  Standards violating comment obfuscation 


bugzilla-daemon@hughes-family.org said:

> Earlier versions of Perl have the regexp "zero-width lookahead assertions",
> right?

Yeah, they do I think.  We use it in a few rules.

--j.

Comment 33 Matthew Cline 2003-06-18 23:30:36 UTC

Hmmm...  I see that HTML.pm contains a sub to deal with declarations:

sub html_declaration {
  my ($self, $text) = @_;

  if ($text =~ /^<!doctype/i) {
    my $tag = "!doctype";

    $self->{html}{elements}++;
    $self->{html}{tags}++;
    $self->{html_inside}{$tag} = 0;
  }
}

And the latest change I make strips out declarations.  However, if
HTML::Parser properly dealt with delcarations, wouldn't it strip out things
like "<!foo>" be stripped out?

Hmmmm, as far as I can tell, SA doesn't have any rules that deal with "doctype"
anymore, so stripping out declarations shouldn't be a problem (for now).

Comment 34 Daniel Quinlan 2003-06-19 02:20:19 UTC

Matt, please follow the code review process before making changes to the code.
We are in a code freeze for some time now.  Nobody even saw your changes to
HTML.pm that you checked in six hours ago, let alone gave them a review.

This change could have a significant affect on rules, I'd like to make sure it
gets some testing and that we agree on the solution before it goes into the tree.

Comment 35 Daniel Quinlan 2003-06-19 11:21:20 UTC

-1: this patch (already applied) seems to break a bunch of tests, my ham
FPs increased by this count of messages for these tests:

19      HTML_TAG_BALANCE_HEAD
18      HTML_TAG_BALANCE_HTML
9       FORGED_OUTLOOK_TAGS
6       FORGED_IMS_TAGS
(and more, these are just the top four)

bringing the results for those tests to:

OVERALL     SPAM      HAM     S/O   SCORE  NAME
  29733    12036    17697    0.405   0.00    0.00  (all messages)
   1030     1021        9    0.994   0.92   1.00  FORGED_OUTLOOK_TAGS
    807      763       44    0.962   0.84   0.28  HTML_TAG_BALANCE_HTML
     98       92        6    0.958   0.82   1.00  FORGED_IMS_TAGS
    206      177       29    0.900   0.68   0.00  HTML_TAG_BALANCE_HEAD

All of the FPs are new for FORGED_*_TAGS and most of the HTML_TAG_BALANCE_*
FPs are new for me too.

On the other side, we lost a fair number of spam hits for some rules too:

-10     HTML_LINK_CLICK_CAPS
-11     LINES_OF_YELLING_2
-12     HTML_00_10
-12     HTML_80_90
-15     HTML_COMMENT_RATIO
-17     LINES_OF_YELLING
-34     HTML_FONT_BIG

In addition, the patch should at least exempt doctype using negative
lookahead something like:

  <!(?!doctype)

with a /i on the end.  I'm not sure what the right solution is, maybe it's
just a problem with the current replacement code.

Comment 36 Daniel Quinlan 2003-06-19 18:33:14 UTC

I tested out some of these obfuscations in Internet Explorer (I am not
being paid enough for this... well, I'm not being paid at all for this).

I think we only need to fix up things when we've verified that our rendering
is different from what most people see in their MUA.  I think that's pretty
similar to what IE does and probably the same as what Mozilla does 9 times out
of 10.

IE does not remove "<>" nor does it remove "< >".  It also does not seem to
remove "< foo >".  It does remove "</>", "</ >, and "</ foo>", though.

So, here's my new removal.  I leave it at the beginning for the same reasons
Matt mentioned.

    # convert <!foo> to <!--foo-->
    if ($HTML::Parser::VERSION < 3.28) {
      $text =~ s/<!((?!--|doctype)[^>]*)>/<!--$1-->/gsi;
    }

    # remove empty close tags: </>, </ >, </ foo>
    $text =~ s/<\/(?:\s.*?)?>//gs;

Testing now on my corpus to see how it compares.  I'll try to finish up my
testing and post a patch later tonight.

Comment 37 Daniel Quinlan 2003-06-20 13:02:26 UTC

Created attachment 1064 [details]
proposed patch against current CVS

Comment 38 Daniel Quinlan 2003-06-20 13:21:07 UTC

Working on this...

Comment 39 Justin Mason 2003-06-20 22:13:10 UTC

ok -- looks good I think.   One thing: could you add a "t" script to test it?
basically it should test against a mail which contains a load of body test text
in different exploited-comment-format styles, and then verifies that each of the
body tests in question match.

Comment 40 Daniel Quinlan 2003-06-20 23:28:21 UTC

Subject: Re:  Standards violating comment obfuscation

> ok -- looks good I think.  One thing: could you add a "t" script to
> test it?  basically it should test against a mail which contains a
> load of body test text in different exploited-comment-format styles,
> and then verifies that each of the body tests in question match.

I can update t/html_obfu.t to match.  It currently has this HTML:

------- start of cut text --------------
Bu</ donkey >lk e-< nope >mail<br>
10<b></b>0% GU<i></i>ARANTEED<br>
ne< >ver rec<>eive an<? blah >other mai<!foo>ling<br>
ma<himum>il w<himum>as s<himum>ent t<himum>o y<himum>ou beca<himum>use<br>
her</nothing>bal vi</nothing>agra<br>
Unso< / argh >licited Com< / argh >mercial El< / argh >ectronic Ma< / argh >il A< / argh >ct<br>
------- end ----------------------------

When I render that HTML in IE 6.0, Outlook Express XP, and Mozilla 1.0,
I see this:

------- start of cut text --------------
Bulk e-< nope >mail
100% GUARANTEED
ne< >ver rec<>eive another mailing
mail was sent to you because
herbal viagra
Unso< / argh >licited Com< / argh >mercial El< / argh >ectronic Ma< / argh >il A< / argh >ct
------- end ----------------------------

With the current patch for this bug, the "body" text matches exactly.

However, the current (pre-patch) 2.60-cvs sees:

------- start of cut text --------------
Bulk e-mail
100% GUARANTEED
never receive another mailing
mail was sent to you because
herbal viagra
Unsolicited Commercial Electronic Mail Act
------- end ----------------------------

I think the current code is actually too good, although we could make a
case for removing those additional fake tags anyway, it's not what
people see and I think it's probably best just to imitate Microsoft.
So, I'm going to adjust the SA test to include some antitests and I'll
also add a test for the empty close tag style.

Daniel

Comment 41 Matthew Cline 2003-06-20 23:43:09 UTC

I've just seen a spam that uses a new obfuscation: "<-!foo-->".  Even with the
patch in attachment 1064 [details], we don't handle it properly, asuming that IE/Outlook
treats such things as comments.

Comment 42 Daniel Quinlan 2003-06-20 23:45:10 UTC

Created attachment 1069 [details]
new proposed patch, updates test, INSTALL note

Comment 43 Daniel Quinlan 2003-06-21 00:07:22 UTC

None of IE, Mozilla, nor Outlook Express treat "<-!foo-->" as a comment.  It
shows up in plain text.  It's just a dumb spammer.  We could add plain old
body rules to look for these "non-obfuscations" if delusional spammers start
using them often.

In addition, I did a mass-check on my corpus with and without the patch.
It had this effect on my results (granted this is over 29733 messages,
but the effect should be much larger once I update my corpus).

  spam rule hits: +88
  ham rule hits: -35
  spam total score: +77.047
  ham total score: -10.457
  FNs: no change
  FPs: -3

The patch is ready to be reviewed and applied, I think.

Comment 44 Daniel Quinlan 2003-06-21 00:14:37 UTC

Oops, one mistake.  There were 4 additional FPs, not 3 fewer FPs.  Three of
them are due to the binary problem noted in bug 2106 -- a separate problem
that has to do with our handling of nested MIME with more than one boundary
and the fourth is due to the inaccuracy of our "when to run HTML::Parser" code.
More notes in that bug.

Anyway, the patch is still good-to-go.

Comment 45 Daniel Quinlan 2003-06-21 00:17:16 UTC

I've been working on this too long... only 3 additional FPs.  The 4th was just
trading one FP for another, so the only problem somewhat exasperated by the new
code is the binary attachment in nested MIME documents issue in bug 2106.

Comment 46 Justin Mason 2003-06-21 18:19:04 UTC

+1, sounds good

Comment 47 Theo Van Dinter 2003-06-21 18:59:44 UTC

seems good to me.  +1

Comment 48 Daniel Quinlan 2003-06-22 00:33:44 UTC

applied to CVS