Bug 2170 - Message content after </HTML> tag.
Summary: Message content after </HTML> tag.
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: 2.55
Hardware: Other other
: P5 enhancement
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
: 2422 (view as bug list)
Depends on:
Blocks:
 
Reported: 2003-07-01 02:56 UTC by Maxime Ritter
Modified: 2004-10-18 08:43 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Example spam with content after </HTML> tag text/plain None Lachlan Cameron-Smith [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Maxime Ritter 2003-07-01 02:56:27 UTC
Some spammers continue to write things in their HTML Spams AFTER the </HTML>.
Even the worst HTML composer won't do that ! Usually they write only weird
things, maybe to avoid Razor/DCC/Pyzor... So it might be interesting to look
after this (maybe writing a rule ?).
Comment 1 Justin Mason 2003-07-25 16:23:04 UTC
it *may* be, but I think a lot of mailing list software would also run into this
(MailMan especially).
Comment 2 Malte S. Stretz 2003-07-25 16:38:11 UTC
I also noticed some spam which had things before the opening <html> (just to 
point this out, dunno how good such a test would be ;-) 
Comment 3 Justin Mason 2003-07-25 17:04:17 UTC
Malte,

good point, I would think that's more likely to be a useful test -- look for
random alphanum-and-whitespace crap before the opening < char.  (since there may
also be <?xml or <!doctype stuff.)
Comment 4 Brian White 2003-09-08 07:13:14 UTC
Subject: Re: [SAdev]  Message content after </HTML> tag.

> it *may* be, but I think a lot of mailing list software would also run into this
> (MailMan especially).

I wrote a test for this but wasn't able to evaluate it properly because
current versions of SA will sometimes include the MIME boundary as part
of an HTML body.  See bug #2375 for more information.

                                          Brian
                                 ( bcwhite@precidia.com )

-------------------------------------------------------------------------------
Love and pain become one and the same in the eyes of a wounded child. -PBenetar

Comment 5 Fred T 2003-09-30 21:04:23 UTC
*** Bug 2422 has been marked as a duplicate of this bug. ***
Comment 6 Fred T 2003-10-20 06:02:14 UTC
Malte, Justin,
I made a simple test to check before the HTML tag.

rawbody  T_B_BEFORE_HTML  /.{3,25}\<html\>/i
describe T_B_BEFORE_HTML  FVGT - what comes before the opening HTML tag?


This is working good for me!  Give it a try when you get a chance.
Thanks,
Comment 7 Lachlan Cameron-Smith 2004-01-04 20:51:36 UTC
Created attachment 1657 [details]
Example spam with content after </HTML> tag

The attached message is typical of about 15 I've received over the Christmas
break  which haven't been tagged... it contains text after the </HTML> tag but
within a MIME boundary. Surely this would only happen in spam? If mailman or
other mailing software tacked on content after a </HTML> tag, it wouldn't be
within the same MIME boundary as the HTML?
Comment 8 David Newcum 2004-01-05 03:22:01 UTC
This probably applies to various tools, but I know that mail2world.com in
particular (a free web-mail service) simply appends its advertisements footer at
the end of HTML attachments, disregarding what came before it.  Which usually
results in legitimate email having extra HTML tags and text coming after a
closing HTML tag.

Below is an example.  Note that it is within the same MIME boundary (as they'd
like readers to see the advertisement without having to open up an attachment).


------=_NextPart_000_9D9C7_01C3C553.44329930
Content-Type: text/html
Content-Transfer-Encoding: 7bit

<HTML>
<BODY>
<P>Hey hoe, mom wanted me to think of christmas presents ideas that people
should get me and when she heard this one she told me to e-mail you because you're
probably the only that get or would want to get info like this. I want some info
on the best places in the world to live or travel. It can be any where, but k
eep in mind hobbies like mineral mining, sailing, car cultures, high tech.
areas, and the beautiful tropics. It's just an idea, but it's cheap. I know that yo
u looked up a lot of places to live before you picked Rockford so you have some
experience in this department. Anyway, it's just an idea, so you don't really
have to do it.</P>
<P>Have anything on your mind that you'd like?</P>
<P>Bekki</P>
</BODY></HTML>
<BR><font face="Arial, Helvetica, sans-serif" size="2"
style="font-size:13.5px">_______________________________________________________________<BR><font
face=
"Arial, Helvetica, sans-serif" size="2" style="font-size:13.5px">Get the FREE
email that has everyone talking at <a href="http://www.mail2world.com" target="n
ew">http://www.mail2world.com</a></font><br><br>&nbsp;</font>  </font>
------=_NextPart_000_9D9C7_01C3C553.44329930--
Comment 9 Kenneth Porter 2004-01-05 07:12:03 UTC
Note comments at bug #2892. I propose there that Bayes training not consider
material after the </HTML>, but from your example it looks like non-Bayes tests
should continue looking there.

If people insist on using HTML, at least it should be *valid* HTML. Maybe we
need a validator eval that returns a varying score based on how bad the HTML is.
(Does SA have any way for an eval rule to pass a computed weight to a score?)
Comment 10 Daniel Quinlan 2004-10-18 11:04:38 UTC
Thanks, we now have a rule for that.

Also, browsers seem to render text after </html>, so we will too.
Comment 11 Loren Wilton 2004-10-18 16:43:30 UTC
I have been using a rule to catch things after </html> for months.  It catches 
quite a lot of spam.  It also catches quite a lot of ham, typically things like 
Travelocity statements and HTML mail from mail lists.

It is a useful test, but can't be scored too high.

BTW, checking for junk before <html> turns out to be LESS useful than checking 
for junk after </html>.  Seems more "legit" things put junk at the front of the 
message.