SA Bugzilla – Bug 2170
Message content after </HTML> tag.
Last modified: 2004-10-18 08:43:30 UTC
Some spammers continue to write things in their HTML Spams AFTER the </HTML>. Even the worst HTML composer won't do that ! Usually they write only weird things, maybe to avoid Razor/DCC/Pyzor... So it might be interesting to look after this (maybe writing a rule ?).
it *may* be, but I think a lot of mailing list software would also run into this (MailMan especially).
I also noticed some spam which had things before the opening <html> (just to point this out, dunno how good such a test would be ;-)
Malte, good point, I would think that's more likely to be a useful test -- look for random alphanum-and-whitespace crap before the opening < char. (since there may also be <?xml or <!doctype stuff.)
Subject: Re: [SAdev] Message content after </HTML> tag. > it *may* be, but I think a lot of mailing list software would also run into this > (MailMan especially). I wrote a test for this but wasn't able to evaluate it properly because current versions of SA will sometimes include the MIME boundary as part of an HTML body. See bug #2375 for more information. Brian ( bcwhite@precidia.com ) ------------------------------------------------------------------------------- Love and pain become one and the same in the eyes of a wounded child. -PBenetar
*** Bug 2422 has been marked as a duplicate of this bug. ***
Malte, Justin, I made a simple test to check before the HTML tag. rawbody T_B_BEFORE_HTML /.{3,25}\<html\>/i describe T_B_BEFORE_HTML FVGT - what comes before the opening HTML tag? This is working good for me! Give it a try when you get a chance. Thanks,
Created attachment 1657 [details] Example spam with content after </HTML> tag The attached message is typical of about 15 I've received over the Christmas break which haven't been tagged... it contains text after the </HTML> tag but within a MIME boundary. Surely this would only happen in spam? If mailman or other mailing software tacked on content after a </HTML> tag, it wouldn't be within the same MIME boundary as the HTML?
This probably applies to various tools, but I know that mail2world.com in particular (a free web-mail service) simply appends its advertisements footer at the end of HTML attachments, disregarding what came before it. Which usually results in legitimate email having extra HTML tags and text coming after a closing HTML tag. Below is an example. Note that it is within the same MIME boundary (as they'd like readers to see the advertisement without having to open up an attachment). ------=_NextPart_000_9D9C7_01C3C553.44329930 Content-Type: text/html Content-Transfer-Encoding: 7bit <HTML> <BODY> <P>Hey hoe, mom wanted me to think of christmas presents ideas that people should get me and when she heard this one she told me to e-mail you because you're probably the only that get or would want to get info like this. I want some info on the best places in the world to live or travel. It can be any where, but k eep in mind hobbies like mineral mining, sailing, car cultures, high tech. areas, and the beautiful tropics. It's just an idea, but it's cheap. I know that yo u looked up a lot of places to live before you picked Rockford so you have some experience in this department. Anyway, it's just an idea, so you don't really have to do it.</P> <P>Have anything on your mind that you'd like?</P> <P>Bekki</P> </BODY></HTML> <BR><font face="Arial, Helvetica, sans-serif" size="2" style="font-size:13.5px">_______________________________________________________________<BR><font face= "Arial, Helvetica, sans-serif" size="2" style="font-size:13.5px">Get the FREE email that has everyone talking at <a href="http://www.mail2world.com" target="n ew">http://www.mail2world.com</a></font><br><br> </font> </font> ------=_NextPart_000_9D9C7_01C3C553.44329930--
Note comments at bug #2892. I propose there that Bayes training not consider material after the </HTML>, but from your example it looks like non-Bayes tests should continue looking there. If people insist on using HTML, at least it should be *valid* HTML. Maybe we need a validator eval that returns a varying score based on how bad the HTML is. (Does SA have any way for an eval rule to pass a computed weight to a score?)
Thanks, we now have a rule for that. Also, browsers seem to render text after </html>, so we will too.
I have been using a rule to catch things after </html> for months. It catches quite a lot of spam. It also catches quite a lot of ham, typically things like Travelocity statements and HTML mail from mail lists. It is a useful test, but can't be scored too high. BTW, checking for junk before <html> turns out to be LESS useful than checking for junk after </html>. Seems more "legit" things put junk at the front of the message.