SA Bugzilla – Bug 2403
New rule suggestion: HTML_EXCESSIVE_ENTITIES -- regular letters reencoded as &#nnn;
Last modified: 2005-02-06 14:40:12 UTC
(Indented lines continued from previous line) describe EXCESSIVE_HTML_ENTITIES Unnecessary reencoding of normal letters into HTML entities # 70-89 is 'F'-'Y', 100-119 is 'd'-'w'i, 32-36 is space,!,",#,$ rawbody __EXCESSIVE_HTML_ENTITIES_3x /\&\#3[2-6];/ rawbody __EXCESSIVE_HTML_ENTITIES_7x /\&\#7[0-9];/ rawbody __EXCESSIVE_HTML_ENTITIES_8x /\&\#8[0-9];/ rawbody __EXCESSIVE_HTML_ENTITIES_10x /\&\#10[0-9];/ rawbody __EXCESSIVE_HTML_ENTITIES_11x /\&\#11[0-9];/ meta EXCESSIVE_HTML_ENTITIES ( __EXCESSIVE_HTML_ENTITIES_3x + __EXCESSIVE_HTML_ENTITIES_7x + __EXCESSIVE_HTML_ENTITIES_8x + __EXCESSIVE_HTML_ENTITIES_10x + __EXCESSIVE_HTML_ENTITIES_11x ) > 2 Requires a hit on three out of five groups of unnecessarily reencoded letters. Maybe I'm being too nice. ;) See attached example spam. (Though one wonders why he bothered to obfuscate the HTML and then left a plaintext copy in. Spammers never cease to amaze me.)
Created attachment 1311 [details] Example spam with unnecessary HTML entities
Come to think of it, if the above tests show useful, this is probably much better implemented in the HTML->plaintext converter, at near zero cost, but I refuse to get involved in yet another project, so that's for the illustrious Someone Else to do.
Subject: Re: [SAdev] New rule suggestion: HTML_EXCESSIVE_ENTITIES -- regular letters reencoded as &#nnn; > Example spam with unnecessary HTML entities I believe this is similar to bug #2211, "New HTML Tag Tests". Brian ( bcwhite@precidia.com ) ------------------------------------------------------------------------------- Many times the difference between failure and success is doing something nearly right... or doing it exactly right.
Subject: Re: [SAdev] New rule suggestion: HTML_EXCESSIVE_ENTITIES -- regular letters reencoded as &#nnn; > Example spam with unnecessary HTML entities Oops, never mind. I misunderstood "entities". Brian ( bcwhite@precidia.com ) ------------------------------------------------------------------------------- Many times the difference between failure and success is doing something nearly right... or doing it exactly right.
I propose using HTML::Entities ( http://search.cpan.org/dist/HTML-Parser/lib/HTML/Entities.pm ) from HTML-Parser in lib/Mail/SpamAssassin/HTML.pm ? Would that be OK?
*** Bug 2475 has been marked as a duplicate of this bug. ***
OK, looked at the source code * PerMsgStatus.pm at HTML::Parser->new : text => [sub { $self->html_text(@_) }, "dtext"] The dtext already takes care of decoding HTML entities, which is good on the one hand, because rules for suspicous words are matched, but there is no easy way of telling that obfuscation has taken place other than using rawbody at the moment. Instead of using dtext, we could do decode inside of html_text but would that be better than rawbody matching with seperate rules? * Entity decoding does not apply to HTML tags - for example URIs are not parsed correctly so it might be worth running them through HTML::Entities and if conversion has taken place to signal a hit. * Spammers could start obfuscating "a href=" with entities thus bypassing URI tests altogether, right? Should we convert all entities in Tags and signal a hit if converstion has taken place?
Subject: Re: [SAdev] New rule suggestion: HTML_EXCESSIVE_ENTITIES -- regular letters reencoded as &#nnn; >* Entity decoding does not apply to HTML tags - for example URIs are not >parsed correctly so it might be worth running them through >HTML::Entities and if conversion has taken place to signal a hit. > >* Spammers could start obfuscating "a href=" with entities thus bypassing >URI tests altogether, right? Should we convert all entities in Tags and >signal a hit if converstion has taken place? it would be worth checking MUA behaviour on these -- as far as I know, use of entities in those places will *not* be decoded in the renderer and therefore not acted on. --j.
On the "HTML entities in 'a href's" issue: (Conclusions at the bottom) Tested variations ----------------- 1: <a href="http://www.foo.com">link</a><br> 2: A <A href="http://www.foo.com">link</a><br> 3:   <a href="http://www.foo.com">link</a><br> 4: h <a href="http://www.foo.com">link</a><br> 5: = <a href="http://www.foo.com">link</a><br> 6: " <a href="http://www.foo.com">link</a><br> 7: h <a href="http://www.foo.com">link</a><br> 8: w <a href="http://www.foo.com">link</a><br> Opera 7.11, Netscape 4.79, IE 5.00 (Outlook) -------------------------------------------- 1: link 2: A <A href="http://www.foo.com">link 3: link 4: h link 5: = link 6: " link 7: h link 8: w link 1: clickable. works. 6: clickable but doesn't work. becomes relative link to currentpath/"http://www.foo.com" 7: clickable. works. 8: clickable. works. This is expected behavior. HTML entities inside tag values should be decoded (reference e.g. input boxes). Checking "�" --------------- A: � <a href="http://www.foo.com">link</a><br> B: � <a �href="http://www.foo.com">link</a><br> C: � <a hr�ef="http://www.foo.com">link</a><br> D: � <a href="�http://www.foo.com">link</a><br> E: � <a href="http�://www.foo.com">link</a><br> F: � <a href="http:�//www.foo.com">link</a><br> G: � <a href="http:�//www.foo.com">link</a><br> H: � <a href="http://�www.foo.com">link</a><br> I: � <a href="http://www.foo.com�">link</a><br> J: � <a href="http://www.foo.com�/">link</a><br> K: � <a href�="http://www.foo.com">link</a><br> L: � <a href=�"http://www.foo.com">link</a><br> M: � <�a href="http://www.foo.com">link</a><br> Netscape 4.79 ------------- Prints the "�" literally in text; refuses to understand it. A: Totally broken B-C: Not clickable D-G: Clickable but won't work H: Clickable. Messes up internal cacheing fiercely. Displays a "using cached page instead" dialog and the displays ... something; I haven't figured out what exactly yet. It displays a site I previously tried with "&0;" somewhere but I'm not sure which variation. I-J: Clickable but attempts to resolve ".com�" - can't work K: Not clickable L: Clickable but won't work Opera 6.05 ---------- Prints hollow squares in place of the "�"s A: Totally broken B-C: Not clickable D-E: Clickable, but "Address type unknown or unsupported" F-G: Clickable but won't work H: Attempts to resolve but won't work. Perhaps would work with a tweaked DNS entry / wildcard? Unknown. I-J: Clickable but won't/can't resolve K: Not clickable L: Clickable but won't work IE 5.00 ------- Prints the "�" literally in text; refuses to understand it. A: Totally broken B-C: Not clickable D-J: Clickable but won't work. IE errors are unhelpful^Wfriendly K: Not blickable L: Clickable but won't work CONCLUSIONS ----------- I don't see an immediate problem, but there's a few things that may be worth checking / investigating at some point: - Are URLs properly HTML decoded (HTML entities converted) before checks? - What happens if SA decodes a "�" ? Does it become a NUL? If then, do any searches terminate prematurely? In body text? URLs? - The "H" case might be worth investigating further. Perhaps with a DNS protocol sniffer. Some rainy day :)
This behavior is picking up somewhat. Nowhere near alarming yet, but definitely on the increase. 2003-11-14 -- 2003-12-15: 15 hits out of 4497 (0.3%) 2003-12-15 -- 2004-01-06: 32 hits out of 3667 (0.9%) 2004-01-06 -- 2004-01-22: 33 hits out of 3001 (1.1%) This just out of my own address though. (Yeah, 190 spams/day now. Yum.)
2004-02-10 to 03-26: 140 hits out of 9530 (1.5%) 2004-03-26 to 04-28: 22 hits out of 6587 (0.3%) 2004-04-28 to 05-28: 0 hits out of 9000 (0%) 2004-05-28 to 06-28: 10 hits out of 9846 (0.1%) 2004-06-28 to 07-18: 0 hits out of 6158 (0%) Fad?
re: 'fad?' -- it sounds a lot like one spammer, who's now moved on to other techniques (presumably because this one isn't helping much.)
moving accuracy and some bugs to 3.1.0 milestone
more accuracy and performance bugs going to 3.1.0 milestone
NEEDSMC
# [automatically generated by automc: start] # DONEMC 15: completed request from comment 15 0.197 0.0936 0.5988 0.135 0.23 1.00 __EXCESSIVE_HTML_ENTITIES_3x_b2403_c0 0.071 0.0897 0.0000 1.000 0.57 1.00 __EXCESSIVE_HTML_ENTITIES_7x_b2403_c0 0.068 0.0860 0.0000 1.000 0.56 1.00 __EXCESSIVE_HTML_ENTITIES_8x_b2403_c0 0.154 0.1933 0.0010 0.995 0.63 1.00 __EXCESSIVE_HTML_ENTITIES_10x_b2403_c0 0.157 0.1975 0.0000 1.000 0.63 1.00 __EXCESSIVE_HTML_ENTITIES_11x_b2403_c0 0.076 0.0960 0.0000 1.000 0.58 0.01 T_MC_EXCESSIVE_HTML_ENTITIES_b2403_c0 above freqs using data from "/home/automc/corpus/html/DETAILS.new" as of Fri Feb 4 15:45:56 2005: __EXCESSIVE_HTML_ENTITIES_3x_b2403_c0 = __EXCESSIVE_HTML_ENTITIES_3x from bug 2403 comment 0 full freqs: http://bugzilla.spamassassin.org/ruleqa?rule=__EXCESSIVE_HTML_ENTITIES_3x_b2403_c0&date=20050204 __EXCESSIVE_HTML_ENTITIES_7x_b2403_c0 = __EXCESSIVE_HTML_ENTITIES_7x from bug 2403 comment 0 full freqs: http://bugzilla.spamassassin.org/ruleqa?rule=__EXCESSIVE_HTML_ENTITIES_7x_b2403_c0&date=20050204 __EXCESSIVE_HTML_ENTITIES_8x_b2403_c0 = __EXCESSIVE_HTML_ENTITIES_8x from bug 2403 comment 0 full freqs: http://bugzilla.spamassassin.org/ruleqa?rule=__EXCESSIVE_HTML_ENTITIES_8x_b2403_c0&date=20050204 __EXCESSIVE_HTML_ENTITIES_10x_b2403_c0 = __EXCESSIVE_HTML_ENTITIES_10x from bug 2403 comment 0 full freqs: http://bugzilla.spamassassin.org/ruleqa?rule=__EXCESSIVE_HTML_ENTITIES_10x_b2403_c0&date=20050204 __EXCESSIVE_HTML_ENTITIES_11x_b2403_c0 = __EXCESSIVE_HTML_ENTITIES_11x from bug 2403 comment 0 full freqs: http://bugzilla.spamassassin.org/ruleqa?rule=__EXCESSIVE_HTML_ENTITIES_11x_b2403_c0&date=20050204 T_MC_EXCESSIVE_HTML_ENTITIES_b2403_c0 = EXCESSIVE_HTML_ENTITIES from bug 2403 comment 0 full freqs: http://bugzilla.spamassassin.org/ruleqa?rule=T_MC_EXCESSIVE_HTML_ENTITIES_b2403_c0&date=20050204 # ham results used: ham-bzoetekouw.log ham-daf.log ham-jm.log ham-parkerm.log ham-quinlan.log ham-rODbegbie.log ham-theo.log # spam results used: spam-bzoetekouw.log spam-daf.log spam-jm.log spam-parkerm.log spam-quinlan.log spam-rODbegbie.log spam-theo.log 479311 381285 98026 0.795 0.00 0.00 (all messages) 100.000 79.5486 20.4514 0.795 0.00 0.00 (all messages as %) # [automatically generated by automc: end]
freqs from "DETAILS.age" (set 0, by message age): 0.013 0.0144 0.0000 1.000 0.43 0.01 T_MC_EXCESSIVE_HTML_ENTITIES_b2403_c0:0-1 0.016 0.0189 0.0000 1.000 0.46 0.01 T_MC_EXCESSIVE_HTML_ENTITIES_b2403_c0:1-3 0.071 0.0854 0.0000 1.000 0.45 0.01 T_MC_EXCESSIVE_HTML_ENTITIES_b2403_c0:3-6 sorry, I think we have to close this -- rawbody tests are slow, the hit-rate's not great, and the hit-rates are declining (0.0144% of spam in the last month).