Bug 2403 - New rule suggestion: HTML_EXCESSIVE_ENTITIES -- regular letters reencoded as &#nnn;
Summary: New rule suggestion: HTML_EXCESSIVE_ENTITIES -- regular letters reencoded as ...
Status: RESOLVED WONTFIX
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: unspecified
Hardware: Other other
: P5 enhancement
Target Milestone: 3.1.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
: 2475 (view as bug list)
Depends on:
Blocks:
 
Reported: 2003-09-03 12:54 UTC by Mikael Olsson
Modified: 2005-02-06 14:40 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Example spam with unnecessary HTML entities text/plain None Mikael Olsson [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Mikael Olsson 2003-09-03 12:54:54 UTC
(Indented lines continued from previous line)

describe EXCESSIVE_HTML_ENTITIES Unnecessary reencoding of normal letters 
  into HTML entities
# 70-89 is 'F'-'Y', 100-119 is 'd'-'w'i, 32-36 is space,!,",#,$
rawbody __EXCESSIVE_HTML_ENTITIES_3x /\&\#3[2-6];/
rawbody __EXCESSIVE_HTML_ENTITIES_7x /\&\#7[0-9];/
rawbody __EXCESSIVE_HTML_ENTITIES_8x /\&\#8[0-9];/
rawbody __EXCESSIVE_HTML_ENTITIES_10x /\&\#10[0-9];/
rawbody __EXCESSIVE_HTML_ENTITIES_11x /\&\#11[0-9];/
meta EXCESSIVE_HTML_ENTITIES ( __EXCESSIVE_HTML_ENTITIES_3x + 
  __EXCESSIVE_HTML_ENTITIES_7x + __EXCESSIVE_HTML_ENTITIES_8x + 
  __EXCESSIVE_HTML_ENTITIES_10x + __EXCESSIVE_HTML_ENTITIES_11x ) > 2

Requires a hit on three out of five groups of unnecessarily
reencoded letters. Maybe I'm being too nice. ;)

See attached example spam. (Though one wonders why he bothered
to obfuscate the HTML and then left a plaintext copy in. Spammers
never cease to amaze me.)
Comment 1 Mikael Olsson 2003-09-03 12:55:59 UTC
Created attachment 1311 [details]
Example spam with unnecessary HTML entities
Comment 2 Mikael Olsson 2003-09-03 14:14:39 UTC
Come to think of it, if the above tests show useful, this is probably 
much better implemented in the HTML->plaintext converter, at near
zero cost, but I refuse to get involved in yet another project, so 
that's for the illustrious Someone Else to do.
Comment 3 Brian White 2003-09-03 14:29:15 UTC
Subject: Re: [SAdev]  New rule suggestion: HTML_EXCESSIVE_ENTITIES -- 
 regular letters reencoded as &#nnn;

> Example spam with unnecessary HTML entities

I believe this is similar to bug #2211, "New HTML Tag Tests".

                                          Brian
                                 ( bcwhite@precidia.com )

-------------------------------------------------------------------------------
    Many times the difference between failure and success is doing something
                   nearly right... or doing it exactly right.

Comment 4 Brian White 2003-09-03 14:30:15 UTC
Subject: Re: [SAdev]  New rule suggestion: HTML_EXCESSIVE_ENTITIES -- 
 regular letters reencoded as &#nnn;

> Example spam with unnecessary HTML entities

Oops, never mind.  I misunderstood "entities".

                                          Brian
                                 ( bcwhite@precidia.com )

-------------------------------------------------------------------------------
    Many times the difference between failure and success is doing something
                   nearly right... or doing it exactly right.

Comment 5 Gregor Lawatscheck 2003-10-07 12:42:15 UTC
I propose using 
HTML::Entities ( http://search.cpan.org/dist/HTML-Parser/lib/HTML/Entities.pm )
from HTML-Parser in lib/Mail/SpamAssassin/HTML.pm ? Would that be OK?

Comment 6 Malte S. Stretz 2003-10-07 14:56:32 UTC
*** Bug 2475 has been marked as a duplicate of this bug. ***
Comment 7 Gregor Lawatscheck 2003-10-08 17:52:11 UTC
OK, looked at the source code 

* PerMsgStatus.pm at HTML::Parser->new :
  text => [sub { $self->html_text(@_) }, "dtext"]

  The dtext already takes care of decoding HTML entities, 
  which is good on the one hand, because rules for suspicous 
  words are matched, but there is no easy way of telling that 
  obfuscation has taken place other than using rawbody at the moment.
  Instead of using dtext, we could do decode inside of html_text but would that                  
  be better than rawbody matching with seperate rules?  
  
* Entity decoding does not apply to HTML tags - for example URIs are not parsed    correctly so it might be worth running them through HTML::Entities and if conversion has taken place to signal a hit.
  
* Spammers could start obfuscating "a href=" with entities thus bypassing URI tests altogether, right? Should we convert all entities in Tags and signal a hit if converstion has taken place?
Comment 8 Justin Mason 2003-10-08 18:35:19 UTC
Subject: Re: [SAdev]  New rule suggestion: HTML_EXCESSIVE_ENTITIES -- regular letters reencoded as &#nnn; 


>* Entity decoding does not apply to HTML tags - for example URIs are not
>parsed    correctly so it might be worth running them through
>HTML::Entities and if conversion has taken place to signal a hit.
>  
>* Spammers could start obfuscating "a href=" with entities thus bypassing
>URI tests altogether, right? Should we convert all entities in Tags and
>signal a hit if converstion has taken place?

it would be worth checking MUA behaviour on these -- as far as I know,
use of entities in those places will *not* be decoded in the renderer
and therefore not acted on.

--j.

Comment 9 Mikael Olsson 2003-10-09 13:11:22 UTC
On the "HTML entities in 'a href's" issue:

(Conclusions at the bottom)

Tested variations
-----------------

  1: <a href="http://www.foo.com">link</a><br>
  2: &#65; <&#65; href="http://www.foo.com">link</a><br>
  3: &#32; <a&#32;href="http://www.foo.com">link</a><br>
  4: &#104; <a &#104ref="http://www.foo.com">link</a><br>
  5: &#61; <a href&#61"http://www.foo.com">link</a><br>
  6: &#34; <a href=&#34http://www.foo.com">link</a><br>
  7: &#104; <a href="&#104ttp://www.foo.com">link</a><br>
  8: &#119; <a href="http://&#119;ww.foo.com">link</a><br>

Opera 7.11, Netscape 4.79, IE 5.00 (Outlook)
--------------------------------------------

  1: link
  2: A <A href="http://www.foo.com">link
  3: link
  4: h link
  5: = link
  6: " link
  7: h link
  8: w link

  1: clickable. works.
  6: clickable but doesn't work.
     becomes relative link to currentpath/"http://www.foo.com"
  7: clickable. works.
  8: clickable. works.

This is expected behavior. HTML entities inside tag values should
be decoded (reference e.g. input boxes).


Checking "&#0;"
---------------

A: &#0; <a href="http://www.foo.com">link</a><br>
B: &#0; <a &#0;href="http://www.foo.com">link</a><br>
C: &#0; <a hr&#0;ef="http://www.foo.com">link</a><br>
D: &#0; <a href="&#0;http://www.foo.com">link</a><br>
E: &#0; <a href="http&#0;://www.foo.com">link</a><br>
F: &#0; <a href="http:&#0;//www.foo.com">link</a><br>
G: &#0; <a href="http:&#0;//www.foo.com">link</a><br>
H: &#0; <a href="http://&#0;www.foo.com">link</a><br>
I: &#0; <a href="http://www.foo.com&#0;">link</a><br>
J: &#0; <a href="http://www.foo.com&#0;/">link</a><br>
K: &#0; <a href&#0;="http://www.foo.com">link</a><br>
L: &#0; <a href=&#0;"http://www.foo.com">link</a><br>
M: &#0; <&#0;a href="http://www.foo.com">link</a><br>


Netscape 4.79
-------------

  Prints the "&#0;" literally in text; refuses to understand it.

  A: Totally broken
  B-C: Not clickable
  D-G: Clickable but won't work
  H: Clickable. Messes up internal cacheing fiercely. Displays a 
     "using cached page instead" dialog and the displays ... 
     something; I haven't figured out what exactly yet. It displays
     a site I previously tried with "&0;" somewhere but I'm not sure
     which variation.
  I-J: Clickable but attempts to resolve ".com&#0;" - can't work
  K: Not clickable
  L: Clickable but won't work

Opera 6.05
----------
  
  Prints hollow squares in place of the "&#0;"s

  A: Totally broken
  B-C: Not clickable
  D-E: Clickable, but "Address type unknown or unsupported"
  F-G: Clickable but won't work
  H: Attempts to resolve but won't work. Perhaps would work with a
     tweaked DNS entry / wildcard? Unknown.
  I-J: Clickable but won't/can't resolve
  K: Not clickable
  L: Clickable but won't work

IE 5.00
-------

  Prints the "&#0;" literally in text; refuses to understand it.

  A: Totally broken
  B-C: Not clickable
  D-J: Clickable but won't work. IE errors are unhelpful^Wfriendly
  K: Not blickable
  L: Clickable but won't work


CONCLUSIONS
-----------

  I don't see an immediate problem, but there's a few things that
  may be worth checking / investigating at some point:

  - Are URLs properly HTML decoded (HTML entities converted) before checks?
  - What happens if SA decodes a "&#0;" ? Does it become a NUL?
    If then, do any searches terminate prematurely? In body text? URLs?
  - The "H" case might be worth investigating further. Perhaps with a 
    DNS protocol sniffer. Some rainy day :)

Comment 10 Mikael Olsson 2004-01-24 12:57:53 UTC
This behavior is picking up somewhat. Nowhere near alarming
yet, but definitely on the increase.

2003-11-14 -- 2003-12-15: 15 hits out of 4497 (0.3%)
2003-12-15 -- 2004-01-06: 32 hits out of 3667 (0.9%)
2004-01-06 -- 2004-01-22: 33 hits out of 3001 (1.1%)

This just out of my own address though. (Yeah, 190 spams/day now. Yum.)
Comment 11 Mikael Olsson 2004-07-18 03:45:18 UTC
2004-02-10 to 03-26: 140 hits out of 9530 (1.5%)
2004-03-26 to 04-28:  22 hits out of 6587 (0.3%)
2004-04-28 to 05-28:   0 hits out of 9000 (0%)
2004-05-28 to 06-28:  10 hits out of 9846 (0.1%)
2004-06-28 to 07-18:   0 hits out of 6158 (0%)

Fad?
Comment 12 Justin Mason 2004-07-18 12:48:19 UTC
re: 'fad?' -- it sounds a lot like one spammer, who's now moved on to other
techniques (presumably because this one isn't helping much.)
Comment 13 Daniel Quinlan 2004-08-27 17:00:24 UTC
moving accuracy and some bugs to 3.1.0 milestone
Comment 14 Daniel Quinlan 2004-08-27 17:19:53 UTC
more accuracy and performance bugs going to 3.1.0 milestone
Comment 15 Justin Mason 2005-01-14 19:59:20 UTC
NEEDSMC
Comment 16 Auto-Mass-Checker 2005-02-04 15:45:59 UTC
# [automatically generated by automc: start]
# DONEMC 15: completed request from comment 15

  0.197   0.0936   0.5988    0.135   0.23    1.00  __EXCESSIVE_HTML_ENTITIES_3x_b2403_c0
  0.071   0.0897   0.0000    1.000   0.57    1.00  __EXCESSIVE_HTML_ENTITIES_7x_b2403_c0
  0.068   0.0860   0.0000    1.000   0.56    1.00  __EXCESSIVE_HTML_ENTITIES_8x_b2403_c0
  0.154   0.1933   0.0010    0.995   0.63    1.00  __EXCESSIVE_HTML_ENTITIES_10x_b2403_c0
  0.157   0.1975   0.0000    1.000   0.63    1.00  __EXCESSIVE_HTML_ENTITIES_11x_b2403_c0
  0.076   0.0960   0.0000    1.000   0.58    0.01  T_MC_EXCESSIVE_HTML_ENTITIES_b2403_c0

above freqs using data from "/home/automc/corpus/html/DETAILS.new" as of Fri Feb  4 15:45:56 2005:

__EXCESSIVE_HTML_ENTITIES_3x_b2403_c0 = __EXCESSIVE_HTML_ENTITIES_3x from bug 2403 comment 0
full freqs: http://bugzilla.spamassassin.org/ruleqa?rule=__EXCESSIVE_HTML_ENTITIES_3x_b2403_c0&date=20050204

__EXCESSIVE_HTML_ENTITIES_7x_b2403_c0 = __EXCESSIVE_HTML_ENTITIES_7x from bug 2403 comment 0
full freqs: http://bugzilla.spamassassin.org/ruleqa?rule=__EXCESSIVE_HTML_ENTITIES_7x_b2403_c0&date=20050204

__EXCESSIVE_HTML_ENTITIES_8x_b2403_c0 = __EXCESSIVE_HTML_ENTITIES_8x from bug 2403 comment 0
full freqs: http://bugzilla.spamassassin.org/ruleqa?rule=__EXCESSIVE_HTML_ENTITIES_8x_b2403_c0&date=20050204

__EXCESSIVE_HTML_ENTITIES_10x_b2403_c0 = __EXCESSIVE_HTML_ENTITIES_10x from bug 2403 comment 0
full freqs: http://bugzilla.spamassassin.org/ruleqa?rule=__EXCESSIVE_HTML_ENTITIES_10x_b2403_c0&date=20050204

__EXCESSIVE_HTML_ENTITIES_11x_b2403_c0 = __EXCESSIVE_HTML_ENTITIES_11x from bug 2403 comment 0
full freqs: http://bugzilla.spamassassin.org/ruleqa?rule=__EXCESSIVE_HTML_ENTITIES_11x_b2403_c0&date=20050204

T_MC_EXCESSIVE_HTML_ENTITIES_b2403_c0 = EXCESSIVE_HTML_ENTITIES from bug 2403 comment 0
full freqs: http://bugzilla.spamassassin.org/ruleqa?rule=T_MC_EXCESSIVE_HTML_ENTITIES_b2403_c0&date=20050204
# ham results used: ham-bzoetekouw.log ham-daf.log ham-jm.log ham-parkerm.log ham-quinlan.log ham-rODbegbie.log ham-theo.log
# spam results used: spam-bzoetekouw.log spam-daf.log spam-jm.log spam-parkerm.log spam-quinlan.log spam-rODbegbie.log spam-theo.log
 479311   381285    98026    0.795   0.00    0.00  (all messages)
100.000  79.5486  20.4514    0.795   0.00    0.00  (all messages as %)
# [automatically generated by automc: end]
Comment 17 Justin Mason 2005-02-06 23:40:12 UTC
freqs from "DETAILS.age" (set 0, by message age):

  0.013   0.0144   0.0000    1.000   0.43    0.01 
T_MC_EXCESSIVE_HTML_ENTITIES_b2403_c0:0-1
  0.016   0.0189   0.0000    1.000   0.46    0.01 
T_MC_EXCESSIVE_HTML_ENTITIES_b2403_c0:1-3
  0.071   0.0854   0.0000    1.000   0.45    0.01 
T_MC_EXCESSIVE_HTML_ENTITIES_b2403_c0:3-6


sorry, I think we have to close this -- rawbody tests are slow, the hit-rate's
not great, and the hit-rates are declining (0.0144% of spam in the last month).