Bug 3139 - RFE: ignore invisible text during rendering
Summary: RFE: ignore invisible text during rendering
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: unspecified
Hardware: Other other
: P5 enhancement
Target Milestone: 3.2.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
: 4389 (view as bug list)
Depends on:
Blocks: 3661
  Show dependency tree
 
Reported: 2004-03-08 04:03 UTC by Loren Wilton
Modified: 2006-04-21 11:21 UTC (History)
3 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
spample using DISPLAY: none text/plain None Justin Mason [HasCLA]
another sample text/plain None Theo Van Dinter [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Loren Wilton 2004-03-08 04:03:58 UTC
A fairly typical obfuscation these days is to put letters, symbols, and small 
groups in the middle of a word, using either a 0 or 1 point/pixel font.  The 
rendered result on the screen is the Evil Word with at most a small blip in 
it.  The result of rendering the html down to text, ignoring fonts, is the 
obfuscated Evil Word.

Now granted, in most cases we can catch these either because of the small font, 
or detecting a match on the obfuscated word anyway.  However, an OPTIONAL 
option on some test object (let us say for argument, 'body'), that would delete 
the stuff in the tiny font, would end up rendering the Evil Word itself to 
text, where it can be easily detected, probably without even time-consuming 
obfuscation checks.

Clearly having such an option, and using it, would mean potentially rendering 
the html twice, which is extra overhead.  Thus, this form of the object 
probably shouldn't be rendered unless specifically asked for.  It probably 
would require a very minor extra amount of smarts in the html renderer (to 
recognize small fonts as such).

Such a rendering method might also be interesting as the feeder to Bayes.  
Presumably the results would be less-obfuscated words that the current stuff, 
and might (or might not) result in better hit rates.
Comment 1 Justin Mason 2004-03-16 21:39:59 UTC
yes.  I'm thinking we should consider these "invisible text" segments.

body: should remove these
rawbody: keep them
full: keep them

we could do with a patch for this btw ;)
Comment 2 Daniel Quinlan 2004-08-27 17:19:59 UTC
more accuracy and performance bugs going to 3.1.0 milestone
Comment 3 Justin Mason 2005-03-11 15:14:55 UTC
btw, this needs a sample message.  I can't find one in recent spam ;)
Comment 4 Theo Van Dinter 2005-05-02 12:59:52 UTC
fyi, bug 3661 has discussions about this topic as well.  I'm sort of merging them together here.
Comment 5 Justin Mason 2005-05-09 00:43:58 UTC
do we need to fix this for 3.1.0?  not seeing much missed spam due to this
Comment 6 Loren Wilton 2005-05-09 01:20:58 UTC
Subject: Re:  RFE: ignore invisible text during rendering

Probably not required, I'd mark in 320 I think.

Comment 7 Theo Van Dinter 2005-05-09 09:54:08 UTC
Subject: Re:  RFE: ignore invisible text during rendering

On Mon, May 09, 2005 at 12:43:58AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> do we need to fix this for 3.1.0?  not seeing much missed spam due to this

I don't think so.  We need to spend more time figuring out how to break up the
rules into visible vs invisisble, etc.  3.2 is fine IMO.

Comment 8 Justin Mason 2005-05-19 00:18:29 UTC
moving out -- I don't think this is causing problems
Comment 9 Theo Van Dinter 2005-06-01 14:59:00 UTC
(In reply to comment #8)
> moving out -- I don't think this is causing problems

FYI, I just got a spam that exploits this thoroughly:

<DIV><FONT face=Arial size=4>V<SPAN style="DISPLAY: none">addressed  to a grande of Spain, 
heavily sealed with
the arms of</SPAN>lAGRA  VA<SPAN style="DISPLAY: none">morning.</SPAN>LlUM ClA<SPAN  
style="DISPLAY: none">the
bar with all canvas furled save only  their spiltsails, which,</SPAN>LlS <SPAN style="DISPLAY: 
none">Then
with a sudden audible catch in his breath, he stopped  short.</SPAN>LEVlTRA and many other.</
FONT></DIV>


So with the current rendered text, we see:

Vaddressed to a grande of Spain, heavily sealed with the arms oflAGRA VAmorning.LlUM ClAthe bar 
with all canvas furled save only their spiltsails, which,LlS Then with a sudden audible catch in his 
breath, he stopped short.LEVlTRA and many other.


If we just looked at visible, we get:

VlAGRA VALlUM ClALlS LEVlTRA and many other.
Comment 10 Justin Mason 2005-06-01 15:15:09 UTC
Created attachment 2916 [details]
spample using DISPLAY: none

yeah, I've got a fair few of those too.
btw, they're getting nailed by other rules -- this one has a score of 22.7.
Comment 11 Theo Van Dinter 2005-06-04 20:10:14 UTC
*** Bug 4389 has been marked as a duplicate of this bug. ***
Comment 12 Martin Blapp 2005-06-05 08:23:42 UTC
After seeing more and more hide technics, I think we should also
think about skipping the 'invisible' in the bayes filter.
Often the invisible parts are just 'bayes poisening' and have no
other reason:

<font face="verdana" size="3">
via<span style="display:none;">allotting greenware rivulet dreamy blend tungsten
nettlesome todd bogeymen chap facet age geisha iconoclast cassandra sampson
adjudge taxpaying arcana disparage shank leafy crude zounds abraham oshkosh
manitoba discuss aventine steven tyranny assimilable transposable bstj help
opine kaolinite tenant bask transshipped  92581</span>gra and vi<span
style="display:none;">wept banks automotive blame agamemnon chungking create
colossus extinct shown marquis bernard babel awhile hamilton bengali incurred
burlesque suction attendee tempera bacteria bowel retinue lyon bimini  fief
amphibology inclement free wingman archibald beauty cb ike cover depressant
</span>codin are very
cheap today.

So I wonder if this handled by spamassassin 3.1 already ? We should
strip away 'span display none' parts, strip away invisible font
and color parts and only feed the remaining 'visible' part to the bayes
filter. This way we would have 'again' better bayes results for HTML mails.

What do you think about this idea ?

Martin
Comment 13 Loren Wilton 2005-06-05 08:56:15 UTC
Subject: Re:  RFE: ignore invisible text during rendering

In the past this 'bayes poisioning' has actually proven to be a very good
indicator of spam, in general.  The sort of thing most spammers put in their
mail as poisoning simply doesn't match the words commonly in use by the
recipients, and thus nicely classifies the mail as spam.

There has been moderate discussion of eliminating invisible text in various
contexts over the past few months; the idea isn't new and there is an open
big (maybe this one) on the concept.  Part of the question, once one gets
past the 'should we do it at all?' question, becomes 'where should we do it,
and under what circumstances?'

There are several possibilities:

a) eliminate invisible or near-invisible text entirely. *
b) eliminate it in bayes
c) eliminate it as a rule source
d) make visible and invisible rules, and have the invisible text only
available to special rules
e) make visible rules and full rules, with the invisible text left in
position in the full rules**

* Deciding what is invisible isn't as simple as 0pt or display:none.  Below
a certain point size, or almost any size with the right combination of
foreground and background colors (not necessarily the same) can be
invisible.  This is a human physiology question to an extent.  Spammers can
decide the colors, font faces and sizes by experimentation.  Determining
algorithmically what the spammers have achieved by experimentation may be
nontrivial.

(Of course, one can always start with the trivial cases, since they will
handle most of today's spam.)

** "full rules" in this context is not referring to the current full rules,
it is more referring to the current body rules that show rendered text, but
with the visible and near-visible or invisible text still left in position.


As a rule writer rather than an SA implementor, I personally favor the
following, at least with current spam techniques:

1)    leave the invisible text in bayes.  I think someone did a test that
indicated that it was a better spam sign if left as currently rendered than
if stripped out into separate tokens from the visible text.  However, this
experiment might be revisited to determine the best hit ratio.

2)    Make the following rule base types.  (This covers some other
complaints I have as a rule writer also) (Some of these already exist):

    a    pristine email message
    b    headers
    c    mime headers
    d    decoded body sections (un-base64, etc)
    e    rendered body sections, keeping invisible text
    f     rendered body sections, deleting invisible text
    g    anchor rules.  These have two parts - the uri and the anchor
    h    uri rules

Obviously types a, b, e, and h already exist in usable form.  Type d exists,
but is largely unusable as it breaks the text by line rather than as a
section.  (For that matter, type e has problems in some cases, as it breaks
by paragraph and other random places.)

So the new types would be a mime header type, the visible-rendered type, and
the anchor-rule type.

Once the rule base types existed, there would be a formidable effort of
taking the existing body rules and determining which ones should be
rendered-body rules and which ones should be rendered-visible rules to get
the best results.  This is a sub-project that SARE members would probably
happily take on to relieve the dev's of having to do all of the work.  For
that matter, if rawbody was fixed to return body parts rather than lines, a
lot of existing rules that currently exist as both rawbody/full or full/uri
or the like could be reduced to simple rawbody rules with improved hit
rates.

Comment 14 Martin Blapp 2005-06-05 10:03:53 UTC
How about learning the 'invisible' parts to a second bayes ? I think this would
be highly effective !

Martin
Comment 15 Theo Van Dinter 2005-06-05 10:16:48 UTC
Subject: Re:  RFE: ignore invisible text during rendering

On Sun, Jun 05, 2005 at 10:03:53AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> How about learning the 'invisible' parts to a second bayes ? I think this would
> be highly effective !

We've been treating visible and invisible text separately in bayes for
a while now.

Comment 16 Theo Van Dinter 2005-06-29 07:54:14 UTC
Created attachment 2966 [details]
another sample
Comment 17 Loren Wilton 2005-06-29 20:03:08 UTC
Subject: Re:  RFE: ignore invisible text during rendering

That's also an excellent example of a test for one of my other rule
suggestions -

It has a text and an html part, and there are lots of uris in the html part
and none in the text part.

I'd love to see test results on a rule that could check for that, but I
suspect it would have to be an eval rule, or maybe a plugin.  And
unfortunately I don't know perl well enough to write one myself.

        Loren

Comment 18 Jens Yllman 2005-12-13 08:31:36 UTC
I just want to add to this that the current HTML_TINY_FONT rule catches a little
to many font sizes. But I guess that it is not possible to do it 'correct' until
the parser realy looks at the HTML in whole. I have some e-mail notifications
that uses font-size: 1.2em. And that is caught by the current rule. But 1.2em
could be invisible depending on other style settings.
Comment 19 Justin Mason 2006-04-21 18:21:38 UTC
can we close this?  if there are remaining issues, we should open new bugs, this
one's a mess. ;)