SA Bugzilla – Bug 3139
RFE: ignore invisible text during rendering
Last modified: 2006-04-21 11:21:38 UTC
A fairly typical obfuscation these days is to put letters, symbols, and small groups in the middle of a word, using either a 0 or 1 point/pixel font. The rendered result on the screen is the Evil Word with at most a small blip in it. The result of rendering the html down to text, ignoring fonts, is the obfuscated Evil Word. Now granted, in most cases we can catch these either because of the small font, or detecting a match on the obfuscated word anyway. However, an OPTIONAL option on some test object (let us say for argument, 'body'), that would delete the stuff in the tiny font, would end up rendering the Evil Word itself to text, where it can be easily detected, probably without even time-consuming obfuscation checks. Clearly having such an option, and using it, would mean potentially rendering the html twice, which is extra overhead. Thus, this form of the object probably shouldn't be rendered unless specifically asked for. It probably would require a very minor extra amount of smarts in the html renderer (to recognize small fonts as such). Such a rendering method might also be interesting as the feeder to Bayes. Presumably the results would be less-obfuscated words that the current stuff, and might (or might not) result in better hit rates.
yes. I'm thinking we should consider these "invisible text" segments. body: should remove these rawbody: keep them full: keep them we could do with a patch for this btw ;)
more accuracy and performance bugs going to 3.1.0 milestone
btw, this needs a sample message. I can't find one in recent spam ;)
fyi, bug 3661 has discussions about this topic as well. I'm sort of merging them together here.
do we need to fix this for 3.1.0? not seeing much missed spam due to this
Subject: Re: RFE: ignore invisible text during rendering Probably not required, I'd mark in 320 I think.
Subject: Re: RFE: ignore invisible text during rendering On Mon, May 09, 2005 at 12:43:58AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > do we need to fix this for 3.1.0? not seeing much missed spam due to this I don't think so. We need to spend more time figuring out how to break up the rules into visible vs invisisble, etc. 3.2 is fine IMO.
moving out -- I don't think this is causing problems
(In reply to comment #8) > moving out -- I don't think this is causing problems FYI, I just got a spam that exploits this thoroughly: <DIV><FONT face=Arial size=4>V<SPAN style="DISPLAY: none">addressed to a grande of Spain, heavily sealed with the arms of</SPAN>lAGRA VA<SPAN style="DISPLAY: none">morning.</SPAN>LlUM ClA<SPAN style="DISPLAY: none">the bar with all canvas furled save only their spiltsails, which,</SPAN>LlS <SPAN style="DISPLAY: none">Then with a sudden audible catch in his breath, he stopped short.</SPAN>LEVlTRA and many other.</ FONT></DIV> So with the current rendered text, we see: Vaddressed to a grande of Spain, heavily sealed with the arms oflAGRA VAmorning.LlUM ClAthe bar with all canvas furled save only their spiltsails, which,LlS Then with a sudden audible catch in his breath, he stopped short.LEVlTRA and many other. If we just looked at visible, we get: VlAGRA VALlUM ClALlS LEVlTRA and many other.
Created attachment 2916 [details] spample using DISPLAY: none yeah, I've got a fair few of those too. btw, they're getting nailed by other rules -- this one has a score of 22.7.
*** Bug 4389 has been marked as a duplicate of this bug. ***
After seeing more and more hide technics, I think we should also think about skipping the 'invisible' in the bayes filter. Often the invisible parts are just 'bayes poisening' and have no other reason: <font face="verdana" size="3"> via<span style="display:none;">allotting greenware rivulet dreamy blend tungsten nettlesome todd bogeymen chap facet age geisha iconoclast cassandra sampson adjudge taxpaying arcana disparage shank leafy crude zounds abraham oshkosh manitoba discuss aventine steven tyranny assimilable transposable bstj help opine kaolinite tenant bask transshipped 92581</span>gra and vi<span style="display:none;">wept banks automotive blame agamemnon chungking create colossus extinct shown marquis bernard babel awhile hamilton bengali incurred burlesque suction attendee tempera bacteria bowel retinue lyon bimini fief amphibology inclement free wingman archibald beauty cb ike cover depressant </span>codin are very cheap today. So I wonder if this handled by spamassassin 3.1 already ? We should strip away 'span display none' parts, strip away invisible font and color parts and only feed the remaining 'visible' part to the bayes filter. This way we would have 'again' better bayes results for HTML mails. What do you think about this idea ? Martin
Subject: Re: RFE: ignore invisible text during rendering In the past this 'bayes poisioning' has actually proven to be a very good indicator of spam, in general. The sort of thing most spammers put in their mail as poisoning simply doesn't match the words commonly in use by the recipients, and thus nicely classifies the mail as spam. There has been moderate discussion of eliminating invisible text in various contexts over the past few months; the idea isn't new and there is an open big (maybe this one) on the concept. Part of the question, once one gets past the 'should we do it at all?' question, becomes 'where should we do it, and under what circumstances?' There are several possibilities: a) eliminate invisible or near-invisible text entirely. * b) eliminate it in bayes c) eliminate it as a rule source d) make visible and invisible rules, and have the invisible text only available to special rules e) make visible rules and full rules, with the invisible text left in position in the full rules** * Deciding what is invisible isn't as simple as 0pt or display:none. Below a certain point size, or almost any size with the right combination of foreground and background colors (not necessarily the same) can be invisible. This is a human physiology question to an extent. Spammers can decide the colors, font faces and sizes by experimentation. Determining algorithmically what the spammers have achieved by experimentation may be nontrivial. (Of course, one can always start with the trivial cases, since they will handle most of today's spam.) ** "full rules" in this context is not referring to the current full rules, it is more referring to the current body rules that show rendered text, but with the visible and near-visible or invisible text still left in position. As a rule writer rather than an SA implementor, I personally favor the following, at least with current spam techniques: 1) leave the invisible text in bayes. I think someone did a test that indicated that it was a better spam sign if left as currently rendered than if stripped out into separate tokens from the visible text. However, this experiment might be revisited to determine the best hit ratio. 2) Make the following rule base types. (This covers some other complaints I have as a rule writer also) (Some of these already exist): a pristine email message b headers c mime headers d decoded body sections (un-base64, etc) e rendered body sections, keeping invisible text f rendered body sections, deleting invisible text g anchor rules. These have two parts - the uri and the anchor h uri rules Obviously types a, b, e, and h already exist in usable form. Type d exists, but is largely unusable as it breaks the text by line rather than as a section. (For that matter, type e has problems in some cases, as it breaks by paragraph and other random places.) So the new types would be a mime header type, the visible-rendered type, and the anchor-rule type. Once the rule base types existed, there would be a formidable effort of taking the existing body rules and determining which ones should be rendered-body rules and which ones should be rendered-visible rules to get the best results. This is a sub-project that SARE members would probably happily take on to relieve the dev's of having to do all of the work. For that matter, if rawbody was fixed to return body parts rather than lines, a lot of existing rules that currently exist as both rawbody/full or full/uri or the like could be reduced to simple rawbody rules with improved hit rates.
How about learning the 'invisible' parts to a second bayes ? I think this would be highly effective ! Martin
Subject: Re: RFE: ignore invisible text during rendering On Sun, Jun 05, 2005 at 10:03:53AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > How about learning the 'invisible' parts to a second bayes ? I think this would > be highly effective ! We've been treating visible and invisible text separately in bayes for a while now.
Created attachment 2966 [details] another sample
Subject: Re: RFE: ignore invisible text during rendering That's also an excellent example of a test for one of my other rule suggestions - It has a text and an html part, and there are lots of uris in the html part and none in the text part. I'd love to see test results on a rule that could check for that, but I suspect it would have to be an eval rule, or maybe a plugin. And unfortunately I don't know perl well enough to write one myself. Loren
I just want to add to this that the current HTML_TINY_FONT rule catches a little to many font sizes. But I guess that it is not possible to do it 'correct' until the parser realy looks at the HTML in whole. I have some e-mail notifications that uses font-size: 1.2em. And that is caught by the current rule. But 1.2em could be invisible depending on other style settings.
can we close this? if there are remaining issues, we should open new bugs, this one's a mess. ;)