Bug 3173 - Bayes can be circumvented by faking invisible or near-invisible text.
Summary: Bayes can be circumvented by faking invisible or near-invisible text.
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P5 normal
Target Milestone: 3.0.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 2892 2282 2423
  Show dependency tree
 
Reported: 2004-03-14 12:44 UTC by David Koppelman
Modified: 2004-03-23 12:48 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
Message faking invisibility. text/plain None David Koppelman [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description David Koppelman 2004-03-14 12:44:03 UTC
The fix associated with Bug 2129 would allow spammers to circumvent
Bayesian classification. (There is also discussion on this in Bug 2282.)

With the following changes,
  r9460 | jm | 2004-03-13 23:49:08 -0600 (Sat, 13 Mar 2004) | 1 line

a message could be crafted so that it would be considered not
"visible_for_bayes" but still visible enough when using a typical
email client.  That might be done by using a seemingly low contrast
color pair or by markup that tricks the code into thinking the text is
invisible (for example, using complex styles).

Such messages would get around one of the more robust spam detection
techniques used in SpamAssassin.  

As an alternative, tag invisible text, as suggested in Bug 2129.


With respect to Bug 2129 cmment 16,

> This bug has been marked as Resolved Fixed after the invisible text
> code was checked in and reduced FNs by 7% with no new FPs.

my concern is not with existing spam but with future messages designed
to take advantage of r9460.
Comment 1 Sidney Markowitz 2004-03-14 13:05:31 UTC
Ok, now there's an appropriate place to talk about this :-)

If it is possible to craft such a message, then our code is identifying text as
invisible when it is not invisible. That would be a bug in our code, which can
be fixed. The correct approach is to attach such a message to a bug report.

I would not consider any solution acceptible if it allows a spammer to create a
message with 20,000 unique random 4-letter combinations that would be processed
by Bayes and not visible in a mail reader, unless someone comes up with a way
for that not to be a DoS attack on SpamAssassin with Bayes. That doesn't mean do
nothing to fix a problem, but it is a security issue that cannot be ignored.

I don't see introducing a vulnerability in order to fix a problem that has not
been demonstrated. Where is this message that is labeled invisible but isn't and
for which there is no fix in the invisibility detector code? If there is no such
example after some time, I'll be closing this bug as a WONTFIX. Of course if I
do that and an example shows up in the future, I would be happy to see this
reopened.
Comment 2 David Koppelman 2004-03-14 13:53:02 UTC
> If it is possible to craft such a message, then our code is
> identifying text as invisible when it is not invisible. That would
> be a bug in our code, which can be fixed. The correct approach is to
> attach such a message to a bug report.

A lot of time can go by between when such messages appear and users
download the fixed releases of SA.  Why introduce this weakness into
BC without considering the alternatives.

To protect against random word strings slowing down SA one might limit
the number of tokens processed by BC, or at least processed by BC
within invisible regions.  The former would protect against DoS
attacks that r9460 would not.

Specially marking tokens in invisible regions might also improve
classification accuracy.

The reduction in FNs cited in Bug 2129 is impressive.  I'd like to
take a closer look at what's going on.  For example, it appears that
BC does look at tokens in invisible regions for scoring a message, it
just doesn't learn them.  (If that's true I'd call it a bug.)  Another
thing I'd like to know is how the BC was trained before starting the
testing used to get the results.  I'd appreciate it if anyone could
enlighten me before I take a look.

> I don't see introducing a vulnerability in order to fix a problem
> that has not been demonstrated. Where is this message that is
> labeled invisible but isn't and for which there is no fix in the
> invisibility detector code? If there is no such example after some
> time, I'll be closing this bug as a WONTFIX. 

We'll have to wait until r9460 is released, only then would spammers
try to take advantage of it.  (Or did you want me to come up with one
on my own?)

To summarize, r9460 opens a vulnerability in BC in order to prevent a
DoS attack that could be mounted anyway (by making the random strings
visible).
Comment 3 Justin Mason 2004-03-14 14:10:48 UTC
Subject: Re:  Bayes can be circumvented by faking invisible or near-invisible text. 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> The reduction in FNs cited in Bug 2129 is impressive.  I'd like to
> take a closer look at what's going on.  For example, it appears that
> BC does look at tokens in invisible regions for scoring a message, it
> just doesn't learn them.  (If that's true I'd call it a bug.) 

??? shouldn't be the case.

> Another
> thing I'd like to know is how the BC was trained before starting the
> testing used to get the results.  I'd appreciate it if anyone could
> enlighten me before I take a look.

See "masses/bayes-testing/README".   it's a 10-fold cross-validation
run, the std for testing trained classifiers.

David, I'd be happy to test some code using this method on the same
corpus, if you'd care to come up with a patch against current SVN.

(I'm thinking there *may* be a case where this will be useful in
future.)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAVNhhQTcbUG5Y7woRAoNHAJ4wPQ7EMka31+Ewxfbb06h2mw1oygCfVLNK
+bLfYFa3LQBG55014wpj6T0=
=ouq6
-----END PGP SIGNATURE-----

Comment 4 Sidney Markowitz 2004-03-14 14:43:16 UTC
We are dealing with two possible attacks. One we know how to do now, loading
spam with invisible text that SpamAssassin will process in a way that overloads
SpamAssassin, effectively crashing it. The second we don't know how to do,
loading spam with visible text that SpamAssassin thinks is invisible to create
false negatives. All I am saying is that we must not create a vulnerability to a
known crash attack just to protect against the unproven possibility of false
negatives.

Setting a limit on the number of tokens that we process would make us vulnerable
to the Bayes poisoning that we have been seeing. So far they have not worked
because we score based on the 15 most significant tokens. If we have to throw
away tokens we can't find the most significant 15.

You are correct that a spammer could put 20,000 tokens in visible text. If they
go so far as to have a three line ad for v*ag*a followed by two thousand lines
of gibberish, we would have to do something about that. Perhaps exceeding some
limit of unique tokens in one message could suppress Bayes and trigger another
rule. Yes, that would also work to prevent the DoS if the random words were in
invisible text, but doing so would give spammers a way to turn off Bayes
processing with no visible effect on the spam. That can't be a good thing.

So, yes, given the tradeoffs, I would like you to come up with examples on your
own, so we can pre-empt the efforts of the spammers. If this bug report remains
purely theoretical for too long, I will close it, subject to being reopened if
and when someone can come up with an example. I would not approve of the simple
solution of I*tokens without an included patch for DoS protection.

Ok, I've made my points. I'll shut up now until I see either some code or an
example of the problem.

> Another thing I'd like to know is how the BC was trained
> before starting the testing used to get the results.

Justin talked about how he tested in comments 6 and 7 in bug #2129. I interpret
that as saying that he used 10fold cross-validation. That repeates training on
samples of the corpus and testing on the remainder.

Comment 5 David Koppelman 2004-03-14 15:38:21 UTC
Justin, thanks for the pointer to the BC testing methodology, I'll
have a look.

When I get a chance I'll work up a patch to mark invisible tokens.

I've come up with a simple message that SA thinks is invisible but
Firefox (and presumably other HTML renderers) does not.  I'll attach
it below.  My concern is that any attempt to detect truly invisible
text would be easy to get around, unless we include something close
to a complete HTML/CSS rendering system.
Comment 6 David Koppelman 2004-03-14 15:40:02 UTC
Created attachment 1840 [details]
Message faking invisibility.
Comment 7 Daniel Quinlan 2004-03-14 15:40:13 UTC
Subject: Re:  Bayes can be circumvented by faking invisible or near-invisible text.

If we use a single html_text array and a single body array and set up a
parallel array of properties, then it would be trivial to add an option
to ignore or not ignore invisible text.

Comment 8 Daniel Quinlan 2004-03-14 15:43:39 UTC
Subject: Re:  Bayes can be circumvented by faking invisible or near-invisible text.

> Message faking invisibility.

Part of the solution is for us to handle CSS properly.  Sadly, it is
necessary.

Comment 9 David Koppelman 2004-03-14 15:44:02 UTC
An option is a great idea!  What's the default setting? :-)
Comment 10 David Koppelman 2004-03-14 15:48:40 UTC
> Part of the solution is for us to handle CSS properly.  Sadly, it is
> necessary.

Add to that a better handling of HTML.  There might be ways of faking
invisibility by assigning fg and bg colors to different blocks and confusing SA
about which one applies.
Comment 11 Daniel Quinlan 2004-03-14 15:58:08 UTC
Subject: Re:  Bayes can be circumvented by faking invisible or near-invisible text.

> Add to that a better handling of HTML.  There might be ways of faking
> invisibility by assigning fg and bg colors to different blocks and
> confusing SA about which one applies.

We *know* already...

Well, you are able to submit code improvements.

For starters, you can look at jgc's spam tricks page.  I know a
half-dozen more, but CSS parsing and some basic CSS handling are the
main things lacking right now.

Comment 12 David Koppelman 2004-03-14 16:17:34 UTC
> We *know* already...

My argument is that for now we should not omit seemingly invisible text, at
least for now, because it would be too easy to fake.
Comment 13 Justin Mason 2004-03-14 16:53:58 UTC
Subject: Re:  Bayes can be circumvented by faking invisible or near-invisible text. 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


>Setting a limit on the number of tokens that we process would make us vulnerable
>to the Bayes poisoning that we have been seeing. So far they have not worked
>because we score based on the 15 most significant tokens. If we have to throw
>away tokens we can't find the most significant 15.

BTW -- it's 150, not 15.

>You are correct that a spammer could put 20,000 tokens in visible text. If they
>go so far as to have a three line ad for v*ag*a followed by two thousand lines
>of gibberish, we would have to do something about that. Perhaps exceeding some
>limit of unique tokens in one message could suppress Bayes and trigger another
>rule. Yes, that would also work to prevent the DoS if the random words were in
>invisible text, but doing so would give spammers a way to turn off Bayes
>processing with no visible effect on the spam. That can't be a good thing.

I suggest reading the "Dobly" paper before thinking up new details along
these lines -- it sounds quite practical to detect this.

>> Another thing I'd like to know is how the BC was trained
>> before starting the testing used to get the results.
>
>Justin talked about how he tested in comments 6 and 7 in bug #2129. I interpret
>that as saying that he used 10fold cross-validation. That repeates training on
>samples of the corpus and testing on the remainder.

yep.   Specifically, training on 1/10th, testing against the other 9/10ths
of the corpus, and repeat over all "folds".  I've added a page to
describe it here:
http://wiki.apache.org/spamassassin/TenFoldCrossValidation

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAVP6eQTcbUG5Y7woRAuBRAKDTYR68APi3eo/Ttqnn08jaHFkEzwCggpLW
h8DNjyXDjNrimCULva1dZzI=
=ZSuF
-----END PGP SIGNATURE-----

Comment 14 Sidney Markowitz 2004-03-14 18:35:06 UTC
> BTW -- it's 150, not 15

Oh, Paul Graham and DSPAM used 15 (later up to 20 to 25). That's an interesting
difference.

> I suggest reading the "Dobly" paper before thinking up new details
> along these lines -- it sounds quite practical to detect this.

My understanding after reading the paper is that Dobly uses the spam/ham counts
of the tokens to determine the "sparseness" of text, so it would have to access
the database entry for every token in the message. That would not help with a
DoS attack at all, since it takes as much I/O to determine that the words are
noise. It would only improve accuracy of the final result, if it works at all.
Wouldn't Dobly only help when the random words contained embedded high
probability ham indicators and Dobly led us to ignore them? How would a spammer
produce high probability ham indicator words targeted to each recipient?
Comment 15 Justin Mason 2004-03-14 19:59:37 UTC
Subject: Re:  Bayes can be circumvented by faking invisible or near-invisible text. 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


bugzilla-daemon@bugzilla.spamassassin.org writes:
> > BTW -- it's 150, not 15
> 
> Oh, Paul Graham and DSPAM used 15 (later up to 20 to 25). That's an
> interesting difference.

Yeah -- experimentally verified in "bayes tweaks round 1" if I recall
correctly ;)  Spambayes found the same.

Note that PG doesn't do 10-fold cross-validation as far as I know ;)

> > I suggest reading the "Dobly" paper before thinking up new details
> > along these lines -- it sounds quite practical to detect this.
> 
> My understanding after reading the paper is that Dobly uses the spam/ham
> counts of the tokens to determine the "sparseness" of text, so it would
> have to access the database entry for every token in the message. That
> would not help with a DoS attack at all, since it takes as much I/O to
> determine that the words are noise. It would only improve accuracy of
> the final result, if it works at all.

that's very true.  OK, I see your point...

> Wouldn't Dobly only help when the
> random words contained embedded high probability ham indicators and
> Dobly led us to ignore them? How would a spammer produce high
> probability ham indicator words targeted to each recipient?

Yeah -- true.  The issue is that occasionally they can be lucky
and hit one or two 0.001's.  Chi2 combining is much better at
dealing with that.  But I bet there's the occasional spam that
gets through...

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAVSohQTcbUG5Y7woRAmJ3AKCgexbzNj8SCmwdp/ZyLHDv+EY0UgCfZmte
nV4FqAfwMPne//3ANzVk/50=
=n+8y
-----END PGP SIGNATURE-----

Comment 16 Loren Wilton 2004-03-14 23:25:27 UTC
As I see it from examining spam, there are two kinds (generally) of invisible 
text in the messages.  Type 1 is a whole lot of "words", doubtless intended as 
Bayes poisioning, and generally showing up at the end of the body of an HTML 
message, or sometimes after the body (illegal HTML).  It is also common to see 
them in the text fork where they will generally be ignored by mail readers.

So far as I can tell, this Bayes "poisoning" in the vast majority of cases has 
just the opposite effect than intended -- it makes the message as 99% 
probability spam.  This can often be very useful in pushing a message solidly 
into the spam bucket when the available rules were indecisive.  I would hate to 
lose this form of Beyes "poisoning", since it is so effective in detecting spam.

The second form of invisible text is individual letters or small sequences of 
letters or numbers, typically in a 0 or 1 pt font, inserted in the middle of a 
word, typically something like via<small letters here>gra.  The intent here is 
clearly random obfuscation of key words to make them hard to match by rules, 
and hopefully hard to match by Beyes.

In these cases simply making the invisible letters go away will have to 
advantage of making the evil words immediately obvious to both rules and Beyes, 
since they will not contain the invisible obfuscation text.

Which means I both do and don't want invisible text to vanish from the text 
rendering of the html.  A simple and probably effective rule would be to throw 
away any invisible text that isn't bounded by a wordbreak on at least one side, 
or that doesn't contain whitespace.  Or maybe more simply any run of invisible 
text of less than say 6 characters.  And multi-word run of invisible text 
should REMAIN in the rendering, since it is very effective in Bayes for 
detecting spam.

(Which implies also that having a set of rules of percentage of invisible text 
to non-invisible text could be a good spam detector all by itself.)
Comment 17 David Koppelman 2004-03-15 07:08:53 UTC
I commented above (Comment 2) that invisible text is not being ignored
for classification but is being ignored (as intended) for learning.
Actually, I don't think it's being ignored in either case, see Bug
3176.

If I'm not wrong then we need to look at what the improved performance
Justin observed is due to.  (See Bug 2129 Cmment 7.)  Perhaps he
tested using a working copy of the invisible text feature but a
non-working version was checked into the repository.
Comment 18 David Koppelman 2004-03-16 06:33:34 UTC
I've come to the conclusion that using special marking for invisible
Bayes tokens, such as "I:poison", is a bad idea.  The imbalance
between "invisible" regions in spam and ham messages can lead to false
positives for legitimate messages with regions deemed invisible.  (In
a recent 1-day HTML run HTML_FONT_INVISIBLE hits on 10.7% of spam and
2.4% of ham.)

I still think that invisible regions should not be ignored.

Right now I'm thinking along the lines of having the following options:

1: Always ignore invisible regions.

2: Never ignore invisible regions.

3: Always learn invisible regions.
   If percent of recognized tokens > some threshold,
     use ordinary scoring;
   if percent of recognized tokens <= some threshold,
     consider only spammy tokens (in case region contains random strings).
     
Option 3 should reduce the impact of Bayes poisoning, just as ignoring
invisible regions does, without enabling spammers to get around BC by
faking invisible text.

If anyone is familiar with work or discussion on considering only
spammy tokens when the number of recognized tokens is small, I'd
appreciate a pointer.

I'll work on a patch to implement this idea and post it probably later
today.
Comment 19 Daniel Quinlan 2004-03-16 11:32:43 UTC
Subject: Re:  Bayes can be circumvented by faking invisible or near-invisible text.

> I've come to the conclusion that using special marking for invisible
> Bayes tokens, such as "I:poison", is a bad idea.  The imbalance
> between "invisible" regions in spam and ham messages can lead to false
> positives for legitimate messages with regions deemed invisible.  (In
> a recent 1-day HTML run HTML_FONT_INVISIBLE hits on 10.7% of spam and
> 2.4% of ham.)

There's zero basis for your conclusion.  Accidentally invisible text in
ham is very likely going to use different words than intentionally
invisible text in spam.  It's fine to speculate, but until you've done a
test or have a reference you can point to, laying down firm conclusions
is a waste of everyone's time.

Comment 20 David Koppelman 2004-03-16 12:05:50 UTC
> There's zero basis for your conclusion.

If I've been pompous or otherwise offensive I apologize.  Please suggest
a less offensive way of phrasing the conclusion of my speculation.

> Accidentally invisible text in ham is very likely going to use
> different words than intentionally invisible text in spam.

I'm not concerned about clearly hammy or spammy words, I'm concerned
about the large number of words that are neutral.  Those words are
considered neutral by BC (as now used) because it sees roughly equal
amounts of ham and spam.  My fear is that for those who get little or
no invisible ham, in the invisible hams that do arrive words that
should be scored neutral will be scored spammy.  Keeping track of the
number of ham and spam messages having invisible regions won't help if
what should be a neutral word has not yet appeared in an invisible ham
region.

> It's fine to speculate, but until you've done a test or have a
> reference you can point to, laying down firm conclusions is a waste of
> everyone's time.

I'm just trying to decide what to do next.  I have some data, I'd be happy
to post it if you (or others) want to help me interpret it.
Comment 21 David Koppelman 2004-03-16 17:06:10 UTC
I'm getting a version ready that will specially mark invisible
tokens and which works along the ways described in Comment 18
(different options, not all at once).
Comment 22 Justin Mason 2004-03-16 21:39:30 UTC
testing the "I*" variant now.  I'd be happy to test another variant given a patch ;)

BTW, another question: should "body" see the visible parts only?  or both vis
and invis?

if the latter, how will that interact with bug 3139 (in that "tiny font"
sections should be considered "invisible text")?
Comment 23 Daniel Quinlan 2004-03-16 23:32:57 UTC
Subject: Re:  Bayes can be circumvented by faking invisible or near-invisible text.

> BTW, another question: should "body" see the visible parts only?  or
> both vis and invis?

Maybe both to make sure we match as much as possible?

Only maybe 10% of spam has invisible text, another 10% has low contrast,
and maybe another 10% of other random ways to hide text.
 
> if the latter, how will that interact with bug 3139 (in that "tiny
> font" sections should be considered "invisible text")?

I think ignoring tiny fonts would have to be tested.  It seems to be a
smaller percentage of spam, so I think it can wait.

Daniel

Comment 24 David Koppelman 2004-03-17 04:06:36 UTC
Did anyone look at bug 3176, mentioned in comment 17 here?  I don't think
invisible text is being ignored, at least in the trunk code as of last night.
Comment 25 Justin Mason 2004-03-17 09:14:31 UTC
yeah, I went back to first principles and have been verifying that the tokens
are being treated correctly this time around.  Not sure what the situation was
before.
Comment 26 Justin Mason 2004-03-17 18:52:35 UTC
ok -- urgh.  it looks like the prior checkin was partly more accurate due to
ignoring invisible tokens, and partly due to a bug in how the new
token-decomposition code worked!

(the bug was that the decomposition ran before very long tokens were shortened
to "skip" tokens, so it wound up generating long decomposed tokens.   see end of
mail for discussion on *those* results.)

So accurate figures for invisible-text treatment are:

invisnone: SUMMARY: 0.30/0.70  fp     2 fn   270 uh   273 us  3872    c 704.50
invis1:    SUMMARY: 0.30/0.70  fp     2 fn   270 uh   274 us  3861    c 703.50
invis2:    SUMMARY: 0.30/0.70  fp     2 fn   261 uh   284 us  3840    c 693.40

namely,

invisnone: do not add invisible tokens at all
invis1: add with an "I*" prefix to keep in a separate namespace
invis2: add, with no prefix

how's that for unexpected. ;)

Inspecting the Bayes dbs looks like everything's working as it should; the "I*"
tokens in invis1 really are the ones found in bayes-poison blocks.   it really
does seem that (at least on this corpus) allowing the invisible, bayes-pollution
tokens to pollute the db, actually INCREASES accuracy.  Very odd.

I would still prefer to keep the inviz tokens separate from the real ones -- and
given David's points that it's entirely possible to make invisible-*looking*
tokens enough to fool a filter but be visible in a MUA,  I'd prefer to not just
throw them out invisnone style, since that'll be exploited.  So I'd prefer
invis1.  But invis2 is surprising.

Possible reason: as David noted in bug 2282, a lot of recent spam does *not* use
invis blocks, it just leaves the text fully visible.  could be why...

db sizes:

2889847 testset7/invisnone/results/bucket1/bayes_db.dump
2917427 testset7/invis2/results/bucket1/bayes_db.dump
2929940 testset7/invis1/results/bucket1/bayes_db.dump

from 260 spams, 451 hams.

Anyway, I'll check in the code into SVN in "invisnone" mode now, since it fixes
a couple of other bugs anyway.


... On a separate issue.  here's some ROUGH figures for skip and non-skip:

skip:   SUMMARY: 0.30/0.70  fp     2 fn   261 uh   284 us  3840    c 693.40
noskip: SUMMARY: 0.30/0.70  fp     2 fn   260 uh   267 us  3825    c 689.20

rough because it's slightly buggy how that was implemented, being the
side-effect of a bug anyway.  ;)  and DB sizes:

skip:   2956604 testset5/checkin/results/bucket1/bayes_db.dump
noskip: 2917427 testset7/invis2/results/bucket1/bayes_db.dump

I think it's best to stick with "skip" tokens to save that 40k or so on-disk,
since it only clears up 1 FN otherwise.  bear in mind that's 40k of skipped data
(long hashbusters, msgids, etc.) from only 260 spams and 451 hams.  thoughts?


Comment 27 Daniel Quinlan 2004-03-17 19:17:28 UTC
A agree about invis1 appearing to be the best option.

One note that it seems somewhat wasteful that were not doing a Huffman encoding
of our prefixes.
Comment 28 David Koppelman 2004-03-18 06:28:05 UTC
It looks like my fears about an increase in fp when invisible tokens
were marked (comment 18) were unfounded, at least based on these
results.  

I'd still prefer not marking them (treating them the same as visible
regions), but marking them is not nearly as dangerous (IMO) as
ignoring invisible regions outright.

I've written some code that detects possible poisoning by looking at
the number of new tokens.  Originally it was only to look in invisible
regions but given how rare they are I modified it to look at the whole
message. (It counts the number of new body tokens that are not part of
URLs and compares that to the number of previously seen tokens of any
type.)  There is still some tuning to do, when it's ready I'll file it
under a new bug.
Comment 29 Justin Mason 2004-03-23 21:48:10 UTC
ok, this is now closed.