2282 – RFE: Tokenize reduced visibility text specially or not at all.

Bug 2282 - RFE: Tokenize reduced visibility text specially or not at all.

Summary: RFE: Tokenize reduced visibility text specially or not at all.

Status:	RESOLVED FIXED

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	Learner (show other bugs)
Version:	SVN Trunk (Latest Devel Version)
Hardware:	All All

Importance:	P5 enhancement
Target Milestone:	3.0.0
Assignee:	SpamAssassin Developer Mailing List

URL:
Whiteboard:
Keywords:

Depends on:	3173
Blocks:	3208
	Show dependency tree

Reported:	2003-08-01 10:10 UTC by David Koppelman
Modified:	2004-03-23 12:43 UTC (History)
CC List:	0 users

Attachment	Type	Actions	Submitter/CLA Status
Spam designed for Bayesian classification.	text/plain	None	David Koppelman
Diff for HTML.pm (against SA v2.55) for better detection of invisible text in messages.	patch	None	Brian White
Patch to HTML.pm (against SA v2.60-rc2) for removal of invisible text	patch	None	Brian White
Spam with Baysean classifier confounding words.	text/plain	None	David Koppelman
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description David Koppelman 2003-08-01 10:10:23 UTC

I recently received spam that looks like it was specifically designed
to get around Bayesian classification.  The message contained a large
number of words that don't usually appear in spam but might appear in
normal messages.  (I'll attach the spam message.)  Those words were
rendered invisible by using a white font.  SpamAssassin's rules caught
the invisibility however BC was still done on the invisible words and
the message was scored 0.5, despite having many previously encountered
spammy tokens.  Despite the HTML_FONT_INVISIBLE rule, the message was
classified as ham.

A possible workaround is to provide the tokenizer with two versions of
the body text, one for visible text and one for invisible or
reduced-visibility (RV) text.  One option is to not tokenize RV text,
the other is to prefix the tokens with something like
"I:kindergarten".

The system would have to keep up with subtler ways of hiding text that
spammers would develop, but like comment obfuscation, the hiding
attempts would provide strong evidence of spam even before performing
BC.

Comment 1 David Koppelman 2003-08-01 10:13:31 UTC

Created attachment 1201 [details]
Spam designed for Bayesian classification.

The token entries, for example, 0.949-4--H*r:forged, show the
probability, the number of times the token has to be received before
loosing its status (spam or ham), and the token itself.

Comment 2 Brian White 2003-08-01 12:15:18 UTC

Subject: Re: [SAdev]  New: RFE: Tokenize reduced visibility text 
 specially or not at all.

> A possible workaround is to provide the tokenizer with two versions of
> the body text, one for visible text and one for invisible or
> reduced-visibility (RV) text.  One option is to not tokenize RV text,
> the other is to prefix the tokens with something like
> "I:kindergarten".

I think it would be better to just ignore invisible text.  Sometimes it
has tell-tale messages ("to unsubscribe..."), but most often it's just
to hide obfuscating text.  Since that stuff is never seen by the user,
it can be anything and thus checking it will probably not provide much
benefit.  The actual visible content of the message is the only thing
that can be relied upon.

With this in mind, I went in to the HTML.pm file and came up with the
following...

-------------------------------------------------------------------------------
--- orig/HTML.pm        Fri Aug  1 14:26:13 2003
+++ HTML.pm     Fri Aug  1 15:04:40 2003
@@ -213,16 +213,31 @@
   }
   if ($tag eq "font" && exists $attr->{color}) {
     my $c = lc($attr->{color});
+    $self->{html}{bgcolor} = "#ffffff" unless (exists $self->{html}{bgcolor});
     $self->{html}{font_color_nohash} = 1 if $c =~ /^[0-9a-f]{6}$/;
     $self->{html}{font_color_unsafe} = 1 if ($c =~ /^\#?[0-9a-f]{6}$/ &&
                                     $c !~ /^\#?(?:00|33|66|80|99|cc|ff){3}$/);
     $self->{html}{font_color_name} = 1 if ($c !~ /^\#?[0-9a-f]{6}$/ &&
                                   $c !~ /^(?:navy|gray|red|white)$/);
     $c = name_to_rgb($c);
-    $self->{html}{font_invisible} = 1 if (exists $self->{html}{bgcolor} &&
-                                substr($c,-6) eq substr($self->{html}{bgcolor},-6));
+    if (substr($c,-6) eq substr($self->{html}{bgcolor},-6)) {
+#     print STDERR "html_tests: self->bgcolor=$self->{html}{bgcolor}; fgcolor=$c\n";
+      $self->{html}{font_invisible} = 1;
+    }
     if ($c =~ /^\#?([0-9a-f]{2})([0-9a-f]{2})([0-9a-f]{2})$/) {
+      my ($r, $g, $b) = ($1, $2, $3);
       my ($h, $s, $v) = rgb_to_hsv(hex($1), hex($2), hex($3));
+      if ($self->{html}{bgcolor} =~ /^\#?([0-9a-f]{2})([0-9a-f]{2})([0-9a-f]{2})$/) {
+#       print STDERR "html_tests: bg(r,g,b)=($r,$g,$b); fg(r,g,b)=($1,$2,$3) -- ";
+        if (abs(hex($r)-hex($1)) < 16 && abs(hex($g)-hex($2)) < 16 && abs(hex($b)-hex($3)) < 16) {
+#         print STDERR "invisible!\n";
+          $self->{html}{font_invisible} = 1;
+          $self->{html}{invisible} = 1;
+        } else {
+#         print STDERR "visible\n";
+          $self->{html}{invisible} = 0;
+        }
+      }
       if (!defined($h)) {
        $self->{html}{font_gray} = 1 unless ($v == 0 || $v == 255);
       }
@@ -366,6 +381,12 @@
   while ($text =~ s/<(\S[^>]*)>//) {
 #   print STDERR "html_text: found unparsed <$1> inside text\n";
     html_tag($self,$1,undef,0);
+  }
+
+  # ignore all invisible text
+  if (exists $self->{html}{invisible} && $self->{html}{invisible}) {
+#   print STDERR "html_text: ignoring invisible text \"$text\"\n";
+    return;
   }
 
   # record when something non-tag exists between other tags (search of obfuscating tags)
-------------------------------------------------------------------------------

I'm currently testing it on our mailsystem here at work.  I'll attach it
as a real diff when I'm satisfied that it's working correctly.

The idea is that any difference between fg and bg colors where R, G, & B are
all within 16 of 256 points of each other (respectively) is basically invisible
to the naked eye.  A real message has much more contrast than this and for
a spammer to use a difference of 17+ is likely to start becoming visible to
the very people they don't want it to be.

One other thing I added was to set the bgcolor to #ffffff if not otherwise
defined since that's the expected default.

I'd appreciate any comments people have on this.  Thanks!

                                          Brian
                                 ( bcwhite@precidia.com )

-------------------------------------------------------------------------------
    It seems that anything people have learned prior to puberty takes on the
  status of an immutable truth (this is something well understood by parents,
    governments, and religions). Rational explanations of why some previous
    belief might be incompatible with the behavior of nature, and a careful
       explanation of the actual behavior of nature are of little avail.

Comment 3 David Koppelman 2003-08-01 13:23:29 UTC

In the version I have font near invisibility is already tested for in
html_font_invisible.  The following patch should be sufficient, except
for the fact that there is no way to turn text skipping off.  BTW, I'm
not proposing this as the enhancement just something to play with.


*** HTML.pm.~1.95.~	Sat Jun 14 15:42:18 2003
--- HTML.pm	Fri Aug  1 15:18:57 2003
***************
*** 536,542 ****
    $self->html_font_invisible($text) if $text =~ /[^ \t\n\r\f\x0b\xa0]/;
  
    $text =~ s/^\n//s if $self->{html_last_tag} eq "br";
!   push @{$self->{html_text}}, $text;
  }
  
  sub html_comment {
--- 536,543 ----
    $self->html_font_invisible($text) if $text =~ /[^ \t\n\r\f\x0b\xa0]/;
  
    $text =~ s/^\n//s if $self->{html_last_tag} eq "br";
!   push @{$self->{html_text}}, $text
!     unless $self->{html}{font_invisible} or $self->{html}{font_near_invisible};
  }
  
  sub html_comment {

Comment 4 David Koppelman 2003-08-01 13:53:59 UTC

Brian, can you let us know how well your changes worked.  In my case I've
only gotten one message that skipping would obviously work on but of course
that could be the solitary fat raindrop before the downpour.

Comment 5 Brian White 2003-08-01 14:56:30 UTC

Subject: Re: [SAdev]  RFE: Tokenize reduced visibility text specially 
 or not at all.

> ------- Additional Comments From koppel@ece.lsu.edu  2003-08-01 13:53 -------
> Brian, can you let us know how well your changes worked.  In my case I've
> only gotten one message that skipping would obviously work on but of course
> that could be the solitary fat raindrop before the downpour.

Well, first I had to fix my change.  <sigh>  It would recognize invisible
text allright...  It just didn't see when "</font>" made it visible again.
So, now I have a stack that grows with each font tag and shrinks with every
/font tag.  Much better.

Interestingly enough, a message very similar to the one used to report this
bug came through my filter not long after I got this done.  Here's how it
was tagged:

Aug  1 17:41:20 jordan mailscanner[5673]: Message 19iheC-0001kH-00 from
210.219.251.251 (hotmail.com) is spam according to SpamAssassin (score=8.2,
required 6, BAYES_70, HTML_BAD_TAGS_0, HTML_EXTERNAL_IMAGE, HTML_IMAGE_ONLY_02,
HTML_LINKED_IMAGE, HTML_LINKED_IMAGE_ONLY_02, MIME_HTML_ONLY,
MSG_ID_ADDED_BY_MTA_3, RCVD_IN_RFCI, SEMIFORGED_HOTMAIL_RCVD) 


So...  Even ignoring the invisible words meant a BAYES_70 hit.  Note that
my local SpamAssassin has my patch for the HTML_LINKED_IMAGE* rules.


I'll have to run the same message both with and without the change, of
course, but it's a long weekend here so I'm leaving soon.  See ya Tuesday!

                                          Brian
                                 ( bcwhite@precidia.com )

-------------------------------------------------------------------------------
    It seems that anything people have learned prior to puberty takes on the
  status of an immutable truth (this is something well understood by parents,
    governments, and religions). Rational explanations of why some previous
    belief might be incompatible with the behavior of nature, and a careful
       explanation of the actual behavior of nature are of little avail.

Comment 6 Matthew Cline 2003-08-01 21:34:00 UTC

Brian, also make sure to keep a stack of background colors, since table elements
can change the background color away from the default.

Comment 7 David Koppelman 2003-08-02 05:36:45 UTC

As a reminder, the current CVS code already keeps a stack of fg and bg
colors and keeps track of both invisibility and near invisibilty.  If
you like I could attach a recent version of HTML.pm with that code.

Comment 8 David Koppelman 2003-08-04 16:55:05 UTC

One thing to keep in mind is that omitting invisible and
low-visibility text from SA rules (not including Bayes) is not a good
idea because these rules, at least in recent versions, can mostly
increase the spam score.  If text were incorrectly classified as
invisible those rules would not work and there would be no harm in
having those rules operate on invisible text.

I think this would be best:

    The regular SA rules get the usual text.

    The BC gets text with the invisible and low visibility portions
    removed, but with URI's retained.  (With such changes it would be
    easy to add a high-ratio-of-invisible-text eval rule.)

A potential problem is misclassifying text as invisible and to a
lesser extent, having invisible text marked as visible.  (CSS makes
things complicated, especially if we have to chase down external css
files.)

If the developers are interested I'd be glad to make the changes.

Comment 9 Justin Mason 2003-08-04 17:26:27 UTC

Subject: Re: [SAdev]  RFE: Tokenize reduced visibility text specially or not at all. 


>If the developers are interested I'd be glad to make the changes.

whoa, yeah, definitely! ;)   At least I would think so, the changes
sound good and we've been meaning to do them for a while.

Dan, Theo I presume you have no objections?

BTW -- I would add a caveat; we need to do a ten-fold cross validation to
test how it affects classification at the end, before it can be merged.
If it *decreases* accuracy, we can't check it in, for obvious reasons.
This is std practice for bayes modifications, and is pretty unavoidable. ;)

--j.

Comment 10 David Koppelman 2003-08-04 18:19:54 UTC

> If it *decreases* accuracy, we can't check it in, for obvious reasons.
> This is std practice for bayes modifications, and is pretty unavoidable. ;)

I've only noticed one message that used invisible text to hide hammy tokens.
(Out of hundreds.)  If that's typical then it might have no significant
impact on accuracy, at least until the tactic becomes more widespread.

I'll probably start working on it in a few days.
Probably have get_decoded_stripped.. return two references, one to the
usualy text, the other with invisible material removed or maybe prefixed
(I:algorithm).

Comment 11 Brian White 2003-08-05 04:55:44 UTC

Subject: Re: [SAdev]  RFE: Tokenize reduced visibility text specially 
 or not at all.

> Brian, also make sure to keep a stack of background colors, since table elements
> can change the background color away from the default.

Yup.  My patch is now 3rd generation.  It tracks all font attrs whether
they come from font|table|tr|td.  So far, it appears to be working well.

As an added bonus, I think this will greatly improve the effectiveness
of the "font_invisible" test since now it can actually set that flag
only when real text is found and not just when the foreground/background
happen to be the same (which occurs quite often, it seems).

                                          Brian
                                 ( bcwhite@precidia.com )

-------------------------------------------------------------------------------
  The two most plentiful elements in the universe are hydrogen and stupidity.

Comment 12 Brian White 2003-08-05 04:57:17 UTC

Subject: Re: [SAdev]  RFE: Tokenize reduced visibility text specially 
 or not at all.

> As a reminder, the current CVS code already keeps a stack of fg and bg
> colors and keeps track of both invisibility and near invisibilty.  If
> you like I could attach a recent version of HTML.pm with that code.

Arghhh!!!  Ummm, well, sure.  <sigh>  It'll be interesting to see how
our approaches differ.

                                          Brian
                                 ( bcwhite@precidia.com )

-------------------------------------------------------------------------------
  The two most plentiful elements in the universe are hydrogen and stupidity.

Comment 13 Brian White 2003-08-05 05:25:31 UTC

Subject: Re: [SAdev]  RFE: Tokenize reduced visibility text specially 
 or not at all.

> One thing to keep in mind is that omitting invisible and
> low-visibility text from SA rules (not including Bayes) is not a good
> idea because these rules, at least in recent versions, can mostly
> increase the spam score.  If text were incorrectly classified as
> invisible those rules would not work and there would be no harm in
> having those rules operate on invisible text.

While it's true that the invisible words do seem to increase the Bayes
score at the moment, I think that's a short-term thing.  Right now, the
words are either garbage or random but it's only a matter of time before
spammers start to figure out statistically what words are common in ham
messages and start including those as the invisible text.

I believe it's better long-term to weigh in only on the part of the message
that a user will see.  Anything else is open to and end-run by the
dedicated spammer.

Falsely determining text to be invisible would be a problem, but it should
be fairly easy to avoid that.


>     The regular SA rules get the usual text.
> 
>     The BC gets text with the invisible and low visibility portions
>     removed, but with URI's retained.  (With such changes it would be
>     easy to add a high-ratio-of-invisible-text eval rule.)

Perhaps I don't understand, but this seems at odds with your first comment
that "omitting invisible and low-visibility text ... is not a good idea".


> A potential problem is misclassifying text as invisible and to a
> lesser extent, having invisible text marked as visible.  (CSS makes
> things complicated, especially if we have to chase down external css
> files.)

Well, I think if there's an external style-sheet, then it's probably spam
anyway.  Does SA have a test for that?  Do any mailers create/attach a
CSS to a mail message?

                                          Brian
                                 ( bcwhite@precidia.com )

-------------------------------------------------------------------------------
  The two most plentiful elements in the universe are hydrogen and stupidity.

Comment 14 Daniel Quinlan 2003-08-21 23:04:23 UTC

Brian, I'd be interested to see your patch (or at least find out if I missed
anything), the code that's in 2.60 went through many *many* generations, but I
definitely neglected some things (style sheets come to mind).

Comments on other past discussion:

1. tagging invisible text vs. skipping: just do a 10fcv (remembering that
   skipping text or altering it may affect non-Bayes rules)
2. David's offer to do work: yes, we're interested!  Watch out for memory
   usage and performance (re-running any routines more than we already do).

One other note: when in doubt, simulate Microsoft rendering with Outlook or
Outlook Express (same as Internet Explorer).

P.S. Always work off of top-of-tree.  ;-)

Comment 15 Brian White 2003-08-22 06:00:01 UTC

Subject: Re: [SAdev]  RFE: Tokenize reduced visibility text specially 
 or not at all.

> Brian, I'd be interested to see your patch (or at least find out if I missed
> anything), the code that's in 2.60 went through many *many* generations, but I
> definitely neglected some things (style sheets come to mind).

I'll attach it to the bug.  It's been running for several weeks now and
seems quite effective.


> 1. tagging invisible text vs. skipping: just do a 10fcv (remembering that
>    skipping text or altering it may affect non-Bayes rules)

My hunch is that, as far as Bayes is concerned, it'll work out about the
same.  Invisible text in spam is likely to be semi-random and thus won't
correlate too much between messages.  Invisible text is likely non-existant
in ham and so won't apply any weighting that direction.  So, I figure that
the "I:" tagged tokens will seldom make the top 15 interesting words and
thus have about the same effect as just skipping them.

Since providing alternate tagging would mean having to somehow pass this
information out-of-band to the text classifiers (Bayes for now, perhaps
others in the future), I don't believe that the small improvement that may
result over just skipping the text justifies the additional amount of
work and likelihood of bugs/errors.

Other classification systems may become even more difficult since things
like CRM114 use the relative position of the words, too.


> P.S. Always work off of top-of-tree.  ;-)

Unfortunately, this is not an option for me.  I'm adding this code on our
production mail server here at work.  While I can quietly justify making
changes personally and testing them on the spot, I can't just bring in
unreleased changes.  I have a machine at home I can do unstable testing
with, but it's much easier at work where we get thousands of spam per
day instead of just a hundred or so.

As an aside...  I goofed a couple weeks back and SA was off-line for a
weekend.  I got a few inquiries Monday morning about that!  On the plus
side, everybody now has a good feel for how effective the filter really
is.  <grin>

                                          Brian
                                 ( bcwhite@precidia.com )

-------------------------------------------------------------------------------
     There's no healthy way to mess with the line between wrong and right.

Comment 16 Brian White 2003-08-22 06:03:09 UTC

Created attachment 1256 [details]
Diff for HTML.pm (against SA v2.55) for better detection of invisible text in messages.

When invisible text is detected by this patch, it is removed from the text
stream so as to avoid having it tested for by other rules, including the
Bayesian classifier.

I think it's important to have the modified text string affect all "body" rules
since many look for a series of tell-tale words and including an invisible
word, even tagged, would cause them not to function.

Comment 17 Brian White 2003-08-22 06:57:26 UTC

Subject: Re: [SAdev]  RFE: Tokenize reduced visibility text specially 
 or not at all.

> 1. tagging invisible text vs. skipping: just do a 10fcv (remembering that
>    skipping text or altering it may affect non-Bayes rules)

Hmmm...  Here's an interesting bit of invisible text in a piece of spam I
just received:

<div align=3D"center"> <FONT color=3D#ffffff size=3D1>Order Confirmation. =
Your order 
  should be shipped by January, via FedEx. Your Federal Express tracking n=
umber 
  is 6-8.</FONT><font size=3D"1"><BR>


Perhaps it's not as semi-random as I originally thought.

Whatever we do, we need to keep invisible text from all the "body" rules
or find a way to negate the result of any "nice" rules that match invisible
text.  Invisible text would also need to be scanned as a separate text
string.  Keeping them mixed with the visible text would break any rules
that are positionally dependant (i.e. many of the body rules and text
classifiers like CRM114).

I'm still not sure the potential results justify the amount of effort
necessary to implement this.

                                          Brian
                                 ( bcwhite@precidia.com )

-------------------------------------------------------------------------------
     There's no healthy way to mess with the line between wrong and right.

Comment 18 Daniel Quinlan 2003-08-22 15:21:39 UTC

> Diff for HTML.pm (against SA v2.55) for better detection of invisible text in
> messages.

The patch doesn't apply against SA v2.55.  You must have some other patches in
there or something else.

$ patch --dry-run < /tmp/1256
patching file HTML.pm
Hunk #1 FAILED at 35.
Hunk #2 succeeded at 31 with fuzz 2 (offset -12 lines).
Hunk #3 succeeded at 169 (offset -12 lines).
Hunk #4 succeeded at 291 (offset -12 lines).
Hunk #5 FAILED at 443.
2 out of 5 hunks FAILED -- saving rejects to file HTML.pm.rej

Comment 19 Brian White 2003-08-25 09:02:30 UTC

Subject: Re: [SAdev]  RFE: Tokenize reduced visibility text specially 
 or not at all.

> > Diff for HTML.pm (against SA v2.55) for better detection of invisible text in
> > messages.
> 
> The patch doesn't apply against SA v2.55.  You must have some other patches in
> there or something else.
> 
> $ patch --dry-run < /tmp/1256
> patching file HTML.pm
> Hunk #1 FAILED at 35.
> Hunk #2 succeeded at 31 with fuzz 2 (offset -12 lines).
> Hunk #3 succeeded at 169 (offset -12 lines).
> Hunk #4 succeeded at 291 (offset -12 lines).
> Hunk #5 FAILED at 443.
> 2 out of 5 hunks FAILED -- saving rejects to file HTML.pm.rej

There might be.  I had another patch for handling bad tags and there is
probably some overlap.  That's why I'll have to regenerate the patch by
hand for the new code.

                                          Brian
                                 ( bcwhite@precidia.com )

-------------------------------------------------------------------------------
     There's no healthy way to mess with the line between wrong and right.

Comment 20 Brian White 2003-08-29 08:15:14 UTC

Created attachment 1293 [details]
Patch to HTML.pm (against SA v2.60-rc2) for removal of invisible text

Here's a refinement of the patch for the lastest SA release.  I changed the
html_font_invisible function to return a flag and use that within the html_text
function to remove invisible text.

I know there is some debate as to the best action for invisible text, but I
believe simply stripping it is the best choice for these reasons:

 - Tagging each invisible word in-line would cause problems with rules that
look at more than a single word at a time (including CRM114, should it ever get
included).

 - There is no method to pass the invisible text out-of-band to just those
rules that can make use of it.	I think adding such an ability would cost us
more development effort than the amount of effort a spammer would have to
invest to defeat it.

 - The invisible text can be any text a spammer chooses, so trying to act based
on the contents of this text is a losing battle.  It's better just to act based
on the existence of invisible text rather than its contents.


On another note, I liked the method of detecting invisible text; it looks easy
to extend and made my patch very simple.

What about other types of invisible text?  Some spam has a dozen <BR> lines
followed by some text at <font size=1> so, while it's technically visible, it's
not really seen.

-- Brian

Comment 21 Justin Mason 2003-08-29 10:41:01 UTC

'What about other types of invisible text?  Some spam has a dozen <BR> lines
followed by some text at <font size=1> so, while it's technically visible, it's
not really seen.'

Perhaps we should think of ways to use different tagging for "faraway" text --
ie. text that would be "far away" from the top of the message.  However for long
ham mails that may be not good.  Some empirical 10fcv testing could provide
results on this I think.

I think I agree that stripping invisible text is the best option BTW.

Comment 22 Brian White 2003-08-29 11:51:00 UTC

Subject: Re: [SAdev]  RFE: Tokenize reduced visibility text specially 
 or not at all.

> 'What about other types of invisible text?  Some spam has a dozen <BR> lines
> followed by some text at <font size=1> so, while it's technically visible, it's
> not really seen.'
> 
> Perhaps we should think of ways to use different tagging for "faraway" text --
> ie. text that would be "far away" from the top of the message.  However for long
> ham mails that may be not good.  Some empirical 10fcv testing could provide
> results on this I think.

What is "10fcv"?

How about just detecting large blocks of blank lines?

body	TEXT_FARAWAY	/\n{5}/

                                          Brian
                                 ( bcwhite@precidia.com )

-------------------------------------------------------------------------------
    In theory, theory and practice are the same.  In practice, they're not.

Comment 23 Justin Mason 2003-08-29 12:19:57 UTC

Subject: Re: [SAdev]  RFE: Tokenize reduced visibility text specially or not at all. 


>What is "10fcv"?

10-fold cross-validation testing.  A very good way to test tweaks
to learning systems like Bayes or the SpamAssassin GA:

  http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html

>How about just detecting large blocks of blank lines?
>body	TEXT_FARAWAY	/\n{5}/

That would work -- but doesn't fix the problem that the tokens found
after the blank lines would be visible to Bayes, like the "invisible
text" trick.

--j.

Comment 24 David Koppelman 2004-01-21 06:35:45 UTC

Given the spam that I've been receiving in the past few months the
ignore invisible text feature would be of little use.  It was supposed
to ignore text intentionally included to confound the Baysean
classifier.  However in alot of recent spam I've received such
confounding words do not appear in an invisible region so the feature
would not skip them.  Also the confounding words are re-used and so
many are recognized as spammy the second time around, so skipping them
would reduce SpamAssassin's effectiveness.  

I'm attaching a sample of such mail below.

Comment 25 David Koppelman 2004-01-21 06:39:03 UTC

Created attachment 1716 [details]
Spam with Baysean classifier confounding words.

The spam shows the classification of words.  About one out of
ten of the confounding words are classified as hammy but a greater
number are recognized as spammy since they've appeared in earlier
messages of this type.

Comment 26 Justin Mason 2004-03-17 18:52:53 UTC

adding dep on 3173

Comment 27 Justin Mason 2004-03-23 21:43:50 UTC

actually, marking fixed.  3173 has the most current state-of-play for this code.

(btw I like that representation of the tokens.  we should add that ;)