Bug 2892 - Disregard attempts at Bayes poison
Summary: Disregard attempts at Bayes poison
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: 2.61
Hardware: Other other
: P5 enhancement
Target Milestone: Future
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on: 3173
Blocks:
  Show dependency tree
 
Reported: 2004-01-02 22:08 UTC by Kenneth Porter
Modified: 2008-08-13 14:01 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Kenneth Porter 2004-01-02 22:08:53 UTC
Looking at some recent spam, I see two mechanisms to introduce non-visible Bayes
poison:

1. Put the poison in an alternative text/plain part on the assumption that most
users will prefer the HTML alternative.

2. Put poison after a </HTML> tag in the HTML part.

I'd like to suggest, for purposes of Bayes learning, ignoring the content of a
text/plain part if the HTML part has significant content, and ignoring anything
after an </HTML> tag.
Comment 1 Niels Teglsbo 2004-01-03 10:10:44 UTC
Bug #2878 suggests that SA identifies when plain text and HTML are different in 
multipart/alternative, that would make it futile for the spammers to use the 
first kind of poisoning.
Comment 2 Kenneth Porter 2004-01-03 22:43:47 UTC
Note that bug #2878 describes the phenomena (HTML part differs from plain text),
while I'm proposing a policy (ignore words in plain text for Bayesian analysis
and learning when they differ from the HTML). I also note the additional Bayes
poison hidden in the HTML part, which would still be there even if the plain
text part were absent.
Comment 3 Sidney Markowitz 2004-01-03 23:25:13 UTC
In case anyone picks this up I would like to point to an article by Paul Graham
in which he talks about the statistical effects that this kind of poisoning has
on Bayesian spam filters. See the second and third sections with headings "More
Good Tokens" and "Fewer Bad Tokens". The article is at
http://www.paulgraham.com/sofar.html

Based on what he says, it looks like the first step may be to see if these
"poison" words really do decrease the effectiveness of the Bayesian filter or if
in fact it has no effect or maybe even helps because the random words don't look
like normal mail.
Comment 4 Lachlan Cameron-Smith 2004-01-08 14:20:50 UTC
I've received over 30 spam this year which include "random" words after the
</HTML> tag. Even after feeding each of these through sa-learn as I get them,
each new one is scoring BAYES_00, BAYES_01, BAYES_10 at best, and therefore not
reaching my points threshold. So I'd like to see words after the </HTML> tag
disregarded for Bayes learning.
Comment 5 Justin Mason 2004-03-17 18:53:17 UTC
another bayes-poison-related bug
Comment 6 Justin Mason 2004-03-23 21:47:13 UTC
this just needs 1 more tweak -- in a multipart/alt message with a plain part and
a html part, we should consider the tokens from the plain part as "invisible".
Comment 7 Justin Mason 2004-04-28 16:43:58 UTC
BTW, as http://bugzilla.spamassassin.org/show_bug.cgi?id=3173#c26 (bug 3173
comment 26) shows, it looks like disregarding the bayes poison in the first
text/plain part is not urgent.

(PS: Sidney, whatever you do, don't take Paul Graham's advice on this stuff.  ;)
Comment 8 Sidney Markowitz 2004-04-28 19:26:58 UTC
> Sidney, whatever you do, don't take Paul Graham's advice on this stuff.  ;)

Hey, all I _did_ say was that based on what he said we shouldn't jump into
trying to block the poison before we look at what it really does... And that
turned out to be correct :-)



Comment 9 Theo Van Dinter 2004-05-22 09:47:38 UTC
May I propose we hold off on this ticket for 3.1?  It looks like we really need 
to do a bit of testing WRT how this stuff impacts scoring, what sections to 
"ignore" (I wouldn't ignore them, I'd just prepend something like "INVIS*" to 
the tokens), etc.  It's not really a killer for 3.0, so ...
Comment 10 Justin Mason 2004-05-25 18:27:06 UTC
+1 on leaving it for 3.1.0.
Comment 11 Justin Mason 2004-05-25 19:35:21 UTC
actually, I'll just do the punt ;)
Comment 12 Daniel Quinlan 2004-08-27 17:19:08 UTC
more accuracy and performance bugs going to 3.1.0 milestone
Comment 13 Daniel Quinlan 2005-04-12 14:46:21 UTC
bumping to 3.2.0
Comment 14 Justin Mason 2006-12-12 12:40:20 UTC
moving RFEs and low-priority stuff to 3.3.0 target
Comment 15 Justin Mason 2008-08-13 14:01:24 UTC
this hasn't really turned out to be a problem. punting a bit further...