SA Bugzilla – Bug 2892
Disregard attempts at Bayes poison
Last modified: 2008-08-13 14:01:24 UTC
Looking at some recent spam, I see two mechanisms to introduce non-visible Bayes poison: 1. Put the poison in an alternative text/plain part on the assumption that most users will prefer the HTML alternative. 2. Put poison after a </HTML> tag in the HTML part. I'd like to suggest, for purposes of Bayes learning, ignoring the content of a text/plain part if the HTML part has significant content, and ignoring anything after an </HTML> tag.
Bug #2878 suggests that SA identifies when plain text and HTML are different in multipart/alternative, that would make it futile for the spammers to use the first kind of poisoning.
Note that bug #2878 describes the phenomena (HTML part differs from plain text), while I'm proposing a policy (ignore words in plain text for Bayesian analysis and learning when they differ from the HTML). I also note the additional Bayes poison hidden in the HTML part, which would still be there even if the plain text part were absent.
In case anyone picks this up I would like to point to an article by Paul Graham in which he talks about the statistical effects that this kind of poisoning has on Bayesian spam filters. See the second and third sections with headings "More Good Tokens" and "Fewer Bad Tokens". The article is at http://www.paulgraham.com/sofar.html Based on what he says, it looks like the first step may be to see if these "poison" words really do decrease the effectiveness of the Bayesian filter or if in fact it has no effect or maybe even helps because the random words don't look like normal mail.
I've received over 30 spam this year which include "random" words after the </HTML> tag. Even after feeding each of these through sa-learn as I get them, each new one is scoring BAYES_00, BAYES_01, BAYES_10 at best, and therefore not reaching my points threshold. So I'd like to see words after the </HTML> tag disregarded for Bayes learning.
another bayes-poison-related bug
this just needs 1 more tweak -- in a multipart/alt message with a plain part and a html part, we should consider the tokens from the plain part as "invisible".
BTW, as http://bugzilla.spamassassin.org/show_bug.cgi?id=3173#c26 (bug 3173 comment 26) shows, it looks like disregarding the bayes poison in the first text/plain part is not urgent. (PS: Sidney, whatever you do, don't take Paul Graham's advice on this stuff. ;)
> Sidney, whatever you do, don't take Paul Graham's advice on this stuff. ;) Hey, all I _did_ say was that based on what he said we shouldn't jump into trying to block the poison before we look at what it really does... And that turned out to be correct :-)
May I propose we hold off on this ticket for 3.1? It looks like we really need to do a bit of testing WRT how this stuff impacts scoring, what sections to "ignore" (I wouldn't ignore them, I'd just prepend something like "INVIS*" to the tokens), etc. It's not really a killer for 3.0, so ...
+1 on leaving it for 3.1.0.
actually, I'll just do the punt ;)
more accuracy and performance bugs going to 3.1.0 milestone
bumping to 3.2.0
moving RFEs and low-priority stuff to 3.3.0 target
this hasn't really turned out to be a problem. punting a bit further...