Bug 2398 - Detect nonsense in wordings
Summary: Detect nonsense in wordings
Status: RESOLVED DUPLICATE of bug 2528
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: unspecified
Hardware: All All
: P5 enhancement
Target Milestone: 2.70
Assignee: SpamAssassin Developer Mailing List
URL: http://www.df.lth.se/~triad/krad/mark...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-09-01 13:05 UTC by Linus Walleij
Modified: 2003-10-02 06:59 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Linus Walleij 2003-09-01 13:05:20 UTC
I have receieved a lot of spam trying to bypass bayesian filters by introducing
"nonsense words" such as random keystrokes (fkgjdfkj) in order to bring down the
bayes point score for the entire message. (And they succeed.)

I devised a method to detect nonsense words using markov chains (in case the
language of the mail is known) and made a reference implementation in perl,
available at the URL given. Some demos included, or you can just read my
reasoning on the page.

Do you think it's a good idea to try to write up a real SA-rule based on this
idea? The current code is very slow, but can be significantly improved I believe.

Need you advice on this: do / do not / stupid etc.

(I also used this method to decide what language a message is written in, in
case one don't know. It works with most sufficiently long pieces of text.)
Comment 1 Sidney Markowitz 2003-09-01 14:20:11 UTC
Did you notice the TextCat package that is already included in SpamAssassin? It
is a slight modification of an open source module for Bayesian classification of
language based on hidden Markov models of n-grams.

I'm not so sure about using this for both detecting what language something is
written in and at the same time detecting nonsense words. It is one thing for an
English-only speaking user to tell SpamAssassin to reject anything written in,
for example French. It is something else to tell it to reject anything that it
cannot conclusively categorize as being in any language. How do you distinguish
between "nonsense words" and input that TextCat can't conclusively classify?

Also, I've seen recently spam that has a lot of real English words thrown in
where they would not be seen, such as in the text MIME part of a
mutipart/alternative HTML message, or hidden in invisible or one point font.
They appear to be there to fool bayesian spam classifiers, but they would not be
picked up by something that looks for nonsense words. The only approach that I
see working are the rules and filters that cause the Bayesian learner to ignore
text that is not visible to the recipient.

Of course the real test of your idea is to try it out and see how it actually
works on a corpus. But I suggest looking at the existing textcat.pm code to see
if it already contains what you were intending to implement for this.
Comment 2 Brian White 2003-09-02 06:45:17 UTC
Subject: Re: [SAdev]  New: Detect nonsense in wordings

> I have receieved a lot of spam trying to bypass bayesian filters by introducing
> "nonsense words" such as random keystrokes (fkgjdfkj) in order to bring down the
> bayes point score for the entire message. (And they succeed.)

Why do they succeed?  The Bayes tests should only take the 15 "most
interesting"
words and weight based on that.  Words not seen before get a weight of 0.4
which makes them not very interesting and thus should not be included in the
final weighting.

I'm basing this on the "Plan for Spam" article and not the actual SA
implementaion.  Please let me know if SA behaves differently than this.

                                          Brian
                                 ( bcwhite@precidia.com )

-------------------------------------------------------------------------------
    Many times the difference between failure and success is doing something
                   nearly right... or doing it exactly right.

Comment 3 Justin Mason 2003-09-02 11:05:54 UTC
Subject: Re: [SAdev]  Detect nonsense in wordings

> Why do they succeed?  The Bayes tests should only take the 15 "most
> interesting"
> words and weight based on that.  Words not seen before get a weight of 0.4
> which makes them not very interesting and thus should not be included in the
> final weighting.
> 
> I'm basing this on the "Plan for Spam" article and not the actual SA
> implementaion.  Please let me know if SA behaves differently than this.

No, that's about right.   But don't tell the spammers.

--j.

Comment 4 Linus Walleij 2003-10-02 14:59:56 UTC
That one asks for the same thing, and is assigned so let's say this is a duplicate.

*** This bug has been marked as a duplicate of 2528 ***