SA Bugzilla – Bug 2398
Detect nonsense in wordings
Last modified: 2003-10-02 06:59:56 UTC
I have receieved a lot of spam trying to bypass bayesian filters by introducing "nonsense words" such as random keystrokes (fkgjdfkj) in order to bring down the bayes point score for the entire message. (And they succeed.) I devised a method to detect nonsense words using markov chains (in case the language of the mail is known) and made a reference implementation in perl, available at the URL given. Some demos included, or you can just read my reasoning on the page. Do you think it's a good idea to try to write up a real SA-rule based on this idea? The current code is very slow, but can be significantly improved I believe. Need you advice on this: do / do not / stupid etc. (I also used this method to decide what language a message is written in, in case one don't know. It works with most sufficiently long pieces of text.)
Did you notice the TextCat package that is already included in SpamAssassin? It is a slight modification of an open source module for Bayesian classification of language based on hidden Markov models of n-grams. I'm not so sure about using this for both detecting what language something is written in and at the same time detecting nonsense words. It is one thing for an English-only speaking user to tell SpamAssassin to reject anything written in, for example French. It is something else to tell it to reject anything that it cannot conclusively categorize as being in any language. How do you distinguish between "nonsense words" and input that TextCat can't conclusively classify? Also, I've seen recently spam that has a lot of real English words thrown in where they would not be seen, such as in the text MIME part of a mutipart/alternative HTML message, or hidden in invisible or one point font. They appear to be there to fool bayesian spam classifiers, but they would not be picked up by something that looks for nonsense words. The only approach that I see working are the rules and filters that cause the Bayesian learner to ignore text that is not visible to the recipient. Of course the real test of your idea is to try it out and see how it actually works on a corpus. But I suggest looking at the existing textcat.pm code to see if it already contains what you were intending to implement for this.
Subject: Re: [SAdev] New: Detect nonsense in wordings > I have receieved a lot of spam trying to bypass bayesian filters by introducing > "nonsense words" such as random keystrokes (fkgjdfkj) in order to bring down the > bayes point score for the entire message. (And they succeed.) Why do they succeed? The Bayes tests should only take the 15 "most interesting" words and weight based on that. Words not seen before get a weight of 0.4 which makes them not very interesting and thus should not be included in the final weighting. I'm basing this on the "Plan for Spam" article and not the actual SA implementaion. Please let me know if SA behaves differently than this. Brian ( bcwhite@precidia.com ) ------------------------------------------------------------------------------- Many times the difference between failure and success is doing something nearly right... or doing it exactly right.
Subject: Re: [SAdev] Detect nonsense in wordings > Why do they succeed? The Bayes tests should only take the 15 "most > interesting" > words and weight based on that. Words not seen before get a weight of 0.4 > which makes them not very interesting and thus should not be included in the > final weighting. > > I'm basing this on the "Plan for Spam" article and not the actual SA > implementaion. Please let me know if SA behaves differently than this. No, that's about right. But don't tell the spammers. --j.
That one asks for the same thing, and is assigned so let's say this is a duplicate. *** This bug has been marked as a duplicate of 2528 ***