Bug 2908 - Use bayes translation to decrease effectiveness of intentional misspellings
Summary: Use bayes translation to decrease effectiveness of intentional misspellings
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: 2.61
Hardware: PC Linux
: P5 enhancement
Target Milestone: Future
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-01-08 02:40 UTC by Chris Thielen
Modified: 2010-04-21 20:56 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
Diff to add some Bayes translations to detect misspelling obfu patch None Chris Thielen [HasCLA]
Shell script to (repeatedly) train bayes and mass-check a corpus text/plain None Chris Thielen [HasCLA]
Results of shell script testing modifications to bayes application/octet-stream None Chris Thielen [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Thielen 2004-01-08 02:40:33 UTC
The latest crop of spam I receive contains misspellings of spam-sign words, such
as generic, viagra, paris, hilton.  Some simple examples of permutations I
receive are geenric vvvaigraa ppariis hilllton.  To counteract this, I have
written a simple modification to sub tokenize_line in Bayes.pm.

pseudocode:

(For each non-header token)
  Strip sk: prefix from token if it was added previously
  Remove all non-alpha characters
  Force token to lowercase (I have no idea if this is a good idea)
  Sort the characters in the string (bananas => aaabnns)
  Prepend sk: to string if we stripped it
  Add new token to bayes token list
  Strip any repeated characters (aaabnns => abns)
  Add new token to bayes token list

This has the effect that the words translate as such:

generic, viagra, paris, hilton
debug: BAYES TRANSLATE: generic: ceeginr, ceginr
debug: BAYES TRANSLATE: viagra: aagirv, agirv
debug: BAYES TRANSLATE: paris: aiprs, aiprs
debug: BAYES TRANSLATE: hilton: hilnot, hilnot

geenric vvvaigraa ppariis hilllton
debug: BAYES TRANSLATE: geenric: ceeginr, ceginr
debug: BAYES TRANSLATE: vvvaigraa: aaagirvvv, agirv
debug: BAYES TRANSLATE: ppariis: aiipprs, aiprs
debug: BAYES TRANSLATE: hilllton: hilllnot, hilnot

in my bayes database, agirv, aiprs, hilnot all score very high. ceginr scores
neutrally.
Comment 1 Chris Thielen 2004-01-08 02:42:35 UTC
Created attachment 1666 [details]
Diff to add some Bayes translations to detect misspelling obfu

Bayes.dist.pm is from Debian's spamassassin package 2.61-2
Comment 2 Chris Thielen 2004-01-08 02:50:01 UTC
I forgot to mention:
The changes in the patch I posted (attachment 1666 [details]) have not been tested beyond
nuking my bayes DB and sa-learning my entire ham/spam corpus (~4000/2000).  I've
run spamassassin -D rulesrun=255 on a few obfu spams to do initial validation of
the bayes translation and subsequent minimal bayes score analysis.

I have left in the dbg call (will be VERY excessive in any real environment) and
have extra (cycle-consuming) variables, etc.  I've made no attempt at any type
of optimization whatsoever.

Comment 3 Marc Perkel 2004-01-08 06:06:12 UTC
Subject: Re:  New: Use bayes translation to decrease effectiveness
 of intentional misspellings

Here's something I'm doing to catch misspellings.

I have a list of about 100 words commonly deliberately misspelled. I 
first remove all the words that are correctly spelled based in this 
list. Then I translate characters - @-a 0-o 1-i etc. I then remove all 
punctuion and space characters. Then - I check for the misspelled words 
again after spell correcting them, and if there's a match - it's spam.

Comment 4 Sidney Markowitz 2004-01-08 08:36:25 UTC
In an article by Paul Graham at http://www.paulgraham.com/sofar.html he says,

"Misspellings end up having higher spam probabilities than the words they're
intended to conceal. In my filter the spam probability of "Viagra" is .9848, and
of "V1agra" .9998. [1] For this kind of trick to work, you have to be the first
person to use nearly every misspelling in a spam. The odds of doing that are
low, and if you fail you merely teach the filter all the new misspellings."

I'm not taking his writings as gospel, but it shows that you should be very
careful in how you test this patch to see what effect it has on scores under
different circumstances. There are only so many ways to misspell viagra, and if
spammers use them they should all get very high scores very quickly.
Comment 5 Chris Thielen 2004-01-11 01:13:16 UTC
Re: Paul Graham's misspelling comment

Sidney:
I don't doubt Paul's assertion that misspelled spam words tend to be scored much
higher due to exclusive use in spam.  However, I do think that the amount of
spam required for a database to reach this level of effectiveness is higher than
what is commonly fed for training.

It is often said a misspelling of v1agra is only useful (to a spammer) one time,
then bayes will take care of it.  In my opinion, the problem with this argument
is that there are simply too many possible permutations of viiaarga.  My
personal experience is that my bayes DB has not already seen many of the
permutations I get (and a significant amount of my spam is vaariaga).

Comment 6 Chris Thielen 2004-01-11 01:14:43 UTC
I wrote a script to test the effectiveness of this modification to Bayes
processing.  In short, it appears that the patch I submitted does help catch bfu
scated spellings, but hurts bayes overall in most of the test cases.

The test script itself (with some modification) might be of use to somebody in
the future to test the effectiveness of other bayes changes, however.  I've
attached it as "testbayes".

The test script sets three bayes expire thresholds (I was trying to see if they
would affect overall accuracy) but my corpus wasn't large enough for it to
require expiration except for the 150000 threshold.  For each Bayes DB size, 3
bayes algorithms are tested: the original 2.61 code, custom version 1 and custom
version 2.  Custom version 1 adds two extra copies of each body token; first,
the token made by sorting the characters in the word, second the token made by
removing all duplicate letters.  Custom version 2 adds one extra body token; the
token made by sorting the characters in the word and removing all duplicate
letters.  For each iteration, two scenarios are simulated.  First, Bayes is
trained with all email up until the cutoff threshold of 45 days in the past. 
All email from 45 days in the past until now is mass-checked.  The second
scenario is shorter, with a cutoff of 15 days.

I have also attached all the results of my tests.  The original intent was to
catch obfuscated misspelled or rearranged words.  Sampling a couple of these
style of emails indicates that indeed their bayes score is stronger. However, it
seems that most test scenarios show a weaker bayes overall, with a higher
concentration of neutral scores.
Comment 7 Chris Thielen 2004-01-11 01:16:36 UTC
Created attachment 1676 [details]
Shell script to (repeatedly) train bayes and mass-check a corpus
Comment 8 Chris Thielen 2004-01-11 01:17:11 UTC
Created attachment 1677 [details]
Results of shell script testing modifications to bayes
Comment 9 Daniel Quinlan 2004-08-27 17:19:43 UTC
more accuracy and performance bugs going to 3.1.0 milestone
Comment 10 Daniel Quinlan 2005-04-06 12:42:20 UTC
Chris,

We need an individual CLA for you.

  http://www.apache.org/licenses/#clas

Also moving this to 3.2.0, need to make some decisions and this isn't ready
yet...
Comment 11 Chris Thielen 2005-04-07 09:41:29 UTC
CLA dropped in the mail.
Comment 12 Justin Mason 2006-12-12 12:40:22 UTC
moving RFEs and low-priority stuff to 3.3.0 target
Comment 13 Justin Mason 2010-01-27 02:21:17 UTC
moving most remaining 3.3.0 bugs to 3.3.1 milestone
Comment 14 Justin Mason 2010-01-27 03:17:00 UTC
reassigning, too
Comment 15 Justin Mason 2010-03-23 16:34:13 UTC
moving all open 3.3.1 bugs to 3.3.2
Comment 16 Karsten Bräckelmann 2010-03-23 17:43:13 UTC
Moving back off of Security, which got changed by accident during the mass Target Milestone move.