Bug 4170 - Obfuscated text with (arbitrary) edit distances
Summary: Obfuscated text with (arbitrary) edit distances
Status: RESOLVED WONTFIX
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (Eval Tests) (show other bugs)
Version: unspecified
Hardware: Other other
: P5 enhancement
Target Milestone: Future
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-03-03 21:33 UTC by Scott Crosby
Modified: 2019-09-25 04:46 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Edit distance prototype text/plain None Scott Crosby [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Scott Crosby 2005-03-03 21:33:03 UTC
Obfuscated text is something thats fairly prevalent in email. Here's
an edit distance algorithm, and a set of distance metrics thats pretty
good at handling and dealing with many types of obfuscations.

By using arbitrary costs instead of the Levenshtein Distance (all
costs=1), it is much more sensitive to minor changes. The
algorithm is paramatarized by 3 tables that allow seperate costs to be
specified for inserting each type of character, deleting each type of
character, and all pairs of replacing one character with another. (so,
eg, deleting vowels is cheap, inserting [._,] is cheap, and replacing
a 'i' with '1' is cheap.)

Here is a sample output:

lolita   lol....ita      :  0.04;
lolita   l*l*t*          :  0.09;
viagra   v....ia...gra   :  0.07;
viagra   v*agra          :  0.03;
viagra   v*ag.r  a       :  0.06;
lolita   l()l!t.a        :  0.07;
lolita   l(ss)l!t.a      :  1.87;
lolita   l()l!t12.a      :  0.67;
mortgage         mort(age        :  0.91;

From this it is clear it is robust with both the replacement of
characters with '.' and with the insertion of extra *'s. It also handles
inserted spaces.

A sample selftest looking for FP's:

viagra   Agra    :  1.5;
lolita   iota    :  1.55;
lolita   lilt    :  1.3;
lolita   Lola    :  1.5;
lolita   Lolita's        :  0.91;
lolita   loll    :  1.55;
lolita   Lila    :  1.6;

Experience and experimentation is sure to find better distance tables.

One flaw with the algorithm is that the performance on an AMD XP-2500+
is only about 10k matches per second. Some filter to remove tokens is
necessary. I have a few ideas. One idea might be to only apply this to
tokens that aren't in the bayes database. Another option would be to
apply this to 'suspicious tokens' only.
Comment 1 Scott Crosby 2005-03-03 21:34:17 UTC
Created attachment 2679 [details]
Edit distance prototype
Comment 2 Theo Van Dinter 2006-09-10 06:06:21 UTC
No movement -> future
Comment 3 Bill Cole 2019-09-25 04:46:27 UTC
The experience of the past dozen years indicates that as pattern matching gets fuzzier, marginal benefit falls while FPs go up with increasing detection of obfuscation. 

It is not clear that this general idea offers any real improvement over the existing (and mostly newer) narrowly targeted "FUZZY" and "OBFU" rules. Closing as "WONTFIX" instead of "INVALID" only because it is hypothetically possible that someone could someday pick up this idea and turn it into something testable against the real world.