SA Bugzilla – Bug 4170
Obfuscated text with (arbitrary) edit distances
Last modified: 2019-09-25 04:46:27 UTC
Obfuscated text is something thats fairly prevalent in email. Here's an edit distance algorithm, and a set of distance metrics thats pretty good at handling and dealing with many types of obfuscations. By using arbitrary costs instead of the Levenshtein Distance (all costs=1), it is much more sensitive to minor changes. The algorithm is paramatarized by 3 tables that allow seperate costs to be specified for inserting each type of character, deleting each type of character, and all pairs of replacing one character with another. (so, eg, deleting vowels is cheap, inserting [._,] is cheap, and replacing a 'i' with '1' is cheap.) Here is a sample output: lolita lol....ita : 0.04; lolita l*l*t* : 0.09; viagra v....ia...gra : 0.07; viagra v*agra : 0.03; viagra v*ag.r a : 0.06; lolita l()l!t.a : 0.07; lolita l(ss)l!t.a : 1.87; lolita l()l!t12.a : 0.67; mortgage mort(age : 0.91; From this it is clear it is robust with both the replacement of characters with '.' and with the insertion of extra *'s. It also handles inserted spaces. A sample selftest looking for FP's: viagra Agra : 1.5; lolita iota : 1.55; lolita lilt : 1.3; lolita Lola : 1.5; lolita Lolita's : 0.91; lolita loll : 1.55; lolita Lila : 1.6; Experience and experimentation is sure to find better distance tables. One flaw with the algorithm is that the performance on an AMD XP-2500+ is only about 10k matches per second. Some filter to remove tokens is necessary. I have a few ideas. One idea might be to only apply this to tokens that aren't in the bayes database. Another option would be to apply this to 'suspicious tokens' only.
Created attachment 2679 [details] Edit distance prototype
No movement -> future
The experience of the past dozen years indicates that as pattern matching gets fuzzier, marginal benefit falls while FPs go up with increasing detection of obfuscation. It is not clear that this general idea offers any real improvement over the existing (and mostly newer) narrowly targeted "FUZZY" and "OBFU" rules. Closing as "WONTFIX" instead of "INVALID" only because it is hypothetically possible that someone could someday pick up this idea and turn it into something testable against the real world.