SA Bugzilla – Bug 5380
SUBJECT_FUZZY_MEDS triggers on un-obfuscated meds and meds in a word
Last modified: 2009-08-31 16:42:58 UTC
I hope this isn't already been posted, I did search bugs for "meds" & "fuzzy" and didn't find, but I'm a newbie, so forgive me if I missed something. The rule SUBJECT_FUZZY_MEDS is a little too general: it hits plain, unobfuscated "meds" and also any occurences of "meds" and perhaps other combinations (not sure how the replace stuff works) in a word in the subject line, eg. lameds, medscheat, premeds I suggest replacing it with Subject =~ /\b(?!meds)<M><E><D><S>\b/i which will only trigger on obfuscated meds as a separate word, or possibly loosing the (?!meds), since the not-really-a-word meds is not too likely to show up in a Subject except in spam cheers, without spamassassin I'd have given up email (well at least some accounts ;-) michael b
will try to fix for 3.3.0
looks like it will fp on anything with meds in the subject line, inside a word, etc the following (small snipet) is enough to trigger this: (save to a file, yes, just these lines is enough) ------------begin--- Subject: Someone: Review Meds MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="0-1130195460-1249928373=:51768" --0-1130195460-1249928373=:51768 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable -<<<EOF
the rule matches also on czech/slovak words "medzi" (inter), "obmedzit" (to limit). Yes I'd be glad if we'd have way to cut FPs down.
btw if we can get some samples of what it's _supposed_ to hit, that would help too. (this is a new approach to rule regression testing I'm working on.)
The problem is, that the rule as-is will hit on any sub-string "meds". No word boundaries, no exclusion of the NON-obfuscated meds. 25_replace.cf: header SUBJECT_FUZZY_MEDS Subject =~ /<M><E><D><S>/i Given comment 3, the proposed limiting in comment 0 seems entire sensible. No plain non-obfuscated "meds". No half-assed obfuscated one within a longer word. Maybe using (\b|_) rather than \b, to catch that pathetic this-is-a-word-char non-real-word char that underscore is.
(In reply to comment #5) > Given comment 3, the proposed limiting in comment 0 seems entire sensible. No > plain non-obfuscated "meds". No half-assed obfuscated one within a longer word. > > Maybe using (\b|_) rather than \b, to catch that pathetic this-is-a-word-char > non-real-word char that underscore is. +1 to both. Performance question: which is more efficient? (?:\b|_)x(?:\b|_) \b_*x_*\b
btw, check the rescoring bug; most of the FUZZY ruleset got zeroed scores.
(In reply to comment #6) > Performance question: which is more efficient? They are not equivalent. "a_x" =~ /(?:\b|_)x(?:\b|_)/ && "a_x" !~ /\b_*x_*\b/
if we want to change this for 3.3.0, it needs to be in SVN by this Thursday; see bug 6155.
svn commit -m 'bug 5380: fix SUBJECT_FUZZY_MEDS FP on unobfuscated "meds"' Sending rules/25_replace.cf Transmitting file data . Committed revision 809780.