Bug 5380

Summary:	SUBJECT_FUZZY_MEDS triggers on un-obfuscated meds and meds in a word
Product:	Spamassassin	Reporter:	Michael Bietenholz <mfb6>
Component:	Rules	Assignee:	SpamAssassin Developer Mailing List <dev>
Status:	RESOLVED FIXED
Severity:	normal
Priority:	P2
Version:	3.1.8
Target Milestone:	3.3.0
Hardware:	PC
OS:	Linux
Whiteboard:

Description Michael Bietenholz 2007-03-14 11:53:24 UTC

I hope this isn't already been posted, I did search bugs for "meds" & "fuzzy"
and didn't find, but I'm a newbie, so forgive me if I missed something.

The rule  SUBJECT_FUZZY_MEDS is a little too general: it hits plain,
unobfuscated "meds" and also any occurences of "meds" and perhaps other
combinations (not sure how the replace stuff works) in a word in the subject
line, eg. lameds, medscheat, premeds

I suggest replacing it with  

 Subject =~ /\b(?!meds)<M><E><D><S>\b/i

which will only trigger on obfuscated meds as a separate word, or possibly
loosing the (?!meds), since the not-really-a-word meds is not too likely to show
up in a Subject except in spam

        cheers, without spamassassin I'd have given up email
        (well at least some accounts ;-)

              michael b

Comment 1 Justin Mason 2009-07-23 07:13:14 UTC

will try to fix for 3.3.0

Comment 2 Michael Scheidell 2009-08-12 10:17:45 UTC

looks like it will fp on anything with meds in the subject line, inside a word, etc

the following (small snipet) is enough to trigger this:
(save to a file, yes, just these lines is enough)

------------begin---
Subject: Someone: Review Meds
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="0-1130195460-1249928373=:51768"


--0-1130195460-1249928373=:51768
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
-<<<EOF

Comment 3 Matus UHLAR - fantomas 2009-08-13 00:32:05 UTC

the rule matches also on czech/slovak words "medzi" (inter), "obmedzit" (to limit).
Yes I'd be glad if we'd have way to cut FPs down.

Comment 4 Justin Mason 2009-08-13 04:33:27 UTC

btw if we can get some samples of what it's _supposed_ to hit, that would help too.  (this is a new approach to rule regression testing I'm working on.)

Comment 5 Karsten Bräckelmann 2009-08-18 17:10:49 UTC

The problem is, that the rule as-is will hit on any sub-string "meds". No word boundaries, no exclusion of the NON-obfuscated meds.

25_replace.cf:  header SUBJECT_FUZZY_MEDS  Subject =~ /<M><E><D><S>/i

Given comment 3, the proposed limiting in comment 0 seems entire sensible. No plain non-obfuscated "meds". No half-assed obfuscated one within a longer word.

Maybe using (\b|_) rather than \b, to catch that pathetic this-is-a-word-char non-real-word char that underscore is.

Comment 6 John Hardin 2009-08-18 17:25:03 UTC

(In reply to comment #5)

> Given comment 3, the proposed limiting in comment 0 seems entire sensible. No
> plain non-obfuscated "meds". No half-assed obfuscated one within a longer word.
> 
> Maybe using (\b|_) rather than \b, to catch that pathetic this-is-a-word-char
> non-real-word char that underscore is.

+1 to both.

Performance question: which is more efficient?

(?:\b|_)x(?:\b|_)

\b_*x_*\b

Comment 7 Justin Mason 2009-08-19 05:23:39 UTC

btw, check the rescoring bug; most of the FUZZY ruleset got zeroed scores.

Comment 8 Karsten Bräckelmann 2009-08-20 06:46:51 UTC

(In reply to comment #6)
> Performance question: which is more efficient?

They are not equivalent.

  "a_x" =~ /(?:\b|_)x(?:\b|_)/  &&  "a_x" !~ /\b_*x_*\b/

Comment 9 Justin Mason 2009-08-31 16:06:50 UTC

if we want to change this for 3.3.0, it needs to be in SVN by this Thursday; see bug 6155.

Comment 10 John Hardin 2009-08-31 16:42:58 UTC

svn commit -m 'bug 5380: fix SUBJECT_FUZZY_MEDS FP on unobfuscated "meds"'
Sending        rules/25_replace.cf
Transmitting file data .
Committed revision 809780.