Bug 5992 - False positives on SUBJECT_FUZZY_TION rule
Summary: False positives on SUBJECT_FUZZY_TION rule
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: 3.2.5
Hardware: All All
: P5 trivial
Target Milestone: 3.2.6
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-10-01 17:16 UTC by Ned Slider
Modified: 2008-12-31 09:29 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Ned Slider 2008-10-01 17:16:46 UTC
The SUBJECT_FUZZY_TION rule generates false positives on any word containing "tition"

For example, some legitimate non-obfuscated examples that trigger this rule:

competition (OK, likely spam term but not obfuscated)
partition
petition
practitioner
repartition
repetition
superstition
Comment 1 Sidney Markowitz 2008-10-01 20:10:34 UTC
It's worse than that. cotton and mutton also match.

The mass-check S/O for the rule is pretty bad. At first I thought we should just delete the rule, but the source of this problem is something else that might need fixing. The sandbox file emailed/00_FVGT_File001.cf redefines all of the replace_tags that are used by the ReplaceTags plugin, overwriting what is defined in rules/25_replace.cf.

00_FVGT_File001.cf has new values for the replace_tags G,I,Q,S,T, and W. The redefinition of I is what is messing up this rule, as it makes tition the same as tiiion, and makes tton the same as tion.

I don't think that 00_FVGT_File001.cf should be doing anything that has a global effect on rules that it is not defining. We should remove the unnecessary replace_tag definitions that are already in the standard rule set, and rename the changed tags to G2,I2,Q2,S2,T2, and W2 so that they are only referenced in the rules defined in that file.

I'm a little hesitant to push this into the rule updates because it will have the effect of reverting all the replace_tag rules in other files to the behavior they had before any 00_FVGT_File001.cf rules were promoted. But it does seem like the right thing to do. Can anyone think of a way to test this without pushing it out to everyone? Would it work to check in a change to 00_FVGT_File001.cf and compare the S/O's of all rules in 25_replace.cf before and after the change, knowing that nobody will see the change until we push an update through the channel?
Comment 2 Sidney Markowitz 2008-10-01 21:01:13 UTC
Here are the last mass check results for the fuzzy rules from 25_replace.cf, preserved here for posterity and to make it easier to compare results after I change it. Since the change should not propagate until the update channel is explicitly pushed out, I'm going to go ahead and check the fix into the sandbox and see what these results look like in the next nightly that uses it.

SPAM%   HAM%   S/O  RANK    NAME
1.5053 0.0046 0.997 0.91 SUBJECT_FUZZY_MEDS 
0.3093 0.0000 1.000 0.81 FUZZY_GUARANTEE 
0.2751 0.0077 0.973 0.80 FUZZY_ERECT 
0.2378 0.0278 0.895 0.77 FUZZY_CPILL 
0.1713 0.0031 0.982 0.76 __SUBJECT_FUZZY_VPILL 
0.1473 0.0000 1.000 0.76 FUZZY_PRICES 
0.0876 0.0000 1.000 0.71 SUBJECT_FUZZY_VPILL 
0.0743 0.0000 1.000 0.69 FUZZY_MEDICATION 
0.0674 0.0031 0.956 0.68 FUZZY_PHARMACY 
0.0662 0.0046 0.935 0.68 FUZZY_XPILL 
0.1210 0.0880 0.579 0.67 FUZZY_VPILL 
0.0493 0.0000 1.000 0.66 SUBJECT_FUZZY_PENIS 
0.0413 0.0000 1.000 0.64 FUZZY_PRESCRIPT 
0.0517 0.0432 0.545 0.63 FUZZY_CREDIT 
0.0209 0.0015 0.931 0.58 FUZZY_OFFERS 
0.1062 0.3614 0.227 0.53 SUBJECT_FUZZY_TION 
0.0110 0.0448 0.197 0.50 FUZZY_AMBIEN 
0.0045 0.0154 0.224 0.49 FUZZY_VLIUM 
0.0006 0.0000 1.000 0.49 FUZZY_MONEY 
0.0003 0.0000 1.000 0.49 FUZZY_OBLIGATION 
0.0003 0.0000 1.000 0.49 FUZZY_MORTGAGE 
0.0001 0.0000 1.000 0.49 FUZZY_SOFTWARE 
0.0001 0.0000 1.000 0.49 FUZZY_MERIDIA 
0.0000 0.0000 0.500 0.49 FUZZY_BILLION 
0.0000 0.0000 0.500 0.49 FUZZY_PHENT 
0.0000 0.0000 0.500 0.49 FUZZY_REMOVE 
0.0000 0.0000 0.500 0.49 FUZZY_THOUSANDS 
0.0000 0.0000 0.500 0.49 FUZZY_AFFORDABLE 
0.0000 0.0000 0.500 0.49 FUZZY_ROLEX 
0.0000 0.0000 0.500 0.49 SUBJECT_FUZZY_CHEAP 
0.0000 0.0000 0.500 0.49 FUZZY_VIOXX 
0.0002 0.0015 0.119 0.49 FUZZY_MILLION 
0.0001 0.0062 0.017 0.48 FUZZY_REFINANCE
Comment 3 Sidney Markowitz 2008-10-01 23:46:33 UTC
Committed rules/trunk/sandbox/emailed/00_FVGT_File001.cf as revision 701013

After I see the results in a mass-check I'll push out the update and close this bug
Comment 4 Justin Mason 2008-10-02 01:31:47 UTC
+1 for changing.

Measure by running a mass-check on your spam corpus with the "pre" ruleset, then a mass-check on the same mails with the "post" ruleset.  We can assume that the results will roughly match what'll happen on the bigger multi-user nightly corpora.
Comment 5 Justin Mason 2008-10-02 01:32:34 UTC
(or alternatively just make the change; I think the hitrates are low enough, and FP rates are already high enough, that it's likely it'll improve matters anyway!)
Comment 6 Justin Mason 2008-10-02 01:33:17 UTC
aiming at 3.2.6, since it appears Sidney plans to push a 3.2.x update
Comment 7 Sidney Markowitz 2008-10-02 23:25:31 UTC
The mass check stats look good. The biggest change is in SUBJECT_FUZZY_TION which went from an S/O of 0.227 to an S/O of 0.670/ The other chamges were much smaller, with almost all the few changes that were for the worse being just 0.01 difference.

This change is only to the rules/trunk/sandbox/emailed/00_FVGT_File001.cf file. Do rules in there that meet the criteria get promoted to the 3.2 update channel automatically when the channel is pushed? Do I just do a push of the channel to see this go into 3.2?
Comment 8 Justin Mason 2008-12-31 09:29:30 UTC
(In reply to comment #7)
> This change is only to the rules/trunk/sandbox/emailed/00_FVGT_File001.cf file.
> Do rules in there that meet the criteria get promoted to the 3.2 update channel
> automatically when the channel is pushed? Do I just do a push of the channel to
> see this go into 3.2?

no.  it was all manual :(  now done, anyway...

svn commit -m "bug 5992: reduce FPs on replace_rules fuzzy-matching rules, backport from trunk" ../b3_2_0_updates/72_active.cf
Sending        b3_2_0_updates/72_active.cf
Transmitting file data .
Committed revision 730415.

and update pushed.