SA Bugzilla – Bug 5992
False positives on SUBJECT_FUZZY_TION rule
Last modified: 2008-12-31 09:29:30 UTC
The SUBJECT_FUZZY_TION rule generates false positives on any word containing "tition" For example, some legitimate non-obfuscated examples that trigger this rule: competition (OK, likely spam term but not obfuscated) partition petition practitioner repartition repetition superstition
It's worse than that. cotton and mutton also match. The mass-check S/O for the rule is pretty bad. At first I thought we should just delete the rule, but the source of this problem is something else that might need fixing. The sandbox file emailed/00_FVGT_File001.cf redefines all of the replace_tags that are used by the ReplaceTags plugin, overwriting what is defined in rules/25_replace.cf. 00_FVGT_File001.cf has new values for the replace_tags G,I,Q,S,T, and W. The redefinition of I is what is messing up this rule, as it makes tition the same as tiiion, and makes tton the same as tion. I don't think that 00_FVGT_File001.cf should be doing anything that has a global effect on rules that it is not defining. We should remove the unnecessary replace_tag definitions that are already in the standard rule set, and rename the changed tags to G2,I2,Q2,S2,T2, and W2 so that they are only referenced in the rules defined in that file. I'm a little hesitant to push this into the rule updates because it will have the effect of reverting all the replace_tag rules in other files to the behavior they had before any 00_FVGT_File001.cf rules were promoted. But it does seem like the right thing to do. Can anyone think of a way to test this without pushing it out to everyone? Would it work to check in a change to 00_FVGT_File001.cf and compare the S/O's of all rules in 25_replace.cf before and after the change, knowing that nobody will see the change until we push an update through the channel?
Here are the last mass check results for the fuzzy rules from 25_replace.cf, preserved here for posterity and to make it easier to compare results after I change it. Since the change should not propagate until the update channel is explicitly pushed out, I'm going to go ahead and check the fix into the sandbox and see what these results look like in the next nightly that uses it. SPAM% HAM% S/O RANK NAME 1.5053 0.0046 0.997 0.91 SUBJECT_FUZZY_MEDS 0.3093 0.0000 1.000 0.81 FUZZY_GUARANTEE 0.2751 0.0077 0.973 0.80 FUZZY_ERECT 0.2378 0.0278 0.895 0.77 FUZZY_CPILL 0.1713 0.0031 0.982 0.76 __SUBJECT_FUZZY_VPILL 0.1473 0.0000 1.000 0.76 FUZZY_PRICES 0.0876 0.0000 1.000 0.71 SUBJECT_FUZZY_VPILL 0.0743 0.0000 1.000 0.69 FUZZY_MEDICATION 0.0674 0.0031 0.956 0.68 FUZZY_PHARMACY 0.0662 0.0046 0.935 0.68 FUZZY_XPILL 0.1210 0.0880 0.579 0.67 FUZZY_VPILL 0.0493 0.0000 1.000 0.66 SUBJECT_FUZZY_PENIS 0.0413 0.0000 1.000 0.64 FUZZY_PRESCRIPT 0.0517 0.0432 0.545 0.63 FUZZY_CREDIT 0.0209 0.0015 0.931 0.58 FUZZY_OFFERS 0.1062 0.3614 0.227 0.53 SUBJECT_FUZZY_TION 0.0110 0.0448 0.197 0.50 FUZZY_AMBIEN 0.0045 0.0154 0.224 0.49 FUZZY_VLIUM 0.0006 0.0000 1.000 0.49 FUZZY_MONEY 0.0003 0.0000 1.000 0.49 FUZZY_OBLIGATION 0.0003 0.0000 1.000 0.49 FUZZY_MORTGAGE 0.0001 0.0000 1.000 0.49 FUZZY_SOFTWARE 0.0001 0.0000 1.000 0.49 FUZZY_MERIDIA 0.0000 0.0000 0.500 0.49 FUZZY_BILLION 0.0000 0.0000 0.500 0.49 FUZZY_PHENT 0.0000 0.0000 0.500 0.49 FUZZY_REMOVE 0.0000 0.0000 0.500 0.49 FUZZY_THOUSANDS 0.0000 0.0000 0.500 0.49 FUZZY_AFFORDABLE 0.0000 0.0000 0.500 0.49 FUZZY_ROLEX 0.0000 0.0000 0.500 0.49 SUBJECT_FUZZY_CHEAP 0.0000 0.0000 0.500 0.49 FUZZY_VIOXX 0.0002 0.0015 0.119 0.49 FUZZY_MILLION 0.0001 0.0062 0.017 0.48 FUZZY_REFINANCE
Committed rules/trunk/sandbox/emailed/00_FVGT_File001.cf as revision 701013 After I see the results in a mass-check I'll push out the update and close this bug
+1 for changing. Measure by running a mass-check on your spam corpus with the "pre" ruleset, then a mass-check on the same mails with the "post" ruleset. We can assume that the results will roughly match what'll happen on the bigger multi-user nightly corpora.
(or alternatively just make the change; I think the hitrates are low enough, and FP rates are already high enough, that it's likely it'll improve matters anyway!)
aiming at 3.2.6, since it appears Sidney plans to push a 3.2.x update
The mass check stats look good. The biggest change is in SUBJECT_FUZZY_TION which went from an S/O of 0.227 to an S/O of 0.670/ The other chamges were much smaller, with almost all the few changes that were for the worse being just 0.01 difference. This change is only to the rules/trunk/sandbox/emailed/00_FVGT_File001.cf file. Do rules in there that meet the criteria get promoted to the 3.2 update channel automatically when the channel is pushed? Do I just do a push of the channel to see this go into 3.2?
(In reply to comment #7) > This change is only to the rules/trunk/sandbox/emailed/00_FVGT_File001.cf file. > Do rules in there that meet the criteria get promoted to the 3.2 update channel > automatically when the channel is pushed? Do I just do a push of the channel to > see this go into 3.2? no. it was all manual :( now done, anyway... svn commit -m "bug 5992: reduce FPs on replace_rules fuzzy-matching rules, backport from trunk" ../b3_2_0_updates/72_active.cf Sending b3_2_0_updates/72_active.cf Transmitting file data . Committed revision 730415. and update pushed.