SA Bugzilla – Bug 1842
Remove CLICK_TO_REMOVE_2 from 20_body_tests
Last modified: 2004-03-17 12:02:44 UTC
Body tests are somewhat expensive, so it would seem to make sense to remove the ones that are not terribly useful. CLICK_TO_REMOVE_2 matched once in my sample set of 7800+ spam+non-spam messages, and that one match *also matched* the following rules: HTML_LINK_CLICK_HERE, MAILTO_TO_REMOVE, MAILTO_TO_SPAM_ADDR, MAILTO_WITH_SUBJ, MAILTO_WITH_SUBJ_REMOVE Does anyone see a reason to keep doing so many /mailto:/ tests, espcially ones like CLICK_TO_REMOVE_2 that are so restrictive and redundant and superceded for the most part by Bayes any way (it looks for three words in close proximity, "mailto", "click" and "remove")? If anyone really cares enough about my example, I'll attach it, but it doesn't seem pertinent to me.
Subject: Re: [SAdev] New: Remove CLICK_TO_REMOVE_2 from 20_body_tests I totally agree. Both spam and nonspam use click to remove. In fact - click to remove is good list practice and certianly should not be punished. I vote (for what it worth) to eliminate the click to remove rules. Or at least most of them. Unless there are ways to detect clearly bogus click to remove links.
Subject: Re: [SAdev] New: Remove CLICK_TO_REMOVE_2 from 20_body_tests > Body tests are somewhat expensive, so it would seem to make sense to > remove t he ones that are not terribly useful. CLICK_TO_REMOVE_2 matched > once in my sampl e set of 7800+ spam+non-spam messages, and that one > match *also matched* the following rules: > > HTML_LINK_CLICK_HERE, MAILTO_TO_REMOVE, MAILTO_TO_SPAM_ADDR, > MAILTO_WITH_SUBJ , MAILTO_WITH_SUBJ_REMOVE I agree totally BTW. Persuading everyone else is the hard part ;)
Just to ammend my comment, I'm not saying that we should remove the click-to-remove matching (it's a nice metric, which the GA can sort out the false-positive risk for). What I was saying was that this test is BOTH redundant with many other tests AND so restrictive that it almost never matches in spam or non-spam (though the one example I found WAS spam). IMHO that should be the two criteria by which every rule can be removed.
Aaron, We have a lot of these rules right now. I think they're overdue for a bit of consolidation and clean-up. Group results (DETAILS.last): 1.769 1.9621 0.0452 0.977 0.91 1.10 CLICK_TO_REMOVE_1 11.259 12.4383 0.7528 0.943 0.83 1.10 HTML_LINK_CLICK_CAPS 11.078 12.2034 1.0539 0.921 0.78 0.50 CLICK_BELOW_CAPS 0.678 0.7419 0.1054 0.876 0.65 0.80 CLICK_TO_REMOVE_2 39.710 42.2548 17.0431 0.713 0.41 0.10 HTML_LINK_CLICK_HERE 36.805 38.4337 22.2975 0.633 0.30 0.23 CLICK_BELOW 0.052 0.0541 0.0301 0.642 0.26 1.00 EXCUSE_6 Clearly, some of these rules are better than others for both S/O ratio and spam %. And there's also a lot of overlap between the various rules which is not a good thing. No overlap and fewer rules would be ideal. If the first step is deleting CLICK_TO_REMOVE_2 and EXCUSE_6, sign me up. They get almost no hits not covered by other rules: $ egrep EXCUSE_6 spam-quinlan.log spam-jm.log spam-rODbegbie.log spam-theo.log spam-daf.log spam-lan.log|awk '{print $4}'|tr , '\n'|egrep '(CLICK|EXCUSE_6)'|count|tail 18 HTML_LINK_CLICK_HERE 21 CLICK_BELOW 21 __CLICK_BELOW 35 EXCUSE_6 EXCUSE_6 is not worth keeping... $ egrep CLICK_TO_REMOVE_2 spam-quinlan.log spam-jm.log spam-rODbegbie.log spam-theo.log spam-daf.log spam-lan.log|awk '{print $4}'|tr , '\n'|egrep '(CLICK|EXCUSE_6)'|count|tail 60 CLICK_TO_REMOVE_1 117 CLICK_BELOW_CAPS 125 HTML_LINK_CLICK_CAPS 145 CLICK_BELOW 256 HTML_LINK_CLICK_HERE 262 __CLICK_BELOW 265 CLICK_TO_REMOVE_2 almost all hits covered by HTML_LINK_CLICK_HERE If there are any words covered by CLICK_TO_REMOVE_2 which aren't in HTML_LINK_CLICK_HERE, then just merge them in for now. Same for EXCUSE_6 and CLICK_BELOW.
moving a bunch of bugs to 2.70 milestone
it's gone