Bug 1842 - Remove CLICK_TO_REMOVE_2 from 20_body_tests
Summary: Remove CLICK_TO_REMOVE_2 from 20_body_tests
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P5 enhancement
Target Milestone: 3.0.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-04-30 09:46 UTC by Aaron Sherman
Modified: 2004-03-17 12:02 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Aaron Sherman 2003-04-30 09:46:48 UTC
Body tests are somewhat expensive, so it would seem to make sense to remove the
ones that are not terribly useful. CLICK_TO_REMOVE_2 matched once in my sample
set of 7800+ spam+non-spam messages, and that one match *also matched* the
following rules:

HTML_LINK_CLICK_HERE, MAILTO_TO_REMOVE, MAILTO_TO_SPAM_ADDR, MAILTO_WITH_SUBJ,
MAILTO_WITH_SUBJ_REMOVE

Does anyone see a reason to keep doing so many /mailto:/ tests, espcially ones
like CLICK_TO_REMOVE_2 that are so restrictive and redundant and superceded for
the most part by Bayes any way (it looks for three words in close proximity,
"mailto", "click" and "remove")?

If anyone really cares enough about my example, I'll attach it, but it doesn't
seem pertinent to me.
Comment 1 Marc Perkel 2003-04-30 10:06:19 UTC
Subject: Re: [SAdev]  New: Remove CLICK_TO_REMOVE_2 from 20_body_tests

I totally agree. Both spam and nonspam use click to remove. In fact - 
click to remove is good list practice and certianly should not be 
punished. I vote (for what it worth) to eliminate the click to remove 
rules. Or at least most of them. Unless there are ways to detect clearly 
bogus click to remove links.


Comment 2 Antony Mawer 2003-04-30 11:23:53 UTC
Subject: Re: [SAdev]  New: Remove CLICK_TO_REMOVE_2 from 20_body_tests 


> Body tests are somewhat expensive, so it would seem to make sense to
> remove t he ones that are not terribly useful. CLICK_TO_REMOVE_2 matched
> once in my sampl e set of 7800+ spam+non-spam messages, and that one
> match *also matched* the following rules:
> 
> HTML_LINK_CLICK_HERE, MAILTO_TO_REMOVE, MAILTO_TO_SPAM_ADDR,
> MAILTO_WITH_SUBJ , MAILTO_WITH_SUBJ_REMOVE

I agree totally BTW.  Persuading everyone else is the hard part ;)

Comment 3 Aaron Sherman 2003-04-30 11:38:07 UTC
Just to ammend my comment, I'm not saying that we should remove the
click-to-remove matching (it's a nice metric, which the GA can sort out the
false-positive risk for). What I was saying was that this test is BOTH redundant
with many other tests AND so restrictive that it almost never matches in spam or
non-spam (though the one example I found WAS spam).

IMHO that should be the two criteria by which every rule can be removed.
Comment 4 Daniel Quinlan 2003-04-30 18:29:33 UTC
Aaron,

We have a lot of these rules right now.  I think they're overdue for a bit
of consolidation and clean-up.  Group results (DETAILS.last):

  1.769   1.9621   0.0452    0.977   0.91    1.10  CLICK_TO_REMOVE_1
 11.259  12.4383   0.7528    0.943   0.83    1.10  HTML_LINK_CLICK_CAPS
 11.078  12.2034   1.0539    0.921   0.78    0.50  CLICK_BELOW_CAPS
  0.678   0.7419   0.1054    0.876   0.65    0.80  CLICK_TO_REMOVE_2
 39.710  42.2548  17.0431    0.713   0.41    0.10  HTML_LINK_CLICK_HERE
 36.805  38.4337  22.2975    0.633   0.30    0.23  CLICK_BELOW
  0.052   0.0541   0.0301    0.642   0.26    1.00  EXCUSE_6

Clearly, some of these rules are better than others for both S/O ratio and
spam %.  And there's also a lot of overlap between the various rules which
is not a good thing.  No overlap and fewer rules would be ideal.

If the first step is deleting CLICK_TO_REMOVE_2 and EXCUSE_6, sign me up.
They get almost no hits not covered by other rules:

$ egrep EXCUSE_6 spam-quinlan.log spam-jm.log spam-rODbegbie.log spam-theo.log
spam-daf.log spam-lan.log|awk '{print $4}'|tr , '\n'|egrep
'(CLICK|EXCUSE_6)'|count|tail
     18 HTML_LINK_CLICK_HERE
     21 CLICK_BELOW
     21 __CLICK_BELOW
     35 EXCUSE_6

EXCUSE_6 is not worth keeping...

$ egrep CLICK_TO_REMOVE_2 spam-quinlan.log spam-jm.log spam-rODbegbie.log
spam-theo.log spam-daf.log spam-lan.log|awk '{print $4}'|tr , '\n'|egrep
'(CLICK|EXCUSE_6)'|count|tail
     60 CLICK_TO_REMOVE_1
    117 CLICK_BELOW_CAPS
    125 HTML_LINK_CLICK_CAPS
    145 CLICK_BELOW
    256 HTML_LINK_CLICK_HERE
    262 __CLICK_BELOW
    265 CLICK_TO_REMOVE_2

almost all hits covered by HTML_LINK_CLICK_HERE

If there are any words covered by CLICK_TO_REMOVE_2 which aren't in
HTML_LINK_CLICK_HERE, then just merge them in for now.

Same for EXCUSE_6 and CLICK_BELOW.
Comment 5 Daniel Quinlan 2003-05-18 21:40:36 UTC
moving a bunch of bugs to 2.70 milestone
Comment 6 Justin Mason 2004-03-17 21:02:44 UTC
it's gone