Bug 691 - Questionable negative rules in 20_compensate.cf
Summary: Questionable negative rules in 20_compensate.cf
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P2 normal
Target Milestone: ---
Assignee: Daniel Quinlan
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2002-08-12 22:57 UTC by Michael Moncur
Modified: 2002-09-03 22:09 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Moncur 2002-08-12 22:57:26 UTC
I had a false negative today that got a -4 score from the SUBJECT_IS_NEWS rule, 
and this prompted me to run a test on the 20_compensate.cf rules. The following 
ones look questionable - especially SUBJECT_IS_NEWS, FROM_NEWS_LIST, and 
EXCHANGE_SERVER. Some of the rest scored OK, but the negative scores might be a 
bit larger than they should be. Does anyone else get similar results?

OVERALL     SPAM  NONSPAM     S/O   SCORE  NAME
  10689     6683     4006    0.63    0.00  (all messages)
    155      126       29    0.72   -0.30  OUTLOOK_FW_MSG
    250      122      128    0.36   -2.00  EXCHANGE_SERVER
    118      114        4    0.94   -2.00  FROM_NEWS_LIST
    118       51       67    0.31   -4.00  SUBJECT_IS_NEWS
    274       47      227    0.11   -3.50  SUBJECT_MONTH_2
    253       28      225    0.07   -3.50  SUBJECT_MONTH
    331       28      303    0.05   -2.00  X_ACCEPT_LANG
     57       19       38    0.23   -2.50  SUBJECT_FREQ
     36       19       17    0.40   -0.70  X_AUTH_WARNING
     22       17        5    0.67   -1.00  AUTO_RESP
     33       12       21    0.26   -1.00  FWD_MSG
     19        9       10    0.35   -1.00  MAILER_DAEMON
     32        8       24    0.17   -1.00  ACCOUNT_CLICK
     21        8       13    0.27   -4.00  SUBJECT_HAS_DATE
Comment 1 Daniel Quinlan 2002-08-12 23:04:56 UTC
No local tests should ever have a human-set score.  Some of these are
just new rules so the GA hasn't been run over them yet, but some are also
human-set which is non-optimal.

My plan is to relocate all local tests to the GA scoring section of the
rules file.
Comment 2 Justin Mason 2002-08-13 15:42:29 UTC
BTW, I haven't spent nearly enough time fixing these rules;
some are definitely hitting promiscuously on spam as well
as nonspam.  Feel free to fix them if you see cases where
they're mismatching.

If they still give really bad FNs, then we should nuke 'em :(
Comment 3 Rod Begbie 2002-08-14 16:02:00 UTC
OUTLOOK_FW_MSG matches any subject starting "Fw" -- The colon is "0 or 1", and 
the whitespace after is "0 or many".  Outlook forwards always start 
either "Fw: " or "FW: ".

Similarly, FWD_MSG has a "\s*" at the end of it -- Either we want to match a 
whitespace, or we don't care, so either the * or the whole thing should go.

I made the following changes -- In both cases, they hit more nonspam and less 
spam.

header ROD_FWD_MSG                  Subject =~ /\[?Fwd:?\s/
describe ROD_FWD_MSG                Forwarded email
tflags ROD_FWD_MSG                  nice

header ROD_OUTLOOK_FW_MSG           Subject =~ /\[?F[Ww]:\s/
describe ROD_OUTLOOK_FW_MSG         Forwarded email (Outlook style)
tflags ROD_OUTLOOK_FW_MSG           nice

OVERALL     SPAM  NONSPAM     S/O   SCORE  NAME
  12758     3565     9193    0.28    0.00  (all messages)
    182       38      144    0.40   -0.30  OUTLOOK_FW_MSG
    116       19       97    0.34   -1.00  FWD_MSG
    114       17       97    0.31    1.00  ROD_FWD_MSG
    203       17      186    0.19    1.00  ROD_OUTLOOK_FW_MSG

Comment 4 Rod Begbie 2002-08-14 16:29:05 UTC
For the record, here're my scores for 20_compensate as a whole.

Note that EXCHANGE_SERVER did quite reasonably here (hit a handful of spam, but 
shedloads of nonspam).

Still not sure I agree with SUBJECT_IS_NEWS -- It's triggering on things 
like "Subject: [trouble-list] Who should go?".  Doesn't seem that helpful.

I agree that the current scores in CVS are off, and that the GA should be set 
upon the list.


[rod@blazing masses]$ ./hit-frequencies -x | egrep -v "0 +0 +0" | sort -rn +3
OVERALL     SPAM  NONSPAM     S/O   SCORE  NAME
  12758     3565     9193    0.28    0.00  (all messages)
     12        6        6    0.72   -1.00  X_MAILING_LIST
     19        6       13    0.54   -1.00  X_LOOP
    182       38      144    0.40   -0.30  OUTLOOK_FW_MSG
     59       10       49    0.34   -2.00  FORGOTTEN_PASSWORD
    116       19       97    0.34   -1.00  FWD_MSG
    250       39      211    0.32   -4.00  SUBJECT_IS_NEWS
    114       17       97    0.31    1.00  ROD_FWD_MSG
     15        2       13    0.28   -1.00  PRIVACY_STATEMENT
    203       17      186    0.19    1.00  ROD_OUTLOOK_FW_MSG
    830       52      778    0.15   -2.00  EXCHANGE_SERVER
     48        3       45    0.15   -1.00  ACCOUNT_CLICK
     68        3       65    0.11   -2.00  SIGNATURE_SHORT_DENSE
    281       13      268    0.11   -2.50  SUBJECT_FREQ
    428       10      418    0.06   -3.50  SUBJECT_MONTH_2
    134        3      131    0.06   -0.50  SIGNATURE_LONG_SPARSE
    128        3      125    0.06   -1.00  HOTMAIL_FOOTER1
    413        8      405    0.05   -3.50  SUBJECT_MONTH
    195        4      191    0.05   -2.00  FROM_NEWS_LIST
    224        4      220    0.04   -2.00  X_ACCEPT_LANG
     97        1       96    0.03   -1.00  MSN_FOOTER1
    159        2      157    0.03   -0.70  X_AUTH_WARNING
    350        2      348    0.01   -1.00  HOTMAIL_FOOTER2
    187        1      186    0.01   -4.00  SUBJECT_HAS_DATE
      8        0        8    0.00   -1.00  REG_THANKS
    770        1      769    0.00   -3.00  FROM_EGROUPS
    741        1      740    0.00   -1.00  GROUPS_YAHOO_1
    660        1      659    0.00   -1.00  USER_AGENT
    631        0      631    0.00   -1.00  EMAIL_ATTRIBUTION
      6        0        6    0.00   -5.00  EVITE
      3        0        3    0.00   -2.00  CRON_ENV
     28        0       28    0.00   -1.50  SIGNATURE_SHORT_SPARSE
     28        0       28    0.00  -10.00  GENUINE_EBAY_RCVD
     26        0       26    0.00   -5.00  LISTBUILDER
     21        0       21    0.00   -3.13  PGP_SIGNATURE
      2        0        2    0.00   -1.00  APPROVED_BY
   1932        0     1932    0.00   -3.38  IN_REP_TO
     19        0       19    0.00   -1.00  SIGNATURE_LONG_DENSE
     18        0       18    0.00   -1.00  HOTMAIL_FOOTER3
   1252        0     1252    0.00   -0.10  REFERENCES
    105        0      105    0.00   -1.00  HOTMAIL_FOOTER5
Comment 5 Daniel Rogers 2002-08-14 16:51:00 UTC
Subject: Re: [SAdev]  Questionable negative rules in 20_compensate.cf

And my results:

OVERALL     SPAM  NONSPAM     S/O   SCORE  NAME
  11062     7152     3910    0.65    0.00  (all messages)
    524      459       65    0.79   -2.00  FROM_NEWS_LIST
    426       75      351    0.10   -2.50  SUBJECT_FREQ
    222       70      152    0.20   -4.00  SUBJECT_IS_NEWS
     78       69        9    0.81   -2.00  X_ACCEPT_LANG
    493       39      454    0.04   -3.50  SUBJECT_MONTH_2
    636       34      602    0.03   -0.70  X_AUTH_WARNING
    278       31      247    0.06   -0.30  OUTLOOK_FW_MSG
     26       25        1    0.93   -1.00  RESENT_TO
     44       23       21    0.37   -2.00  EXCHANGE_SERVER
    173       17      156    0.06   -1.00  FWD_MSG
     24       16        8    0.52   -1.00  ACCOUNT_CLICK
    464       15      449    0.02   -3.50  SUBJECT_MONTH
     29       12       17    0.28   -0.10  REFERENCES
     10       10        0    1.00   -0.50  SIGNATURE_LONG_SPARSE
     56        4       52    0.04   -4.00  SUBJECT_HAS_DATE
     20        3       17    0.09   -2.00  FORGOTTEN_PASSWORD
      8        3        5    0.25   -1.00  HOTMAIL_FOOTER2
      3        2        1    0.52   -1.00  MAILBITS_EMAIL
      6        2        4    0.21   -1.00  PRIVACY_STATEMENT
      6        1        5    0.10   -1.00  APPROVED_BY
      4        1        3    0.15   -1.00  X_LOOP
     14        1       13    0.04   -1.00  EMAIL_ATTRIBUTION
      4        1        3    0.15   -1.00  HOTMAIL_FOOTER3
      1        1        0    1.00   -1.00  TRACK_NUMBER

Dan.

Comment 6 Rod Begbie 2002-08-15 07:35:01 UTC
A slight tweak to simplify the regexp:

header FWD_MSG                  Subject =~ /Fwd:\s/
describe FWD_MSG                Forwarded email
tflags FWD_MSG                  nice

And since I've shouted at others for not supplying tests, here are some 
regression tests for my FW changes:

test OUTLOOK_FW_MSG ok Subject: FW: White Stripes Tour!
test OUTLOOK_FW_MSG ok Subject: Fw: Thank you yourself
test OUTLOOK_FW_MSG fail Subject: Fwd: Dracula
test OUTLOOK_FW_MSG fail Subject: fw:                                       . qB
QyzOWqKYggZT0oDJzp41nd
test OUTLOOK_FW_MSG fail Subject: FW:>Re: Spruce up your life!DGKJWI
test FWD_MSG ok Subject: Fwd: Dracula
test FWD_MSG ok Subject: [landho] Fwd: tell rod
test FWD_MSG fail Subject: Fwd:Pure Opt-In for half the price
test FWD_MSG fail Subject: Re: RE: FWD: search results        .       .       .

Comment 7 Justin Mason 2002-09-04 06:09:14 UTC
closing this bug:

- Rod's suggestion has been applied.

- too many rules were being discussed at once, problem rules should each
  have a bug to themselves, much easier to follow.

- we now have much better FP/FN figures in rules/STATISTICS.txt to work
  from.