Bug 807 - SPAM_PHRASE_XX_XX scores are silly
Summary: SPAM_PHRASE_XX_XX scores are silly
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P3 normal
Target Milestone: ---
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2002-09-03 03:38 UTC by Michael Moncur
Modified: 2002-09-18 11:47 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Moncur 2002-09-03 03:38:09 UTC
I know there are complex and ineffable reasons for the GA to do this, but from 
a real-life perspective, the SPAM_PHRASE_XX scores are just plain strange.

In particular, SPAM_PHRASE_00_01 matches on over 50% of non-spam in my tests, 
and adds half a point. I'm also quite puzzled by the SPAM_PHRASE_55_XX score, 
although it's been discussed here before.

I really think the lowest three rules should be thrown out or made negative - 
they look like great negative rules on my corpus - and the scores of the rest 
should be looked at closely.

50_scores.cf:score SPAM_PHRASE_00_01              0.552
50_scores.cf:score SPAM_PHRASE_01_02              -0.094
50_scores.cf:score SPAM_PHRASE_02_03              -0.713
50_scores.cf:score SPAM_PHRASE_03_05              0.075
50_scores.cf:score SPAM_PHRASE_05_08              0.737
50_scores.cf:score SPAM_PHRASE_08_13              -0.070
50_scores.cf:score SPAM_PHRASE_13_21              2.969
50_scores.cf:score SPAM_PHRASE_21_34              3.593
50_scores.cf:score SPAM_PHRASE_34_55              2.648
50_scores.cf:score SPAM_PHRASE_55_XX              -4.018

Test results from my corpus...

OVERALL%   SPAM% NONSPAM%     S/O   SCORE  NAME
  13251    10071     3180    0.76    0.00  (all messages)
100.000   76.002   23.998    0.76    0.00  (all messages as %)
  1.094    1.440    0.000    1.00    2.65  SPAM_PHRASE_34_55
  0.023    0.030    0.000    1.00   -4.02  SPAM_PHRASE_55_XX
  7.916   10.406    0.031    1.00    3.59  SPAM_PHRASE_21_34
 16.633   21.249    2.013    0.91    2.97  SPAM_PHRASE_13_21
  9.297   10.863    4.340    0.71    0.74  SPAM_PHRASE_05_08
 19.742   22.470   11.101    0.67   -0.07  SPAM_PHRASE_08_13
  7.652    7.705    7.484    0.51    0.07  SPAM_PHRASE_03_05
  4.158    3.336    6.761    0.33   -0.71  SPAM_PHRASE_02_03
 26.836   18.469   53.333    0.26    0.55  SPAM_PHRASE_00_01
  6.649    4.031   14.937    0.21   -0.09  SPAM_PHRASE_01_02
Comment 1 Michael Moncur 2002-09-03 04:20:27 UTC
>I really think the lowest three rules should
>be thrown out or made negative

Er, two of them are already negative. Ahem.
Comment 2 Daniel Quinlan 2002-09-03 11:48:59 UTC
The first three rules should just be stamped as "nice" and the GA should
know how to score them.
Comment 3 Michael Moncur 2002-09-04 01:22:49 UTC
Sounds good to me. Has anyone looked at why SPAM_PHRASE_55_XX scored so 
negatively? It matches exactly 3 out of 10,000 spam and zero out of 5000 
nonspam on my corpus.

Also, if the GA knows to score 'nice' rules negatively, shouldn't it know to 
score 'mean' (default) rules positively?
Comment 4 Daniel Quinlan 2002-09-18 19:47:49 UTC
This has been fixed with the new GA.

score SPAM_PHRASE_08_13              1.320
score SPAM_PHRASE_13_21              1.388
score SPAM_PHRASE_21_34              1.833
score SPAM_PHRASE_34_55              1.883
score SPAM_PHRASE_55_XX              0.540