SA Bugzilla – Bug 807
SPAM_PHRASE_XX_XX scores are silly
Last modified: 2002-09-18 11:47:49 UTC
I know there are complex and ineffable reasons for the GA to do this, but from a real-life perspective, the SPAM_PHRASE_XX scores are just plain strange. In particular, SPAM_PHRASE_00_01 matches on over 50% of non-spam in my tests, and adds half a point. I'm also quite puzzled by the SPAM_PHRASE_55_XX score, although it's been discussed here before. I really think the lowest three rules should be thrown out or made negative - they look like great negative rules on my corpus - and the scores of the rest should be looked at closely. 50_scores.cf:score SPAM_PHRASE_00_01 0.552 50_scores.cf:score SPAM_PHRASE_01_02 -0.094 50_scores.cf:score SPAM_PHRASE_02_03 -0.713 50_scores.cf:score SPAM_PHRASE_03_05 0.075 50_scores.cf:score SPAM_PHRASE_05_08 0.737 50_scores.cf:score SPAM_PHRASE_08_13 -0.070 50_scores.cf:score SPAM_PHRASE_13_21 2.969 50_scores.cf:score SPAM_PHRASE_21_34 3.593 50_scores.cf:score SPAM_PHRASE_34_55 2.648 50_scores.cf:score SPAM_PHRASE_55_XX -4.018 Test results from my corpus... OVERALL% SPAM% NONSPAM% S/O SCORE NAME 13251 10071 3180 0.76 0.00 (all messages) 100.000 76.002 23.998 0.76 0.00 (all messages as %) 1.094 1.440 0.000 1.00 2.65 SPAM_PHRASE_34_55 0.023 0.030 0.000 1.00 -4.02 SPAM_PHRASE_55_XX 7.916 10.406 0.031 1.00 3.59 SPAM_PHRASE_21_34 16.633 21.249 2.013 0.91 2.97 SPAM_PHRASE_13_21 9.297 10.863 4.340 0.71 0.74 SPAM_PHRASE_05_08 19.742 22.470 11.101 0.67 -0.07 SPAM_PHRASE_08_13 7.652 7.705 7.484 0.51 0.07 SPAM_PHRASE_03_05 4.158 3.336 6.761 0.33 -0.71 SPAM_PHRASE_02_03 26.836 18.469 53.333 0.26 0.55 SPAM_PHRASE_00_01 6.649 4.031 14.937 0.21 -0.09 SPAM_PHRASE_01_02
>I really think the lowest three rules should >be thrown out or made negative Er, two of them are already negative. Ahem.
The first three rules should just be stamped as "nice" and the GA should know how to score them.
Sounds good to me. Has anyone looked at why SPAM_PHRASE_55_XX scored so negatively? It matches exactly 3 out of 10,000 spam and zero out of 5000 nonspam on my corpus. Also, if the GA knows to score 'nice' rules negatively, shouldn't it know to score 'mean' (default) rules positively?
This has been fixed with the new GA. score SPAM_PHRASE_08_13 1.320 score SPAM_PHRASE_13_21 1.388 score SPAM_PHRASE_21_34 1.833 score SPAM_PHRASE_34_55 1.883 score SPAM_PHRASE_55_XX 0.540