Bug 670 - BALANCE_FOR_LONG: way too many FPs
Summary: BALANCE_FOR_LONG: way too many FPs
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P3 minor
Target Milestone: ---
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
: 943 954 (view as bug list)
Depends on:
Blocks:
 
Reported: 2002-08-08 08:24 UTC by Justin Mason
Modified: 2002-09-20 10:50 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Justin Mason 2002-08-08 08:24:46 UTC
0.05     0.05     0.05    0.52   -1.00  BALANCE_FOR_LONG_40K (fn)
   0.62     0.66     0.49    0.57   -1.00  BALANCE_FOR_LONG_20K (fn)

worst frequencies of all the tests. :(
 
We need a better way to measure "message length" that doesn't
hit spams. most of the spams *aren't* actually that long,
but with a text and a HTML version included (separate MIME
parts), they add up to be "long" according to the current
test.

The current test just measures byte size AFAICR.
Comment 1 Michael Moncur 2002-08-08 22:05:59 UTC
The current test checks the lengh of the 'body' part. I know this doesn't 
include HTML tags - does it include multiple MIME parts? If so this explains 
some of the problems with the HTML_X_X tests.
Comment 2 Daniel Quinlan 2002-08-11 01:49:48 UTC
I agree that this test should be removed.  Here are my results:

OVERALL     SPAM  NONSPAM     S/O   SCORE  NAME
   9220     2264     6956    0.25    0.00  (all messages)
    468      286      182    0.83   -1.00  BALANCE_FOR_LONG
    157       94       63    0.82   -1.00  BALANCE_FOR_LONG_20K
     43       14       29    0.60   -1.00  BALANCE_FOR_LONG_40K

It's not even close to being a compensation test and wouldn't make a very
good spam indicator either.  I think the only two good options are:

1. Change the test to be a range test (min,max), set the score to zero, and
   see if the GA does anything interesting with it.  It seems unlikely that
   it could, so that would probably just be waste of time.

2. Remove it and remove it quickly.

I think whether dual-format messages (HTML and non-HTML) are being counted is
not a significant factor since you see that for both spam and nonspam.
Comment 3 Daniel Quinlan 2002-08-11 02:49:04 UTC
I'm commenting these rules out as a temporary measure, they were not making
testing of the new HTML parsing code any easier.
Comment 4 Daniel Quinlan 2002-09-20 18:48:43 UTC
*** Bug 954 has been marked as a duplicate of this bug. ***
Comment 5 Daniel Quinlan 2002-09-20 18:48:53 UTC
*** Bug 943 has been marked as a duplicate of this bug. ***
Comment 6 Daniel Quinlan 2002-09-20 18:50:12 UTC
These rules were never any good.  They have been removed.  Nobody will
miss them.  No basis for giving long messages a lower score.

The End