304 – -S option is utterly and hopelessly broken

Bug 304 - -S option is utterly and hopelessly broken

Summary: -S option is utterly and hopelessly broken

Status:	RESOLVED WONTFIX

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	spamassassin (show other bugs)
Version:	SVN Trunk (Latest Devel Version)
Hardware:	Other other

Importance:	P3 normal
Target Milestone:	3.0.0
Assignee:	SpamAssassin Developer Mailing List

URL:
Whiteboard:
Keywords:

Duplicates (1):	341 (view as bug list)
Depends on:
Blocks:

Reported:	2002-05-11 10:20 UTC by Doug Morse
Modified:	2004-05-18 13:45 UTC (History)
CC List:	4 users (show)

Attachment	Type	Modified	Status	Actions	Submitter/CLA Status
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Doug Morse 2002-05-11 10:20:42 UTC

hi,

i was having great trouble getting whitelist_from to work correctly.  then i discovered 
what the problem was: i was using the -S option to improve performance.  doing so, however, also 
seems to completely disable the whitelist feature.  this is because, apparently, the whitelist 
weighting is only added at the *end* of the scoring process (see example below).

is this 
really the desired behavior?  in light of this, would it not perhaps be better to have spamassassin 
always add the whitelist score *first* in the scoring process?  if so, it probably would also be 
wise to have -S stop on a configurable minimum threshold as well (e.g., -100); otherwise, -S will be of 
no benefit to whitelisted messages (i.e., such messages will be let through but will always 
undergo full content analysis, which isn't needed since they are 
whitelisted).

cheers!
doug


example:

# cat spam_msg.txt | spamassassin -t -L -D -
S

SPAM: -------------------- Start SpamAssassin results ----------------------
SPAM: This mail is probably spam.  The original message 
has been altered
SPAM: so you can recognise or block similar unwanted mail in future.
SPAM: 
See http://spamassassin.org/tag/ for more details.
SPAM: 
SPAM: Content analysis 
details:   (5.6 hits, 5 required)
SPAM: Hit! (4.3 points)  Reply-To: is empty
SPAM: Hit! (1.3 
points)  Received via SMTPD32 server (SMTPD32-n.n)
SPAM: 
SPAM: -------------------- End of SpamAssassin results ---------------------


the message is flagged as spam.  now, drop the -S option:

SPAM: -------------------- Start SpamAssassin results ----------------------

SPAM: This mail is probably spam.  The original message has been altered
SPAM: so you can 
recognise or block similar unwanted mail in future.
SPAM: See http://spamassassin.org/tag/ 
for more details.
SPAM: 
SPAM: Content analysis details:   (-84.1 hits, 5 required)
SPAM: 
Hit! (4.3 points)  Reply-To: is empty
SPAM: Hit! (1.3 points)  Received via SMTPD32 server 
(SMTPD32-n.n)
SPAM: Hit! (1.0 point)   From: ends in numbers
SPAM: Hit! (0.5 points)  Subject 
has an exclamation mark
SPAM: Hit! (1.5 points)  BODY: Asks you to click below
SPAM: Hit! (0.2 
points)  BODY: No such thing as a free lunch (1)
SPAM: Hit! (2.3 points)  BODY: List removal 
information
SPAM: Hit! (1.6 points)  BODY: Mentions Spam Law "UCE-Mail Act"
SPAM: Hit! (1.0 
point)   BODY: No such thing as a free lunch (3)
SPAM: Hit! (0.9 points)  BODY: Mentions Spam law 
"H.R. 3113"
SPAM: Hit! (1.3 points)  URI: Includes a link to a likely spammer email 
address
SPAM: Hit! (-100.0 points) From: address is in the user's white-list
SPAM: 
SPAM: -------------------- End 
of SpamAssassin results ---------------------

the message is correctly whitelisted and allowed 
through.

just for the record, i am running spamassassin v2.20 on redhat 7.2.

[EOF]

Comment 1 Duncan Findlay 2002-06-01 21:31:08 UTC

*** Bug 341 has been marked as a duplicate of this bug. ***

Comment 2 Duncan Findlay 2002-06-21 19:48:15 UTC

Unless I can't read code, (which is a possibility) the -S flag is
virtually useless.

Instead of running negatively scoring tests, followed by positively
scoring tests (like I thought it did), the following is run (more or less):

rbl tests
-ve head tests
+ve head tests
-ve body tests
+ve body tests
-ve uri tests
+ve uri tests
-ve rawbody tests
+ve rawbody tests
-ve rawbody evals
+ve rawbody evals
-ve full tests
+ve full tests
-ve full evals
+ve full evals
-ve head evals
+ve head evals
more rbl tests?
awl test

This is pretty important! Essentially, -S doesn't work the way it's meant to at all.

Comment 3 Justin Mason 2002-08-07 07:21:57 UTC

should we leave the -S option broken, and just ignore it in 
2.40?

Comment 4 Theo Van Dinter 2002-08-07 07:57:41 UTC

Subject: Re: [SAdev]  -S option is utterly and hopelessly broken

On Wed, Aug 07, 2002 at 07:21:57AM -0700, bugzilla-daemon@hughes-family.org wrote:
> should we leave the -S option broken, and just ignore it in 
> 2.40?

I guess that depends when that freeze/2.40 release is planned for...

To sum up the discussion so far, I believe there are three problems that
need solving:

1) -S cuts the time for spam, but non-spam will have all the rules run
   against them.

2) {white,black}list_* entries just add a score whereas they should
   probably just short-circuit completely.

3) Instead of running all neg. then pos. tests, we run neg/pos head,
   then neg/pos body, etc.

The third one shouldn't be hard to fix, add a few if statements around
the do_* functions, have a "foreach negative, positive" or "foreach both"
in check, and pass a variable saying what we should be testing to each
do_ function.  (do neg, or pos, or both (if -S isn't used))

The second one shouldn't be that bad either, run those tests first
and check test_names_hit when it returns.  If there were any hits,
abort there like -S.  I'm thinking of just checking USER_IN_BLACKLIST,
USER_IN_WHITELIST, USER_IN_BLACKLIST, and USER_IN_ALL_SPAM_TO.  The rest
should probably just stay adding a score.

I don't know how to solve #1, beyond solving #2.  Part of the problem
of checking spaminess is, well, checking spaminess.  We need to check
everything before we can properly report spaminess, so ...

Also: If running -S, should AWL function since the scores will be skewed?
I'd say we should avoid AWL if #2 occurs as well (why add the whitelist
to the AWL if it's explicitly *listed already?)

Comment 5 Tobias v. Koch 2002-08-07 10:01:12 UTC

Subject: Re: [SAdev]  -S option is utterly and hopelessly broken

BDFO> should we leave the -S option broken, and just ignore it in 
BDFO> 2.40?

Just ignore it and put a notice about this in the manpage. This
wouldn't be a real problem - SA won't be as fast as the user expects,
but that's all. In contrast, it would be a problem if we leave it as it
is, because SA could even mis-identify messages as spam then.

Wasn't -S broken ever since it was added?

tobias

Comment 6 Craig Hughes 2002-08-07 10:59:08 UTC

Subject: Re: [SAdev]  -S option is utterly and hopelessly broken

I'm not sure it's as badly broken as people make out -- I think 
it was badly broken in 2.30, possibly 2.31, but I think some 
reasonable amount of fix up has been done on it in the 
2.40CVS -- I'll schedule a bit of time to take a look at it, but 
I think -S is a high priority.  I think Deersoft would even be 
willing to allocate resources to it if required.

Comment 7 Matt Sergeant 2002-08-08 04:47:18 UTC

Subject: Re: [SAdev]  -S option is utterly and hopelessly broken

It's really broken and the idea is wrong anyway (IMHO). I say this even 
though I implemented it. I'm willing to bet we'd get more of a speedup 
if we took that -S stuff *out* of SA.

Lets focus on getting decision trees working - that'll speed things up a 
hell of a lot faster than -S ever would.

Matt.

Comment 8 Craig Hughes 2002-08-11 06:45:01 UTC

Subject: Re: [SAdev] Re:  -S option is utterly and hopelessly broken

I had been thinking along these lines too.  Scott, I'd love to 
see any work you end up doing on this.  Two tricks are 
converting from the config file syntax to a lexer, and then 
realizing that multiple patterns might match the same chunk of 
text, so the lexer has to be a little smarter.  Still shouldn't 
be too complicated though to get a huge speedup.

C

On Saturday, August 10, 2002, at 10:20  PM, Scott A Crosby wrote:

> On Thu,  8 Aug 2002 04:47:18 -0700 (PDT), bugzilla-
> daemon@hughes-family.org writes:
>
>> Lets focus on getting decision trees working - that'll speed 
>> things up a
>> hell of a lot faster than -S ever would.
>
> Hell no. Flex-style regexp matching will do regexp's 100x
> faster. Then, we can run *all* the regexp's at the same time at
> megabytes/second.. I'll have a prototype example sometime in the next
> couple of weeks to demo the possibilities. (My script requires some
> more adaption.. I can't promise that it'll work, but I'm pretty sure I
> will fulfill my claims.)
>
> I think that that has the greatest win... And to boot, as we always
> have the full regexp results, we can feed them into more interesting
> things, like ANN's, or perceptron networks, or anything else.

Comment 9 Justin Mason 2002-08-26 15:24:59 UTC

status on this?  I guess we'll be documenting -S as broken for
2.40, afaics...

Comment 10 Matt Sergeant 2002-08-27 08:17:30 UTC

Subject: Re: [SAdev]  -S option is utterly and hopelessly broken

Awaiting ability to run rules in arbitrary orders (i.e. probably 2.5 or 
3.0).

Comment 11 Justin Mason 2002-08-29 05:00:34 UTC

lowering pri

Comment 12 Daniel Quinlan 2003-09-18 19:50:56 UTC

Closing as WONTFIX.  -S has been gone for a long time.

We'll continue to work on performance and early stop of heuristics may be
a part of the solution, but it hopefully won't require an option or it may
be implemented completely differently, but this one is definitely a WONTFIX.

Comment 13 Duncan Findlay 2004-05-18 21:45:38 UTC

*** Bug 3109 has been marked as a duplicate of this bug. ***