Bug 3109 - RFE: really simple "this is ham" shortcircuiting
Summary: RFE: really simple "this is ham" shortcircuiting
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P5 enhancement
Target Milestone: 3.2.0
Assignee: SpamAssassin Developer Mailing List
URL:
Keywords:
: 3407 4763 (view as bug list)
Depends on: 4776 4860
Blocks: 2385 3370 4763
  Show dependency tree
 
Reported: 2004-02-28 15:14 UTC by Justin Mason
Modified: 2009-06-10 16:36 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
shortcircuit patch patch None Dallas Engelken [HasCLA]
shortcircuit patch on 3.1.1 patch None Dallas Engelken [HasCLA]
shortcircuit patch with meta/dns priority support patch None Dallas Engelken [HasCLA]
s/c results gif image/gif None Dallas Engelken [HasCLA]
updated patch against svn patch None Dallas Engelken [HasCLA]
updated patch against svn #2 patch None Dallas Engelken [HasCLA]
automatically set meta rule priorities patch None Justin Mason [HasCLA]
shortcircuit.cf as used in jm's test text/plain None Justin Mason [HasCLA]
rule-priority code, against svn trunk patch None Justin Mason [HasCLA]
current short-circuiting patch, against svn trunk+3570 patch None Justin Mason [HasCLA]
new shortcircuit patch, against trunk patch None Justin Mason [HasCLA]
new new patch against trunk patch None Justin Mason [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Justin Mason 2004-02-28 15:14:17 UTC
There are a number of situations where you can take a look at a message and know
 immediately that it should not be scanned. e.g.

1. message is too big
2. message has travelled entirely via trusted hosts (ALL_TRUSTED rule)
3. user is in all_spam_to
4. bug 2385
5. caller has other ways of determining that the message came from internal hosts

Currently we recommend that this "do not scan at all" short-circuiting be done
in the .procmailrc file before SA is called.  However, that doesn't apply for
products that call SA like Amavisd, MailScanner, et al; so all of those products
have to be aware of this, add their own code to take care of it, etc. And in the
case of bug 2385, that just isn't really an option!

I suggest we add a new rule type explicitly for *immediate* shortcircuiting,
applied before most/all of the time-consuming rules.   This way, new
whitelist/blacklist rules that are guaranteed unforgeable can be applied
immediately and the message can immediately be marked.
Comment 1 Daniel Quinlan 2004-02-28 17:34:59 UTC
> I suggest we add a new rule type explicitly for *immediate* shortcircuiting

I think we should make this a part of a SINGLE rule priority system (covering
both pre and post functionality) and extend scores to allow for short-circuiting:

body FOO /foo/
tpriority -10
score ham

- priority of zero is default
- less than zero means run early
- more than zero means run later
- score of "ham" or "spam" means mark as ham or spam and short circuit, no
  numeric score should be generated since the score number would be inaccurate
Comment 2 Keith Ivey 2004-02-28 17:57:08 UTC
How would this interact with autolearning?  Should there be a way to indicate
whether the message should be learned when short-circuiting?
Comment 3 Justin Mason 2004-02-28 19:02:53 UTC
Subject: Re:  RFE: really simple "this is ham" shortcircuiting 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


>- priority of zero is default
>- less than zero means run early
>- more than zero means run later

+1 on those

>- score of "ham" or "spam" means mark as ham or spam and short circuit, no
>  numeric score should be generated since the score number would be inaccurate

Suggest generating a score of -9999 or 9999 -- otherwise I can guarantee
it'll confuse tools that act on X-Spam-Status.  Plus it'd make mass-check
results pretty hard to parse too ;)

'all_spam_to' already uses a very-low-score hack, this is similar (but
less hacky).

Also I'd prefer if it wasn't using the "score" command; a new one like
"shortcircuit ham", or even "tflags short-circuit-ham" strikes me as better.

re: autolearning: that's a hard one, since this would be how all_spam_to
would be implemented.  Maybe 2 ways of saying ham, one indicating that the
msg is autolearnable, one indicating not?

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAQVZXQTcbUG5Y7woRAufbAKCQ1j7aOpQ3ecPHJD/Y4WdeI5X2EgCg1XQe
R6WnLb5voTZe1utLhPajtz0=
=8442
-----END PGP SIGNATURE-----

Comment 4 Duncan Findlay 2004-05-18 21:45:36 UTC
This was implemented (see bug 62) but was horribly broken (see bug 304). It was
eventually decided (and generally agreed) that short-circuiting has a
significant number of problems that we are not prepared to overcome. I'm going
to mark this as a duplicate of bug 304, because it sorta is.

*** This bug has been marked as a duplicate of 304 ***
Comment 5 Justin Mason 2005-07-29 10:27:28 UTC
argh!  this is NOT a dupe of bug 304!

this is an entirely DIFFERENT form of short-circuiting, not based on score, but
based on specific rules that are marked as very trustworthy short-circuit rules
(such as whitelisting).  read the comments. grr.

reopening -- although pushing it away from the 3.1.0 milestone.
Comment 6 Justin Mason 2005-10-04 18:37:30 UTC
btw regarding the "tpriority" idea -- note that this was indeed implemented and
is now in shipping code, as "priority".  all that's needed on top of that is an
implementation of a short-circuit keyword:

  tflags shortcircuit

I've just realised, btw, that we don't need separate "shortcircuit-ham" and
"shortcircuit-spam" keywords -- we can set the ham/spam status using the rule's
score.  Just set the score to -100 or +100, and shortcircuit.
Comment 7 Theo Van Dinter 2005-10-04 19:19:25 UTC
Subject: Re:  RFE: really simple "this is ham" shortcircuiting

On Tue, Oct 04, 2005 at 06:37:30PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> I've just realised, btw, that we don't need separate "shortcircuit-ham" and
> "shortcircuit-spam" keywords -- we can set the ham/spam status using the rule's
> score.  Just set the score to -100 or +100, and shortcircuit.

You can figure out the ham/spam via score, but the relative score
value doesn't matter imo.  I'd base ham/spam on "tflags nice" versus
not personally.  I had previously sent some thoughts to the dev list
about short circuiting, but I'll put them here as well so I can find it
easier next time:

        - set certain rules as SC if hit
                USER_IN_WHITELIST, USER_IN_BLACKLIST (not DEF)
                *BSP*
                HABEAS*
        - allow SC on ham score (ie: < #)
        - allow SC on spam score (ie: > #)
        - should autolearn skip SC msgs?  should we always do autolearn in the
          appropriate direction?
        - AWL should be skipped during SC
        - SC rules should have a negative priority so they run first
        - do *not* do score check per rule, do it either per priority or rule
          type (header, body, etc.)
        - SC will require is_spam SC as score + required_hits will be at odds
        - add SC header macro (get_tag)
        - SC for S/O 1.000 rules?  how about S/O near 1?  BAYES_99, etc.

        Some form of order/priority rearrangement:

        Blacklist               short
        Whitelist               user/admin wants it
        BSP/Habeas              reputable, non-forgable
        Other SC Rules          as early as possible
        Other Local Rules       lightweight
        Bayes                   don't do it unless we have to
        Network                 large latency, try to avoid

Comment 8 Justin Mason 2005-10-04 20:01:40 UTC
btw I was talking to somebody a while back, and they noted that their system
turned our conceptual model of relative speeds on its head:

>       Other Local Rules       lightweight
>       Network                 large latency, try to avoid

in their system, the local rules were heavyweight and avoided, because they
hammered the CPU, whereas the network tests were considered lightweight, because
they left the CPU idle while waiting for results.  so we need to be able to take
both into account IMO.
Comment 9 Theo Van Dinter 2006-01-19 17:54:38 UTC
*** Bug 4763 has been marked as a duplicate of this bug. ***
Comment 10 Vikram J. Gurjar 2006-01-22 13:37:24 UTC
*** Bug 4763 has been marked as a duplicate of this bug. ***
Comment 11 Dallas Engelken 2006-03-22 05:39:22 UTC
i am posting a short-circuit patch due to several personal requests.  i would 
say this is an over-simplified approach, but it does work very well... and it 
was created with very few lines of additional code in PerMsgStatus (37 to be 
exact).  it applies cleanly to current SVN, i have not tried it against 3.1.1 
or anything else.. feel free to do so.  

so basically, this code is just here for now in case you want to save some CPU 
cycles until there is something "better, faster, stronger" (ie pluganized check
() methods with improved priorities on evals, metas, plugins, etc).  below is 
just some basic info showing debug of it running.

-------------------------

normal message that didnt shortcircuit runs through all priorities you have 
defined.

[2203] dbg: check: running tests for priority: -100
[2203] dbg: check: running tests for priority: 0
[2203] dbg: check: running tests for priority: 500
[2203] dbg: check: running tests for priority: 1000

and now a shortcircuited rule (set priority -100 so it runs first)

body TEST /test/i
describe TEST this is just a test rule to show how shortcircuiting works..
score TEST 1.0
priority TEST -100
shortcircuit TEST ham

[2203] dbg: check: running tests for priority: -100
[2203] dbg: shortcircuit: hit on rule TEST, sc type is ham, apply score of -5

As you can see, priority 0, 500, and 100 were skiped. Changing shortcircuit 
method to 'spam' gives

[2216] dbg: check: running tests for priority: -100
[2216] dbg: shortcircuit: hit on rule TEST, sc type is spam, apply score of 15

# echo -e "\n\ntest" | spamc
X-Spam-Tests: TEST
X-Spam-Score: 15.0
X-Spam-Flag: YES

the values -5 and 15 are applied by setting a couple options in local.cf

shortcircuit_spam_score		15
shortcircuit_ham_score		-5

If you do not set these items to your choice, they default to 100 (spam) and -
100 (ham). 

Thats pretty much it.  It works great for hitting body, rawbody and headers 
rules that you know have a 1.0 S/O (spam-o-the-day).   be careful out there!

-------------------------

thx. dallas



Comment 12 Dallas Engelken 2006-03-22 05:40:10 UTC
Created attachment 3426 [details]
shortcircuit patch

implements simple shortcircuiting
Comment 13 Justin Mason 2006-03-22 10:58:57 UTC
sweet!   I want to get this into 3.2.0 immediately for my server ;) thanks Dallas!

+1
Comment 14 Dallas Engelken 2006-03-22 18:43:18 UTC
Created attachment 3427 [details]
shortcircuit patch on 3.1.1

due to popular demand... here is a patch against 3.1.1.  this patch goes one
step further and defines a 60_shortcircuit.cf with some basic (obvious) rules
that just use their default scores... 

# echo -e "To: blah@blah.com\n\n" | spamc
X-Spam-Tests: USER_IN_BLACKLIST_TO
X-Spam-Score: 10.0
X-Spam-Flag: YES

[13826] dbg: check: running tests for priority: -100
[13826] dbg: check: running tests for priority: -95
[13826] dbg: check: running tests for priority: -90
[13826] dbg: shortcircuit: hit on rule USER_IN_BLACKLIST_TO, sc type is
default, apply score of 10

so, you see it runs through whitelist first at priority -100, all trusted on
-95, and then hits a blacklist_to on priority -90.

with this approach there is no cancellation when a whitelist_* and blacklist_*
(maybe by mistake) hit on the same message.  whitelist takes priority.	if you
want it the other way around, set the black test to higher (lower number)
priorities than -100.

have fun!
Comment 15 Dallas Engelken 2006-03-22 18:49:41 UTC
side note: after applying this patch, throw
http://www.rulesemporium.com/rules/70_sare_whitelist.cf in your local config,
your cpu will thank you. 

if anyone has any additions to 70_sare_whitelist.cf, please contribute via the
user list.
http://lists.maddoc.net/mailman/listinfo/sare-users

thanks.
Comment 16 Theo Van Dinter 2006-03-22 19:20:16 UTC
sorry to be a downer, but I'm -1 on applying this patch in its current state.  I think it's a decent first go, 
thanks Dallas, but I'd like to see us come up with more of a design before applying code then having to 
go in and change it all around, etc.


here's a number of thoughts I have about the patch:

first, I really don't like the override score idea.  I don't think there's a real point in doing it, and it's 
going to cause confusion by people who add up the rules and see it's nowhere near the rule score sum.

second, there has to be something in the x-spam-status header to indicate that SC occured.

third, mass-check wrt reuse needs to be modified to take SC into account.  we'll also need to let people 
who do mass-checks know that they really shouldn't use SC since we want to see the hit results for all 
network tests (the current reuse behavior).

fourth, what about autolearn?  it's likely autolearn won't happen by default (too few hits, etc,) but it 
could also potentially learn the wrong way depending on what rules hit.  I'd like to get a consensus on 
what should happen here (IMO, autolearn is skipped for SC).

fifth, what about AWL?  similar to autolearn, except that the score is going to be completely off.  IMO, 
AWL is skipped during SC.

sixth, an easy addition to this patch would be short circuiting on current message score.  sometimes 
you don't care what rules hit, but you want to stop after a certain point.

seventh, and this is tricky, do we want to try moving the priorities around automatically when SC is 
enabled?  I was thinking of having code that sets default priorities, similar to default score, that goes 
through the SC rule listing and bumps the priority so they run first (including meta dependencies) 
unless the rule has a priority specifically set by config.  Then another loop which sets the default to 
priority 0.

eighth, should the SC decision code be in a plugin?
Comment 17 Michael Parker 2006-03-22 19:24:21 UTC
RE: plugin

See Check plugin proposal.
Comment 18 Dallas Engelken 2006-03-22 20:27:11 UTC
hey theo, i have no objection at all to your points.  the problem is, its just 
all talk, and has been for a long time.   this ticket is over 2 years old.  so 
instead of sitting back and watching nothing ever come of it... i figured it 
would be better to have something rather than nothing. 

i realize its not the "ideal" solution, but hopefully some people can find it 
useful, at least until the pluginized check is in place.  and if nothing else, 
maybe it will spur some discussion and keep this ticket moving in the right 
direction... because this is an important feature.

i not sure i agree on the autolearning bit.  users should be able make up their 
mind on autolearning, and set a tflag noautolearn if they need to.  if you let 
them shortcircuit their rules, you should let them make an autolearn vs 
noautolearn decision..   or maybe it defaults to noautolearn on all 
shortcircuited rules, and requires them to optin with autolearn tflag.

one note on the existing patches, i just realized that dns tests are being 
applied unless you explicitly skip it... since that code runs after the do_* 
test, which isnt last() until the next iteration.  so it needs to  be

    if ($needs_dnsbl_harvest_p && !exists $self->{shortcircuit}) {










Comment 19 Vikram J. Gurjar 2006-03-23 03:16:48 UTC
Firstly, let me congratulate Dallas on doing something that has been needed for
a long, long time. It  is needed to make SA more efficient. Compare SA to Dspam
and you can see the difference immediately in terms of speed.

It is a GREAT first run.

As to Theo's points...

1. Override scores...just set them to zero and all is well.

2. X-Spam-Status .. I agree with you. A brief SC at (Test) would be nice.

As to the rest of your points....one simple line in local.cf would take care of
it...of course with a bit of code change from Dallas...w

# Short circuiting behaviour - default 0 set it to 1 to enable.
short_circuit (1|0)  

Again....thanks Dallas.
Comment 20 Dallas Engelken 2006-03-23 04:46:46 UTC
Created attachment 3428 [details]
shortcircuit patch with meta/dns priority support

okay, i found shortcircuit on body, rawbody and header tests only to be very
boring...  other than user-defined white/black lists, there is really not much
else that you can make a good shortcircuit descision on (at least in the
default rules).  so, i started messing around with the harvest priorities and
found that it is quite simple to accomplish prioritized dns and meta rules.  

attached is a updated patch against 3.1.1 that will allow this, and give you
alot more power with shortcircuiting of meta tests.  you just have to make sure
the tests your meta's rely on have higher priorities than the meta tests
themselves.  works great as long as you order your priorities correctly.
i've seen the loads disappear on some dedicated spamd servers tonite :)

when meta rules are created, they are set with a priority of 500 by default,
which is what the constant HARVEST_DNSBL_PRIORITY was previously set that. 
This caused all metas test to not even be evaluated until all the default
priority rules had ran, which was what needed to happen.  However, this
prevented proper prioritization on the meta rules.  By changing
HARVEST_DNSBL_PRIORITY to -500, this allows it to evaluate any meta in a
priority group > -500 prior to the cpu heavy rules.   if you do not define any
metas with < 0 priorities, they just works like they always have, running at
priority 500.

So, now you can do something that is actually worthwile..  like

# must define dns based tests at a priority > -800
# must define meta priorities > -500 and < any dns test priority it relies on.
priority URIBL_BLACK -500
priority URIBL_JP_SURBL -500
priority URIBL_SC_SURBL -500
priority URIBL_OB_SURBL -500

meta  SC_MULTI_RHSBL   ( URIBL_BLACK && (URIBL_JP_SURBL || URIBL_SC_SURBL ||
URIBL_OB_SURBL ))
priority SC_MULTI_RHSBL -100
shortcircuit SC_MULTI_RHSBL spam

[31446] dbg: config: read file /usr/share/spamassassin/60_shortcircuit.cf
[31446] dbg: check: running tests for priority: -1000
[31446] dbg: priority: running high priority (-1000) eval USER_IN_WHITELIST
[31446] dbg: priority: running high priority (-1000) eval USER_IN_ALL_SPAM_TO
[31446] dbg: priority: running high priority (-1000) eval USER_IN_DEF_WHITELIST

[31446] dbg: priority: running high priority (-1000) eval SUBJECT_IN_WHITELIST
[31446] dbg: check: running tests for priority: -950
[31446] dbg: priority: running high priority (-950) eval ALL_TRUSTED
[31446] dbg: check: running tests for priority: -900
[31446] dbg: priority: running high priority (-900) eval USER_IN_BLACKLIST_TO
[31446] dbg: priority: running high priority (-900) eval USER_IN_BLACKLIST
[31446] dbg: priority: running high priority (-900) eval SUBJECT_IN_BLACKLIST
[31446] dbg: check: running tests for priority: -500
[31446] dbg: priority: running high priority (-500) eval URIBL_BLACK
[31446] dbg: priority: running high priority (-500) eval URIBL_JP_SURBL
[31446] dbg: priority: running high priority (-500) eval URIBL_OB_SURBL
[31446] dbg: priority: running high priority (-500) eval URIBL_SC_SURBL
[31446] dbg: check: running tests for priority: -100
[31446] dbg: priority: running high priority (-100) meta SC_URIBL_HASH
[25755] dbg: shortcircuit: hit on rule SC_MULTI_RHSBL, sc type is spam, apply
score of 100

and whitelists/blacklists running at a higher priority like -1000 and -900 will
prevent check_post_dnsbl from ever firing of dns queries because the
HARVEST_DNSBL_PRIORITY does not kick in until the priority group reaches -800.

[22242] dbg: check: running tests for priority: -1000
[22242] dbg: priority: running high priority (-1000) eval USER_IN_WHITELIST
[22242] dbg: shortcircuit: hit on rule USER_IN_WHITELIST, sc type is default,
apply score of -100

so now we are getting somewhere! ;)
Comment 21 Justin Mason 2006-03-23 13:32:23 UTC
OK, some agreement with Theo, and some disagreement ;)

> first, I really don't like the override score idea.  I don't think
> there's a real point in doing it, and it's going to cause confusion by
> people who add up the rules and see it's nowhere near the rule score
> sum.

-1
I think it provides a better, more familiar UI for users.  Compare
(assuming we fix the "something in x-spam-status" issue below):

  X-Spam-Status: Yes, score=1.9 required=5.0 tests=BAYES_50,HTML_IMAGE_ONLY_28,
        HTML_MESSAGE,MIME_HTML_ONLY,SC_TEST shortcircuit=spam
        autolearn=no version=3.2.0-r372567

eh?  "Yes, score=1.9 required=5.0", wtf?
that's less intuitive, and more likely to cause confusion, than

  X-Spam-Status: Yes, score=15.0 required=5.0 tests=BAYES_50,HTML_IMAGE_ONLY_28,
        HTML_MESSAGE,MIME_HTML_ONLY,SC_TEST shortcircuit=spam
        autolearn=no version=3.2.0-r372567

it really makes less sense in my opinion for the mail to be marked as spam
with a low score, than for the tests not to add up. ;)

(Personally, I'd prefer to see a more "extreme" score value -- -100 and
100, for example -- to give another "hint" to users that something
out-of-the-ordinary has occurred.  But I don't think that's a
vote-stopper issue.  for example:)

  X-Spam-Status: Yes, score=100.0 required=5.0 tests=BAYES_50,
        HTML_IMAGE_ONLY_28,HTML_MESSAGE,MIME_HTML_ONLY,SC_TEST
        shortcircuit=spam autolearn=no version=3.2.0-r372567

> second, there has to be something in the x-spam-status header to
> indicate that SC occured.

+1
agreed on that point, I didn't spot that this was missing. ;)

> third, mass-check wrt reuse needs to be modified to take SC into
> account.  we'll also need to let people who do mass-checks know that
> they really shouldn't use SC since we want to see the hit results for
> all network tests (the current reuse behavior).

+1
OK, that is a good point.  if the X-Spam-Status line is fixed to
record "shortcircuit=spam" etc., it'll be easy to detect in mass-check.

> fourth, what about autolearn?  it's likely autolearn won't happen by
> default (too few hits, etc,) but it could also potentially learn the
> wrong way depending on what rules hit.  I'd like to get a consensus on
> what should happen here (IMO, autolearn is skipped for SC).

-1
I don't think this needs to block autolearning necessarily.  It can be
driven by the rules that caused the shortcircuiting.  For example, let's
say USER_IN_WHITELIST is the rule that shortcircuited -- that rule already
mandates "noautolearn", so there's no need for the act of shortcircuiting
to do so, too.

There's a possibility that some extremely reliable shortcircuiting rules
(esp ones that don't rely on user/admin input) can provide good data
for autolearning, so let's not rule it out.

by the way the "autolearn the wrong way" scenario should be impossible
already, see lib/Mail/SpamAssassin/Plugin/AutoLearnThreshold.pm:

  dbg("learn: auto-learn? no: scored as ham but autolearn wanted spam");


> fifth, what about AWL?  similar to autolearn, except that the score is
> going to be completely off.  IMO, AWL is skipped during SC.

+1
ok good point, agreed.

> sixth, an easy addition to this patch would be short circuiting on
> current message score.  sometimes you don't care what rules hit, but you
> want to stop after a certain point.

-1
*no*. This is EXACTLY the approach that has been tried several times
before, with lousy results -- which resulted in the idea that this
approach is more worthwhile.  Let's not go around in circles!

It'll be easy enough to build additions on top of this simple change,
later.  We don't have to do everything *now*.

> seventh, and this is tricky, do we want to try moving the priorities
> around automatically when SC is enabled?  I was thinking of having code
> that sets default priorities, similar to default score, that goes
> through the SC rule listing and bumps the priority so they run first
> (including meta dependencies) unless the rule has a priority
> specifically set by config.  Then another loop which sets the default to
> priority 0.
> 
> eighth, should the SC decision code be in a plugin?

-1
No. There are a lot of issues around pluginizing parts of check(), and
they're being discussed in 2 other bugs simultaneously iirc!  Let's not
confuse matters even more by making it 3!  :(

This patch, in contrast, is pretty simple and small, and can be easily
refactored into a plugin *later* if desired, once the other pluginization
discussions are concluded with an agreement.

I think this is a great patch, adding some functionality that we *really*
*REALLY* need -- I know I, for one, have had to put in some upfront
black/whitelisting at the MTA level on one of my servers due to high load
issues, and this patch would allow me to avoid that kludge.

It does impact on other parts of the SpamAssassin design, but let's not
get dragged into a "what colour to paint the bikeshed" discussion. I
really feel that if we were to insist on pluggable shortcircuiting
algorithms immediately, we'd be heading down that road -- the existing
check()-pluginization discussions certainly give that impression to be
honest.

Again, what I'm saying is, there's no need to get it "perfect" in the
first svn commit.

If you want, I'd suggest we apply this patch to svn trunk (ok, after fixing
some or all of the +1's above ;), and then open some new bugs to deal with
evolutionary follow-on ideas...

--j.
Comment 22 Dallas Engelken 2006-03-23 18:58:08 UTC
Created attachment 3429 [details]
s/c results gif

i implemented uribl+surbl shortcircuiting last night on one box, and left
another unpatched.  here is the load average comparision of them.  it should
speak for itself :)
Comment 23 Dallas Engelken 2006-03-23 22:48:04 UTC
>> fifth, what about AWL?  similar to autolearn, except that the score is
>>  going to be completely off.  IMO, AWL is skipped during SC.

> +1
> ok good point, agreed.

hmm... what am i missing here???

AWL is a header eval, run at prio 1000, so it never gets that deep in the scan
if the email hits a shortcircuited rule.  



Comment 24 Daryl C. W. O'Shea 2006-03-23 22:49:45 UTC
(In reply to comment #21)
> > first, I really don't like the override score idea.  I don't think
> > there's a real point in doing it, and it's going to cause confusion by
> > people who add up the rules and see it's nowhere near the rule score
> > sum.
> 
> -1
> I think it provides a better, more familiar UI for users.  Compare
> (assuming we fix the "something in x-spam-status" issue below):
> 
>   X-Spam-Status: Yes, score=1.9 required=5.0 tests=BAYES_50,HTML_IMAGE_ONLY_28,
>         HTML_MESSAGE,MIME_HTML_ONLY,SC_TEST shortcircuit=spam
>         autolearn=no version=3.2.0-r372567
> 
> eh?  "Yes, score=1.9 required=5.0", wtf?
> that's less intuitive, and more likely to cause confusion, than
> 
>   X-Spam-Status: Yes, score=15.0 required=5.0 tests=BAYES_50,HTML_IMAGE_ONLY_28,
>         HTML_MESSAGE,MIME_HTML_ONLY,SC_TEST shortcircuit=spam
>         autolearn=no version=3.2.0-r372567
> 
> it really makes less sense in my opinion for the mail to be marked as spam
> with a low score, than for the tests not to add up. ;)

While it may make more sense to have a score greater than the required score,
neither make complete sense. :)

I think that either tests that are going to cause short circuit should either
have a score alone that will cause a sensible spam score; or where that may not
be possible (say, decision trees in a future check() plugin) score should be
undefined:

X-Spam-Level: ****************************************** (max # of stars we use)
X-Spam-Status: Yes, score= required= tests=BAYES_50,
         HTML_IMAGE_ONLY_28,HTML_MESSAGE,MIME_HTML_ONLY,SC_TEST
         shortcircuit=spam autolearn=no version=3.2.0-r372567

(or maybe 0.0 for both score and required)


> > fourth, what about autolearn?  it's likely autolearn won't happen by
> > default (too few hits, etc,) but it could also potentially learn the
> > wrong way depending on what rules hit.  I'd like to get a consensus on
> > what should happen here (IMO, autolearn is skipped for SC).
> 
> -1
> I don't think this needs to block autolearning necessarily.  It can be
> driven by the rules that caused the shortcircuiting.  For example, let's
> say USER_IN_WHITELIST is the rule that shortcircuited -- that rule already
> mandates "noautolearn", so there's no need for the act of shortcircuiting
> to do so, too.

Agreed with jm here.


> > seventh, and this is tricky, do we want to try moving the priorities
> > around automatically when SC is enabled?  I was thinking of having code
> > that sets default priorities, similar to default score, that goes
> > through the SC rule listing and bumps the priority so they run first
> > (including meta dependencies) unless the rule has a priority
> > specifically set by config.  Then another loop which sets the default to
> > priority 0.

I'd say yes, but it can wait.


> > eighth, should the SC decision code be in a plugin?
> 
> -1
> No. There are a lot of issues around pluginizing parts of check(), and
> they're being discussed in 2 other bugs simultaneously iirc!  Let's not
> confuse matters even more by making it 3!  :(
> 
> This patch, in contrast, is pretty simple and small, and can be easily
> refactored into a plugin *later* if desired, once the other pluginization
> discussions are concluded with an agreement.

Agreed with jm.
Comment 25 Dallas Engelken 2006-03-24 05:16:09 UTC
Created attachment 3431 [details]
updated patch against svn

attached is an updated patch. is does the following

1) adds the following TEMPLATE TAGS: SC, SCTYPE, and SCRULE.  this will allow
for the add_header necessary for X-Spam-Short-Circuit if needed.
2) updates 10_default_prefs.cf to include shortcircuit=_SCTYPE_ in the
X-Spam-Status header (for masscheck purposes).
3) resets test_names_hit, score to 0, and tag_data before applying the
shortcircuit.  this will prevent multiple rules from showing hit when you have
a shortcircuited meta test.
4) alters dns harvest priority to -800 to allow dns tests to be prioritized
properly.
5) alters meta harvest priority to -500 to allow meta tests to be prioritized. 
Conf/Parser.pm still sets meta priorities to 500 by default.
6) supports a shortcircuit 'off' value so you can override a shortcircuited
default test via your local config.
7) alters report template to include _SCTYPE_ on the Content analysis details
line.
8) adds 60_shortcircuit.cf with some default shortcircuits, and some other
examples that are commented.

so, that takes care of most of the major issues i hope.
Comment 26 Dallas Engelken 2006-03-24 05:24:11 UTC
Created attachment 3432 [details]
updated patch against svn #2

oops, those sample metas in 60_shortcircuit.cf were supposed to be commented
out..

please review.. its late and my eyes are tired ;)
Comment 27 Justin Mason 2006-03-24 14:32:25 UTC
undefined score/required?  I really don't like that one, to be honest. The
X-Spam-Status is mainly a UI but also an API, and I think it's better to
"degrade gracefully" for apps that won't know about short-circuiting, by
providing them *something* relatively sane in the areas they're looking
at, instead of breaking their expectations like that.

I do like the "full X-Spam-Level" suggestion though.
To recap, these are the current suggestions:


option (a), Daryl's suggestion:

X-Spam-Level: ******************************************
X-Spam-Status: Yes, score= required= tests=BAYES_50,
         HTML_IMAGE_ONLY_28,HTML_MESSAGE,MIME_HTML_ONLY,SC_TEST
         shortcircuit=spam autolearn=no version=3.2.0-r372567

option (b), just the "shortcircuit=spam" thing:

X-Spam-Level: *
X-Spam-Status: Yes, score=1.9 required=5.0 tests=BAYES_50,HTML_IMAGE_ONLY_28,
         HTML_MESSAGE,MIME_HTML_ONLY,SC_TEST shortcircuit=spam
         autolearn=no version=3.2.0-r372567

option (c), set score to -5 or 15:

X-Spam-Level: ***************
X-Spam-Status: Yes, score=15.0 required=5.0 tests=BAYES_50,HTML_IMAGE_ONLY_28,
         HTML_MESSAGE,MIME_HTML_ONLY,SC_TEST shortcircuit=spam
         autolearn=no version=3.2.0-r372567

option (d), same as (c) but with a higher value for clarity:

X-Spam-Level: ******************************************
X-Spam-Status: Yes, score=100.0 required=5.0 tests=BAYES_50,HTML_IMAGE_ONLY_28,
         HTML_MESSAGE,MIME_HTML_ONLY,SC_TEST shortcircuit=spam
         autolearn=no version=3.2.0-r372567


me, I like (d), with (c) as a second-best.

--j.
Comment 28 Vikram J. Gurjar 2006-04-01 15:24:22 UTC
Is this going to be a part of Spamassassin or will we have to keep patching it?
Comment 29 Justin Mason 2006-04-01 18:47:43 UTC
I can't apply it until Theo revokes his "thumbs down" -1 vote. ;)
Comment 30 Vikram J. Gurjar 2006-04-01 20:31:28 UTC
>>> I can't apply it until Theo revokes his "thumbs down" -1 vote. ;)

I've been running one mail server with this patch - and the latest svn version
of spamassassin - and it works great.  The only change that I have made is I
have now monitored  spamassassin with monit in case it crashes as the latest svn
could be unstable.  So far - no problems in about a week.

This really should be a part of Spamassassin....anything that makes it more
efficient should be incorporated.  Otherwise it will be like Windows software. 
Each succeeding version becomes more and more bloated and inefficient.

I hope Theo reviews this and makes it a part of Spamassassin.  After all, SA is
one of the more resource hungry hogs on any mail server. 
Comment 31 Loren Wilton 2006-04-02 00:56:14 UTC
Re #27, I'm also in favor of D with C as second choice.  Leaving the score out 
completely is likely to break a whole lot of stuff.
Comment 32 Daryl C. W. O'Shea 2006-04-02 01:12:54 UTC
(In reply to comment #27)
> undefined score/required?  I really don't like that one, to be honest. The
> X-Spam-Status is mainly a UI but also an API, and I think it's better to
> "degrade gracefully" for apps that won't know about short-circuiting, by
> providing them *something* relatively sane in the areas they're looking
> at, instead of breaking their expectations like that.

Even though it's the most accurate, I don't really like it either -- that's me
being used to safety critical systems where we never ever rely on a value that's
supposed to be there being there.

I'd much prefer that whatever causes the shortcicuit to have fired a rule that
will be certain to cause a final score that exceeds the threshold in the
appropriate direction, and we use that final score.


> option (d), same as (c) but with a higher value for clarity:
> 
> X-Spam-Level: ******************************************
> X-Spam-Status: Yes, score=100.0 required=5.0 tests=BAYES_50,HTML_IMAGE_ONLY_28,
>          HTML_MESSAGE,MIME_HTML_ONLY,SC_TEST shortcircuit=spam
>          autolearn=no version=3.2.0-r372567
> 
> 
> me, I like (d), with (c) as a second-best.

I'd settle for (d) but with a (default) score outside the range of typical
scores.  The score should be configurable though, as I believe it is now, so
that people using multiple thresholds for tag-accept/reject can still do that.
Comment 33 Theo Van Dinter 2006-04-03 04:49:51 UTC
Sorry it's been a while for me, I've been busy with job and move-related activities...

I think we can move ahead with applying the patch, but I can definitely see some bits that need to be 
changed.  Some thoughts:

1) There's no point in checking for shortcircuit_type in the sections building the eval in do_head_tests, 
do_body_tests, do_body_uri_tests, etc.  In normal usage, these sections are only ever called once, and 
shortcircuit_type is guaranteed not to be enabled at that point.

2) I'd move the initial shortcircuit_type check in those above sections until after the eval is built but 
before the function is called (assuming user rules are disabled).  This way we'll get the evals built the 
first time as usual, and then later on the tests can be skipped before calling the rules.


there's a couple of other places that could be tweaked, but I'll worry about that later on.


I still think priorities ought to be handled automatically as possible.  I don't want to require people to 
have to change priorities around on their own.  This shouldn't be hard in theory (hopefully I won't look 
back at this and wonder wtf I was talking about... ;))

The patch also does nothing with mass-check to deal with reuse and short-circuit, so we'll want to get 
something in quickly for that.


Also,

>> sixth, an easy addition to this patch would be short circuiting on
>> current message score.  sometimes you don't care what rules hit, but you
>> want to stop after a certain point.
>
>-1
>*no*. This is EXACTLY the approach that has been tried several times
>before, with lousy results -- which resulted in the idea that this
>approach is more worthwhile.  Let's not go around in circles!

yes, except that the problem was performance due to the the constant if checks and the lack of 
prioritization.  those two are basically fixed now, so it should work fine.
Comment 34 Michael Parker 2006-04-03 06:12:41 UTC
While I'm not veoting the patch, I am a little concerned with how it messes
around with priorities.  I realize people are hot on the patch, but I suggest
caution and review time.
Comment 35 Justin Mason 2006-04-04 00:34:24 UTC
So I've applied *most* of the patch to svn trunk ;)

1. There does seem to be an issue with what it does to HARVEST_DNSBL_PRIORITY,
changing it from 500 to -800.  AFAIK, this causes DNSBLs to be harvested pretty
much immediately, instead of letting those network queries percolate in the
background while we do other stuff.

normally we'd do:

    - start all DNSBL lookups (in the background)
    - do head tests, body tests, etc. for several seconds
    - wait several seconds for DNSBL results
    - do meta tests
    - AWL
    - return

with the HARVEST_DNSBL_PRIORITY change it becomes

    - start DNSBL lookups (in the background)
    - wait several seconds for DNSBL results
    - do head tests, body tests, etc. for several seconds
    - start more DNSBL lookups (at another priority level)
    - wait several seconds for those DNSBL results
    - ...

I'm worried that that'll have pretty major knock-on effects on speed. :( So I
haven't applied that change to HARVEST_DNSBL_PRIORITY, and it's still at +500.

(META_TEST_MIN_PRIORITY is indeed now -500, though, so metas of non-network
rules can now be used for shortcircuiting. I couldn't see any bad
side-effects of that.)

But use of net/DNSBL tests in a short-circuit, needs a better way to do this
than causing it to attempt to harvest *all* network lookups early -- I think it
will need a way for the "shortcircuit SC_URIBL_SBL spam" line to cause the meta
dependency chain to be followed up to the network rule, and some logic to
realise that that implies that the URIBL lookups -- and just those -- therefore
need to be waited for before that meta rule can be evaluated correctly.

I think there's more work needed there.

2. I'm not convinced about "shortcircuit default" btw...  I left it in, anyway,
for now.   I would prefer to take it out though, unless there's a good reason
to have both it *and* the "ham/spam" version.

3. "t/dnsbl.t" fails, but it's failing before this patch was applied
anyway. something's up there in trunk :(

4. the "no shortcircuit" SCTYPE token for X-Spam-Status got changed to "no" to
match "autolearn=no".

5. added code to mass-check to ignore messages where X-Spam-Status =~
/shortcircuit=(?:yes|no|default)/.

apart from that -- looks good to me!  checked in as r391177.
Comment 36 Michael Parker 2006-04-04 01:26:49 UTC
While I said I wasn't vetoing the patch, I did say I request time to review,
since there was no time given I now have to veto.

I'm -1 (veto) on this patch.

This is "ONE" implementation of shortcircuit, and from the comments, not even
the best, and happens to stomp on at least one other known implementation.  This
code should be backed out and when we get the Check plugin code in place you can
then put it all in your own plugin to avoid creating problems for other people's
code.
Comment 37 Vikram J. Gurjar 2006-04-04 05:46:11 UTC
Michael : 

I appreciate your concern but how much time do need to review?  And what are
your points of concern so that others who are hot on the patch can address them.

This has brought down the CPU usage on one of my servers considerably....I am
seriously considering putting it  on all my mail servers.

Justin :
>> normally we'd do:

    - start all DNSBL lookups (in the background)
    - do head tests, body tests, etc. for several seconds
    - wait several seconds for DNSBL results
    - do meta tests
    - AWL
    - return

with the HARVEST_DNSBL_PRIORITY change it becomes

    - start DNSBL lookups (in the background)
    - wait several seconds for DNSBL results
    - do head tests, body tests, etc. for several seconds
    - start more DNSBL lookups (at another priority level)
    - wait several seconds for those DNSBL results
    - ...

I'm worried that that'll have pretty major knock-on effects on speed. :( So I
haven't applied that change to HARVEST_DNSBL_PRIORITY, and it's still at +500.

>>

Normally, the dsnbl checks can be done last.... the fastest sequence is as follows

1.  Banned attachments ... if it contains a banned attachment...mark as spam .. SC
2.  Clamav plug in ... if it contains a virus ... mark as spam .. SC
3.  White list ... mark as ham .. SC
4.  Blacklist  ... mark as spam .. SC
5.  Local tests ... mark as spam if need be ... SC only if score exceeds spam
threshold
6.  dnsbl tests ... mark as ham/spam as normal.

I find that 60% of the incoming emails are sorted out by step 4....with very
little CPU usage by SA (except CLAMAV which takes quite a bit at step 2) 

By knocking off the obvious banned attachments/viruses/whitelists and
blacklists, the loads on the network services can be brought down considerably.

Why do a dnsbl test on a virus?  It is a waste of everybody's resources.
Comment 38 Justin Mason 2006-04-04 11:28:42 UTC
Dude, you didn't "request time to review" -- you "suggested" that people should
"exercise caution and review time".  It was not clear that *you* were
the person you wanted to exercise that review time! ;)

Also your first comment was 2 weeks after the patch was posted, saying "I'm not
vetoing the patch but am a little concerned".  So I took that as not indicating
a whole lot of worry.

> This is "ONE" implementation of shortcircuit, and from the comments, not even
> the best, and happens to stomp on at least one other known implementation.

hmm -- where is this other known implementation?  Is it open source, available
for use in Apache SpamAssassin right now?  Is it ready to apply to SVN trunk?
Does it implement techniques we've been agreeing on for two years, as this
does?

If this other code needs to be taken into account, it should be made part of
the open source codebase so it is at least visible, and efforts be made to get
it into SVN, as Dallas is doing here.

It is nice to support proprietary work in Apache SpamAssassin, but not to the
detriment of the open-source product's quality.

> This code should be backed out and when we get the Check plugin code in place
> you can then put it all in your own plugin to avoid creating problems for
> other people's code.

I strongly disagree that we should make this work conditional on the Check
plugin.  Do it the other way around -- when the Check plugin is completed,
*then* rearrange the contents of svn trunk to work with it, not vice versa.

We cannot freeze the entire contents of the PerMsgStatus class pending a Check
plugin, given that the last comment on bug 4776, was *2 months ago*, and
there's still some serious disagreement there.  

Fundamentally it makes no sense to block the rest of the project based on a
task that does not seem to be going forward!

Comment 39 Justin Mason 2006-04-09 19:06:13 UTC
bug 4860 adds an PerMsgStatus::is_rule_complete($rule) API, allowing the s/c
code to determine if a DNSBL lookup has completed yet.
Comment 40 Justin Mason 2006-04-10 00:42:34 UTC
Created attachment 3467 [details]
automatically set meta rule priorities

Here's a patch that takes care of one issue with the current short-circuiting
patch; it automatically traces priorities of meta rules, and ensures that all
of their dependencies have the same or earlier priority.  in other words,
having to work that out by hand is no longer required; just set a rule to
whatever priority you want and the code will take care of it.

This has the happy side-effect that META_TEST_MIN_PRIORITY is no longer
required.

the next step then is to use the code in bug 4860, and call is_rule_complete()
for each rule in metas, to ensure that they are either not DNS rules, or else
are DNS lookups that have completed; and if that function returns 0 for any
rule, to call $self->{async}->complete_lookups(1) until it returns 1. that will
then allow network tests to be used in shortcircuit metas.
Comment 41 Michael Parker 2006-04-10 14:11:51 UTC
> ------- Additional Comments From jm@jmason.org  2006-04-04 11:28 -------
> Dude, you didn't "request time to review" -- you "suggested" that people should
> "exercise caution and review time".  It was not clear that *you* were
> the person you wanted to exercise that review time! ;)

Perhaps I should have been more forceful in the request but its almost like
you're rushing to get the code in before anyone else can review, whats the
hurry?  Why can't we all work together to get a better handle on the optimal
solution?

> Also your first comment was 2 weeks after the patch was posted, saying "I'm not
> vetoing the patch but am a little concerned".  So I took that as not indicating
> a whole lot of worry.

Well, my first comment was to back up Theo's objection and near as I can tell
still outstanding veto right after the patch was posted.  Apparently, a veto
doesn't mean as much these days since clearly you ignored that one and don't
seem to be respecting mine.
 
> It is nice to support proprietary work in Apache SpamAssassin, but not to the
> detriment of the open-source product's quality.

How about hackish code at the detriment?  You've used the "lets not screw folks
using the code" argument more times than I can count.  If you seriously want
other implementations (and believe me I'm working on getting as much closed code
opened up as I can) then screwing them over certainly isn't the best path there.
 You of all people should realize that.

However, since feeding my family has taken precedence and I just don't have time
to offer alternative implementations, I'll remove my veto (since it was being
ignored anyway) but still register a -1 for the code as implemented.

I'll also note that folks who provide results for scoring should not be running
this code, it will seriously hamper our efforts to provide realtime hits in the
reuse code.
Comment 42 Justin Mason 2006-04-10 14:58:11 UTC
This is functionality I've been looking for, for over two years, and it will
solve several scalability problems I'm having on my own MXes -- so I am indeed
keen to get it in.  Sorry if it seems like I'm hurrying too much.

Regarding Theo's veto: he said 'I think we can move ahead with applying the
patch, but I can definitely see some bits that need to be changed', then listed
several points. While he didn't spell it out, that does indicate that his veto
was retracted. Each of those points he raised was covered in the change I
applied in r391177.

Alright, I can see I've been a bit over-eager here.  I'll back out r391177, and
instead open a branch for this.  There's a fair bit of dependent work (bug 4860
and patch 3467), and it's going to need an SVN repository I think.
Comment 43 Justin Mason 2006-04-10 15:06:29 UTC
OK, I've created a branch of "trunk" for this:

https://svn.apache.org/repos/asf/spamassassin/branches/bug-3109-shortcircuiting

ps:
'I'll also note that folks who provide results for scoring should not be running
this code, it will seriously hamper our efforts to provide realtime hits in the
reuse code.'

mass-check will not attempt to reuse hits from shortcircuited messages.
Comment 44 Michael Parker 2006-04-10 15:32:16 UTC
> mass-check will not attempt to reuse hits from shortcircuited messages.
> 

Understood, however if this gets put on one of the trap servers it will
seriously hamper our ability to use realtime results for network tests.
Comment 45 Justin Mason 2006-04-11 13:38:05 UTC
'But use of net/DNSBL tests in a short-circuit, needs a better way to do this
than causing it to attempt to harvest *all* network lookups early -- I think it
will need a way for the "shortcircuit SC_URIBL_SBL spam" line to cause the meta
dependency chain to be followed up to the network rule, and some logic to
realise that that implies that the URIBL lookups -- and just those -- therefore
need to be waited for before that meta rule can be evaluated correctly.'

'seventh, and this is tricky, do we want to try moving the priorities around
automatically when SC is enabled?'

Update: these two are now implemented in the branch.

It works as follows:

- 1. config is read as normal; priorities are set to whatever they are in the
  config file, or 0 by default (including meta rules).

- 2. finish_parsing() calls trace_meta_dependencies(), which recursively
  descends the meta dependency tree, so that each meta rule then has a
  list of all the rules it depends on (directly or indirectly) in
  $conf->{meta_dependencies}.

- 3. finish_parsing() calls fix_priorities(), which then uses that
  meta-dependencies list to ensure that each meta rule will run at the same
  time as, or later than, all of its dependencies.

- 4. do_meta_tests() in PerMsgStatus then, before it runs each meta rule, greps
  out a list of async (ie. network) tests that are dependencies of that rule
  (if any).  It then checks to see if they've completed; if not, it waits until
  it completes (or the async tests time out).
  
  This grepping process takes place once at rule-compile time, not at runtime,
  although the completion test runs every time -- however that's 1 or 2 hash
  lookups, and there are not many meta rules that depend on network rules.


Finally, META_TEST_MIN_PRIORITY is rendered obsolete, so it's gone.

It all appears to be working, although could probably do with some tidying
up and may have a few bugs ;)   In particular I think the rbl_timeout
scaling code probably needs some work, I think it may be cutting off
DNSBL lookups too early now.

Also it needs some speed measurement to see how it now compares with trunk.

Add this file as "rules/62_test.cf", then run "spamassassin -D -t", for a demo:

  meta SC_PRI_DEMO  ((URIBL_BLACK || URIBL_SC_SURBL) && SPF_SOFTFAIL)
  shortcircuit SC_PRI_DEMO  spam
  priority SC_PRI_DEMO      -1000


by the way I am still uncomfortable with "shortcircuit default".  I would
prefer to get rid of it, since I think it's one feature too many, unless
there's a good reason for it to be kept.
Dallas, fancy making a case for it? ;)


Also remaining to do: add tests to the test suite for shortcircuiting;
and try it out on my dogfood server.
Comment 46 Dallas Engelken 2006-04-11 14:57:37 UTC
> Finally, META_TEST_MIN_PRIORITY is rendered obsolete, so it's gone.

very nice!

> by the way I am still uncomfortable with "shortcircuit default".  
> I would prefer to get rid of it, since I think it's one feature 
> too many, unless there's a good reason for it to be kept.
>
> Dallas, fancy making a case for it? ;)

sure, i'll tell you my thoughts behind it.  lets say you have the following
configuration.

identify spam at 5.0
quarantine at 8.0
delete over 15.0

and you have a need to quarantine certain mail, lets say for a past employees. 
Since you know the To: address, and you know it all needs to be quarntined, the
simple solution would be 

header  EX_EMPLOYEES   To =~ /(user1|user2|user3)\@mydomain/
score   EX_EMPLOYEES   8.0
shortcircuit EX_EMPLOYEES default
priority EX_EMPLOYEES -100

Therefore, any email to those people automatically goes to quarantine because
the final score is 8.0.  Without being able to shortcircuit and use the default
score, you are always going to shortcircuit at the +100 (or whatever leve you
set).... which looses flexibility.

but its whatever you feel is right... I just thought people may find it
beneficial not to designate a spam/ham classification and just use whatever
score is assiged to the shortcircuited rule.

d


Comment 47 Loren Wilton 2006-04-12 09:48:32 UTC
Gee, now that you are tracing meta dependencies, you could do a lint warning if 
you find an unresolved dependency.  :-)
Comment 48 Justin Mason 2006-04-12 11:09:47 UTC
'Therefore, any email to those people automatically goes to quarantine because
the final score is 8.0.  Without being able to shortcircuit and use the default
score, you are always going to shortcircuit at the +100 (or whatever level you
set).... which looses flexibility.'

ok, that makes sense.

However I'm still not fond of having two separate "parallel" ways to do the
same thing; "shortcircuit default / score 100" and "shortcircuit spam" do the
same thing two different ways.

how's about getting rid of "shortcircuit ham"/"shortcircuit spam", and just
have either "shortcircuit default" (renamed to "shortcircuit on") or
"shortcircuit off" behaviour as the available options?    Then the score alone
is used to determine whether a rule shortcircuits as ham or as spam -- negative
means ham, 0 or positive spam.

if brevity of the configuration is the aim, we can still support "shortcircuit
ham" / "shortcircuit spam", but just implement them in the config-reading
subroutine as translating to "shortcircuit on / score -100"  or "shortcircuit
on / score 100".

What do you think?
Comment 49 Justin Mason 2006-04-12 12:35:54 UTC
message from Loren by mail (watch out, our bz still doesn't support that):

Loren Wilton writes:
> Someone please remind me why the email score simply isn't the score total up
> to and including the short-circuit rule?  I'm assuming that in general short
> circuit rules are going to run relatively early (else why bother?) so except
> for a short-circuit meta there should be relatively few scores in addition
> to the short circuit.  And for the meta I'd expect most contributory scores
> to go the same way as the short circuit score.
>
> It then becomes dependent on the rule writer to pick a good score for the
> short-circuit rule.  But this is always the case.

Loren -- that's one of the existing options, that's "shortcircuit default".
Comment 50 Dallas Engelken 2006-04-13 14:46:20 UTC
(In reply to comment #48)

> how's about getting rid of "shortcircuit ham"/"shortcircuit spam", and just
> have either "shortcircuit default" (renamed to "shortcircuit on") or
> "shortcircuit off" behaviour as the available options?    Then the score alone
> is used to determine whether a rule shortcircuits as ham or as spam -- 
negative
> means ham, 0 or positive spam.
> if brevity of the configuration is the aim, we can still support "shortcircuit
> ham" / "shortcircuit spam", but just implement them in the config-reading
> subroutine as translating to "shortcircuit on / score -100"  or "shortcircuit
> on / score 100".
> What do you think?

sounds reasonable to me.   on/off is simple, and still gives you the 
flexibility to control the final score.   also it would probably be best to set 
noautolearn tflag by default, and force someone to tflag learn a s/c rule.

if you have on|off|ham|spam, the ham and spam s/c classification could score -
100/+100 and force autolearning.   but i guess that puts us right back where we 
started at ham|spam|default|off really... just changing "default" to "on", and 
making sure "on" has noautolearn tied to it.

i'd rather not have to define a rule with 6 lines of config if we can set some 
sane defaults.

body RULE /text/
describe RULE blah
score RULE x.x
shortcircuit RULE on
tflags RULE noautolearn
priority RULE -100

i would suggest shortcircuit "on" set noautolearn tflag and -100 priority by 
default....  and if you want to adjust the priority higher from there, you have 
to specify it, and if you want to autolearn on that rule, you either have to 
specify the learn tflag, or you have to s/c ham|spam.   i guess if you s/c as 
ham|spam, the noautolearn tflag would have to be removed from the rules config.

d
Comment 51 Justin Mason 2006-04-14 18:33:11 UTC
ok, r394166 implements that.
Comment 52 Justin Mason 2006-04-16 17:17:29 UTC
>> mass-check will not attempt to reuse hits from shortcircuited messages.
>
>Understood, however if this gets put on one of the trap servers it will
>seriously hamper our ability to use realtime results for network tests.

hey, I've been thinking about this.

I agree that it shouldn't be used on a trap server.  However I'm not
mass-checking any of the traps currently; I mass-check my own spam/ham.
I already reject at SMTP connect time using SBL/XBL, which means that, what is
it, 40% of incoming spam is not represented in my spam corpus?   I'll
bet quite a few of the other mass-checkers are doing the same, or even
using other SMTP-time tactics, like greylisting or other DNSBLs.

As a result, if we're worried about our filtering tactics affecting our corpora,
and affecting mass-check results on network tests, IMO that horse has already
bolted, to a substantial degree....



ok, things still to do on this bug:

- a test for shortcircuit;
- I need to dogfood it on my server as well;
- figure out what to do with the branch.
Comment 53 Justin Mason 2006-04-19 19:14:58 UTC
> a test for shortcircuit

now in, as revision 395341.  and dogfooding is now underway using this config:

meta SC_BL_BAYES ((USER_IN_BLACKLIST_TO || USER_IN_BLACKLIST) && BAYES_99)
meta SC_WL_BAYES ((USER_IN_WHITELIST||USER_IN_DEF_WHITELIST||USER_IN_ALL_SPAM_TO
||USER_IN_DEF_SPF_WL||NO_RELAYS) && BAYES_00)
meta SC_BL_URIBL ((URIBL_BLACK||URIBL_AB_SURBL||URIBL_OB_SURBL||URIBL_SC_SURBL||
URIBL_WS_SURBL||URIBL_SBL) && BAYES_99)

shortcircuit SC_BL_BAYES spam
shortcircuit SC_BL_URIBL spam
shortcircuit SC_WL_BAYES ham
Comment 54 Justin Mason 2006-04-20 21:26:03 UTC
well, it hasn't wound up faster -- before:

all: 1980 messages, total time 9672.09999999999, avg 4.88489898989899

after:

sc=spam: 405 messages, total time 1914.6, avg 4.72740740740741
sc=no: 939 messages, total time 7682.30000000001, avg 8.18136315228968
sc=ham: 117 messages, total time 1116.7, avg 9.54444444444444
all: 1461 messages, total time 10713.6, avg 7.33305954825462


shortcircuit-as-spam mails are marginally faster.  shortcircuit-as-ham, or
non-s/c, quite a bit slower.
Comment 55 Justin Mason 2006-04-20 21:26:52 UTC
oh, here's the script I used to parse the logs for that report --

#!/usr/bin/perl

while (<>) {
  /spamd: result: / or next;
  my $sc = 'no';
  /shortcircuit=(\S+)/ and $sc=$1;
  /scantime=([\d\.]+)/ or next;
  my $stime = $1;
  $msgs{$sc}++; $times{$sc}+=$stime;
  $msgs{all}++; $times{all}+=$stime;
}

foreach my $t (reverse sort keys %msgs) {
  my $avg = $times{$t} / ($msgs{$t} || 0.00001);
  print "sc=$t: $msgs{$t} messages, total time $times{$t}, avg $avg\n";
}
Comment 56 Dallas Engelken 2006-04-20 21:51:19 UTC
maybe it depends what you are short-circuiting.  since all your sc tests rely 
on bayes at least, that may be something that is pushing out your scan times?

as you can see here, my spam:ham ratio is pretty close yesterday, yet the time 
my system spent processing it was 2/5th of how long the ham took. 

Spam:      8140    Time Spent Processing Spam:    2.88 hours
Ham:       9073    Time Spent Processing Ham:     7.30 hours

 i dont have any s/c ham rules expect for whitelists, so you see a pretty small 
sample size of sc=ham below.

# cat /var/log/spamassassin/spamd.log | perl test.pl
sc=spam: 4085 messages, total time 2346.59999999995, avg 0.574443084455313
sc=no: 6064 messages, total time 18832.1, avg 3.1055573878628
sc=ham: 134 messages, total time 68, avg 0.507462686567164
sc=all: 10283 messages, total time 21246.7000000002, avg 2.06619663522321

as you can see, my sc=spam and sc=ham avg scantimes are 6x faster than those 
message that were not s/c.   how are your scantimes 4-8 seconds? old hardware?

although i have yet to put your new implementation into production, i'm still 
using the hacked HARVEST_DNSBL_PRIORITY.  

as soon as i get some time to do testing on the new stuff i'll provide some 
more data.
Comment 57 Justin Mason 2006-04-20 22:44:17 UTC
cool -- could you post your rules in the meantime?

are you running bayes on that server?

my server is pretty old hardware.  however I think a good bit is probably due to
waiting for network rules -- all those rules rely on net tests.

if you think about it, there's no way the BAYES rules could cause s/c=spam to
take longer than s/c=no; BAYES runs for both, and would run for the same length
of time.  the bayes rules are synchronous.
Comment 58 Dallas Engelken 2006-05-12 13:27:11 UTC
i'm still testing,  but it appears there is a small problem with a couple
priorities.   looks like setting priorities of -998 and -999 fail because they
are used by

$MISSING_REQUIRED_VALUE     = -998;
$INVALID_VALUE              = -999;

In Conf/Parser.pm, the $ret value set from

      my $ret = &{$cmd->{code}} ($conf, $cmd->{setting}, $value, $line);

is the priority.  which then matches 

      if ($ret && $ret eq $Mail::SpamAssassin::Conf::INVALID_VALUE)

and

      elsif ($ret && $ret eq $Mail::SpamAssassin::Conf::MISSING_REQUIRED_VALUE)

and causes goto failed_line.

# spamassassin --lint -D 2>&1 | grep USER 2>&1
priority USER_IN_BLACKLIST_TO   -999
KEY=priority VALUE=USER_IN_BLACKLIST_TO -999
PARSE_ERROR RET=-999 IV=-999 KEY=priority VALUE=USER_IN_BLACKLIST_TO    -999
[24595] warn: config: SpamAssassin failed to parse line,
"USER_IN_BLACKLIST_TO_-999" is not valid for "priority", skipping: priority
USER_IN_BLACKLIST_TO_-999
priority USER_IN_BLACKLIST      -998
KEY=priority VALUE=USER_IN_BLACKLIST    -998
[24595] warn: config: SpamAssassin failed to parse line, no value provided for
"priority", skipping: priority USER_IN_BLACKLIST_-998
priority USER_IN_WHITELIST     -990
KEY=priority VALUE=USER_IN_WHITELIST     -990
NEXT, RET=-990 KEY=priority VALUE=USER_IN_WHITELIST     -990
priority USER_IN_DEF_WHITELIST -989
KEY=priority VALUE=USER_IN_DEF_WHITELIST -989
NEXT, RET=-989 KEY=priority VALUE=USER_IN_DEF_WHITELIST -989
priority USER_IN_ALL_SPAM_TO   -988
KEY=priority VALUE=USER_IN_ALL_SPAM_TO   -988
NEXT, RET=-988 KEY=priority VALUE=USER_IN_ALL_SPAM_TO   -988
[24595] dbg: shortcircuit: adding USER_IN_WHITELIST default
[24595] dbg: shortcircuit: adding USER_IN_DEF_WHITELIST default
[24595] dbg: shortcircuit: adding USER_IN_ALL_SPAM_TO default
[24595] dbg: shortcircuit: adding USER_IN_BLACKLIST default
[24595] dbg: shortcircuit: adding USER_IN_BLACKLIST_TO default
Comment 59 Justin Mason 2006-05-12 13:38:00 UTC
doh.

I think we can change them; they're not exposed outside the code, and any
plugins will/should be using them by name instead of embedding their values.

$MISSING_REQUIRED_VALUE     = -998;
$INVALID_VALUE              = -999;
Comment 60 Justin Mason 2006-05-29 19:36:07 UTC
Created attachment 3530 [details]
shortcircuit.cf as used in jm's test

I got a chance to do a follow-up on the performance of this code.   I redid my
rules files to not include any network rules at all, which require the "stop
everything and wait for completion" code to kick in. Instead, just using the
"shuffle priorities to ensure all local rules complete in the right order"
algorithm, and just very simple 2 shortcircuit rules, here's what I get:

sc=spam: 161 messages, total time 108.4, avg 0.673291925465839
sc=no: 1365 messages, total time 4170.9, avg 3.0556043956044
sc=ham: 129 messages, total time 485.6, avg 3.76434108527132
all: 1655 messages, total time 4764.9, avg 2.87909365558913

reminder: here's what I had beforehand using svn trunk:

all: 1980 messages, total time 9672.09999999999, avg 4.88489898989899

so, on average, it now completes scans in *on average* across all messages a
whopping 58% of the runtime.   That is a MAJOR speedup!

Most of this is coming from the spams taking only 0.673 seconds to scan on
average.

This is probably the largest speedup in the open source SpamAssassin codebase
in years. Nice ;)
Comment 61 Loren Wilton 2006-05-30 02:40:55 UTC
Interesting.  And that was still requiring bayes, which I bet is a major time 
sink.  It would be interesting to see the same timing run dropping the bayes_00 
and bayes_99 from the metas.  I'm sure the classification results will suffer 
as a result.  But it would be an interesting time comparison.
Comment 62 Dallas Engelken 2006-06-20 19:49:33 UTC
hey justin.. i was just doing some bayes work on bug #4787 and came across 
something that effected my initial shortcircuit work in regards to 
autolearning... i'm not sure if it effects your attachment #3530 [details] since i havent 
yet got to that, but just in case, here is issue i ran into.

i noticed that all email that shortcircuited was failing to autolearn, even if 
autolearn tflag was set specifically fo rthat rule.  in 
Plugin::AutoLearnThreshold->autolearn_disriminator() I added this small block 
of code. 

  my $isspam;

  if ( $scan->{shortcircuit_rule} ) {
     $isspam = 1 if ( $scan->{shortcircuit_type} eq "spam" );
     $isspam = 0 if ( $scan->{shortcircuit_type} eq "ham" );
     dbg("learn: shortcircuit ham/spam does not require score evalution.");
     return $isspam;
  }

just thought i'd share that discover while its on my mind....
Comment 63 Justin Mason 2006-07-09 17:24:23 UTC
Created attachment 3570 [details]
rule-priority code, against svn trunk

In order to reduce the amount of contentious code involved, I've split the work

into two independent chunks:

- 1. the short-circuiting code itself (which has been -1'd)

- 2. the priority-tracing code, which automatically enforces priority ordering
  for meta rules, so that meta rules always run after their dependencies;
  and removal of META_TEST_MIN_PRIORITY special-casing, which then becomes
  superfluous.

The latter code is a dependency of the former, but is independent, and would be

very useful to get into trunk anyway, to get rid if the META_TEST_MIN_PRIORITY
hack.  This patch does that.
Comment 64 Justin Mason 2006-07-09 17:33:23 UTC
Created attachment 3571 [details]
current short-circuiting patch, against svn trunk+3570

In case anyone's running the code, here's the current state of the
short-circuiting patch, against svn trunk with patch 3570 applied.
Comment 65 Justin Mason 2006-07-14 10:58:21 UTC
no comments => lazy consensus, so I've applied patch 3570 (the priority
inference code) to svn trunk.

Sending        MANIFEST
Sending        lib/Mail/SpamAssassin/Conf/Parser.pm
Sending        lib/Mail/SpamAssassin/Conf.pm
Sending        lib/Mail/SpamAssassin/Constants.pm
Sending        lib/Mail/SpamAssassin/PerMsgStatus.pm
Adding         t/priorities.t
Transmitting file data ......
Committed revision 421868.
Comment 66 Michael Parker 2006-07-14 13:21:05 UTC
/me drowns in massive conflicts

Come on man, there is still a veto here.

I'm staying on record that I believe this is the wrong approach.
Comment 67 Michael Parker 2006-07-14 13:29:17 UTC
I'm removing my veto.  Its not worth the time, apply and move forward.  Its too
bad real world priorities must take precendence.
Comment 68 Justin Mason 2006-07-18 22:51:06 UTC
Created attachment 3591 [details]
new shortcircuit patch, against trunk

Here is a new, less-intrusive short-circuiting patch against trunk.

It moves the implementation to a plugin,
Mail::SpamAssassin::Plugin::Shortcircuit, which implements the s/c logic. This
is made possible with a new plugin API and a flag-setting API on the
PerMsgStatus object.

It is still necessary to dot the PerMsgStatus code with calls like:

   return if (exists $self->{shortcircuit_flag});

I can't see a way to do it efficiently otherwise, unfortunately.  Other
than that, though, the logic is all now in a plugin; in particular,
deciding *when* that flag is set all takes place there.
Comment 69 Michael Parker 2006-07-19 00:20:22 UTC
> Here is a new, less-intrusive short-circuiting patch against trunk.
> 
> It moves the implementation to a plugin,

YAY!!!!

> Mail::SpamAssassin::Plugin::Shortcircuit, which implements the s/c logic. This
> is made possible with a new plugin API and a flag-setting API on the
> PerMsgStatus object.
> 

Why not pass all the other available values into handle_hit?

> It is still necessary to dot the PerMsgStatus code with calls like:
> 
>    return if (exists $self->{shortcircuit_flag});

Why not make this into some sort of plugin call?

> 
> I can't see a way to do it efficiently otherwise, unfortunately.  Other
> than that, though, the logic is all now in a plugin; in particular,
> deciding *when* that flag is set all takes place there.
> 

I think that the changes are too specific, and I'd like to make it even more
generic.  It would be easy to give this lots more love with a Check.pm plugin BTW :)

Also, I strongly, in the -1 veto type of disagreement, with having shortcircuit
turned on by default.
Comment 70 Justin Mason 2006-07-19 09:54:42 UTC
> > Here is a new, less-intrusive short-circuiting patch against trunk. It
> > moves the implementation to a plugin,
> 
> YAY!!!!

good to hear ;)

> > Mail::SpamAssassin::Plugin::Shortcircuit, which implements the s/c logic.
> > This is made possible with a new plugin API and a flag-setting API on the
> > PerMsgStatus object.
> 
> Why not pass all the other available values into handle_hit?

I should have mentioned that.  To be honest, I consider those ($area and $desc)
to be legacy -- something that we shouldn't be passing around anyway, and I'd
prefer to get rid of them from that arg list sometime soon if poss.  Given
that, I didn't want to bake them into a new plugin API.

> > It is still necessary to dot the PerMsgStatus code with calls like:
> > 
> >    return if (exists $self->{shortcircuit_flag});
> 
> Why not make this into some sort of plugin call?

I was considering it, but wasn't sure about the performance impact of making
the calls through the plugin infrastructure vs calling exists on a member var.
Let me look into that....


> > I can't see a way to do it efficiently otherwise, unfortunately.  Other
> > than that, though, the logic is all now in a plugin; in particular,
> > deciding *when* that flag is set all takes place there.
> 
> I think that the changes are too specific, and I'd like to make it even more
> generic.  It would be easy to give this lots more love with a Check.pm plugin
> BTW :)

Bring it on, I'm dying to see this plugin ;)

what do you mean by 'the changes are too specific' btw?  this new set of
changes to PerMsgStatus?  suggestions would be welcome, I'm happy to
compromise.

> Also, I strongly, in the -1 veto type of disagreement, with having
> shortcircuit turned on by default.

OK, I can go for that.

Comment 71 Justin Mason 2006-07-19 18:35:11 UTC
'> Why not make this into some sort of plugin call?

I was considering it, but wasn't sure about the performance impact of making
the calls through the plugin infrastructure vs calling exists on a member var.
Let me look into that....'

fwiw, that's not a problem -- I'll switch it to a call_plugin() call in next rev.

exists:

 0.20   0.049  0.049  20250   0.0000 0.0000  Mail::SpamAssassin::PerMsgStatus::
                                             have_shortcircuited

plugin:

 0.39       -  0.081  20250        - 0.0000  Mail::SpamAssassin::PerMsgStatus::
                                             have_shortcircuited

moving from 0.20% to 0.39% of the runtime, while a near-doubling, is still far
down the list of CPU hogs, below even
Mail::SpamAssassin::PerMsgStatus::check_from_in_default_whitelist().  I think
that's pretty safe.
Comment 72 Justin Mason 2006-07-20 12:52:15 UTC
Created attachment 3598 [details]
new new patch against trunk

hey -- here's that next rev. changes:

- Mail::SpamAssassin::PerMsgStatus::have_shortcircuited() is now a
  call_plugin() call, as suggested.


- shortcircuit is off by default, as requested.


- something had been annoying me about the previous patch -- namely the
  creation of a new (duplicate) plugin hook with one new argument,
  "handle_hit", just to avoid having to make changes in the PerMsgStatus
  code. So I bit the bullet, and consolidated the two post-rule-fire
  plugin APIs (hit_rule and handle_hit) into one (the existing API,
  hit_rule).
  
  This required a little more work, since moving the
  call_plugins("hit_rule") calls into the got_hit() method required that
  method to be extended to support the additional data required (the
  "ruletype" string). Using perl-style name=>value pairs for the
  additional optional arguments, and keeping compatibility with the
  existing pre-3.2.0 argument set, this was doable in a
  backward-compatible way.


- Bonus: this allows Plugin/AWL.pm to be fixed to use the public
  PerMsgStatus::got_hit() API, instead of the private _handle_hit() API
  (arguably a bugfix imo).


- Bonus 2: got_eval_hit() and got_pattern_hit() became one-line wrappers
  around got_hit(); refactored out for efficiency and clarity.


- I added POD doc for got_hit(), while I was at it, since it's an API
  callable (and called) by plugins, it needs documentation.


- the 'tflags multiple' code offered an optimisation, while I was at it;
  instead of looking up tflags at scan-time to check if a rule should exit
  or look for multiple hits, it should do that only at eval-build time, in
  the hit_rule_plugin_code() code-generation call.
Comment 73 Justin Mason 2006-07-20 12:55:17 UTC
btw, how are we in vetoes with this patch?  I've got a batch of unrelated
PerMsgStatus optimisations I'd like to apply...
Comment 74 Justin Mason 2006-07-21 19:03:42 UTC
ping.  let me know if you'd prefer this stayed out of trunk.

If there's no comment, I'll apply it in a day or two, since it's pretty
modular now -- and hand-merging around it is getting tricky.
Comment 75 Michael Parker 2006-07-21 19:12:09 UTC
Can you give me a week?  I'm gonna have some time to look at it and the
Check/Rule plugin stuff over the next week.  It will highly conflict with those
changes as well.  I might be able to incorporate much of the code into what I've
already moved out of PerMsgStatus (right now PMS has no do_*_tests).

After a week, lets look at it again and see where we are.

Thanks.
Comment 76 Justin Mason 2006-07-21 20:17:04 UTC
ok -- I'll be happy to do the merging btw.
Comment 77 Justin Mason 2006-08-03 18:14:53 UTC
> Can you give me a week?  I'm gonna have some time to look at it and the
> Check/Rule plugin stuff over the next week.  It will highly conflict with those
> changes as well.  I might be able to incorporate much of the code into what I've
> already moved out of PerMsgStatus (right now PMS has no do_*_tests).
> After a week, lets look at it again and see where we are.

ping -- how's that going?
Comment 78 Michael Parker 2006-08-04 13:14:14 UTC
/me sighs

Time I did not get.  I won't hold up any progress.  I'm still a little
uncomfortable with the implementation but am not in a position to provide a
better answer for the moment.  So, unless you want to hold up on this thing for
a month or so until I can come free from my current project, you might as well
go ahead with things.

I imagine, if things ever happen with Check.pm and the Rule plugins, that things
will change around a bit anyway.
Comment 79 Justin Mason 2006-08-04 14:01:59 UTC
ok, sorry to hear it :(  I'll go ahead and check in.

Let me know if you need a hand with Check.pm and the rule-type plugin stuff in
the meantime, btw.  I'm increasingly thinking that'd be good to get in there.
Comment 80 Justin Mason 2006-08-04 19:53:25 UTC
ok, this is now checked in! r428859.
Comment 81 Theo Van Dinter 2006-12-30 19:51:03 UTC
*** Bug 3407 has been marked as a duplicate of this bug. ***
Comment 82 Amos 2009-06-07 16:36:20 UTC
There seems to be a bug in the shortcircuit implementation in 3.2.5 -- the "spam" and "ham" classifications skip remaining tests, but don't adjust the score (as they should).  

For clarification, as an example:  on my system, following the defaults, the spam threshold is 5, BAYES_99 is set to give a score of 3.5, and BAYES_99 is also set to shortcircuit as spam.  Any mail that triggers BAYES_99 goes straight to the Inbox as ham, because although it shows shortcircuit=spam, it also shows a final score of 3.5, and therefore as not spam.

The shortcircuit ham classification seems to have the same bug (it doesn't adjust the score, it just skips tests).

The shortcircuit "on" classification seems to work fine.
Comment 83 Matt Kettler 2009-06-07 17:18:44 UTC
Well, change the rule score, if that's what you want.

in local.cf

score BAYES_99 3.4 

(or whatever score you want).

As is, you've shortcircuited on a rule with a score below your spam threshold, so you get exactly what you've asked for. 

Shortciruciting does not, and really should not, radically alter the score defined for the rule. That's what score statements are for. Why would "shortciricut spam" imply "score 100", when you can just declare whatever score you want? It seems substantially more flexible this way.


Also, please be aware that BAYES_99 does have a distinctly non-zero false positive rate. (theoretically 1% of its hits are nonspam, but in practice it's more like 0.1%). You WILL get some nonspam messages tagged by this once in a while.

Be sure you're really ready for the implications of what you're doing.
Comment 84 Amos 2009-06-10 15:27:56 UTC
Matt:

     That's all good advice of itself, but the documentation on shortcircuit is explicit:

     Shortcircuit "on" is supposed to have the behavior you describe.

     Shortcircuit "spam" or "ham" is supposed to have the different behavior of also forcing a pre-set (very high or very low) score.
 
     The plugin is not complying with the documentation  -- see, e.g., http://oldschoolpunx.net/phpMan.php/man/Mail::SpamAssassin::Plugin::Shortcircuit -- which I think constitutes a bug in and of itself.

     It also constitutes a bug in another way:  the default spamassassin rules set BAYES_99 to 3.4 (or whatever), shortcircuit=spam, and threshold=5.  So if someone turns shortcircuit on without changing the defaults, then it causes anything that hits BAYES_99 to be treated as HAM without exception -- the opposite of the intended behavior.

     Also, at least on my system, I can tolerate a non-zero false-positive rate, because I'm not deleting spam automatically, just routing it to a folder.  So the default setting, shortcircuit BAYES_99 spam, is actually the behavior I want.

     To sum-up:  All the points you made are good ones, but there is still a real bug here, I think.
Comment 85 Karsten Bräckelmann 2009-06-10 16:36:13 UTC
> Shortcircuit "spam" or "ham" is supposed to have the different behavior of
> also forcing a pre-set (very high or very low) score.

Do you have a custom value for shortcircuit_spam_score?  Anyway...

> To sum-up:  All the points you made are good ones, but there is still a
> real bug here, I think.

Amos, if there is a bug, please file a NEW one. Add all the investigations you found there, again.  *This* bug is a feature request for implementing a shortcircuiting plugin -- which is RESOLVED FIXED.