SA Bugzilla – Bug 3109
RFE: really simple "this is ham" shortcircuiting
Last modified: 2009-06-10 16:36:13 UTC
There are a number of situations where you can take a look at a message and know immediately that it should not be scanned. e.g. 1. message is too big 2. message has travelled entirely via trusted hosts (ALL_TRUSTED rule) 3. user is in all_spam_to 4. bug 2385 5. caller has other ways of determining that the message came from internal hosts Currently we recommend that this "do not scan at all" short-circuiting be done in the .procmailrc file before SA is called. However, that doesn't apply for products that call SA like Amavisd, MailScanner, et al; so all of those products have to be aware of this, add their own code to take care of it, etc. And in the case of bug 2385, that just isn't really an option! I suggest we add a new rule type explicitly for *immediate* shortcircuiting, applied before most/all of the time-consuming rules. This way, new whitelist/blacklist rules that are guaranteed unforgeable can be applied immediately and the message can immediately be marked.
> I suggest we add a new rule type explicitly for *immediate* shortcircuiting I think we should make this a part of a SINGLE rule priority system (covering both pre and post functionality) and extend scores to allow for short-circuiting: body FOO /foo/ tpriority -10 score ham - priority of zero is default - less than zero means run early - more than zero means run later - score of "ham" or "spam" means mark as ham or spam and short circuit, no numeric score should be generated since the score number would be inaccurate
How would this interact with autolearning? Should there be a way to indicate whether the message should be learned when short-circuiting?
Subject: Re: RFE: really simple "this is ham" shortcircuiting -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >- priority of zero is default >- less than zero means run early >- more than zero means run later +1 on those >- score of "ham" or "spam" means mark as ham or spam and short circuit, no > numeric score should be generated since the score number would be inaccurate Suggest generating a score of -9999 or 9999 -- otherwise I can guarantee it'll confuse tools that act on X-Spam-Status. Plus it'd make mass-check results pretty hard to parse too ;) 'all_spam_to' already uses a very-low-score hack, this is similar (but less hacky). Also I'd prefer if it wasn't using the "score" command; a new one like "shortcircuit ham", or even "tflags short-circuit-ham" strikes me as better. re: autolearning: that's a hard one, since this would be how all_spam_to would be implemented. Maybe 2 ways of saying ham, one indicating that the msg is autolearnable, one indicating not? - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFAQVZXQTcbUG5Y7woRAufbAKCQ1j7aOpQ3ecPHJD/Y4WdeI5X2EgCg1XQe R6WnLb5voTZe1utLhPajtz0= =8442 -----END PGP SIGNATURE-----
This was implemented (see bug 62) but was horribly broken (see bug 304). It was eventually decided (and generally agreed) that short-circuiting has a significant number of problems that we are not prepared to overcome. I'm going to mark this as a duplicate of bug 304, because it sorta is. *** This bug has been marked as a duplicate of 304 ***
argh! this is NOT a dupe of bug 304! this is an entirely DIFFERENT form of short-circuiting, not based on score, but based on specific rules that are marked as very trustworthy short-circuit rules (such as whitelisting). read the comments. grr. reopening -- although pushing it away from the 3.1.0 milestone.
btw regarding the "tpriority" idea -- note that this was indeed implemented and is now in shipping code, as "priority". all that's needed on top of that is an implementation of a short-circuit keyword: tflags shortcircuit I've just realised, btw, that we don't need separate "shortcircuit-ham" and "shortcircuit-spam" keywords -- we can set the ham/spam status using the rule's score. Just set the score to -100 or +100, and shortcircuit.
Subject: Re: RFE: really simple "this is ham" shortcircuiting On Tue, Oct 04, 2005 at 06:37:30PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > I've just realised, btw, that we don't need separate "shortcircuit-ham" and > "shortcircuit-spam" keywords -- we can set the ham/spam status using the rule's > score. Just set the score to -100 or +100, and shortcircuit. You can figure out the ham/spam via score, but the relative score value doesn't matter imo. I'd base ham/spam on "tflags nice" versus not personally. I had previously sent some thoughts to the dev list about short circuiting, but I'll put them here as well so I can find it easier next time: - set certain rules as SC if hit USER_IN_WHITELIST, USER_IN_BLACKLIST (not DEF) *BSP* HABEAS* - allow SC on ham score (ie: < #) - allow SC on spam score (ie: > #) - should autolearn skip SC msgs? should we always do autolearn in the appropriate direction? - AWL should be skipped during SC - SC rules should have a negative priority so they run first - do *not* do score check per rule, do it either per priority or rule type (header, body, etc.) - SC will require is_spam SC as score + required_hits will be at odds - add SC header macro (get_tag) - SC for S/O 1.000 rules? how about S/O near 1? BAYES_99, etc. Some form of order/priority rearrangement: Blacklist short Whitelist user/admin wants it BSP/Habeas reputable, non-forgable Other SC Rules as early as possible Other Local Rules lightweight Bayes don't do it unless we have to Network large latency, try to avoid
btw I was talking to somebody a while back, and they noted that their system turned our conceptual model of relative speeds on its head: > Other Local Rules lightweight > Network large latency, try to avoid in their system, the local rules were heavyweight and avoided, because they hammered the CPU, whereas the network tests were considered lightweight, because they left the CPU idle while waiting for results. so we need to be able to take both into account IMO.
*** Bug 4763 has been marked as a duplicate of this bug. ***
i am posting a short-circuit patch due to several personal requests. i would say this is an over-simplified approach, but it does work very well... and it was created with very few lines of additional code in PerMsgStatus (37 to be exact). it applies cleanly to current SVN, i have not tried it against 3.1.1 or anything else.. feel free to do so. so basically, this code is just here for now in case you want to save some CPU cycles until there is something "better, faster, stronger" (ie pluganized check () methods with improved priorities on evals, metas, plugins, etc). below is just some basic info showing debug of it running. ------------------------- normal message that didnt shortcircuit runs through all priorities you have defined. [2203] dbg: check: running tests for priority: -100 [2203] dbg: check: running tests for priority: 0 [2203] dbg: check: running tests for priority: 500 [2203] dbg: check: running tests for priority: 1000 and now a shortcircuited rule (set priority -100 so it runs first) body TEST /test/i describe TEST this is just a test rule to show how shortcircuiting works.. score TEST 1.0 priority TEST -100 shortcircuit TEST ham [2203] dbg: check: running tests for priority: -100 [2203] dbg: shortcircuit: hit on rule TEST, sc type is ham, apply score of -5 As you can see, priority 0, 500, and 100 were skiped. Changing shortcircuit method to 'spam' gives [2216] dbg: check: running tests for priority: -100 [2216] dbg: shortcircuit: hit on rule TEST, sc type is spam, apply score of 15 # echo -e "\n\ntest" | spamc X-Spam-Tests: TEST X-Spam-Score: 15.0 X-Spam-Flag: YES the values -5 and 15 are applied by setting a couple options in local.cf shortcircuit_spam_score 15 shortcircuit_ham_score -5 If you do not set these items to your choice, they default to 100 (spam) and - 100 (ham). Thats pretty much it. It works great for hitting body, rawbody and headers rules that you know have a 1.0 S/O (spam-o-the-day). be careful out there! ------------------------- thx. dallas
Created attachment 3426 [details] shortcircuit patch implements simple shortcircuiting
sweet! I want to get this into 3.2.0 immediately for my server ;) thanks Dallas! +1
Created attachment 3427 [details] shortcircuit patch on 3.1.1 due to popular demand... here is a patch against 3.1.1. this patch goes one step further and defines a 60_shortcircuit.cf with some basic (obvious) rules that just use their default scores... # echo -e "To: blah@blah.com\n\n" | spamc X-Spam-Tests: USER_IN_BLACKLIST_TO X-Spam-Score: 10.0 X-Spam-Flag: YES [13826] dbg: check: running tests for priority: -100 [13826] dbg: check: running tests for priority: -95 [13826] dbg: check: running tests for priority: -90 [13826] dbg: shortcircuit: hit on rule USER_IN_BLACKLIST_TO, sc type is default, apply score of 10 so, you see it runs through whitelist first at priority -100, all trusted on -95, and then hits a blacklist_to on priority -90. with this approach there is no cancellation when a whitelist_* and blacklist_* (maybe by mistake) hit on the same message. whitelist takes priority. if you want it the other way around, set the black test to higher (lower number) priorities than -100. have fun!
side note: after applying this patch, throw http://www.rulesemporium.com/rules/70_sare_whitelist.cf in your local config, your cpu will thank you. if anyone has any additions to 70_sare_whitelist.cf, please contribute via the user list. http://lists.maddoc.net/mailman/listinfo/sare-users thanks.
sorry to be a downer, but I'm -1 on applying this patch in its current state. I think it's a decent first go, thanks Dallas, but I'd like to see us come up with more of a design before applying code then having to go in and change it all around, etc. here's a number of thoughts I have about the patch: first, I really don't like the override score idea. I don't think there's a real point in doing it, and it's going to cause confusion by people who add up the rules and see it's nowhere near the rule score sum. second, there has to be something in the x-spam-status header to indicate that SC occured. third, mass-check wrt reuse needs to be modified to take SC into account. we'll also need to let people who do mass-checks know that they really shouldn't use SC since we want to see the hit results for all network tests (the current reuse behavior). fourth, what about autolearn? it's likely autolearn won't happen by default (too few hits, etc,) but it could also potentially learn the wrong way depending on what rules hit. I'd like to get a consensus on what should happen here (IMO, autolearn is skipped for SC). fifth, what about AWL? similar to autolearn, except that the score is going to be completely off. IMO, AWL is skipped during SC. sixth, an easy addition to this patch would be short circuiting on current message score. sometimes you don't care what rules hit, but you want to stop after a certain point. seventh, and this is tricky, do we want to try moving the priorities around automatically when SC is enabled? I was thinking of having code that sets default priorities, similar to default score, that goes through the SC rule listing and bumps the priority so they run first (including meta dependencies) unless the rule has a priority specifically set by config. Then another loop which sets the default to priority 0. eighth, should the SC decision code be in a plugin?
RE: plugin See Check plugin proposal.
hey theo, i have no objection at all to your points. the problem is, its just all talk, and has been for a long time. this ticket is over 2 years old. so instead of sitting back and watching nothing ever come of it... i figured it would be better to have something rather than nothing. i realize its not the "ideal" solution, but hopefully some people can find it useful, at least until the pluginized check is in place. and if nothing else, maybe it will spur some discussion and keep this ticket moving in the right direction... because this is an important feature. i not sure i agree on the autolearning bit. users should be able make up their mind on autolearning, and set a tflag noautolearn if they need to. if you let them shortcircuit their rules, you should let them make an autolearn vs noautolearn decision.. or maybe it defaults to noautolearn on all shortcircuited rules, and requires them to optin with autolearn tflag. one note on the existing patches, i just realized that dns tests are being applied unless you explicitly skip it... since that code runs after the do_* test, which isnt last() until the next iteration. so it needs to be if ($needs_dnsbl_harvest_p && !exists $self->{shortcircuit}) {
Firstly, let me congratulate Dallas on doing something that has been needed for a long, long time. It is needed to make SA more efficient. Compare SA to Dspam and you can see the difference immediately in terms of speed. It is a GREAT first run. As to Theo's points... 1. Override scores...just set them to zero and all is well. 2. X-Spam-Status .. I agree with you. A brief SC at (Test) would be nice. As to the rest of your points....one simple line in local.cf would take care of it...of course with a bit of code change from Dallas...w # Short circuiting behaviour - default 0 set it to 1 to enable. short_circuit (1|0) Again....thanks Dallas.
Created attachment 3428 [details] shortcircuit patch with meta/dns priority support okay, i found shortcircuit on body, rawbody and header tests only to be very boring... other than user-defined white/black lists, there is really not much else that you can make a good shortcircuit descision on (at least in the default rules). so, i started messing around with the harvest priorities and found that it is quite simple to accomplish prioritized dns and meta rules. attached is a updated patch against 3.1.1 that will allow this, and give you alot more power with shortcircuiting of meta tests. you just have to make sure the tests your meta's rely on have higher priorities than the meta tests themselves. works great as long as you order your priorities correctly. i've seen the loads disappear on some dedicated spamd servers tonite :) when meta rules are created, they are set with a priority of 500 by default, which is what the constant HARVEST_DNSBL_PRIORITY was previously set that. This caused all metas test to not even be evaluated until all the default priority rules had ran, which was what needed to happen. However, this prevented proper prioritization on the meta rules. By changing HARVEST_DNSBL_PRIORITY to -500, this allows it to evaluate any meta in a priority group > -500 prior to the cpu heavy rules. if you do not define any metas with < 0 priorities, they just works like they always have, running at priority 500. So, now you can do something that is actually worthwile.. like # must define dns based tests at a priority > -800 # must define meta priorities > -500 and < any dns test priority it relies on. priority URIBL_BLACK -500 priority URIBL_JP_SURBL -500 priority URIBL_SC_SURBL -500 priority URIBL_OB_SURBL -500 meta SC_MULTI_RHSBL ( URIBL_BLACK && (URIBL_JP_SURBL || URIBL_SC_SURBL || URIBL_OB_SURBL )) priority SC_MULTI_RHSBL -100 shortcircuit SC_MULTI_RHSBL spam [31446] dbg: config: read file /usr/share/spamassassin/60_shortcircuit.cf [31446] dbg: check: running tests for priority: -1000 [31446] dbg: priority: running high priority (-1000) eval USER_IN_WHITELIST [31446] dbg: priority: running high priority (-1000) eval USER_IN_ALL_SPAM_TO [31446] dbg: priority: running high priority (-1000) eval USER_IN_DEF_WHITELIST [31446] dbg: priority: running high priority (-1000) eval SUBJECT_IN_WHITELIST [31446] dbg: check: running tests for priority: -950 [31446] dbg: priority: running high priority (-950) eval ALL_TRUSTED [31446] dbg: check: running tests for priority: -900 [31446] dbg: priority: running high priority (-900) eval USER_IN_BLACKLIST_TO [31446] dbg: priority: running high priority (-900) eval USER_IN_BLACKLIST [31446] dbg: priority: running high priority (-900) eval SUBJECT_IN_BLACKLIST [31446] dbg: check: running tests for priority: -500 [31446] dbg: priority: running high priority (-500) eval URIBL_BLACK [31446] dbg: priority: running high priority (-500) eval URIBL_JP_SURBL [31446] dbg: priority: running high priority (-500) eval URIBL_OB_SURBL [31446] dbg: priority: running high priority (-500) eval URIBL_SC_SURBL [31446] dbg: check: running tests for priority: -100 [31446] dbg: priority: running high priority (-100) meta SC_URIBL_HASH [25755] dbg: shortcircuit: hit on rule SC_MULTI_RHSBL, sc type is spam, apply score of 100 and whitelists/blacklists running at a higher priority like -1000 and -900 will prevent check_post_dnsbl from ever firing of dns queries because the HARVEST_DNSBL_PRIORITY does not kick in until the priority group reaches -800. [22242] dbg: check: running tests for priority: -1000 [22242] dbg: priority: running high priority (-1000) eval USER_IN_WHITELIST [22242] dbg: shortcircuit: hit on rule USER_IN_WHITELIST, sc type is default, apply score of -100 so now we are getting somewhere! ;)
OK, some agreement with Theo, and some disagreement ;) > first, I really don't like the override score idea. I don't think > there's a real point in doing it, and it's going to cause confusion by > people who add up the rules and see it's nowhere near the rule score > sum. -1 I think it provides a better, more familiar UI for users. Compare (assuming we fix the "something in x-spam-status" issue below): X-Spam-Status: Yes, score=1.9 required=5.0 tests=BAYES_50,HTML_IMAGE_ONLY_28, HTML_MESSAGE,MIME_HTML_ONLY,SC_TEST shortcircuit=spam autolearn=no version=3.2.0-r372567 eh? "Yes, score=1.9 required=5.0", wtf? that's less intuitive, and more likely to cause confusion, than X-Spam-Status: Yes, score=15.0 required=5.0 tests=BAYES_50,HTML_IMAGE_ONLY_28, HTML_MESSAGE,MIME_HTML_ONLY,SC_TEST shortcircuit=spam autolearn=no version=3.2.0-r372567 it really makes less sense in my opinion for the mail to be marked as spam with a low score, than for the tests not to add up. ;) (Personally, I'd prefer to see a more "extreme" score value -- -100 and 100, for example -- to give another "hint" to users that something out-of-the-ordinary has occurred. But I don't think that's a vote-stopper issue. for example:) X-Spam-Status: Yes, score=100.0 required=5.0 tests=BAYES_50, HTML_IMAGE_ONLY_28,HTML_MESSAGE,MIME_HTML_ONLY,SC_TEST shortcircuit=spam autolearn=no version=3.2.0-r372567 > second, there has to be something in the x-spam-status header to > indicate that SC occured. +1 agreed on that point, I didn't spot that this was missing. ;) > third, mass-check wrt reuse needs to be modified to take SC into > account. we'll also need to let people who do mass-checks know that > they really shouldn't use SC since we want to see the hit results for > all network tests (the current reuse behavior). +1 OK, that is a good point. if the X-Spam-Status line is fixed to record "shortcircuit=spam" etc., it'll be easy to detect in mass-check. > fourth, what about autolearn? it's likely autolearn won't happen by > default (too few hits, etc,) but it could also potentially learn the > wrong way depending on what rules hit. I'd like to get a consensus on > what should happen here (IMO, autolearn is skipped for SC). -1 I don't think this needs to block autolearning necessarily. It can be driven by the rules that caused the shortcircuiting. For example, let's say USER_IN_WHITELIST is the rule that shortcircuited -- that rule already mandates "noautolearn", so there's no need for the act of shortcircuiting to do so, too. There's a possibility that some extremely reliable shortcircuiting rules (esp ones that don't rely on user/admin input) can provide good data for autolearning, so let's not rule it out. by the way the "autolearn the wrong way" scenario should be impossible already, see lib/Mail/SpamAssassin/Plugin/AutoLearnThreshold.pm: dbg("learn: auto-learn? no: scored as ham but autolearn wanted spam"); > fifth, what about AWL? similar to autolearn, except that the score is > going to be completely off. IMO, AWL is skipped during SC. +1 ok good point, agreed. > sixth, an easy addition to this patch would be short circuiting on > current message score. sometimes you don't care what rules hit, but you > want to stop after a certain point. -1 *no*. This is EXACTLY the approach that has been tried several times before, with lousy results -- which resulted in the idea that this approach is more worthwhile. Let's not go around in circles! It'll be easy enough to build additions on top of this simple change, later. We don't have to do everything *now*. > seventh, and this is tricky, do we want to try moving the priorities > around automatically when SC is enabled? I was thinking of having code > that sets default priorities, similar to default score, that goes > through the SC rule listing and bumps the priority so they run first > (including meta dependencies) unless the rule has a priority > specifically set by config. Then another loop which sets the default to > priority 0. > > eighth, should the SC decision code be in a plugin? -1 No. There are a lot of issues around pluginizing parts of check(), and they're being discussed in 2 other bugs simultaneously iirc! Let's not confuse matters even more by making it 3! :( This patch, in contrast, is pretty simple and small, and can be easily refactored into a plugin *later* if desired, once the other pluginization discussions are concluded with an agreement. I think this is a great patch, adding some functionality that we *really* *REALLY* need -- I know I, for one, have had to put in some upfront black/whitelisting at the MTA level on one of my servers due to high load issues, and this patch would allow me to avoid that kludge. It does impact on other parts of the SpamAssassin design, but let's not get dragged into a "what colour to paint the bikeshed" discussion. I really feel that if we were to insist on pluggable shortcircuiting algorithms immediately, we'd be heading down that road -- the existing check()-pluginization discussions certainly give that impression to be honest. Again, what I'm saying is, there's no need to get it "perfect" in the first svn commit. If you want, I'd suggest we apply this patch to svn trunk (ok, after fixing some or all of the +1's above ;), and then open some new bugs to deal with evolutionary follow-on ideas... --j.
Created attachment 3429 [details] s/c results gif i implemented uribl+surbl shortcircuiting last night on one box, and left another unpatched. here is the load average comparision of them. it should speak for itself :)
>> fifth, what about AWL? similar to autolearn, except that the score is >> going to be completely off. IMO, AWL is skipped during SC. > +1 > ok good point, agreed. hmm... what am i missing here??? AWL is a header eval, run at prio 1000, so it never gets that deep in the scan if the email hits a shortcircuited rule.
(In reply to comment #21) > > first, I really don't like the override score idea. I don't think > > there's a real point in doing it, and it's going to cause confusion by > > people who add up the rules and see it's nowhere near the rule score > > sum. > > -1 > I think it provides a better, more familiar UI for users. Compare > (assuming we fix the "something in x-spam-status" issue below): > > X-Spam-Status: Yes, score=1.9 required=5.0 tests=BAYES_50,HTML_IMAGE_ONLY_28, > HTML_MESSAGE,MIME_HTML_ONLY,SC_TEST shortcircuit=spam > autolearn=no version=3.2.0-r372567 > > eh? "Yes, score=1.9 required=5.0", wtf? > that's less intuitive, and more likely to cause confusion, than > > X-Spam-Status: Yes, score=15.0 required=5.0 tests=BAYES_50,HTML_IMAGE_ONLY_28, > HTML_MESSAGE,MIME_HTML_ONLY,SC_TEST shortcircuit=spam > autolearn=no version=3.2.0-r372567 > > it really makes less sense in my opinion for the mail to be marked as spam > with a low score, than for the tests not to add up. ;) While it may make more sense to have a score greater than the required score, neither make complete sense. :) I think that either tests that are going to cause short circuit should either have a score alone that will cause a sensible spam score; or where that may not be possible (say, decision trees in a future check() plugin) score should be undefined: X-Spam-Level: ****************************************** (max # of stars we use) X-Spam-Status: Yes, score= required= tests=BAYES_50, HTML_IMAGE_ONLY_28,HTML_MESSAGE,MIME_HTML_ONLY,SC_TEST shortcircuit=spam autolearn=no version=3.2.0-r372567 (or maybe 0.0 for both score and required) > > fourth, what about autolearn? it's likely autolearn won't happen by > > default (too few hits, etc,) but it could also potentially learn the > > wrong way depending on what rules hit. I'd like to get a consensus on > > what should happen here (IMO, autolearn is skipped for SC). > > -1 > I don't think this needs to block autolearning necessarily. It can be > driven by the rules that caused the shortcircuiting. For example, let's > say USER_IN_WHITELIST is the rule that shortcircuited -- that rule already > mandates "noautolearn", so there's no need for the act of shortcircuiting > to do so, too. Agreed with jm here. > > seventh, and this is tricky, do we want to try moving the priorities > > around automatically when SC is enabled? I was thinking of having code > > that sets default priorities, similar to default score, that goes > > through the SC rule listing and bumps the priority so they run first > > (including meta dependencies) unless the rule has a priority > > specifically set by config. Then another loop which sets the default to > > priority 0. I'd say yes, but it can wait. > > eighth, should the SC decision code be in a plugin? > > -1 > No. There are a lot of issues around pluginizing parts of check(), and > they're being discussed in 2 other bugs simultaneously iirc! Let's not > confuse matters even more by making it 3! :( > > This patch, in contrast, is pretty simple and small, and can be easily > refactored into a plugin *later* if desired, once the other pluginization > discussions are concluded with an agreement. Agreed with jm.
Created attachment 3431 [details] updated patch against svn attached is an updated patch. is does the following 1) adds the following TEMPLATE TAGS: SC, SCTYPE, and SCRULE. this will allow for the add_header necessary for X-Spam-Short-Circuit if needed. 2) updates 10_default_prefs.cf to include shortcircuit=_SCTYPE_ in the X-Spam-Status header (for masscheck purposes). 3) resets test_names_hit, score to 0, and tag_data before applying the shortcircuit. this will prevent multiple rules from showing hit when you have a shortcircuited meta test. 4) alters dns harvest priority to -800 to allow dns tests to be prioritized properly. 5) alters meta harvest priority to -500 to allow meta tests to be prioritized. Conf/Parser.pm still sets meta priorities to 500 by default. 6) supports a shortcircuit 'off' value so you can override a shortcircuited default test via your local config. 7) alters report template to include _SCTYPE_ on the Content analysis details line. 8) adds 60_shortcircuit.cf with some default shortcircuits, and some other examples that are commented. so, that takes care of most of the major issues i hope.
Created attachment 3432 [details] updated patch against svn #2 oops, those sample metas in 60_shortcircuit.cf were supposed to be commented out.. please review.. its late and my eyes are tired ;)
undefined score/required? I really don't like that one, to be honest. The X-Spam-Status is mainly a UI but also an API, and I think it's better to "degrade gracefully" for apps that won't know about short-circuiting, by providing them *something* relatively sane in the areas they're looking at, instead of breaking their expectations like that. I do like the "full X-Spam-Level" suggestion though. To recap, these are the current suggestions: option (a), Daryl's suggestion: X-Spam-Level: ****************************************** X-Spam-Status: Yes, score= required= tests=BAYES_50, HTML_IMAGE_ONLY_28,HTML_MESSAGE,MIME_HTML_ONLY,SC_TEST shortcircuit=spam autolearn=no version=3.2.0-r372567 option (b), just the "shortcircuit=spam" thing: X-Spam-Level: * X-Spam-Status: Yes, score=1.9 required=5.0 tests=BAYES_50,HTML_IMAGE_ONLY_28, HTML_MESSAGE,MIME_HTML_ONLY,SC_TEST shortcircuit=spam autolearn=no version=3.2.0-r372567 option (c), set score to -5 or 15: X-Spam-Level: *************** X-Spam-Status: Yes, score=15.0 required=5.0 tests=BAYES_50,HTML_IMAGE_ONLY_28, HTML_MESSAGE,MIME_HTML_ONLY,SC_TEST shortcircuit=spam autolearn=no version=3.2.0-r372567 option (d), same as (c) but with a higher value for clarity: X-Spam-Level: ****************************************** X-Spam-Status: Yes, score=100.0 required=5.0 tests=BAYES_50,HTML_IMAGE_ONLY_28, HTML_MESSAGE,MIME_HTML_ONLY,SC_TEST shortcircuit=spam autolearn=no version=3.2.0-r372567 me, I like (d), with (c) as a second-best. --j.
Is this going to be a part of Spamassassin or will we have to keep patching it?
I can't apply it until Theo revokes his "thumbs down" -1 vote. ;)
>>> I can't apply it until Theo revokes his "thumbs down" -1 vote. ;) I've been running one mail server with this patch - and the latest svn version of spamassassin - and it works great. The only change that I have made is I have now monitored spamassassin with monit in case it crashes as the latest svn could be unstable. So far - no problems in about a week. This really should be a part of Spamassassin....anything that makes it more efficient should be incorporated. Otherwise it will be like Windows software. Each succeeding version becomes more and more bloated and inefficient. I hope Theo reviews this and makes it a part of Spamassassin. After all, SA is one of the more resource hungry hogs on any mail server.
Re #27, I'm also in favor of D with C as second choice. Leaving the score out completely is likely to break a whole lot of stuff.
(In reply to comment #27) > undefined score/required? I really don't like that one, to be honest. The > X-Spam-Status is mainly a UI but also an API, and I think it's better to > "degrade gracefully" for apps that won't know about short-circuiting, by > providing them *something* relatively sane in the areas they're looking > at, instead of breaking their expectations like that. Even though it's the most accurate, I don't really like it either -- that's me being used to safety critical systems where we never ever rely on a value that's supposed to be there being there. I'd much prefer that whatever causes the shortcicuit to have fired a rule that will be certain to cause a final score that exceeds the threshold in the appropriate direction, and we use that final score. > option (d), same as (c) but with a higher value for clarity: > > X-Spam-Level: ****************************************** > X-Spam-Status: Yes, score=100.0 required=5.0 tests=BAYES_50,HTML_IMAGE_ONLY_28, > HTML_MESSAGE,MIME_HTML_ONLY,SC_TEST shortcircuit=spam > autolearn=no version=3.2.0-r372567 > > > me, I like (d), with (c) as a second-best. I'd settle for (d) but with a (default) score outside the range of typical scores. The score should be configurable though, as I believe it is now, so that people using multiple thresholds for tag-accept/reject can still do that.
Sorry it's been a while for me, I've been busy with job and move-related activities... I think we can move ahead with applying the patch, but I can definitely see some bits that need to be changed. Some thoughts: 1) There's no point in checking for shortcircuit_type in the sections building the eval in do_head_tests, do_body_tests, do_body_uri_tests, etc. In normal usage, these sections are only ever called once, and shortcircuit_type is guaranteed not to be enabled at that point. 2) I'd move the initial shortcircuit_type check in those above sections until after the eval is built but before the function is called (assuming user rules are disabled). This way we'll get the evals built the first time as usual, and then later on the tests can be skipped before calling the rules. there's a couple of other places that could be tweaked, but I'll worry about that later on. I still think priorities ought to be handled automatically as possible. I don't want to require people to have to change priorities around on their own. This shouldn't be hard in theory (hopefully I won't look back at this and wonder wtf I was talking about... ;)) The patch also does nothing with mass-check to deal with reuse and short-circuit, so we'll want to get something in quickly for that. Also, >> sixth, an easy addition to this patch would be short circuiting on >> current message score. sometimes you don't care what rules hit, but you >> want to stop after a certain point. > >-1 >*no*. This is EXACTLY the approach that has been tried several times >before, with lousy results -- which resulted in the idea that this >approach is more worthwhile. Let's not go around in circles! yes, except that the problem was performance due to the the constant if checks and the lack of prioritization. those two are basically fixed now, so it should work fine.
While I'm not veoting the patch, I am a little concerned with how it messes around with priorities. I realize people are hot on the patch, but I suggest caution and review time.
So I've applied *most* of the patch to svn trunk ;) 1. There does seem to be an issue with what it does to HARVEST_DNSBL_PRIORITY, changing it from 500 to -800. AFAIK, this causes DNSBLs to be harvested pretty much immediately, instead of letting those network queries percolate in the background while we do other stuff. normally we'd do: - start all DNSBL lookups (in the background) - do head tests, body tests, etc. for several seconds - wait several seconds for DNSBL results - do meta tests - AWL - return with the HARVEST_DNSBL_PRIORITY change it becomes - start DNSBL lookups (in the background) - wait several seconds for DNSBL results - do head tests, body tests, etc. for several seconds - start more DNSBL lookups (at another priority level) - wait several seconds for those DNSBL results - ... I'm worried that that'll have pretty major knock-on effects on speed. :( So I haven't applied that change to HARVEST_DNSBL_PRIORITY, and it's still at +500. (META_TEST_MIN_PRIORITY is indeed now -500, though, so metas of non-network rules can now be used for shortcircuiting. I couldn't see any bad side-effects of that.) But use of net/DNSBL tests in a short-circuit, needs a better way to do this than causing it to attempt to harvest *all* network lookups early -- I think it will need a way for the "shortcircuit SC_URIBL_SBL spam" line to cause the meta dependency chain to be followed up to the network rule, and some logic to realise that that implies that the URIBL lookups -- and just those -- therefore need to be waited for before that meta rule can be evaluated correctly. I think there's more work needed there. 2. I'm not convinced about "shortcircuit default" btw... I left it in, anyway, for now. I would prefer to take it out though, unless there's a good reason to have both it *and* the "ham/spam" version. 3. "t/dnsbl.t" fails, but it's failing before this patch was applied anyway. something's up there in trunk :( 4. the "no shortcircuit" SCTYPE token for X-Spam-Status got changed to "no" to match "autolearn=no". 5. added code to mass-check to ignore messages where X-Spam-Status =~ /shortcircuit=(?:yes|no|default)/. apart from that -- looks good to me! checked in as r391177.
While I said I wasn't vetoing the patch, I did say I request time to review, since there was no time given I now have to veto. I'm -1 (veto) on this patch. This is "ONE" implementation of shortcircuit, and from the comments, not even the best, and happens to stomp on at least one other known implementation. This code should be backed out and when we get the Check plugin code in place you can then put it all in your own plugin to avoid creating problems for other people's code.
Michael : I appreciate your concern but how much time do need to review? And what are your points of concern so that others who are hot on the patch can address them. This has brought down the CPU usage on one of my servers considerably....I am seriously considering putting it on all my mail servers. Justin : >> normally we'd do: - start all DNSBL lookups (in the background) - do head tests, body tests, etc. for several seconds - wait several seconds for DNSBL results - do meta tests - AWL - return with the HARVEST_DNSBL_PRIORITY change it becomes - start DNSBL lookups (in the background) - wait several seconds for DNSBL results - do head tests, body tests, etc. for several seconds - start more DNSBL lookups (at another priority level) - wait several seconds for those DNSBL results - ... I'm worried that that'll have pretty major knock-on effects on speed. :( So I haven't applied that change to HARVEST_DNSBL_PRIORITY, and it's still at +500. >> Normally, the dsnbl checks can be done last.... the fastest sequence is as follows 1. Banned attachments ... if it contains a banned attachment...mark as spam .. SC 2. Clamav plug in ... if it contains a virus ... mark as spam .. SC 3. White list ... mark as ham .. SC 4. Blacklist ... mark as spam .. SC 5. Local tests ... mark as spam if need be ... SC only if score exceeds spam threshold 6. dnsbl tests ... mark as ham/spam as normal. I find that 60% of the incoming emails are sorted out by step 4....with very little CPU usage by SA (except CLAMAV which takes quite a bit at step 2) By knocking off the obvious banned attachments/viruses/whitelists and blacklists, the loads on the network services can be brought down considerably. Why do a dnsbl test on a virus? It is a waste of everybody's resources.
Dude, you didn't "request time to review" -- you "suggested" that people should "exercise caution and review time". It was not clear that *you* were the person you wanted to exercise that review time! ;) Also your first comment was 2 weeks after the patch was posted, saying "I'm not vetoing the patch but am a little concerned". So I took that as not indicating a whole lot of worry. > This is "ONE" implementation of shortcircuit, and from the comments, not even > the best, and happens to stomp on at least one other known implementation. hmm -- where is this other known implementation? Is it open source, available for use in Apache SpamAssassin right now? Is it ready to apply to SVN trunk? Does it implement techniques we've been agreeing on for two years, as this does? If this other code needs to be taken into account, it should be made part of the open source codebase so it is at least visible, and efforts be made to get it into SVN, as Dallas is doing here. It is nice to support proprietary work in Apache SpamAssassin, but not to the detriment of the open-source product's quality. > This code should be backed out and when we get the Check plugin code in place > you can then put it all in your own plugin to avoid creating problems for > other people's code. I strongly disagree that we should make this work conditional on the Check plugin. Do it the other way around -- when the Check plugin is completed, *then* rearrange the contents of svn trunk to work with it, not vice versa. We cannot freeze the entire contents of the PerMsgStatus class pending a Check plugin, given that the last comment on bug 4776, was *2 months ago*, and there's still some serious disagreement there. Fundamentally it makes no sense to block the rest of the project based on a task that does not seem to be going forward!
bug 4860 adds an PerMsgStatus::is_rule_complete($rule) API, allowing the s/c code to determine if a DNSBL lookup has completed yet.
Created attachment 3467 [details] automatically set meta rule priorities Here's a patch that takes care of one issue with the current short-circuiting patch; it automatically traces priorities of meta rules, and ensures that all of their dependencies have the same or earlier priority. in other words, having to work that out by hand is no longer required; just set a rule to whatever priority you want and the code will take care of it. This has the happy side-effect that META_TEST_MIN_PRIORITY is no longer required. the next step then is to use the code in bug 4860, and call is_rule_complete() for each rule in metas, to ensure that they are either not DNS rules, or else are DNS lookups that have completed; and if that function returns 0 for any rule, to call $self->{async}->complete_lookups(1) until it returns 1. that will then allow network tests to be used in shortcircuit metas.
> ------- Additional Comments From jm@jmason.org 2006-04-04 11:28 ------- > Dude, you didn't "request time to review" -- you "suggested" that people should > "exercise caution and review time". It was not clear that *you* were > the person you wanted to exercise that review time! ;) Perhaps I should have been more forceful in the request but its almost like you're rushing to get the code in before anyone else can review, whats the hurry? Why can't we all work together to get a better handle on the optimal solution? > Also your first comment was 2 weeks after the patch was posted, saying "I'm not > vetoing the patch but am a little concerned". So I took that as not indicating > a whole lot of worry. Well, my first comment was to back up Theo's objection and near as I can tell still outstanding veto right after the patch was posted. Apparently, a veto doesn't mean as much these days since clearly you ignored that one and don't seem to be respecting mine. > It is nice to support proprietary work in Apache SpamAssassin, but not to the > detriment of the open-source product's quality. How about hackish code at the detriment? You've used the "lets not screw folks using the code" argument more times than I can count. If you seriously want other implementations (and believe me I'm working on getting as much closed code opened up as I can) then screwing them over certainly isn't the best path there. You of all people should realize that. However, since feeding my family has taken precedence and I just don't have time to offer alternative implementations, I'll remove my veto (since it was being ignored anyway) but still register a -1 for the code as implemented. I'll also note that folks who provide results for scoring should not be running this code, it will seriously hamper our efforts to provide realtime hits in the reuse code.
This is functionality I've been looking for, for over two years, and it will solve several scalability problems I'm having on my own MXes -- so I am indeed keen to get it in. Sorry if it seems like I'm hurrying too much. Regarding Theo's veto: he said 'I think we can move ahead with applying the patch, but I can definitely see some bits that need to be changed', then listed several points. While he didn't spell it out, that does indicate that his veto was retracted. Each of those points he raised was covered in the change I applied in r391177. Alright, I can see I've been a bit over-eager here. I'll back out r391177, and instead open a branch for this. There's a fair bit of dependent work (bug 4860 and patch 3467), and it's going to need an SVN repository I think.
OK, I've created a branch of "trunk" for this: https://svn.apache.org/repos/asf/spamassassin/branches/bug-3109-shortcircuiting ps: 'I'll also note that folks who provide results for scoring should not be running this code, it will seriously hamper our efforts to provide realtime hits in the reuse code.' mass-check will not attempt to reuse hits from shortcircuited messages.
> mass-check will not attempt to reuse hits from shortcircuited messages. > Understood, however if this gets put on one of the trap servers it will seriously hamper our ability to use realtime results for network tests.
'But use of net/DNSBL tests in a short-circuit, needs a better way to do this than causing it to attempt to harvest *all* network lookups early -- I think it will need a way for the "shortcircuit SC_URIBL_SBL spam" line to cause the meta dependency chain to be followed up to the network rule, and some logic to realise that that implies that the URIBL lookups -- and just those -- therefore need to be waited for before that meta rule can be evaluated correctly.' 'seventh, and this is tricky, do we want to try moving the priorities around automatically when SC is enabled?' Update: these two are now implemented in the branch. It works as follows: - 1. config is read as normal; priorities are set to whatever they are in the config file, or 0 by default (including meta rules). - 2. finish_parsing() calls trace_meta_dependencies(), which recursively descends the meta dependency tree, so that each meta rule then has a list of all the rules it depends on (directly or indirectly) in $conf->{meta_dependencies}. - 3. finish_parsing() calls fix_priorities(), which then uses that meta-dependencies list to ensure that each meta rule will run at the same time as, or later than, all of its dependencies. - 4. do_meta_tests() in PerMsgStatus then, before it runs each meta rule, greps out a list of async (ie. network) tests that are dependencies of that rule (if any). It then checks to see if they've completed; if not, it waits until it completes (or the async tests time out). This grepping process takes place once at rule-compile time, not at runtime, although the completion test runs every time -- however that's 1 or 2 hash lookups, and there are not many meta rules that depend on network rules. Finally, META_TEST_MIN_PRIORITY is rendered obsolete, so it's gone. It all appears to be working, although could probably do with some tidying up and may have a few bugs ;) In particular I think the rbl_timeout scaling code probably needs some work, I think it may be cutting off DNSBL lookups too early now. Also it needs some speed measurement to see how it now compares with trunk. Add this file as "rules/62_test.cf", then run "spamassassin -D -t", for a demo: meta SC_PRI_DEMO ((URIBL_BLACK || URIBL_SC_SURBL) && SPF_SOFTFAIL) shortcircuit SC_PRI_DEMO spam priority SC_PRI_DEMO -1000 by the way I am still uncomfortable with "shortcircuit default". I would prefer to get rid of it, since I think it's one feature too many, unless there's a good reason for it to be kept. Dallas, fancy making a case for it? ;) Also remaining to do: add tests to the test suite for shortcircuiting; and try it out on my dogfood server.
> Finally, META_TEST_MIN_PRIORITY is rendered obsolete, so it's gone. very nice! > by the way I am still uncomfortable with "shortcircuit default". > I would prefer to get rid of it, since I think it's one feature > too many, unless there's a good reason for it to be kept. > > Dallas, fancy making a case for it? ;) sure, i'll tell you my thoughts behind it. lets say you have the following configuration. identify spam at 5.0 quarantine at 8.0 delete over 15.0 and you have a need to quarantine certain mail, lets say for a past employees. Since you know the To: address, and you know it all needs to be quarntined, the simple solution would be header EX_EMPLOYEES To =~ /(user1|user2|user3)\@mydomain/ score EX_EMPLOYEES 8.0 shortcircuit EX_EMPLOYEES default priority EX_EMPLOYEES -100 Therefore, any email to those people automatically goes to quarantine because the final score is 8.0. Without being able to shortcircuit and use the default score, you are always going to shortcircuit at the +100 (or whatever leve you set).... which looses flexibility. but its whatever you feel is right... I just thought people may find it beneficial not to designate a spam/ham classification and just use whatever score is assiged to the shortcircuited rule. d
Gee, now that you are tracing meta dependencies, you could do a lint warning if you find an unresolved dependency. :-)
'Therefore, any email to those people automatically goes to quarantine because the final score is 8.0. Without being able to shortcircuit and use the default score, you are always going to shortcircuit at the +100 (or whatever level you set).... which looses flexibility.' ok, that makes sense. However I'm still not fond of having two separate "parallel" ways to do the same thing; "shortcircuit default / score 100" and "shortcircuit spam" do the same thing two different ways. how's about getting rid of "shortcircuit ham"/"shortcircuit spam", and just have either "shortcircuit default" (renamed to "shortcircuit on") or "shortcircuit off" behaviour as the available options? Then the score alone is used to determine whether a rule shortcircuits as ham or as spam -- negative means ham, 0 or positive spam. if brevity of the configuration is the aim, we can still support "shortcircuit ham" / "shortcircuit spam", but just implement them in the config-reading subroutine as translating to "shortcircuit on / score -100" or "shortcircuit on / score 100". What do you think?
message from Loren by mail (watch out, our bz still doesn't support that): Loren Wilton writes: > Someone please remind me why the email score simply isn't the score total up > to and including the short-circuit rule? I'm assuming that in general short > circuit rules are going to run relatively early (else why bother?) so except > for a short-circuit meta there should be relatively few scores in addition > to the short circuit. And for the meta I'd expect most contributory scores > to go the same way as the short circuit score. > > It then becomes dependent on the rule writer to pick a good score for the > short-circuit rule. But this is always the case. Loren -- that's one of the existing options, that's "shortcircuit default".
(In reply to comment #48) > how's about getting rid of "shortcircuit ham"/"shortcircuit spam", and just > have either "shortcircuit default" (renamed to "shortcircuit on") or > "shortcircuit off" behaviour as the available options? Then the score alone > is used to determine whether a rule shortcircuits as ham or as spam -- negative > means ham, 0 or positive spam. > if brevity of the configuration is the aim, we can still support "shortcircuit > ham" / "shortcircuit spam", but just implement them in the config-reading > subroutine as translating to "shortcircuit on / score -100" or "shortcircuit > on / score 100". > What do you think? sounds reasonable to me. on/off is simple, and still gives you the flexibility to control the final score. also it would probably be best to set noautolearn tflag by default, and force someone to tflag learn a s/c rule. if you have on|off|ham|spam, the ham and spam s/c classification could score - 100/+100 and force autolearning. but i guess that puts us right back where we started at ham|spam|default|off really... just changing "default" to "on", and making sure "on" has noautolearn tied to it. i'd rather not have to define a rule with 6 lines of config if we can set some sane defaults. body RULE /text/ describe RULE blah score RULE x.x shortcircuit RULE on tflags RULE noautolearn priority RULE -100 i would suggest shortcircuit "on" set noautolearn tflag and -100 priority by default.... and if you want to adjust the priority higher from there, you have to specify it, and if you want to autolearn on that rule, you either have to specify the learn tflag, or you have to s/c ham|spam. i guess if you s/c as ham|spam, the noautolearn tflag would have to be removed from the rules config. d
ok, r394166 implements that.
>> mass-check will not attempt to reuse hits from shortcircuited messages. > >Understood, however if this gets put on one of the trap servers it will >seriously hamper our ability to use realtime results for network tests. hey, I've been thinking about this. I agree that it shouldn't be used on a trap server. However I'm not mass-checking any of the traps currently; I mass-check my own spam/ham. I already reject at SMTP connect time using SBL/XBL, which means that, what is it, 40% of incoming spam is not represented in my spam corpus? I'll bet quite a few of the other mass-checkers are doing the same, or even using other SMTP-time tactics, like greylisting or other DNSBLs. As a result, if we're worried about our filtering tactics affecting our corpora, and affecting mass-check results on network tests, IMO that horse has already bolted, to a substantial degree.... ok, things still to do on this bug: - a test for shortcircuit; - I need to dogfood it on my server as well; - figure out what to do with the branch.
> a test for shortcircuit now in, as revision 395341. and dogfooding is now underway using this config: meta SC_BL_BAYES ((USER_IN_BLACKLIST_TO || USER_IN_BLACKLIST) && BAYES_99) meta SC_WL_BAYES ((USER_IN_WHITELIST||USER_IN_DEF_WHITELIST||USER_IN_ALL_SPAM_TO ||USER_IN_DEF_SPF_WL||NO_RELAYS) && BAYES_00) meta SC_BL_URIBL ((URIBL_BLACK||URIBL_AB_SURBL||URIBL_OB_SURBL||URIBL_SC_SURBL|| URIBL_WS_SURBL||URIBL_SBL) && BAYES_99) shortcircuit SC_BL_BAYES spam shortcircuit SC_BL_URIBL spam shortcircuit SC_WL_BAYES ham
well, it hasn't wound up faster -- before: all: 1980 messages, total time 9672.09999999999, avg 4.88489898989899 after: sc=spam: 405 messages, total time 1914.6, avg 4.72740740740741 sc=no: 939 messages, total time 7682.30000000001, avg 8.18136315228968 sc=ham: 117 messages, total time 1116.7, avg 9.54444444444444 all: 1461 messages, total time 10713.6, avg 7.33305954825462 shortcircuit-as-spam mails are marginally faster. shortcircuit-as-ham, or non-s/c, quite a bit slower.
oh, here's the script I used to parse the logs for that report -- #!/usr/bin/perl while (<>) { /spamd: result: / or next; my $sc = 'no'; /shortcircuit=(\S+)/ and $sc=$1; /scantime=([\d\.]+)/ or next; my $stime = $1; $msgs{$sc}++; $times{$sc}+=$stime; $msgs{all}++; $times{all}+=$stime; } foreach my $t (reverse sort keys %msgs) { my $avg = $times{$t} / ($msgs{$t} || 0.00001); print "sc=$t: $msgs{$t} messages, total time $times{$t}, avg $avg\n"; }
maybe it depends what you are short-circuiting. since all your sc tests rely on bayes at least, that may be something that is pushing out your scan times? as you can see here, my spam:ham ratio is pretty close yesterday, yet the time my system spent processing it was 2/5th of how long the ham took. Spam: 8140 Time Spent Processing Spam: 2.88 hours Ham: 9073 Time Spent Processing Ham: 7.30 hours i dont have any s/c ham rules expect for whitelists, so you see a pretty small sample size of sc=ham below. # cat /var/log/spamassassin/spamd.log | perl test.pl sc=spam: 4085 messages, total time 2346.59999999995, avg 0.574443084455313 sc=no: 6064 messages, total time 18832.1, avg 3.1055573878628 sc=ham: 134 messages, total time 68, avg 0.507462686567164 sc=all: 10283 messages, total time 21246.7000000002, avg 2.06619663522321 as you can see, my sc=spam and sc=ham avg scantimes are 6x faster than those message that were not s/c. how are your scantimes 4-8 seconds? old hardware? although i have yet to put your new implementation into production, i'm still using the hacked HARVEST_DNSBL_PRIORITY. as soon as i get some time to do testing on the new stuff i'll provide some more data.
cool -- could you post your rules in the meantime? are you running bayes on that server? my server is pretty old hardware. however I think a good bit is probably due to waiting for network rules -- all those rules rely on net tests. if you think about it, there's no way the BAYES rules could cause s/c=spam to take longer than s/c=no; BAYES runs for both, and would run for the same length of time. the bayes rules are synchronous.
i'm still testing, but it appears there is a small problem with a couple priorities. looks like setting priorities of -998 and -999 fail because they are used by $MISSING_REQUIRED_VALUE = -998; $INVALID_VALUE = -999; In Conf/Parser.pm, the $ret value set from my $ret = &{$cmd->{code}} ($conf, $cmd->{setting}, $value, $line); is the priority. which then matches if ($ret && $ret eq $Mail::SpamAssassin::Conf::INVALID_VALUE) and elsif ($ret && $ret eq $Mail::SpamAssassin::Conf::MISSING_REQUIRED_VALUE) and causes goto failed_line. # spamassassin --lint -D 2>&1 | grep USER 2>&1 priority USER_IN_BLACKLIST_TO -999 KEY=priority VALUE=USER_IN_BLACKLIST_TO -999 PARSE_ERROR RET=-999 IV=-999 KEY=priority VALUE=USER_IN_BLACKLIST_TO -999 [24595] warn: config: SpamAssassin failed to parse line, "USER_IN_BLACKLIST_TO_-999" is not valid for "priority", skipping: priority USER_IN_BLACKLIST_TO_-999 priority USER_IN_BLACKLIST -998 KEY=priority VALUE=USER_IN_BLACKLIST -998 [24595] warn: config: SpamAssassin failed to parse line, no value provided for "priority", skipping: priority USER_IN_BLACKLIST_-998 priority USER_IN_WHITELIST -990 KEY=priority VALUE=USER_IN_WHITELIST -990 NEXT, RET=-990 KEY=priority VALUE=USER_IN_WHITELIST -990 priority USER_IN_DEF_WHITELIST -989 KEY=priority VALUE=USER_IN_DEF_WHITELIST -989 NEXT, RET=-989 KEY=priority VALUE=USER_IN_DEF_WHITELIST -989 priority USER_IN_ALL_SPAM_TO -988 KEY=priority VALUE=USER_IN_ALL_SPAM_TO -988 NEXT, RET=-988 KEY=priority VALUE=USER_IN_ALL_SPAM_TO -988 [24595] dbg: shortcircuit: adding USER_IN_WHITELIST default [24595] dbg: shortcircuit: adding USER_IN_DEF_WHITELIST default [24595] dbg: shortcircuit: adding USER_IN_ALL_SPAM_TO default [24595] dbg: shortcircuit: adding USER_IN_BLACKLIST default [24595] dbg: shortcircuit: adding USER_IN_BLACKLIST_TO default
doh. I think we can change them; they're not exposed outside the code, and any plugins will/should be using them by name instead of embedding their values. $MISSING_REQUIRED_VALUE = -998; $INVALID_VALUE = -999;
Created attachment 3530 [details] shortcircuit.cf as used in jm's test I got a chance to do a follow-up on the performance of this code. I redid my rules files to not include any network rules at all, which require the "stop everything and wait for completion" code to kick in. Instead, just using the "shuffle priorities to ensure all local rules complete in the right order" algorithm, and just very simple 2 shortcircuit rules, here's what I get: sc=spam: 161 messages, total time 108.4, avg 0.673291925465839 sc=no: 1365 messages, total time 4170.9, avg 3.0556043956044 sc=ham: 129 messages, total time 485.6, avg 3.76434108527132 all: 1655 messages, total time 4764.9, avg 2.87909365558913 reminder: here's what I had beforehand using svn trunk: all: 1980 messages, total time 9672.09999999999, avg 4.88489898989899 so, on average, it now completes scans in *on average* across all messages a whopping 58% of the runtime. That is a MAJOR speedup! Most of this is coming from the spams taking only 0.673 seconds to scan on average. This is probably the largest speedup in the open source SpamAssassin codebase in years. Nice ;)
Interesting. And that was still requiring bayes, which I bet is a major time sink. It would be interesting to see the same timing run dropping the bayes_00 and bayes_99 from the metas. I'm sure the classification results will suffer as a result. But it would be an interesting time comparison.
hey justin.. i was just doing some bayes work on bug #4787 and came across something that effected my initial shortcircuit work in regards to autolearning... i'm not sure if it effects your attachment #3530 [details] since i havent yet got to that, but just in case, here is issue i ran into. i noticed that all email that shortcircuited was failing to autolearn, even if autolearn tflag was set specifically fo rthat rule. in Plugin::AutoLearnThreshold->autolearn_disriminator() I added this small block of code. my $isspam; if ( $scan->{shortcircuit_rule} ) { $isspam = 1 if ( $scan->{shortcircuit_type} eq "spam" ); $isspam = 0 if ( $scan->{shortcircuit_type} eq "ham" ); dbg("learn: shortcircuit ham/spam does not require score evalution."); return $isspam; } just thought i'd share that discover while its on my mind....
Created attachment 3570 [details] rule-priority code, against svn trunk In order to reduce the amount of contentious code involved, I've split the work into two independent chunks: - 1. the short-circuiting code itself (which has been -1'd) - 2. the priority-tracing code, which automatically enforces priority ordering for meta rules, so that meta rules always run after their dependencies; and removal of META_TEST_MIN_PRIORITY special-casing, which then becomes superfluous. The latter code is a dependency of the former, but is independent, and would be very useful to get into trunk anyway, to get rid if the META_TEST_MIN_PRIORITY hack. This patch does that.
Created attachment 3571 [details] current short-circuiting patch, against svn trunk+3570 In case anyone's running the code, here's the current state of the short-circuiting patch, against svn trunk with patch 3570 applied.
no comments => lazy consensus, so I've applied patch 3570 (the priority inference code) to svn trunk. Sending MANIFEST Sending lib/Mail/SpamAssassin/Conf/Parser.pm Sending lib/Mail/SpamAssassin/Conf.pm Sending lib/Mail/SpamAssassin/Constants.pm Sending lib/Mail/SpamAssassin/PerMsgStatus.pm Adding t/priorities.t Transmitting file data ...... Committed revision 421868.
/me drowns in massive conflicts Come on man, there is still a veto here. I'm staying on record that I believe this is the wrong approach.
I'm removing my veto. Its not worth the time, apply and move forward. Its too bad real world priorities must take precendence.
Created attachment 3591 [details] new shortcircuit patch, against trunk Here is a new, less-intrusive short-circuiting patch against trunk. It moves the implementation to a plugin, Mail::SpamAssassin::Plugin::Shortcircuit, which implements the s/c logic. This is made possible with a new plugin API and a flag-setting API on the PerMsgStatus object. It is still necessary to dot the PerMsgStatus code with calls like: return if (exists $self->{shortcircuit_flag}); I can't see a way to do it efficiently otherwise, unfortunately. Other than that, though, the logic is all now in a plugin; in particular, deciding *when* that flag is set all takes place there.
> Here is a new, less-intrusive short-circuiting patch against trunk. > > It moves the implementation to a plugin, YAY!!!! > Mail::SpamAssassin::Plugin::Shortcircuit, which implements the s/c logic. This > is made possible with a new plugin API and a flag-setting API on the > PerMsgStatus object. > Why not pass all the other available values into handle_hit? > It is still necessary to dot the PerMsgStatus code with calls like: > > return if (exists $self->{shortcircuit_flag}); Why not make this into some sort of plugin call? > > I can't see a way to do it efficiently otherwise, unfortunately. Other > than that, though, the logic is all now in a plugin; in particular, > deciding *when* that flag is set all takes place there. > I think that the changes are too specific, and I'd like to make it even more generic. It would be easy to give this lots more love with a Check.pm plugin BTW :) Also, I strongly, in the -1 veto type of disagreement, with having shortcircuit turned on by default.
> > Here is a new, less-intrusive short-circuiting patch against trunk. It > > moves the implementation to a plugin, > > YAY!!!! good to hear ;) > > Mail::SpamAssassin::Plugin::Shortcircuit, which implements the s/c logic. > > This is made possible with a new plugin API and a flag-setting API on the > > PerMsgStatus object. > > Why not pass all the other available values into handle_hit? I should have mentioned that. To be honest, I consider those ($area and $desc) to be legacy -- something that we shouldn't be passing around anyway, and I'd prefer to get rid of them from that arg list sometime soon if poss. Given that, I didn't want to bake them into a new plugin API. > > It is still necessary to dot the PerMsgStatus code with calls like: > > > > return if (exists $self->{shortcircuit_flag}); > > Why not make this into some sort of plugin call? I was considering it, but wasn't sure about the performance impact of making the calls through the plugin infrastructure vs calling exists on a member var. Let me look into that.... > > I can't see a way to do it efficiently otherwise, unfortunately. Other > > than that, though, the logic is all now in a plugin; in particular, > > deciding *when* that flag is set all takes place there. > > I think that the changes are too specific, and I'd like to make it even more > generic. It would be easy to give this lots more love with a Check.pm plugin > BTW :) Bring it on, I'm dying to see this plugin ;) what do you mean by 'the changes are too specific' btw? this new set of changes to PerMsgStatus? suggestions would be welcome, I'm happy to compromise. > Also, I strongly, in the -1 veto type of disagreement, with having > shortcircuit turned on by default. OK, I can go for that.
'> Why not make this into some sort of plugin call? I was considering it, but wasn't sure about the performance impact of making the calls through the plugin infrastructure vs calling exists on a member var. Let me look into that....' fwiw, that's not a problem -- I'll switch it to a call_plugin() call in next rev. exists: 0.20 0.049 0.049 20250 0.0000 0.0000 Mail::SpamAssassin::PerMsgStatus:: have_shortcircuited plugin: 0.39 - 0.081 20250 - 0.0000 Mail::SpamAssassin::PerMsgStatus:: have_shortcircuited moving from 0.20% to 0.39% of the runtime, while a near-doubling, is still far down the list of CPU hogs, below even Mail::SpamAssassin::PerMsgStatus::check_from_in_default_whitelist(). I think that's pretty safe.
Created attachment 3598 [details] new new patch against trunk hey -- here's that next rev. changes: - Mail::SpamAssassin::PerMsgStatus::have_shortcircuited() is now a call_plugin() call, as suggested. - shortcircuit is off by default, as requested. - something had been annoying me about the previous patch -- namely the creation of a new (duplicate) plugin hook with one new argument, "handle_hit", just to avoid having to make changes in the PerMsgStatus code. So I bit the bullet, and consolidated the two post-rule-fire plugin APIs (hit_rule and handle_hit) into one (the existing API, hit_rule). This required a little more work, since moving the call_plugins("hit_rule") calls into the got_hit() method required that method to be extended to support the additional data required (the "ruletype" string). Using perl-style name=>value pairs for the additional optional arguments, and keeping compatibility with the existing pre-3.2.0 argument set, this was doable in a backward-compatible way. - Bonus: this allows Plugin/AWL.pm to be fixed to use the public PerMsgStatus::got_hit() API, instead of the private _handle_hit() API (arguably a bugfix imo). - Bonus 2: got_eval_hit() and got_pattern_hit() became one-line wrappers around got_hit(); refactored out for efficiency and clarity. - I added POD doc for got_hit(), while I was at it, since it's an API callable (and called) by plugins, it needs documentation. - the 'tflags multiple' code offered an optimisation, while I was at it; instead of looking up tflags at scan-time to check if a rule should exit or look for multiple hits, it should do that only at eval-build time, in the hit_rule_plugin_code() code-generation call.
btw, how are we in vetoes with this patch? I've got a batch of unrelated PerMsgStatus optimisations I'd like to apply...
ping. let me know if you'd prefer this stayed out of trunk. If there's no comment, I'll apply it in a day or two, since it's pretty modular now -- and hand-merging around it is getting tricky.
Can you give me a week? I'm gonna have some time to look at it and the Check/Rule plugin stuff over the next week. It will highly conflict with those changes as well. I might be able to incorporate much of the code into what I've already moved out of PerMsgStatus (right now PMS has no do_*_tests). After a week, lets look at it again and see where we are. Thanks.
ok -- I'll be happy to do the merging btw.
> Can you give me a week? I'm gonna have some time to look at it and the > Check/Rule plugin stuff over the next week. It will highly conflict with those > changes as well. I might be able to incorporate much of the code into what I've > already moved out of PerMsgStatus (right now PMS has no do_*_tests). > After a week, lets look at it again and see where we are. ping -- how's that going?
/me sighs Time I did not get. I won't hold up any progress. I'm still a little uncomfortable with the implementation but am not in a position to provide a better answer for the moment. So, unless you want to hold up on this thing for a month or so until I can come free from my current project, you might as well go ahead with things. I imagine, if things ever happen with Check.pm and the Rule plugins, that things will change around a bit anyway.
ok, sorry to hear it :( I'll go ahead and check in. Let me know if you need a hand with Check.pm and the rule-type plugin stuff in the meantime, btw. I'm increasingly thinking that'd be good to get in there.
ok, this is now checked in! r428859.
*** Bug 3407 has been marked as a duplicate of this bug. ***
There seems to be a bug in the shortcircuit implementation in 3.2.5 -- the "spam" and "ham" classifications skip remaining tests, but don't adjust the score (as they should). For clarification, as an example: on my system, following the defaults, the spam threshold is 5, BAYES_99 is set to give a score of 3.5, and BAYES_99 is also set to shortcircuit as spam. Any mail that triggers BAYES_99 goes straight to the Inbox as ham, because although it shows shortcircuit=spam, it also shows a final score of 3.5, and therefore as not spam. The shortcircuit ham classification seems to have the same bug (it doesn't adjust the score, it just skips tests). The shortcircuit "on" classification seems to work fine.
Well, change the rule score, if that's what you want. in local.cf score BAYES_99 3.4 (or whatever score you want). As is, you've shortcircuited on a rule with a score below your spam threshold, so you get exactly what you've asked for. Shortciruciting does not, and really should not, radically alter the score defined for the rule. That's what score statements are for. Why would "shortciricut spam" imply "score 100", when you can just declare whatever score you want? It seems substantially more flexible this way. Also, please be aware that BAYES_99 does have a distinctly non-zero false positive rate. (theoretically 1% of its hits are nonspam, but in practice it's more like 0.1%). You WILL get some nonspam messages tagged by this once in a while. Be sure you're really ready for the implications of what you're doing.
Matt: That's all good advice of itself, but the documentation on shortcircuit is explicit: Shortcircuit "on" is supposed to have the behavior you describe. Shortcircuit "spam" or "ham" is supposed to have the different behavior of also forcing a pre-set (very high or very low) score. The plugin is not complying with the documentation -- see, e.g., http://oldschoolpunx.net/phpMan.php/man/Mail::SpamAssassin::Plugin::Shortcircuit -- which I think constitutes a bug in and of itself. It also constitutes a bug in another way: the default spamassassin rules set BAYES_99 to 3.4 (or whatever), shortcircuit=spam, and threshold=5. So if someone turns shortcircuit on without changing the defaults, then it causes anything that hits BAYES_99 to be treated as HAM without exception -- the opposite of the intended behavior. Also, at least on my system, I can tolerate a non-zero false-positive rate, because I'm not deleting spam automatically, just routing it to a folder. So the default setting, shortcircuit BAYES_99 spam, is actually the behavior I want. To sum-up: All the points you made are good ones, but there is still a real bug here, I think.
> Shortcircuit "spam" or "ham" is supposed to have the different behavior of > also forcing a pre-set (very high or very low) score. Do you have a custom value for shortcircuit_spam_score? Anyway... > To sum-up: All the points you made are good ones, but there is still a > real bug here, I think. Amos, if there is a bug, please file a NEW one. Add all the investigations you found there, again. *This* bug is a feature request for implementing a shortcircuiting plugin -- which is RESOLVED FIXED.