SA Bugzilla – Bug 5590
Scantime is very long on certain messages unless "use bytes" hack is used
Last modified: 2015-04-09 00:22:38 UTC
# spamd -V SpamAssassin Server version 3.2.3 running on Perl 5.8.8 with zlib support (Compress::Zlib 2.004) # uname -a SunOS ornl71 5.9 Generic_118558-39 sun4u sparc SUNW,UltraAX-i2 Problem has been present since SA v3.2.1 Some messages take a very long time to scan (on the order of several minutes). This time is reduced substantially if "use bytes" is added to Message.pm. For a test case, I did the following to minimize the effects of network lookups and local rules: Added -L and -D to spamd startup Turned off Bayes, AWL, razor, pyzor, DCC. Set dns_available to "no". Removed SARE and locally-defined rules. Scanned the test message with spamc. Results With "use bytes" hack: 4 seconds Without the hack: 23 seconds Adding in SARE and local rules makes the gap even wider: With "use bytes" hack: 16 seconds Without the hack: 122 seconds
Created attachment 4082 [details] Test case
Created attachment 4083 [details] spamd debug log from test case
Confirmed my test box can replicate the results using a crude: time spamassassin -t <testcase.eml However, in mine the difference when using a "stock" 3.2.3 is barely noticeable, going from 9 seconds to 8 seconds. Adding in a good handful of SARE rules (1365 extra rules, counting "score" lines) makes the difference quite significant. Without "use bytes" and the SARE rules: real 0m21.626s user 0m20.873s sys 0m0.264s With "use bytes" added to Message.pm, and the SARE rules: real 0m9.012s user 0m6.468s sys 0m0.240s
Side note: Mark Martinec suggested this might be due to UTF-8 encoding in the locale. While my test system did have en_US.UTF-8 as the LANG, resetting /etc/sysconfig/i18n to use "en_US" and rebooting did not change the results in any significant way.
fwiw, I would just change your LANG environment variable.
This problem still exists in v3.2.4.
I just tested with SA v3.3.0rc1, Solaris 10 Sparc, Perl 5.10.1: spam5# uname -a SunOS spam5 5.10 Generic_141444-09 sun4u sparc SUNW,Sun-Fire-V240 spam5# spamd -V SpamAssassin Server version 3.3.0-rc1 running on Perl 5.10.1 with zlib support (Compress::Zlib 2.02) It takes an average of 1.8 times as long to scan a message if I don't use the "use bytes" kludge. Can anybody tell me why it is taking so long, and if there is another way to speed it up? What am I missing when I use the "use bytes" kludge?
(In reply to comment #7) > I just tested with SA v3.3.0rc1, Solaris 10 Sparc, Perl 5.10.1: > > spam5# uname -a > SunOS spam5 5.10 Generic_141444-09 sun4u sparc SUNW,Sun-Fire-V240 > spam5# spamd -V > SpamAssassin Server version 3.3.0-rc1 > running on Perl 5.10.1 > with zlib support (Compress::Zlib 2.02) > > It takes an average of 1.8 times as long to scan a message if I don't use the > "use bytes" kludge. > > Can anybody tell me why it is taking so long, and if there is another way to > speed it up? What am I missing when I use the "use bytes" kludge? with "use bytes": : 97...; LANG=en_US.UTF-8 time spamassassin -L -t < /home/jmason/DL/test.eml 4.79user 0.08system 0:05.32elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k 56inputs+160outputs (7major+9251minor)pagefaults 0swaps without: : 104...; LANG=en_US.UTF-8 time spamassassin -L -t < /home/jmason/DL/test.eml 4.80user 0.04system 0:05.09elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+160outputs (0major+9258minor)pagefaults 0swaps : 104...; perl -V Summary of my perl5 (revision 5 version 10 subversion 0) configuration: Platform: osname=linux, osvers=2.6.24-23-server, archname=i486-linux-gnu-thread-multi uname='linux rothera 2.6.24-23-server #1 smp wed apr 1 22:22:14 utc 2009 i686 gnulinux ' config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.0 -Dsitearch=/usr/local/lib/perl/5.10.0 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.10.0 -Dd_dosuid -des' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2 -g', cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include' ccversion='', gccversion='4.3.3', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib /usr/lib64 libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt perllibs=-ldl -lm -lpthread -lc -lcrypt libc=/lib/libc-2.9.so, so=so, useshrplib=true, libperl=libperl.so.5.10.0 gnulibc_version='2.9' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib' Characteristics of this binary (from libperl): Compile-time options: MULTIPLICITY PERL_DONT_CREATE_GVSV PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP USE_ITHREADS USE_LARGE_FILES USE_PERLIO USE_REENTRANT_API Built under linux Compiled at Jun 26 2009 18:23:00 @INC: /etc/perl /usr/local/lib/perl/5.10.0 /usr/local/share/perl/5.10.0 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.10 /usr/share/perl/5.10 /usr/local/lib/site_perl .
doh. ran the wrong spamassassin! happens to everyone ;) without "use bytes": : 106...; LANG=en_US.UTF-8 time ./spamassassin -L -t < /home/jmason/DL/test.eml 7.70user 0.08system 0:07.96elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+160outputs (0major+13479minor)pagefaults 0swaps with: 4.05user 0.04system 0:04.29elapsed 95%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+160outputs (0major+13156minor)pagefaults 0swaps ok, something's up there.
Can someone do a freqdiff between two runs, especially with mail that has highbit characters.
Took a 176 KB message that showed slow processing in the log, and run it with and without 'use bytes' in Message.pm. Network tests disabled, some SARE rules, HitFreqsRuleTiming plugin enabled, perl 5.10.1: without 'use bytes' (as in 3.3.0-rc3), top-15 rules by elapsed time: T __FILL_THIS_FORM_SHORT 3.218 3.218 1 T FILL_THIS_FORM_LONG 3.127 3.127 1 T __FILL_THIS_FORM_FRAUD_PHISH 1.501 1.501 1 T __FILL_THIS_FORM_LOAN 0.841 0.841 1 T SARE_SPEC_ROLEX_NOV5B 0.666 0.666 1 T __LOTSA_MONEY_03 0.467 0.467 1 T __LOTSA_MONEY_01 0.456 0.456 1 T __FRAUD_DBI 0.431 0.431 1 T SARE_SPEC_ROLEX_NOV5A 0.426 0.426 1 T __LOTSA_MONEY_00 0.385 0.385 1 T __SARE_DEGREES 0.344 0.344 1 T __SARE_FRAUD_MONEY 0.304 0.304 1 T FUZZY_XPILL 0.250 0.250 1 T __SARE_FRAUD_VIPS 0.239 0.239 1 T SOFTWARE_AD 0.233 0.233 1 X-Spam-Timing: total 60780 ms - init: 5474 (9.0%), parse: 27 (0.0%), extract_message_metadata: 533 (0.9%), get_uri_detail_list: 287 (0.5%), tests_pri_-1000: 21 (0.0%), compile_gen: 1201 (2.0%), compile_eval: 47 (0.1%), tests_pri_-950: 11 (0.0%), tests_pri_-900: 12 (0.0%), tests_pri_-400: 966 (1.6%), check_bayes: 954 (1.6%), tests_pri_0: 52742 (86.8%), tests_pri_500: 334 (0.5%), tests_pri_899: 284 (0.5%), check_crm114: 271 (0.4%), tests_pri_1000: 69 (0.1%), total_awl: 56 (0.1%), check_awl: 5 (0.0%), update_awl: 13 (0.0%), rewrite_mail: 0.00 (0.0%) ==> most notably: tests_pri_0: 52.742 seconds Now, adding 'use bytes': T __FILL_THIS_FORM_SHORT 1.310 1.310 1 T FILL_THIS_FORM_LONG 1.260 1.260 1 T __FILL_THIS_FORM_FRAUD_PHISH 0.541 0.541 1 T __FILL_THIS_FORM_LOAN 0.382 0.382 1 T SARE_SPEC_ROLEX_NOV5B 0.160 0.160 1 T __LOTSA_MONEY_03 0.149 0.149 1 T __LOTSA_MONEY_01 0.145 0.145 1 T SARE_SPEC_ROLEX_NOV5A 0.136 0.136 1 T __LOTSA_MONEY_00 0.127 0.127 1 T __FILL_THIS_FORM_PARTIAL_RAW 0.123 0.123 1 T __FRAUD_DBI 0.108 0.108 1 T __SARE_FRAUD_MONEY 0.096 0.096 1 T __SARE_DEGREES 0.095 0.095 1 T __SARE_FRAUD_TRUST 0.087 0.087 1 T SARE_MONEYTERMS 0.083 0.083 1 X-Spam-Timing: total 26186 ms - init: 5552 (21.2%), parse: 27 (0.1%), extract_message_metadata: 506 (1.9%), get_uri_detail_list: 285 (1.1%), tests_pri_-1000: 21 (0.1%), compile_gen: 1149 (4.4%), compile_eval: 51 (0.2%), tests_pri_-950: 10 (0.0%), tests_pri_-900: 11 (0.0%), tests_pri_-400: 917 (3.5%), check_bayes: 906 (3.5%), tests_pri_0: 18131 (69.2%), tests_pri_500: 348 (1.3%), tests_pri_899: 285 (1.1%), check_crm114: 265 (1.0%), tests_pri_1000: 76 (0.3%), total_awl: 56 (0.2%), check_awl: 6 (0.0%), update_awl: 3 (0.0%), rewrite_mail: 0.00 (0.0%) ==> most notably: tests_pri_0: 18.131 seconds A factor of 2.9 reduction in rules pocessing time! (52.742/18.131)
I repeated the same test with 'sa-compile'-d rules, this time with disabled HitFreqsRuleTiming, disabled debugging and disabled bayes. The shown tests_pri_0 time corresponds practically entirely to regexp rules (here with net tests and bayes disabled). no sa-compile, not 'use bytes': tests_pri_0: 52217 ms no sa-compile, using 'use bytes': tests_pri_0: 17174 ms sa-compile, not 'use bytes': tests_pri_0: 25134 ms sa-compile, using 'use bytes': tests_pri_0: 11070 ms The message is just an ordinary newsletter, multipart/alternative (plain text and HTML), and contains 8-bit UTF-8 characters.
The attached Test case takes ~2 seconds with Fedora 12 perl-5.10.0 here, and ~12 seconds with RHEL-5 perl-5.8.8. Could someone please attach an example mail that is an even larger delay?
Created attachment 4644 [details] my sample message, gzipped Here attached is my newsletter sample as used in comment 11 and #12. More slow samples are attached to Bug 6238. Granted, the more troublesome rules are some of the SARE rules, but it is not limited to them.
I'm all for fixing this in 3.3.0. Why do we want to release a major version knowing that it could be much faster in certain cases? I frequently get messages that take 4+ minutes to scan on my 1Ghz SPARC. "use bytes" seem to speed it up to 1 minute range. Tested one message and it was 3 vs 10 seconds on 3Ghz AMD though.. gotta look into it if I have time.
Um, why did we remove "use bytes;" in the first place? I don't remember completely accurately. I think that there's a bug on it or at least a thread about it somewhere... there was a quite a bit of discussion when it was done, IIRC.
There is no doubt that this makes it faster, but I don't see any discussion here verifying that it results in correct behavior. I personally am OK with further delay of 3.3.0 release for this issue *IF* it can be demonstrated that it results in functional equivalence for perl-5.8.8 and perl-5.10.1. But if it cannot be demonstrated as such by Sunday, I propose we cut the 3.3.0 release and call for vote at that time. We already have a high degree of confidence that the current code is good, and this particular issue would not be a regression.
Some history: http://svn.apache.org/viewvc?view=revision&revision=315047 And Bug 4596
Here's a quick run of 10k ham + 10k spam on perl 5.10.0. $ ./freqdiff -c ham.log.nobytes ham.log.bytes 518 __HIGHBITS 64 T_HK_MUCHMONEY 52 __hk_million 39 __F_TEEN 24 __hk_bigmoney 18 T_LOTS_OF_MONEY 18 __LOTSA_MONEY_01 2 __FRT_GOLD 1 DRUGS_MUSCLE 1 T_FRT_LOLITA1 1 __DRUGS_MUSCLE1 1 __LOTSA_MONEY_00 1 __OWN_LAND -1 FRT_SOMA2 -2 T_TVD_LONG_WORD5 $ ./freqdiff -c spam.log.nobytes spam.log.bytes 28 __HIGHBITS 1 __FB_MA 1 __OWN_LAND -1 T_FRT_EXPERIENCE -1 __AFRICAN_STATE -1 __IVORY_COAST -1 __hk_bigmoney -2 T_FRT_ADULT2 -3 FUZZY_CREDIT
Surprisinly total scan time was only 10min vs 13min in favor of use bytes.
btw, I think this significant slowdown in 3.3.0 may be as a result of increased use of replace_rules rules, compared to when bug 4596 happened.
(In reply to comment #21) > btw, I think this significant slowdown in 3.3.0 may be as a result of > increased use of replace_rules rules, compared to when bug 4596 happened. Are you suggesting replace_rules itself has an impact on the performance? I doubt that, as replace_rules only comes into play at compile time. I will, however, grant that replace_rules makes it easier to write very complex (and thus slow) rules...
(In reply to comment #22) > (In reply to comment #21) > > btw, I think this significant slowdown in 3.3.0 may be as a result of > > increased use of replace_rules rules, compared to when bug 4596 happened. > > Are you suggesting replace_rules itself has an impact on the performance? I > doubt that, as replace_rules only comes into play at compile time. > > I will, however, grant that replace_rules makes it easier to write very complex > (and thus slow) rules... the latter -- I think some aspect of those complexities is performing very badly without "use bytes", compared to with. fwiw, I did a mass-check with and without "use bytes" in Message.pm, using perl 5.8.8, against 5000 ham and 5000 spam: : 36...; wc -l log.trunk/*.log 5002 log.trunk/ham.log 5003 log.trunk/spam.log 10005 total : 37...; wc -l log.bytes/*.log 5002 log.bytes/ham.log 5003 log.bytes/spam.log 10005 total : 31...; ./freqdiff -c log.trunk/ham.log log.bytes/ham.log 4 __HIGHBITS : 32...; ./freqdiff -c log.trunk/spam.log log.bytes/spam.log -2 __TVD_SPACE_RATIO so that's a barely-noticeable difference of 4 mails (out of 5000 hams) more __HIGHBITS hits, and 2 less __TVD_SPACE_RATIO hits in spam. no scoring rules were affected. speed was not noticeably affected, either; ~50 mins with trunk, 48 with "use bytes". locale was set to LANG=C.
What is the exact patch to test "use bytes"? My corpus has lots of Japanese mail and I could run masscheck. But given 48 minutes vs 50 minutes is this really worthwhile of a risk to do last minute before release? We really should decide quickly and cut the 3.3.0.
Note that I've seen differences between 18 and 28 percent on a mass-check of around 50k spam and ham. I was seeing similar differences in my freqdiff, HIGHBITS was the highest difference and then some other length based rules. I assume that is a side effect of characters being considered length 2 vs 1 with use bytes. For me I was just going to revert the use bytes change in Message.pm but then I didn't know how it would effect the other changes John made. I also note that with use bytes added back to Message.pm t/html_utf8.t now fails for me on perl 5.8.5 and FreeBSD.
Looking at mass runtimes it's probably worth exploring only after 3.3.0. I'll be enabling use bytes on my slow server though, since I get many bad messages and the effect is much more visible there..
Even if the rule hits are roughly equivalent: * Is this functionally equivalent for the tokens going in/out of Bayes? * What about those reasons why this was removed years ago? Please consider safety for this 3.3.0 release. I think we should aggressively recruit more masscheck participants and push for a good 3.3.1 release within 3-6 months where we can adequately test changes like "use bytes".
Not surprisingly it affects Bayes, but only as slightly as the rules. Probably tokens containing highbits etc. It's simple to test with sa-learn and comparing dumps.
(In reply to comment #24) > What is the exact patch to test "use bytes"? My corpus has lots of Japanese > mail and I could run masscheck. The exact location is not critical. Something like this would do: --- Message.pm~ 2010-01-18 12:55:21.000000000 +0100 +++ Message.pm 2010-01-18 12:55:45.000000000 +0100 @@ -46,4 +46,5 @@ use warnings; use re 'taint'; +use bytes; use Mail::SpamAssassin; > But given 48 minutes vs 50 minutes is this really worthwhile of a risk to do > last minute before release? The change has a little effect on run times of majority of message, and hence on the average times. Seems that only few messages are affected, but there the difference can be substantial, as reported earlier (factors like 5.7, 6.7, 1.9, 2.9 were reported under this bug, which can have a large impact when a bunch of such message comes in, each taking minutes to process).
(In reply to comment #23) > so that's a barely-noticeable difference of 4 mails (out of 5000 hams) more > __HIGHBITS hits, and 2 less __TVD_SPACE_RATIO hits in spam. no scoring rules > were affected. > > speed was not noticeably affected, either; ~50 mins with trunk, 48 with "use > bytes". > > locale was set to LANG=C. btw I repeated this with the "use bytes" mass-check with LANG=en_US.UTF-8 set, without any differences at all in freqdiff results.
Does everyone agree that this is only isolated to small subsets of messages, rather than affecting scan speed for all, or a majority of, messages? ("mass-check" is the appropriate tool to use to measure this.) If this is the case, we should go ahead and cut 3.3.0 as it stands, and leave this as a known issue for a 3.3.1 investigation/fix.
(In reply to comment #28) > Not surprisingly it affects Bayes, but only as slightly as the rules. Probably > tokens containing highbits etc. It's simple to test with sa-learn and comparing > dumps. I would imagine that treating the multi-byte characters as individual bytes might bite us in ways similar to Bug 6183. Various control characters or characters considered non-words happen as the second byte, screwing up tokenization.
(In reply to comment #31) > Does everyone agree that this is only isolated to small subsets of messages, > rather than affecting scan speed for all, or a majority of, messages? > ("mass-check" is the appropriate tool to use to measure this.) > > If this is the case, we should go ahead and cut 3.3.0 as it stands, and leave > this as a known issue for a 3.3.1 investigation/fix. +1 go ahead with 3.3.0 cut. Target 3.3.1 in several months, which should be possible if we recruit more masscheck participants.
On behalf of my corpus, +1 for releasing 3.3.0 also.
Actually, if you look at Bayes.pm you'll notice that it already has use bytes so I'm not sure it will have any effect on bayes tokenization.
(In reply to comment #35) > Actually, if you look at Bayes.pm you'll notice that it already has use bytes > so I'm not sure it will have any effect on bayes tokenization. If it wasn't clear, I did test it. Sa-learn a folder, sa-learn --dump data > dump1, rm -rf bayes*, sa-learn same folder, sa-learn --dump data > dump2, diff dump1 dump2. You will see same tokens with slightly different counts and few different tokens. I don't think there is anything in Bayes that would make subsequent runs different?
(In reply to comment #35) > Actually, if you look at Bayes.pm you'll notice that it already has use bytes > so I'm not sure it will have any effect on bayes tokenization. OK, this is already an existing problem in Bayes then. Will investigate and file a separate bug for 3.3.1.
regarding Bayes tokenization: use of byte-level breaks there is an explicit choice, not a bug.
(In reply to comment #32) > I would imagine that treating the multi-byte characters as individual bytes > might bite us in ways similar to Bug 6183. (In reply to comment #31) > Does everyone agree that this is only isolated to small subsets of messages, > rather than affecting scan speed for all, or a majority of, messages? Yes, I believe this is the case. According to our timing stats charts, switching 'use bytes' on or off is hardly detectable on a plot (showing 5-minute average running times per message). The TIME_LIMIT_EXCEEDED rule (on a 40 second time limit) strikes about 5 times per day on a lightly loaded host (7k msgs/h), so at the expense of possibly passing through a potential spam due to reaching a time limit, it is acceptable as it stands now, at least for our site. > If this is the case, we should go ahead and cut 3.3.0 as it stands, and > leave this as a known issue for a 3.3.1 investigation/fix. Agreed. Putting the the 'use bytes' would conceptually be a step backwards. Advantage of having a message represented as perl characters instead of bytes can be put to good use in several corners. It is very unfortunate that Perl takes such a heavy hit on them. Until this gets targeted as a DoS hole, it need not be turned off.
let's not move bugs off the 3.3.0 milestone until that version is released. We have a protocol to indicate that a bug is a release blocker (P1) or not a blocker (P2..P5) -- let's stick with that system.
moving most remaining 3.3.0 bugs to 3.3.1 milestone
reassigning, too
moving all open 3.3.1 bugs to 3.3.2
Moving back off of Security, which got changed by accident during the mass Target Milestone move.
Note: "use bytes" makes the t/html_utf8.t test fail: t/html_utf8.t ..................... 1/2 Not found: QUOTE_YOUR = QUOTE_YOUR at t/html_utf8.t line 27. # Failed test 1 in t/SATest.pm at line 717 Output can be examined in: log/d.html_obfu/1 t/html_utf8.t ..................... Failed 1/2 subtests
(In reply to comment #39) > > I would imagine that treating the multi-byte characters as individual bytes > > might bite us in ways similar to Bug 6183. > > Does everyone agree that this is only isolated to small subsets of messages, > > rather than affecting scan speed for all, or a majority of, messages? > Yes, I believe this is the case. [...] > Agreed. Putting the the 'use bytes' would conceptually be a step backwards. > Advantage of having a message represented as perl characters instead of bytes > can be put to good use in several corners. It is very unfortunate that Perl > takes such a heavy hit on them. Until this gets targeted as a DoS hole, > it need not be turned off. There is no light decision on this. Setting the target to 'future'...
This remains an issue with 3.4.0: May 1 19:12:53 edge01 amavis[30846]: (30846-02) TIMING-SA [total 2274 ms, cpu 1962 ms] - parse: 2.7 (0.1%), extract_message_metadata: 18 (0.8%), get_uri_detail_list: 17 (0.7%), tests_pri_-1000: 2.9 (0.1%), tests_pri_-950: 0.97 (0.0%), tests_pri_-900: 1.25 (0.1%), tests_pri_-400: 43 (1.9%), check_bayes: 42 (1.8%), b_tokenize: 20 (0.9%), b_tok_get_all: 10 (0.4%), b_comp_prob: 9 (0.4%), b_tok_touch_all: 0.22 (0.0%), b_finish: 0.65 (0.0%), tests_pri_0: 2109 (92.7%), check_spf: 0.24 (0.0%), check_razor2: 156 (6.8%), check_pyzor: 154 (6.8%), tests_pri_500: 2.5 (0.1%), learn: 80 (3.5%), b_learn: 69 (3.0%), b_count_change: 31 (1.3%), get_report: 0.69 (0.0%) Total SA time was just over 2 seconds, and 92.7% (2109ms) was due to test_pri_0 Generally, what I see is: May 1 19:24:12 edge01 amavis[30886]: (30886-13) TIMING-SA [total 423 ms, cpu 169 ms] - parse: 1.53 (0.4%), extract_message_metadata: 2.3 (0.5%), get_uri_detail_list: 3.1 (0.7%), tests_pri_-1000: 3.8 (0.9%), tests_pri_-950: 1.48 (0.4%), tests_pri_-900: 1.68 (0.4%), tests_pri_-400: 17 (3.9%), check_bayes: 15 (3.6%), b_tokenize: 4.7 (1.1%), b_tok_get_all: 4.3 (1.0%), b_comp_prob: 3.7 (0.9%), b_tok_touch_all: 0.11 (0.0%), b_finish: 0.73 (0.2%), tests_pri_0: 351 (83.1%), check_spf: 0.63 (0.1%), check_razor2: 143 (33.8%), check_pyzor: 144 (34.0%), tests_pri_500: 2.8 (0.7%), learn: 33 (7.9%), b_learn: 29 (6.9%), b_count_change: 12 (2.7%), get_report: 0.49 (0.1%) Even still, we can see that test_pri_0 is the majority of the time (83%, 351ms/423ms). I.e., even for "fast" emails, test_pri_0 is the bottleneck.
Thanks to the amavis-logwatch script, I have some nice timing data for rules processing at my site as well: ======================================================================================================================== SA Timing Percentiles % Time Total (ms) 0% 5% 25% 50% 75% 95% 100% ------------------------------------------------------------------------------------------------------------------------ tests_pri_0 (7650) 84.64% 5165461.00 88.00 233.00 338.00 402.00 534.00 1759.40 24756.00 check_pyzor (7650) 28.59% 1744883.00 31.00 139.00 145.00 151.00 158.00 218.00 19208.00 check_razor2 (7650) 19.22% 1173107.00 24.00 27.00 94.00 132.00 162.00 209.00 14023.00 learn (3985) 5.91% 360940.90 3.40 7.00 28.00 38.00 60.00 248.80 3674.00 b_learn (3985) 5.42% 330508.35 1.49 2.20 25.00 34.00 52.00 204.00 3481.00 tests_pri_-400 (7650) 4.66% 284643.00 7.00 12.00 17.00 22.00 34.00 87.00 14435.00 check_bayes (7650) 4.49% 273827.00 7.00 11.00 16.00 21.00 32.00 84.00 14434.00 extract_message_metadata (7650) 1.74% 106200.12 0.93 1.99 3.90 8.00 14.00 45.00 528.00 b_tokenize (7650) 1.61% 98333.80 1.85 3.20 4.80 7.00 12.00 33.00 531.00 b_count_change (3605) 1.40% 85147.90 3.20 8.00 11.00 16.00 24.00 47.00 2919.00 get_uri_detail_list (7650) 1.00% 60895.61 0.08 0.64 1.76 2.90 7.00 23.00 509.00 b_tok_get_all (7650) 1.00% 60866.52 1.51 3.00 4.30 6.00 8.00 19.00 699.00 b_comp_prob (7650) 0.79% 48176.27 1.16 2.20 3.60 4.70 7.00 14.00 123.00 b_finish (7650) 0.52% 31588.31 0.24 0.46 0.52 0.56 0.63 0.96 14408.00 parse (7650) 0.42% 25687.28 0.73 1.20 1.79 2.40 3.50 9.00 40.00 tests_pri_-1000 (7650) 0.41% 25285.90 1.95 2.20 2.60 2.80 3.10 6.00 42.00 tests_pri_500 (7650) 0.41% 24944.14 1.66 2.20 2.50 2.90 3.20 6.00 31.00 expire_bayes (2) 0.40% 24705.00 10441.00 10632.15 11396.75 12352.50 13308.25 14072.85 14264.00 tests_pri_-900 (7650) 0.18% 11028.16 0.93 1.02 1.16 1.21 1.29 1.84 40.00 tests_pri_-950 (7650) 0.16% 10020.39 0.77 0.86 0.98 1.05 1.36 1.73 22.00 get_report (7650) 0.09% 5729.38 0.25 0.41 0.48 0.56 0.69 1.48 25.00 check_spf (7650) 0.05% 3037.06 0.18 0.21 0.24 0.27 0.46 0.65 27.00 b_tok_touch_all (7650) 0.02% 1254.65 0.04 0.06 0.09 0.13 0.19 0.25 14.00 ------------------------------------------------------------------------------------------------------------------------ Total 6103202.00 118.00 293.00 395.00 475.00 667.00 2086.70 25856.00 ======================================================================================================================== We can see that test_pri_0 eats up an inordinate amount of time comparatively.
what plugins are loaded ? is sa-compile runned after sa-update ? what version of re2c is used ? what custom rules are loaded if any ?
(In reply to Benny Pedersen from comment #49) > what plugins are loaded ? Whatever the defaults are for 3.4.0 > is sa-compile runned after sa-update ? No > what version of re2c is used ? N/A > what custom rules are loaded if any ? header __FROM_FACEBOOK Return-Path:addr =~ /no-reply\@facebook\.com/i meta __FORGED_SENDER (!SPF_PASS && !DKIM_VALID_AU) meta FORGED_FACEBOOK_FROM (__FROM_FACEBOOK && __FORGED_SENDER) score FORGED_FACEBOOK 5 5 5 5 header DSPAM_SPAM X-DSPAM-Result =~ /^Spam$/ describe DSPAM_SPAM DSPAM claims it is spam score DSPAM_SPAM 1.5 header DSPAM_HAM X-DSPAM-Result =~ /^Innocent$/ describe DSPAM_HAM DSPAM claims it is ham score DSPAM_HAM -0.5 header PROOFPOINT_SPAM X-Proofpoint-Spam-Details =~ /rule=tag_spam/ describe PROOFPOINT_SPAM Proofpoint fsecure spam score PROOFPOINT_SPAM 3 header PROOFPOINT_HAM X-Proofpoint-Spam-Details =~ /rule=tag_notspam/ describe PROOFPOINT_HAM Proofpoint fsecure non-spam score PROOFPOINT_HAM -1.5 header LAB_NETWORK Received=~ /\[X.X.X.X\]/ describe LAB_NETWORK Always allow lab systems score LAB_NETWORK -99 header LOGGER_REPORT Received =~ /\[X.X.X.X\]/ describe LOGGER_REPORT Always allow logger report score LOGGER_REPORT -99 header FROM_PERFORCE exists:X-From-Perforce describe FROM_PERFORCE Email from Perforce Check-ins score FROM_PERFORCE -5 header BARRACUDA_SPAM X-Barracuda-Spam-Flag =~ /^Yes,/ describe BARRACUDA_SPAM Barracuda Spam Firewall claims it is spam score BARRACUDA_SPAM 3.5 header BARRACUDA_HAM X-Barracuda-Spam-Status =~ /^No,/ describe BARRACUDA_HAM Barracuda Spam Firewall claims it is spam score BARRACUDA_HAM -0.5 header CMAE_SPAM X-CMAE-Verdict =~ /^spam/ describe CMAE_SPAM CMAE Spam Firewall claims it is spam score CMAE_SPAM 105 header SPAMGUARD_SPAM exists:X-Rocket-Spam describe SPAMGUARD_SPAM SpamGuard claims it is spam. score SPAMGUARD_SPAM 3.5 header SPAMGUARD_SPAM2 exists:X-Spam-Track describe SPAMGUARD_SPAM2 SpamGuard claims it is spam. score SPAMGUARD_SPAM 3.5
(In reply to Quanah Gibson-Mount from comment #48) > We can see that test_pri_0 eats up an inordinate amount of time > comparatively. Just to be clear, you know that test_pri_0 is a group of rules and a large number of rules fall under that priority level?
(In reply to Kevin A. McGrail from comment #51) > (In reply to Quanah Gibson-Mount from comment #48) > > We can see that test_pri_0 eats up an inordinate amount of time > > comparatively. > > Just to be clear, you know that test_pri_0 is a group of rules and a large > number of rules fall under that priority level? Heh, no, that's good to know. ;) So I guess a lot of time spent in it may be expected.
(In reply to Quanah Gibson-Mount from comment #52) > (In reply to Kevin A. McGrail from comment #51) > > (In reply to Quanah Gibson-Mount from comment #48) > > > We can see that test_pri_0 eats up an inordinate amount of time > > > comparatively. > > > > Just to be clear, you know that test_pri_0 is a group of rules and a large > > number of rules fall under that priority level? > > Heh, no, that's good to know. ;) So I guess a lot of time spent in it may > be expected. this looks more like something you could keep an eye on: expire_bayes (2) 0.40% 24705.00 10441.00 10632.15 11396.75 12352.50 13308.25 14072.85 14264.00 I'm guessing that you have bayes expire set to auto. If yes, it may be wise to run bayes expiration via cron, during alow traffic period (Sunday 03:00)
(In reply to AXB from comment #53) > this looks more like something you could keep an eye on: > > expire_bayes (2) 0.40% 24705.00 10441.00 10632.15 > 11396.75 12352.50 13308.25 14072.85 14264.00 > > I'm guessing that you have bayes expire set to auto. > If yes, it may be wise to run bayes expiration via cron, during alow > traffic period (Sunday 03:00) Ok, good info, thanks!
http://www.zimbra.com/forums/administrators/30587-solved-you-aware-you-can-compile-your-sa-rules.html if still not satisfied, post spamassassin 2>&1 -D --lint
check loadplugin is NOT in cf files btw move the lines to pre files
Please see Bug 7091 and further tickets referenced there. Text being processed by rules and plugins should not have utf8 flag enabled since 3.4.1, i.e. it is either transcoded to UTF-8 octets (with normalize-charset enabled), or left as original bytes. There was a bug in HTML decoding which could result in Unicode text (utf8 flag on) which would slow down processing. I hope to get rid of most of 'use bytes' in modules eventually, as this pragma makes no difference when we can ensure that both text and rules are in octets (utf8 flag off).
I'm tentatively closing this issue. There has been extensive work in 3.4.1 to stay consistent with byte semantics (e.g. use UTF-8 encodings) in decoding, HTML decoding, bayes tokenization, etc, so the original concern has either been resolved or avoided by now. In the future the remaining 'use bytes' should probably be removed, as they are unnecessary when we can ensure that both the text and regular expressions are using byte semantics (i.e. have utf8 flag off) - which was not always the case in the past. Alternatively, we can start considering switching to character semantics throughout. In any case, it's probably the best to open up a new ticket when time comes.