Bug 5590 - Scantime is very long on certain messages unless "use bytes" hack is used
Summary: Scantime is very long on certain messages unless "use bytes" hack is used
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: 3.2.3
Hardware: All All
: P4 normal
Target Milestone: Future
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-08-10 11:01 UTC by Larry Rosenbaum
Modified: 2015-04-09 00:22 UTC (History)
10 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Test case text/plain None Larry Rosenbaum [NoCLA]
spamd debug log from test case text/plain None Larry Rosenbaum [NoCLA]
my sample message, gzipped application/octet-stream None Mark Martinec [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Larry Rosenbaum 2007-08-10 11:01:29 UTC
# spamd -V
SpamAssassin Server version 3.2.3
  running on Perl 5.8.8
  with zlib support (Compress::Zlib 2.004)
# uname -a
SunOS ornl71 5.9 Generic_118558-39 sun4u sparc SUNW,UltraAX-i2
Problem has been present since SA v3.2.1

Some messages take a very long time to scan (on the order of several minutes). 
This time is reduced substantially if "use bytes" is added to Message.pm.  For 
a test case, I did the following to minimize the effects of network lookups 
and local rules:
Added -L and -D to spamd startup
Turned off Bayes, AWL, razor, pyzor, DCC.
Set dns_available to "no".
Removed SARE and locally-defined rules.
Scanned the test message with spamc.
Results
With "use bytes" hack:  4 seconds
Without the hack:      23 seconds
Adding in SARE and local rules makes the gap even wider:
With "use bytes" hack:  16 seconds
Without the hack:      122 seconds
Comment 1 Larry Rosenbaum 2007-08-10 11:02:39 UTC
Created attachment 4082 [details]
Test case
Comment 2 Larry Rosenbaum 2007-08-10 11:04:12 UTC
Created attachment 4083 [details]
spamd debug log from test case
Comment 3 Matt Kettler 2007-08-10 19:06:22 UTC
Confirmed my test box can replicate the results using a crude: time spamassassin
-t <testcase.eml

However, in mine the difference when using a "stock" 3.2.3 is barely noticeable,
going from 9 seconds to 8 seconds. 

Adding in a good handful of SARE rules (1365 extra rules, counting "score"
lines) makes the difference quite significant.

Without "use bytes" and the SARE rules:
real    0m21.626s
user    0m20.873s
sys     0m0.264s

With "use bytes" added to Message.pm, and the SARE rules:
real    0m9.012s
user    0m6.468s
sys     0m0.240s
Comment 4 Matt Kettler 2007-08-11 19:41:28 UTC
Side note: Mark Martinec suggested this might be due to UTF-8 encoding in the
locale. 

While my test system did have en_US.UTF-8 as the LANG, resetting
/etc/sysconfig/i18n to use "en_US" and rebooting did not change the results in
any significant way.
Comment 5 Theo Van Dinter 2007-08-11 20:18:50 UTC
fwiw, I would just change your LANG environment variable.
Comment 6 Larry Rosenbaum 2008-01-15 12:22:27 UTC
This problem still exists in v3.2.4.
Comment 7 Larry Rosenbaum 2009-12-30 10:59:32 UTC
I just tested with SA v3.3.0rc1, Solaris 10 Sparc, Perl 5.10.1:

spam5# uname -a
SunOS spam5 5.10 Generic_141444-09 sun4u sparc SUNW,Sun-Fire-V240
spam5# spamd -V
SpamAssassin Server version 3.3.0-rc1
  running on Perl 5.10.1
  with zlib support (Compress::Zlib 2.02)

It takes an average of 1.8 times as long to scan a message if I don't use the "use bytes" kludge.

Can anybody tell me why it is taking so long, and if there is another way to speed it up?  What am I missing when I use the "use bytes" kludge?
Comment 8 Justin Mason 2010-01-14 06:12:34 UTC
(In reply to comment #7)
> I just tested with SA v3.3.0rc1, Solaris 10 Sparc, Perl 5.10.1:
> 
> spam5# uname -a
> SunOS spam5 5.10 Generic_141444-09 sun4u sparc SUNW,Sun-Fire-V240
> spam5# spamd -V
> SpamAssassin Server version 3.3.0-rc1
>   running on Perl 5.10.1
>   with zlib support (Compress::Zlib 2.02)
> 
> It takes an average of 1.8 times as long to scan a message if I don't use the
> "use bytes" kludge.
> 
> Can anybody tell me why it is taking so long, and if there is another way to
> speed it up?  What am I missing when I use the "use bytes" kludge?

with "use bytes":

: 97...; LANG=en_US.UTF-8 time spamassassin -L -t < /home/jmason/DL/test.eml 
4.79user 0.08system 0:05.32elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k
56inputs+160outputs (7major+9251minor)pagefaults 0swaps

without:

: 104...; LANG=en_US.UTF-8 time spamassassin -L -t < /home/jmason/DL/test.eml 
4.80user 0.04system 0:05.09elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+160outputs (0major+9258minor)pagefaults 0swaps

: 104...; perl -V
Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.6.24-23-server, archname=i486-linux-gnu-thread-multi
    uname='linux rothera 2.6.24-23-server #1 smp wed apr 1 22:22:14 utc 2009 i686 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.0 -Dsitearch=/usr/local/lib/perl/5.10.0 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.10.0 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include'
    ccversion='', gccversion='4.3.3', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /usr/lib64
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.9.so, so=so, useshrplib=true, libperl=libperl.so.5.10.0
    gnulibc_version='2.9'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib'


Characteristics of this binary (from libperl): 
  Compile-time options: MULTIPLICITY PERL_DONT_CREATE_GVSV
                        PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP USE_ITHREADS
                        USE_LARGE_FILES USE_PERLIO USE_REENTRANT_API
  Built under linux
  Compiled at Jun 26 2009 18:23:00
  @INC:
    /etc/perl
    /usr/local/lib/perl/5.10.0
    /usr/local/share/perl/5.10.0
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.10
    /usr/share/perl/5.10
    /usr/local/lib/site_perl
    .
Comment 9 Justin Mason 2010-01-14 06:15:06 UTC
doh.  ran the wrong spamassassin!  happens to everyone ;)

without "use bytes":

: 106...; LANG=en_US.UTF-8 time ./spamassassin -L -t < /home/jmason/DL/test.eml 
7.70user 0.08system 0:07.96elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+160outputs (0major+13479minor)pagefaults 0swaps

with:

4.05user 0.04system 0:04.29elapsed 95%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+160outputs (0major+13156minor)pagefaults 0swaps

ok, something's up there.
Comment 10 Michael Parker 2010-01-14 07:32:47 UTC
Can someone do a freqdiff between two runs, especially with mail that has highbit characters.
Comment 11 Mark Martinec 2010-01-14 17:16:50 UTC
Took a 176 KB message that showed slow processing in the log, and run it
with and without 'use bytes' in Message.pm. Network tests disabled,
some SARE rules, HitFreqsRuleTiming plugin enabled, perl 5.10.1:

without 'use bytes' (as in 3.3.0-rc3), top-15 rules by elapsed time:

T         __FILL_THIS_FORM_SHORT    3.218    3.218    1
T            FILL_THIS_FORM_LONG    3.127    3.127    1
T   __FILL_THIS_FORM_FRAUD_PHISH    1.501    1.501    1
T          __FILL_THIS_FORM_LOAN    0.841    0.841    1
T          SARE_SPEC_ROLEX_NOV5B    0.666    0.666    1
T               __LOTSA_MONEY_03    0.467    0.467    1
T               __LOTSA_MONEY_01    0.456    0.456    1
T                    __FRAUD_DBI    0.431    0.431    1
T          SARE_SPEC_ROLEX_NOV5A    0.426    0.426    1
T               __LOTSA_MONEY_00    0.385    0.385    1
T                 __SARE_DEGREES    0.344    0.344    1
T             __SARE_FRAUD_MONEY    0.304    0.304    1
T                    FUZZY_XPILL    0.250    0.250    1
T              __SARE_FRAUD_VIPS    0.239    0.239    1
T                    SOFTWARE_AD    0.233    0.233    1

X-Spam-Timing: total 60780 ms - init: 5474 (9.0%), parse: 27 (0.0%),
  extract_message_metadata: 533 (0.9%), get_uri_detail_list: 287 (0.5%),
  tests_pri_-1000: 21 (0.0%), compile_gen: 1201 (2.0%), compile_eval: 47 (0.1%),
  tests_pri_-950: 11 (0.0%), tests_pri_-900: 12 (0.0%), tests_pri_-400: 966
  (1.6%), check_bayes: 954 (1.6%), tests_pri_0: 52742 (86.8%), tests_pri_500:
  334 (0.5%), tests_pri_899: 284 (0.5%), check_crm114: 271 (0.4%),
  tests_pri_1000: 69 (0.1%), total_awl: 56 (0.1%), check_awl: 5 (0.0%),
  update_awl: 13 (0.0%), rewrite_mail: 0.00 (0.0%)

==> most notably: tests_pri_0: 52.742 seconds


Now, adding 'use bytes':

T         __FILL_THIS_FORM_SHORT    1.310    1.310    1
T            FILL_THIS_FORM_LONG    1.260    1.260    1
T   __FILL_THIS_FORM_FRAUD_PHISH    0.541    0.541    1
T          __FILL_THIS_FORM_LOAN    0.382    0.382    1
T          SARE_SPEC_ROLEX_NOV5B    0.160    0.160    1
T               __LOTSA_MONEY_03    0.149    0.149    1
T               __LOTSA_MONEY_01    0.145    0.145    1
T          SARE_SPEC_ROLEX_NOV5A    0.136    0.136    1
T               __LOTSA_MONEY_00    0.127    0.127    1
T   __FILL_THIS_FORM_PARTIAL_RAW    0.123    0.123    1
T                    __FRAUD_DBI    0.108    0.108    1
T             __SARE_FRAUD_MONEY    0.096    0.096    1
T                 __SARE_DEGREES    0.095    0.095    1
T             __SARE_FRAUD_TRUST    0.087    0.087    1
T                SARE_MONEYTERMS    0.083    0.083    1

X-Spam-Timing: total 26186 ms - init: 5552 (21.2%), parse: 27 (0.1%),
  extract_message_metadata: 506 (1.9%), get_uri_detail_list: 285 (1.1%),
  tests_pri_-1000: 21 (0.1%), compile_gen: 1149 (4.4%), compile_eval: 51 (0.2%),
  tests_pri_-950: 10 (0.0%), tests_pri_-900: 11 (0.0%), tests_pri_-400: 917
  (3.5%), check_bayes: 906 (3.5%), tests_pri_0: 18131 (69.2%), tests_pri_500:
  348 (1.3%), tests_pri_899: 285 (1.1%), check_crm114: 265 (1.0%),
  tests_pri_1000: 76 (0.3%), total_awl: 56 (0.2%), check_awl: 6 (0.0%),
  update_awl: 3 (0.0%), rewrite_mail: 0.00 (0.0%)

==> most notably: tests_pri_0: 18.131 seconds


A factor of 2.9 reduction in rules pocessing time!  (52.742/18.131)
Comment 12 Mark Martinec 2010-01-14 17:56:39 UTC
I repeated the same test with 'sa-compile'-d rules, this time with
disabled HitFreqsRuleTiming, disabled debugging and disabled bayes.
The shown tests_pri_0 time corresponds practically entirely to regexp
rules (here with net tests and bayes disabled).

no sa-compile, not   'use bytes': tests_pri_0: 52217 ms
no sa-compile, using 'use bytes': tests_pri_0: 17174 ms

sa-compile,    not   'use bytes': tests_pri_0: 25134 ms
sa-compile,    using 'use bytes': tests_pri_0: 11070 ms


The message is just an ordinary newsletter, multipart/alternative
(plain text and HTML), and contains 8-bit UTF-8 characters.
Comment 13 Warren Togami 2010-01-15 13:29:03 UTC
The attached Test case takes ~2 seconds with Fedora 12 perl-5.10.0 here, and ~12 seconds with RHEL-5 perl-5.8.8.

Could someone please attach an example mail that is an even larger delay?
Comment 14 Mark Martinec 2010-01-15 16:25:14 UTC
Created attachment 4644 [details]
my sample message, gzipped

Here attached is my newsletter sample as used in comment 11 and #12.
More slow samples are attached to Bug 6238.

Granted, the more troublesome rules are some of the SARE rules,
but it is not limited to them.
Comment 15 Henrik Krohns 2010-01-15 22:03:53 UTC
I'm all for fixing this in 3.3.0. Why do we want to release a major version knowing that it could be much faster in certain cases?

I frequently get messages that take 4+ minutes to scan on my 1Ghz SPARC. "use bytes" seem to speed it up to 1 minute range. Tested one message and it was 3 vs 10 seconds on 3Ghz AMD though.. gotta look into it if I have time.
Comment 16 Daryl C. W. O'Shea 2010-01-15 22:11:58 UTC
Um, why did we remove "use bytes;" in the first place?  I don't remember completely accurately.  I think that there's a bug on it or at least a thread about it somewhere... there was a quite a bit of discussion when it was done, IIRC.
Comment 17 Warren Togami 2010-01-15 22:24:09 UTC
There is no doubt that this makes it faster, but I don't see any discussion here verifying that it results in correct behavior.  I personally am OK with further delay of 3.3.0 release for this issue *IF* it can be demonstrated that it results in functional equivalence for perl-5.8.8 and perl-5.10.1.

But if it cannot be demonstrated as such by Sunday, I propose we cut the 3.3.0 release and call for vote at that time.  We already have a high degree of confidence that the current code is good, and this particular issue would not be a regression.
Comment 18 Michael Parker 2010-01-15 23:33:37 UTC
Some history:

http://svn.apache.org/viewvc?view=revision&revision=315047

And Bug 4596
Comment 19 Henrik Krohns 2010-01-16 06:38:51 UTC
Here's a quick run of 10k ham + 10k spam on perl 5.10.0.

$ ./freqdiff -c ham.log.nobytes ham.log.bytes
518     __HIGHBITS
64      T_HK_MUCHMONEY
52      __hk_million
39      __F_TEEN
24      __hk_bigmoney
18      T_LOTS_OF_MONEY
18      __LOTSA_MONEY_01
2       __FRT_GOLD
1       DRUGS_MUSCLE
1       T_FRT_LOLITA1
1       __DRUGS_MUSCLE1
1       __LOTSA_MONEY_00
1       __OWN_LAND
-1      FRT_SOMA2
-2      T_TVD_LONG_WORD5

$ ./freqdiff -c spam.log.nobytes spam.log.bytes
28      __HIGHBITS
1       __FB_MA
1       __OWN_LAND
-1      T_FRT_EXPERIENCE
-1      __AFRICAN_STATE
-1      __IVORY_COAST
-1      __hk_bigmoney
-2      T_FRT_ADULT2
-3      FUZZY_CREDIT
Comment 20 Henrik Krohns 2010-01-16 06:44:50 UTC
Surprisinly total scan time was only 10min vs 13min in favor of use bytes.
Comment 21 Justin Mason 2010-01-16 15:58:59 UTC
btw,  I think this significant slowdown in 3.3.0 may be as a result of increased use of replace_rules rules, compared to when bug 4596 happened.
Comment 22 John Hardin 2010-01-16 18:18:50 UTC
(In reply to comment #21)
> btw,  I think this significant slowdown in 3.3.0 may be as a result of
> increased use of replace_rules rules, compared to when bug 4596 happened.

Are you suggesting replace_rules itself has an impact on the performance? I doubt that, as replace_rules only comes into play at compile time.

I will, however, grant that replace_rules makes it easier to write very complex (and thus slow) rules...
Comment 23 Justin Mason 2010-01-17 16:24:37 UTC
(In reply to comment #22)
> (In reply to comment #21)
> > btw,  I think this significant slowdown in 3.3.0 may be as a result of
> > increased use of replace_rules rules, compared to when bug 4596 happened.
> 
> Are you suggesting replace_rules itself has an impact on the performance? I
> doubt that, as replace_rules only comes into play at compile time.
> 
> I will, however, grant that replace_rules makes it easier to write very complex
> (and thus slow) rules...

the latter -- I think some aspect of those complexities is performing very badly without "use bytes", compared to with.

fwiw, I did a mass-check with and without "use bytes" in Message.pm, using perl 5.8.8, against 5000 ham and 5000 spam:

: 36...; wc -l log.trunk/*.log
    5002 log.trunk/ham.log
    5003 log.trunk/spam.log
   10005 total

: 37...; wc -l log.bytes/*.log
    5002 log.bytes/ham.log
    5003 log.bytes/spam.log
   10005 total

: 31...; ./freqdiff -c log.trunk/ham.log log.bytes/ham.log
4       __HIGHBITS

: 32...; ./freqdiff -c log.trunk/spam.log log.bytes/spam.log
-2      __TVD_SPACE_RATIO


so that's a barely-noticeable difference of 4 mails (out of 5000 hams) more __HIGHBITS hits, and 2 less __TVD_SPACE_RATIO hits in spam.  no scoring rules were affected.

speed was not noticeably affected, either; ~50 mins with trunk, 48 with "use bytes".   

locale was set to LANG=C.
Comment 24 Warren Togami 2010-01-17 16:58:35 UTC
What is the exact patch to test "use bytes"?  My corpus has lots of Japanese mail and I could run masscheck.

But given 48 minutes vs 50 minutes is this really worthwhile of a risk to do last minute before release?

We really should decide quickly and cut the 3.3.0.
Comment 25 Michael Parker 2010-01-17 20:31:03 UTC
Note that I've seen differences between 18 and 28 percent on a mass-check of around 50k spam and ham.

I was seeing similar differences in my freqdiff, HIGHBITS was the highest difference and then some other length based rules.  I assume that is a side effect of characters being considered length 2 vs 1 with use bytes.

For me I was just going to revert the use bytes change in Message.pm but then I didn't know how it would effect the other changes John made.

I also note that with use bytes added back to Message.pm t/html_utf8.t now fails for me on perl 5.8.5 and FreeBSD.
Comment 26 Henrik Krohns 2010-01-17 21:43:03 UTC
Looking at mass runtimes it's probably worth exploring only after 3.3.0. I'll be enabling use bytes on my slow server though, since I get many bad messages and the effect is much more visible there..
Comment 27 Warren Togami 2010-01-17 23:18:18 UTC
Even if the rule hits are roughly equivalent:

* Is this functionally equivalent for the tokens going in/out of Bayes?
* What about those reasons why this was removed years ago?

Please consider safety for this 3.3.0 release.  I think we should aggressively recruit more masscheck participants and push for a good 3.3.1 release within 3-6 months where we can adequately test changes like "use bytes".
Comment 28 Henrik Krohns 2010-01-17 23:29:23 UTC
Not surprisingly it affects Bayes, but only as slightly as the rules. Probably tokens containing highbits etc. It's simple to test with sa-learn and comparing dumps.
Comment 29 Mark Martinec 2010-01-18 04:06:45 UTC
(In reply to comment #24)
> What is the exact patch to test "use bytes"?  My corpus has lots of Japanese
> mail and I could run masscheck.

The exact location is not critical. Something like this would do:

--- Message.pm~ 2010-01-18 12:55:21.000000000 +0100
+++ Message.pm  2010-01-18 12:55:45.000000000 +0100
@@ -46,4 +46,5 @@
 use warnings;
 use re 'taint';
+use bytes;

 use Mail::SpamAssassin;


> But given 48 minutes vs 50 minutes is this really worthwhile of a risk to do
> last minute before release?

The change has a little effect on run times of majority of message, and
hence on the average times. Seems that only few messages are affected,
but there the difference can be substantial, as reported earlier
(factors like 5.7, 6.7, 1.9, 2.9 were reported under this bug, which
can have a large impact when a bunch of such message comes in, each
taking minutes to process).
Comment 30 Justin Mason 2010-01-18 05:57:29 UTC
(In reply to comment #23)
> so that's a barely-noticeable difference of 4 mails (out of 5000 hams) more
> __HIGHBITS hits, and 2 less __TVD_SPACE_RATIO hits in spam.  no scoring rules
> were affected.
> 
> speed was not noticeably affected, either; ~50 mins with trunk, 48 with "use
> bytes".   
> 
> locale was set to LANG=C.

btw I repeated this with the "use bytes" mass-check with LANG=en_US.UTF-8 set, without any differences at all in freqdiff results.
Comment 31 Justin Mason 2010-01-18 08:23:27 UTC
Does everyone agree that this is only isolated to small subsets of messages, rather than affecting scan speed for all, or a majority of, messages?  ("mass-check" is the appropriate tool to use to measure this.)

If this is the case, we should go ahead and cut 3.3.0 as it stands, and leave this as a known issue for a 3.3.1 investigation/fix.
Comment 32 Warren Togami 2010-01-18 08:26:41 UTC
(In reply to comment #28)
> Not surprisingly it affects Bayes, but only as slightly as the rules. Probably
> tokens containing highbits etc. It's simple to test with sa-learn and comparing
> dumps.

I would imagine that treating the multi-byte characters as individual bytes might bite us in ways similar to Bug 6183.  Various control characters or characters considered non-words happen as the second byte, screwing up tokenization.
Comment 33 Warren Togami 2010-01-18 08:43:20 UTC
(In reply to comment #31)
> Does everyone agree that this is only isolated to small subsets of messages,
> rather than affecting scan speed for all, or a majority of, messages? 
> ("mass-check" is the appropriate tool to use to measure this.)
> 
> If this is the case, we should go ahead and cut 3.3.0 as it stands, and leave
> this as a known issue for a 3.3.1 investigation/fix.

+1 go ahead with 3.3.0 cut.  Target 3.3.1 in several months, which should be possible if we recruit more masscheck participants.
Comment 34 Henrik Krohns 2010-01-18 08:44:31 UTC
On behalf of my corpus, +1 for releasing 3.3.0 also.
Comment 35 Michael Parker 2010-01-18 08:46:16 UTC
Actually, if you look at Bayes.pm you'll notice that it already has use bytes so I'm not sure it will have any effect on bayes tokenization.
Comment 36 Henrik Krohns 2010-01-18 08:56:35 UTC
(In reply to comment #35)
> Actually, if you look at Bayes.pm you'll notice that it already has use bytes
> so I'm not sure it will have any effect on bayes tokenization.

If it wasn't clear, I did test it. Sa-learn a folder, sa-learn --dump data > dump1, rm -rf bayes*, sa-learn same folder, sa-learn --dump data > dump2, diff dump1 dump2. You will see same tokens with slightly different counts and few different tokens. I don't think there is anything in Bayes that would make subsequent runs different?
Comment 37 Warren Togami 2010-01-18 09:16:42 UTC
(In reply to comment #35)
> Actually, if you look at Bayes.pm you'll notice that it already has use bytes
> so I'm not sure it will have any effect on bayes tokenization.

OK, this is already an existing problem in Bayes then.  Will investigate and file a separate bug for 3.3.1.
Comment 38 Justin Mason 2010-01-18 09:32:06 UTC
regarding Bayes tokenization: use of byte-level breaks there is an explicit choice, not a bug.
Comment 39 Mark Martinec 2010-01-18 10:06:49 UTC
(In reply to comment #32)
> I would imagine that treating the multi-byte characters as individual bytes
> might bite us in ways similar to Bug 6183.

(In reply to comment #31)
> Does everyone agree that this is only isolated to small subsets of messages,
> rather than affecting scan speed for all, or a majority of, messages? 

Yes, I believe this is the case. According to our timing stats charts,
switching 'use bytes' on or off is hardly detectable on a plot (showing
5-minute average running times per message). The TIME_LIMIT_EXCEEDED rule
(on a 40 second time limit) strikes about 5 times per day on a lightly
loaded host (7k msgs/h), so at the expense of possibly passing through
a potential spam due to reaching a time limit, it is acceptable as it
stands now, at least for our site.

> If this is the case, we should go ahead and cut 3.3.0 as it stands, and
> leave this as a known issue for a 3.3.1 investigation/fix.

Agreed. Putting the the 'use bytes' would conceptually be a step backwards.
Advantage of having a message represented as perl characters instead of bytes
can be put to good use in several corners. It is very unfortunate that Perl
takes such a heavy hit on them. Until this gets targeted as a DoS hole,
it need not be turned off.
Comment 40 Justin Mason 2010-01-20 13:35:23 UTC
let's not move bugs off the 3.3.0 milestone until that version is released.  We have a protocol to indicate that a bug is a release blocker (P1) or not a blocker (P2..P5) -- let's stick with that system.
Comment 41 Justin Mason 2010-01-27 02:20:38 UTC
moving most remaining 3.3.0 bugs to 3.3.1 milestone
Comment 42 Justin Mason 2010-01-27 03:16:25 UTC
reassigning, too
Comment 43 Justin Mason 2010-03-23 16:33:39 UTC
moving all open 3.3.1 bugs to 3.3.2
Comment 44 Karsten Bräckelmann 2010-03-23 17:42:45 UTC
Moving back off of Security, which got changed by accident during the mass Target Milestone move.
Comment 45 Larry Rosenbaum 2010-03-23 19:22:58 UTC
Note: "use bytes" makes the t/html_utf8.t test fail:

t/html_utf8.t ..................... 1/2         Not found: QUOTE_YOUR =  QUOTE_YOUR  at t/html_utf8.t line 27.
# Failed test 1 in t/SATest.pm at line 717
Output can be examined in: log/d.html_obfu/1
t/html_utf8.t ..................... Failed 1/2 subtests
Comment 46 Mark Martinec 2013-03-09 02:17:58 UTC
(In reply to comment #39)
> > I would imagine that treating the multi-byte characters as individual bytes
> > might bite us in ways similar to Bug 6183.
> > Does everyone agree that this is only isolated to small subsets of messages,
> > rather than affecting scan speed for all, or a majority of, messages? 
> Yes, I believe this is the case. [...]
> Agreed. Putting the the 'use bytes' would conceptually be a step backwards.
> Advantage of having a message represented as perl characters instead of bytes
> can be put to good use in several corners. It is very unfortunate that Perl
> takes such a heavy hit on them. Until this gets targeted as a DoS hole,
> it need not be turned off.

There is no light decision on this. Setting the target to 'future'...
Comment 47 Quanah Gibson-Mount 2014-05-02 00:25:21 UTC
This remains an issue with 3.4.0:

May  1 19:12:53 edge01 amavis[30846]: (30846-02) TIMING-SA [total 2274 ms, cpu 1962 ms] - parse: 2.7 (0.1%), extract_message_metadata: 18 (0.8%), get_uri_detail_list: 17 (0.7%), tests_pri_-1000: 2.9 (0.1%), tests_pri_-950: 0.97 (0.0%), tests_pri_-900: 1.25 (0.1%), tests_pri_-400: 43 (1.9%), check_bayes: 42 (1.8%), b_tokenize: 20 (0.9%), b_tok_get_all: 10 (0.4%), b_comp_prob: 9 (0.4%), b_tok_touch_all: 0.22 (0.0%), b_finish: 0.65 (0.0%), tests_pri_0: 2109 (92.7%), check_spf: 0.24 (0.0%), check_razor2: 156 (6.8%), check_pyzor: 154 (6.8%), tests_pri_500: 2.5 (0.1%), learn: 80 (3.5%), b_learn: 69 (3.0%), b_count_change: 31 (1.3%), get_report: 0.69 (0.0%)

Total SA time was just over 2 seconds, and 92.7% (2109ms) was due to test_pri_0

Generally, what I see is:

May  1 19:24:12 edge01 amavis[30886]: (30886-13) TIMING-SA [total 423 ms, cpu 169 ms] - parse: 1.53 (0.4%), extract_message_metadata: 2.3 (0.5%), get_uri_detail_list: 3.1 (0.7%), tests_pri_-1000: 3.8 (0.9%), tests_pri_-950: 1.48 (0.4%), tests_pri_-900: 1.68 (0.4%), tests_pri_-400: 17 (3.9%), check_bayes: 15 (3.6%), b_tokenize: 4.7 (1.1%), b_tok_get_all: 4.3 (1.0%), b_comp_prob: 3.7 (0.9%), b_tok_touch_all: 0.11 (0.0%), b_finish: 0.73 (0.2%), tests_pri_0: 351 (83.1%), check_spf: 0.63 (0.1%), check_razor2: 143 (33.8%), check_pyzor: 144 (34.0%), tests_pri_500: 2.8 (0.7%), learn: 33 (7.9%), b_learn: 29 (6.9%), b_count_change: 12 (2.7%), get_report: 0.49 (0.1%)

Even still, we can see that test_pri_0 is the majority of the time (83%, 351ms/423ms).

I.e., even for "fast" emails, test_pri_0 is the bottleneck.
Comment 48 Quanah Gibson-Mount 2014-05-02 18:19:57 UTC
Thanks to the amavis-logwatch script, I have some nice timing data for rules processing at my site as well:

========================================================================================================================
SA Timing Percentiles         % Time    Total (ms)        0%        5%       25%       50%       75%       95%      100%
------------------------------------------------------------------------------------------------------------------------
tests_pri_0 (7650)            84.64%    5165461.00     88.00    233.00    338.00    402.00    534.00   1759.40  24756.00
check_pyzor (7650)            28.59%    1744883.00     31.00    139.00    145.00    151.00    158.00    218.00  19208.00
check_razor2 (7650)           19.22%    1173107.00     24.00     27.00     94.00    132.00    162.00    209.00  14023.00
learn (3985)                   5.91%     360940.90      3.40      7.00     28.00     38.00     60.00    248.80   3674.00
b_learn (3985)                 5.42%     330508.35      1.49      2.20     25.00     34.00     52.00    204.00   3481.00
tests_pri_-400 (7650)          4.66%     284643.00      7.00     12.00     17.00     22.00     34.00     87.00  14435.00
check_bayes (7650)             4.49%     273827.00      7.00     11.00     16.00     21.00     32.00     84.00  14434.00
extract_message_metadata (7650)   1.74%     106200.12      0.93      1.99      3.90      8.00     14.00     45.00    528.00
b_tokenize (7650)              1.61%      98333.80      1.85      3.20      4.80      7.00     12.00     33.00    531.00
b_count_change (3605)          1.40%      85147.90      3.20      8.00     11.00     16.00     24.00     47.00   2919.00
get_uri_detail_list (7650)     1.00%      60895.61      0.08      0.64      1.76      2.90      7.00     23.00    509.00
b_tok_get_all (7650)           1.00%      60866.52      1.51      3.00      4.30      6.00      8.00     19.00    699.00
b_comp_prob (7650)             0.79%      48176.27      1.16      2.20      3.60      4.70      7.00     14.00    123.00
b_finish (7650)                0.52%      31588.31      0.24      0.46      0.52      0.56      0.63      0.96  14408.00
parse (7650)                   0.42%      25687.28      0.73      1.20      1.79      2.40      3.50      9.00     40.00
tests_pri_-1000 (7650)         0.41%      25285.90      1.95      2.20      2.60      2.80      3.10      6.00     42.00
tests_pri_500 (7650)           0.41%      24944.14      1.66      2.20      2.50      2.90      3.20      6.00     31.00
expire_bayes (2)               0.40%      24705.00  10441.00  10632.15  11396.75  12352.50  13308.25  14072.85  14264.00
tests_pri_-900 (7650)          0.18%      11028.16      0.93      1.02      1.16      1.21      1.29      1.84     40.00
tests_pri_-950 (7650)          0.16%      10020.39      0.77      0.86      0.98      1.05      1.36      1.73     22.00
get_report (7650)              0.09%       5729.38      0.25      0.41      0.48      0.56      0.69      1.48     25.00
check_spf (7650)               0.05%       3037.06      0.18      0.21      0.24      0.27      0.46      0.65     27.00
b_tok_touch_all (7650)         0.02%       1254.65      0.04      0.06      0.09      0.13      0.19      0.25     14.00
------------------------------------------------------------------------------------------------------------------------
Total                                   6103202.00    118.00    293.00    395.00    475.00    667.00   2086.70  25856.00
========================================================================================================================

We can see that test_pri_0 eats up an inordinate amount of time comparatively.
Comment 49 Benny Pedersen 2014-05-02 18:36:15 UTC
what plugins are loaded ?

is sa-compile runned after sa-update ?

what version of re2c is used ?

what custom rules are loaded if any ?
Comment 50 Quanah Gibson-Mount 2014-05-02 18:49:24 UTC
(In reply to Benny Pedersen from comment #49)
> what plugins are loaded ?

Whatever the defaults are for 3.4.0

> is sa-compile runned after sa-update ?

No

> what version of re2c is used ?

N/A

> what custom rules are loaded if any ?

header __FROM_FACEBOOK Return-Path:addr =~ /no-reply\@facebook\.com/i
meta __FORGED_SENDER (!SPF_PASS && !DKIM_VALID_AU)
meta FORGED_FACEBOOK_FROM (__FROM_FACEBOOK && __FORGED_SENDER)
score FORGED_FACEBOOK 5 5 5 5

header DSPAM_SPAM X-DSPAM-Result =~ /^Spam$/
describe DSPAM_SPAM DSPAM claims it is spam
score DSPAM_SPAM 1.5

header DSPAM_HAM X-DSPAM-Result =~ /^Innocent$/
describe DSPAM_HAM DSPAM claims it is ham
score DSPAM_HAM -0.5

header PROOFPOINT_SPAM X-Proofpoint-Spam-Details =~ /rule=tag_spam/
describe PROOFPOINT_SPAM Proofpoint fsecure spam
score PROOFPOINT_SPAM 3

header PROOFPOINT_HAM X-Proofpoint-Spam-Details =~ /rule=tag_notspam/
describe PROOFPOINT_HAM Proofpoint fsecure non-spam
score PROOFPOINT_HAM -1.5

header LAB_NETWORK Received=~ /\[X.X.X.X\]/
describe LAB_NETWORK Always allow lab systems
score LAB_NETWORK -99

header LOGGER_REPORT Received =~ /\[X.X.X.X\]/
describe LOGGER_REPORT Always allow logger report
score LOGGER_REPORT -99

header FROM_PERFORCE exists:X-From-Perforce
describe FROM_PERFORCE Email from Perforce Check-ins
score FROM_PERFORCE -5

header BARRACUDA_SPAM X-Barracuda-Spam-Flag =~ /^Yes,/
describe BARRACUDA_SPAM Barracuda Spam Firewall claims it is spam
score BARRACUDA_SPAM 3.5

header BARRACUDA_HAM X-Barracuda-Spam-Status =~ /^No,/
describe BARRACUDA_HAM Barracuda Spam Firewall claims it is spam
score BARRACUDA_HAM -0.5

header CMAE_SPAM X-CMAE-Verdict =~ /^spam/
describe CMAE_SPAM CMAE Spam Firewall claims it is spam
score CMAE_SPAM 105

header SPAMGUARD_SPAM exists:X-Rocket-Spam
describe SPAMGUARD_SPAM SpamGuard claims it is spam.
score SPAMGUARD_SPAM 3.5

header SPAMGUARD_SPAM2 exists:X-Spam-Track
describe SPAMGUARD_SPAM2 SpamGuard claims it is spam.
score SPAMGUARD_SPAM 3.5
Comment 51 Kevin A. McGrail 2014-05-02 19:05:25 UTC
(In reply to Quanah Gibson-Mount from comment #48)
> We can see that test_pri_0 eats up an inordinate amount of time
> comparatively.

Just to be clear, you know that test_pri_0 is a group of rules and a large number of rules fall under that priority level?
Comment 52 Quanah Gibson-Mount 2014-05-02 19:11:54 UTC
(In reply to Kevin A. McGrail from comment #51)
> (In reply to Quanah Gibson-Mount from comment #48)
> > We can see that test_pri_0 eats up an inordinate amount of time
> > comparatively.
> 
> Just to be clear, you know that test_pri_0 is a group of rules and a large
> number of rules fall under that priority level?

Heh, no, that's good to know. ;)  So I guess a lot of time spent in it may be expected.
Comment 53 AXB 2014-05-02 19:19:08 UTC
(In reply to Quanah Gibson-Mount from comment #52)
> (In reply to Kevin A. McGrail from comment #51)
> > (In reply to Quanah Gibson-Mount from comment #48)
> > > We can see that test_pri_0 eats up an inordinate amount of time
> > > comparatively.
> > 
> > Just to be clear, you know that test_pri_0 is a group of rules and a large
> > number of rules fall under that priority level?
> 
> Heh, no, that's good to know. ;)  So I guess a lot of time spent in it may
> be expected.

this looks more like something you could keep an eye on:

expire_bayes (2)               0.40%      24705.00  10441.00  10632.15 
11396.75  12352.50  13308.25  14072.85  14264.00

I'm guessing that you have bayes expire set to auto.
If yes, it may be wise to run bayes expiration via cron, during  alow traffic period (Sunday 03:00)
Comment 54 Quanah Gibson-Mount 2014-05-02 19:20:46 UTC
(In reply to AXB from comment #53)

> this looks more like something you could keep an eye on:
> 
> expire_bayes (2)               0.40%      24705.00  10441.00  10632.15 
> 11396.75  12352.50  13308.25  14072.85  14264.00
> 
> I'm guessing that you have bayes expire set to auto.
> If yes, it may be wise to run bayes expiration via cron, during  alow
> traffic period (Sunday 03:00)

Ok, good info, thanks!
Comment 55 Benny Pedersen 2014-05-02 19:42:26 UTC
http://www.zimbra.com/forums/administrators/30587-solved-you-aware-you-can-compile-your-sa-rules.html

if still not satisfied, post

spamassassin 2>&1 -D --lint
Comment 56 Benny Pedersen 2014-05-02 19:45:00 UTC
check loadplugin is NOT in cf files btw

move the lines to pre files
Comment 57 Mark Martinec 2015-02-24 01:49:25 UTC
Please see Bug 7091 and further tickets referenced there.

Text being processed by rules and plugins should not have utf8 flag
enabled since 3.4.1, i.e. it is either transcoded to UTF-8 octets (with
normalize-charset enabled), or left as original bytes. There was a bug
in HTML decoding which could result in Unicode text (utf8 flag on) which
would slow down processing.

I hope to get rid of most of 'use bytes' in modules eventually, as this
pragma makes no difference when we can ensure that both text and rules are
in octets (utf8 flag off).
Comment 58 Mark Martinec 2015-04-09 00:22:38 UTC
I'm tentatively closing this issue.

There has been extensive work in 3.4.1 to stay consistent with
byte semantics (e.g. use UTF-8 encodings) in decoding, HTML decoding,
bayes tokenization, etc, so the original concern has either been
resolved or avoided by now.

In the future the remaining 'use bytes' should probably be
removed, as they are unnecessary when we can ensure that
both the text and regular expressions are using byte semantics
(i.e. have utf8 flag off) - which was not always the case in
the past. Alternatively, we can start considering switching to
character semantics throughout. In any case, it's probably
the best to open up a new ticket when time comes.