5691 – Slow rules due to charset normalization not always clearing utf8 flag

Bug 5691 - Slow rules due to charset normalization not always clearing utf8 flag

Summary: Slow rules due to charset normalization not always clearing utf8 flag

Status:	RESOLVED INVALID

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	Libraries (show other bugs)
Version:	3.2.3
Hardware:	All All

Importance:	P5 normal
Target Milestone:	3.2.4
Assignee:	SpamAssassin Developer Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2007-10-18 17:49 UTC by Mark Martinec
Modified:	2009-11-12 15:20 UTC (History)
CC List:	2 users (show)

Attachment	Type	Actions	Submitter/CLA Status
sample mail	text/plain	None	Mark Martinec
gzipped small test perl program, simulating the case	application/octet-stream	None	Mark Martinec
proposed patch	patch	None	Mark Martinec
actually, this one	patch	None	Mark Martinec
same patch but without a "use Encode"	patch	None	Mark Martinec
Show Obsolete (2) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Mark Martinec 2007-10-18 17:49:59 UTC

Seems like some koi8-r -encoded Russian spam causes certain priority-0
body rules to be really slow. I'm observing this for the last two or
three weeks - every time some message takes one or two minutes to check,
it almost always turns out to be this type of spam.

The slow rules are mostly of the SARE_FRAUD* and SARE_SPEC_ROLEX* family,
but even some baseline rules are not much better: SHORT_TERM_PRICE, FB_GVR,
DRUGS_SLEEP2, AFF_LOTTERY, KAM_LOTTO*, ...

Seems like the trunk code is even slower than 3.2.3(4), don't know why.

Would somebody more familiar with the internals in that area please
try the attached message. Perhaps something can be done about it,
besides ditching these rules.

Here is the timing report from a command-line spamassassin run
(note the tests_pri_0 which takes 44 seconds on this 72 kB message):

[58931] dbg: timing: total 49963 ms - init: 4188 (8.4%), parse: 16 (0.0%), 
extract_message_metadata: 406 (0.8%), get_uri_detail_list: 150 (0.3%),
tests_pri_-1000: 71 (0.1%), tests_pri_-950: 5 (0.0%), tests_pri_-900: 7 (0.0%), 
tests_pri_-400: 54 (0.1%), check_bayes: 47 (0.1%), tests_pri_0: 44788 (89.6%), 
check_spf: 61 (0.1%), check_dkim_signature: 29 (0.1%), poll_dns_idle: 0.05(0%), 
check_razor2: 726 (1.5%), check_pyzor: 0.07 (0.0%), check_dcc: 462 (0.9%),
tests_pri_100: 5 (0.0%), tests_pri_500: 177 (0.4%), tests_pri_1000: 34 (0.1%), 
total_awl: 29 (0.1%), check_awl: 4 (0.0%), update_awl: 2 (0.0%), learn: 80 
(0.2%)

and here is the top of the timing list are reported by HitFreqsRuleTiming:
v1
T         __SARE_FRAUD_BADTHINGS    1.385    1.385    1
T             __SARE_FRAUD_MONEY    1.338    1.338    1
T             __SARE_FRAUD_TRUST    1.199    1.199    1
T               SHORT_TERM_PRICE    1.128    1.128    1
T          __SARE_FRAUD_RELIGION    1.068    1.068    1
T           __SARE_FRAUD_PROPOSE    0.982    0.982    1
T          SARE_SPEC_ROLEX_NOV5A    0.920    0.920    1
T          SARE_SPEC_ROLEX_NOV5B    0.855    0.855    1
T         __SARE_FRAUD_DPTCOMPNY    0.804    0.804    1
T                         FB_GVR    0.769    0.769    1
T      RAZOR2_CF_RANGE_E8_51_100    0.726    0.726    1
T              __SARE_FRAUD_MISC    0.553    0.553    1
T                  __SARE_OEM_2A    0.543    0.543    1
T          __SARE_FRAUD_FUNWORDS    0.543    0.543    1
T              SARE_RMML_Stock17    0.539    0.539    1
T             __SARE_FRAUD_AGREE    0.477    0.477    1
T                      DCC_CHECK    0.462    0.462    1
T          SARE_SPEC_ROLEX_CHEAP    0.444    0.444    1
T                 __DRUGS_SLEEP2    0.424    0.424    1
T                SARE_MONEYTERMS    0.375    0.375    1
T           __SARE_FRAUD_LOTTERY    0.367    0.367    1
T              __SARE_FRAUD_VIPS    0.365    0.365    1
T                  __AFF_LOTTERY    0.355    0.355    1
T                  __SARE_OEM_1A    0.343    0.343    1
T          __SARE_FRAUD_LOTTERY2    0.329    0.329    1
T         __SARE_FRAUD_PAPERWORK    0.329    0.329    1
T                 SARE_BETTERORG    0.329    0.329    1
T                SARE_EN_SET1_A1    0.309    0.309    1
T                __ZMIde_STOCK17    0.302    0.302    1
T                   __KAM_LOTTO5    0.283    0.283    1
T         __SARE_FRAUD_BARRISTER    0.281    0.281    1
T                  __SARE_OEM_2B    0.269    0.269    1
T                   __KAM_LOTTO2    0.252    0.252    1
T               SARE_RMML_Stock5    0.250    0.250    1
T           __SARE_FRAUD_FOREIGN    0.245    0.245    1
T           SARE_SPEC_REPL_OBFU3    0.230    0.230    1
T                    __FRAUD_DBI    0.223    0.223    1
T             __SARE_FRAUD_REPLY    0.221    0.221    1
T                     SARE_CSBIG    0.214    0.214    1
T           SARE_SPEC_REPL_OBFU1    0.212    0.212    1
T                    FB_GET_MEDS    0.207    0.207    1
T          SARE_SPEC_ROLEX_NOV5D    0.199    0.199    1
T                  SARE_ADLTOBFU    0.198    0.198    1
T                    __FRAUD_XJR    0.182    0.182    1
T            __SARE_FRAUD_FAMILY    0.179    0.179    1
T           SARE_SPEC_REPL_OBFU2    0.176    0.176    1
T               __DRUGS_ANXIETY3    0.151    0.151    1
T                   __KAM_LOTTO3    0.151    0.151    1
T                SARE_MLB_Stock1    0.148    0.148    1
T              __DRUGS_ERECTILE1    0.147    0.147    1
T            __VBOUNCE_AVREPORT0    0.146    0.146    1
T             __SARE_FRAUD_INTRO    0.146    0.146    1
T                   __KAM_LOTTO6    0.145    0.145    1
T                   __KAM_LOTTO4    0.145    0.145    1
T             SARE_SPEC_ANTIDOTE    0.145    0.145    1
T               __LOCAL_PP_B_UPD    0.143    0.143    1
T                  SARE_UNSUB30C    0.138    0.138    1
T           __SARE_FRAUD_TINHORN    0.131    0.131    1
T                 FUZZY_MORTGAGE    0.117    0.117    1
T               FUZZY_MEDICATION    0.116    0.116    1
T                SARE_EN_A_INT_1    0.115    0.115    1
T                    FUZZY_MONEY    0.112    0.112    1
T                  FUZZY_MILLION    0.111    0.111    1
T             REMOVE_BEFORE_LINK    0.111    0.111    1

Comment 1 Mark Martinec 2007-10-18 17:51:44 UTC

Created attachment 4167 [details]
sample mail

Comment 2 Mark Martinec 2007-10-19 03:00:50 UTC

> Seems like the trunk code is even slower than 3.2.3(4), don't know why.

Actually it isn't slower, it is the same speed in 3.2.3 and in trunk.
I just happened not to have SARE rules loaded in 3.2.3
(after switching versions back and forth), which it made it
about twice as fast.

Comment 3 Justin Mason 2007-10-19 03:09:05 UTC

trunk (without additional rulesets) seems reasonably fast:

: jm 103...; time ./spamassassin -Lt < /home/jm/DL/1.msg  > o

real    0m6.988s
user    0m4.908s
sys     0m0.240s

I think it's a SARE bug.  do they have a bug tracker?

Comment 4 Mark Martinec 2007-10-19 08:47:23 UTC

Created attachment 4168 [details]
gzipped small test perl program, simulating the case

Depends on what one considers a bug. I'd say it is a SpamAssassin bug.

Attached is a small perl program which defines an array of text chunks
from the very same sample mail in question, and runs them through
two of the most time-consuming rules as reported in the timing report.
The string encoding and its utf8 flag is the same as it really happens
in the real command-line spamassassin run, eval'ed under compiled body tests.

What is amazing is that if utf8 flag in the text(!) is turned off
it runs 30 times faster!

  bytes t = 0.120686 s
  chars t = 3.617508 s

(btw, I gzipped the test program to make sure the 8-bit characters
will remain intact)

Comment 5 Justin Mason 2007-10-19 09:19:38 UTC

(In reply to comment #4)
> Depends on what one considers a bug. I'd say it is a SpamAssassin bug.

how would you propose to fix it in SA?  by re-adding "use bytes"?

interestingly it seems to work a bit better with blead:

: jm 157...; perl -T 0.lis
bytes t = 0.753940 s
chars t = 4.609974 s
: jm 158...; bleadperl -T 0.lis
bytes t = 0.062525 s
chars t = 1.208566 s

Comment 6 Mark Martinec 2007-10-19 10:28:12 UTC

Created attachment 4169 [details]
proposed patch

> how would you propose to fix it in SA?  by re-adding "use bytes"?

No. By using a supported method that actually works,
instead of utf8::downgrade, which according to Encode man page:

  NOTE: this function is experimental and may change
	or be removed without notice.

and which turns out doesn't do its job.  See attached patch.

Comment 7 Mark Martinec 2007-10-19 10:51:24 UTC

Created attachment 4170 [details]
actually, this one

Comment 8 Mark Martinec 2007-10-24 07:13:48 UTC

Created attachment 4173 [details]
same patch but without a "use Encode"

The Encode is already loaded by Conf.pm and need not be
loaded here - avoids having to go through the same decisions
on loading / not loading it, as already handled in Conf
in order to still be able to run code under Perl 5.6

Comment 9 Mark Martinec 2007-10-24 07:31:18 UTC

I wonder if perhaps this bug should be closed and the Bug 4636
reopened, as it doesn't do its job well.

I'm not sure I know what is the true purpose of utf8::downgrade
in r366594 (comment no.12 in Bug 4636). If the intention was to
clear the utf8 flag for speed, it doesn't work this way, as the
utf8::downgrade aborts its operation leaving the utf8 flag on
(with or without throwing a signal, depending on its second arg)
if a string contains a character that does not have a representation
in the current locale charset.

My proposed patch just makes sure that a conversion is made
from characters to utf-8 octets, this making sure the regexps
in rules need not deal with 16-bit characters, which makes them
terribly slow.

If the intention of the original utf8::downgrade call was to really
convert to a locale character set, it needs to be documented and
probably the target charset needs to be configurable. Also not to forget
that utf8::downgrade is unsupported/experimental in Perl and needs
to be replaced with something else anyway.

Reopening Bug 4636 or keeping this topic here?

Comment 10 Justin Mason 2007-10-24 07:45:13 UTC

(In reply to comment #9)
> I wonder if perhaps this bug should be closed and the Bug 4636
> reopened, as it doesn't do its job well.
> 
> I'm not sure I know what is the true purpose of utf8::downgrade
> in r366594 (comment no.12 in Bug 4636). If the intention was to
> clear the utf8 flag for speed, it doesn't work this way, as the
> utf8::downgrade aborts its operation leaving the utf8 flag on
> (with or without throwing a signal, depending on its second arg)
> if a string contains a character that does not have a representation
> in the current locale charset

aha.  that's not good. :(


> My proposed patch just makes sure that a conversion is made
> from characters to utf-8 octets, this making sure the regexps
> in rules need not deal with 16-bit characters, which makes them
> terribly slow.
> 
> If the intention of the original utf8::downgrade call was to really
> convert to a locale character set, it needs to be documented and
> probably the target charset needs to be configurable. Also not to forget
> that utf8::downgrade is unsupported/experimental in Perl and needs
> to be replaced with something else anyway.
> 
> Reopening Bug 4636 or keeping this topic here?

I would suggest 

(a) retitling this bug to be more specific about utf8::downgrade
causing the issue

and (b) and commenting on bug 4636, indicating that there's a
bug in that code, and pointing at this.

That assumes that, if we replace the utf8::downgrade usage, we
have fixed this bug in its entirety -- is that the case?  If not,
then the utf8::downgrade bug should be opened as a new issue.

Comment 11 Mark Martinec 2007-10-24 11:15:06 UTC

> That assumes that, if we replace the utf8::downgrade usage, we
> have fixed this bug in its entirety -- is that the case?  If not,
> then the utf8::downgrade bug should be opened as a new issue.

Yes, it is true - as far as this bug is concerned (i.e. the utf8 flag
remains enabled in certain text as passed to rules when normalize_charset
is on), the proposed patch does fix it.

Comment 12 John Gardiner Myers 2007-10-24 15:01:23 UTC

The proposed patch is incorrect and this bug is INVALID.

The purpose of the utf8::downgrade call is to get the speed benefits of having
the utf8 flag cleared when it is possible to represent the characters without
using the utf8 flag.  In this particular case, it is not possible to represent
the characters without having the utf8 flag set, so the call leaves the utf8
flag set as it is intended to do.

Many rules are not charset-normalization-aware and thus may perform poorly or
incorrectly with charset normalization enabled.  For example, I have seen rules
test for non-ASCII by using [\x80-\xff].  With charset normalization, they need
to instead use [^\x00-\x7f].  Similarly, rules might need to use [0-9] instead
of \d.  Similarly \s and \w might catch more characters than intended.

Comment 13 Loren Wilton 2007-10-24 16:58:23 UTC

I'm not sure that closing this bug was the appropriate action, certainly not as 
INVALID.  The original complaint that some rules are abnormally slow still 
seems to be a quite valid complaint.

It should also be noted that these rules have existed for years, and were NOT 
slow until recently.  That is because previously the module contained a 'use 
bytes' directive, which is no longer present.

Presumably the use bytes was forcing byte encoding rather than utf-8 encoding 
on strings, whether they could be validly represented as bytes or not.  And the 
rules in question (and many, probably most other existing rules) were tested 
under this regieme and found to not FP excessively.

Comment 14 John Gardiner Myers 2007-10-24 17:13:10 UTC

Did normalize_charset get turned on by default recently?  With normalize_charset
disabled, the code in question won't be run and nothing will ever get the utf8
flag set.

With normalize_charset enabled, one has to pay the associated cost and be
careful about the set of rules used.

The old "use bytes" did nothing useful as nothing ever had the utf8 flag set.

Comment 15 Takashi Takizawa 2007-10-25 02:51:31 UTC

The body pattern test doesn't match even if the body pattern rule written in
Japanese, which encoding is UTF-8, is used. 

However, the body pattern test matches when rewriting to use Encode. 


The following attention is written in 'perldoc utf8'
       * $success = utf8::downgrade($string[, FAIL_OK])
           
               snip...

           Note that this function does not handle arbitrary encodings.
           Therefore Encode.pm is recommended for the general purposes.

           Not affected by the encoding pragma.

           NOTE: this function is experimental and may change or be removed
           without notice.

Comment 16 John Gardiner Myers 2007-10-25 12:40:27 UTC

(In reply to comment #15)
> The body pattern test doesn't match even if the body pattern rule written in
> Japanese, which encoding is UTF-8, is used. 

SA rules files are encoded in ISO-8859-1, not UTF-8.  You have to either encode
Japanese characters in pattern tests using \x sequences or develop a new feature
adding support for UTF-8 config files to SA.

> The following attention is written in 'perldoc utf8'
>        * $success = utf8::downgrade($string[, FAIL_OK])
>            
>                snip...
> 
>            Note that this function does not handle arbitrary encodings.
>            Therefore Encode.pm is recommended for the general purposes.
> 
>            Not affected by the encoding pragma.
> 
>            NOTE: this function is experimental and may change or be removed
>            without notice.

The "NOTE" has been there for years and is most likely out of date.  The only
alternatives to using that function are to reimplement the same semantics some
other way or to impose the performance penalty in cases where it is not necessary.

Comment 17 Adam Katz 2009-11-12 15:20:28 UTC

FYI, the UTF-8 parsing inefficiencies were noted and discussed (with no real results) about five months ago, see the open ticket at 
http://rt.perl.org/rt3//Public/Bug/Display.html?id=66852