SA Bugzilla – Bug 5691
Slow rules due to charset normalization not always clearing utf8 flag
Last modified: 2009-11-12 15:20:28 UTC
Seems like some koi8-r -encoded Russian spam causes certain priority-0 body rules to be really slow. I'm observing this for the last two or three weeks - every time some message takes one or two minutes to check, it almost always turns out to be this type of spam. The slow rules are mostly of the SARE_FRAUD* and SARE_SPEC_ROLEX* family, but even some baseline rules are not much better: SHORT_TERM_PRICE, FB_GVR, DRUGS_SLEEP2, AFF_LOTTERY, KAM_LOTTO*, ... Seems like the trunk code is even slower than 3.2.3(4), don't know why. Would somebody more familiar with the internals in that area please try the attached message. Perhaps something can be done about it, besides ditching these rules. Here is the timing report from a command-line spamassassin run (note the tests_pri_0 which takes 44 seconds on this 72 kB message): [58931] dbg: timing: total 49963 ms - init: 4188 (8.4%), parse: 16 (0.0%), extract_message_metadata: 406 (0.8%), get_uri_detail_list: 150 (0.3%), tests_pri_-1000: 71 (0.1%), tests_pri_-950: 5 (0.0%), tests_pri_-900: 7 (0.0%), tests_pri_-400: 54 (0.1%), check_bayes: 47 (0.1%), tests_pri_0: 44788 (89.6%), check_spf: 61 (0.1%), check_dkim_signature: 29 (0.1%), poll_dns_idle: 0.05(0%), check_razor2: 726 (1.5%), check_pyzor: 0.07 (0.0%), check_dcc: 462 (0.9%), tests_pri_100: 5 (0.0%), tests_pri_500: 177 (0.4%), tests_pri_1000: 34 (0.1%), total_awl: 29 (0.1%), check_awl: 4 (0.0%), update_awl: 2 (0.0%), learn: 80 (0.2%) and here is the top of the timing list are reported by HitFreqsRuleTiming: v1 T __SARE_FRAUD_BADTHINGS 1.385 1.385 1 T __SARE_FRAUD_MONEY 1.338 1.338 1 T __SARE_FRAUD_TRUST 1.199 1.199 1 T SHORT_TERM_PRICE 1.128 1.128 1 T __SARE_FRAUD_RELIGION 1.068 1.068 1 T __SARE_FRAUD_PROPOSE 0.982 0.982 1 T SARE_SPEC_ROLEX_NOV5A 0.920 0.920 1 T SARE_SPEC_ROLEX_NOV5B 0.855 0.855 1 T __SARE_FRAUD_DPTCOMPNY 0.804 0.804 1 T FB_GVR 0.769 0.769 1 T RAZOR2_CF_RANGE_E8_51_100 0.726 0.726 1 T __SARE_FRAUD_MISC 0.553 0.553 1 T __SARE_OEM_2A 0.543 0.543 1 T __SARE_FRAUD_FUNWORDS 0.543 0.543 1 T SARE_RMML_Stock17 0.539 0.539 1 T __SARE_FRAUD_AGREE 0.477 0.477 1 T DCC_CHECK 0.462 0.462 1 T SARE_SPEC_ROLEX_CHEAP 0.444 0.444 1 T __DRUGS_SLEEP2 0.424 0.424 1 T SARE_MONEYTERMS 0.375 0.375 1 T __SARE_FRAUD_LOTTERY 0.367 0.367 1 T __SARE_FRAUD_VIPS 0.365 0.365 1 T __AFF_LOTTERY 0.355 0.355 1 T __SARE_OEM_1A 0.343 0.343 1 T __SARE_FRAUD_LOTTERY2 0.329 0.329 1 T __SARE_FRAUD_PAPERWORK 0.329 0.329 1 T SARE_BETTERORG 0.329 0.329 1 T SARE_EN_SET1_A1 0.309 0.309 1 T __ZMIde_STOCK17 0.302 0.302 1 T __KAM_LOTTO5 0.283 0.283 1 T __SARE_FRAUD_BARRISTER 0.281 0.281 1 T __SARE_OEM_2B 0.269 0.269 1 T __KAM_LOTTO2 0.252 0.252 1 T SARE_RMML_Stock5 0.250 0.250 1 T __SARE_FRAUD_FOREIGN 0.245 0.245 1 T SARE_SPEC_REPL_OBFU3 0.230 0.230 1 T __FRAUD_DBI 0.223 0.223 1 T __SARE_FRAUD_REPLY 0.221 0.221 1 T SARE_CSBIG 0.214 0.214 1 T SARE_SPEC_REPL_OBFU1 0.212 0.212 1 T FB_GET_MEDS 0.207 0.207 1 T SARE_SPEC_ROLEX_NOV5D 0.199 0.199 1 T SARE_ADLTOBFU 0.198 0.198 1 T __FRAUD_XJR 0.182 0.182 1 T __SARE_FRAUD_FAMILY 0.179 0.179 1 T SARE_SPEC_REPL_OBFU2 0.176 0.176 1 T __DRUGS_ANXIETY3 0.151 0.151 1 T __KAM_LOTTO3 0.151 0.151 1 T SARE_MLB_Stock1 0.148 0.148 1 T __DRUGS_ERECTILE1 0.147 0.147 1 T __VBOUNCE_AVREPORT0 0.146 0.146 1 T __SARE_FRAUD_INTRO 0.146 0.146 1 T __KAM_LOTTO6 0.145 0.145 1 T __KAM_LOTTO4 0.145 0.145 1 T SARE_SPEC_ANTIDOTE 0.145 0.145 1 T __LOCAL_PP_B_UPD 0.143 0.143 1 T SARE_UNSUB30C 0.138 0.138 1 T __SARE_FRAUD_TINHORN 0.131 0.131 1 T FUZZY_MORTGAGE 0.117 0.117 1 T FUZZY_MEDICATION 0.116 0.116 1 T SARE_EN_A_INT_1 0.115 0.115 1 T FUZZY_MONEY 0.112 0.112 1 T FUZZY_MILLION 0.111 0.111 1 T REMOVE_BEFORE_LINK 0.111 0.111 1
Created attachment 4167 [details] sample mail
> Seems like the trunk code is even slower than 3.2.3(4), don't know why. Actually it isn't slower, it is the same speed in 3.2.3 and in trunk. I just happened not to have SARE rules loaded in 3.2.3 (after switching versions back and forth), which it made it about twice as fast.
trunk (without additional rulesets) seems reasonably fast: : jm 103...; time ./spamassassin -Lt < /home/jm/DL/1.msg > o real 0m6.988s user 0m4.908s sys 0m0.240s I think it's a SARE bug. do they have a bug tracker?
Created attachment 4168 [details] gzipped small test perl program, simulating the case Depends on what one considers a bug. I'd say it is a SpamAssassin bug. Attached is a small perl program which defines an array of text chunks from the very same sample mail in question, and runs them through two of the most time-consuming rules as reported in the timing report. The string encoding and its utf8 flag is the same as it really happens in the real command-line spamassassin run, eval'ed under compiled body tests. What is amazing is that if utf8 flag in the text(!) is turned off it runs 30 times faster! bytes t = 0.120686 s chars t = 3.617508 s (btw, I gzipped the test program to make sure the 8-bit characters will remain intact)
(In reply to comment #4) > Depends on what one considers a bug. I'd say it is a SpamAssassin bug. how would you propose to fix it in SA? by re-adding "use bytes"? interestingly it seems to work a bit better with blead: : jm 157...; perl -T 0.lis bytes t = 0.753940 s chars t = 4.609974 s : jm 158...; bleadperl -T 0.lis bytes t = 0.062525 s chars t = 1.208566 s
Created attachment 4169 [details] proposed patch > how would you propose to fix it in SA? by re-adding "use bytes"? No. By using a supported method that actually works, instead of utf8::downgrade, which according to Encode man page: NOTE: this function is experimental and may change or be removed without notice. and which turns out doesn't do its job. See attached patch.
Created attachment 4170 [details] actually, this one
Created attachment 4173 [details] same patch but without a "use Encode" The Encode is already loaded by Conf.pm and need not be loaded here - avoids having to go through the same decisions on loading / not loading it, as already handled in Conf in order to still be able to run code under Perl 5.6
I wonder if perhaps this bug should be closed and the Bug 4636 reopened, as it doesn't do its job well. I'm not sure I know what is the true purpose of utf8::downgrade in r366594 (comment no.12 in Bug 4636). If the intention was to clear the utf8 flag for speed, it doesn't work this way, as the utf8::downgrade aborts its operation leaving the utf8 flag on (with or without throwing a signal, depending on its second arg) if a string contains a character that does not have a representation in the current locale charset. My proposed patch just makes sure that a conversion is made from characters to utf-8 octets, this making sure the regexps in rules need not deal with 16-bit characters, which makes them terribly slow. If the intention of the original utf8::downgrade call was to really convert to a locale character set, it needs to be documented and probably the target charset needs to be configurable. Also not to forget that utf8::downgrade is unsupported/experimental in Perl and needs to be replaced with something else anyway. Reopening Bug 4636 or keeping this topic here?
(In reply to comment #9) > I wonder if perhaps this bug should be closed and the Bug 4636 > reopened, as it doesn't do its job well. > > I'm not sure I know what is the true purpose of utf8::downgrade > in r366594 (comment no.12 in Bug 4636). If the intention was to > clear the utf8 flag for speed, it doesn't work this way, as the > utf8::downgrade aborts its operation leaving the utf8 flag on > (with or without throwing a signal, depending on its second arg) > if a string contains a character that does not have a representation > in the current locale charset aha. that's not good. :( > My proposed patch just makes sure that a conversion is made > from characters to utf-8 octets, this making sure the regexps > in rules need not deal with 16-bit characters, which makes them > terribly slow. > > If the intention of the original utf8::downgrade call was to really > convert to a locale character set, it needs to be documented and > probably the target charset needs to be configurable. Also not to forget > that utf8::downgrade is unsupported/experimental in Perl and needs > to be replaced with something else anyway. > > Reopening Bug 4636 or keeping this topic here? I would suggest (a) retitling this bug to be more specific about utf8::downgrade causing the issue and (b) and commenting on bug 4636, indicating that there's a bug in that code, and pointing at this. That assumes that, if we replace the utf8::downgrade usage, we have fixed this bug in its entirety -- is that the case? If not, then the utf8::downgrade bug should be opened as a new issue.
> That assumes that, if we replace the utf8::downgrade usage, we > have fixed this bug in its entirety -- is that the case? If not, > then the utf8::downgrade bug should be opened as a new issue. Yes, it is true - as far as this bug is concerned (i.e. the utf8 flag remains enabled in certain text as passed to rules when normalize_charset is on), the proposed patch does fix it.
The proposed patch is incorrect and this bug is INVALID. The purpose of the utf8::downgrade call is to get the speed benefits of having the utf8 flag cleared when it is possible to represent the characters without using the utf8 flag. In this particular case, it is not possible to represent the characters without having the utf8 flag set, so the call leaves the utf8 flag set as it is intended to do. Many rules are not charset-normalization-aware and thus may perform poorly or incorrectly with charset normalization enabled. For example, I have seen rules test for non-ASCII by using [\x80-\xff]. With charset normalization, they need to instead use [^\x00-\x7f]. Similarly, rules might need to use [0-9] instead of \d. Similarly \s and \w might catch more characters than intended.
I'm not sure that closing this bug was the appropriate action, certainly not as INVALID. The original complaint that some rules are abnormally slow still seems to be a quite valid complaint. It should also be noted that these rules have existed for years, and were NOT slow until recently. That is because previously the module contained a 'use bytes' directive, which is no longer present. Presumably the use bytes was forcing byte encoding rather than utf-8 encoding on strings, whether they could be validly represented as bytes or not. And the rules in question (and many, probably most other existing rules) were tested under this regieme and found to not FP excessively.
Did normalize_charset get turned on by default recently? With normalize_charset disabled, the code in question won't be run and nothing will ever get the utf8 flag set. With normalize_charset enabled, one has to pay the associated cost and be careful about the set of rules used. The old "use bytes" did nothing useful as nothing ever had the utf8 flag set.
The body pattern test doesn't match even if the body pattern rule written in Japanese, which encoding is UTF-8, is used. However, the body pattern test matches when rewriting to use Encode. The following attention is written in 'perldoc utf8' * $success = utf8::downgrade($string[, FAIL_OK]) snip... Note that this function does not handle arbitrary encodings. Therefore Encode.pm is recommended for the general purposes. Not affected by the encoding pragma. NOTE: this function is experimental and may change or be removed without notice.
(In reply to comment #15) > The body pattern test doesn't match even if the body pattern rule written in > Japanese, which encoding is UTF-8, is used. SA rules files are encoded in ISO-8859-1, not UTF-8. You have to either encode Japanese characters in pattern tests using \x sequences or develop a new feature adding support for UTF-8 config files to SA. > The following attention is written in 'perldoc utf8' > * $success = utf8::downgrade($string[, FAIL_OK]) > > snip... > > Note that this function does not handle arbitrary encodings. > Therefore Encode.pm is recommended for the general purposes. > > Not affected by the encoding pragma. > > NOTE: this function is experimental and may change or be removed > without notice. The "NOTE" has been there for years and is most likely out of date. The only alternatives to using that function are to reimplement the same semantics some other way or to impose the performance penalty in cases where it is not necessary.
FYI, the UTF-8 parsing inefficiencies were noted and discussed (with no real results) about five months ago, see the open ticket at http://rt.perl.org/rt3//Public/Bug/Display.html?id=66852