SA Bugzilla – Bug 4691
fast_body, fast_rawbody, and tmethod support
Last modified: 2011-09-07 05:29:54 UTC
here we go. a patch to add support for 'tmethod'. similar to 'tflags', tmethod can be set on individual rules to alter the behavior of its regexp testing. the default tmethod is still per-line for every line of content, but this gives you the ability to define things like - 'line X' to only scan a single line number - 'linerange X:Y' to apply a regexp to a range of lines - 'string X:Y' to apply a regexp to a single block of content, x=start position, y=number of bytes. x can be negative to go from the end. - 'stringtrim X:Y' which is same as string but cleans up newlines and whitespace to make writing multiline rawbody rules a breeze. i've also implemented "canned" fast_body and fast_rawbody rules, which basically predefine the tmethod to "stringtrim" for you using 8kb (body) and 32kb (rawbody) settings. by going this route, all the rules priority settings will still apply... i cant see a need for tmethod support on headers, as they are single lines already, and very fast. i also dont see an urgent need for tmethod on full types?? do you? more imformation regarding the implementation details is below. you will see my patch added to this ticket next! have fun testing. ;) # body rules with tmethod definitions ######################################################### body TMETHOD_STRING /test/i tmethod TMETHOD_STRING string 0:16 describe TMETHOD_STRING test found in first 16 bytes of body content score TMETHOD_STRING 0.1 body TMETHOD_STRINGTRIM /test/i tmethod TMETHOD_STRINGTRIM stringtrim -24:24 describe TMETHOD_STRINGTRIM test found in last 24 bytes of body content score TMETHOD_STRINGTRIM 0.1 body TMETHOD_LINE /test/i tmethod TMETHOD_LINE line 1 describe TMETHOD_LINE test found on first line of body content score TMETHOD_LINE 0.1 body TMETHOD_LINERANGE /test/i tmethod TMETHOD_LINERANGE linerange 1:5 describe TMETHOD_LINERANGE test found within first 5 lines of body content score TMETHOD_LINERANGE 0.1 ######################################################### [3015] dbg: rules: TMETHOD_LINERANGE - scanning line 1 [3015] dbg: rules: TMETHOD_LINERANGE - applying regex /test/i to line 1 [3015] dbg: rules: ran body rule TMETHOD_LINERANGE ======> got hit: "test" [3015] dbg: rules: TMETHOD_LINERANGE took 0.006965 seconds, using method= [linerange] args=[1:5] [3015] dbg: rules: TMETHOD_LINE - scanning line 1 [3015] dbg: rules: TMETHOD_LINE - applying regex /test/i to line 1 [3015] dbg: rules: ran body rule TMETHOD_LINE ======> got hit: "test" [3015] dbg: rules: TMETHOD_LINE took 0.00593 seconds, using method=[line] args= [1] [3015] dbg: rules: TMETHOD_STRING - running tmethod string [3015] dbg: rules: ran body rule TMETHOD_STRING ======> got hit: "test" [3015] dbg: rules: TMETHOD_STRING - start position 0, bytes 16, total length 16 [3015] dbg: rules: TMETHOD_STRING took 0.006037 seconds, using method=[string] args=[0:16] [3092] dbg: rules: TMETHOD_STRINGTRIM - running tmethod stringtrim [3092] dbg: rules: TMETHOD_STRINGTRIM - start position -24, bytes 24, total length 24 [3092] dbg: rules: TMETHOD_STRINGTRIM took 0.003509 seconds, using method= [stringtrim] args=[-24:24] ######################################################### DEBUG: string [0:16]: DEBUG: -------------------------- test blah test 1 DEBUG: ------------------------- DEBUG: stringtrim [-24:24]: DEBUG: -------------------------- st 1 2 3 check this out DEBUG: ------------------------- # rawbody rules with tmethod definitions ######################################################### rawbody TMETHOD_STRINGTRIM /test/i tmethod TMETHOD_STRINGTRIM stringtrim -512:512 describe TMETHOD_STRINGTRIM test found in last 512 bytes of body content score TMETHOD_STRINGTRIM 0.1 rawbody TMETHOD_STRING /test/i tmethod TMETHOD_STRING string 0:512 describe TMETHOD_BYTERANGE test found in first 512 bytes of body content score TMETHOD_BYTERANGE 0.1 rawbody TMETHOD_LINE /test/i tmethod TMETHOD_LINE line 1 describe TMETHOD_LINE test found on first line of body content score TMETHOD_LINE 0.1 rawbody TMETHOD_LINERANGE /test/i tmethod TMETHOD_LINERANGE linerange 1:5 describe TMETHOD_LINERANGE test found within first 5 lines of body content score TMETHOD_LINERANGE 0.1 ######################################################### [3559] dbg: rules: TMETHOD_STRING - start position 0, bytes 512, total length 57 [3559] dbg: rules: ran rawbody rule TMETHOD_STRING ======> got hit: "test" [3559] dbg: rules: TMETHOD_STRING took 0.00501 seconds, using method=[string] args=[0:512] [3559] dbg: rules: ran rawbody rule TMETHOD_LINERANGE ======> got hit: "test" [3559] dbg: rules: TMETHOD_LINERANGE took 0.003007 seconds, using method= [linerange] args=[1:5] [3559] dbg: rules: TMETHOD_LINE took 0.000192 seconds, using method=[line] args= [1] [3559] dbg: rules: TMETHOD_STRINGTRIM - start position -512, bytes 512, total length 57 [3559] dbg: rules: ran rawbody rule TMETHOD_STRINGTRIM ======> got hit: "test" [3559] dbg: rules: TMETHOD_STRINGTRIM took 0.004803 seconds, using method= [stringtrim] args=[-512:512] ######################################################### DEBUG: string [0:512]: DEBUG: -------------------------- <html><body><font size=3> test </font></body></html> DEBUG: ------------------------- DEBUG: stringtrim [-512:512]: DEBUG: -------------------------- <html> <body> <font size=3> test </font> </body> </html> DEBUG: ------------------------- # fast_body and fast_rawbody rules with predefined tmethods ######################################################### fast_rawbody TEST_FAST_RAWBODY /test/i describe TEST_FAST_RAWBODY test found score TEST_FAST_RAWBODY 0.1 fast_body TEST_FAST_BODY /test/i describe TEST_FAST_BODY test found score TEST_FAST_BODY 0.1 ######################################################### [5704] dbg: rules: ran body rule TEST_FAST_BODY ======> got hit: "test" [5704] dbg: rules: TEST_FAST_BODY - start position 0, bytes 8192, total length 11 [5704] dbg: rules: TEST_FAST_BODY took 0.005431 seconds, using method= [stringtrim] args=[0:8192] [5704] dbg: rules: ran rawbody rule TEST_FAST_RAWBODY ======> got hit: "test" [5704] dbg: rules: TEST_FAST_RAWBODY - start position 0, bytes 32768, total length 62 [5704] dbg: rules: TEST_FAST_RAWBODY took 0.004835 seconds, using method= [stringtrim] args=[0:32768] ######################################################### DEBUG: stringtrim [0:8192]: DEBUG: -------------------------- test test DEBUG: ------------------------- DEBUG: stringtrim [0:32768]: DEBUG: -------------------------- <html> <body> <font size=3> test </font> </body> </html> DEBUG: ------------------------- #########################################################
Created attachment 3256 [details] fast_body, fast_rawbody, and tmethod implements fast_body, fast_rawbody, and tmethod options.
*cough* Plugin! *cough*
Very interesting ideas. However, this functionality really needs to be a plugin. Changing the body code itself to do this is a bad idea. It makes the code much harder to maintain and there may be other ways to speed up tests (for example, index()-based matching).
i dont see any way to pluginze this and make it work properly. if you pluginize the 'tmethod' property, what good does that do if PMS::do_body_tests () cant read it? what happens to body rule priorities if you have to eval them through a custom plugin?? if I want to create some fast_body rules and want them to be the first thing that runs to increase efficiency, i cant because they are evals which occur after body and rawbody rules. $self->do_body_tests($priority, $decoded); $self->do_body_uri_tests($priority, @uris); $self->do_body_eval_tests($priority, $decoded); $self->do_rawbody_tests($priority, $bodytext); $self->do_rawbody_eval_tests($priority, $bodytext); try to look past the 'fast_body' and 'fast_rawbody' definitions, as those are just quick wrappers around the 'tmethod' code which is the meat and potatoes of the patch ;)
"if I want to create some fast_body rules and want them to be the first thing that runs to increase efficiency" ---- realize i say this in advance of the "yet-to-come-short-circuit" testing... run the fastest rules first that you can short-circuit on to increase efficiency. body WANNA_FSCK /wanna fsck\?/ descript WANNA_FSCK 'wanna fsck' in first 1kb of data tmethod WANNA_FSCK stringtrim 0:1024 tflags WANNA_FSCK shortcircuit score WANNA_FSCK 100 ;)
btw, I swapped a couple of mails with Dallas about this -- worth noting that it does provide a significant speed-up -- at least in the original patch. (However, I'm concerned that this version, where 'sub '.$rulename.'_body_test' now includes a lot of body array-manipulation code, may remove that speed improvement... some more benchmarking/moving code about may be necessary with this one.) I'm in favour of the proposed command syntax, although I'd prefer "range" instead of "tmethod" if possible. Normally I'm the biggest fan of pluginizing. However, I think this functionality is valuable in the core -- unless Daniel or Michael can propose a good, fast-enough plugin API for it. ;)
I suspect a lot of the speed increases would be minimized if this were to be pluginized. Having said that, I think we need to see some performance numbers (DProf) to see what kind of a speed increase we are talking about (and what the performance hit is if we don't use any tmethod rules).
i have yet to do performance testing on this, but i plan to. the performance testing i did previously, was when i implemented a 'tflag fast' method, which basically just took a config value for fast_body_max_bytes and fast_rawbody_max_bytes, did a substr(content,0,<max_bytes>) and performed a single regexp test against that block of content, instead of per line regexp. the speed up was quite good depending on the max_bytes values that were set. even when the max_bytes for body and rawbody were set to 256kb, it was slightly faster to use a single regexp test. i'll just paste those results here.. it was sorta crude testing, but gave me some hope to continue. :) * no tflags fast on body/rawbody rules --------------------------------------------------------------- total scantime: 570.591863 seconds number of files scanned: 740 avg scantime: 0.7707657 seconds --------------------------------------------------------------- And now with 'tflag fast' set on all body and rawbody rules.... * fast_body_max_bytes 262144 * fast_rawbody_max_bytes 262144 --------------------------------------------------------------- total scantime: 552.70254 seconds number of files scanned: 740 avg scantime: 0.7465846 seconds --------------------------------------------------------------- * fast_body_max_bytes 8192 * fast_rawbody_max_bytes 32768 --------------------------------------------------------------- total scantime: 470.123985 seconds number of files scanned: 740 avg scantime: 0.6349819 seconds --------------------------------------------------------------- * fast_body_max_bytes 4096 * fast_rawbody_max_bytes 16384 --------------------------------------------------------------- total scantime: 454.957886 seconds number of files scanned: 740 avg scantime: 0.6145058 seconds --------------------------------------------------------------- as you can see, a 4k body and 16kb rawbody resulted in about a 20% speedup... but this was against the entire ruleset running in tflag fast mode. in reality, thats not something that would ever be done because of loss in hit rates, but i think the point here is that it even proves faster at the 256kb level. although i'm sure there are people that run spamc -s with values larger than 256kb, and performance at some point will get worse over that. so, to justin's point, the increase in size of the function by rule does effect the memory footprint for one... hopefully it doesnt effect the speed. and that footprint will grow more with the addition of custom rules/rule sets. memory usage, stock svn 7658 59.0 1.7 22692 18480 ? Ss 17:39 0:00 /usr/bin/spamd -d memory usage, svn + tmethod patch 7593 0.4 1.8 24556 18956 ? Ss 17:37 0:00 /usr/bin/spamd -d to duncans point, if you "dont use any tmethod rules", then i guess there is no point in adding this to the core. unless the point is to extend the functionality for rule writers. Actually, extending rule writing functionality for SARE is the main reason i'm doing this.. most notably the multiline rawbody rules which are impossible to write efficiently now since full /s testing is just too expensive. dallas
"I'm in favour of the proposed command syntax, although I'd prefer "range" instead of "tmethod" if possible." i chose tmethod (test method) to complement tflags (test flags) because they both alter the way the test actually works. if you use 'range', what range would that refer to?? byte range, line range? do you keep syntax range RULE_NAME <method> <args> or do you drop the methods param... range RULE_NAME <start_pos> <bytes> does the above only apply to block style single regexp tests and skip per-line testing altogether? do you use range RULE_NAME line <start> <stop> range RULE_NAME byte <start_pos> <bytes> how do you solve the problem of taking newline and extra whitespace out of rawbody content.. another param completely? there are many ways one could go here...
The testing of the method should be done outside $evalstr2. The method won't change across different messages, so there's no need to put the either the test or the code for the unused methods inside the function defined by $evalstr2. With a large number of "fast_body" rules, SpamAssassin will be spending time repeatedly reconstructing the same string to be searched against. I thought the reason that body and rawbody were split into arrays was to protect against worst-case behavior of test regexes. By pasting those lines back together, this patch appears to forgo that protection. In my opinion, this patch adds way to many knobs to tweak. I don't find the less than 20% performance improvement that compelling, especially given it hands spammers an easy way to prevent SpamAssassin from firing rules on their messages.
> I thought the reason that body and rawbody were split into arrays was to protect > against worst-case behavior of test regexes. By pasting those lines back > together, this patch appears to forgo that protection. That is the theory. The practice is that spammers have several very trivial patterns that are guaranteed to be impossible to detect in SA, short of writing a separate plugin for each thing you want to test. Essentially, any HTML-level pattern containing one or more line breaks is undetectable. The only current solution is to run 'full' rules over the entire email to test for those patterns. This of course will only work with non-encoded mail bodies. It is perhaps also worth noting that by refusing to allow multiline rawbody, and arbitrarily and unpredictably breaking 'body', you are forcing the use of 'full' rules for things that could be trivially parsed in connected body or rawbody rules. The only conclusion can be that the dev's believe that is is safer and faster to parse the entire email with a regex than it would be to parse a small but useful part of the email with the same regex.
okay.. new patch. it now works properly with 'tflags multiple' no matter if you use 'line' or 'byte', and it changes the config from 'tmethod' to simply 'range' like justin requested. i've also moved most of the code out of the $evalstr2 definition into its own sub called get_range_data() which saves quite a bit on memory consumption. ------------------------------------- body __TEST /test/ range __TEST byte 0:512 tflags __TEST multiple score __TEST 1.0 # echo -e "From: you\n\ntest\ntest\ntest\n" | spamc [26087] dbg: range_data: type=byte args=0:512 for rule __TEST [26087] dbg: rules: running range type byte [26087] dbg: rules: found "test" at pos 5 - hit #1 [26087] dbg: rules: ran body rule __TEST ======> got hit: "test" [26087] dbg: rules: found "test" at pos 10 - hit #2 [26087] dbg: rules: ran body rule __TEST ======> got hit: "test" [26087] dbg: rules: found "test" at pos 15 - hit #3 [26087] dbg: rules: ran body rule __TEST ======> got hit: "test" [26087] dbg: rules: __TEST got 3 hits ------------------------------------- body __TEST /test/ range __TEST byte 0:10 tflags __TEST multiple score __TEST 1.0 # echo -e "From: you\n\ntest\ntest\ntest\n" | spamc [26073] dbg: range_data: type=byte args=0:10 for rule __TEST [26073] dbg: rules: running range type byte [26073] dbg: rules: found "test" at pos 5 - hit #1 [26073] dbg: rules: ran body rule __TEST ======> got hit: "test" [26073] dbg: rules: found "test" at pos 10 - hit #2 [26073] dbg: rules: ran body rule __TEST ======> got hit: "test" [26073] dbg: rules: __TEST got 2 hits ------------------------------------- fast_body __TEST /test/ tflags __TEST multiple score __TEST 1.0 # echo -e "From: you\n\ntest\ntest\ntest\n" | spamc [26096] dbg: range_data: type=bytetrim args=0:8192 for rule __TEST [26096] dbg: rules: running range type bytetrim [26096] dbg: rules: found "test" at pos 5 - hit #1 [26096] dbg: rules: ran body rule __TEST ======> got hit: "test" [26096] dbg: rules: found "test" at pos 10 - hit #2 [26096] dbg: rules: ran body rule __TEST ======> got hit: "test" [26096] dbg: rules: found "test" at pos 15 - hit #3 [26096] dbg: rules: ran body rule __TEST ======> got hit: "test" [26096] dbg: rules: __TEST got 3 hits ------------------------------------- # echo -e "From: you\nContent-type: text/html\n\n<html>\n<body>\n\ntest<br />\ntest<br />\ntest<br />\n</body>\n</html>" | spamc fast_rawbody __TEST /test/ tflags __TEST multiple score __TEST 1.0 [26175] dbg: range_data: type=bytetrim args=0:32768 for rule __TEST [26175] dbg: rules: ran rawbody rule __TEST ======> got hit: "test" [26175] dbg: rules: ran rawbody rule __TEST ======> got hit: "test" [26175] dbg: rules: ran rawbody rule __TEST ======> got hit: "test" -------------------------------------
Created attachment 3273 [details] 'range' patch implements 'range', 'fast_body' and 'fast_rawbody' methods.
well.. i've done my speed test with and without the patch. Average scan times were better with the patch than without... dont know why. memory usage goes up as expected.. starting spamd in both cases as.. # spamd -d -L -s mail ------------------------------------------------------------ Memory & Speed of Stock SVN (3.2.0-r322462) root 31237 54.2 2.4 31448 25236 ? Ss 11:21 0:02 /usr/bin/perl -T -w /usr/bin/spamd -d -D -L -s mail root 31240 0.0 2.4 31448 25228 ? S 11:21 0:00 spamd child root 31241 0.0 2.4 31448 25228 ? S 11:21 0:00 spamd child total scantime: 761.461411 seconds number of files scanned: 987 avg scantime: 0.7708705 seconds ------------------------------------------------------------ Memory & Speed after applying range patch... (3.2.0-r322462) root 890 33.8 2.7 33468 28612 ? Ss 12:01 0:02 /usr/bin/spamd -d -L -s mail root 892 0.0 2.7 33468 28612 ? S 12:01 0:00 spamd child root 893 0.0 2.7 33468 28612 ? S 12:01 0:00 spamd child total scantime: 756.437461 seconds number of files scanned: 987 avg scantime: 0.7657802 seconds ------------------------------------------------------------ other benchmarks welcome... thanks.
followup to comment 14 no rules were applied using any range parameters... the test was just to see what overhead the extra code added.
with this patch running, and testing various SARE rulesets, i'm seeing thousands of Malformed UTF-8 warn messages... [8498] warn: Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xcf) in pattern match (m//) at /usr/share/spamassassin/70_sare_adult.cf, rule SARE_OBFUPUSS, line 1, <GEN10> line 266. when i remove this regex test, they go away... while ($fb =~ '.$pat.'g) { is there a better way to do this?? i see the regex test in 'sub do_full_tests' uses while ($$fullmsgref =~ '.$pat.'g) { yet it never produces UTF-8 warnings when I change those body rules (ie SARE_OBFUPUSS) from type body to type full. is this because of the way the $evalstr is built up in do_full_tests()? any recommendations to avoid these UTF-8 warns? thanks...
What are the chances that you are concatenating a bunch of separate strings, and somehow ending up with the null terminators in the compound string? Sounds like that might produce this kind of warning.
More likely when you truncate $fb, you somehow do it so the last UTF-8 sequence ends mid-byte on some of your test case messages.
i've commented out the substr of $fb, and changed the regexp to match a constant string, and still see the Malformed UTF-8 # $fb = substr($fb,$arg1,$arg2); '.$self->hash_line_for_rule($rulename).' pos $fb = 0; #while ($fb =~ '.$pat.'g) { while(q#blah# =~ '.$pat.'g) { even if i change it to while(q#blah# =~ /this is a test/g) { i'm still seeing malformed utf-8 in debug... so this must be coming from some where else. also, just saw a report on the SARE user list this morning about "hundreds of malformed utf-8 errors on svn r349275".
John's theory sounds very plausible to me. can we get a list of perl versions that this occurs with? UTF-8 errors are very variable based on perl release.
also, is it always 'unexpected non-continuation byte 0x00'? 0x00 would indicate a UTF-8 char chopped off at end of string. full debug logs may be enlightening.
The SARE rule-related warnings are being tracked in bug 3787. They reportedly happen even without this patch.
yes.. it was a premature diagnosis. the utf-8 warns are content specific... and happen even against the if ('.$pat.') code.
ping! just wondering what people think about this patch. I think we got distracted by the utf-8 issue...
I'm about 130% in favor of this. If nothing else it would finally make rawbody a usable type for rule writers and we can get away from writing 'full' rules that will still only catch half the spams.
i've been running it for 2+ months on a uri scraper that i use for uribl.com. and for about 4-5 weeks on ~1000 mtas. since i'm scraping alot of html data, i have to use lots of rawbody rules and it comes in quite handy. Something that I couldnt catch before was this <html> www.myspamuri.com </html> and now i can... rawbody HTML_URI_ONLY m'<html> ?(<body> )?(www\.)?[a-z0-9\-] {5,64}\.(com|net|info|biz) ?(</body>)? ?</html>'i range HTML_URI_ONLY bytetrim 0:256 couple things i'll note from my previous patch. the first dbg() call in get_range_data() causes lots of debug because its called per rule.. so that should be removed or commented out. also negative offsets supplied on a byte range do not work due to this line.. + if ($args && $args =~ m/(\d+)(:(\d+))?/) { should be + if ($args && $args =~ m/(\-?\d+)(:(\d+))?/) { This makes a rule like this start to work. body __FREEBIE_FOOTER /Home.{1,5}Disclaimer.{1,5}Privacy Policy.{1,5}Unsubscribe/i range __FREEBIE_FOOTER byte -256:256 i can make a new patch if necessary, but nothing else has changed. adding range checking to full rule types should probably be added as well. d
I'm concerned that this patch convolutes two different issues. One is wanting to test specific parts of the text. I'm not sure how often this crops up (I've only *really* wanted it once and ended up not needing it), but it could be a useful efficacy addition. The other is the need for a better rawbody test. This seems like a good idea, but I'd prefer to just change how rawbody works. I think that's something we could do regardless. Make a new test type for rawbody that is more useful in general. Anyhow, the one time I wanted to test a specific range, an eval did the job fine. In other words, I still think this needs to be a plugin. If plugins don't allow what is needed, then we should try to address that. I should mention that this bug resembles bug 1050...
my concern is that pluginizing this will break rule priorities since body evals are handled after body rules, rawbody evals are handled after rawbody rules. and if the primary reason for this is efficiency, prioritizing range based rawbody rules higher because they are faster would not be possible. i'm not sure how a pluginzied check() method will alter this functionality, but it would be nice to prioritize rules regardless of their rule type, tflags, meta dependancies, etc etc.
With a few exceptions (meta and DNS) it is currently possible to prioritize rules regardless of rule type. Just assign them the priorities you want.
Dallas, think you could whip up a patch for just adding tflag multiple into current SVN? This feature alone *might* be worth some value to us all..
fred, i believe tflag multiple support is already in svn. =item tflags SYMBOLIC_TEST_NAME [ {net|nice|learn|userconf|noautolearn|multiple} ] =item multiple The test will be evaluated multiple times, for use with meta rules. Only affects header, body, rawbody, uri, and full tests.
Is this patch still feasible? I'd like to resurrect this idea in a slightly different manner: default limits. Something like this (assuming "string 0:0" removes any scope from the query): default_tmethod string 0:512000 body TEST_WHOLE_MSG /\btest\b/ tmethod TEST_WHOLE_MSG string 0:0 body TEST_FROM_END /\bsincerely,/i tmethod TEST_FROM_END string -100:100 With uri, body, rawbody, etc. limited to just the first 500kB (the default max_size for spamc), we should theoretically then be able to increase the default max_size considerably without much overhead. Any rule (such as the two defined above) that wants to re-scopes itself can do so explicitly. This should significantly reduce scan time, especially if a script regularly drills into the timing logs and reduces the scope of the more resource-intensive rules (incrementally until it affects the hit stats).
(In reply to comment #32) > > This should significantly reduce scan time, especially if a script regularly > drills into the timing logs and reduces the scope of the more > resource-intensive rules (incrementally until it affects the hit stats). A 512kB limit will not significantly reduce anything. After all, how many mails are bigger than that? Have you seen my tests? Even a 30kB limit was only 7% faster given enough corpus to test. https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6582