Bug 4691 - fast_body, fast_rawbody, and tmethod support
Summary: fast_body, fast_rawbody, and tmethod support
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: All other
: P4 enhancement
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL: http://www.engelken.net/download/rang...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-11-17 21:56 UTC by Dallas Engelken
Modified: 2011-09-07 05:29 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
fast_body, fast_rawbody, and tmethod patch None Dallas Engelken [HasCLA]
'range' patch patch None Dallas Engelken [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Dallas Engelken 2005-11-17 21:56:50 UTC
here we go.  a patch to add support for 'tmethod'.  similar to 'tflags', 
tmethod can be set on individual rules to alter the behavior of its regexp 
testing.  the default tmethod is still per-line for every line of content, but 
this gives you the ability to define things like

 - 'line X' to only scan a single line number
 - 'linerange X:Y' to apply a regexp to a range of lines
 - 'string X:Y' to apply a regexp to a single block of content, x=start 
position, y=number of bytes.  x can be negative to go from the end.
 - 'stringtrim X:Y' which is same as string but cleans up newlines and 
whitespace to make writing multiline rawbody rules a breeze.

i've also implemented "canned" fast_body and fast_rawbody rules, which 
basically predefine the tmethod to "stringtrim" for you using 8kb (body) and 
32kb (rawbody) settings.

by going this route, all the rules priority settings will still apply...   i 
cant see a need for tmethod support on headers, as they are single lines 
already, and very fast.  i also dont see an urgent need for tmethod on full 
types?? do you?

more imformation regarding the implementation details is below.  you will see 
my patch added to this ticket next!  have fun testing.  ;)


# body rules with tmethod definitions
#########################################################

body     TMETHOD_STRING /test/i
tmethod  TMETHOD_STRING string 0:16
describe TMETHOD_STRING test found in first 16 bytes of body content
score    TMETHOD_STRING 0.1

body     TMETHOD_STRINGTRIM /test/i
tmethod  TMETHOD_STRINGTRIM stringtrim -24:24
describe TMETHOD_STRINGTRIM test found in last 24 bytes of body content
score    TMETHOD_STRINGTRIM 0.1

body     TMETHOD_LINE    /test/i
tmethod  TMETHOD_LINE    line 1
describe TMETHOD_LINE    test found on first line of body content
score    TMETHOD_LINE    0.1

body     TMETHOD_LINERANGE  /test/i
tmethod  TMETHOD_LINERANGE  linerange 1:5
describe TMETHOD_LINERANGE  test found within first 5 lines of body content
score    TMETHOD_LINERANGE 0.1

#########################################################

[3015] dbg: rules: TMETHOD_LINERANGE - scanning line 1
[3015] dbg: rules: TMETHOD_LINERANGE - applying regex /test/i to line 1
[3015] dbg: rules: ran body rule TMETHOD_LINERANGE ======> got hit: "test"
[3015] dbg: rules: TMETHOD_LINERANGE took 0.006965 seconds, using method=
[linerange] args=[1:5]

[3015] dbg: rules: TMETHOD_LINE - scanning line 1
[3015] dbg: rules: TMETHOD_LINE - applying regex /test/i to line 1
[3015] dbg: rules: ran body rule TMETHOD_LINE ======> got hit: "test"
[3015] dbg: rules: TMETHOD_LINE took 0.00593 seconds, using method=[line] args=
[1]

[3015] dbg: rules: TMETHOD_STRING - running tmethod string
[3015] dbg: rules: ran body rule TMETHOD_STRING ======> got hit: "test"
[3015] dbg: rules: TMETHOD_STRING - start position 0, bytes 16, total length 16
[3015] dbg: rules: TMETHOD_STRING took 0.006037 seconds, using method=[string] 
args=[0:16]

[3092] dbg: rules: TMETHOD_STRINGTRIM - running tmethod stringtrim
[3092] dbg: rules: TMETHOD_STRINGTRIM - start position -24, bytes 24, total 
length 24
[3092] dbg: rules: TMETHOD_STRINGTRIM took 0.003509 seconds, using method=
[stringtrim] args=[-24:24]

#########################################################

DEBUG: string [0:16]:
DEBUG: --------------------------
test
blah test 1
DEBUG: -------------------------

DEBUG: stringtrim [-24:24]:
DEBUG: --------------------------
st 1 2 3 check this out
DEBUG: -------------------------



# rawbody rules with tmethod definitions
#########################################################

rawbody  TMETHOD_STRINGTRIM /test/i
tmethod  TMETHOD_STRINGTRIM stringtrim -512:512
describe TMETHOD_STRINGTRIM test found in last 512 bytes of body content
score    TMETHOD_STRINGTRIM 0.1

rawbody  TMETHOD_STRING /test/i
tmethod  TMETHOD_STRING string 0:512
describe TMETHOD_BYTERANGE test found in first 512 bytes of body content
score    TMETHOD_BYTERANGE 0.1

rawbody  TMETHOD_LINE    /test/i
tmethod  TMETHOD_LINE    line 1
describe TMETHOD_LINE    test found on first line of body content
score    TMETHOD_LINE    0.1

rawbody  TMETHOD_LINERANGE  /test/i
tmethod  TMETHOD_LINERANGE  linerange 1:5
describe TMETHOD_LINERANGE  test found within first 5 lines of body content
score    TMETHOD_LINERANGE 0.1

#########################################################

[3559] dbg: rules: TMETHOD_STRING - start position 0, bytes 512, total length 57
[3559] dbg: rules: ran rawbody rule TMETHOD_STRING ======> got hit: "test"
[3559] dbg: rules: TMETHOD_STRING took 0.00501 seconds, using method=[string] 
args=[0:512]

[3559] dbg: rules: ran rawbody rule TMETHOD_LINERANGE ======> got hit: "test"
[3559] dbg: rules: TMETHOD_LINERANGE took 0.003007 seconds, using method=
[linerange] args=[1:5]

[3559] dbg: rules: TMETHOD_LINE took 0.000192 seconds, using method=[line] args=
[1]

[3559] dbg: rules: TMETHOD_STRINGTRIM - start position -512, bytes 512, total 
length 57
[3559] dbg: rules: ran rawbody rule TMETHOD_STRINGTRIM ======> got hit: "test"
[3559] dbg: rules: TMETHOD_STRINGTRIM took 0.004803 seconds, using method=
[stringtrim] args=[-512:512]

#########################################################

DEBUG: string [0:512]:
DEBUG: --------------------------
<html><body><font
    size=3>
test
</font></body></html>


DEBUG: -------------------------

DEBUG: stringtrim [-512:512]:
DEBUG: --------------------------
<html> <body> <font size=3> test </font> </body> </html>
DEBUG: -------------------------



# fast_body and fast_rawbody rules with predefined tmethods
#########################################################

fast_rawbody TEST_FAST_RAWBODY /test/i
describe     TEST_FAST_RAWBODY test found
score        TEST_FAST_RAWBODY 0.1

fast_body    TEST_FAST_BODY /test/i
describe     TEST_FAST_BODY test found
score        TEST_FAST_BODY 0.1

#########################################################

[5704] dbg: rules: ran body rule TEST_FAST_BODY ======> got hit: "test"
[5704] dbg: rules: TEST_FAST_BODY - start position 0, bytes 8192, total length 
11
[5704] dbg: rules: TEST_FAST_BODY took 0.005431 seconds, using method=
[stringtrim] args=[0:8192]

[5704] dbg: rules: ran rawbody rule TEST_FAST_RAWBODY ======> got hit: "test"
[5704] dbg: rules: TEST_FAST_RAWBODY - start position 0, bytes 32768, total 
length 62
[5704] dbg: rules: TEST_FAST_RAWBODY took 0.004835 seconds, using method=
[stringtrim] args=[0:32768]

#########################################################

DEBUG: stringtrim [0:8192]:
DEBUG: --------------------------
test test
DEBUG: -------------------------


DEBUG: stringtrim [0:32768]:
DEBUG: --------------------------
<html> <body> <font size=3> test </font> </body> </html>
DEBUG: -------------------------

#########################################################
Comment 1 Dallas Engelken 2005-11-17 21:57:39 UTC
Created attachment 3256 [details]
fast_body, fast_rawbody, and tmethod

implements fast_body, fast_rawbody, and tmethod options.
Comment 2 Michael Parker 2005-11-17 22:19:40 UTC
*cough* Plugin! *cough*
Comment 3 Daniel Quinlan 2005-11-17 22:22:03 UTC
Very interesting ideas.

However, this functionality really needs to be a plugin.

Changing the body code itself to do this is a bad idea.  It makes the code much
harder to maintain and there may be other ways to speed up tests (for example,
index()-based matching).
Comment 4 Dallas Engelken 2005-11-17 22:38:46 UTC
i dont see any way to pluginze this and make it work properly.  if you 
pluginize the 'tmethod' property, what good does that do if PMS::do_body_tests
() cant read it?  

what happens to body rule priorities if you have to eval them through a custom 
plugin??   if I want to create some fast_body rules and want them to be the 
first thing that runs to increase efficiency, i cant because they are evals 
which occur after body and rawbody rules.  

      $self->do_body_tests($priority, $decoded);
      $self->do_body_uri_tests($priority, @uris);
      $self->do_body_eval_tests($priority, $decoded);

      $self->do_rawbody_tests($priority, $bodytext);
      $self->do_rawbody_eval_tests($priority, $bodytext);

try to look past the 'fast_body' and 'fast_rawbody' definitions, as those are 
just quick wrappers around the 'tmethod' code which is the meat and potatoes of 
the patch ;)

Comment 5 Dallas Engelken 2005-11-17 22:44:31 UTC
"if I want to create some fast_body rules and want them to be the 
first thing that runs to increase efficiency"

----

realize i say this in advance of the "yet-to-come-short-circuit" testing...  
run the fastest rules first that you can short-circuit on to increase 
efficiency.

body      WANNA_FSCK  /wanna fsck\?/
descript  WANNA_FSCK  'wanna fsck' in first 1kb of data
tmethod   WANNA_FSCK  stringtrim 0:1024
tflags    WANNA_FSCK  shortcircuit
score     WANNA_FSCK  100

;)

Comment 6 Justin Mason 2005-11-18 00:08:09 UTC
btw, I swapped a couple of mails with Dallas about this -- worth noting that it
does provide a significant speed-up -- at least in the original patch.  

(However, I'm concerned that this version, where 'sub '.$rulename.'_body_test'
now includes a lot of body array-manipulation code, may remove that speed
improvement... some more benchmarking/moving code about may be necessary with
this one.)

I'm in favour of the proposed command syntax, although I'd prefer "range"
instead of "tmethod" if possible.

Normally I'm the biggest fan of pluginizing.  However, I think this
functionality is valuable in the core -- unless Daniel or Michael can propose a
good, fast-enough plugin API for it. ;)
Comment 7 Duncan Findlay 2005-11-18 00:26:08 UTC
I suspect a lot of the speed increases would be minimized if this were to be
pluginized. Having said that, I think we need to see some performance numbers
(DProf) to see what kind of a speed increase we are talking about (and what the
performance hit is if we don't use any tmethod rules).
Comment 8 Dallas Engelken 2005-11-18 01:24:44 UTC
i have yet to do performance testing on this, but i plan to.

the performance testing i did previously, was when i implemented a 'tflag fast' 
method, which basically just took a config value for fast_body_max_bytes and 
fast_rawbody_max_bytes, did a substr(content,0,<max_bytes>) and performed a 
single regexp test against that block of content, instead of per line regexp. 

the speed up was quite good depending on the max_bytes values that were set.  
even when the max_bytes for body and rawbody were set to 256kb, it was slightly 
faster to use a single regexp test.  i'll just paste those results here..  it 
was sorta crude testing, but gave me some hope to continue. :)

* no tflags fast on body/rawbody rules
---------------------------------------------------------------
total scantime: 570.591863 seconds
number of files scanned: 740
avg scantime: 0.7707657 seconds
---------------------------------------------------------------

And now with 'tflag fast' set on all body and rawbody rules....

* fast_body_max_bytes    262144
* fast_rawbody_max_bytes 262144
---------------------------------------------------------------
total scantime: 552.70254 seconds
number of files scanned: 740
avg scantime: 0.7465846 seconds
---------------------------------------------------------------

* fast_body_max_bytes     8192 
* fast_rawbody_max_bytes 32768
---------------------------------------------------------------
total scantime: 470.123985 seconds
number of files scanned: 740
avg scantime: 0.6349819 seconds
---------------------------------------------------------------
 
* fast_body_max_bytes     4096
* fast_rawbody_max_bytes 16384
---------------------------------------------------------------
total scantime: 454.957886 seconds
number of files scanned: 740
avg scantime: 0.6145058 seconds
---------------------------------------------------------------

as you can see, a 4k body and 16kb rawbody resulted in about a 20% speedup... 
but this was against the entire ruleset running in tflag fast mode.  in 
reality, thats not something that would ever be done because of loss in hit 
rates, but i think the point here is that it even proves faster at the 256kb 
level.  although i'm sure there are people that run spamc -s  with values 
larger than 256kb, and performance at some point will get worse over that.

so, to justin's point, the increase in size of the function by rule does effect 
the memory footprint for one... hopefully it doesnt effect the speed.  and that 
footprint will grow more with the addition of custom rules/rule sets.

memory usage, stock svn

7658 59.0  1.7 22692 18480 ?       Ss   17:39   0:00 /usr/bin/spamd -d

memory usage, svn + tmethod patch

7593  0.4  1.8 24556 18956 ?       Ss   17:37   0:00 /usr/bin/spamd -d

to duncans point, if you "dont use any tmethod rules", then i guess there is no 
point in adding this to the core.  unless the point is to extend the 
functionality for rule writers.  Actually, extending rule writing functionality 
for SARE is the main reason i'm doing this.. most notably the multiline rawbody 
rules which are impossible to write efficiently now since full /s testing is 
just too expensive.

dallas
Comment 9 Dallas Engelken 2005-11-18 01:49:49 UTC
"I'm in favour of the proposed command syntax, although I'd prefer "range"
instead of "tmethod" if possible."

i chose tmethod (test method) to complement tflags (test flags) because they 
both alter the way the test actually works.   if you use 'range', what range 
would that refer to??  byte range, line range?  do you keep syntax

range RULE_NAME <method> <args>

or do you drop the methods param...

range RULE_NAME <start_pos> <bytes>

does the above only apply to block style single regexp tests and skip per-line 
testing altogether?  do you use

range RULE_NAME line <start> <stop>
range RULE_NAME byte <start_pos> <bytes>

how do you solve the problem of taking newline and extra whitespace out of 
rawbody content.. another param completely? 

there are many ways one could go here...  

Comment 10 John Gardiner Myers 2005-11-20 23:59:11 UTC
The testing of the method should be done outside $evalstr2.  The method won't
change across different messages, so there's no need to put the either the test
or the code for the unused methods inside the function defined by $evalstr2.

With a large number of "fast_body" rules, SpamAssassin will be spending time
repeatedly reconstructing the same string to be searched against.

I thought the reason that body and rawbody were split into arrays was to protect
against worst-case behavior of test regexes.  By pasting those lines back
together, this patch appears to forgo that protection.

In my opinion, this patch adds way to many knobs to tweak.  I don't find the
less than 20% performance improvement that compelling, especially given it hands
spammers an easy way to prevent SpamAssassin from firing rules on their messages.
Comment 11 Loren Wilton 2005-11-21 07:13:07 UTC
> I thought the reason that body and rawbody were split into arrays was to 
protect
> against worst-case behavior of test regexes.  By pasting those lines back
> together, this patch appears to forgo that protection.

That is the theory.  The practice is that spammers have several very trivial 
patterns that are guaranteed to be impossible to detect in SA, short of writing 
a separate plugin for each thing you want to test.  Essentially, any HTML-level 
pattern containing one or more line breaks is undetectable.

The only current solution is to run 'full' rules over the entire email to test 
for those patterns.  This of course will only work with non-encoded mail bodies.

It is perhaps also worth noting that by refusing to allow multiline rawbody, 
and arbitrarily and unpredictably breaking 'body', you are forcing the use 
of 'full' rules for things that could be trivially parsed in connected body or 
rawbody rules.  The only conclusion can be that the dev's believe that is is 
safer and faster to parse the entire email with a regex than it would be to 
parse a small but useful part of the email with the same regex.
Comment 12 Dallas Engelken 2005-11-25 18:12:07 UTC
okay.. new patch.  

it now works properly with 'tflags multiple' no matter if you use 'line' or
'byte',  and it changes the config from 'tmethod' to simply 'range' like justin
requested.   i've also moved most of the code out of the $evalstr2 definition
into its own sub called get_range_data() which saves quite a bit on memory
consumption.

-------------------------------------

body __TEST /test/
range __TEST byte 0:512
tflags __TEST multiple
score __TEST 1.0

# echo -e "From: you\n\ntest\ntest\ntest\n" | spamc

[26087] dbg: range_data: type=byte args=0:512 for rule __TEST
[26087] dbg: rules: running range type byte
[26087] dbg: rules: found "test" at pos 5 - hit #1
[26087] dbg: rules: ran body rule __TEST ======> got hit: "test"
[26087] dbg: rules: found "test" at pos 10 - hit #2
[26087] dbg: rules: ran body rule __TEST ======> got hit: "test"
[26087] dbg: rules: found "test" at pos 15 - hit #3
[26087] dbg: rules: ran body rule __TEST ======> got hit: "test"
[26087] dbg: rules: __TEST got 3 hits

-------------------------------------

body __TEST /test/
range __TEST byte 0:10
tflags __TEST multiple
score __TEST 1.0

# echo -e "From: you\n\ntest\ntest\ntest\n" | spamc

[26073] dbg: range_data: type=byte args=0:10 for rule __TEST
[26073] dbg: rules: running range type byte
[26073] dbg: rules: found "test" at pos 5 - hit #1
[26073] dbg: rules: ran body rule __TEST ======> got hit: "test"
[26073] dbg: rules: found "test" at pos 10 - hit #2
[26073] dbg: rules: ran body rule __TEST ======> got hit: "test"
[26073] dbg: rules: __TEST got 2 hits

-------------------------------------

fast_body __TEST /test/
tflags __TEST multiple
score __TEST 1.0

# echo -e "From: you\n\ntest\ntest\ntest\n" | spamc

[26096] dbg: range_data: type=bytetrim args=0:8192 for rule __TEST
[26096] dbg: rules: running range type bytetrim
[26096] dbg: rules: found "test" at pos 5 - hit #1
[26096] dbg: rules: ran body rule __TEST ======> got hit: "test"
[26096] dbg: rules: found "test" at pos 10 - hit #2
[26096] dbg: rules: ran body rule __TEST ======> got hit: "test"
[26096] dbg: rules: found "test" at pos 15 - hit #3
[26096] dbg: rules: ran body rule __TEST ======> got hit: "test"
[26096] dbg: rules: __TEST got 3 hits

-------------------------------------

# echo -e "From: you\nContent-type: text/html\n\n<html>\n<body>\n\ntest<br
/>\ntest<br />\ntest<br />\n</body>\n</html>" | spamc

fast_rawbody __TEST /test/
tflags __TEST multiple
score __TEST 1.0

[26175] dbg: range_data: type=bytetrim args=0:32768 for rule __TEST
[26175] dbg: rules: ran rawbody rule __TEST ======> got hit: "test"
[26175] dbg: rules: ran rawbody rule __TEST ======> got hit: "test"
[26175] dbg: rules: ran rawbody rule __TEST ======> got hit: "test"

-------------------------------------


Comment 13 Dallas Engelken 2005-11-25 18:13:06 UTC
Created attachment 3273 [details]
'range' patch

implements 'range', 'fast_body' and 'fast_rawbody' methods.
Comment 14 Dallas Engelken 2005-11-25 19:26:43 UTC
well.. i've done my speed test with and without the patch.  Average scan times
were better with the patch than without... dont know why. 

memory usage goes up as expected..

starting spamd in both cases as..
# spamd -d  -L -s mail

------------------------------------------------------------

Memory & Speed of Stock SVN (3.2.0-r322462)

root     31237 54.2  2.4 31448 25236 ?       Ss   11:21   0:02 /usr/bin/perl -T
-w /usr/bin/spamd -d -D -L -s mail
root     31240  0.0  2.4 31448 25228 ?       S    11:21   0:00 spamd child
root     31241  0.0  2.4 31448 25228 ?       S    11:21   0:00 spamd child

total scantime: 761.461411 seconds
number of files scanned: 987
avg scantime: 0.7708705 seconds

------------------------------------------------------------

Memory & Speed after applying range patch... (3.2.0-r322462)

root       890 33.8  2.7 33468 28612 ?       Ss   12:01   0:02 /usr/bin/spamd -d
-L -s mail
root       892  0.0  2.7 33468 28612 ?       S    12:01   0:00 spamd child
root       893  0.0  2.7 33468 28612 ?       S    12:01   0:00 spamd child

total scantime: 756.437461 seconds
number of files scanned: 987
avg scantime: 0.7657802 seconds

------------------------------------------------------------

other benchmarks welcome... thanks.



Comment 15 Dallas Engelken 2005-11-25 19:29:00 UTC
followup to comment 14

no rules were applied using any range parameters... the test was just to see
what overhead the extra code added.
Comment 16 Dallas Engelken 2005-11-25 23:09:35 UTC
with this patch running, and testing various SARE rulesets, i'm seeing 
thousands of Malformed UTF-8 warn messages...

[8498] warn: Malformed UTF-8 character (unexpected non-continuation byte 0x00, 
immediately after start byte 0xcf) in pattern match (m//) 
at /usr/share/spamassassin/70_sare_adult.cf, rule SARE_OBFUPUSS, line 1, 
<GEN10> line 266.

when i remove this regex test, they go away...
 while ($fb =~ '.$pat.'g) {

is there a better way to do this??  i see the regex test in 'sub do_full_tests' 
uses 
        while ($$fullmsgref =~ '.$pat.'g) {
 
yet it never produces UTF-8 warnings when I change those body rules (ie 
SARE_OBFUPUSS) from type body to type full.
 
is this because of the way the $evalstr is built up in do_full_tests()?
any recommendations to avoid these UTF-8 warns?

thanks...
  
Comment 17 Loren Wilton 2005-11-26 13:21:35 UTC
What are the chances that you are concatenating a bunch of separate strings, 
and somehow ending up with the null terminators in the compound string?  Sounds 
like that might produce this kind of warning.
Comment 18 John Gardiner Myers 2005-11-28 08:27:53 UTC
More likely when you truncate $fb, you somehow do it so the last UTF-8 sequence
ends mid-byte on some of your test case messages.
Comment 19 Dallas Engelken 2005-11-28 15:10:04 UTC
i've commented out the substr of $fb, and changed the regexp to match a 
constant string, and still see the Malformed UTF-8

          #  $fb = substr($fb,$arg1,$arg2);
          '.$self->hash_line_for_rule($rulename).'
          pos $fb = 0;
          #while ($fb =~ '.$pat.'g) {
          while(q#blah# =~ '.$pat.'g) {

even if i change it to 

              while(q#blah# =~ /this is a test/g) {
 
i'm still seeing malformed utf-8 in debug...  so this must be coming from some 
where else.    also,  just saw a report on the SARE user list this morning 
about "hundreds of malformed utf-8 errors on svn r349275". 




Comment 20 Justin Mason 2005-11-28 20:26:13 UTC
John's theory sounds very plausible to me.

can we get a list of perl versions that this occurs with?  UTF-8 errors are very
variable based on perl release.
Comment 21 Justin Mason 2005-11-28 20:35:14 UTC
also, is it always 'unexpected non-continuation byte 0x00'?   0x00 would
indicate a UTF-8 char chopped off at end of string.  full debug logs may be
enlightening.
Comment 22 John Gardiner Myers 2005-11-28 20:42:55 UTC
The SARE rule-related warnings are being tracked in bug 3787.  They reportedly
happen even without this patch.
Comment 23 Dallas Engelken 2005-11-28 20:51:31 UTC
yes.. it was a premature diagnosis.  the utf-8 warns are content specific... 
and happen even against the  if ('.$pat.') code. 

Comment 24 Justin Mason 2006-02-12 01:24:00 UTC
ping!  just wondering what people think about this patch.

I think we got distracted by the utf-8 issue...
Comment 25 Loren Wilton 2006-02-12 01:35:53 UTC
I'm about 130% in favor of this.  If nothing else it would finally make rawbody 
a usable type for rule writers and we can get away from writing 'full' rules 
that will still only catch half the spams.
Comment 26 Dallas Engelken 2006-02-12 18:32:56 UTC
i've been running it for 2+ months on a uri scraper that i use for uribl.com.  
and for about 4-5 weeks on ~1000 mtas.  since i'm scraping alot of html data, i 
have to use lots of rawbody rules and it comes in quite handy.   Something that 
I couldnt catch before was this

<html>
www.myspamuri.com
</html>

and now i can...

rawbody         HTML_URI_ONLY           m'<html> ?(<body> )?(www\.)?[a-z0-9\-]
{5,64}\.(com|net|info|biz) ?(</body>)? ?</html>'i
range           HTML_URI_ONLY           bytetrim 0:256


couple things i'll note from my previous patch.  the first dbg() call in 
get_range_data() causes lots of debug because its called per rule..  so that 
should be removed or commented out.

also negative offsets supplied on a byte range do not work due to this line..

+  if ($args && $args =~ m/(\d+)(:(\d+))?/) {

should be

+  if ($args && $args =~ m/(\-?\d+)(:(\d+))?/) {

This makes a rule like this start to work.

body            __FREEBIE_FOOTER        /Home.{1,5}Disclaimer.{1,5}Privacy 
Policy.{1,5}Unsubscribe/i
range           __FREEBIE_FOOTER        byte -256:256

i can make a new patch if necessary, but nothing else has changed.   adding 
range checking to full rule types should probably be added as well.

d



Comment 27 Daniel Quinlan 2006-02-20 10:42:47 UTC
I'm concerned that this patch convolutes two different issues.

One is wanting to test specific parts of the text.  I'm not sure how often
this crops up (I've only *really* wanted it once and ended up not needing
it), but it could be a useful efficacy addition.

The other is the need for a better rawbody test.  This seems like a good
idea, but I'd prefer to just change how rawbody works.  I think that's
something we could do regardless.  Make a new test type for rawbody that
is more useful in general.

Anyhow, the one time I wanted to test a specific range, an eval did the
job fine.  In other words, I still think this needs to be a plugin.  If
plugins don't allow what is needed, then we should try to address that.

I should mention that this bug resembles bug 1050...
Comment 28 Dallas Engelken 2006-02-20 18:13:29 UTC
my concern is that pluginizing this will break rule priorities since body evals 
are handled after body rules, rawbody evals are handled after rawbody rules.

and if the primary reason for this is efficiency, prioritizing range based 
rawbody rules higher because they are faster would not be possible.

i'm not sure how a pluginzied check() method will alter this functionality, but 
it would be nice to prioritize rules regardless of their rule type, tflags, 
meta dependancies, etc etc. 
Comment 29 John Gardiner Myers 2006-02-20 18:34:53 UTC
With a few exceptions (meta and DNS) it is currently possible to prioritize
rules regardless of rule type.  Just assign them the priorities you want.
Comment 30 Fred T 2006-04-19 18:04:32 UTC
Dallas, think you could whip up a patch for just adding tflag multiple into 
current SVN?  This feature alone *might* be worth some value to us all..
Comment 31 Dallas Engelken 2006-04-19 18:13:37 UTC
fred, i believe tflag multiple support is already in svn.

=item tflags SYMBOLIC_TEST_NAME [ {net|nice|learn|userconf|noautolearn|multiple} ]

=item multiple

The test will be evaluated multiple times, for use with meta rules.
Only affects header, body, rawbody, uri, and full tests.
Comment 32 Adam Katz 2011-09-07 03:46:05 UTC
Is this patch still feasible?

I'd like to resurrect this idea in a slightly different manner:  default limits.  Something like this (assuming "string 0:0" removes any scope from the query):

  default_tmethod  string 0:512000
  
  body    TEST_WHOLE_MSG  /\btest\b/
  tmethod TEST_WHOLE_MSG  string 0:0
  body    TEST_FROM_END   /\bsincerely,/i
  tmethod TEST_FROM_END   string -100:100

With uri, body, rawbody, etc. limited to just the first 500kB (the default max_size for spamc), we should theoretically then be able to increase the default max_size considerably without much overhead.  Any rule (such as the two defined above) that wants to re-scopes itself can do so explicitly.

This should significantly reduce scan time, especially if a script regularly drills into the timing logs and reduces the scope of the more resource-intensive rules (incrementally until it affects the hit stats).
Comment 33 Henrik Krohns 2011-09-07 05:29:54 UTC
(In reply to comment #32)
>
> This should significantly reduce scan time, especially if a script regularly
> drills into the timing logs and reduces the scope of the more
> resource-intensive rules (incrementally until it affects the hit stats).

A 512kB limit will not significantly reduce anything. After all, how many mails are bigger than that?

Have you seen my tests? Even a 30kB limit was only 7% faster given enough corpus to test.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6582