Bug 416 - t/lang_pl_tests.t fails
Summary: t/lang_pl_tests.t fails
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamassassin (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: PC Linux
: P3 minor
Target Milestone: ---
Assignee: SpamAssassin Developer Mailing List
URL: http://www.opoka.org.pl/SADoS.html
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2002-06-10 11:04 UTC by Jakub Wasielewski
Modified: 2002-10-10 02:11 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Jakub Wasielewski 2002-06-10 11:04:17 UTC
Get the messages from URL:
http://www.opoka.org.pl/SADoS.html
Pipe it thru spamassassin -L -D -t < INPUTMSG
On my machines SA take 100% of CPU and never finishes checking this mesg.
This is default instalation of SA 2.20 and 2.11.
No idea what is going on with this amazing software :(
Comment 1 Theo Van Dinter 2002-06-10 11:22:57 UTC
Subject: Re: [SAdev]  New: SA DoS?

On Mon, Jun 10, 2002 at 11:04:17AM -0700, bugzilla-daemon@hughes-family.org wrote:
> Get the messages from URL:
> http://www.opoka.org.pl/SADoS.html
> Pipe it thru spamassassin -L -D -t < INPUTMSG
> On my machines SA take 100% of CPU and never finishes checking this mesg.
> This is default instalation of SA 2.20 and 2.11.
> No idea what is going on with this amazing software :(

On my P200, SA 2.20 (virgin) will run through that test in approx
28 seconds.

Comment 2 Theo Van Dinter 2002-06-10 15:38:43 UTC
Subject: Re:  SA DoS?

On Mon, Jun 10, 2002 at 11:02:20PM +0200, Jakub Wasielewski wrote:
> > After changing my locale around to "pl", I found this behavior and traced
> > it down to these two rules.  They're not doing anything special, so it's
> > probably something related to the backtracking involved with all of the
> > ".*" statements.
> > 
> > lang pl body PL_JEZELI_NIE             /je.*li.*nie .ycz.*sobie/i
> > lang pl body PL_ARTYKUL_USTAWY         /Art.*25.*ust.*2.*pkt.*2/i
> 
> Hmm,  this  two  are very important!!! First is to match a part saying
> "If you do not want to"... recive this anymore. Second is a part  usu?
> aly  placed in SPAM explaining that SPAM is not illegal by polish law.
> 
> > I know no Polish, but the problem may be cleared up with a more stict
> > version of pattern matching.  For instance:

Ok, on the way home I tried some stuff out ...  Indeed, if I make the
regexp more strict (but still matching the description text), SA runs
much much much faster...

Regexp					Time for SA run	Comments
======================================= =============== ======================================
/je.*li.*nie .ycz.*sobie/i              21.12		# original
/je.\S*li.* nie .ycz\S+ sobie/i         5.59

/Art.*25.*ust.*2.*pkt.*2/i              95+             # original, doesn't match text in desc
/Art\S* 25 ust 2 p\S*kt 2/i             5.52

Since I don't know any Polish, I'm not about to go making patches for
all the 25_body_tests_pl.cf rules.  Basically, like with the English
version, all of the regexps should be made stricter.  Replace ".*" with
".{,30}" if you actually want to match anything, use "\S*" for matching
non-whitespace (ie: other characters in a single word), add in "\b"
at the start of the regexp (I didn't do this above) if it's a word, etc.

The thing that's killed SA multiple times so far is backtracking in
regular expressions, so by removing pieces that will commonly cause
backtracking (".*", ".+" mostly) we can keep the hangs away.  Streamlining
will also cause CPU usage to go down, thereby speeding up the SA runs. :)

Comment 3 Theo Van Dinter 2002-06-11 07:25:39 UTC
Subject: Re:  SA DoS?

On Tue, Jun 11, 2002 at 09:26:38AM +0200, Radoslaw Stachowiak wrote:
> 1. does \S* matches highascii (128..255) ? Because there are polish
> national characters which use these high values.

Yes.  \S matches non-whitespace characters, so anything that isn't
space, tab, LF, or CR.  (I think the definition is actually anything
that doesn't match the isspace() function, but those 4 are good enough.)

> 2. Can you give me more examples of proper \b use - because i dont get
> it..

This is from the perlre man page:

       A word boundary (`\b') is a spot between two characters
       that has a `\w' on one side of it and a `\W' on the other
       side of it (in either order), counting the imaginary char-
       acters off the beginning and end of the string as matching
       a `\W'.

So the idea is that instead of matching /the full moon/, which would
also match (contrived) "bathe full moon", and any other string that has
that set of characters in them.  Doing /\bthe full moon\b/ will mean that
the exact string needs to match.  The \b's match any non-word character
(which would match high-ascii chars as well), which also matches the
beginning and end of the string.

Essentially, using \b (at least for English text) will make the matches
faster and more accurate at the same time.


Hopefully this helps. :)

Comment 4 Justin Mason 2002-08-14 05:01:52 UTC
BTW, I've gone through the pl rules and replaced all .*'s with
.{0,99} to limit them.
Comment 5 Justin Mason 2002-08-14 16:53:16 UTC
ok, now verified as fixed in CVS. test added to test suite too.
Comment 6 Craig Hughes 2002-08-23 03:19:13 UTC
The test is broken for me in current b2_4_0 CVS:

[craig@belphegore spamassassin]$ make test
PERL_DL_NONLAZY=1 /usr/bin/perl5.8.0 "-MExtUtils::Command::MM" "-e"
"test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/db_based_whitelist........ok                                               
t/db_based_whitelist_ips....ok                                               
t/forged_rcvd...............ok                                               
t/lang_pl_tests.............	Not found: didnt_hang_at_least =  Analiza zawarto¶ci: 
# Failed test 1 in t/SATest.pm at line 241
t/lang_pl_tests.............FAILED test 1                                    
	Failed 1/1 tests, 0.00% okay
Comment 7 Justin Mason 2002-08-26 14:22:55 UTC
wierd, can't reproduce that...
Comment 8 Justin Mason 2002-08-28 15:50:32 UTC
craig, still seeing this?
Comment 9 Jesus Climent 2002-08-31 00:28:48 UTC
Same here. Latest CVS.

t/lang_pl_tests.....    Not found: didnt_hang_at_least =  Analiza zawarto¶ci: 
t/lang_pl_tests.....FAILED test 1                                            
        Failed 1/1 tests, 0.00% okay
Comment 10 Justin Mason 2002-10-10 10:11:46 UTC
this is fixed