Bug 23303 - Perfomance problem with matching ^(_?[A-Z0-9]+)*$ pattern
Summary: Perfomance problem with matching ^(_?[A-Z0-9]+)*$ pattern
Status: CLOSED DUPLICATE of bug 9153
Alias: None
Product: Regexp
Classification: Unclassified
Component: Other (show other bugs)
Version: unspecified
Hardware: PC All
: P3 normal (vote)
Target Milestone: ---
Assignee: Jakarta Notifications Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-09-21 12:35 UTC by Oleg Sukhodolsky
Modified: 2005-01-11 13:37 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Oleg Sukhodolsky 2003-09-21 12:35:58 UTC
The follows program requires more then 1 minute to check 
if given string matchs a pattern (some users will think that jvm hangs).
BTW if I remove last unterscope in the string the program
works less then 1 ms.

import org.apache.regexp.RESyntaxException;
import org.apache.regexp.RE;

public class RegexpTest {
    public static void main(String[] args) {
        String format = "^(_?[A-Z0-9]+)*$";
        String bad_str = 
            "_NOERROR_AND_LAST_TXN_NOT_OK_AND_LAST_TXN_WAS_";
        try {
            RE regexp = new RE(format);
            long before = System.currentTimeMillis();
            System.err.println(regexp.match(bad_str));
            long after = System.currentTimeMillis();
            System.err.println(after-before);
        }
        catch(RESyntaxException e) {
            e.printStackTrace();
        }
    }
}
Comment 1 Rich Mallory 2003-09-22 15:15:10 UTC
The error email says it takes over a minute to match the pattern against
  "_NOERROR_AND_LAST_TXN_NOT_OK_AND_LAST_TXN_WAS_"
(and fail) but less than a millisecond to match the pattern against 
  "_NOERROR_AND_LAST_TXN_NOT_OK_AND_LAST_TXN_WAS"
(and succeed).  These are the same strings except for the final "_".

The "?" in the pattern gives a large degree of indeterminacy in how the match 
may be made if the greediest match fails.  The equivalent pattern
  "^[A-Z0-9]*(_[A-Z0-9]+)*$"
has much less indeterminacy and succeeds or fails very quickly with the above 
test strings.  I am tempted to say that the problem lies more with the pattern 
than the matcher.  
Comment 2 Vadim Gritsenko 2003-09-23 00:35:52 UTC

*** This bug has been marked as a duplicate of 9153 ***
Comment 3 Oleg Sukhodolsky 2003-09-25 03:46:32 UTC
I would agree that some regexp is better for the string then another.
But I will never agree that ANY regexp should be matched with exponential 
complexity.
Comment 4 Vadim Gritsenko 2003-09-25 12:03:51 UTC
Oleg,

Nobody disagrees with you.
If you see "slow" regexps and you want to improve performance, please send a
patch, it will be considered.