The follows program requires more then 1 minute to check if given string matchs a pattern (some users will think that jvm hangs). BTW if I remove last unterscope in the string the program works less then 1 ms. import org.apache.regexp.RESyntaxException; import org.apache.regexp.RE; public class RegexpTest { public static void main(String[] args) { String format = "^(_?[A-Z0-9]+)*$"; String bad_str = "_NOERROR_AND_LAST_TXN_NOT_OK_AND_LAST_TXN_WAS_"; try { RE regexp = new RE(format); long before = System.currentTimeMillis(); System.err.println(regexp.match(bad_str)); long after = System.currentTimeMillis(); System.err.println(after-before); } catch(RESyntaxException e) { e.printStackTrace(); } } }
The error email says it takes over a minute to match the pattern against "_NOERROR_AND_LAST_TXN_NOT_OK_AND_LAST_TXN_WAS_" (and fail) but less than a millisecond to match the pattern against "_NOERROR_AND_LAST_TXN_NOT_OK_AND_LAST_TXN_WAS" (and succeed). These are the same strings except for the final "_". The "?" in the pattern gives a large degree of indeterminacy in how the match may be made if the greediest match fails. The equivalent pattern "^[A-Z0-9]*(_[A-Z0-9]+)*$" has much less indeterminacy and succeeds or fails very quickly with the above test strings. I am tempted to say that the problem lies more with the pattern than the matcher.
*** This bug has been marked as a duplicate of 9153 ***
I would agree that some regexp is better for the string then another. But I will never agree that ANY regexp should be matched with exponential complexity.
Oleg, Nobody disagrees with you. If you see "slow" regexps and you want to improve performance, please send a patch, it will be considered.