Bug 64309

Summary: Improve repository regular expression performance used for bootstrapping Catalina
Product: Tomcat 9 Reporter: Paul Muriel Biya-Bi <paulmuriel.biyabi>
Component: CatalinaAssignee: Tomcat Developers Mailing List <dev>
Status: RESOLVED FIXED    
Severity: enhancement    
Priority: P2    
Version: 9.0.x   
Target Milestone: -----   
Hardware: PC   
OS: Linux   

Description Paul Muriel Biya-Bi 2020-04-05 20:22:13 UTC
The goal of this enhancement is to improve the regular expression used for searching class loader repositories when bootstrapping Catalina in Tomcat 9

The regular expression currently used is (\".*?\")|(([^,])*)
The above regular expression an alternation.
An alternation is used to match a single regular expression out of several possible regular expressions.
it uses a vertical bar or pipe symbol (|) to separate the different options in a regular expression.
In our case, the different options are (\".*?\") and (([^,])*)

Our enhancement is about the first option, namely, (\".*?\") 

As can be seen in the first option, there is a question mark is in fact a lazy quantifier. Lazy quantifiers makes repeatable symbols such as the asterisk (*) match as few characters as possible. That is fine. But that comes with a cost called, backtracking, implemented by regex-directed engines. There are broadly two types of regular expression engines, namely, regex-directed engines and text-directed engines. The Java regular expression engine is regex-directed, which means it implements backtracking. Backtracking slows down regex-directed engines. Therefore, if it is possible to get rid of backtracking by finding an equivalent regular expression, then the application gains in performance.

Now, looking at the regular (\".*?\") expression, it simply says: “Give me the shortest sequence of characters enclosed in double quotes”. The word shortest comes from the usage of the lazy quantifier (?) as explained above. An equivalent regular expression is (\"[^\"]*\")

The equivalent regular expression replaces the .* with the [^\"] (negated character class). The negated character class does not use backtracking and is therefore by far more efficient.

Let us examine the efficiencies of both regular expressions on a sample string, namely (\"abc\") 

Case 1: With the lazy quantifier.
 In this case, the regex engine does the following:

It compares the first token \" in the regular expression with the first character \" in the string. Both match. The engine now compares the next token .* to the second character a in the string. Again, both match. Because of the lazy quantifier (?), the engine compares the next character b in the string to the to the last token \"  in the regular expression. But both do not match. The engine backtracks to the .* token and notices that b matches that token. The engine again compares the next character c to the last token \" in the regular expression. They do not match. The engine backtracks again to .*  and notices  that c matches that token. The engine now compares the last token \" in the regular expression to the last character \" in the string, and finds that both match. The engine reports a match. For the above string with 3 characters (a, b, and c) between quotes, it took 7 steps for the engine to find a match. For n (n>=1) characters between quotes, it will take the engine 3 + 2(n-1) steps.

Case 2: With the negated character class.
In this case, the regex engine does:

It compares the first token \" in the regular expression with the first character \" in the string. Both match. The engine now compares the next token [^\"] to the second character a in the string. They both match (because the character a is not equal to the quote character). The engine again compares [^\"] to the next character b in the string (because of the repetition symbol *). Again, they match. The engine again compares [^\"] to the next character c in the string (because of the repetition symbol *).  Again, they match. The engine again compares [^\"] to the next character \" in the string (because of the repetition symbol *). Now, they do not match because of the negation (^). However, because at least one match was found for the repetition symbol, the engine compares the last token \" in the regular expression with the last character \" in the string. Both match. The engine reports a match.

For the above string with 3 characters (a, b, and c) between quotes, it took 6 steps for the engine to find a match. For n (n>=1) characters between quotes, it will take the engine n + 3 steps.

Therefore, for case 1, we have  3 + 2(n-1) steps, and for case 2, we have  n + 3 steps. As n grows, 3 + 2(n-1) becomes very bigger than n + 3.

For example, if n = 50:
3 + 2(n-1)  gives 101
n + 3 gives 53
We therefore avoid 48 computations with the negated character sequence.

In conclusion, we gain be using (\"[^\"]*\")|(([^,])*) rather than (\".*?\")|(([^,])*) for searching class loader repositories when bootstrapping Catalina in Tomcat 9.

I will create a pull request for the enhancement if this proposal is approved.
Comment 1 Mark Thomas 2020-04-06 14:46:37 UTC
No objections but I do wonder if the performance difference is noticeable or even measurable given that this happens once on start-up.
Comment 2 Mark Thomas 2020-04-22 09:37:02 UTC
Fixed in:
- master for 10.0.0-M5 onwards
- 9.0.x for 9.0.35 onwards
- 8.5.x for 8.5.55 onwards