Bug 38331

Summary: ArrayIndexOutOfBoundsException under certain conditions
Product: Regexp Reporter: Josh Rodman <josh_rodman-bgz>
Component: OtherAssignee: Jakarta Notifications Mailing List <notifications>
Status: CLOSED FIXED    
Severity: normal    
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Windows XP   

Description Josh Rodman 2006-01-20 15:35:04 UTC
This code generates an exception when running with jdk1.3.1_17:

RE r123 = new RE("((a|b){1637})");
r123.match("a");

This code works properly:

RE r123 = new RE("((a|b){1638})");
r123.match("a");

This code shows that depending on the number requested, regexp switches between 
working and not working:

boolean lastvalue = true;
for(int i = 1; i < 3650; i+=1) {
    try {
        RE r = new RE("((a|b){" + i + "})");
        r.match("a");
	if (!lastvalue) { System.out.println("Switching from NOT to WORKING 
at " + i + " (" + i + " works) "+lastvalue); }
	lastvalue = true;
    } catch (Exception ex) {
	if (lastvalue) { System.out.println("Switching from WORKING to NOT at " 
+ i + " (" + i + " doesn't work) "+lastvalue); }
	lastvalue = false;
    }
}

This behavior, if "i" was allowed past 3650, would switch back and forth a 
couple more times before 10000, however seen it happen above 7000 (this is as 
far as I let it test). In RE.java, look under the following signature:

protected int matchNodes(int firstNode, int lastNode, int idxStart)

Look for this line:

next   = node + (short)instruction[node + offsetNext];

Change it to say:

next   = node + (int)instruction[node + offsetNext];

Recompile and test and this problem appears to go away, however I cannot 
confirm that it doesn't break something else. I'm not sure why "short" would 
have been chosen over "int". Maybe there is a hidden reason.
Comment 1 Vadim Gritsenko 2007-03-07 16:25:20 UTC
instruction is an array of chars, which means it has two bytes values. Offset
from one instruction to another takes one char in the array, so it must be
within [Short.MIN_VALUE, MAX_VALUE]. Some of the programs (like a{8192}) in
current version are compiled into code exceeding this size (more than
Short.MAX_VALUE instructions), and so can not be expressed correctly.

Added check for this condition to RECompiler.