Bug 3730 - Perl5Matcher sometimes confuses the begin/end offsets on similar sub patterns in a regular expression
Summary: Perl5Matcher sometimes confuses the begin/end offsets on similar sub patterns...
Status: RESOLVED LATER
Alias: None
Product: ORO
Classification: Unclassified
Component: Main (show other bugs)
Version: 2.0.4
Hardware: Other other
: P3 normal (vote)
Target Milestone: ---
Assignee: Jakarta Notifications Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2001-09-19 14:26 UTC by James Vinett
Modified: 2004-11-16 19:05 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description James Vinett 2001-09-19 14:26:27 UTC
here is the test program:


import com.oroinc.text.regex.*;
import java.io.*;

public class bug_report
{
    public static void main(String[] args) throws Exception
    {
        String regex  = "\010[(]GAME +GID:([^;]+); +GDATE:([^;]*); +GSTART:([^;]
*); +GSITE:([^;]*); +GNEUTRAL:([^;]*); +GSTAT:([^;]*); +GPERIOD:([^;]*);[^\r\n]*
[\r\n]+"
                       +"("
                       +"(\010[(]TEAM +TNAME:([^;]*);( +[^:]+:[^;]*;){3} 
+THOME: *([Yy][Ee][Ss]); +TSCORE:([^;]*); +TSTAT:([^;]*)[^\r\n]*[\r\n]+)"
                       +"|"
                       +"(\010[(]TEAM +TNAME:([^;]*);( +[^:]+:[^;]*;){3} 
+THOME: *([Nn][Oo]); +TSCORE:([^;]*); +TSTAT:([^;]*)[^\r\n]*[\r\n]+)"
                       +"){2}";


        String input  = "(GAME GID:13805; GDATE:11/01/2000; GSTART:19:30; 
GSITE:Charlotte Coliseum; GNEUTRAL:NO; GSTAT:Final; GPERIOD:4; \n"
                       +"(TEAM TNAME:Hornets; TLOCALE:Charlotte; 
TCONF:Eastern; TDIV:Central; THOME:YES; TSCORE:77; TSTAT:LOST; TID:9;)\n"
                       +"(TEAM TNAME:Wizards; TLOCALE:Washington; 
TCONF:Eastern; TDIV:Atlantic; THOME:NO; TSCORE:95; TSTAT:WON; TID:7;))\n";

        String input2 = "(GAME GID:13789; GDATE:10/31/2000; GSTART:19:30; 
GSITE:TD Waterhouse Centre; GNEUTRAL:NO; GSTAT:Final; GPERIOD:4; \n"
                       +"(TEAM TNAME:Magic; TLOCALE:Orlando; TCONF:Eastern; 
TDIV:Atlantic; THOME:YES; TSCORE:97; TSTAT:WON; TID:5;)\n"
                       +"(TEAM TNAME:Wizards; TLOCALE:Washington; 
TCONF:Eastern; TDIV:Atlantic; THOME:NO; TSCORE:86; TSTAT:LOST; TID:7;))\n";
        	
	    Perl5Compiler p5compiler = new Perl5Compiler();
	    Perl5Pattern p5pattern = null;
	    Perl5Matcher p5matcher = new Perl5Matcher();
	    PatternMatcherInput p5input = new PatternMatcherInput(input2);
	    
		try {
			p5pattern = (Perl5Pattern) p5compiler.compile(regex,
				        Perl5Compiler.SINGLELINE_MASK |
				        Perl5Compiler.READ_ONLY_MASK  );
		} catch(MalformedPatternException e) {
			System.out.println("Error:  Bad Perl5 pattern.");
			System.out.println(e.getMessage());
		}
		
		boolean result = p5matcher.matchesPrefix(p5input, p5pattern);
		
		if( result )
		{
            MatchResult mr = p5matcher.getMatch();
            int groups     = mr.groups();
            int start      = -1;
            int end        = -1;
            String matchStr = null;
            for( int x = 0; x < groups; x++ )
            {
                start = mr.beginOffset(x);
                end   = mr.endOffset(x);
                //matchStr = mr.group(x);
                
                //System.out.print
("Pos: "+x+"\tStart: "+start+"\tEnd: "+end+"\tMatch: "+matchStr);
                System.out.print("Pos: "+x+"\tStart: "+start+"\tEnd: "+end);
                
                if( start > end )
                    System.out.println( " -- ERROR" );
                else
                    System.out.println();
            }
		}
		else
		{
		    System.out.println("No Match");
		}
		System.out.println("Program terminating");
    }
    
}    


and here is some output:

Pos: 0    Start: 0    End: 338
Pos: 1    Start: 11    End: 16
Pos: 2    Start: 24    End: 34
Pos: 3    Start: 43    End: 48
Pos: 4    Start: 56    End: 76
Pos: 5    Start: 87    End: 89
Pos: 6    Start: 97    End: 102
Pos: 7    Start: 112    End: 113
Pos: 8    Start: 224    End: 338
Pos: 9    Start: 224    End: 224
Pos: 10    Start: 237    End: 237
Pos: 11    Start: 280    End: 295
Pos: 12    Start: 302    End: 192 -- ERROR
Pos: 13    Start: 201    End: 203
Pos: 14    Start: 211    End: 214
Pos: 15    Start: 224    End: 338
Pos: 16    Start: 237    End: 244
Pos: 17    Start: 280    End: 295
Pos: 18    Start: 302    End: 304
Pos: 19    Start: 313    End: 315
Pos: 20    Start: 323    End: 327
Program terminating



if you'll notice, Pos 12 and Pos 18 share the same Start value.  In the regex
they have the same pattern.  Granted, there are many similar sub patterns as a
matter of fact lines 2 and 3 of the pattern are almost exatly the same except 
for [Yy][Ee][Ss] and [Nn][Oo]...
Comment 1 James Vinett 2001-09-19 15:07:34 UTC
same problem for 2.0.4 version
Comment 2 Daniel F. Savarese 2001-09-19 15:36:57 UTC
This behavior is consistent with Perl 5.003_07 and is not a bug.  The
contents of a group is not guaranteed to be the last succesful match
when contained within an alternation. In other words, group 18 is the
valid match while group 12 did not match anything on its last attempt.
In Perl5MatchResult, groups that failed to match on their last attempt
as part of the NFA are indicated when  the start offset is greater than
the end offset (this may be a documentation bug since it may not appear
in the javadocs) and when accessed via group(int) they return null.
Subgroups that weren't reached a final time during the NFA execution
(perhaps because an earlier subgroup failed) will retain their old values.

Later versions of Perl regularized the behavior of subgroups so that they
would always contain the last value matched rather than a potentially
empty value based on a final failed subgroup match attempt.  Perl5Matcher
will implement this behavior as part of the Perl 5.6 compatibility work.