Bug 2121

Summary: '.' or '-' in bracket expression gives unexpected results
Product: Regexp Reporter: Edwin Martin <edwin>
Component: OtherAssignee: Jakarta Notifications Mailing List <notifications>
Status: CLOSED FIXED    
Severity: normal CC: euge, lmoore
Priority: P3    
Version: unspecified   
Target Milestone: ---   
Hardware: Other   
OS: All   

Description Edwin Martin 2001-06-11 12:26:14 UTC
This problem has been discussed on the regexp mailinglist
(June 8, 2001, Subject "Regexp broken").

I made an JSP page in which I test some RE's.

I have a string "{regexp-1.2}" and I want an RE which
matches everything between the brackets (thus "regexp-1.2").

Let's try some regular expressions:

Input string "{regexp-1.2}" and RE "([a-z0-9]+)" match: "regexp"
Input string "{regexp-1.2}" and RE "([a-z0-9-]+)" match: "{regexp-1.2}"
Input string "{regexp-1.2}" and RE "([a-z0-9.]+)" match: "regexp"
Input string "{regexp-1.2}" and RE "([a-z0-9.-]+)" match: "{regexp"
Input string "{regexp-1.2}" and RE "([a-z0-9\-]+)" match: "regexp-1"
Input string "{regexp-1.2}" and RE "([a-z0-9\.]+)" match: "regexp"
Input string "{regexp-1.2}" and RE "([a-z0-9\.\-]+)" match: "regexp-1"
Input string "{regexp-1.2}" and RE "([a-z0-9.\-]+)" match: "regexp-1"
Input string "{regexp-1.2}" and RE "([a-z0-9\.-]+)" match: "{regexp"

Some strange results and none of them gives "regexp-1.2"!

This is the JSP-code, so you can try it yourself:

<%@ page import="org.apache.regexp.*" %>

<%!
JspWriter JspOut;

public void reTest( String in, String re ) throws java.io.IOException, 
org.apache.regexp.RESyntaxException {
        JspOut.print( "Input string \""+in+"\" and RE \""+re+"\" match: ");
        RE testRe = new RE(re);
        if ( testRe.match( in ) )
                JspOut.print( "\""+testRe.getParen(1)+"\"" );
        else
                JspOut.print( "no match" );
        JspOut.print("<br>");
}
%>

<%
JspOut = out;
String s = "{regexp-1.2}";

reTest( s, "([a-z0-9]+)" );
reTest( s, "([a-z0-9-]+)" );
reTest( s, "([a-z0-9.]+)" );
reTest( s, "([a-z0-9.-]+)" );
reTest( s, "([a-z0-9\\-]+)" );
reTest( s, "([a-z0-9\\.]+)" );
reTest( s, "([a-z0-9\\.\\-]+)" );
reTest( s, "([a-z0-9.\\-]+)" );
reTest( s, "([a-z0-9\\.-]+)" );
%>
Comment 1 Edwin Martin 2001-06-21 14:29:54 UTC
Here's an contribution to general@jakarta.apache.org,
subject "What are we doing in regards to JDK 1.4?".

It contains untested fixes.

At 09:42 21-6-2001 -0700, Jon wrote:
Edwin,

on 6/21/01 7:16 AM, "Edwin Martin" <edwin@bitstorm.nl> wrote:

- > org.apache.regexp 1.2 is pretty much broken. It has some
- > major flaws since 1.0 and they are still not addressed.
- > 
- > See http://nagoya.betaversion.org/bugzilla/buglist.cgi?product=Regexp
- > for a list of bugs (BTW none of them is assigned).
- 
- Sending in bug reports doesn't get the problems fixed. This is a community
- of VOLUNTEERS. You can't just magically put in a bug report and then someone
- is going to jump up and fix it...you have to submit patches or try to nicely
- motivate people to fix it for you.
- 
- <http://jakarta.apache.org/site/understandingopensource.html>
-
- "With the opensource system, if you find any deficiency in the project, the
- onus is on you to redress that deficiency."

I thought submitting bug reports is also an important
way to support Open Source.

Well, I looked at the regexp-code and saw one of the bugs:

RECompiler.java, line 664:

                   // Premature end of range. define up to Character.MAX_VALUE
                    if ((idx + 1) < len && pattern.charAt(++idx) == ']')
                    {
                        simpleChar = Character.MAX_VALUE;
                        break;
                    }

The code makes any minus a range.

The RE "[a-]" becomes "the character a and anything after it".

A minus at the beginning or the end should be just a minus.

The code should be something like this:

                    // Premature end of range. define up to Character.MAX_VALUE
                    if ((idx + 1) < len && pattern.charAt(++idx) == ']')
                    {
                        definingRange = false;
                        break;
                    }

Futhermore, RECompiler.java, line 697:

                if ((idx + 1) >= len || pattern.charAt(idx + 1) != '-')

Should become something like:

                if ((idx + 1) >= len || !(pattern.charAt(idx + 1) == '-' &&
!((idx + 2) <= len && pattern.charAt(idx + 2) == ']')))

Which means: Do not include a char when followed by a minus, but DO include the
char when the minus is followed by a ']'.

The code still does not address the possibility of a charclass which starts with a
minus, like "[-a]" or "[^-a]", but that shouldn't be too difficult to implement.

It isn't really that hard to fix these bugs, I just wonder if there's anybody
responsible for the regexp package.

And by the way, you don't have to shout.

Bye,
Edwin Martin.
Comment 2 Michael McCallum 2001-09-08 15:48:43 UTC
Cheers for the notes Edwin I see about getting them implemented.
I have been too busy with working and moving and no one else is volunteering.

It will get done.
Comment 3 Michael McCallum 2001-09-08 15:50:06 UTC
*** Bug 2122 has been marked as a duplicate of this bug. ***
Comment 4 Michael McCallum 2001-09-08 16:00:59 UTC
*** Bug 810 has been marked as a duplicate of this bug. ***
Comment 5 Vadim Gritsenko 2003-04-25 17:58:22 UTC
This is partially fixed by patch attached to bug #19329. Following REs work
properly:

reTest( s, "([a-z0-9]+)" );
reTest( s, "([a-z0-9\\-]+)" );
reTest( s, "([a-z0-9\\.]+)" );
reTest( s, "([a-z0-9\\.\\-]+)" );
Comment 6 Oleg Sukhodolsky 2003-10-25 19:40:04 UTC
In current version (1.4-dev) only three regexps from list provided in this bug
produce results unexpected by submitter:
([a-z0-9-]+) matches "regexp" instead of "regexp-1"
([a-z0-9.-]+) - "{regexp" instead of "regexp-1.2"
([a-z0-9\\.-]+) - "{regexp" instead of "regexp-1.2"
The cause of this problem is that jakarta-regexp consider [-] as interval from
0 upto Character.MAX_VALUE.  And [a-] as interval from a upto 
Character.MAX_VALUE.

So, we can say that this is a feature and do not fix this.
Or say that this is a bug, but on this way we will have to change behavior of '-
'
when it is first/last character in [].
The advantage of latter approach is that jakarta-regexp will be more compatible
with Perl, gnu-regexp and Sun's regexp implementation (I think ORO too, but I 
don't have it to check).
Comment 7 Vadim Gritsenko 2003-12-20 18:03:02 UTC
I think it is possible to add new MATCH modifier, something like:

Index: src/java/org/apache/regexp/RE.java
===================================================================
RCS file: /home/cvs/jakarta-regexp/src/java/org/apache/regexp/RE.java,v
retrieving revision 1.14
diff -u -r1.14 RE.java
--- src/java/org/apache/regexp/RE.java	6 Sep 2003 01:45:51 -0000	1.14
+++ src/java/org/apache/regexp/RE.java	19 Dec 2003 03:16:44 -0000
@@ -389,6 +389,12 @@
      */
     public static final int MATCH_SINGLELINE      = 0x0004;
 
+    /**
+     * Changes behavior of '-' in character classes where one or both
+     * of the range limits are missing.
+     */
+    public static final int MATCH_POSIX           = 0x0008;
+
     /************************************************
      *                                              *
      * The format of a node in a program is:        *


And change behavior of RECompiler depending on value of this match modifier.
Comment 8 Vadim Gritsenko 2003-12-20 18:03:53 UTC
Oops. Meant not MATCH_POSIX, but something more descriptive, say MATCH_POSIXRANGE.
Comment 9 Oleg Sukhodolsky 2004-02-17 06:42:34 UTC
The problem with MATCH_POSIX flag is that it's not a pure "matching time" flag
as all other ones.  We need to make a decision on how interpret '-' during
"compile time" (when we create program), because program for "[-]" and
"[\u0000-\uFFFF]" will be the same.  I, personally, think that we shouldn't
enhance program structure to catch this particular case IMHO this is overkill.

I'd say that regexp package should just make a decision on how it handles 
incomplete ranges and does so without any flags.
Comment 10 Vadim Gritsenko 2004-02-17 11:27:55 UTC
Well, then, I'm biased towards declaring this a "feature" of regexp - especially
taking into account all those years this behavior was in regexp.
Comment 11 Oleg Sukhodolsky 2004-02-17 18:19:43 UTC
I think we should write some javadoc about this feature and close this bug :)
Comment 12 Vadim Gritsenko 2004-02-20 13:44:32 UTC
Yes! Do you want to wordsmith a javadoc patch? ;-)
Comment 13 Oleg Sukhodolsky 2004-02-25 04:07:18 UTC
I'm not an expert in javadoc writing, but here is my suggestion:

Index: src/java/org/apache/regexp/RE.java
===================================================================
RCS file: /home/cvspublic/jakarta-regexp/src/java/org/apache/regexp/RE.java,v
retrieving revision 1.19
diff -u -r1.19 RE.java
--- src/java/org/apache/regexp/RE.java  17 Feb 2004 13:37:54 -0000      1.19
+++ src/java/org/apache/regexp/RE.java  25 Feb 2004 04:05:22 -0000
@@ -128,6 +128,12 @@
  *    [a-zA-Z]             Character class with ranges
  *    [^abc]               Negated character class
  *
+ * </pre>
+ * <b>Note</b> that incomplete range will be interpreted as &quot;starts
+ * from zero&quot; or &quot;ends with last character&quot;. <br> I.e. [-a] is
+ * the same as [\\u0000-a], [a-] - [a-\\uFFFF] and [-] means
+ * &quot;all characters&quot;
+ * <pre>
  * <br>
  *
  *  <b><font face=times roman>Standard POSIX Character Classes</font></b>
Comment 14 Vadim Gritsenko 2004-02-27 02:16:44 UTC
I'd applied your patch, and made few changes on top of it. Thanks :-)