This problem has been discussed on the regexp mailinglist (June 8, 2001, Subject "Regexp broken"). I made an JSP page in which I test some RE's. I have a string "{regexp-1.2}" and I want an RE which matches everything between the brackets (thus "regexp-1.2"). Let's try some regular expressions: Input string "{regexp-1.2}" and RE "([a-z0-9]+)" match: "regexp" Input string "{regexp-1.2}" and RE "([a-z0-9-]+)" match: "{regexp-1.2}" Input string "{regexp-1.2}" and RE "([a-z0-9.]+)" match: "regexp" Input string "{regexp-1.2}" and RE "([a-z0-9.-]+)" match: "{regexp" Input string "{regexp-1.2}" and RE "([a-z0-9\-]+)" match: "regexp-1" Input string "{regexp-1.2}" and RE "([a-z0-9\.]+)" match: "regexp" Input string "{regexp-1.2}" and RE "([a-z0-9\.\-]+)" match: "regexp-1" Input string "{regexp-1.2}" and RE "([a-z0-9.\-]+)" match: "regexp-1" Input string "{regexp-1.2}" and RE "([a-z0-9\.-]+)" match: "{regexp" Some strange results and none of them gives "regexp-1.2"! This is the JSP-code, so you can try it yourself: <%@ page import="org.apache.regexp.*" %> <%! JspWriter JspOut; public void reTest( String in, String re ) throws java.io.IOException, org.apache.regexp.RESyntaxException { JspOut.print( "Input string \""+in+"\" and RE \""+re+"\" match: "); RE testRe = new RE(re); if ( testRe.match( in ) ) JspOut.print( "\""+testRe.getParen(1)+"\"" ); else JspOut.print( "no match" ); JspOut.print("<br>"); } %> <% JspOut = out; String s = "{regexp-1.2}"; reTest( s, "([a-z0-9]+)" ); reTest( s, "([a-z0-9-]+)" ); reTest( s, "([a-z0-9.]+)" ); reTest( s, "([a-z0-9.-]+)" ); reTest( s, "([a-z0-9\\-]+)" ); reTest( s, "([a-z0-9\\.]+)" ); reTest( s, "([a-z0-9\\.\\-]+)" ); reTest( s, "([a-z0-9.\\-]+)" ); reTest( s, "([a-z0-9\\.-]+)" ); %>
Here's an contribution to general@jakarta.apache.org, subject "What are we doing in regards to JDK 1.4?". It contains untested fixes. At 09:42 21-6-2001 -0700, Jon wrote: Edwin, on 6/21/01 7:16 AM, "Edwin Martin" <edwin@bitstorm.nl> wrote: - > org.apache.regexp 1.2 is pretty much broken. It has some - > major flaws since 1.0 and they are still not addressed. - > - > See http://nagoya.betaversion.org/bugzilla/buglist.cgi?product=Regexp - > for a list of bugs (BTW none of them is assigned). - - Sending in bug reports doesn't get the problems fixed. This is a community - of VOLUNTEERS. You can't just magically put in a bug report and then someone - is going to jump up and fix it...you have to submit patches or try to nicely - motivate people to fix it for you. - - <http://jakarta.apache.org/site/understandingopensource.html> - - "With the opensource system, if you find any deficiency in the project, the - onus is on you to redress that deficiency." I thought submitting bug reports is also an important way to support Open Source. Well, I looked at the regexp-code and saw one of the bugs: RECompiler.java, line 664: // Premature end of range. define up to Character.MAX_VALUE if ((idx + 1) < len && pattern.charAt(++idx) == ']') { simpleChar = Character.MAX_VALUE; break; } The code makes any minus a range. The RE "[a-]" becomes "the character a and anything after it". A minus at the beginning or the end should be just a minus. The code should be something like this: // Premature end of range. define up to Character.MAX_VALUE if ((idx + 1) < len && pattern.charAt(++idx) == ']') { definingRange = false; break; } Futhermore, RECompiler.java, line 697: if ((idx + 1) >= len || pattern.charAt(idx + 1) != '-') Should become something like: if ((idx + 1) >= len || !(pattern.charAt(idx + 1) == '-' && !((idx + 2) <= len && pattern.charAt(idx + 2) == ']'))) Which means: Do not include a char when followed by a minus, but DO include the char when the minus is followed by a ']'. The code still does not address the possibility of a charclass which starts with a minus, like "[-a]" or "[^-a]", but that shouldn't be too difficult to implement. It isn't really that hard to fix these bugs, I just wonder if there's anybody responsible for the regexp package. And by the way, you don't have to shout. Bye, Edwin Martin.
Cheers for the notes Edwin I see about getting them implemented. I have been too busy with working and moving and no one else is volunteering. It will get done.
*** Bug 2122 has been marked as a duplicate of this bug. ***
*** Bug 810 has been marked as a duplicate of this bug. ***
This is partially fixed by patch attached to bug #19329. Following REs work properly: reTest( s, "([a-z0-9]+)" ); reTest( s, "([a-z0-9\\-]+)" ); reTest( s, "([a-z0-9\\.]+)" ); reTest( s, "([a-z0-9\\.\\-]+)" );
In current version (1.4-dev) only three regexps from list provided in this bug produce results unexpected by submitter: ([a-z0-9-]+) matches "regexp" instead of "regexp-1" ([a-z0-9.-]+) - "{regexp" instead of "regexp-1.2" ([a-z0-9\\.-]+) - "{regexp" instead of "regexp-1.2" The cause of this problem is that jakarta-regexp consider [-] as interval from 0 upto Character.MAX_VALUE. And [a-] as interval from a upto Character.MAX_VALUE. So, we can say that this is a feature and do not fix this. Or say that this is a bug, but on this way we will have to change behavior of '- ' when it is first/last character in []. The advantage of latter approach is that jakarta-regexp will be more compatible with Perl, gnu-regexp and Sun's regexp implementation (I think ORO too, but I don't have it to check).
I think it is possible to add new MATCH modifier, something like: Index: src/java/org/apache/regexp/RE.java =================================================================== RCS file: /home/cvs/jakarta-regexp/src/java/org/apache/regexp/RE.java,v retrieving revision 1.14 diff -u -r1.14 RE.java --- src/java/org/apache/regexp/RE.java 6 Sep 2003 01:45:51 -0000 1.14 +++ src/java/org/apache/regexp/RE.java 19 Dec 2003 03:16:44 -0000 @@ -389,6 +389,12 @@ */ public static final int MATCH_SINGLELINE = 0x0004; + /** + * Changes behavior of '-' in character classes where one or both + * of the range limits are missing. + */ + public static final int MATCH_POSIX = 0x0008; + /************************************************ * * * The format of a node in a program is: * And change behavior of RECompiler depending on value of this match modifier.
Oops. Meant not MATCH_POSIX, but something more descriptive, say MATCH_POSIXRANGE.
The problem with MATCH_POSIX flag is that it's not a pure "matching time" flag as all other ones. We need to make a decision on how interpret '-' during "compile time" (when we create program), because program for "[-]" and "[\u0000-\uFFFF]" will be the same. I, personally, think that we shouldn't enhance program structure to catch this particular case IMHO this is overkill. I'd say that regexp package should just make a decision on how it handles incomplete ranges and does so without any flags.
Well, then, I'm biased towards declaring this a "feature" of regexp - especially taking into account all those years this behavior was in regexp.
I think we should write some javadoc about this feature and close this bug :)
Yes! Do you want to wordsmith a javadoc patch? ;-)
I'm not an expert in javadoc writing, but here is my suggestion: Index: src/java/org/apache/regexp/RE.java =================================================================== RCS file: /home/cvspublic/jakarta-regexp/src/java/org/apache/regexp/RE.java,v retrieving revision 1.19 diff -u -r1.19 RE.java --- src/java/org/apache/regexp/RE.java 17 Feb 2004 13:37:54 -0000 1.19 +++ src/java/org/apache/regexp/RE.java 25 Feb 2004 04:05:22 -0000 @@ -128,6 +128,12 @@ * [a-zA-Z] Character class with ranges * [^abc] Negated character class * + * </pre> + * <b>Note</b> that incomplete range will be interpreted as "starts + * from zero" or "ends with last character". <br> I.e. [-a] is + * the same as [\\u0000-a], [a-] - [a-\\uFFFF] and [-] means + * "all characters" + * <pre> * <br> * * <b><font face=times roman>Standard POSIX Character Classes</font></b>
I'd applied your patch, and made few changes on top of it. Thanks :-)