Bug 3303

Summary: Unicode 3.0 character \\uFFFD
Product: Regexp Reporter: Tasuki Yamamoto <tasuki.y2k>
Component: OtherAssignee: Jakarta Notifications Mailing List <notifications>
Status: CLOSED FIXED    
Severity: minor    
Priority: P3    
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: Suggested fix for this bug.

Description Tasuki Yamamoto 2001-08-28 06:17:48 UTC
http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt:
>FFFD;REPLACEMENT CHARACTER;So;0;ON;;;;;N;;;;;

For some reason when the above character is in any regex character class it 
causes a RESyntaxException with description 'Bad Character Class'. I attempted 
to use it in the following context:

  private static String XMLescape(String s)
  	throws RESyntaxException
  {
	if (s==null) return s;
	if (s.length() == 0) return s;

	// XML 1.0 standard actually says:
	// Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
[#x10000-10FFFF]
	// For some reason this library doesn't like the Unicode character 
\\uFFFD.
	RE r = new RE("[^\\u0009\\u0010\\u0013\\u0020-\\uD7FF\\uE000-\\uFFFC]");

	return r.subst(s, "");
  }

I'm using the JRE Standard Edition 3.0.

Regards,

Tasuki.
Comment 1 Oleg Sukhodolsky 2003-10-07 03:23:16 UTC
The cause of the problem is that RECompiler uses 0xfffd as value of its 
internal constant ESC_CLASS.

To fix the problem type of ESC_XXX constants should be changed from
char to int.  Thier values should be bigger than maximum value of char.
and return type of escape() method should be changed to int.
Comment 2 Oleg Sukhodolsky 2003-10-07 07:34:49 UTC
Created attachment 8476 [details]
Suggested fix for this bug.
Comment 3 Vadim Gritsenko 2003-12-20 17:59:23 UTC
Patch applied, thanks
Comment 4 Vadim Gritsenko 2003-12-20 17:59:43 UTC
Closed