Some UTF-8 encoded Japanese documents causes Fatal Error. If a name with multi-byte characters of UTF-8 encoding reaches to or exceeds over every 16kbyte-length boundary in its file, the parser reports '[Fatal Error] testdata.xml:266:14: Element type "--a substring of the element name in Japanese--" must be followed by either attribute specifications, ">" or "/>".' The following is a part of a hex-dump of the document. ---- 00003fe0 20 3c e6 97 a5 e6 9c ac e8 aa 9e e3 81 ae e3 81 | <..............| 00003ff0 bf e3 81 ae e3 82 a8 e3 83 ac e3 83 a1 e3 83 b3 |................| 00004000 e3 83 88 e5 90 8d e3 81 a7 e3 82 82 e3 83 80 e3 |................| 00004010 83 a1 e3 81 a7 e3 81 97 e3 82 87 3e e6 97 a5 e6 |...........>....| 00004020 9c ac e8 aa 9e e3 81 ae e3 81 bf e3 81 ae e3 82 |................| 00004030 a8 e3 83 ac e3 83 a1 e3 83 b3 e3 83 88 e5 90 8d |................| 00004040 e3 82 82 e3 83 80 e3 83 a1 e3 81 a7 e3 81 97 e3 |................| 00004050 82 87 3c 2f e6 97 a5 e6 9c ac e8 aa 9e e3 81 ae |..</............| 00004060 e3 81 bf e3 81 ae e3 82 a8 e3 83 ac e3 83 a1 e3 |................| 00004070 83 b3 e3 83 88 e5 90 8d e3 81 a7 e3 82 82 e3 83 |................| 00004080 80 e3 83 a1 e3 81 a7 e3 81 97 e3 82 87 3e 0a 3c |.............>.<| 00004090 2f 64 6f 63 3e 0a |/doc>.| ---- And the following code will generate a test data which causes the problem. ---- import java.io.FileOutputStream; public class MakeTestData { static final String xmldecl = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"; static final String rootbgn = "<doc>\n"; static final String elem1 = " <\u30a8\u30ec\u30e1\u30f3\u30c8>\u65e5\u672c\ u8a9e\u8981\u7d20\u540d\u3060\u3088</\u30a8\u30ec\u30e1\u30f3\u30c8>\n"; static final String elem2 = " <\u65e5\u672c\u8a9e\u306e\u307f\u306e\u30a8\u 30ec\u30e1\u30f3\u30c8\u540d\u3067\u3082\u30c0\u30e1\u3067\u3057\u3087>\u65e5\u6 72c\u8a9e\u306e\u307f\u306e\u30a8\u30ec\u30e1\u30f3\u30c8\u540d\u3082\u30c0\u30e 1\u3067\u3057\u3087</\u65e5\u672c\u8a9e\u306e\u307f\u306e\u30a8\u30ec\u30e1\u30f 3\u30c8\u540d\u3067\u3082\u30c0\u30e1\u3067\u3057\u3087>\n"; static final String rootend = "</doc>\n"; static final String fname = "testdata.xml"; static final String fenc = "UTF-8"; public static void main(String[] args) { StringBuffer buf = new StringBuffer(); buf.append(xmldecl); buf.append(rootbgn); for (int i=0; i<263; i++) { buf.append(elem1); } buf.append(elem2); buf.append(rootend); String testdata = buf.toString(); try { FileOutputStream fos = new FileOutputStream(fname); fos.write(testdata.getBytes(fenc)); fos.close(); } catch (Exception e) { e.printStackTrace(); } } }
The following patch to src/org/apache/xerces/readers/UTF8Reader.java seems to resolve the problem. ---- 1398a1399 > data = fMostRecentData; ---- But, I'm not sure if it is really enough or it doesn't cause any degradation...