Bug 16287 - The parser cannot parse some UTF-8 encoded Japanese documents
Summary: The parser cannot parse some UTF-8 encoded Japanese documents
Status: NEW
Alias: None
Product: Xerces-J
Classification: Unclassified
Component: Core (show other bugs)
Version: 1.4.4
Hardware: All All
: P3 major
Target Milestone: ---
Assignee: Xerces-J Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-01-21 05:02 UTC by Takuya Mori
Modified: 2004-11-16 19:05 UTC (History)
1 user (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Takuya Mori 2003-01-21 05:02:31 UTC
Some UTF-8 encoded Japanese documents causes Fatal Error.

If a name with multi-byte characters of UTF-8 encoding reaches to or exceeds
over every 16kbyte-length boundary in its file, the parser reports
'[Fatal Error] testdata.xml:266:14: Element type "--a substring of the element
name in Japanese--" must be followed by either attribute specifications, ">" or
"/>".'

The following is a part of a hex-dump of the document.
----
00003fe0  20 3c e6 97 a5 e6 9c ac  e8 aa 9e e3 81 ae e3 81  | <..............|
00003ff0  bf e3 81 ae e3 82 a8 e3  83 ac e3 83 a1 e3 83 b3  |................|
00004000  e3 83 88 e5 90 8d e3 81  a7 e3 82 82 e3 83 80 e3  |................|
00004010  83 a1 e3 81 a7 e3 81 97  e3 82 87 3e e6 97 a5 e6  |...........>....|
00004020  9c ac e8 aa 9e e3 81 ae  e3 81 bf e3 81 ae e3 82  |................|
00004030  a8 e3 83 ac e3 83 a1 e3  83 b3 e3 83 88 e5 90 8d  |................|
00004040  e3 82 82 e3 83 80 e3 83  a1 e3 81 a7 e3 81 97 e3  |................|
00004050  82 87 3c 2f e6 97 a5 e6  9c ac e8 aa 9e e3 81 ae  |..</............|
00004060  e3 81 bf e3 81 ae e3 82  a8 e3 83 ac e3 83 a1 e3  |................|
00004070  83 b3 e3 83 88 e5 90 8d  e3 81 a7 e3 82 82 e3 83  |................|
00004080  80 e3 83 a1 e3 81 a7 e3  81 97 e3 82 87 3e 0a 3c  |.............>.<|
00004090  2f 64 6f 63 3e 0a                                 |/doc>.|
----

And the following code will generate a test data which causes the problem.
----
import java.io.FileOutputStream;

public class MakeTestData {
    static final String xmldecl = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n";

    static final String rootbgn = "<doc>\n";
    static final String elem1 = "  <\u30a8\u30ec\u30e1\u30f3\u30c8>\u65e5\u672c\
u8a9e\u8981\u7d20\u540d\u3060\u3088</\u30a8\u30ec\u30e1\u30f3\u30c8>\n";
    static final String elem2 = "  <\u65e5\u672c\u8a9e\u306e\u307f\u306e\u30a8\u
30ec\u30e1\u30f3\u30c8\u540d\u3067\u3082\u30c0\u30e1\u3067\u3057\u3087>\u65e5\u6
72c\u8a9e\u306e\u307f\u306e\u30a8\u30ec\u30e1\u30f3\u30c8\u540d\u3082\u30c0\u30e
1\u3067\u3057\u3087</\u65e5\u672c\u8a9e\u306e\u307f\u306e\u30a8\u30ec\u30e1\u30f
3\u30c8\u540d\u3067\u3082\u30c0\u30e1\u3067\u3057\u3087>\n";
    static final String rootend = "</doc>\n";
    static final String fname = "testdata.xml";
    static final String fenc = "UTF-8";

    public static void main(String[] args) {
        StringBuffer buf = new StringBuffer();

        buf.append(xmldecl);
        buf.append(rootbgn);

        for (int i=0; i<263; i++) {
            buf.append(elem1);
        }

        buf.append(elem2);
        buf.append(rootend);

        String testdata = buf.toString();

        try {
            FileOutputStream fos = new FileOutputStream(fname);
            fos.write(testdata.getBytes(fenc));
            
            fos.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}
Comment 1 Takuya Mori 2003-01-21 05:12:14 UTC
The following patch to src/org/apache/xerces/readers/UTF8Reader.java
seems to resolve the problem.
----
1398a1399
>                 data = fMostRecentData;
----

But, I'm not sure if it is really enough or it doesn't cause any degradation...