16287 – The parser cannot parse some UTF-8 encoded Japanese documents

Bug 16287 - The parser cannot parse some UTF-8 encoded Japanese documents

Summary: The parser cannot parse some UTF-8 encoded Japanese documents

Status:	NEW

Alias:	None

Product:	Xerces-J
Classification:	Unclassified
Component:	Core (show other bugs)
Version:	1.4.4
Hardware:	All All

Importance:	P3 major
Target Milestone:	---
Assignee:	Xerces-J Developers Mailing List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2003-01-21 05:02 UTC by Takuya Mori
Modified:	2004-11-16 19:05 UTC (History)
CC List:	1 user (show)

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Takuya Mori 2003-01-21 05:02:31 UTC

Some UTF-8 encoded Japanese documents causes Fatal Error.

If a name with multi-byte characters of UTF-8 encoding reaches to or exceeds
over every 16kbyte-length boundary in its file, the parser reports
'[Fatal Error] testdata.xml:266:14: Element type "--a substring of the element
name in Japanese--" must be followed by either attribute specifications, ">" or
"/>".'

The following is a part of a hex-dump of the document.
----
00003fe0  20 3c e6 97 a5 e6 9c ac  e8 aa 9e e3 81 ae e3 81  | <..............|
00003ff0  bf e3 81 ae e3 82 a8 e3  83 ac e3 83 a1 e3 83 b3  |................|
00004000  e3 83 88 e5 90 8d e3 81  a7 e3 82 82 e3 83 80 e3  |................|
00004010  83 a1 e3 81 a7 e3 81 97  e3 82 87 3e e6 97 a5 e6  |...........>....|
00004020  9c ac e8 aa 9e e3 81 ae  e3 81 bf e3 81 ae e3 82  |................|
00004030  a8 e3 83 ac e3 83 a1 e3  83 b3 e3 83 88 e5 90 8d  |................|
00004040  e3 82 82 e3 83 80 e3 83  a1 e3 81 a7 e3 81 97 e3  |................|
00004050  82 87 3c 2f e6 97 a5 e6  9c ac e8 aa 9e e3 81 ae  |..</............|
00004060  e3 81 bf e3 81 ae e3 82  a8 e3 83 ac e3 83 a1 e3  |................|
00004070  83 b3 e3 83 88 e5 90 8d  e3 81 a7 e3 82 82 e3 83  |................|
00004080  80 e3 83 a1 e3 81 a7 e3  81 97 e3 82 87 3e 0a 3c  |.............>.<|
00004090  2f 64 6f 63 3e 0a                                 |/doc>.|
----

And the following code will generate a test data which causes the problem.
----
import java.io.FileOutputStream;

public class MakeTestData {
    static final String xmldecl = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n";

    static final String rootbgn = "<doc>\n";
    static final String elem1 = "  <\u30a8\u30ec\u30e1\u30f3\u30c8>\u65e5\u672c\
u8a9e\u8981\u7d20\u540d\u3060\u3088</\u30a8\u30ec\u30e1\u30f3\u30c8>\n";
    static final String elem2 = "  <\u65e5\u672c\u8a9e\u306e\u307f\u306e\u30a8\u
30ec\u30e1\u30f3\u30c8\u540d\u3067\u3082\u30c0\u30e1\u3067\u3057\u3087>\u65e5\u6
72c\u8a9e\u306e\u307f\u306e\u30a8\u30ec\u30e1\u30f3\u30c8\u540d\u3082\u30c0\u30e
1\u3067\u3057\u3087</\u65e5\u672c\u8a9e\u306e\u307f\u306e\u30a8\u30ec\u30e1\u30f
3\u30c8\u540d\u3067\u3082\u30c0\u30e1\u3067\u3057\u3087>\n";
    static final String rootend = "</doc>\n";
    static final String fname = "testdata.xml";
    static final String fenc = "UTF-8";

    public static void main(String[] args) {
        StringBuffer buf = new StringBuffer();

        buf.append(xmldecl);
        buf.append(rootbgn);

        for (int i=0; i<263; i++) {
            buf.append(elem1);
        }

        buf.append(elem2);
        buf.append(rootend);

        String testdata = buf.toString();

        try {
            FileOutputStream fos = new FileOutputStream(fname);
            fos.write(testdata.getBytes(fenc));
            
            fos.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Comment 1 Takuya Mori 2003-01-21 05:12:14 UTC

The following patch to src/org/apache/xerces/readers/UTF8Reader.java
seems to resolve the problem.
----
1398a1399
>                 data = fMostRecentData;
----

But, I'm not sure if it is really enough or it doesn't cause any degradation...