Bug 7065 - Xerces encodes strange characters but can't parse them
Summary: Xerces encodes strange characters but can't parse them
Status: NEW
Alias: None
Product: Xerces-J
Classification: Unclassified
Component: Core (show other bugs)
Version: unspecified
Hardware: All All
: P3 normal
Target Milestone: ---
Assignee: Xerces-J Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2002-03-12 22:57 UTC by Eric A. Maginniss
Modified: 2004-11-16 19:05 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Eric A. Maginniss 2002-03-12 22:57:17 UTC
This may be a failing of my understanding of XML, but I've always been a strong 
believer that if a framework can generate a document, it should be able to 
parse it as well.  The following code generates an XML document that cannot be 
parsed by xerces.  The code and output follow:

Code:
    public static void main(String[] args) throws Exception {
        byte []bytes = { 28 };

        //Create the document
        Document document = new DocumentImpl();
        Element root = document.createElement("TEST");
        Node child = document.createTextNode(new String(bytes));
        root.appendChild(child);
        document.appendChild(root);

        //Serialize document to String
        ByteArrayOutputStream outStream = new ByteArrayOutputStream();
        OutputFormat format = new OutputFormat(document);
        XMLSerializer serial = new XMLSerializer(outStream, format);
        serial.asDOMSerializer();
        serial.serialize(document.getDocumentElement());
        outStream.flush();
        String xml = outStream.toString();

        //Print out text interpretaion of xml document
        System.out.println(xml);

        //reparse text into xml
        ByteArrayInputStream inputStream = new ByteArrayInputStream(xml.getBytes
());
        DOMParser parser = new DOMParser();
        InputSource inputSource = new InputSource(inputStream);
        parser.parse(inputSource);
        document = parser.getDocument();
    }

Output:
<?xml version="1.0" encoding="UTF-8"?>
<TEST>&#x1c;</TEST>

[Fatal Error] :2:13: Character reference "&#1c" is an invalid XML character.

org.xml.sax.SAXParseException: Character reference "&#1c" is an invalid XML 
character.

	at org.apache.xerces.parsers.DOMParser.parse(DOMParser.java:235)

	at testclassloader.TestXerces.main(TestXerces.java:53)

Exception in thread "main" 


This particular test was run with xerces 2.0.1, but I've had similar results 
with 1.4.4 though the outputted escaped character is different.

While I realize that character 28 does not fit within the XML spec as a valid 
character, I am curious why xerces will generate text node or serialize a 
document with an invalid character.

Also, is there any way to properly encode this document or do I need to 
manually escape my node text before encoding?

Thanks for your time and for working on a fantastic open-source project.
Comment 1 Joe Kesselman 2002-03-13 14:14:06 UTC
The DOM APIs are able to represent documents which can not be encoded as 
well-formed XML, for reasons largely having to do with performance. 

It's up to the XML serializer to decide whether to write them out as damaged XML 
while generating a warning message somewhere, to attempt to repair them, or to 
refuse to write the document out and report the error. There are risks and 
benefits to all three options, on performance grounds as well as flexibility and 
diagnosability.