This may be a failing of my understanding of XML, but I've always been a strong believer that if a framework can generate a document, it should be able to parse it as well. The following code generates an XML document that cannot be parsed by xerces. The code and output follow: Code: public static void main(String[] args) throws Exception { byte []bytes = { 28 }; //Create the document Document document = new DocumentImpl(); Element root = document.createElement("TEST"); Node child = document.createTextNode(new String(bytes)); root.appendChild(child); document.appendChild(root); //Serialize document to String ByteArrayOutputStream outStream = new ByteArrayOutputStream(); OutputFormat format = new OutputFormat(document); XMLSerializer serial = new XMLSerializer(outStream, format); serial.asDOMSerializer(); serial.serialize(document.getDocumentElement()); outStream.flush(); String xml = outStream.toString(); //Print out text interpretaion of xml document System.out.println(xml); //reparse text into xml ByteArrayInputStream inputStream = new ByteArrayInputStream(xml.getBytes ()); DOMParser parser = new DOMParser(); InputSource inputSource = new InputSource(inputStream); parser.parse(inputSource); document = parser.getDocument(); } Output: <?xml version="1.0" encoding="UTF-8"?> <TEST></TEST> [Fatal Error] :2:13: Character reference "c" is an invalid XML character. org.xml.sax.SAXParseException: Character reference "c" is an invalid XML character. at org.apache.xerces.parsers.DOMParser.parse(DOMParser.java:235) at testclassloader.TestXerces.main(TestXerces.java:53) Exception in thread "main" This particular test was run with xerces 2.0.1, but I've had similar results with 1.4.4 though the outputted escaped character is different. While I realize that character 28 does not fit within the XML spec as a valid character, I am curious why xerces will generate text node or serialize a document with an invalid character. Also, is there any way to properly encode this document or do I need to manually escape my node text before encoding? Thanks for your time and for working on a fantastic open-source project.
The DOM APIs are able to represent documents which can not be encoded as well-formed XML, for reasons largely having to do with performance. It's up to the XML serializer to decide whether to write them out as damaged XML while generating a warning message somewhere, to attempt to repair them, or to refuse to write the document out and report the error. There are risks and benefits to all three options, on performance grounds as well as flexibility and diagnosability.