Bug 9215 - XML that contains a large amount of CDATA Sections in parsed incorrectly
Summary: XML that contains a large amount of CDATA Sections in parsed incorrectly
Status: NEW
Alias: None
Product: Xerces-J
Classification: Unclassified
Component: Other (show other bugs)
Version: unspecified
Hardware: All All
: P3 major
Target Milestone: ---
Assignee: Xerces-J Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2002-05-17 22:58 UTC by Matt Havlovick
Modified: 2004-11-16 19:05 UTC (History)
1 user (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Matt Havlovick 2002-05-17 22:58:52 UTC
For my work, I retreive a large amount of data as an XML String and I use the 
DocumentBuilder to parse a ByteArrayInputStream containing this XML. The XML 
contains many CDATA sections and occasionally, depending upon the data, the 
document tree will have nodes that contain incorrect data. 

I have found that if I put a crimson.jar in front of the xercesImpl.jar in the 
classpath, then the document tree comes out OK, but not if xercesImpl.jar is in 
front of the crimson.jar.

Since we use such a large string of XML data, trying to have you reproduce it 
may be somewhat difficult. I was able to make a small program that does produce 
these incorrect results.

import org.w3c.dom.*;
import javax.xml.parsers.*; 
import java.io.*;

class xmltest{
    public static void main(String args[]){
 
        StringBuffer xml = new StringBuffer();
        xml.append("<LETTERS>");
        for (int y=0;y<=100;y++){
            xml.append("<LETTER><![CDATA[");
            for (int z=0;z<=y;z++) xml.append((char)((y%26)+97));
            xml.append("]]></LETTER>");
        }
        xml.append("</LETTERS>");
        
        byte[] b = xml.toString().getBytes();
        InputStream is = new ByteArrayInputStream(b);
        Document doc = null;
        try {
            if (is!=null){
                DocumentBuilderFactory docBuilderFactory = 
DocumentBuilderFactory.newInstance();
                DocumentBuilder docBuilder = 
docBuilderFactory.newDocumentBuilder();
                doc = docBuilder.parse(is);
            }
        } catch (Exception e){}   
          
            
        NodeList nodelist =  doc.getDocumentElement().getChildNodes();
        for (int idx=0; idx<nodelist.getLength();idx++){
            Node node = nodelist.item(idx);
            System.out.println(node.getFirstChild().getNodeValue());
        }
    }
}

At least in my testing, when the nodelist gets to the 65th item, the result for 
the node value is incorrect. Instead of the node containing the same letter, it 
is like a concatination of many of the other nodes.

Thanks,

Matt Havlovick
Consolidated Freightways
Comment 1 Joe Kesselman 2002-05-17 23:41:27 UTC
If changing the parser makes the problem go away, this may be a parser bug 
rather than a Xalan bug. Have you tried running your documents through the 
Xerces sample programs, to see whether they're parsing correctly?
Comment 2 Matt Havlovick 2002-05-18 17:12:36 UTC
Yes, it's seems to be a parser bug. The xercesImpl.jar file appears to have 
the problem, and because it is packaged with the xalan download, I thought it 
might go here as a xalan bug? 
Comment 3 Joe Kesselman 2002-05-20 12:50:18 UTC
Nothing wrong with posting it as a Xalan bug as a first guess, but if it's clear 
that it's a Xerces malfunction posting it there instead is the only way to get 
it fixed.

Transferring to the Xerces project.