Bug 48936

Summary: Writing specific sequence of strings to XSSFSheet results in malformed XML
Product: POI Reporter: Philippe Laflamme <philippe.laflamme>
Component: XSSFAssignee: POI Developers List <dev>
Severity: normal    
Priority: P2    
Version: 3.6-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Attachments: This patch adds a test case. Created against 3_6 tag.
A simplified version of the strings file

Description Philippe Laflamme 2010-03-18 16:49:11 UTC
Created attachment 25147 [details]
This patch adds a test case. Created against 3_6 tag.

A specific sequence of strings results in broken CDATA section in sharedStrings.xml file.

Attached patch adds a test case to 3_6 tag to reproduce the problem.

org.apache.poi.POIXMLException: java.lang.reflect.InvocationTargetException
        at org.apache.poi.xssf.usermodel.XSSFFactory.createDocumentPart(XSSFFactory.java:61)
        at org.apache.poi.POIXMLDocumentPart.read(POIXMLDocumentPart.java:256)
        at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:196)
        at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:179)
        at org.apache.poi.xssf.TestBrokenCdata.testWritingSomeStringsResultsInMalformedCdata(TestBrokenCdata.java:53)
Caused by: java.lang.reflect.InvocationTargetException
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at org.apache.poi.xssf.usermodel.XSSFFactory.createDocumentPart(XSSFFactory.java:59)
Caused by: java.io.IOException: error: Unexpected character encountered (lex state 9): '!'
        at org.apache.poi.xssf.model.SharedStringsTable.readFrom(SharedStringsTable.java:119)
        at org.apache.poi.xssf.model.SharedStringsTable.<init>(SharedStringsTable.java:97)
Comment 1 Josh Micich 2010-03-20 01:08:57 UTC
Created attachment 25154 [details]
A simplified version of the strings file

The problem seems to be within piccolo (part of xmlbeans).  The encoding logic is incorrectly identifying the ']]' in one string as ']]>' and thus attempts a mis-aligned escape.  This logic is being performed in a buffer which gets re-used multiple times across the document.  When piccolo reads one character too far, it reads whatever was left in the buffer from the previous use.  If the input data is constructed carefully, the '>' character may be incorrectly detected.

There are many sets of data which can cause this problem though they can be a little hard to find.  Here are a few guidelines:
  - there needs to be enough total data to cause a wrap around in the buffer.  The problem occurs with strings written after the first buffer flush.
  - one of those strings needs to 
     - be longer than 32 chars, 
     - have more than 5 entities that need escaping (e.g. '>')
     - end in ']]'
  - the lengths of all previous strings are chosen to cause a '>' from the previous buffer usage align with the']]'

This bug should probably be forwarded off to the xmlbeans team.  It will probably take some effort to create a test case that is independent of POI.

xmlbeans 2.5.0 displays the same problem (POI currently uses 2.3.0)
Comment 2 Yegor Kozlov 2010-04-25 09:03:35 UTC
Quite an interesting bug. 
The problem is in the way XmlBeans detects and writes CDATA blocks. Luckily, we can control this funny behavior.

I traced the problem down to the internal class TextSaver in XmlBeans: 


The logic for detection CDATA starts at line #1286. The heuristic is quite complex, but it turned out that it can be controlled with two options:


The default value of cdataEntityCountThreshold is 5 and the default value of cdataLengthThreshold is 32. These values perfectly agree with Josh's observations. 

According to the docs, XmlBeans will use CDATA if the following condition is true:
    textLength > cdataLengthThreshold && entityCount > cdataEntityCountThreshold

The combination of XmlOptions.setSaveCDataEntityCountThreshold(0) and XmlOptions.setSaveCDataLengthThreshold(-1) will make every text CDATA. 

The combination of XmlOptions.setSaveCDataEntityCountThreshold(MAXLENGTH) and XmlOptions.setSaveCDataLengthThreshold(-1) will detect CDATA only if the text is longer than MAXLENGTH chars. I used the following values to disable CDATA when saving sharedStrings.xml:

  XmlOptions options = new XmlOptions(DEFAULT_XML_OPTIONS);

I committed the fix in r937792. 

Existing code using POI-3.6 can be fixed as follows:

  XmlOptions options = POIXMLDocumentPart.DEFAULT_XML_OPTIONS;

Add these lines before calling  workbook.write(out)