Bug 48936 - Writing specific sequence of strings to XSSFSheet results in malformed XML
Summary: Writing specific sequence of strings to XSSFSheet results in malformed XML
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: XSSF (show other bugs)
Version: 3.6-FINAL
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-03-18 16:49 UTC by Philippe Laflamme
Modified: 2010-04-25 09:03 UTC (History)
0 users



Attachments
This patch adds a test case. Created against 3_6 tag. (6.03 KB, patch)
2010-03-18 16:49 UTC, Philippe Laflamme
Details | Diff
A simplified version of the strings file (1.82 KB, text/plain)
2010-03-20 01:08 UTC, Josh Micich
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Philippe Laflamme 2010-03-18 16:49:11 UTC
Created attachment 25147 [details]
This patch adds a test case. Created against 3_6 tag.

A specific sequence of strings results in broken CDATA section in sharedStrings.xml file.

Attached patch adds a test case to 3_6 tag to reproduce the problem.

java.lang.reflect.InvocationTargetException
org.apache.poi.POIXMLException: java.lang.reflect.InvocationTargetException
        at org.apache.poi.xssf.usermodel.XSSFFactory.createDocumentPart(XSSFFactory.java:61)
        at org.apache.poi.POIXMLDocumentPart.read(POIXMLDocumentPart.java:256)
        at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:196)
        at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:179)
        at org.apache.poi.xssf.TestBrokenCdata.testWritingSomeStringsResultsInMalformedCdata(TestBrokenCdata.java:53)
Caused by: java.lang.reflect.InvocationTargetException
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at org.apache.poi.xssf.usermodel.XSSFFactory.createDocumentPart(XSSFFactory.java:59)
Caused by: java.io.IOException: error: Unexpected character encountered (lex state 9): '!'
        at org.apache.poi.xssf.model.SharedStringsTable.readFrom(SharedStringsTable.java:119)
        at org.apache.poi.xssf.model.SharedStringsTable.<init>(SharedStringsTable.java:97)
Comment 1 Josh Micich 2010-03-20 01:08:57 UTC
Created attachment 25154 [details]
A simplified version of the strings file

The problem seems to be within piccolo (part of xmlbeans).  The encoding logic is incorrectly identifying the ']]' in one string as ']]>' and thus attempts a mis-aligned escape.  This logic is being performed in a buffer which gets re-used multiple times across the document.  When piccolo reads one character too far, it reads whatever was left in the buffer from the previous use.  If the input data is constructed carefully, the '>' character may be incorrectly detected.

There are many sets of data which can cause this problem though they can be a little hard to find.  Here are a few guidelines:
  - there needs to be enough total data to cause a wrap around in the buffer.  The problem occurs with strings written after the first buffer flush.
  - one of those strings needs to 
     - be longer than 32 chars, 
     - have more than 5 entities that need escaping (e.g. '>')
     - end in ']]'
  - the lengths of all previous strings are chosen to cause a '>' from the previous buffer usage align with the']]'

This bug should probably be forwarded off to the xmlbeans team.  It will probably take some effort to create a test case that is independent of POI.

xmlbeans 2.5.0 displays the same problem (POI currently uses 2.3.0)
Comment 2 Yegor Kozlov 2010-04-25 09:03:35 UTC
Quite an interesting bug. 
The problem is in the way XmlBeans detects and writes CDATA blocks. Luckily, we can control this funny behavior.

I traced the problem down to the internal class TextSaver in XmlBeans: 

http://svn.apache.org/viewvc/xmlbeans/tags/2.3.0/src/store/org/apache/xmlbeans/impl/store/Saver.java?view=markup

The logic for detection CDATA starts at line #1286. The heuristic is quite complex, but it turned out that it can be controlled with two options:

  XmlOptions#setSaveCDataLengthThreshold(int)
  XmlOptions#setSaveCDataEntityCountThreshold(int)

The default value of cdataEntityCountThreshold is 5 and the default value of cdataLengthThreshold is 32. These values perfectly agree with Josh's observations. 

According to the docs, XmlBeans will use CDATA if the following condition is true:
    textLength > cdataLengthThreshold && entityCount > cdataEntityCountThreshold

The combination of XmlOptions.setSaveCDataEntityCountThreshold(0) and XmlOptions.setSaveCDataLengthThreshold(-1) will make every text CDATA. 

The combination of XmlOptions.setSaveCDataEntityCountThreshold(MAXLENGTH) and XmlOptions.setSaveCDataLengthThreshold(-1) will detect CDATA only if the text is longer than MAXLENGTH chars. I used the following values to disable CDATA when saving sharedStrings.xml:

  XmlOptions options = new XmlOptions(DEFAULT_XML_OPTIONS);
  options.setSaveCDataLengthThreshold(1000000); 
  options.setSaveCDataEntityCountThreshold(-1);


I committed the fix in r937792. 

Existing code using POI-3.6 can be fixed as follows:

  XmlOptions options = POIXMLDocumentPart.DEFAULT_XML_OPTIONS;
  options.setSaveCDataLengthThreshold(1000000);
  options.setSaveCDataEntityCountThreshold(-1);

Add these lines before calling  workbook.write(out)

Regards,
Yegor