Bug 52784

Summary: SXSSFWorkbook, invalid xml characters, corrupted XLSX
Product: POI Reporter: Catalin Z. Alexandru <catalinalexandru.zamfir>
Component: SXSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: 3.8-dev   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: SXSSFWorkbook generated file

Description Catalin Z. Alexandru 2012-02-28 09:52:39 UTC
Created attachment 28395 [details]
SXSSFWorkbook generated file

Exporting with SXSSFWorkbook, generates a corrupted .xlsx file. I've attached the generated XLSX file. Viewed it with an XML viewer, but could not find the problem. 

Generating the same XLSX, from the same data, with XSSFWorkbook, generates a proper .xlsx file.

We're using SXSSFWorkbook, for memory issues. We've now using XSSFWorkbook as a quick-fix/workaround, but wish to identifiy the problem here.
Comment 1 Catalin Z. Alexandru 2012-02-28 10:43:28 UTC
Excel reports: "Replaced Part: /xl/worksheets/sheet1.xml part with XML error.  Illegal xml character. Line 394, column 267.".

Looking in sheet1.xml, at line: 394, column 267, around it i see this: "If you&#226;&#25;re looking for a palm-sweating". The "267" column is the ";" in "&#25;". Tried to decode the entire entity, but it outputs a weird character.

SXSSFWorkbook should ignore unknown or invalid characters for XML. I've tracked this issue down and seems that the original source of this message, contains the same unprintable characters. Does not show up, but can easily be spotted in the source of the original document.

As far as I know < ASCII 32, are control characters. Shouldn't these be ignored? Not encoded. As they're not printable they actually don't provide any useful value for anybody.

XSSFWorkbook does a proper job ignoring this.
SXSSFWorkbook doesn't.
Comment 2 Yegor Kozlov 2012-02-28 14:03:47 UTC
Should be fixed in r1294657

Your diagnosis is correct, writing a ISO control character ( < 32) resulted in a corrupted workbook.

I could easily reproduce it with the following simple code:

        Workbook wb = new SXSSFWorkbook();
        Sheet sh = wb.createSheet();
        Cell cell = sh.createRow(0).createCell(0);
        
        cell.setCellValue("\u0000");

XSSF delegates writing XML to XmlBeans and this framework replaces characters below 32 with question marks. I changed SXSSF to do so too.

It appears that there are two more special cases where you can't simply write a char code in XML:

 case 1: low and high unicode surrogates: DC00-DFFF and D800-D8FF
 case 2: 'not a character' range: FFFE-FFFF

XmlBeans replaces characters from these ranges with question marks, so I fixed SXSSF to be consistent.

Yegor
Comment 3 sabdulrazak 2012-04-27 13:33:27 UTC
Yegor,

I am facing the same problem, where can I download the jar files of this release? Please advise

regards,
Sheikh
Comment 4 sabdulrazak 2012-04-30 07:05:47 UTC
Hi,

I am able to download the version 3.8.

http://www.apache.org/dyn/closer.cgi/poi/release/bin/poi-bin-3.8-20120326.zip

Thanks.

regards,
Sheikh