Bug 52784 - SXSSFWorkbook, invalid xml characters, corrupted XLSX
Summary: SXSSFWorkbook, invalid xml characters, corrupted XLSX
Alias: None
Product: POI
Classification: Unclassified
Component: SXSSF (show other bugs)
Version: 3.8-dev
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2012-02-28 09:52 UTC by Catalin Z. Alexandru
Modified: 2012-04-30 07:05 UTC (History)
0 users

SXSSFWorkbook generated file (225.37 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2012-02-28 09:52 UTC, Catalin Z. Alexandru

Note You need to log in before you can comment on or make changes to this bug.
Description Catalin Z. Alexandru 2012-02-28 09:52:39 UTC
Created attachment 28395 [details]
SXSSFWorkbook generated file

Exporting with SXSSFWorkbook, generates a corrupted .xlsx file. I've attached the generated XLSX file. Viewed it with an XML viewer, but could not find the problem. 

Generating the same XLSX, from the same data, with XSSFWorkbook, generates a proper .xlsx file.

We're using SXSSFWorkbook, for memory issues. We've now using XSSFWorkbook as a quick-fix/workaround, but wish to identifiy the problem here.
Comment 1 Catalin Z. Alexandru 2012-02-28 10:43:28 UTC
Excel reports: "Replaced Part: /xl/worksheets/sheet1.xml part with XML error.  Illegal xml character. Line 394, column 267.".

Looking in sheet1.xml, at line: 394, column 267, around it i see this: "If youâre looking for a palm-sweating". The "267" column is the ";" in "". Tried to decode the entire entity, but it outputs a weird character.

SXSSFWorkbook should ignore unknown or invalid characters for XML. I've tracked this issue down and seems that the original source of this message, contains the same unprintable characters. Does not show up, but can easily be spotted in the source of the original document.

As far as I know < ASCII 32, are control characters. Shouldn't these be ignored? Not encoded. As they're not printable they actually don't provide any useful value for anybody.

XSSFWorkbook does a proper job ignoring this.
SXSSFWorkbook doesn't.
Comment 2 Yegor Kozlov 2012-02-28 14:03:47 UTC
Should be fixed in r1294657

Your diagnosis is correct, writing a ISO control character ( < 32) resulted in a corrupted workbook.

I could easily reproduce it with the following simple code:

        Workbook wb = new SXSSFWorkbook();
        Sheet sh = wb.createSheet();
        Cell cell = sh.createRow(0).createCell(0);

XSSF delegates writing XML to XmlBeans and this framework replaces characters below 32 with question marks. I changed SXSSF to do so too.

It appears that there are two more special cases where you can't simply write a char code in XML:

 case 1: low and high unicode surrogates: DC00-DFFF and D800-D8FF
 case 2: 'not a character' range: FFFE-FFFF

XmlBeans replaces characters from these ranges with question marks, so I fixed SXSSF to be consistent.

Comment 3 sabdulrazak 2012-04-27 13:33:27 UTC

I am facing the same problem, where can I download the jar files of this release? Please advise

Comment 4 sabdulrazak 2012-04-30 07:05:47 UTC

I am able to download the version 3.8.