Bug 60289 - UTF decoding in XSSFRichTextString does not work when the code is lowercase
Summary: UTF decoding in XSSFRichTextString does not work when the code is lowercase
Alias: None
Product: POI
Classification: Unclassified
Component: XSSF (show other bugs)
Version: 3.15-FINAL
Hardware: PC All
: P2 trivial (vote)
Target Milestone: ---
Assignee: POI Developers List
Keywords: PatchAvailable
Depends on:
Reported: 2016-10-21 11:32 UTC by Daniel Gonzalez
Modified: 2016-10-21 16:32 UTC (History)
0 users

Microsoft Access Database to reproduce the bug (29.97 KB, application/x-zip-compressed)
2016-10-21 11:32 UTC, Daniel Gonzalez
Patch that fixes the bug (1.40 KB, patch)
2016-10-21 11:35 UTC, Daniel Gonzalez
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Gonzalez 2016-10-21 11:32:46 UTC
Created attachment 34397 [details]
Microsoft Access Database to reproduce the bug


The class XSSFRichTextString decodes the UTF strings that OOXML format stores in the XML that it generates, this works as expected when the file is generated from Excel as it generates the code with an uupercase hex value, eg: _x000D_ will get translated to /r. 

But I have found that when you export from Microsoft Access 2010 (I haven't tested in newer versions) through a Visual Basic macro using the method DoCmd.TransferSpreadsheet (https://msdn.microsoft.com/en-us/library/office/ff844793.aspx) with the type acSpreadsheetTypeExcel12Xml to export to OOXML it generates the UTF characters with the values in lowercase, eg: _x000d_ wich are not matched by the regular expression used in XSSFRichTextString and thus get passed to the value unmodified.

That's not a problem if the value is stored back to an xslx file, as Excel understand it and decodes the character, but in my case we copy this value to a CSV file which causes the code _x000d_ to be transferred to the text file and not the expected /r. Similar results can be seen if you output the XSSFRichTextString value to a log.

Attached is a simple Access Database with one table that has one rich text field that contains the character /r and a Visual Basic module that exports the contents of the tables to Excel, I have also included the Excel that results from executing the macro.

Comment 1 Daniel Gonzalez 2016-10-21 11:35:58 UTC
Created attachment 34398 [details]
Patch that fixes the bug

We created a trivial patch changing the regular expression on XSSFRichTextString to accept lowercase hex values. I attached it in case it could be useful.
Comment 2 Dominik Stadler 2016-10-21 16:32:32 UTC
Applied in r1766065, thanks for the patch, this will be included in 3.16-beta1.