Bug 60289

Summary: UTF decoding in XSSFRichTextString does not work when the code is lowercase
Product: POI Reporter: Daniel Gonzalez <dgonsan>
Component: XSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: trivial Keywords: PatchAvailable
Priority: P2    
Version: 3.15-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: Microsoft Access Database to reproduce the bug
Patch that fixes the bug

Description Daniel Gonzalez 2016-10-21 11:32:46 UTC
Created attachment 34397 [details]
Microsoft Access Database to reproduce the bug

Hi,

The class XSSFRichTextString decodes the UTF strings that OOXML format stores in the XML that it generates, this works as expected when the file is generated from Excel as it generates the code with an uupercase hex value, eg: _x000D_ will get translated to /r. 

But I have found that when you export from Microsoft Access 2010 (I haven't tested in newer versions) through a Visual Basic macro using the method DoCmd.TransferSpreadsheet (https://msdn.microsoft.com/en-us/library/office/ff844793.aspx) with the type acSpreadsheetTypeExcel12Xml to export to OOXML it generates the UTF characters with the values in lowercase, eg: _x000d_ wich are not matched by the regular expression used in XSSFRichTextString and thus get passed to the value unmodified.

That's not a problem if the value is stored back to an xslx file, as Excel understand it and decodes the character, but in my case we copy this value to a CSV file which causes the code _x000d_ to be transferred to the text file and not the expected /r. Similar results can be seen if you output the XSSFRichTextString value to a log.

Attached is a simple Access Database with one table that has one rich text field that contains the character /r and a Visual Basic module that exports the contents of the tables to Excel, I have also included the Excel that results from executing the macro.

Regards,
Daniel.
Comment 1 Daniel Gonzalez 2016-10-21 11:35:58 UTC
Created attachment 34398 [details]
Patch that fixes the bug

We created a trivial patch changing the regular expression on XSSFRichTextString to accept lowercase hex values. I attached it in case it could be useful.
Comment 2 Dominik Stadler 2016-10-21 16:32:32 UTC
Applied in r1766065, thanks for the patch, this will be included in 3.16-beta1.