Bug 54084

Summary: Some Unicode chars(e.g chinese chars) are not written corectly in xlsx file.
Product: POI Reporter: Alexandra Luca <l_alexandra2010>
Component: XSSFAssignee: POI Developers List <dev>
Status: RESOLVED DUPLICATE    
Severity: normal CC: gauss.gao
Priority: P2    
Version: 3.8-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: All   
Bug Depends on: 59268    
Bug Blocks: 58247, 61246, 61494    
Attachments: the 2 xlsx files
Greek alphabet beyond BMP
Greek alphabet beyond BMP - Manually created xlsx

Description Alexandra Luca 2012-11-01 08:01:59 UTC
Set the value of a SXSSFCell to a string that contains chinese chars:
cell.setCellValue("
Comment 1 Alexandra Luca 2012-11-01 08:04:32 UTC
The chinese chars are replaced with ? in the xlsx file:

??????खफआछ??????
Comment 2 Nick Burch 2012-11-01 09:00:18 UTC
Could you please upload a unit test that shows the problem?

Also, are you sure that you're correctly getting the characters into Java without breaking the encoding, and are you sure that the font you're using can correctly render the characters?
Comment 3 Alexandra Luca 2012-11-01 09:21:24 UTC
Created attachment 29537 [details]
the 2 xlsx files

Here are 2 xlsx files.
The first file(TestUnicode.xlsx) is used to load the data from it to the database.

The data is inserted corectly in database, and then is displayed corectly on the UI.
The same data we are trying to export to another xlsx file, but the chars are not encoded corectly. Both files have the same font(Calibri 11).
Comment 4 Yegor Kozlov 2012-11-01 11:57:48 UTC
I can't reproduce the problem with the latest build from trunk. Can you please upload a unit test that demonstrates the problem?

I see that in the corrupted file unicode characters are garbled, but as of POI-3.9, we don't write raw unicode - every character above ASCII is written in the &#charCode; form which means that the problem is mostly certainly fixed in trunk. 

Links to download nightly builds are on http://poi.apache.org/

Yegor

(In reply to comment #3)
> Created attachment 29537 [details]
> the 2 xlsx files
> 
> Here are 2 xlsx files.
> The first file(TestUnicode.xlsx) is used to load the data from it to the
> database.
> 
> The data is inserted corectly in database, and then is displayed corectly on
> the UI.
> The same data we are trying to export to another xlsx file, but the chars
> are not encoded corectly. Both files have the same font(Calibri 11).
Comment 5 sumedh 2013-05-03 09:42:39 UTC
I also found that surrogate pair characters (supplementary utf16) are not getting written correctly.

e.g. If you have character "\uD835\uDF4B" - 4 byte surrogate pair encoding of unicode U+1D74B (big endian), which is "mathematical italics bold phi", it gets converted to ? when it's exported to excel.
Comment 6 Nick Burch 2013-05-03 09:49:41 UTC
If you write that character in Excel, how does Excel encode it to the file? (Might be worth checking both the raw xml inside the .xlsx, and how POI sees it)
Comment 7 sumedh 2013-05-03 10:00:42 UTC
Created attachment 30251 [details]
Greek alphabet beyond BMP

PFA the UTF-16 (little endian) file with greek characters from beyond basic multilingual plane.
Comment 8 sumedh 2013-05-03 10:07:48 UTC
Created attachment 30252 [details]
Greek alphabet beyond BMP - Manually created xlsx

PFA manually created excel for these characters. MS Excel correctly writes the values in shared string table. SXSSF writes ???? (inline) for them.
Comment 9 Dominik Stadler 2013-06-30 22:51:59 UTC
I worked on reproducing the reported problems with greek characters. This seems to happen when loading shared strings from the XLSX file. The XML file is encoded correctly (UTF-8 codes e.g. from http://www.fileformat.info/info/unicode/char/1d74a/index.htm) and characters appear in OpenOffice and when opening the file in a text-editor.

Also initial loading of the Workbook using XSSF works, the cell contains the necessary data, however after writing out the data and reading back in, it does not match any more.

As far as I see, the shared-strings are read incorrectly and thus break the writing of the data back out.

I could debug the code as far as xmlbeans handles the string where it seems to be fine, but as soon as the SstDocumentImpl takes over, it seems to become corrupted, however debugging there is not possible for me currently because the .class files are stripped... :(

I have for now added a testcase to the special test-class TestUnfixedBugs.java called testBug54084Unicode() which verifies the problem, no fix available yet...
Comment 10 stanescu florentina 2014-09-12 12:10:34 UTC
What is the status of this defect? Is somebody still working to fix this defect?
Comment 11 Yaniv Kunda 2014-09-14 15:35:28 UTC
I've tried to debug it using POI's TestUnfixedBugs, but the loss is happening deep inside XMLBeans.
Probably due to https://issues.apache.org/jira/browse/XMLBEANS-332
Comment 12 Dominik Stadler 2015-03-23 21:05:58 UTC
The testcase shows that it is not related to SXSSF, it also happens for plain XSSF.
Comment 13 Dominik Stadler 2017-05-27 21:00:03 UTC
*** Bug 61029 has been marked as a duplicate of this bug. ***
Comment 14 Dominik Stadler 2017-09-21 16:13:24 UTC
I have verified that using the newer version of XMLBeans that is discussed in Bug 59268 also fixes this issue, so this is a duplicate of that bug.

*** This bug has been marked as a duplicate of bug 59268 ***