Bug 54084 - Some Unicode chars(e.g chinese chars) are not written corectly in xlsx file.
Summary: Some Unicode chars(e.g chinese chars) are not written corectly in xlsx file.
Status: RESOLVED DUPLICATE of bug 59268
Alias: None
Product: POI
Classification: Unclassified
Component: XSSF (show other bugs)
Version: 3.8-FINAL
Hardware: PC All
: P2 normal with 1 vote (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
: 61029 (view as bug list)
Depends on: 59268
Blocks: 58247 61246 61494
  Show dependency tree
 
Reported: 2012-11-01 08:01 UTC by Alexandra Luca
Modified: 2017-09-21 16:13 UTC (History)
1 user (show)



Attachments
the 2 xlsx files (9.22 KB, application/x-zip-compressed)
2012-11-01 09:21 UTC, Alexandra Luca
Details
Greek alphabet beyond BMP (198 bytes, text/plain)
2013-05-03 10:00 UTC, sumedh
Details
Greek alphabet beyond BMP - Manually created xlsx (8.16 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2013-05-03 10:07 UTC, sumedh
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alexandra Luca 2012-11-01 08:01:59 UTC
Set the value of a SXSSFCell to a string that contains chinese chars:
cell.setCellValue("
Comment 1 Alexandra Luca 2012-11-01 08:04:32 UTC
The chinese chars are replaced with ? in the xlsx file:

??????खफआछ??????
Comment 2 Nick Burch 2012-11-01 09:00:18 UTC
Could you please upload a unit test that shows the problem?

Also, are you sure that you're correctly getting the characters into Java without breaking the encoding, and are you sure that the font you're using can correctly render the characters?
Comment 3 Alexandra Luca 2012-11-01 09:21:24 UTC
Created attachment 29537 [details]
the 2 xlsx files

Here are 2 xlsx files.
The first file(TestUnicode.xlsx) is used to load the data from it to the database.

The data is inserted corectly in database, and then is displayed corectly on the UI.
The same data we are trying to export to another xlsx file, but the chars are not encoded corectly. Both files have the same font(Calibri 11).
Comment 4 Yegor Kozlov 2012-11-01 11:57:48 UTC
I can't reproduce the problem with the latest build from trunk. Can you please upload a unit test that demonstrates the problem?

I see that in the corrupted file unicode characters are garbled, but as of POI-3.9, we don't write raw unicode - every character above ASCII is written in the &#charCode; form which means that the problem is mostly certainly fixed in trunk. 

Links to download nightly builds are on http://poi.apache.org/

Yegor

(In reply to comment #3)
> Created attachment 29537 [details]
> the 2 xlsx files
> 
> Here are 2 xlsx files.
> The first file(TestUnicode.xlsx) is used to load the data from it to the
> database.
> 
> The data is inserted corectly in database, and then is displayed corectly on
> the UI.
> The same data we are trying to export to another xlsx file, but the chars
> are not encoded corectly. Both files have the same font(Calibri 11).
Comment 5 sumedh 2013-05-03 09:42:39 UTC
I also found that surrogate pair characters (supplementary utf16) are not getting written correctly.

e.g. If you have character "\uD835\uDF4B" - 4 byte surrogate pair encoding of unicode U+1D74B (big endian), which is "mathematical italics bold phi", it gets converted to ? when it's exported to excel.
Comment 6 Nick Burch 2013-05-03 09:49:41 UTC
If you write that character in Excel, how does Excel encode it to the file? (Might be worth checking both the raw xml inside the .xlsx, and how POI sees it)
Comment 7 sumedh 2013-05-03 10:00:42 UTC
Created attachment 30251 [details]
Greek alphabet beyond BMP

PFA the UTF-16 (little endian) file with greek characters from beyond basic multilingual plane.
Comment 8 sumedh 2013-05-03 10:07:48 UTC
Created attachment 30252 [details]
Greek alphabet beyond BMP - Manually created xlsx

PFA manually created excel for these characters. MS Excel correctly writes the values in shared string table. SXSSF writes ???? (inline) for them.
Comment 9 Dominik Stadler 2013-06-30 22:51:59 UTC
I worked on reproducing the reported problems with greek characters. This seems to happen when loading shared strings from the XLSX file. The XML file is encoded correctly (UTF-8 codes e.g. from http://www.fileformat.info/info/unicode/char/1d74a/index.htm) and characters appear in OpenOffice and when opening the file in a text-editor.

Also initial loading of the Workbook using XSSF works, the cell contains the necessary data, however after writing out the data and reading back in, it does not match any more.

As far as I see, the shared-strings are read incorrectly and thus break the writing of the data back out.

I could debug the code as far as xmlbeans handles the string where it seems to be fine, but as soon as the SstDocumentImpl takes over, it seems to become corrupted, however debugging there is not possible for me currently because the .class files are stripped... :(

I have for now added a testcase to the special test-class TestUnfixedBugs.java called testBug54084Unicode() which verifies the problem, no fix available yet...
Comment 10 stanescu florentina 2014-09-12 12:10:34 UTC
What is the status of this defect? Is somebody still working to fix this defect?
Comment 11 Yaniv Kunda 2014-09-14 15:35:28 UTC
I've tried to debug it using POI's TestUnfixedBugs, but the loss is happening deep inside XMLBeans.
Probably due to https://issues.apache.org/jira/browse/XMLBEANS-332
Comment 12 Dominik Stadler 2015-03-23 21:05:58 UTC
The testcase shows that it is not related to SXSSF, it also happens for plain XSSF.
Comment 13 Dominik Stadler 2017-05-27 21:00:03 UTC
*** Bug 61029 has been marked as a duplicate of this bug. ***
Comment 14 Dominik Stadler 2017-09-21 16:13:24 UTC
I have verified that using the newer version of XMLBeans that is discussed in Bug 59268 also fixes this issue, so this is a duplicate of that bug.

*** This bug has been marked as a duplicate of bug 59268 ***