Bug 49599 - Comment.setAuthor does not encode multi-byte characters (Chinese) well
Summary: Comment.setAuthor does not encode multi-byte characters (Chinese) well
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: HSSF (show other bugs)
Version: 3.7-dev
Hardware: PC Windows XP
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-07-16 04:19 UTC by LiuYan 刘研
Modified: 2010-07-16 09:58 UTC (History)
0 users



Attachments
Test class to show this issue (24.41 KB, application/octet-stream)
2010-07-16 04:19 UTC, LiuYan 刘研
Details
Patch test for unicode on setAuthor() (515 bytes, patch)
2010-07-16 09:22 UTC, André-John Mas
Details | Diff
Patch for defaulting to multi-byte (635 bytes, patch)
2010-07-16 09:22 UTC, André-John Mas
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description LiuYan 刘研 2010-07-16 04:19:00 UTC
Created attachment 25765 [details]
Test class to show this issue

When I setAuthor() for Comment with some Chinese characters, these Chinese characters will transformed to ? when open the file in Microsoft Excel.

Please see the attachment.

My environment: 
- JDK 6 Update 21 
- POI 3.7 Beta 1 
- Windows XP Professional (Simplified Chinese) SP3 
- Microsoft Office Excel 2003 (11.8324.8324) SP3 

for more details, please visit this thread:
http://old.nabble.com/charset-encoding-of-Comment.setAuthor%28%29-ts29115649.html
Comment 1 André-John Mas 2010-07-16 07:54:44 UTC
Just to sum up the thread:

The serialize() method in org.apache.poi.hssf.record.NoteRecord is not calling the StringUtil.putUnicodeLE() method, because the field_5_hasMultibyte instance variable is false, even when the author field contains double-byte characters. In fact other than when a file is read field_5_hasMultibyte is never set to true.

Two possible solutions:
 - add logic to work out if we have non-latin characters, since the issue is not just affecting double-byte characters
 - set the field_5_hasMultibyte variable to be true and always write out unicode characters, unless there is a usage scenario this could break.

I tested on MacOS X 10.6.4 and used Excel 2008 to see the result. Changing the variable to true resulted in Chinese text to appear correctly for the author.

BTW We should probably be extending the unit tests for ensuring non-latin characters are getting stored properly.
Comment 2 André-John Mas 2010-07-16 09:22:09 UTC
Created attachment 25768 [details]
Patch test for unicode on setAuthor()

Added patch that tests for unicode on setAuthor()
Comment 3 André-John Mas 2010-07-16 09:22:54 UTC
Created attachment 25769 [details]
Patch for defaulting to multi-byte

Patch for defaulting to multi-byte
Comment 4 Nick Burch 2010-07-16 09:44:03 UTC
Thanks for investigating this

The usual way in most records is to update the multibyte flag when updating the string

I'll make this change, and write a unit test for it shortly
Comment 5 Nick Burch 2010-07-16 09:58:45 UTC
Fixed in r964800, along with a unit test. Thanks for your investigations and patch!