Bug 49599

Summary: Comment.setAuthor does not encode multi-byte characters (Chinese) well
Product: POI Reporter: LiuYan 刘研 <lovetide>
Component: HSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: 3.7-dev   
Target Milestone: ---   
Hardware: PC   
OS: Windows XP   
Attachments: Test class to show this issue
Patch test for unicode on setAuthor()
Patch for defaulting to multi-byte

Description LiuYan 刘研 2010-07-16 04:19:00 UTC
Created attachment 25765 [details]
Test class to show this issue

When I setAuthor() for Comment with some Chinese characters, these Chinese characters will transformed to ? when open the file in Microsoft Excel.

Please see the attachment.

My environment: 
- JDK 6 Update 21 
- POI 3.7 Beta 1 
- Windows XP Professional (Simplified Chinese) SP3 
- Microsoft Office Excel 2003 (11.8324.8324) SP3 

for more details, please visit this thread:
http://old.nabble.com/charset-encoding-of-Comment.setAuthor%28%29-ts29115649.html
Comment 1 André-John Mas 2010-07-16 07:54:44 UTC
Just to sum up the thread:

The serialize() method in org.apache.poi.hssf.record.NoteRecord is not calling the StringUtil.putUnicodeLE() method, because the field_5_hasMultibyte instance variable is false, even when the author field contains double-byte characters. In fact other than when a file is read field_5_hasMultibyte is never set to true.

Two possible solutions:
 - add logic to work out if we have non-latin characters, since the issue is not just affecting double-byte characters
 - set the field_5_hasMultibyte variable to be true and always write out unicode characters, unless there is a usage scenario this could break.

I tested on MacOS X 10.6.4 and used Excel 2008 to see the result. Changing the variable to true resulted in Chinese text to appear correctly for the author.

BTW We should probably be extending the unit tests for ensuring non-latin characters are getting stored properly.
Comment 2 André-John Mas 2010-07-16 09:22:09 UTC
Created attachment 25768 [details]
Patch test for unicode on setAuthor()

Added patch that tests for unicode on setAuthor()
Comment 3 André-John Mas 2010-07-16 09:22:54 UTC
Created attachment 25769 [details]
Patch for defaulting to multi-byte

Patch for defaulting to multi-byte
Comment 4 Nick Burch 2010-07-16 09:44:03 UTC
Thanks for investigating this

The usual way in most records is to update the multibyte flag when updating the string

I'll make this change, and write a unit test for it shortly
Comment 5 Nick Burch 2010-07-16 09:58:45 UTC
Fixed in r964800, along with a unit test. Thanks for your investigations and patch!