Bug 49441

Summary: Wrong CharSet
Product: POI Reporter: Dmitry <aristar>
Component: HSMFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: 3.6-FINAL   
Target Milestone: ---   
Hardware: All   
OS: All   
Attachments: Message with cyrillic fields

Description Dmitry 2010-06-15 04:02:34 UTC
If used an encoding other than Cp1252 StringChunk returns incorrect value. To solve this issue whe added setCharset method to MAPIMessage, Chunks and StringChunk.
Comment 1 Nick Burch 2010-06-15 09:48:13 UTC
Have you tried with a recent nightly build? HSMF has undergone a lot of changes since 3.6

For String chunks which aren't stored as unicode, we assume they're CP1252 based on all the files we've seen - outlook should generally store them as one of those two. If you have found a file that differs, please do upload it, and also please help us track down where in the file that charset is stored!
Comment 2 Dmitry 2010-06-16 01:21:18 UTC
Created attachment 25594 [details]
Message with cyrillic fields
Comment 3 Dmitry 2010-06-16 01:22:18 UTC
I dont tried other versions than 3.6, but in the head revision StringChunk (from SVN) hardcoded Cp1252. (1251 - Cyrillic)
Comment 4 Nick Burch 2010-06-29 09:30:33 UTC
Unfortunately I can't seem to spot anything in the file which indicates the encoding

If you open the file on a different machine which has a different system language set, does it look correct or do you get the wrong characters showing up?
Comment 5 Dmitry 2010-07-06 03:57:19 UTC
I tried on a system with an English localization. If not set the Language for Non-Unicode Programs to Russian then message open with wrong characters.
Comment 6 Nick Burch 2010-08-03 12:07:45 UTC
Fixed in r981947.

There is now a guess7BitEncoding() method on MAPIMessage, which looks in the headers to guess the encoding, then calls the new set encoding method on the string chunks