Bug 49441 - Wrong CharSet
Summary: Wrong CharSet
Alias: None
Product: POI
Classification: Unclassified
Component: HSMF (show other bugs)
Version: 3.6-FINAL
Hardware: All All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2010-06-15 04:02 UTC by Dmitry
Modified: 2010-08-03 12:07 UTC (History)
0 users

Message with cyrillic fields (25.00 KB, application/octet-stream)
2010-06-16 01:21 UTC, Dmitry

Note You need to log in before you can comment on or make changes to this bug.
Description Dmitry 2010-06-15 04:02:34 UTC
If used an encoding other than Cp1252 StringChunk returns incorrect value. To solve this issue whe added setCharset method to MAPIMessage, Chunks and StringChunk.
Comment 1 Nick Burch 2010-06-15 09:48:13 UTC
Have you tried with a recent nightly build? HSMF has undergone a lot of changes since 3.6

For String chunks which aren't stored as unicode, we assume they're CP1252 based on all the files we've seen - outlook should generally store them as one of those two. If you have found a file that differs, please do upload it, and also please help us track down where in the file that charset is stored!
Comment 2 Dmitry 2010-06-16 01:21:18 UTC
Created attachment 25594 [details]
Message with cyrillic fields
Comment 3 Dmitry 2010-06-16 01:22:18 UTC
I dont tried other versions than 3.6, but in the head revision StringChunk (from SVN) hardcoded Cp1252. (1251 - Cyrillic)
Comment 4 Nick Burch 2010-06-29 09:30:33 UTC
Unfortunately I can't seem to spot anything in the file which indicates the encoding

If you open the file on a different machine which has a different system language set, does it look correct or do you get the wrong characters showing up?
Comment 5 Dmitry 2010-07-06 03:57:19 UTC
I tried on a system with an English localization. If not set the Language for Non-Unicode Programs to Russian then message open with wrong characters.
Comment 6 Nick Burch 2010-08-03 12:07:45 UTC
Fixed in r981947.

There is now a guess7BitEncoding() method on MAPIMessage, which looks in the headers to guess the encoding, then calls the new set encoding method on the string chunks