Bug 63462 - Problems with MAPIMessage.guess7BitEncoding/MAPIMessage.getHtmlBody
Summary: Problems with MAPIMessage.guess7BitEncoding/MAPIMessage.getHtmlBody
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: HSMF (show other bugs)
Version: unspecified
Hardware: All All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-05-23 15:36 UTC by Dominik Hölzl
Modified: 2019-05-23 15:36 UTC (History)
0 users



Attachments
Example MSG files with different code pages (3.11 KB, application/x-zip-compressed)
2019-05-23 15:36 UTC, Dominik Hölzl
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dominik Hölzl 2019-05-23 15:36:10 UTC
Created attachment 36597 [details]
Example MSG files with different code pages

Some E-Mails run into encoding problems when reading the subject, text body or html body and using MAPIMessage.guess7BitEncoding.

Example: E-Mail defines PR_INTERNET_CPID -> UTF-8, PR_MESSAGE_LOCALE_ID -> 1031, PR_MESSAGE_CODEPAGE -> undefined, no headers.

* Outlook wants PR_SUBJECT to be CP1252 (as PR_INTERNET_CPID is only for PR_BODY and PR_BODY_HTML; currently read as UTF-8 as guess7BitEncoding sets this)
* Outlook wants binary PR_BODY_HTML to be UTF-8 (Would currently read as CP1252, as getBodyHtml does not take care of any code page in case it is binary)
* Outlook wants ASCII PR_BODY_HTML to be UTF-8 (Currently correct)
* Outlook wants PR_BODY to be CP1252 for an unknown reason (Would currently read as UTF-8, as guess7BitEncoding sets this)

In the docs PR_INTERNET_CPID may only be used to indicate the code page for PR_BODY and PR_BODY_HTML:

https://docs.microsoft.com/en-us/office/client-developer/outlook/mapi/pidtaginternetcodepage-canonical-property

In my tests Outlook never looks at the charset information inside the HTML; it only relies on PR_INTERNET_CPID.

In case of PR_MESSAGE_CODEPAGE is undefined, and no headers are present, using the default ANSI codepage for the locale defined by PR_MESSAGE_LOCALE_ID may be the only hint to get the correct code page, as PR_INTERNET_CPID is only for text/html body.

Suggestion:

https://github.com/apache/poi/pull/149

(With this patch all existing Unit-Tests succeed without modification)

Attachments:
MSG-Files where the text body and html body should be decoded correctly.
Outlook displays them as expected.

Regards,
Dominik