|Summary:||HSMF fails to extract dates from certain Outlook 2007 messages (.msg)|
|Component:||HSMF||Assignee:||POI Developers List <dev>|
Two sample Outlook 2007 messages, one whose date POI HSMF fails to recognize, and one whose date POI HSMF successfully recognizes
First version of patch for bug 53784
Description joewiz 2012-08-27 21:44:52 UTC
Created attachment 29287 [details] Two sample Outlook 2007 messages, one whose date POI HSMF fails to recognize, and one whose date POI HSMF successfully recognizes Overview -------- POI HSMF fails to recognize the dates in some of my Outlook 2007 (.msg) files. Steps to Reproduce ------------------ I have two reproducible tests, and I've attached a zip file containing test .msg files that illustrate my results. When I use the first file (test-message-poi-succeeds.msg), POI HSMF succeeds. When I use the second file (test-message-poi-fails.msg), POI HSMF fails. Here are the two tests: 1) The first test is to run the .msg files through Tika 1.2 (which uses POI HSMF to parse Outlook files), using the following command: java -jar tika-app-1.2.jar -m <filename> Tika succeeds to find the date on the first message, returning these headers -- 9 of which contain dates, and the key field being "date: 2012-06-22T18:32:54Z": Author: Ashley, Carl E (PACE) Content-Length: 35840 Content-Type: application/vnd.ms-outlook Creation-Date: 2012-06-22T18:32:54Z Last-Modified: 2012-06-22T18:32:54Z Last-Save-Date: 2012-06-22T18:32:54Z Message-Bcc: Message-Cc: PA History Mailbox Message-From: Ashley, Carl E (PACE) Message-Recipient-Address: firstname.lastname@example.org Message-To: 'email@example.com' creator: Ashley, Carl E (PACE) date: 2012-06-22T18:32:54Z dc:creator: Ashley, Carl E (PACE) dc:description: HAC Annual Report dc:title: HAC Annual Report dcterms:created: 2012-06-22T18:32:54Z dcterms:modified: 2012-06-22T18:32:54Z meta:author: Ashley, Carl E (PACE) meta:creation-date: 2012-06-22T18:32:54Z meta:save-date: 2012-06-22T18:32:54Z modified: 2012-06-22T18:32:54Z resourceName: test-message-poi-succeeds.msg subject: HAC Annual Report title: HAC Annual Report Tika fails on the second message, returning no date fields: Author: PA History Mailbox Content-Length: 29696 Content-Type: application/vnd.ms-outlook Message-Bcc: Message-Cc: Message-From: PA History Mailbox Message-Recipient-Address: firstname.lastname@example.org Message-To: Garrett, Amy C (PACE) creator: PA History Mailbox dc:creator: PA History Mailbox dc:description: Draft to La Tetra dc:title: Draft to La Tetra meta:author: PA History Mailbox resourceName: test-message-poi-fails.msg subject: Draft to La Tetra title: Draft to La Tetra 2) The second test is to run POI HSMFDump directly on each file: java -classpath poi-3.8-20120326.jar:poi-scratchpad-3.8-20120326.jar org.apache.poi.hsmf.dev.HSMFDump <filename> When I run this command on the first file, it returns ALL of the fields above (including the 'date' field) in the following area: 125 - TransportMessageHeaders - Unicode String When I run this command on the second file, it returns NO 'date' fields, and no such '125' field. I would appreciate anyone's assistance with this issue. My System --------- I am using poi-bin-3.8-20120326, tika-app-1.2, and Mac OS X 10.8.1 with Java 1.6.0_33.
Comment 1 Nick Burch 2012-08-28 06:25:42 UTC
There has been some discussions about this on the mailing list. We currently support decoding variable length properties, but not fixed length ones which get stored differently. There's a little bit of code added quite recently towards supporting fixed length ones, but more work is needed on it.
Comment 2 joewiz 2012-08-28 12:07:14 UTC
The thread Nick referred to is http://markmail.org/message/cm5fkzosjlwelz2c. Also, for additional context, here is the earlier thread on the Tika list: http://markmail.org/message/vtweepcegqwjuxb4.
Comment 3 Claudius Teodorescu 2012-09-22 19:36:13 UTC
Created attachment 29407 [details] First version of patch for bug 53784
Comment 4 Claudius Teodorescu 2012-09-22 19:39:01 UTC
Attached is a first version of patch for this issue. The date is thus correctly extracted for the messages mentioned. I also made a unit test for those two messages. The flags for client submit time property are not extracted, as would imply modification of another class. If they are necessary now, I can modify that class.
Comment 5 Nick Burch 2012-10-14 14:52:54 UTC
The fix suggested will need some work, as it's a little too specific for the one property type. Thanks for the start though!
Comment 6 Nick Burch 2012-10-15 10:46:57 UTC
I've committed a partial fix for this in r1398241. I've had to disable part of your test though, as we're now getting a date, but not quite the correct one Really, we need to finish the property value decoding, add some unit tests for that part, and then re-visit the disabled tests to verify what really should be happening. Once that's done, we can likely do a bit of refactoring, and then expose more of the properties on MAPIMessage!