Bug 53784 - HSMF fails to extract dates from certain Outlook 2007 messages (.msg)
Summary: HSMF fails to extract dates from certain Outlook 2007 messages (.msg)
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: HSMF (show other bugs)
Version: 3.8-FINAL
Hardware: Macintosh All
: P2 normal with 1 vote (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2012-08-27 21:44 UTC by joewiz
Modified: 2015-08-12 15:15 UTC (History)
1 user (show)

Two sample Outlook 2007 messages, one whose date POI HSMF fails to recognize, and one whose date POI HSMF successfully recognizes (18.46 KB, application/zip)
2012-08-27 21:44 UTC, joewiz
First version of patch for bug 53784 (10.34 KB, application/zip)
2012-09-22 19:36 UTC, Claudius Teodorescu

Note You need to log in before you can comment on or make changes to this bug.
Description joewiz 2012-08-27 21:44:52 UTC
Created attachment 29287 [details]
Two sample Outlook 2007 messages, one whose date POI HSMF fails to recognize, and one whose date POI HSMF successfully recognizes

POI HSMF fails to recognize the dates in some of my Outlook 2007 (.msg) files.  

Steps to Reproduce
I have two reproducible tests, and I've attached a zip file containing test .msg files that illustrate my results.  When I use the first file (test-message-poi-succeeds.msg), POI HSMF succeeds.  When I use the second file (test-message-poi-fails.msg), POI HSMF fails.  Here are the two tests: 

1) The first test is to run the .msg files through Tika 1.2 (which uses POI HSMF to parse Outlook files), using the following command:

  java -jar tika-app-1.2.jar -m <filename>

Tika succeeds to find the date on the first message, returning these headers -- 9 of which contain dates, and the key field being "date: 2012-06-22T18:32:54Z":

Author: Ashley, Carl E (PACE)
Content-Length: 35840
Content-Type: application/vnd.ms-outlook
Creation-Date: 2012-06-22T18:32:54Z
Last-Modified: 2012-06-22T18:32:54Z
Last-Save-Date: 2012-06-22T18:32:54Z
Message-Cc: PA History Mailbox
Message-From: Ashley, Carl E (PACE)
Message-Recipient-Address: saftergood@fas.org
Message-To: 'saftergood@fas.org'
creator: Ashley, Carl E (PACE)
date: 2012-06-22T18:32:54Z
dc:creator: Ashley, Carl E (PACE)
dc:description: HAC Annual Report
dc:title: HAC Annual Report
dcterms:created: 2012-06-22T18:32:54Z
dcterms:modified: 2012-06-22T18:32:54Z
meta:author: Ashley, Carl E (PACE)
meta:creation-date: 2012-06-22T18:32:54Z
meta:save-date: 2012-06-22T18:32:54Z
modified: 2012-06-22T18:32:54Z
resourceName: test-message-poi-succeeds.msg
subject: HAC Annual Report
title: HAC Annual Report

Tika fails on the second message, returning no date fields:

Author: PA History Mailbox
Content-Length: 29696
Content-Type: application/vnd.ms-outlook
Message-From: PA History Mailbox
Message-Recipient-Address: garrettac@state.gov
Message-To: Garrett, Amy C (PACE)
creator: PA History Mailbox
dc:creator: PA History Mailbox
dc:description: Draft to La Tetra
dc:title: Draft to La Tetra
meta:author: PA History Mailbox
resourceName: test-message-poi-fails.msg
subject: Draft to La Tetra
title: Draft to La Tetra

2) The second test is to run POI HSMFDump directly on each file:

  java -classpath poi-3.8-20120326.jar:poi-scratchpad-3.8-20120326.jar org.apache.poi.hsmf.dev.HSMFDump <filename>

When I run this command on the first file, it returns ALL of the fields above (including the 'date' field) in the following area:

  125 - TransportMessageHeaders - Unicode String

When I run this command on the second file, it returns NO 'date' fields, and no such '125' field.

I would appreciate anyone's assistance with this issue.

My System
I am using poi-bin-3.8-20120326, tika-app-1.2, and Mac OS X 10.8.1 with Java 1.6.0_33.
Comment 1 Nick Burch 2012-08-28 06:25:42 UTC
There has been some discussions about this on the mailing list. We currently support decoding variable length properties, but not fixed length ones which get stored differently. There's a little bit of code added quite recently towards supporting fixed length ones, but more work is needed on it.
Comment 2 joewiz 2012-08-28 12:07:14 UTC
The thread Nick referred to is http://markmail.org/message/cm5fkzosjlwelz2c.  Also, for additional context, here is the earlier thread on the Tika list: http://markmail.org/message/vtweepcegqwjuxb4.
Comment 3 Claudius Teodorescu 2012-09-22 19:36:13 UTC
Created attachment 29407 [details]
First version of patch for bug 53784
Comment 4 Claudius Teodorescu 2012-09-22 19:39:01 UTC
Attached is a first version of patch for this issue. The date is thus correctly extracted for the messages mentioned. I also made a unit test for those two messages.

The flags for client submit time property are not extracted, as would imply modification of another class. If they are necessary now, I can modify that class.
Comment 5 Nick Burch 2012-10-14 14:52:54 UTC
The fix suggested will need some work, as it's a little too specific for the one property type. Thanks for the start though!
Comment 6 Nick Burch 2012-10-15 10:46:57 UTC
I've committed a partial fix for this in r1398241.

I've had to disable part of your test though, as we're now getting a date, but not quite the correct one

Really, we need to finish the property value decoding, add some unit tests for that part, and then re-visit the disabled tests to verify what really should be happening.

Once that's done, we can likely do a bit of refactoring, and then expose more of the properties on MAPIMessage!