Bug 53784 - HSMF fails to extract dates from certain Outlook 2007 messages (.msg)
Summary: HSMF fails to extract dates from certain Outlook 2007 messages (.msg)
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: HSMF (show other bugs)
Version: 3.8-FINAL
Hardware: Macintosh All
: P2 normal with 1 vote (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-08-27 21:44 UTC by joewiz
Modified: 2015-08-12 15:15 UTC (History)
1 user (show)



Attachments
Two sample Outlook 2007 messages, one whose date POI HSMF fails to recognize, and one whose date POI HSMF successfully recognizes (18.46 KB, application/zip)
2012-08-27 21:44 UTC, joewiz
Details
First version of patch for bug 53784 (10.34 KB, application/zip)
2012-09-22 19:36 UTC, Claudius Teodorescu
Details

Note You need to log in before you can comment on or make changes to this bug.
Description joewiz 2012-08-27 21:44:52 UTC
Created attachment 29287 [details]
Two sample Outlook 2007 messages, one whose date POI HSMF fails to recognize, and one whose date POI HSMF successfully recognizes

Overview
--------
POI HSMF fails to recognize the dates in some of my Outlook 2007 (.msg) files.  

Steps to Reproduce
------------------
I have two reproducible tests, and I've attached a zip file containing test .msg files that illustrate my results.  When I use the first file (test-message-poi-succeeds.msg), POI HSMF succeeds.  When I use the second file (test-message-poi-fails.msg), POI HSMF fails.  Here are the two tests: 

1) The first test is to run the .msg files through Tika 1.2 (which uses POI HSMF to parse Outlook files), using the following command:

  java -jar tika-app-1.2.jar -m <filename>

Tika succeeds to find the date on the first message, returning these headers -- 9 of which contain dates, and the key field being "date: 2012-06-22T18:32:54Z":

Author: Ashley, Carl E (PACE)
Content-Length: 35840
Content-Type: application/vnd.ms-outlook
Creation-Date: 2012-06-22T18:32:54Z
Last-Modified: 2012-06-22T18:32:54Z
Last-Save-Date: 2012-06-22T18:32:54Z
Message-Bcc: 
Message-Cc: PA History Mailbox
Message-From: Ashley, Carl E (PACE)
Message-Recipient-Address: saftergood@fas.org
Message-To: 'saftergood@fas.org'
creator: Ashley, Carl E (PACE)
date: 2012-06-22T18:32:54Z
dc:creator: Ashley, Carl E (PACE)
dc:description: HAC Annual Report
dc:title: HAC Annual Report
dcterms:created: 2012-06-22T18:32:54Z
dcterms:modified: 2012-06-22T18:32:54Z
meta:author: Ashley, Carl E (PACE)
meta:creation-date: 2012-06-22T18:32:54Z
meta:save-date: 2012-06-22T18:32:54Z
modified: 2012-06-22T18:32:54Z
resourceName: test-message-poi-succeeds.msg
subject: HAC Annual Report
title: HAC Annual Report

Tika fails on the second message, returning no date fields:

Author: PA History Mailbox
Content-Length: 29696
Content-Type: application/vnd.ms-outlook
Message-Bcc: 
Message-Cc: 
Message-From: PA History Mailbox
Message-Recipient-Address: garrettac@state.gov
Message-To: Garrett, Amy C (PACE)
creator: PA History Mailbox
dc:creator: PA History Mailbox
dc:description: Draft to La Tetra
dc:title: Draft to La Tetra
meta:author: PA History Mailbox
resourceName: test-message-poi-fails.msg
subject: Draft to La Tetra
title: Draft to La Tetra


2) The second test is to run POI HSMFDump directly on each file:

  java -classpath poi-3.8-20120326.jar:poi-scratchpad-3.8-20120326.jar org.apache.poi.hsmf.dev.HSMFDump <filename>

When I run this command on the first file, it returns ALL of the fields above (including the 'date' field) in the following area:

  125 - TransportMessageHeaders - Unicode String

When I run this command on the second file, it returns NO 'date' fields, and no such '125' field.

I would appreciate anyone's assistance with this issue.

My System
---------
I am using poi-bin-3.8-20120326, tika-app-1.2, and Mac OS X 10.8.1 with Java 1.6.0_33.
Comment 1 Nick Burch 2012-08-28 06:25:42 UTC
There has been some discussions about this on the mailing list. We currently support decoding variable length properties, but not fixed length ones which get stored differently. There's a little bit of code added quite recently towards supporting fixed length ones, but more work is needed on it.
Comment 2 joewiz 2012-08-28 12:07:14 UTC
The thread Nick referred to is http://markmail.org/message/cm5fkzosjlwelz2c.  Also, for additional context, here is the earlier thread on the Tika list: http://markmail.org/message/vtweepcegqwjuxb4.
Comment 3 Claudius Teodorescu 2012-09-22 19:36:13 UTC
Created attachment 29407 [details]
First version of patch for bug 53784
Comment 4 Claudius Teodorescu 2012-09-22 19:39:01 UTC
Attached is a first version of patch for this issue. The date is thus correctly extracted for the messages mentioned. I also made a unit test for those two messages.

The flags for client submit time property are not extracted, as would imply modification of another class. If they are necessary now, I can modify that class.
Comment 5 Nick Burch 2012-10-14 14:52:54 UTC
The fix suggested will need some work, as it's a little too specific for the one property type. Thanks for the start though!
Comment 6 Nick Burch 2012-10-15 10:46:57 UTC
I've committed a partial fix for this in r1398241.

I've had to disable part of your test though, as we're now getting a date, but not quite the correct one

Really, we need to finish the property value decoding, add some unit tests for that part, and then re-visit the disabled tests to verify what really should be happening.

Once that's done, we can likely do a bit of refactoring, and then expose more of the properties on MAPIMessage!