Created attachment 29287 [details] Two sample Outlook 2007 messages, one whose date POI HSMF fails to recognize, and one whose date POI HSMF successfully recognizes Overview -------- POI HSMF fails to recognize the dates in some of my Outlook 2007 (.msg) files. Steps to Reproduce ------------------ I have two reproducible tests, and I've attached a zip file containing test .msg files that illustrate my results. When I use the first file (test-message-poi-succeeds.msg), POI HSMF succeeds. When I use the second file (test-message-poi-fails.msg), POI HSMF fails. Here are the two tests: 1) The first test is to run the .msg files through Tika 1.2 (which uses POI HSMF to parse Outlook files), using the following command: java -jar tika-app-1.2.jar -m <filename> Tika succeeds to find the date on the first message, returning these headers -- 9 of which contain dates, and the key field being "date: 2012-06-22T18:32:54Z": Author: Ashley, Carl E (PACE) Content-Length: 35840 Content-Type: application/vnd.ms-outlook Creation-Date: 2012-06-22T18:32:54Z Last-Modified: 2012-06-22T18:32:54Z Last-Save-Date: 2012-06-22T18:32:54Z Message-Bcc: Message-Cc: PA History Mailbox Message-From: Ashley, Carl E (PACE) Message-Recipient-Address: saftergood@fas.org Message-To: 'saftergood@fas.org' creator: Ashley, Carl E (PACE) date: 2012-06-22T18:32:54Z dc:creator: Ashley, Carl E (PACE) dc:description: HAC Annual Report dc:title: HAC Annual Report dcterms:created: 2012-06-22T18:32:54Z dcterms:modified: 2012-06-22T18:32:54Z meta:author: Ashley, Carl E (PACE) meta:creation-date: 2012-06-22T18:32:54Z meta:save-date: 2012-06-22T18:32:54Z modified: 2012-06-22T18:32:54Z resourceName: test-message-poi-succeeds.msg subject: HAC Annual Report title: HAC Annual Report Tika fails on the second message, returning no date fields: Author: PA History Mailbox Content-Length: 29696 Content-Type: application/vnd.ms-outlook Message-Bcc: Message-Cc: Message-From: PA History Mailbox Message-Recipient-Address: garrettac@state.gov Message-To: Garrett, Amy C (PACE) creator: PA History Mailbox dc:creator: PA History Mailbox dc:description: Draft to La Tetra dc:title: Draft to La Tetra meta:author: PA History Mailbox resourceName: test-message-poi-fails.msg subject: Draft to La Tetra title: Draft to La Tetra 2) The second test is to run POI HSMFDump directly on each file: java -classpath poi-3.8-20120326.jar:poi-scratchpad-3.8-20120326.jar org.apache.poi.hsmf.dev.HSMFDump <filename> When I run this command on the first file, it returns ALL of the fields above (including the 'date' field) in the following area: 125 - TransportMessageHeaders - Unicode String When I run this command on the second file, it returns NO 'date' fields, and no such '125' field. I would appreciate anyone's assistance with this issue. My System --------- I am using poi-bin-3.8-20120326, tika-app-1.2, and Mac OS X 10.8.1 with Java 1.6.0_33.
There has been some discussions about this on the mailing list. We currently support decoding variable length properties, but not fixed length ones which get stored differently. There's a little bit of code added quite recently towards supporting fixed length ones, but more work is needed on it.
The thread Nick referred to is http://markmail.org/message/cm5fkzosjlwelz2c. Also, for additional context, here is the earlier thread on the Tika list: http://markmail.org/message/vtweepcegqwjuxb4.
Created attachment 29407 [details] First version of patch for bug 53784
Attached is a first version of patch for this issue. The date is thus correctly extracted for the messages mentioned. I also made a unit test for those two messages. The flags for client submit time property are not extracted, as would imply modification of another class. If they are necessary now, I can modify that class.
The fix suggested will need some work, as it's a little too specific for the one property type. Thanks for the start though!
I've committed a partial fix for this in r1398241. I've had to disable part of your test though, as we're now getting a date, but not quite the correct one Really, we need to finish the property value decoding, add some unit tests for that part, and then re-visit the disabled tests to verify what really should be happening. Once that's done, we can likely do a bit of refactoring, and then expose more of the properties on MAPIMessage!