|Summary:||[PATCH] Invalid chunk name Olk10SideProps_0001 (Parsing MSG files - Outlook 2002 drag and dropped)|
|Component:||HSMF||Assignee:||POI Developers List <dev>|
Output of POIFSLister on problem file
PATCH for issue
Patch for the prior fix
Example of the bug that the patch fixed
Description Jeremy 2011-09-22 18:01:11 UTC
I'm getting this error on a bunch of Outlook Msg files I'm trying to ingest. Due to the sensitive nature of the task, I can't post an example here, though I may be able to try and recreate one in the next few days and attach it. After some research it appears that the Olk10SideProps_0001 stream was only written out by Outlook 2002 for documents dragged and dropped to disk. This stream may contain message ID and store ID. It is an undocumented stream in the MS-OXMSG. See further explanation here: http://social.msdn.microsoft.com/Forums/en-US/os_exchangeprotocols/thread/1f2848a4-3b6a-4f8f-85dd-55e6b12fdec6 If possible, adding a fix that will ignore this stream and continue processing the MSG file, if it can be done so in a valid method. I'll see if I can get anything to work on my end. Stack Trace from Tika(1.0-SNAPSHOT) called Poi-3.8-b4: Caused by: java.lang.IllegalArgumentException: Invalid chunk name Olk10SideProps_0001 at org.apache.poi.hsmf.parsers.POIFSChunkParser.process(POIFSChunkParser.java:125) at org.apache.poi.hsmf.parsers.POIFSChunkParser.processChunks(POIFSChunkParser.java:98) at org.apache.poi.hsmf.parsers.POIFSChunkParser.parse(POIFSChunkParser.java:85) at org.apache.poi.hsmf.MAPIMessage.<init>(MAPIMessage.java:127) at org.apache.tika.parser.microsoft.OutlookExtractor.<init>(OutlookExtractor.java:57) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:217) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
Comment 1 Nick Burch 2011-09-22 18:09:13 UTC
If you could produce a sample file that'd be very helpful (either by getting the problematic version of outlook running, or just attacking the file with a hex editor to XXX out all the strings) In the mean time, any chance you could run org.apache.poi.poifs.dev.POIFSLister against the file and post that? We can then check there are no other unexpected chunks in the file
Comment 2 Jeremy 2011-09-22 18:39:19 UTC
Can you please supply the format to use to run the POIFSLister? I found it mentioned on the website but no specification about how to fire it off from the commandline. I'm also going to try some hacks to the current build to see if I can get a work-around that can be supplied as a patch. This issue seems to be specific to the version and method of Outlook used to generate these files. Any modification and re-saving of them fixes the issue. I'm not sure how the hex editor may work as a good percentage of these files also have image attachments that also contain sensitive information. Unless you have any tips for using the editor to also yank out binary attachements without corrupting the original msg?
Comment 3 Nick Burch 2011-09-22 21:24:40 UTC
To run POIFSLister, just do something like: java -classpath poi-3.8-beta5-20110922.jar org.apache.poi.poifs.dev.POIFSLister problem.msg
Comment 4 Jeremy 2011-09-22 22:29:04 UTC
I've attached the result of the lister run on the file. The problem seems to be the final chunk at the end of the file. I'm going to try and test with the chunk parser just returning when a chunk with this name is encountered. (In reply to comment #3) > To run POIFSLister, just do something like: > java -classpath poi-3.8-beta5-20110922.jar org.apache.poi.poifs.dev.POIFSLister > problem.msg
Comment 5 Jeremy 2011-09-22 22:30:48 UTC
Created attachment 27573 [details] Output of POIFSLister on problem file
Comment 6 Jeremy 2011-09-22 23:03:12 UTC
Created attachment 27574 [details] PATCH for issue SVN Diff for patch to reslove issue. Essentially this Piece of Chunk Data was only added to .MSG files generated by Outlook2002 via drag and drop. This wrapper chunk doesn't seem to contain any data impacting the valid extraction of text from an email and its attachments. Patch essentially has a hard-coded check to see if the entryName matches this rare value and returns to continue processesing without an exception.
Comment 7 Nick Burch 2011-09-23 16:25:51 UTC
This should be fixed in r1174868. I went for a slightly different approach to your patch, which should cope with there being more than one Olk10SlideProp entry in case we find that If you could create a test file we could use in unit testing, that'd be great
Comment 8 Jeremy 2011-09-23 18:15:08 UTC
Thanks in advance.... I figured you may want to mod the patch with your better knolwedge about the inner workings of these documents. I'm not so sure if I'll be able to submit a test file for unit testing though. all the files I have contain sensitve infoa nd attachments and trying to get a version of Outlook2002 up and running may not occur. If I get the chance, I'll try to follow through. Thanks again for your help. (In reply to comment #7) > This should be fixed in r1174868. I went for a slightly different approach to > your patch, which should cope with there being more than one Olk10SlideProp > entry in case we find that > If you could create a test file we could use in unit testing, that'd be great
Comment 9 Nick Burch 2011-09-23 19:44:52 UTC
Let's leave this bug open until we get a test file then, whenever that may be!
Comment 10 Jeremy 2011-09-27 19:39:47 UTC
Still looking into the possability of creating a testfile, but am not sure if it will happen. Did detect a bug in tge fix. I've submitted a new patch. either the .equals() needs to change to .startsWith(), or the "Olk10SidProps" needs to have the underscore added to the end... "Olk10SideProps_". Thanks again for your attention to the matter. Regards, Jeremy (In reply to comment #9) > Let's leave this bug open until we get a test file then, whenever that may be!
Comment 11 Jeremy 2011-09-27 19:42:22 UTC
Created attachment 27614 [details] Patch for the prior fix Patches the fix by adding an underscore to the string being compared.
Comment 12 Jeremy 2011-10-03 14:15:10 UTC
Please take a look at my next comment, I noticed an issue with the fix submitted in r1174868. There is an underscore that needs to be taken into account when doing the comparison. I submitted a updated patch for the issue. Thanks in advance. (In reply to comment #9) > Let's leave this bug open until we get a test file then, whenever that may be!
Comment 13 Nick Burch 2011-10-05 22:14:52 UTC
Hopefully fixed in r1179462. Life would be much easier with a test file to write a unit test against... :)
Comment 14 Jeremy 2011-10-06 02:22:10 UTC
Thanks, that fixed it!!! Hopefully sometime this week or next I may be able to track down a sample to send in. Problem is I'm dealing with over 10K message files, of which 3K-6K may have the issue but a good percentage have attachments. If I get lucky and am able to track down a trouble file without an attachment, I'll look into hex editing out the sensitive items and submit it as a test. (In reply to comment #13) > Hopefully fixed in r1179462. > Life would be much easier with a test file to write a unit test against... :)
Comment 15 Jeremy 2011-10-17 18:45:12 UTC
Well I tracked down a few sample files I was going to try and inlcude, however I've hit a roadblock with trying to hex-edit out any of the sensitive information. I'm able to remove all sensitive header information in the hex-editor and do see a raw text version of the message body I've hexed out. Though it appears that the message body text is also contained in some binary format as well which I'm not able to hex edit out without corrupting the document. Unless you have an idea for the best way to proceed with modifying these documents, getting a sample may have to wait unless I can track down a version of Outlook 2002 that does produce messages with this issue. (In reply to comment #14) > Thanks, that fixed it!!! > Hopefully sometime this week or next I may be able to track down a sample to > send in. Problem is I'm dealing with over 10K message files, of which 3K-6K > may have the issue but a good percentage have attachments. If I get lucky and > am able to track down a trouble file without an attachment, I'll look into hex > editing out the sensitive items and submit it as a test. > (In reply to comment #13) > > Hopefully fixed in r1179462. > > Life would be much easier with a test file to write a unit test against... :)
Comment 16 Nick Burch 2011-10-17 18:51:11 UTC
I'd be happy to have a go at editing the file, to see if I can manage to remove all the identifiable bits. Well, as long as it takes me under about 10 minutes... :)
Comment 17 Jeremy 2011-10-19 15:20:02 UTC
Unfortunately I'm still reluctant to release the document as per our contract with the source. If you could help point me in the right direction, that would be greatly appreciated. I've got all the metadata and raw text already hex-edited out. But it appears the message text is also stored in another block in some encoded manner which is then decoded and appears when the message is opened. It would be great if you could help direct me towards the specs for determining the encoding method used and any utilities that may be useful for re-encoding my newly hex-edited version of the content. I'm assuming and hoping that the re-encoded version will be of the same size and length and won't corrupt the original document? It appears the encoded version of the text is in a block labeled as: _._.s.u.b.s.t.g.1...0._.22.214.171.124.0.0.1.E Thanks again for the help. (In reply to comment #16) > I'd be happy to have a go at editing the file, to see if I can manage to remove > all the identifiable bits. Well, as long as it takes me under about 10 > minutes... :)
Comment 18 Nick Burch 2011-10-19 15:22:56 UTC
One option is to just drop the chunks you can't properly clean. That may result in a file that outlook is a bit funny about, but should help us with testing! The way to do that would be to open the file in POIFS, iterate through the entries, and either copy to a new POIFS instance or skip. See org.apache.poi.poifs.filesystem.EntryUtils for a guide to doing this
Comment 19 Jeremy 2011-10-20 20:30:28 UTC
Well thats certainly helpedme mak some headway on that today when I wasn't busy with other things. I've got a utility up and working that uses the EntryUtils to copy or skip blocks from coming over... The problem I'm having now is trying to figure out which entry is being used that holds the binary data. I removed all of the largest in size entries, but both the raw text and binary encoded version remained in the then corrupted msg file. LOL I'm guessing that perhaps one of the entries contains information as to where in the file the encoded portion of the text message resides? Any hints as to trying to pinpoint which chunk relates to the encoded message body? Either via poi utility classes or methods or through the hex-editor? I suppose worst-comes to worst, I can try dropping them one at a time and see which one works. As long as its not a combination of dropped chunks that I need. LOL (In reply to comment #18) > One option is to just drop the chunks you can't properly clean. That may result > in a file that outlook is a bit funny about, but should help us with testing! > The way to do that would be to open the file in POIFS, iterate through the > entries, and either copy to a new POIFS instance or skip. See > org.apache.poi.poifs.filesystem.EntryUtils for a guide to doing this
Comment 20 Jeremy 2011-10-21 14:30:11 UTC
VIOLA!!!!! I had a thought on my drive home yesterday and was able to get it to work this morning. An example will be attached shortly. Just for the instance of getting the test file, I had a great idea for a hack and it appears to have worked. Essentially I created a new test Outlook 2010 document. Then using the utility you directed me towards, I stripped off the 2002 Olk10SideProps Chunk from the original document, copied the 2010 document using EntryUtils, then appended the 2002 Chunk to the copied 2010 document before writing!! Way much simpler than all the hex editing and chunk tracking down that was on my plate. Thanks again Nick for all your help and your patience!! Jeremy
Comment 21 Jeremy 2011-10-21 14:32:11 UTC
Created attachment 27834 [details] Example of the bug that the patch fixed Finally was able to get a sample document together!!