Bug 52400 - [PATCH] TNEF parsing unstable
Summary: [PATCH] TNEF parsing unstable
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: HMEF (show other bugs)
Version: unspecified
Hardware: Other Linux
: P2 critical with 2 votes (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-12-30 12:39 UTC by Rob Tulloh
Modified: 2014-06-25 09:41 UTC (History)
2 users (show)



Attachments
Simple TNEF attachment (4.44 KB, application/octet-stream)
2013-11-01 22:47 UTC, Jeff Evans
Details
TNEF message with attachments (313.87 KB, application/octet-stream)
2013-11-01 22:48 UTC, Jeff Evans
Details
[PATCH] ignore trailing newlines of winmail.dat (282.02 KB, application/zip)
2013-11-02 21:41 UTC, Andreas Beeker
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Rob Tulloh 2011-12-30 12:39:35 UTC
> We are seeing problems in Solr with tika throwing exceptions. Sometimes we see OOM like this:
> {noformat}
> SEVERE: java.lang.OutOfMemoryError: Java heap space
>        at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:50)
>        at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
>        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
>        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>        at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
>        at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
>        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:195)
>        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
>        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>        at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1478)
>        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353)
>        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
>        at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>        at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> {noformat}
> Other times, we see errors like this one:
> {noformat}
> Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun
>        at org.apache.poi.util.LittleEndian.readUShort(LittleEndian.java:302)
>        at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:53)
>        at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
>        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
>        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>        at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
>        at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
>        at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79)
>        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>        ... 26 more
> {noformat}

We are currently evaluating Solr and Tika using valid sampele data. Solr and Tika projects have both indicated this problem should be reported and fixed in POI.
Comment 1 Rob Tulloh 2011-12-30 12:40:48 UTC
See also the original ticket for information.

https://issues.apache.org/jira/browse/TIKA-835
Comment 2 Nick Burch 2011-12-31 03:50:39 UTC
As you're not able to share the file, any chance you could debug through the creation of the attributes? It'd be useful to know what number attribute this is in the list of attributes (is it the 1st one, 2nd one etc), if the lengths + types + IDs of the preceeding attributes look sensible or not, and what the apparent length + ID + type of the failing attribute are?
Comment 3 Rob Tulloh 2012-01-01 14:23:09 UTC
Num,ID,Type,Len
36870,8,4
36871,6,8
2573,1152,512
1,0,977272843 (boom!)

{noformat}
Step completed: "thread=main", org.apache.poi.hmef.attribute.TNEFAttribute.create(), line=65 bci=10
65          if(id == TNEFProperty.ID_MAPIPROPERTIES.id ||

main[1] print id
 id = 1
main[1] print type
 type = 0
main[1] next
>
{noformat}

And then a bit later

{noformat}
Step completed: "thread=main", org.apache.poi.hmef.attribute.TNEFAttribute.<init>(), line=49 bci=15
49          property = TNEFProperty.getBest(id, type);

main[1] print length
 length = 977272843
{noformat}
Comment 4 Rob Tulloh 2012-01-01 15:04:12 UTC
I don't see any use or special handling of TYPE_TRIPLES in POI.

public static final int TYPE_TRIPLES = 0x0000;

Perhaps these need to be handled differently/specially? I don't have MSDN access so I cannot research this. Peharps someone can give me a hand?
Comment 5 Jan Høydahl 2012-04-17 09:52:55 UTC
Apache committers may apply for free MSDN membership :)

I don't know this format at all, but here are two pointers that may be of help:
http://msdn.microsoft.com/en-us/library/cc815562.aspx
http://www.gnu-darwin.org/www001/ports-1.5a-CURRENT/mail/pop3vscan/work/pop3vscan-0.4/ripmime/tnef/tnef.h.html
Comment 6 Fred Stoki 2013-03-15 11:48:09 UTC
I am also seeing the same exception. I can not provide a sample either. Sorry all.   

Using poi 3.8

Having to retype this ( may fat finger something ) 

Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun
at org.apache.poi.util.LittleEndian.readShort(LittleEndian.java:786)
at org.apache.poi.hmef.attribute.TNEFAttribute.&lt;init&gt;(TNEFAttribute.java:62)
at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
at org.apache.poi.hmef.HMEFMessage.&lt;init&gt;(HMEFMessage.java:63)
Comment 7 Nick Burch 2013-03-15 12:23:39 UTC
Without a sample file that can be shared, it's going to be up to one of the people with a confidential file to investigate and work out the fix...
Comment 8 Fred Stoki 2013-03-15 14:01:10 UTC
I understand 

Here is another one I am seeing

Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun
at org.apache.poi.util.LittleEndian.readInt(LittleEndian.java:723)
at org.apache.poi.hmef.attribute.TNEFAttribute.&lt;init&gt;(TNEFAttribute.java:47)
at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
at org.apache.poi.hmef.HMEFMessage.&lt;init&gt;(HMEFMessage.java:63)
Comment 9 Jeff Evans 2013-11-01 22:47:36 UTC
Created attachment 30994 [details]
Simple TNEF attachment

A TNEF encoded message with the body "This is the message body."
Comment 10 Jeff Evans 2013-11-01 22:48:22 UTC
Created attachment 30995 [details]
TNEF message with attachments

HTML body reads: "This is the body.  There are also two attachments.  The first is a 189KB PDF document with file named scion_tc_2007_maintenanceguide.pdf.  The second is a 119KB PNG image named Duke_Wave.png."

There are two embedded attachments in accordance with the body text.
Comment 11 Jeff Evans 2013-11-01 22:48:38 UTC
I can provide some sample data files that cause this error.

The first (winmail-simple.dat) is a simple message with no attachments, and an HTML body which has the text "This is the message body."

The other (winmail-with-attachments.dat) has a body, and the text in the body describes the two attachments: "This is the body.  There are also two attachments.  The first is a 189KB PDF document with file named scion_tc_2007_maintenanceguide.pdf.  The second is a 119KB PNG image named Duke_Wave.png."

Both of these .dat files can be opened using the OS X program "TNEF's Enough" and possibly others.  They can also be read via the Aspose Java Mail API.  They were generated through the Microsoft Office 365 Live website.

I am pasting my stack trace as follows (same for both files).  I am using jdk 1.7.0_25 on Mac OS X 10.9, and the version of POI I am using is 3.10-beta1.  But I have also tried versions 3.10-beta2, 3.9, and 3.8, all with the same result.

Please let me know if this helps, or if any additional info or test files would be useful.  

org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun
	at org.apache.poi.util.LittleEndian.readUShort(LittleEndian.java:804)
	at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:61)
	at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
	at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
	at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
	at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
	at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
	at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
	at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
	at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
	at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
	at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
	at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
	at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
	at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
	at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
	at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
	at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
	at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
	at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
	at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
	at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
	at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
	at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
	at com.gaggle.message.testing.tnef.TnefMessageTests.testPoi(TnefMessageTests.java:90)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
	at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
	at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Comment 12 Nick Burch 2013-11-02 17:33:06 UTC
Thanks for the sample file Jeff!

If you have time, any chance you could step through the TNEF code in a debugger, and report back? Specifically, I'm interested to know if the sizes and IDs that the POI code reads look sensible? And in keeping with the POIFS chunk? Secondly, when processing the data, does it all look garbage, or is it fine for a bit then garbage then breaks?
Comment 13 Andreas Beeker 2013-11-02 17:36:56 UTC
in the sample file without the attachment, it first starts with a little endian tnfversion attribute and then it switches to a big endian tnfversion entry ... seems to be quite strange ... I'm just adopting the TNEFAttribute class for the malformed input ... not sure if this makes sense though
Comment 14 Andreas Beeker 2013-11-02 19:08:07 UTC
sorry for the last confusing comment ... there is/was an error in my changes
Comment 15 Andreas Beeker 2013-11-02 21:41:23 UTC
Created attachment 31000 [details]
[PATCH] ignore trailing newlines of winmail.dat

The sample winmail files contained trailing newlines ... hopefully this is also the case with the other original findings
Comment 16 Dominik Stadler 2013-11-03 12:44:28 UTC
Thanks for the patch, I have applied the changes as SVN rev r1538353 with a few minor modifications and some more tests.
Comment 17 Jeff Evans 2013-11-04 13:02:40 UTC
(In reply to Nick Burch from comment #12)
> Thanks for the sample file Jeff!
> 
> If you have time, any chance you could step through the TNEF code in a
> debugger, and report back? Specifically, I'm interested to know if the sizes
> and IDs that the POI code reads look sensible? And in keeping with the POIFS
> chunk? Secondly, when processing the data, does it all look garbage, or is
> it fine for a bit then garbage then breaks?

Sure, I can do that.  Can you give me a bit more detail as to what I'd be looking for?  I'm fairly new to dealing with TNEF, and my only tools so far are using other applications/libs to read the total files.  I haven't dealt much with the low level parsing.
Comment 18 Monish Gandhi 2014-06-25 09:18:34 UTC
Is this fix not rolled out in 3.10 release?
I still see the following stack trace, dont have a message to share

org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun
        at org.apache.poi.util.LittleEndian.readInt(LittleEndian.java:723)
        at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:51)
Comment 19 Nick Burch 2014-06-25 09:41:08 UTC
(In reply to Monish from comment #18)
> Is this fix not rolled out in 3.10 release?

The fix is in 3.10, and all the files attached to this bug now parse without error

If you have another file which triggers a similar problem, please open a new bug, and upload a sample file which shows it