> We are seeing problems in Solr with tika throwing exceptions. Sometimes we see OOM like this: > {noformat} > SEVERE: java.lang.OutOfMemoryError: Java heap space > at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:50) > at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76) > at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74) > at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) > at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) > at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) > at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63) > at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) > at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:195) > at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58) > at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) > at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1478) > at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353) > at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248) > at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) > at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) > at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > {noformat} > Other times, we see errors like this one: > {noformat} > Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun > at org.apache.poi.util.LittleEndian.readUShort(LittleEndian.java:302) > at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:53) > at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76) > at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74) > at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) > at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) > at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) > at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63) > at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:79) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > ... 26 more > {noformat} We are currently evaluating Solr and Tika using valid sampele data. Solr and Tika projects have both indicated this problem should be reported and fixed in POI.
See also the original ticket for information. https://issues.apache.org/jira/browse/TIKA-835
As you're not able to share the file, any chance you could debug through the creation of the attributes? It'd be useful to know what number attribute this is in the list of attributes (is it the 1st one, 2nd one etc), if the lengths + types + IDs of the preceeding attributes look sensible or not, and what the apparent length + ID + type of the failing attribute are?
Num,ID,Type,Len 36870,8,4 36871,6,8 2573,1152,512 1,0,977272843 (boom!) {noformat} Step completed: "thread=main", org.apache.poi.hmef.attribute.TNEFAttribute.create(), line=65 bci=10 65 if(id == TNEFProperty.ID_MAPIPROPERTIES.id || main[1] print id id = 1 main[1] print type type = 0 main[1] next > {noformat} And then a bit later {noformat} Step completed: "thread=main", org.apache.poi.hmef.attribute.TNEFAttribute.<init>(), line=49 bci=15 49 property = TNEFProperty.getBest(id, type); main[1] print length length = 977272843 {noformat}
I don't see any use or special handling of TYPE_TRIPLES in POI. public static final int TYPE_TRIPLES = 0x0000; Perhaps these need to be handled differently/specially? I don't have MSDN access so I cannot research this. Peharps someone can give me a hand?
Apache committers may apply for free MSDN membership :) I don't know this format at all, but here are two pointers that may be of help: http://msdn.microsoft.com/en-us/library/cc815562.aspx http://www.gnu-darwin.org/www001/ports-1.5a-CURRENT/mail/pop3vscan/work/pop3vscan-0.4/ripmime/tnef/tnef.h.html
I am also seeing the same exception. I can not provide a sample either. Sorry all. Using poi 3.8 Having to retype this ( may fat finger something ) Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun at org.apache.poi.util.LittleEndian.readShort(LittleEndian.java:786) at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:62) at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
Without a sample file that can be shared, it's going to be up to one of the people with a confidential file to investigate and work out the fix...
I understand Here is another one I am seeing Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun at org.apache.poi.util.LittleEndian.readInt(LittleEndian.java:723) at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:47) at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
Created attachment 30994 [details] Simple TNEF attachment A TNEF encoded message with the body "This is the message body."
Created attachment 30995 [details] TNEF message with attachments HTML body reads: "This is the body. There are also two attachments. The first is a 189KB PDF document with file named scion_tc_2007_maintenanceguide.pdf. The second is a 119KB PNG image named Duke_Wave.png." There are two embedded attachments in accordance with the body text.
I can provide some sample data files that cause this error. The first (winmail-simple.dat) is a simple message with no attachments, and an HTML body which has the text "This is the message body." The other (winmail-with-attachments.dat) has a body, and the text in the body describes the two attachments: "This is the body. There are also two attachments. The first is a 189KB PDF document with file named scion_tc_2007_maintenanceguide.pdf. The second is a 119KB PNG image named Duke_Wave.png." Both of these .dat files can be opened using the OS X program "TNEF's Enough" and possibly others. They can also be read via the Aspose Java Mail API. They were generated through the Microsoft Office 365 Live website. I am pasting my stack trace as follows (same for both files). I am using jdk 1.7.0_25 on Mac OS X 10.9, and the version of POI I am using is 3.10-beta1. But I have also tried versions 3.10-beta2, 3.9, and 3.8, all with the same result. Please let me know if this helps, or if any additional info or test files would be useful. org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun at org.apache.poi.util.LittleEndian.readUShort(LittleEndian.java:804) at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:61) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98) at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63) at com.gaggle.message.testing.tnef.TnefMessageTests.testPoi(TnefMessageTests.java:90) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) at org.junit.runners.ParentRunner.run(ParentRunner.java:300) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Thanks for the sample file Jeff! If you have time, any chance you could step through the TNEF code in a debugger, and report back? Specifically, I'm interested to know if the sizes and IDs that the POI code reads look sensible? And in keeping with the POIFS chunk? Secondly, when processing the data, does it all look garbage, or is it fine for a bit then garbage then breaks?
in the sample file without the attachment, it first starts with a little endian tnfversion attribute and then it switches to a big endian tnfversion entry ... seems to be quite strange ... I'm just adopting the TNEFAttribute class for the malformed input ... not sure if this makes sense though
sorry for the last confusing comment ... there is/was an error in my changes
Created attachment 31000 [details] [PATCH] ignore trailing newlines of winmail.dat The sample winmail files contained trailing newlines ... hopefully this is also the case with the other original findings
Thanks for the patch, I have applied the changes as SVN rev r1538353 with a few minor modifications and some more tests.
(In reply to Nick Burch from comment #12) > Thanks for the sample file Jeff! > > If you have time, any chance you could step through the TNEF code in a > debugger, and report back? Specifically, I'm interested to know if the sizes > and IDs that the POI code reads look sensible? And in keeping with the POIFS > chunk? Secondly, when processing the data, does it all look garbage, or is > it fine for a bit then garbage then breaks? Sure, I can do that. Can you give me a bit more detail as to what I'd be looking for? I'm fairly new to dealing with TNEF, and my only tools so far are using other applications/libs to read the total files. I haven't dealt much with the low level parsing.
Is this fix not rolled out in 3.10 release? I still see the following stack trace, dont have a message to share org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun at org.apache.poi.util.LittleEndian.readInt(LittleEndian.java:723) at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:51)
(In reply to Monish from comment #18) > Is this fix not rolled out in 3.10 release? The fix is in 3.10, and all the files attached to this bug now parse without error If you have another file which triggers a similar problem, please open a new bug, and upload a sample file which shows it