The embedded OLE objects in this document (http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx) are extracted without a problem in 3.10-beta2. However, I'm getting the following stacktrace with 3.10-FINAL: Stacktrace: Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -369073454 at java.lang.String.checkBounds(String.java:371) at java.lang.String.<init>(String.java:415) at org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114) at org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:163) at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:91) at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:63) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE(AbstractOOXMLExtractor.java:250) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:115) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243) I haven't had a chance to confirm, but given the release dates and the modifications to the header parsing, r1531623 ("Bugzilla 55578 - Support embedding OLE1.0 packages in HSSF" (http://svn.apache.org/viewvc/poi/trunk/src/java/org/apache/poi/poifs/filesystem/Ole10Native.java?annotate=1531623) may be the cause.
Do you know what the license is on that test file? i.e. whether we can add it as a test document or not?
Also, any chance you could retest with a recent nightly build of POI - the line in your stacktrace in Ole10Native doesn't match on trunk so it is possible the problem has already been fixed
Unfortunately, I don't believe that we can add it as ASF 2.0. According to Simson Garfinkel, the creator of govdocs1, the documents are hostable and redistributable, in personal communication and this site (http://digitalcorpora.org/corpora/files) "For these reasons, we have created and released a corpus of 1 million documents that are freely available for research and may be (to the best of our knowledge) freely redistributed. " I think ASF is more stringent (release of copyright, etc.)
Note to self: check trunk, then submit issue; check trunk, then submit issue. Trunk seems to work. The following code fails in poi-3_10_FINAL (with the appropriate stacktrace), but works in poi-3_10_BETA2 and works in trunk. public void testOleNativeGovdocs1() throws IOException, Ole10NativeException { for (int i = 3; i <=5; i++){ String fName = "oleObject"+i+"-govdocs1-268620.bin"; InputStream is = dataSamples.openResourceAsStream(fName); POIFSFileSystem fs = new POIFSFileSystem(is); Ole10Native ole = Ole10Native.createFromEmbeddedOleObject(fs); is.close(); } }