Bug 56792 - Regression in Ole10Native.createFromEmbeddedOleObject leading to IOOBE since 3.10-beta2
Summary: Regression in Ole10Native.createFromEmbeddedOleObject leading to IOOBE since ...
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: POIFS (show other bugs)
Version: 3.10-FINAL
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-07-30 11:07 UTC by Tim Allison
Modified: 2014-07-30 13:01 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Allison 2014-07-30 11:07:20 UTC
The embedded OLE objects in this document (http://digitalcorpora.org/corp/nps/files/govdocs1/268/268620.pptx) are extracted without a problem in 3.10-beta2. However, I'm getting the following stacktrace with 3.10-FINAL:


Stacktrace:
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -369073454
	at java.lang.String.checkBounds(String.java:371)
	at java.lang.String.<init>(String.java:415)
	at org.apache.poi.util.StringUtil.getFromCompressedUnicode(StringUtil.java:114)
	at org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:163)
	at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:91)
	at org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:63)
	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedOLE(AbstractOOXMLExtractor.java:250)
	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199)
	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:115)
	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)


I haven't had a chance to confirm, but given the release dates and the modifications to the header parsing, r1531623 ("Bugzilla 55578 - Support embedding OLE1.0 packages in HSSF" (http://svn.apache.org/viewvc/poi/trunk/src/java/org/apache/poi/poifs/filesystem/Ole10Native.java?annotate=1531623) may be the cause.
Comment 1 Nick Burch 2014-07-30 11:52:53 UTC
Do you know what the license is on that test file? i.e. whether we can add it as a test document or not?
Comment 2 Nick Burch 2014-07-30 11:56:15 UTC
Also, any chance you could retest with a recent nightly build of POI - the line in your stacktrace in Ole10Native doesn't match on trunk so it is possible the problem has already been fixed
Comment 3 Tim Allison 2014-07-30 11:57:45 UTC
Unfortunately, I don't believe that we can add it as ASF 2.0. According to Simson Garfinkel, the creator of govdocs1, the documents are hostable and redistributable, in personal communication and this site (http://digitalcorpora.org/corpora/files)
"For these reasons, we have created and released a corpus of 1 million documents that are freely available for research and may be (to the best of our knowledge) freely redistributed. " 

I think ASF is more stringent (release of copyright, etc.)
Comment 4 Tim Allison 2014-07-30 13:01:39 UTC
Note to self: check trunk, then submit issue; check trunk, then submit issue.  

Trunk seems to work.

The following code fails in poi-3_10_FINAL (with the appropriate stacktrace), but works in poi-3_10_BETA2 and works in trunk.

    public void testOleNativeGovdocs1() throws IOException, Ole10NativeException {
        for (int i = 3; i <=5; i++){
            String fName = "oleObject"+i+"-govdocs1-268620.bin";
            InputStream is = dataSamples.openResourceAsStream(fName);
            POIFSFileSystem fs = new POIFSFileSystem(is);

            Ole10Native ole = Ole10Native.createFromEmbeddedOleObject(fs);
            is.close();

        }
    }