Bug 59058 - OOM when parsing docx after OPCPackage.open with File but not with InputStream (TIKA-1866)
Summary: OOM when parsing docx after OPCPackage.open with File but not with InputStrea...
Status: RESOLVED DUPLICATE of bug 57031
Alias: None
Product: POI
Classification: Unclassified
Component: OPC (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-02-24 02:41 UTC by Tim Allison
Modified: 2016-02-24 18:21 UTC (History)
0 users



Attachments
Triggering file submitted by Shawn Johnson on TIKA-1866 (205.33 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2016-02-24 02:48 UTC, Tim Allison
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Allison 2016-02-24 02:41:18 UTC
Shawn Johnson recently posted a smallish docx file on TIKA-1866 that causes an OOM.

WARNING: trying to parse this file in Intellij caused a system crash and required a hard reboot on Windows.

I can reproduce this in pure POI with the following:

        OPCPackage pkg = OPCPackage.open(path)
        System.out.println("before creating extractor");
        POIXMLTextExtractor poiExtractor = ExtractorFactory.createExtractor(pkg);
        System.out.println("finished creating extractor");

The OOM happens during createExtractor, and I never hit the second println.

However, there is no OOM with:

        OPCPackage pkg = OPCPackage.open(Files.newInputStream(path));
        System.out.println("before creating extractor");
        POIXMLTextExtractor poiExtractor = ExtractorFactory.createExtractor(pkg);
        System.out.println("finished creating extractor");

Any idea what might cause the different treatment?

Java 1.8.0_72 on Windows.
Comment 1 Tim Allison 2016-02-24 02:48:14 UTC
Created attachment 33585 [details]
Triggering file submitted by Shawn Johnson on TIKA-1866

When I re-save this file, it no longer triggers the problem.
Comment 2 Nick Burch 2016-02-24 12:09:31 UTC
Other than all the contents of the zip having the date "1980-01-01 00:00", I can't see anything immediately wrong

Loading the sampe file as an OPCPackage from an InputStream on my machine seems to use ~13mb of memory (from a couple of tests). Loading it from a File is around ~3mb.

So, nothing obvious springs to mind. Would someone be able to dig in and find out where the memory is going, and how it differs between the two cases?
Comment 3 Tim Allison 2016-02-24 12:32:49 UTC
Interesting.  Thank you for taking a look!  You're not seeing the OOM, then...  What version of Java?  Windows...right?

In addition to Windows with:

java version "1.8.0_72"
Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)


I'm also getting the OOM with Tika (at least) in RHEL with:

java version "1.7.0_75"
OpenJDK Runtime Environment (rhel-2.5.4.0.el6_6-x86_64 u75-b13)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)
Comment 4 Nick Burch 2016-02-24 12:36:10 UTC
I'm not seeing the OOM on opening the package, only on extraction. Opening the package isn't taking much memory. JVM is java version "1.7.0_95" on Ubuntu
Comment 5 Tim Allison 2016-02-24 13:02:11 UTC
Of course, sorry.  Thank you.

This looks very similar to 57031.  However, with the 57031.docx file, I'm getting an OOM while parsing whether I open the package with a File or with an InputStream.
Comment 6 Tim Allison 2016-02-24 13:31:38 UTC
I reused Dominik's test on 57031.

If we use xerces instead of piccolo, we don't appear to have a problem with parsing either 57031 or 59058.

This leads to an OOM for both files:
        ZipFile zf = new ZipFile(path0.toAbsolutePath().toString());
        ZipEntry entry = zf.getEntry("word/document.xml");
        DocumentDocument document = DocumentDocument.Factory.parse(zf.getInputStream(entry));
        assertNotNull(document);

This works for both files:
        ZipFile zf = new ZipFile(path1.toAbsolutePath().toString());
        ZipEntry entry = zf.getEntry("word/document.xml");
        XMLInputFactory xmlif = XMLInputFactory.newInstance();
        XMLStreamReader reader = xmlif.createXMLStreamReader(zf.getInputStream(entry));
        DocumentDocument document = DocumentDocument.Factory.parse(reader);
        assertNotNull(document);
        zf.close();
Comment 7 Tim Allison 2016-02-24 13:49:06 UTC
In XWPFDocument's onDocumentRead(), if we change:

            DocumentDocument doc = DocumentDocument.Factory.parse(getPackagePart().getInputStream(), DEFAULT_XML_OPTIONS);


to:

            XMLInputFactory xmlif = XMLInputFactory.newInstance();
            XMLStreamReader reader = xmlif.createXMLStreamReader(getPackagePart().getInputStream());

            DocumentDocument doc = DocumentDocument.Factory.parse(reader, DEFAULT_XML_OPTIONS);

We can parse both files.

This change is on the periphery of my competence.  Any problems with this?  How can we require xerces via .newInstance()/newFactory()?
Comment 8 Tim Allison 2016-02-24 18:21:20 UTC

*** This bug has been marked as a duplicate of bug 57031 ***