|Summary:||OOM when parsing docx after OPCPackage.open with File but not with InputStream (TIKA-1866)|
|Product:||POI||Reporter:||Tim Allison <tallison>|
|Component:||OPC||Assignee:||POI Developers List <dev>|
|Attachments:||Triggering file submitted by Shawn Johnson on TIKA-1866|
Description Tim Allison 2016-02-24 02:41:18 UTC
Shawn Johnson recently posted a smallish docx file on TIKA-1866 that causes an OOM. WARNING: trying to parse this file in Intellij caused a system crash and required a hard reboot on Windows. I can reproduce this in pure POI with the following: OPCPackage pkg = OPCPackage.open(path) System.out.println("before creating extractor"); POIXMLTextExtractor poiExtractor = ExtractorFactory.createExtractor(pkg); System.out.println("finished creating extractor"); The OOM happens during createExtractor, and I never hit the second println. However, there is no OOM with: OPCPackage pkg = OPCPackage.open(Files.newInputStream(path)); System.out.println("before creating extractor"); POIXMLTextExtractor poiExtractor = ExtractorFactory.createExtractor(pkg); System.out.println("finished creating extractor"); Any idea what might cause the different treatment? Java 1.8.0_72 on Windows.
Comment 1 Tim Allison 2016-02-24 02:48:14 UTC
Created attachment 33585 [details] Triggering file submitted by Shawn Johnson on TIKA-1866 When I re-save this file, it no longer triggers the problem.
Comment 2 Nick Burch 2016-02-24 12:09:31 UTC
Other than all the contents of the zip having the date "1980-01-01 00:00", I can't see anything immediately wrong Loading the sampe file as an OPCPackage from an InputStream on my machine seems to use ~13mb of memory (from a couple of tests). Loading it from a File is around ~3mb. So, nothing obvious springs to mind. Would someone be able to dig in and find out where the memory is going, and how it differs between the two cases?
Comment 3 Tim Allison 2016-02-24 12:32:49 UTC
Interesting. Thank you for taking a look! You're not seeing the OOM, then... What version of Java? Windows...right? In addition to Windows with: java version "1.8.0_72" Java(TM) SE Runtime Environment (build 1.8.0_72-b15) Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode) I'm also getting the OOM with Tika (at least) in RHEL with: java version "1.7.0_75" OpenJDK Runtime Environment (rhel-188.8.131.52.el6_6-x86_64 u75-b13) OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)
Comment 4 Nick Burch 2016-02-24 12:36:10 UTC
I'm not seeing the OOM on opening the package, only on extraction. Opening the package isn't taking much memory. JVM is java version "1.7.0_95" on Ubuntu
Comment 5 Tim Allison 2016-02-24 13:02:11 UTC
Of course, sorry. Thank you. This looks very similar to 57031. However, with the 57031.docx file, I'm getting an OOM while parsing whether I open the package with a File or with an InputStream.
Comment 6 Tim Allison 2016-02-24 13:31:38 UTC
I reused Dominik's test on 57031. If we use xerces instead of piccolo, we don't appear to have a problem with parsing either 57031 or 59058. This leads to an OOM for both files: ZipFile zf = new ZipFile(path0.toAbsolutePath().toString()); ZipEntry entry = zf.getEntry("word/document.xml"); DocumentDocument document = DocumentDocument.Factory.parse(zf.getInputStream(entry)); assertNotNull(document); This works for both files: ZipFile zf = new ZipFile(path1.toAbsolutePath().toString()); ZipEntry entry = zf.getEntry("word/document.xml"); XMLInputFactory xmlif = XMLInputFactory.newInstance(); XMLStreamReader reader = xmlif.createXMLStreamReader(zf.getInputStream(entry)); DocumentDocument document = DocumentDocument.Factory.parse(reader); assertNotNull(document); zf.close();
Comment 7 Tim Allison 2016-02-24 13:49:06 UTC
In XWPFDocument's onDocumentRead(), if we change: DocumentDocument doc = DocumentDocument.Factory.parse(getPackagePart().getInputStream(), DEFAULT_XML_OPTIONS); to: XMLInputFactory xmlif = XMLInputFactory.newInstance(); XMLStreamReader reader = xmlif.createXMLStreamReader(getPackagePart().getInputStream()); DocumentDocument doc = DocumentDocument.Factory.parse(reader, DEFAULT_XML_OPTIONS); We can parse both files. This change is on the periphery of my competence. Any problems with this? How can we require xerces via .newInstance()/newFactory()?