Bug 61251 - Out of memory when opening the DOCX file
Summary: Out of memory when opening the DOCX file
Status: RESOLVED WORKSFORME
Alias: None
Product: POI
Classification: Unclassified
Component: XWPF (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-07-05 14:16 UTC by bsevryukov
Modified: 2017-07-05 20:56 UTC (History)
0 users



Attachments
XWPF OOM bug (210.97 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2017-07-05 14:16 UTC, bsevryukov
Details

Note You need to log in before you can comment on or make changes to this bug.
Description bsevryukov 2017-07-05 14:16:52 UTC
Created attachment 35097 [details]
XWPF OOM bug

Hi guys

I have a OOM when opening one particular docx file. 

POI Versions I tried:
3.15
3.16
3.17-beta1

The code is simple:

        InputStream in = new FileInputStream(new File(path));
        XWPFDocument document = new XWPFDocument(in);

Exception details:

java.lang.OutOfMemoryError: GC overhead limit exceeded
	at org.apache.xerces.dom.DeferredDocumentImpl.getNodeObject(Unknown Source)
	at org.apache.xerces.dom.DeferredElementNSImpl.synchronizeData(Unknown Source)
	at org.apache.xerces.dom.ElementNSImpl.getNamespaceURI(Unknown Source)
	at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1420)
	at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
	at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
	at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
	at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
	at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
	at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
	at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
	at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
	at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
	at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
	at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403)
	at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445)
	at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1385)
	at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1370)
	at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:370)
	at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:144)
	at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown Source)
	at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:152)
	at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
	at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:119)
Comment 1 Tim Allison 2017-07-05 14:29:24 UTC
Thank you for opening this issue.

Do you want to modify the document, or are you only interested in text/metadata extraction?  If extraction only, I added a SAX parser in Apache Tika, which is far more efficient than our DOM parser.
Comment 2 PJ Fanning 2017-07-05 14:31:31 UTC
I think POI uses more memory if you use the InputStream constructors.
Could you try creating an OPCPackage based on the File?
https://poi.apache.org/apidocs/org/apache/poi/openxml4j/opc/OPCPackage.html
And then create the XWPFDocument based on the OPCPackage?
Comment 3 Tim Allison 2017-07-05 14:37:32 UTC
Y, agreed, PJ.  I'm not having any trouble parsing this with Tika and our usual DOM parser even with -Xmx128m, and we use the OPCPackage from the file.  I am able to replicate with -Xmx64m.
Comment 4 bsevryukov 2017-07-05 14:44:44 UTC
Still getting OOM with

final OPCPackage in = OPCPackage.open(new File(path));
XWPFDocument document = new XWPFDocument(in);

Am I doing it right?
Comment 5 PJ Fanning 2017-07-05 14:50:28 UTC
@bsevryukov your code is correct.
As Tim highlights, it seems that you need to increase your Xmx setting.
The approach using OPCPackage will use less memory but XWPFDocument is not based on streaming the document - so the larger the docx, the more memory XWPFDocument needs.
Comment 6 bsevryukov 2017-07-05 14:55:12 UTC
Thanks PJ. Increasing Xmx size helped. 

Thank you guys for the fast response.
Comment 7 Dominik Stadler 2017-07-05 20:56:28 UTC
Fixed based on latest comment.