Created attachment 35097 [details] XWPF OOM bug Hi guys I have a OOM when opening one particular docx file. POI Versions I tried: 3.15 3.16 3.17-beta1 The code is simple: InputStream in = new FileInputStream(new File(path)); XWPFDocument document = new XWPFDocument(in); Exception details: java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.xerces.dom.DeferredDocumentImpl.getNodeObject(Unknown Source) at org.apache.xerces.dom.DeferredElementNSImpl.synchronizeData(Unknown Source) at org.apache.xerces.dom.ElementNSImpl.getNamespaceURI(Unknown Source) at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1420) at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403) at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445) at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403) at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445) at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403) at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445) at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403) at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445) at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403) at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445) at org.apache.xmlbeans.impl.store.Locale.loadNodeChildren(Locale.java:1403) at org.apache.xmlbeans.impl.store.Locale.loadNode(Locale.java:1445) at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1385) at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1370) at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:370) at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:144) at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown Source) at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:152) at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190) at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:119)
Thank you for opening this issue. Do you want to modify the document, or are you only interested in text/metadata extraction? If extraction only, I added a SAX parser in Apache Tika, which is far more efficient than our DOM parser.
I think POI uses more memory if you use the InputStream constructors. Could you try creating an OPCPackage based on the File? https://poi.apache.org/apidocs/org/apache/poi/openxml4j/opc/OPCPackage.html And then create the XWPFDocument based on the OPCPackage?
Y, agreed, PJ. I'm not having any trouble parsing this with Tika and our usual DOM parser even with -Xmx128m, and we use the OPCPackage from the file. I am able to replicate with -Xmx64m.
Still getting OOM with final OPCPackage in = OPCPackage.open(new File(path)); XWPFDocument document = new XWPFDocument(in); Am I doing it right?
@bsevryukov your code is correct. As Tim highlights, it seems that you need to increase your Xmx setting. The approach using OPCPackage will use less memory but XWPFDocument is not based on streaming the document - so the larger the docx, the more memory XWPFDocument needs.
Thanks PJ. Increasing Xmx size helped. Thank you guys for the fast response.
Fixed based on latest comment.