Bug 57463 - OutOfMemeoryError while extracting text from DOCX files
Summary: OutOfMemeoryError while extracting text from DOCX files
Status: NEEDINFO
Alias: None
Product: POI
Classification: Unclassified
Component: XWPF (show other bugs)
Version: 3.11-FINAL
Hardware: PC All
: P2 blocker (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-01-19 12:57 UTC by emergency.shower
Modified: 2015-02-08 14:44 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description emergency.shower 2015-01-19 12:57:12 UTC
Tika/POI text extraction for Lucene indexing quite often crashes server processes due to excessive memory requirements.

E.g. the < 10MB document https://www.eba.europa.eu/documents/10180/359626/Annex+XIV_Data+point+definition_COREP.docx requires about 3.5GB main memory for test extraction.

When the heap dump is analyzed it turns out that large amounts of XMLBeans objects are held in memory.

Class Name                                     |   Objects | Shallow Heap |  Retained Heap
-------------------------------------------------------------------------------------------
org.apache.xmlbeans.impl.store.Xobj$ElementXobj| 2.763.489 |  265.294.944 | >= 511.530.832
org.apache.xmlbeans.impl.store.Xobj$AttrXobj   | 2.797.953 |  246.219.864 | >= 246.233.144
-------------------------------------------------------------------------------------------


The stack extracted from the heap dump was


"QuartzScheduler_Worker-3" daemon prio=5 tid=24 RUNNABLE
    at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseCdataLiteral(PiccoloLexer.java:3027)
    at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseQuotedTagValue(PiccoloLexer.java:2936)
    at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1754)
    at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521)
    at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362)
    at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1293)
    at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
    at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4808)
       Local Variable: int[]#4831
       Local Variable: int[]#4833
       Local Variable: byte[]#1722
       Local Variable: org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer#1
    at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
    at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
    at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
       Local Variable: org.apache.xmlbeans.impl.piccolo.xml.Piccolo#1
    at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3454)
       Local Variable: org.xml.sax.InputSource#1
       Local Variable: org.apache.xmlbeans.impl.store.Locale$PiccoloSaxLoader#1
    at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1276)
    at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1263)
       Local Variable: org.apache.xmlbeans.impl.store.Locale#3
    at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
       Local Variable: org.apache.xmlbeans.impl.schema.SchemaTypeLoaderImpl#1
       Local Variable: org.apache.xmlbeans.impl.schema.SchemaTypeImpl#89
    at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(<unknown string>)
       Local Variable: java.util.zip.ZipFile$1#1
    at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:134)
    at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159)
       Local Variable: java.util.HashMap#24338
       Local Variable: org.apache.poi.xwpf.usermodel.XWPFFactory#1
    at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:116)
       Local Variable: org.apache.poi.xwpf.usermodel.XWPFDocument#1
    at org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:53)
       Local Variable: org.apache.poi.xwpf.extractor.XWPFWordExtractor#1
    at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:180)
       Local Variable: org.apache.poi.xwpf.usermodel.XWPFRelation[]#1
       Local Variable: org.apache.poi.xwpf.usermodel.XWPFRelation#4
       Local Variable: org.apache.poi.openxml4j.opc.PackageRelationshipCollection#1
       Local Variable: org.apache.poi.openxml4j.opc.ZipPackagePart#1
    at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:87)
       Local Variable: org.apache.poi.openxml4j.opc.ZipPackage#1
       Local Variable: java.util.Locale#1
    at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
       Local Variable: org.apache.tika.sax.TaggedContentHandler#1
       Local Variable: org.apache.tika.io.TemporaryResources#1
       Local Variable: org.apache.tika.parser.microsoft.ooxml.OOXMLParser#2
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
       Local Variable: org.apache.tika.sax.TaggedContentHandler#2
       Local Variable: org.apache.tika.parser.DefaultParser#2
       Local Variable: org.apache.tika.io.TemporaryResources#2
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
       Local Variable: org.apache.tika.io.TikaInputStream#1
       Local Variable: org.apache.tika.sax.SecureContentHandler#1
       Local Variable: org.apache.tika.parser.AutoDetectParser#2
       Local Variable: org.apache.tika.sax.BodyContentHandler#1
       Local Variable: org.apache.tika.io.TemporaryResources#3
       Local Variable: org.apache.tika.mime.MediaType#1153
    at org.apache.tika.Tika.parseToString(Tika.java:380)
       Local Variable: org.apache.tika.parser.ParseContext#1
       Local Variable: org.apache.tika.metadata.Metadata#1
       Local Variable: java.io.FileInputStream#3
       Local Variable: org.apache.tika.Tika#2
       Local Variable: org.apache.tika.sax.WriteOutContentHandler#1
    at ...
Comment 1 Nick Burch 2015-01-19 13:21:06 UTC
For XSSF, we have a low-level SAX+helper based way to extract text. It's more work to code for, but low memory

Currently, we haven't had any volunteers to work on one for XWPF / .docx. Because the basic structure of a .docx file is more flexible than .xlsx, I suspect it'll be a bit more work to do, but shouldn't be impossible. Please head over to the dev list if you're interested in working on this!

Otherwise, I wonder if it might be possible to lazy-load some parts of files like that one, to help keep the memory footprint down. Are you able to profile it to work out what xml elements are taking the most space? (We'll need to know what part they come from, eg word/styles.xml, and what xml element within that, eg w:rPr)