Bug 57031 - Out of Memory when extracting text from attached files
Summary: Out of Memory when extracting text from attached files
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: XWPF (show other bugs)
Version: 3.7-FINAL
Hardware: PC All
: P2 critical (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords: PatchAvailable
: 58963 59058 (view as bug list)
Depends on:
Blocks:
 
Reported: 2014-09-28 13:41 UTC by Li Guoyu
Modified: 2016-04-01 21:19 UTC (History)
3 users (show)



Attachments
docx (54.58 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2014-09-28 13:41 UTC, Li Guoyu
Details
switch out piccolo parser for xerces (7.24 KB, patch)
2016-02-24 20:22 UTC, Tim Allison
Details | Diff
Workaround piccolo invocations (16.66 KB, text/x-diff)
2016-03-04 00:11 UTC, Andreas Beeker
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Li Guoyu 2014-09-28 13:41:33 UTC
Created attachment 32066 [details]
docx

I'm getting OOM when trying to extract text from attached files.


public class POITest
{
  public static void main( String[] args ) throws Exception
  {
    String filePath = "/Users/lguoyu/Downloads/HW13_SA.docx";
    FileInputStream inputStream = new FileInputStream( filePath );

    POITextExtractor pTextExtract;
    try
    {
      pTextExtract = ExtractorFactory.createExtractor( inputStream );
      String text = pTextExtract.getText();

      System.out.println( text );
    }
    catch ( Throwable e )
    {
      e.printStackTrace();
    }
  }
}
Comment 1 Li Guoyu 2014-09-28 13:45:05 UTC
It seems the infinite loop is causing the OOM.

Stack trace:


java.lang.OutOfMemoryError: Java heap space
	at org.apache.xmlbeans.impl.store.CharUtil.allocate(CharUtil.java:397)
	at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:506)
	at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:419)
	at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:489)
	at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:2927)
	at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.stripText(Cur.java:3130)
	at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:3143)
	at org.apache.xmlbeans.impl.store.Locale$SaxHandler.characters(Locale.java:3291)
	at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportCdata(Piccolo.java:992)
	at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1290)
	at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
	at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4812)
	at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
	at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
	at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
	at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
	at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1277)
	at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1264)
	at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
	at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown Source)
	at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:136)
	at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:166)
	at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:118)
	at org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:59)
	at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:181)
	at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:153)
	at test.POITest.main(POITest.java:19)
Comment 2 Nick Burch 2014-09-28 16:56:00 UTC
See the FAQ - http://poi.apache.org/faq.html#faq-N10109
Comment 3 Li Guoyu 2014-09-29 00:59:30 UTC
Hi Nick,

Thanks for the quick response.

I don't think this was caused by my environment. I run the code on my MBP(2.3G i7, 16G RAM, SSD), JDK 7. I tried to set max heap size to 10G and there was also OOM.

I guess the doc structure is causing the issue, the issue disappeared after I remove one character in the document.

Can you please run the code with attached document?

Thanks,
Guoyu
Comment 4 Dominik Stadler 2014-09-29 05:29:07 UTC
Reopen to verify
Comment 5 Dominik Stadler 2014-10-12 19:59:00 UTC
Can reproduce this, the following steps suffice to cause OOM, seems reading the 900k file fails in the Piccolo-XMLParser that is used inside XMLBeans:

        String filePath = "test-data/document/57031.docx";
        
        ZipFile zf = new ZipFile(filePath);
        ZipEntry entry = zf.getEntry("word/document.xml");
        
        DocumentDocument document = DocumentDocument.Factory.parse(zf.getInputStream(entry));
        assertNotNull(document);
        
        zf.close();
Comment 6 Volker Kleinschmidt 2015-10-01 19:42:44 UTC
Any progress on this? OOMs are a critical problem, and it's been a year since this has been verified, without further progress being made.
Comment 7 Dominik Stadler 2015-10-02 09:33:15 UTC
Please note that nobody is paid to work on POI, only volunteers who look at things in their free time. 

The best way to help a bug report see progress is to provide more information if available or supply patches together with unit-tests.

See e.g. http://poi.apache.org/guidelines.html#SubmittingPatches for more information about providing patches.
Comment 8 Tim Allison 2016-02-24 18:21:20 UTC
*** Bug 59058 has been marked as a duplicate of this bug. ***
Comment 9 Tim Allison 2016-02-24 18:25:13 UTC
Over on 59058, I found that if we use xerces instead of piccolo, this problem goes away.

Would switching parsers in XWPFDocument open up too great a can of worms?
Comment 10 Tim Allison 2016-02-24 20:22:52 UTC
Created attachment 33591 [details]
switch out piccolo parser for xerces

I tried to mimic our SAXHelper.  Any feedback?  Is this a reasonable approach?
Comment 11 Dominik Stadler 2016-02-24 21:17:06 UTC
As we are on Java 6+, we can expect an XML Parser to be present as part of the Java platform itself. There are some versions of IBM Java 6 that did not manage to do this correctly, but we direct users of that JDK to upgrade to IBM Java 7 anyway as we are using the bundled XML Parser elsewhere already anyway.

Changing the XML parser sounds like a change that can have quite some side-effects. So I would postpone it at least after a 3.14 release if possible, albeit I don't know much about the use of the Piccolo parser in POI, so it might be a smaller change than I think.

However I think we should at least do some larger testing of it, i.e. performance, memory consumption, ... to verify that we do not introduce some bigger degradation here.

And BTW, we already have a class DocumentHelper which does very similar things than StaxHelper, wouldn't it make sense to combine those two?
Comment 12 Tim Allison 2016-02-25 01:48:07 UTC
Thank you, Dominik.  

Makes sense to wait. Will do. 

I'm also leery of changing the xml parser without serious testing.

I just finished downloading and adding lots of doc[xm] files with your CommonCrawlDocumentDownload code.  Will run regression testing on that corpus in addition to the few we had in our regular govdocx1+othercommoncrawl corpus.
Comment 13 Tim Allison 2016-03-02 19:45:29 UTC
Thanks to Dominik's commoncrawl download tool, I found a pptx that shows similar symptoms.  I've posted the file on TIKA-1866.  

When we fix this for docx, we should also fix it for pptx and xlsx...with serious regression testing after the "fix". :)
Comment 14 Andreas Beeker 2016-03-02 23:12:58 UTC
(In reply to Tim Allison from comment #13)
> When we fix this for docx, we should also fix it for pptx and xlsx...with
> serious regression testing after the "fix". :)

I've tried to debug the piccolo parser, but it hangs somewhere for ages while parsing slide6 of the pptx.
I try now to remove the piccolo parser invocations, but I'm also for postponing the potential fix after poi 3.14 is out.
Comment 15 Tim Allison 2016-03-02 23:46:07 UTC
(In reply to Andreas Beeker from comment #14)
Y, absolutely nothing until 3.14.  Thank you for investigating!

> (In reply to Tim Allison from comment #13)
> > When we fix this for docx, we should also fix it for pptx and xlsx...with
> > serious regression testing after the "fix". :)
> 
> I've tried to debug the piccolo parser, but it hangs somewhere for ages
> while parsing slide6 of the pptx.
> I try now to remove the piccolo parser invocations, but I'm also for
> postponing the potential fix after poi 3.14 is out.
Comment 16 Andreas Beeker 2016-03-04 00:11:18 UTC
I've removed the piccolo parser classes from the xmlbeans jar and modified the 
failing classes -
so this is a poi wide patch.
There was also a problem with EvilUnclosedBRFixingInputStream.
And I've added an error handler to workaround those system out messages.

It would be interesting how much slower jaxp vs. piccolo is and if this is 
still an issue with newer Java versions.

Another point would be, if we change the ant build to permanently remove the 
piccolo classes.
Comment 17 Andreas Beeker 2016-03-04 00:11:22 UTC
Created attachment 33621 [details]
Workaround piccolo invocations
Comment 18 Andreas Beeker 2016-03-09 01:34:48 UTC
patched via r1734182
and r1734184

I'm also concerned about performance ... best would be, to check the common crawl statistics ... depending on the difference, we might need to test other approaches instead of the documenthelper -> documentbuilder, e.g. Tims approach with the XMLStreamReader

i'm closing this for now ... feel free to reopen it, if the statistics are bad ...
Comment 19 Mircea 2016-04-01 21:19:03 UTC
*** Bug 58963 has been marked as a duplicate of this bug. ***