Bug 50993 - OutOfMemory Exception on Large docx
Summary: OutOfMemory Exception on Large docx
Status: RESOLVED WONTFIX
Alias: None
Product: POI
Classification: Unclassified
Component: XWPF (show other bugs)
Version: 3.7-FINAL
Hardware: PC All
: P2 blocker (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-03-29 13:37 UTC by Peter Nordquist
Modified: 2014-09-01 11:04 UTC (History)
1 user (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Nordquist 2011-03-29 13:37:25 UTC
Loading the attached document with org.apache.poi.xwpf.usermodel.XWPFDocument(InputStream is) causes an OutOfMemory Exception with 4GB of heap space but with 8GB it does work.  We use POI in an Application Server and multiple users using the system will cause this problem more frequently.  I realize it is a ~23k page document and it seems to take some time to load into any editor.

Example Stack Trace:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.attr(Cur.java:3039)
	at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.attr(Cur.java:3060)
	at org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3254)
	at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082)
	at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1802)
	at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521)
	at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362)
	at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1293)
	at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
	at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4808)
	at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
	at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
	at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
	at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3439)
	at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270)
	at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257)
	at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
	at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown Source)
	at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:135)
	at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
	at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:114)
	at LoadBigDoc.main(LoadBigDoc.java:12)


Example Code:
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;

import org.apache.poi.xwpf.usermodel.XWPFDocument;

public class LoadBigDoc {

  public static void main(final String[] args) throws Exception {
    final InputStream is = new BufferedInputStream(new FileInputStream(new File("ANA3.blast.docx")));
    new XWPFDocument(is);
  }
}

This code assumes it is run with the correct classpath and the file is in the current working directory.

Platforms:
Windows 7 64-bit
Mac OS X 10.6.7 64-bit
RHEL 5.5 64-bit

All running Sun JDK 1.6.0_20 64-bit
Comment 1 Peter Nordquist 2011-03-29 13:51:14 UTC
Since I can't attach the file here it is at https://fx.pnl.gov/Files.aspx?EmailID=8fbf4bdb-5af1-46c0-90dc-2c0980e173e9.  This file will only be available for 10 days but I can send another link if you still need it
Comment 2 David Fisher 2011-03-29 14:21:11 UTC
POI has to load the whole document into memory using the usermodel and into Java objects.

As I write this Microsoft Word is still sequencing the pages in is going for minutes and I am only on page 600. I have a fast mac with 8GB.

You are going to need to rethink your architecture. Please explain the use case on the user list and you should get some help. There are some less memory intensive techniques.

One question is why are you producing what looks like fixed character width formatted program output into Word XML.

Regards,
Dave
Comment 3 Ryan LaMothe 2011-03-29 14:41:17 UTC
@Dave

The following points in your response are meaningless and I will address them one at a time:

1) "need to rethink your architecture"

The chosen architecture is not the reason for this bug report.

2) "Please explain the use case"

Third party documents are being analyzed.  Part of the analysis is to extract text and images.  The original documents cannot be altered.

3) "One question is why are you producing what looks like fixed character width
formatted program output into Word XML"

The document contents and format are not the reason for this bug report.

4) "There are some less memory intensive techniques."

Please supply the information.
Comment 4 David Fisher 2011-03-29 16:15:10 UTC
@Ryan -

This is a huge file. The 13 MB of the docx expands into a 33MB word/document.xml

POI turns this into a Java object for each bit of xml in those 33MB and it all must be in memory in the standard case. This easily is in the 4GB to 8GB range. We are not going to fix the standard method. We would consider patches that might help.

As far as the architecture is concerned I would not want to ever load anything so large into a web server like Apache Tomcat.

By asking what the use case is allows an intelligent discussion about what techniques are available and the correct place to do that is on the POI User list and it is not in a bugzilla entry.

That is how POI works.

We have no idea for what purpose you are loading this data into your web server. Do you intend to find results? Are you analyzing them? So, let's have a dialog, but on the user list.

If you want to discuss different algorithms that might solve the problem then please try the POI Developer list.

If you search bugzilla and the lists about OutOfMemory I think you will find my response consistent. I'm sorry if it was short this time.
Comment 5 Peter Nordquist 2011-03-29 16:29:45 UTC
Sorry for not including this earlier but we are using POI in JBoss Application Server 5.1.0.GA in a Stateless EJB for document extraction in a pipeline.  The final consumer is not a web application but another Java EE service.  It's clear that you want me to post this on the Users list so I will do that.  If you have any time, is there a streaming/event/sax parser for this like the Excel org.apache.poi.xssf.eventusermodel?
Comment 6 David Fisher 2011-03-29 18:03:03 UTC
This is common have a look at http://poi.apache.org/text-extraction.html and see if the WordExtractor helps.

You'll get better help on this on the user list. There is more visibility. We hope for more developers, but we have lots of users.

Sorry if I was short, but I am a busy project manager.