Bug 54790 - Word Document loading strategy is memory hungry and causes OutOfMemoryError
Summary: Word Document loading strategy is memory hungry and causes OutOfMemoryError
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: 3.8-FINAL
Hardware: PC Windows XP
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-04-02 18:41 UTC by Dmitry
Modified: 2019-06-18 07:16 UTC (History)
1 user (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dmitry 2013-04-02 18:41:47 UTC
In my case I have 70MB Word document, which actually results 50MB plain text (after saved as...). When this document is loaded using POI the following error occurs:

Caused by: java.lang.OutOfMemoryError: Java heap space
	at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:133)
	at java.lang.StringCoding.decode(StringCoding.java:173)
	at java.lang.String.<init>(String.java:443)
	at java.lang.String.<init>(String.java:515)
	at org.apache.poi.hwpf.model.TextPiece.buildInitSB(TextPiece.java:89)
	at org.apache.poi.hwpf.model.TextPiece.<init>(TextPiece.java:66)
	at org.apache.poi.hwpf.model.TextPieceTable.<init>(TextPieceTable.java:111)
	at org.apache.poi.hwpf.model.ComplexFileTable.<init>(ComplexFileTable.java:70)
	at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:267)
	at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:186)

As to my observations 70MB document explodes to approx 900MB heap.

Analysis:

As I can see, class TextPieceTable creates thousands of TextPiece objects (and thus thousands of StringBuilder objects with small char[] buffers). Later HWPFDocument strategy is the following:

- it collects all text pieces again in line 275:
  _text = _tpt.getText();
- if preserveTextTable=false, then new ComplexFileTable object is created holing one TextPieceTable, holding one SinglentonTextPiece in lines 314-318

Perhaps this can be further improved. In particular when preserveTextTable=false then TextPieceTable should not make a copy of documentStream part:

System.arraycopy( documentStream, start, buf, 0, textSizeBytes );

and use another lightweight version of TextPiece without buffer. Later when all text pieces need to be collected, they can be taken directly from documentStream.
Comment 1 Sergey Vladimirov 2013-04-02 20:20:39 UTC
Dmitry,

How much memory does you JVM have? Is it standard (JVM-default) 64/128 Mb setting, or is it some kind of mobile system?

Somtimes to load the whole file into memory is the only way to process it. For example, you can't even break text into paragraphs without checking TextPiece content. And to use TextPiece just as some lightweigh proxy to DocumentStream going to be very ineffective (due to required character encoding-deconding process).

Also, disabling preserveTextTable means the whole text is reconstructed into single buffer (StringBuilder). And in most cases there is no single pointer to document stream. Is a reconstruction of pretty complex structure using data from ComplexFileTable. Perhaps is it possible to use "lightweight" "TextPieceProxy" when "preserveTextTable=true" if we need only to read text. But from my point of view, it is not a nice way.
Comment 2 Dmitry 2013-04-06 15:21:42 UTC
To be more precise:
- Opening fails with -Xmx800MB
- Opening succeeded with -Xmx900MB

Expected:
- Opening succeeds with -Xmx300MB

I repeat: DOC file size is 70MB. Potentially I can cut or put it as is to fileshare.

> And to use TextPiece just as some lightweigh proxy to DocumentStream going to be very ineffective (due to required character encoding-deconding process).

Deferred encoding-deconding is not a problem: the only flag is "unicode=true|false". The problem is that DocumentStream is cut into millions of tiny char buffers.

> Also, disabling preserveTextTable means the whole text is reconstructed into single buffer (StringBuilder).

OOM happens before whole text is reconstructed. I would agree for x3 memory consumption, that is 70MB -> 210MB heap. But x10 is too much. And yes, "preserveTextTable" is disabled by default as far as I can see, unless it is enabled by system property.
Comment 3 Dominik Stadler 2016-07-24 10:37:48 UTC
In order to take a look it would be interesting what content there is in the document, any chance of providing the sample document as attachment here?
Comment 4 Dmitry 2016-08-03 08:59:04 UTC
The document can be downloaded from here:
https://www.dropbox.com/s/h837yv1jrnjp9zq/poi_54790_test.7z?dl=1