In my case I have 70MB Word document, which actually results 50MB plain text (after saved as...). When this document is loaded using POI the following error occurs: Caused by: java.lang.OutOfMemoryError: Java heap space at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:133) at java.lang.StringCoding.decode(StringCoding.java:173) at java.lang.String.<init>(String.java:443) at java.lang.String.<init>(String.java:515) at org.apache.poi.hwpf.model.TextPiece.buildInitSB(TextPiece.java:89) at org.apache.poi.hwpf.model.TextPiece.<init>(TextPiece.java:66) at org.apache.poi.hwpf.model.TextPieceTable.<init>(TextPieceTable.java:111) at org.apache.poi.hwpf.model.ComplexFileTable.<init>(ComplexFileTable.java:70) at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:267) at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:186) As to my observations 70MB document explodes to approx 900MB heap. Analysis: As I can see, class TextPieceTable creates thousands of TextPiece objects (and thus thousands of StringBuilder objects with small char[] buffers). Later HWPFDocument strategy is the following: - it collects all text pieces again in line 275: _text = _tpt.getText(); - if preserveTextTable=false, then new ComplexFileTable object is created holing one TextPieceTable, holding one SinglentonTextPiece in lines 314-318 Perhaps this can be further improved. In particular when preserveTextTable=false then TextPieceTable should not make a copy of documentStream part: System.arraycopy( documentStream, start, buf, 0, textSizeBytes ); and use another lightweight version of TextPiece without buffer. Later when all text pieces need to be collected, they can be taken directly from documentStream.
Dmitry, How much memory does you JVM have? Is it standard (JVM-default) 64/128 Mb setting, or is it some kind of mobile system? Somtimes to load the whole file into memory is the only way to process it. For example, you can't even break text into paragraphs without checking TextPiece content. And to use TextPiece just as some lightweigh proxy to DocumentStream going to be very ineffective (due to required character encoding-deconding process). Also, disabling preserveTextTable means the whole text is reconstructed into single buffer (StringBuilder). And in most cases there is no single pointer to document stream. Is a reconstruction of pretty complex structure using data from ComplexFileTable. Perhaps is it possible to use "lightweight" "TextPieceProxy" when "preserveTextTable=true" if we need only to read text. But from my point of view, it is not a nice way.
To be more precise: - Opening fails with -Xmx800MB - Opening succeeded with -Xmx900MB Expected: - Opening succeeds with -Xmx300MB I repeat: DOC file size is 70MB. Potentially I can cut or put it as is to fileshare. > And to use TextPiece just as some lightweigh proxy to DocumentStream going to be very ineffective (due to required character encoding-deconding process). Deferred encoding-deconding is not a problem: the only flag is "unicode=true|false". The problem is that DocumentStream is cut into millions of tiny char buffers. > Also, disabling preserveTextTable means the whole text is reconstructed into single buffer (StringBuilder). OOM happens before whole text is reconstructed. I would agree for x3 memory consumption, that is 70MB -> 210MB heap. But x10 is too much. And yes, "preserveTextTable" is disabled by default as far as I can see, unless it is enabled by system property.
In order to take a look it would be interesting what content there is in the document, any chance of providing the sample document as attachment here?
The document can be downloaded from here: https://www.dropbox.com/s/h837yv1jrnjp9zq/poi_54790_test.7z?dl=1