Bug 45570

Summary: Reduce memory overhead of BitField references
Product: POI Reporter: Trejkaz (pen name) <trejkaz>
Component: HSSFAssignee: POI Developers List <dev>
Severity: enhancement    
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Windows Vista   
Attachments: proposed patch

Description Trejkaz (pen name) 2008-08-05 17:54:37 UTC
Created attachment 22391 [details]
proposed patch

I had a 20MB Excel workbook (doesn't sound very big to me...) which took up around 600MB RAM in HSSF.

Attached patch reduces this to around 500MB for this particular workbook by not holding references to BitField instances from instances which occur a large number of times (I used jhat/jmap and found all the BitField objects with a large number of references from the same type of object.)

There are further BitField references which I did *not* change to static; this is only the ones I noticed for one particular file.  In some cases making things static can be a bad idea as it increases the memory usage even when there is no workbook in memory.

Incidentally, the major memory hog is actually the formula Ptg arrays (they're taking up 150MB!) but I couldn't figure out a good way to trim them down.  How could these structures possibly take up 25 times the size of the original file?  Can't the parsing of formulas be done on demand instead of storing all the structures in memory?
Comment 1 Josh Micich 2008-08-06 00:11:38 UTC
Converted all instance BitFields to static. Applied in svn r683132

However, I suspect the problem with Ptg arrays is more important.
Every Ptg (subclass) instance consumes at least 12 bytes (on a 32 bit JVM) due to the Ptg superclass.  Most Ptg subclasses have data fields which probably also take extra space due to JVM word alignment.  I would estimate that the parsed Ptg arrays consume about 3 times as much space as the encoded byte arrays.  Such an expansion may not be so bad in itself.  It certainly wouldn't account for the 25x expansion you have observed.

It should be noted that POI expands ShareFormula, MulBlank and MulRK records upon reading.  Try writing the spreadsheet back out with POI.  Any growth is usually due to these expansions.  You can run BiffView on both spreadsheets to confirm the differences.  One of the things on the POI to-do list is to have these expanded records re-compacted upon writing out.

Since SharedFormula records contain Ptg arrays, this might explain the massive memory use you are observing.  Making POI *not* expand SharedFormula records might be a complex change.

BTW - I think 20MB *is* a big spreadsheet (unless it contains mainly pictures). I am closing this bug off for the moment.  If you are prepared to work on a patch, please re-open and we can try to improve things further.
Comment 2 Trejkaz (pen name) 2008-08-06 00:19:50 UTC
If I get to work on more of this stuff I'll raise separate patches for the next one(s).

Regarding the cell records, I suspect the explosion of cell runs is one of the problems I'm running into.  At the very least, each blank cell takes 12 bytes just for its fields (then add on the reference to the class itself and whatever other cruft the JVM puts in.)

If there were some way of rewriting it to retain the MulBlankRecord and co., that may help a little.  But there is a bigger problem at the HSSFCell level as each one has so many fields (worse still, for a blank cell it barely uses any of them!  But that's until someone tries to change the value. :-()

I can get another 15-20% saving if I hack in some code to throw away blank cells during HSSFRow.createCellFromRecord but obviously that loses the formatting.  I think it might be possible to make some kind of copy on write proxy which shares the same actual HSSFCell instance for the blank cell but creates a real one when you call one of the setters, but it looks like it will be a lot of work even for 20%.