Bug 37637 - Memory and Perfomace-Optimizations
Summary: Memory and Perfomace-Optimizations
Alias: None
Product: POI
Classification: Unclassified
Component: HSSF (show other bugs)
Version: 3.0-dev
Hardware: Other other
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2005-11-25 13:40 UTC by Oliver Henning
Modified: 2006-01-02 23:27 UTC (History)
0 users

Patch-file (41.86 KB, patch)
2005-11-25 13:41 UTC, Oliver Henning
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Oliver Henning 2005-11-25 13:40:31 UTC
There is a typo. Original code has the following semantic:
	this.filename = filename
	this.field_12_zero5 = field_12_zero5 //resp.

which is (in these methods) equal to:
	this.filename = this.filename
	this.field_12_zero5 = this.field_12_zero5 //resp.
This makes no sense

1. The often called method 'addValueRecord' creates a lot of garbage
   int[] Objects. They are needed just for debug-Tracing. I surrounded
   it with a standard 'if(Logger.isDebugEnabled...)'
2. The columns.getIterator() is expensive, as it always generates a
   Garbage-Object (the Iterator itself). A small refactoring to an
   Array-Loop with index speeds up POI.

see description for 'org/apache/poi/hssf/model/Sheet.java', point 2.

All the BitField-Helper-Objects can be static, as they are invariant.
The idea to reduce garbage with the BitFieldFactory.getInstance(...)
is good, but there also be an Garbage-Object: An java.lang.Integer
to lookup the cache-Map.

I changed the 'backing collection' from TreeMap to a 2-dimensional
Array of CellValueRecordInterface.

The TreeMap is pretty expensive, as every entry needs a TreeMap.Entry
Object. We produce much less Garbage-Objects with this implementation.

Eclipse means that there should be an explicit call of 'toString'
for arrays. Was not really a bug.

The idea of this change is to not have redundant Values in the
HSSFCell-Object an the corresponding CellValueRecord. Especially
the following Attributes in HSSFCell have been removed:
 cellNum 	(=column-Number)
 cellStyle	(=Style-Index)
 cellValue	(=value if Cell is Numeric)
 booleanValue	(=dito if Cell is Boolean)
 errorValue	(=dito if....)
 rot		(=row Number)
Access to these values is now delegated to the underlying

All together we gain speed and save heap. An Example where 50*1000
Cells (25'000 Strings + 25'000 Numerics) were generated and then written
via a byte[] to a file:

Before: 1000ms, over 600'000 Objects (mostly Garbage), 22MB Heap-usage
After :  370ms,      210'000 Objects (70'000 Garbage), 12MB Heap-usage

I hope that you can use some of these optimizations.
Oliver Henning
aloba ag
Comment 1 Oliver Henning 2005-11-25 13:41:38 UTC
Created attachment 17042 [details]
Comment 2 Jason Height 2005-12-15 11:33:00 UTC

Wow this all looks really good, and something that i have been wanting to look
at getting arround to.

One idea that i had also was that the HSSFRow, HSSFCell objects are kept but for
most applications we basically create a row, some cells and then move onto the
next row and repeat. All of the HSSFRow & HSSFCell objects from the previous row
could therefore be garbage collected if they could be re-created from the
low-level objects as required. Ok this results in alot of garbage, but in a
server environment this would lower the overall memory consumption with a bit of
processor gc increase. If there was a switch to determine whether the user model
objects could be created on-demand then that woudl be grand.

Maybe someone will look at this before i have the chance. hint hint!

In the mean time ill look at commiting the patch.

Comment 3 Oliver Henning 2005-12-16 10:12:57 UTC
Salü Jason

Good Idea, but quite a big thing to implement, hm?

But there is still potential in the existing model:
1) HSSFRichTextString in HSSFCell is still a 'duplicate' of the low-level-record.
   This Class is new (compared to poi-2.5.1-final-20040804) and pretty
   heavy-weight compared to the 'good old string' in POI-2.5.1. Perhaps there
   could be both models and the user decides whether he can use the one (fast
   string) or the other (rich text)

2) The method HSSFRow.createCell(short column, int type) is deprecated (but I
   don't know why). You should always use the 'simpler' method
   HSSFRow.createCell(short column) which constructs a 'blank-record'. This
   record normally becomes garbage with the next call to cell.setCellValue(...).
   It is much more efficient if you say directly at 'createCell' which
   type of cell you need. So why not remove the '@deprecated'?
3) It could be possible to remove the attribute "int cellType" from HSSFCell.
   This is redundant, as you can identify the type also with
   "record instanceof XXX"
And finally, for short: we like POI ;-)

Greetings, Oli
Comment 4 Jason Height 2006-01-03 08:27:12 UTC
Patch applied to SVN. Needed modifications so that the unit tests would pass.

Oliver looking forward to more Performance patches ;-)