The methods RowRecordsAggregate.getStartRowNumberForBlock / getEndRowNumberForBlock iterate over the rows for every block in the sheet. An increasing row number increases the time for serialization enormously. The following code improvement decreases the time for serialization of a workbook with 1 sheet, 25 columns and 30000 rows from 7.5 seconds down to 1.6 seconds on my test environment. private List _rowRecordsList = null; /** Returns the physical row number of the first row in a block*/ private int getStartRowNumberForBlock(int block) { //Given that we basically iterate through the rows in order, // TODO - For a performance improvement, it would be better to return an instance of //an iterator and use that instance throughout, rather than recreating one and //having to move it to the right position. if (_rowRecordsList==null) { // build up block-based list of row records _rowRecordsList = new ArrayList(_rowRecords.values()); } int startIndex = block * DBCellRecord.BLOCK_SIZE; RowRecord row = null; if (startIndex < _rowRecordsList.size()) { row = (RowRecord) _rowRecordsList.get(startIndex); } /* Iterator rowIter = _rowRecords.values().iterator(); //Position the iterator at the start of the block for (int i=0; i<=startIndex;i++) { row = (RowRecord)rowIter.next(); } */ if (row == null) { throw new RuntimeException("Did not find start row for block " + block); } return row.getRowNumber(); } /** Returns the physical row number of the end row in a block*/ private int getEndRowNumberForBlock(int block) { if (_rowRecordsList==null) { // build up block-based list of row records _rowRecordsList = new ArrayList(_rowRecords.values()); } int endIndex = ((block + 1)*DBCellRecord.BLOCK_SIZE)-1; if (endIndex >= _rowRecords.size()) endIndex = _rowRecords.size()-1; RowRecord row = null; if (endIndex < _rowRecordsList.size()) { row = (RowRecord) _rowRecordsList.get(endIndex); } /* Iterator rowIter = _rowRecords.values().iterator(); for (int i=0; i<=endIndex;i++) { row = (RowRecord)rowIter.next(); } */ if (row == null) { throw new RuntimeException("Did not find start row for block " + block); } return row.getRowNumber();
Created attachment 26362 [details] A patch to speed up serialization I have also had this very problem with the Poi trunk. I used Alexander's approach and that produces a great speedup. It is still worse than O(N) but it's a lot better than the original version (3.5 to .8 seconds). Serializing Excel ought to be O(N) where N is the total number of rows, but I'll admit I know nothing about the Excel file format. I have attached a patch against Poi 3.7 that implements Alexander's approach. In addition to his version I discard the values list when the map is modified.
Can you upload a file that we can use for a performance benchmark? I understand the intention to speed up serialization, but without a benchmark it is hard to tell how much we will get from this patch. Regards, Yegor
I'll get the test case we used when I get to work tomorrow. But really, any kind of speadsheet with many row (>30000) would probably illustrate this problem. Rows/s being serialized drops rather sharply (like 1/n^2) as the number of row increases. Our worst case, as our application is today, is 10 sheets with about 40000 rows each. Excel might not be the best format for this, but our clients want to use Excel. We use POI for other files we produce and it would be great if we could produce all files the same way.
Applied in r1043517 Regards, Yegor
I really appreciate it. Now we won't have to maintain this patch in our own tree. Thanks!