Bug 47405

Summary: RowRecordsAggregate.getStartRowNumberForBlock / getEndRowNumberForBlock not performing with high row count
Product: POI Reporter: Schulze Alexander <alex2>
Component: HSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: major CC: perle
Priority: P2 Keywords: PatchAvailable
Version: 3.2-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: Windows XP   
Attachments: A patch to speed up serialization

Description Schulze Alexander 2009-06-23 00:33:40 UTC
The methods RowRecordsAggregate.getStartRowNumberForBlock / getEndRowNumberForBlock iterate over the rows for every block in the sheet.
An increasing row number increases the time for serialization enormously.
The following code improvement decreases the time for serialization of a workbook with 1 sheet, 25 columns and 30000 rows from 7.5 seconds down to 1.6 seconds on my test environment.

    private List _rowRecordsList = null;
    
    /** Returns the physical row number of the first row in a block*/
    private int getStartRowNumberForBlock(int block) {
      //Given that we basically iterate through the rows in order,
      // TODO - For a performance improvement, it would be better to return an instance of
      //an iterator and use that instance throughout, rather than recreating one and
      //having to move it to the right position.
        
      if (_rowRecordsList==null) {
          // build up block-based list of row records
          _rowRecordsList = new ArrayList(_rowRecords.values());
      }
      int startIndex = block * DBCellRecord.BLOCK_SIZE;
      RowRecord row = null;
      if (startIndex < _rowRecordsList.size()) {
          row = (RowRecord) _rowRecordsList.get(startIndex);
      }
      /*
      Iterator rowIter = _rowRecords.values().iterator();
      //Position the iterator at the start of the block
      for (int i=0; i<=startIndex;i++) {
        row = (RowRecord)rowIter.next();
      }
      */
      if (row == null) {
          throw new RuntimeException("Did not find start row for block " + block);
      }

      
      return row.getRowNumber();
    }

    /** Returns the physical row number of the end row in a block*/
    private int getEndRowNumberForBlock(int block) {
        if (_rowRecordsList==null) {
            // build up block-based list of row records
            _rowRecordsList = new ArrayList(_rowRecords.values());
        }

      int endIndex = ((block + 1)*DBCellRecord.BLOCK_SIZE)-1;
      if (endIndex >= _rowRecords.size())
        endIndex = _rowRecords.size()-1;

      RowRecord row = null;
      if (endIndex < _rowRecordsList.size()) {
          row = (RowRecord) _rowRecordsList.get(endIndex);
      }
      /*
      Iterator rowIter = _rowRecords.values().iterator();
      for (int i=0; i<=endIndex;i++) {
        row = (RowRecord)rowIter.next();
      }
      */
      if (row == null) {
          throw new RuntimeException("Did not find start row for block " + block);
      }
      return row.getRowNumber();
Comment 1 Per Lewau 2010-12-01 08:04:01 UTC
Created attachment 26362 [details]
A patch to speed up serialization

I have also had this very problem with the Poi trunk. I used Alexander's approach and that produces a great speedup. It is still worse than O(N) but it's a lot better than the original version (3.5 to .8 seconds). Serializing Excel ought to be O(N) where N is the total number of rows, but I'll admit I know nothing about the Excel file format.

I have attached a patch against Poi 3.7 that implements Alexander's approach. In addition to his version I discard the values list when the map is modified.
Comment 2 Yegor Kozlov 2010-12-01 08:57:53 UTC
Can you upload a file that we can use for a performance benchmark? I understand the intention to speed up serialization, but without a benchmark it is hard to tell how much we will get from this patch. 

Regards,
Yegor
Comment 3 Per Lewau 2010-12-01 14:34:44 UTC
I'll get the test case we used when I get to work tomorrow. But really, any kind of speadsheet with many row (>30000) would probably illustrate this problem. Rows/s being serialized drops rather sharply (like 1/n^2) as the number of row increases. Our worst case, as our application is today, is 10 sheets with about 40000 rows each.

Excel might not be the best format for this, but our clients want to use Excel. We use POI for other files we produce and it would be great if we could produce all files the same way.
Comment 4 Yegor Kozlov 2010-12-08 11:53:54 UTC
Applied in r1043517

Regards,
Yegor
Comment 5 Per Lewau 2010-12-13 15:01:04 UTC
I really appreciate it. Now we won't have to maintain this patch in our own tree. Thanks!