Bug 58885 - Performance regression on addition of merged regions
Summary: Performance regression on addition of merged regions
Alias: None
Product: POI
Classification: Unclassified
Component: XSSF (show other bugs)
Version: unspecified
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Blocks: 59212
  Show dependency tree
Reported: 2016-01-18 14:43 UTC by David North
Modified: 2016-03-22 08:47 UTC (History)
0 users


Note You need to log in before you can comment on or make changes to this bug.
Description David North 2016-01-18 14:43:47 UTC
We regressed on performance after fixing bug 58443: adding each merged region is now linear in the number of merged regions which already exist, which showed up as quite a sharp performance regression in one of our apps.

See bug 58443 comment 3.

We either need to speed this up or allow bypassing the validation for applications which are confident they're doing it right.
Comment 1 Javen O'Neal 2016-01-18 19:42:51 UTC
Would Sheet.addMergedRegion(CellRangeAddress region, boolean checkForOverlapping) (not implemented) solve this problem?
Comment 2 Javen O'Neal 2016-01-19 05:23:00 UTC
Adding a merged region *safely* (without creating a potentially corrupted workbook) is inherently a O(n) operation, as it need to check all merged regions in the sheet for potential overlap.

A merged region could be added unsafely (potential workbook corruption), but this should not be the default behavior.

If adding multiple merged regions, it still requires O(n) time for each merged region addition. The checking could be deferred until a later time before saving the workbook, but wouldn't save any CPU cycles unless merged regions were added and removed between opening and saving the workbook.

> which showed up as quite a sharp performance regression in one of our apps

Could I get a rough number of merged regions in a sheet so I could profile this? I don't think there's much we can do to make the code faster, so we'll probably need to provide a method that bypasses validation. To make it crystal clear that using the bypassed-validation option can produce a corrupt workbook, I was thinking of calling it public int Sheet.addMergedRegionUnsafe(CellRangeAddress region), as the signature from comment 1 does not stress the consequences of checkForOverlapping=False.
Comment 3 David North 2016-01-19 09:52:50 UTC
Thanks Javen.

I agree blindly adding merged regions without the guards is an inherently unsafe operation, but in our case we're authoring the workbook from scratch in one run, so I think it's reasonable to give us API to say "we're confident we're adding non-overlapping regions".

In terms of API, we could perhaps call the alternative method addMergedRegionUnsafe(CellRangeAddress region) or use an enum as an alternative to the boolean.

The number of merged regions on the sheet in question is 23,002 - which I'll admit is quite a lot!
Comment 4 Javen O'Neal 2016-01-31 04:17:01 UTC
Fixed in r1727776. Use XSSFSheet.addMergedRegionsUnsafe(CellRangeAddress) if you want to add a merged region without incurring the cost of checking for potential intersecting regions. XSSFSheet.validateMergedRegions can be called after the fact if you want to defer the validation.

Updated changelog in r1727779.