Bug 57893

Summary: [PATCH] XSSFSheet.getMergedRegion(int) takes O(n^2) time
Product: POI Reporter: Chris Boyle <cmb-apache>
Component: XSSFAssignee: POI Developers List <dev>
Severity: normal Keywords: PatchAvailable
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Bug Depends on:    
Bug Blocks: 58350, 58651    
Attachments: Sample code demonstrating slow performance
Sample input with 50000 merged regions
Sample input with 50000 merged regions

Description Chris Boyle 2015-05-05 16:23:34 UTC
Created attachment 32715 [details]
Sample code demonstrating slow performance

The attached Java compares the time taken to loop over sheet.getMergedRegion(i) for all i, versus the same operation by direct access to the deprecated CTMergeCells.getMergeCellArray().

The sample input I will attach, many-merges.xlsx, has 50k merged regions and I get 8000ms vs 80ms. The real-world input I found this with has 250k merged regions and takes ~300 seconds vs ~300ms.

The reason seems to be:

-> CTMergeCells.getMergeCellArray(int)
-> Xobj.find_element_user(QName, int)

which does:

    for ( Xobj x = _firstChild ; x != null ; x = x._nextSibling )
      if (x.isElem() && x._name.equals( name ) && --i < 0)
        return x.getUser();

So for each mergeCell you access you must compare the element name of every previous mergeCell, whereas the deprecated CTMergeCells.getMergeCellArray() uses find_all_element_users(), avoiding O(n^2).

Would you be interested in a patch? The only thing is, it seems the change should be in CTMergeCells in ooxml-schemas, which seems to be generated(?) and I'm not sure what generates it, or why getMergeCellArray() (the only fast method) is deprecated. getMergeCellList() exists but only hands out a wrapper around the slow getMergeCellArray(int).
Comment 1 Chris Boyle 2015-05-05 16:24:03 UTC
Created attachment 32716 [details]
Sample input with 50000 merged regions
Comment 2 Nick Burch 2015-05-05 16:29:20 UTC
We tried asking the XMLBeans project about the deprecation of the Array getter, but never really got an answer... In other places where speed matters, we've used the array getter with an @ignore for the deprecation. If that would help here, we'd love a patch for it!
Comment 3 Chris Boyle 2015-05-08 09:51:30 UTC
Created attachment 32723 [details]
Sample input with 50000 merged regions

Previous sample input was smaller than stated (999 merges); here's one with the intended size (50000 merges).
Comment 4 Chris Boyle 2015-05-08 10:09:07 UTC
Created attachment 32724 [details]

Here's a patch adding Sheet.getMergedRegions(), implemented with the deprecated array getter as discussed.
Comment 5 Chris Boyle 2015-05-08 10:10:07 UTC
(This is a git patch, including 1 binary file for the test.)
Comment 6 David North 2015-07-13 13:00:43 UTC
Patch applied. Fixed in r1690661
Comment 7 Javen O'Neal 2015-08-13 03:50:53 UTC
If there are no merged regions on a sheet, HSSF returns an empty List while XSSF raises an IllegalStateException. I would assume that both would behave the same: return an empty list and let the calling code check for empty list. No merged regions doesn't seem exceptional, and asking a user to check if numMergedRegions before all getMergedRegions seems unnecessary. My 2 cents.
Comment 8 David North 2015-10-05 10:38:15 UTC
Note: comment 7 was fixed under bug 58350