Bug 57893 - [PATCH] XSSFSheet.getMergedRegion(int) takes O(n^2) time
Summary: [PATCH] XSSFSheet.getMergedRegion(int) takes O(n^2) time
Alias: None
Product: POI
Classification: Unclassified
Component: XSSF (show other bugs)
Version: unspecified
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Keywords: PatchAvailable
Depends on:
Blocks: 58350 58651
  Show dependency tree
Reported: 2015-05-05 16:23 UTC by Chris Boyle
Modified: 2015-11-25 11:35 UTC (History)
0 users

Sample code demonstrating slow performance (1.45 KB, text/x-java)
2015-05-05 16:23 UTC, Chris Boyle
Sample input with 50000 merged regions (11.13 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2015-05-05 16:24 UTC, Chris Boyle
Sample input with 50000 merged regions (829.02 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2015-05-08 09:51 UTC, Chris Boyle
Patch (588.68 KB, patch)
2015-05-08 10:09 UTC, Chris Boyle
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Boyle 2015-05-05 16:23:34 UTC
Created attachment 32715 [details]
Sample code demonstrating slow performance

The attached Java compares the time taken to loop over sheet.getMergedRegion(i) for all i, versus the same operation by direct access to the deprecated CTMergeCells.getMergeCellArray().

The sample input I will attach, many-merges.xlsx, has 50k merged regions and I get 8000ms vs 80ms. The real-world input I found this with has 250k merged regions and takes ~300 seconds vs ~300ms.

The reason seems to be:

-> CTMergeCells.getMergeCellArray(int)
-> Xobj.find_element_user(QName, int)

which does:

    for ( Xobj x = _firstChild ; x != null ; x = x._nextSibling )
      if (x.isElem() && x._name.equals( name ) && --i < 0)
        return x.getUser();

So for each mergeCell you access you must compare the element name of every previous mergeCell, whereas the deprecated CTMergeCells.getMergeCellArray() uses find_all_element_users(), avoiding O(n^2).

Would you be interested in a patch? The only thing is, it seems the change should be in CTMergeCells in ooxml-schemas, which seems to be generated(?) and I'm not sure what generates it, or why getMergeCellArray() (the only fast method) is deprecated. getMergeCellList() exists but only hands out a wrapper around the slow getMergeCellArray(int).
Comment 1 Chris Boyle 2015-05-05 16:24:03 UTC
Created attachment 32716 [details]
Sample input with 50000 merged regions
Comment 2 Nick Burch 2015-05-05 16:29:20 UTC
We tried asking the XMLBeans project about the deprecation of the Array getter, but never really got an answer... In other places where speed matters, we've used the array getter with an @ignore for the deprecation. If that would help here, we'd love a patch for it!
Comment 3 Chris Boyle 2015-05-08 09:51:30 UTC
Created attachment 32723 [details]
Sample input with 50000 merged regions

Previous sample input was smaller than stated (999 merges); here's one with the intended size (50000 merges).
Comment 4 Chris Boyle 2015-05-08 10:09:07 UTC
Created attachment 32724 [details]

Here's a patch adding Sheet.getMergedRegions(), implemented with the deprecated array getter as discussed.
Comment 5 Chris Boyle 2015-05-08 10:10:07 UTC
(This is a git patch, including 1 binary file for the test.)
Comment 6 David North 2015-07-13 13:00:43 UTC
Patch applied. Fixed in r1690661
Comment 7 Javen O'Neal 2015-08-13 03:50:53 UTC
If there are no merged regions on a sheet, HSSF returns an empty List while XSSF raises an IllegalStateException. I would assume that both would behave the same: return an empty list and let the calling code check for empty list. No merged regions doesn't seem exceptional, and asking a user to check if numMergedRegions before all getMergedRegions seems unnecessary. My 2 cents.
Comment 8 David North 2015-10-05 10:38:15 UTC
Note: comment 7 was fixed under bug 58350