Bug 60397 - very slow Cell Merge for SXSSFWorkbook
Summary: very slow Cell Merge for SXSSFWorkbook
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: SXSSF (show other bugs)
Version: 3.15-FINAL
Hardware: Macintosh All
: P2 normal with 5 votes (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2016-11-21 01:01 UTC by Marc
Modified: 2018-11-28 20:16 UTC (History)
1 user (show)

Add a method that doesn't calculate merge cell array (5.30 KB, patch)
2018-11-28 20:16 UTC, Nail Samatov
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Marc 2016-11-21 01:01:01 UTC
When merging the cells for a SXSSFWorkbook the rendering speed is drastically reduced. This can be demonstrated by adapting the sample code to merge each of the cells.

        SXSSFWorkbook wb = new SXSSFWorkbook(100); 
        Sheet sh = wb.createSheet();
        for(int rownum = 0; rownum < 10000; rownum = rownum + 2){
            Row row1 = sh.createRow(rownum);
            Row row2 = sh.createRow(rownum + 1);
            for(int cellnum = 0; cellnum < 10; cellnum++){
                Cell cell1 = row1.createCell(cellnum);
                String address = new CellReference(cell1).formatAsString();

                Cell cell2 = row2.createCell(cellnum);

                sh.addMergedRegion(new CellRangeAddress(
                        rownum + 1, 
Comment 1 Javen O'Neal 2016-11-21 17:36:57 UTC
Because merged regions cannot overlap without producing a corrupt document, POI may be checking the list of merged regions on a sheet for potential intersections before adding a merged region. This gives O(N) behavior for adding one region instead of the expected O(1).

For XSSF and SXSSF, we added addMergedRegionUnsafe [1] that skips these checks for speed, but may produce a corrupt document when opened in Excel. Have you tried addMergedRegionUnsafe?

[1] https://poi.apache.org/apidocs/org/apache/poi/xssf/streaming/SXSSFSheet.html#addMergedRegionUnsafe(org.apache.poi.ss.util.CellRangeAddress)
Comment 2 Marc 2016-11-21 23:48:58 UTC
Hi Javen,

Thanks for the info - the unsafe method is definitely faster (and I'll update all our code to use it, as there are tests in place to ensure the spreadsheets are rendered correctly - so corruption shouldn't be an issue). However, it's still a lot slower than not using any merged cells. I've run the following benchmark tests to demonstrate:

2000 Rows

- No merging -> 1 second
- Unsafe merging -> 2 seconds
- Safe merging -> 33 seconds

10,000 Rows

- No merging -> 2 seconds
- Unsafe merging -> 39 seconds
- Safe merging - > hadn't completed after 5 mins, so gave up :-)


Comment 3 Javen O'Neal 2016-11-22 02:01:38 UTC
Here's the implementation [1]. Looks fairly benign, though there could be performance problems in method calls.

Can you test getNumberOfCells and formatAsString to see if either of these are the cause of the 39 second performance?
for (CellRangeAddress region : regionsToAdd) {

[1] https://svn.apache.org/viewvc/poi/trunk/src/ooxml/java/org/apache/poi/xssf/usermodel/XSSFSheet.java?revision=1768589&view=markup#l347
Comment 4 Javen O'Neal 2017-01-05 09:53:46 UTC
It would probably be faster for us to read the merged regions into a java List and discard the CTMergedRegions, operate on the Java List, and then recreate the CTMergedRegions XML nodes when writing the workbook. Not sure whether this would improve the speed here, but I'm guessing worksheet.getMergeCells() and ctMergeCells.addNewMergeCell() are expensive calls since it's creating and linking XML nodes in a DOM. This work could be deferred to when the workbook is written out.
Comment 5 Stephen Webster 2017-05-09 16:54:01 UTC
I have been watching this bug, and wanted to add some information that I have gathered since I didn't see any progress.  Here is my testcase.

for(int i=0; i < 20000; i++) {
   for(int j=0; j < 30; j+=3) {
      CellRangeAddress range = new CellRangeAddress(i,i,j,j+2);

The results get quite a bit worse as the number of rows/merged region number increases.

1000  rows -> 1.2 seconds
10000 rows -> 100 seconds
20000 rows -> Over 5 minutes.

I dropped the above code and profiled it and turns out the time spent is in the return statement of addMergedRegion(), ctMergeCells.sizeOfMergeCellArray();

I am not using the return value of this method, after creating a locally modified version of the XSSFSheet class, I created a method without the return statement and the testcase finishes in less then 1 second instead of over 5 minutes for 20,000 rows.

Here is the stack from profiler which was 90% of time spent
Stack Trace
org.apache.xmlbeans.impl.store.Locale.count(Xobj, QName, QNameSet)
org.apache.poi.xssf.usermodel.XSSFSheet.addMergedRegion(CellRangeAddress, boolean)
Comment 6 Greg Woolsey 2017-05-09 17:09:29 UTC
That makes sense.  All XmlBeans methods are ridiculously expensive.  Javen's suggestion of using temporary POJOs to avoid multiple bean operations seems like a good one, as well as a new method with a void return for cases that don't need it, like yours.

Many other places we've elected to track data outside of the bean structures for performance reasons, especially in things like the *Evaluator classes.  This sounds like another candidate.
Comment 7 Nail Samatov 2018-11-28 20:16:56 UTC
Created attachment 36284 [details]
Add a method that doesn't calculate merge cell array

I reproduced the same issue as Stephen Webster and found that adding a new method that doesn't calculate a merge cell array solves the issue for me. It would help a lot if you add such method.

I attached a patch that should solve the issue, could you apply it if it's ok?