Bug 60397 - very slow Cell Merge for SXSSFWorkbook
Summary: very slow Cell Merge for SXSSFWorkbook
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: SXSSF (show other bugs)
Version: 3.15-FINAL
Hardware: Macintosh All
: P2 normal with 7 votes (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-11-21 01:01 UTC by Marc
Modified: 2020-12-11 17:51 UTC (History)
2 users (show)



Attachments
Add a method that doesn't calculate merge cell array (5.30 KB, patch)
2018-11-28 20:16 UTC, Nail Samatov
Details | Diff
Patch to fix performance when adding merged regions (4.44 KB, patch)
2020-12-11 15:53 UTC, Alex Herve
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Marc 2016-11-21 01:01:01 UTC
When merging the cells for a SXSSFWorkbook the rendering speed is drastically reduced. This can be demonstrated by adapting the sample code to merge each of the cells.

        SXSSFWorkbook wb = new SXSSFWorkbook(100); 
        Sheet sh = wb.createSheet();
        for(int rownum = 0; rownum < 10000; rownum = rownum + 2){
            Row row1 = sh.createRow(rownum);
            Row row2 = sh.createRow(rownum + 1);
            for(int cellnum = 0; cellnum < 10; cellnum++){
                Cell cell1 = row1.createCell(cellnum);
                String address = new CellReference(cell1).formatAsString();
                cell1.setCellValue(address);

                Cell cell2 = row2.createCell(cellnum);
                cell2.setCellValue("");

                sh.addMergedRegion(new CellRangeAddress(
                        rownum, 
                        rownum + 1, 
                        cellnum,
                        cellnum  
                ));
            }
        }
Comment 1 Javen O'Neal 2016-11-21 17:36:57 UTC
Because merged regions cannot overlap without producing a corrupt document, POI may be checking the list of merged regions on a sheet for potential intersections before adding a merged region. This gives O(N) behavior for adding one region instead of the expected O(1).

For XSSF and SXSSF, we added addMergedRegionUnsafe [1] that skips these checks for speed, but may produce a corrupt document when opened in Excel. Have you tried addMergedRegionUnsafe?

[1] https://poi.apache.org/apidocs/org/apache/poi/xssf/streaming/SXSSFSheet.html#addMergedRegionUnsafe(org.apache.poi.ss.util.CellRangeAddress)
Comment 2 Marc 2016-11-21 23:48:58 UTC
Hi Javen,

Thanks for the info - the unsafe method is definitely faster (and I'll update all our code to use it, as there are tests in place to ensure the spreadsheets are rendered correctly - so corruption shouldn't be an issue). However, it's still a lot slower than not using any merged cells. I've run the following benchmark tests to demonstrate:

2000 Rows

- No merging -> 1 second
- Unsafe merging -> 2 seconds
- Safe merging -> 33 seconds

10,000 Rows

- No merging -> 2 seconds
- Unsafe merging -> 39 seconds
- Safe merging - > hadn't completed after 5 mins, so gave up :-)

Cheers,

Marc
Comment 3 Javen O'Neal 2016-11-22 02:01:38 UTC
Here's the implementation [1]. Looks fairly benign, though there could be performance problems in method calls.

Can you test getNumberOfCells and formatAsString to see if either of these are the cause of the 39 second performance?
for (CellRangeAddress region : regionsToAdd) {
    region.getNumberOfCells();
    region.formatAsString();
}

[1] https://svn.apache.org/viewvc/poi/trunk/src/ooxml/java/org/apache/poi/xssf/usermodel/XSSFSheet.java?revision=1768589&view=markup#l347
Comment 4 Javen O'Neal 2017-01-05 09:53:46 UTC
It would probably be faster for us to read the merged regions into a java List and discard the CTMergedRegions, operate on the Java List, and then recreate the CTMergedRegions XML nodes when writing the workbook. Not sure whether this would improve the speed here, but I'm guessing worksheet.getMergeCells() and ctMergeCells.addNewMergeCell() are expensive calls since it's creating and linking XML nodes in a DOM. This work could be deferred to when the workbook is written out.
Comment 5 Stephen Webster 2017-05-09 16:54:01 UTC
I have been watching this bug, and wanted to add some information that I have gathered since I didn't see any progress.  Here is my testcase.

for(int i=0; i < 20000; i++) {
   for(int j=0; j < 30; j+=3) {
      CellRangeAddress range = new CellRangeAddress(i,i,j,j+2);
      sheet.addMergedRegionUnsafe(range);
   }
}

The results get quite a bit worse as the number of rows/merged region number increases.

1000  rows -> 1.2 seconds
10000 rows -> 100 seconds
20000 rows -> Over 5 minutes.

I dropped the above code and profiled it and turns out the time spent is in the return statement of addMergedRegion(), ctMergeCells.sizeOfMergeCellArray();

I am not using the return value of this method, after creating a locally modified version of the XSSFSheet class, I created a method without the return statement and the testcase finishes in less then 1 second instead of over 5 minutes for 20,000 rows.

Here is the stack from profiler which was 90% of time spent
Stack Trace
org.apache.xmlbeans.impl.store.Locale.count(Xobj, QName, QNameSet)
org.apache.xmlbeans.impl.store.Xobj.count_elements(QName)	
org.openxmlformats.schemas.spreadsheetml.x2006.main.impl.CTMergeCellsImpl.sizeOfMergeCellArray()
org.apache.poi.xssf.usermodel.XSSFSheet.addMergedRegion(CellRangeAddress, boolean)
org.apache.poi.xssf.usermodel.XSSFSheet.addMergedRegionUnsafe(CellRangeAddress)
Comment 6 Greg Woolsey 2017-05-09 17:09:29 UTC
That makes sense.  All XmlBeans methods are ridiculously expensive.  Javen's suggestion of using temporary POJOs to avoid multiple bean operations seems like a good one, as well as a new method with a void return for cases that don't need it, like yours.

Many other places we've elected to track data outside of the bean structures for performance reasons, especially in things like the *Evaluator classes.  This sounds like another candidate.
Comment 7 Nail Samatov 2018-11-28 20:16:56 UTC
Created attachment 36284 [details]
Add a method that doesn't calculate merge cell array

I reproduced the same issue as Stephen Webster and found that adding a new method that doesn't calculate a merge cell array solves the issue for me. It would help a lot if you add such method.

I attached a patch that should solve the issue, could you apply it if it's ok?
Comment 8 Roberto Navarro 2019-06-19 04:41:48 UTC
I tested an alternative solution, keeping row number in SXSSFRow and index in SXSSFCell boost performance avoiding iterations on TreeSets.
Comment 9 Alex Herve 2020-12-10 09:32:27 UTC
Is there any possibility of adding the patch, after adjusting for changes since it was created?

This issue is causing creating excel files in my tool to take much longer than needed, with almost 60% of the run time spent in the 
org.openxmlformats.schemas.spreadsheetml.x2006.main.impl.CTMergeCellsImpl.sizeOfMergeCellArray()
method.
Comment 10 PJ Fanning 2020-12-10 19:59:27 UTC
The patch file is out of date. If anyone feels like adding a new patch, we will consider it. My preference would be not add extra API methods like the existing patch does. The existing `public int addMergedRegionUnsafe(CellRangeAddress region)` in SXSSFSheet could be be changed to return a dummy value, eg 0 if calculating the int is expensive.

As to Roberto Navarro's comment above, we have added a change today that does that. https://github.com/apache/poi/pull/206
Comment 11 Alex Herve 2020-12-11 13:25:54 UTC
That sounds good to me.

I will work on the patch then. I think we could maybe improve that solution by returning the getCount() result instead of a dummy value, as that should be equivalent unless I am misunderstanding the code.

Maybe even switching to just directly using the count property instead of calling the sizeOfMergeCellArray() method if it is so expensive. Then all merge methods would see an improvement, not only the unsafe ones.
Comment 12 Alex Herve 2020-12-11 15:53:08 UTC
Created attachment 37601 [details]
Patch to fix performance when adding merged regions

Here is a patch that uses the getCount() method on the ctMergeCells instead of the expensive sizeOfMergeCellArray().

It passes all tests and I added a new test just to check if the counts are valid both after adding the first region and then a second one (following the structure of the test from bug 63371).

For my use case in my tool, using the SNAPSHOT with the fix decreases runtime by over 50% for creating a large excel file with many merged cells.
Comment 13 PJ Fanning 2020-12-11 17:51:42 UTC
Thanks - added r1884329