Bug 55280 - Poor (Blocking) Performance of Merged Regions
Summary: Poor (Blocking) Performance of Merged Regions
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: XSSF (show other bugs)
Version: 3.9-FINAL
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-07-19 10:25 UTC by proplant
Modified: 2014-09-17 06:50 UTC (History)
0 users



Attachments
55280-2 (3.10 KB, patch)
2014-09-03 10:25 UTC, Yaniv Kunda
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description proplant 2013-07-19 10:25:35 UTC
The problem can be reproduced by the following test case:

            Workbook w = new XSSFWorkbook();
            Sheet s = w.createSheet();
            for (int row = 0; row < 5000; ++row)
                s.addMergedRegion(new CellRangeAddress(row, row, 0, 3));

            s.shiftRows(0, 4999, 1);        // takes 15 minutes

shiftRows() takes 15 minutes because it shifts all merged regions, which is done be removing and re-adding merged regions one by one. Especially the implicit calls to Sheet.removeMergedRegion(idx) are slow.
Thus working with a lot of merged regions is limited because a (maybe implicit) shift of the merged regions will block the application.

An approach to solve the problem may be to implement (and use internally) a method which removes multiple merged regions at once, e.g.

Sheet.removeMergedRegions(int[] ids)

This can be implemented much more efficient than calling removeMergedRegion(id) for each single id.

Regards

Olaf
Comment 1 Nick Burch 2013-07-19 11:19:32 UTC
If you are able to work up a patch, we'd be happy to review it

Also, does this same problem affect HSSF too? If not, it might be worth looking at the HSSF logic to see if we can crib from that. If it does, we'll want to try to fix both if possible.
Comment 2 proplant 2013-07-19 12:34:32 UTC
We migrated from HSSF to XSSF and did not notice the problem with HSSF.

I tried the following code to do a faster remove from outside POI. It is copied and changed from XSSFSheet.removeMergedRegion(int):

----------------------
    public static void removeMergedRegions(XSSFSheet sheet, HashSet<Integer> indices) {
        if (indices.size() == 0)
            return;

        CTWorksheet worksheet = getCTWorksheet(sheet);     // via reflection
        CTMergeCells ctMergeCells = worksheet.getMergeCells();

        CTMergeCell[] mergeCellsArray = new CTMergeCell[ctMergeCells.sizeOfMergeCellArray() - indices.size()];
        int d = 0;
        for (int i = 0 ; i < ctMergeCells.sizeOfMergeCellArray() ; i++) {
            if (! indices.contains(i)) {
                mergeCellsArray[d++] = ctMergeCells.getMergeCellArray(i);
            }
        }
        if(mergeCellsArray.length > 0){
            ctMergeCells.setMergeCellArray(mergeCellsArray);
        } else{
            worksheet.unsetMergeCells();
        }
    }
--------------------
It worked so far, but it doesn't fix the internal usages of removeMergedRegion. At this point we decided to use a workaround (different excel output).
Comment 3 Nick Burch 2013-07-19 12:37:20 UTC
(In reply to proplant from comment #2)
> We migrated from HSSF to XSSF and did not notice the problem with HSSF.

OK, in that case it is most likely related to how we drive xmlbeans.


> It worked so far, but it doesn't fix the internal usages of
> removeMergedRegion. At this point we decided to use a workaround (different
> excel output).

If you do have a chance to work this up into a full patch, that'd be great!
Comment 4 proplant 2013-07-19 13:48:10 UTC
(In reply to Nick Burch from comment #3)
> (In reply to proplant from comment #2)
> If you do have a chance to work this up into a full patch, that'd be great!

I dont know. Which steps do I have to take?
Until now I only downloaded the source.zip.
Comment 5 Nick Burch 2013-07-22 21:11:04 UTC
http://poi.apache.org/guidelines.html has some information. Your best bet though would probably be to ask on the dev list. We have added several new committers recently, who have gone through the process of learning how to submit patches, so there are plenty of people about who can offer you advice on how to do it!
Comment 6 Dominik Stadler 2014-08-31 20:26:05 UTC
Using dynaTrace to analyze where the time is actually spent here, I saw the following top time consumers:

Method	Exec Sum	Breakdown	Class
count(Xobj, QName, QNameSet)	7.84s	CPU: 78 %, Sync: 0 %, Wait: 0 %, Suspension: 0 %, I/O: 22 %	org.apache.xmlbeans.impl.store.Locale
remove_element(QName, int)	6.84s	CPU: 78 %, Sync: 0 %, Wait: 0 %, Suspension: 0 %, I/O: 22 %	org.apache.xmlbeans.impl.store.Xobj
get_locale()	5.37s	CPU: 78 %, Sync: 0 %, Wait: 0 %, Suspension: 0 %, I/O: 22 %	org.apache.xmlbeans.impl.store.Xobj
find_element_user(QName, int)	3.95s	CPU: 78 %, Sync: 0 %, Wait: 0 %, Suspension: 0 %, I/O: 22 %	org.apache.xmlbeans.impl.store.Xobj

So on top initially is not remove(), but actually count(), i.e. one of the size-methods.

Drilling down into the details showed that the calls to count() are mainly done because XSSFSheet.removeMergedRegion() calls sizeOfMergeCellArray() many times but this call is quite costly in XMLBeans!

Removing the calls cut the time nearly in half from 29 to 15 seconds for 1500 merged regions.

I commited this small change as r1621631, this made the most time-consuming method now actually remove_element() followed by getMergeCellArray().

By introducing a bulk-remove method XSSFSheet.removeMergedRegions(Set<Integer>) as outlined below, the time needed to shift is reduced to around 800ms for the shifting with 5000 merged regions, which I think should suffice for almost all  use cases.

The second set of changes is committed as r1621633.
Comment 7 Yaniv Kunda 2014-09-03 10:25:23 UTC
Created attachment 31960 [details]
55280-2

Further enhancement of performance by:
- replacing calls to xmlbeans getMergeCellArray(i) methods with direct array access
- replacing manual array copy with System.arraycopy