Bug 66418 - Performance/Scalability issue in XSSFSheet.groupRow
Summary: Performance/Scalability issue in XSSFSheet.groupRow
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: XSSF (show other bugs)
Version: 5.2.2-FINAL
Hardware: All All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-01-11 21:19 UTC by dan.rabe
Modified: 2023-02-10 13:02 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description dan.rabe 2023-01-11 21:19:17 UTC
We've noticed a scalability issue when there are a large number of grouped rows. We call XSSFSheet.groupRow() for each group. That implementation sets the outline level on the row, but then it also calls setSheetFormatPrOutlineLevelRow(), which looks at EVERY XSSFRow / CTRow to get its outline level. So if I have, say, 100,000 rows, and I call groupRow() 10,000 times, it will be reading a CTRow outline level 1 billion times. This can be quite slow.

I can think of a couple of different approaches to fixing this; would appreciate some input and maybe I can try to provide a patch.

Option 1: Provide another groupRow method that does NOT calculate the sheet-level outline level, for example:

public void groupRow(int fromRow, int toRow, boolean autoCalculateSheetOutlineLevel)

(If autoCalculateSheetOutlineLevel is true, it will behave as it does today; if false, it will skip execution of setSheetFormatPrOutlineLevelRow()).

Then in XSSFSheet, make this method public:
private void setSheetFormatPrOutlineLevelRow()
or expose another public method that wraps it.

Option 2:
Create a new method on XSSFSheet that takes a list of from/to row indexes, so that it can set a lot of outline levels, and then call setSheetFormatPrOutlineLevelRow() just once at the end. Maybe something like

public void groupRows(List<Interval> rowGroups)

where Interval would be a small class to encapsulate the "from" and "to" row indexes.

Thoughts?
Comment 1 PJ Fanning 2023-01-11 21:47:59 UTC
Thanks Dan for raising the issue. It would be great if you or your colleagues could come up with a patch because basically, POI is no longer very active volunteer wise.

My thoughts are that this issue might be better solved without API changes. That the small rewrites of the under the hood functionality might solve the problem. Changing the APIs brings more headaches - HSSF/XSSF/SXSSF code APIs ideally should match - more test coverage needed, etc.

I looked at SXSSFSheet and with groupRow, it just calls pr.setOutlineLevelRow if the new groupRow call means the value needs to be increased. XSSFSheet could probably follow same strategy. No need to traverse all the rows to see what the max is.

ungroupRow calls can be left to go and do a full traverse (because you could be ungrouping the row(s) that caused the current max to be set.
Comment 2 dan.rabe 2023-01-12 15:46:14 UTC
Brilliant, I like your approach much better! I've played with it a bit and it seems to be working well. I need to jump through some hoops before I can submit a patch, but when I'm ready, would you prefer that I create a new "[PATCH]" bug, or attach the patch to this bug?
Comment 3 PJ Fanning 2023-01-12 19:53:04 UTC
attach a patch file to this issue or create a Github PR.
Comment 4 PJ Fanning 2023-01-28 20:38:02 UTC
I added r1907065

It relies on existing test coverage - which isn't great but there is some coverage.
Comment 5 PJ Fanning 2023-02-10 13:02:03 UTC
Closing, as done. Please test and reopen if there are problems.