Bug 56705 - [PATCH] XSSFWorkbook.getSheet(String) is slow with many sheets
Summary: [PATCH] XSSFWorkbook.getSheet(String) is slow with many sheets
Alias: None
Product: POI
Classification: Unclassified
Component: XSSF (show other bugs)
Version: unspecified
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Keywords: PatchAvailable
Depends on:
Reported: 2014-07-08 16:16 UTC by Chris Boyle
Modified: 2015-07-22 08:49 UTC (History)
0 users

patch to add a map from name to sheet (3.25 KB, patch)
2014-07-08 16:16 UTC, Chris Boyle
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Boyle 2014-07-08 16:16:41 UTC
Created attachment 31802 [details]
patch to add a map from name to sheet

XSSFWorkbook.getSheet(String) showed up as a hot-spot in profiling, as the project I'm on calls it very often to resolve cross-sheet formula references.

Currently it simply walks the list of sheets looking for one with a matching name. The attached patch adds a map from name to sheet, which reduces the time our operation spends in this method by about 80%.
Comment 1 Nick Burch 2014-07-14 19:02:19 UTC
Looking at the patch, we already have a list of XSSFSheet objects, and this adds a map too by name. I can't help but wonder if we couldn't have a special kind of sorted map that provides both the old "give me the sheets" of the list plus the new "lookup by name"

It's possible that my approach would be more complex or slower though, even if it would mean only one object to update instead of two

Comment 2 Chris Boyle 2014-08-26 10:47:13 UTC
Having just looked at this again: the map is a LinkedHashMap and most of the required operations are possible using only the map with no worse efficiency than before (walk the map's values() which is no worse than walking the previous List).

Unfortunately it all falls down at setSheetOrder(sheetname, pos). With LinkedHashMap it would have to remove and reinsert every sheet between its desired position and the end. You would need a kind of LinkedHashMap with the ability to reorder its internal list. I don't know of one and I'm not sure adding one and maintaining it would be an improvement over the patch as it stands.
Comment 3 Dominik Stadler 2015-05-19 19:14:34 UTC
I'm not sure if this is a big problem in almost all other use cases of POI as usually there are only a few sheets and getSheet() is not called that often as well. 

So by optimizing this one specific case we would actually add a bit more to some other methods and also make the code more complex and harder to maintain.

How many sheets do you have in those workbooks? And how often do you actually call it? I wouldn't optimize this inside POI if it is simply called much too often in your case.

Would it maybe be better to solve your specific case here with a small cache in your code which remembers these sheet-name to index mapping outside of POI core code and then uses getSheetAt() to have quick access?
Comment 4 David North 2015-07-22 08:49:24 UTC
Further review at our end suggests this optimization is no longer necessary, so the simplest thing to do is WONTFIX this one for now.

I had pondered a TreeSet or some other ordered set implementation as a single point of truth for the sheets in a workbook, but it can wait for concrete requirements.