Bug 61862 - Improve TreeMap implementation to cache first and last key
Summary: Improve TreeMap implementation to cache first and last key
Alias: None
Product: POI
Classification: Unclassified
Component: SS Common (show other bugs)
Version: unspecified
Hardware: PC All
: P2 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2017-12-06 11:17 UTC by Javen O'Neal
Modified: 2017-12-07 00:00 UTC (History)
0 users

an example SortedMap subclass that caches the first key and last key, written by Javen. No unit test. (3.00 KB, text/plain)
2017-12-06 11:17 UTC, Javen O'Neal

Note You need to log in before you can comment on or make changes to this bug.
Description Javen O'Neal 2017-12-06 11:17:48 UTC
Created attachment 35589 [details]
an example SortedMap subclass that caches the first key and last key, written by Javen. No unit test.

POI has several TreeMaps to maintain the sorted order of cells in a row or rows in a sheet.

Getting the last and first keys on those maps can be expensive (O(log N)) if called frequently.

Let's investigate if any of these maps would benefit from a Map implementation that cached the first and last key, making those keys available in O(1).

Most common example:
Comment 1 Javen O'Neal 2017-12-06 11:23:45 UTC
I have not tested this new caching TreeMap to see if it's faster or slower than a regular TreeMap. TreeMap's O(log(N)) performance is probably pretty good.
The first or last key of a TreeMap containing 1,000,000 rows would traverse through ~20 nodes (2^20=1,000,000).

The constant overhead of keeping the cache up to date for every modification just to make firstKey/lastKey faster for getFirstRowNum() and getLastRowNum() faster, which are probably called no more than once per sheet anyways... This might not actually be an improvement for XSSFSheet.
Comment 2 Greg Woolsey 2017-12-06 23:46:01 UTC
How about caching the first and last keys and a "dirty" flag or clearing the cache values on structural changes?  Then the first/last would only need to be calculated on first access after changes.  I can see these being referenced many times when evaluating formulas, for example, but we wouldn't want them to be continuously updated as rows are being created.  Lazy init, eager clearing/dirtying would perhaps still offer a formula performance gain without adding much if any overhead when creating documents.

The only cases where it wouldn't add much value, and still wouldn't add much overhead beyond what TreeMap already does, would be cases where data changes and formula evaluations/data scanning are done repeatedly or in some iterative/interleaved pattern.  In my experience these are not typical use cases.
Comment 3 Greg Woolsey 2017-12-07 00:00:12 UTC
All said and done, however, the Map interface has so many possible mutation vectors, including submaps and iterators, it likely isn't worth it.  Caching in the evaluation structures is I think sufficient.  There we already have a required manual method for clearing cached values when data changes.  As you mentioned, most use cases where users iterate over things like rows, it's done once per sheet, and the cost is negligible.