Bug 58525 - [PATCH] Optimization of adding/modifying named cells
Summary: [PATCH] Optimization of adding/modifying named cells
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: HSSF (show other bugs)
Version: 3.13-dev
Hardware: PC All
: P2 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords: PatchAvailable
Depends on:
Blocks:
 
Reported: 2015-10-23 09:37 UTC by Richard Hart
Modified: 2019-01-06 13:00 UTC (History)
0 users



Attachments
patch file created with Git. (19.00 KB, patch)
2015-10-23 09:37 UTC, Richard Hart
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Hart 2015-10-23 09:37:53 UTC
Created attachment 33196 [details]
patch file created with Git.

If a large number of named cells are used, a great deal of time is spent determining if the name is already used in a given sheet due to looking up the name by looping through an array of names.

This optimization adds a Map<Integer sheetNumber, Map<String name, NameRecord>> wherever the array of names exists. The NameRecord may be an HSSFName depending on which model is being used.

The duplicate test in HSSFName.setNameName() uses the map to find a duplicate if one exists rather than looping through the names array.

As far as I know this does not break any existing functionality. All unit tests executed in the ant build are successful although none of the unit tests have been modified to use the new getName(name, sheetNumber) method. 

This change was prompted by a project that builds a large excel workbook of 58 sheets with about 3000 named cells per sheet. The time required to produce the workbook was 4:47 before the modification and 3:14 after, about a 30% improvement.

The change is based on poi-3.13. To my knowledge this is a release version.

My local project is named poi-3.13-pwc. The included patch references that project name.
Comment 1 Javen O'Neal 2015-10-23 11:25:07 UTC
It looks like you're maintaining two data structures that contain essentially the same content, a List and a HashMap. In the interest of keeping POI's memory footprint down, would it be possible to use just one data structure? It might even be faster, since write operations wouldn't need to modify 2 data structures.

> public LinkTable(int numberOfSheets, WorkbookRecordList workbookRecordList) {
>   _workbookRecordList = workbookRecordList;
>   _definedNames = new ArrayList<NameRecord>();
>   _nameRecordMap = new HashMap<Integer, Map<String, NameRecord>>();
>   ...
>   while(true) {
>     if (nextClass == NameRecord.class) {
>       NameRecord nr = (NameRecord)rs.getNext();
>       _definedNames.add(nr);
>       addToNamesMap(nr);
>     }
>   }

> public void removeBuiltinRecord(byte name, int sheetIndex) {
>  NameRecord record = getSpecificBuiltinRecord(name, sheetIndex);
>   if (record != null) {
>    _definedNames.remove(record);
>    removeFromNameRecordMap(record);
>   }
> }
Comment 2 Richard Hart 2015-10-23 11:40:14 UTC
(In reply to Javen ONeal from comment #1)
> It looks like you're maintaining two data structures that contain
> essentially the same content, a List and a HashMap. In the interest of
> keeping POI's memory footprint down, would it be possible to use just one
> data structure? It might even be faster, since write operations wouldn't
> need to modify 2 data structures.
> 
> > public LinkTable(int numberOfSheets, WorkbookRecordList workbookRecordList) {
> >   _workbookRecordList = workbookRecordList;
> >   _definedNames = new ArrayList<NameRecord>();
> >   _nameRecordMap = new HashMap<Integer, Map<String, NameRecord>>();
> >   ...
> >   while(true) {
> >     if (nextClass == NameRecord.class) {
> >       NameRecord nr = (NameRecord)rs.getNext();
> >       _definedNames.add(nr);
> >       addToNamesMap(nr);
> >     }
> >   }
> 
> > public void removeBuiltinRecord(byte name, int sheetIndex) {
> >  NameRecord record = getSpecificBuiltinRecord(name, sheetIndex);
> >   if (record != null) {
> >    _definedNames.remove(record);
> >    removeFromNameRecordMap(record);
> >   }
> > }

Indeed. I first tried to replace the names array with a map but as I got deeper, more and more would have to be changed. The issue is so many methods use the getName(index) method which is not only fast but cannot be done on a map. I did not want to break any existing functionality so I added the map.
Comment 3 Richard Hart 2015-10-23 11:54:48 UTC
If I am not mistaken, the names array also contains names that are empty ("") such as when the cell is first created. Giving the cell a name is not required. The map can have only one empty name. So there is a difference between the two stores. The map contains only names whose length > 0.
Comment 4 Javen O'Neal 2015-10-23 12:05:02 UTC
There might be some off-the-shelf solutions available for this, though the underlying implementation of MultiKeyMap would determine if this saves any memory over maintaining a List and a Map in parallel--at least a hybrid data structure would relieve the developers of touching twice as many data structures and twice as many functions.
https://commons.apache.org/proper/commons-collections/javadocs/api-3.2.1/org/apache/commons/collections/map/MultiKeyMap.html

Alternatively, if the performance hit is small enough, some type of SortedMap/TreeMap/LinkedHashMap might work here:
getName(int index) -> map.values().toArray()[index]

If those suggestions don't work, you could roll your own hybrid data structure to maintain two underlying data structures in parallel. Better encapsulation.

> (In reply to Richard Hart from comment #3)
> If I am not mistaken, the names array also contains names that are empty
> ("") such as when the cell is first created. Giving the cell a name is not
> required. The map can have only one empty name. So there is a difference
> between the two stores. The map contains only names whose length > 0.

Hmm.. another wrinkle in the design. Let me think about this problem a bit more. It's starting to sound like a custom-built hybrid data structure will be needed here. Maybe that custom data structure is LinkTable.java.