Bug 58740

Summary:	[PATCH] Fix O(n^2) behavior when generating XLSX files with large number of styles
Product:	POI	Reporter:	Archie Cobbs <archie.cobbs>
Component:	XSSF	Assignee:	POI Developers List <dev>
Status:	RESOLVED CLOSED
Severity:	enhancement
Priority:	P2
Version:	3.13-FINAL
Target Milestone:	---
Hardware:	PC
OS:	All
Attachments:	Patch that fixes the behavior described Rebased attachment 33352 to r1721943. patch showing mixed based indices MappedList data structure that combines a List and a TreeSetValuedHashMap to improve list reverse lookup speed a non-working MapUniqueList class that mimicks an associative SetUniqueList Version of original patch using Guava's BiHashMap class

Description Archie Cobbs 2015-12-16 04:08:13 UTC

Created attachment 33352 [details]
Patch that fixes the behavior described

When generating XSLX files for a spreadsheet that contains a large number of cell styles, there is extreme slowness caused by the use of ArrayList.indexOf() which is O(n) time, leading to overall O(n^2) behavior.

This is easily fixed by keeping a reverse mapping from list entry to list index. Since the list of styles is only appended to, this is easy.

The attached patch reduced the time to generate my XSLX output from several minutes to under 10 seconds.

Comment 1 Javen O'Neal 2015-12-16 08:57:25 UTC

Out of curiosity, how many styles are you creating that cause the reported several minutes/10 seconds times? Are any of the styles identical? I ask because most workbooks I create use under 100 styles. Unless your workbook is a styles demo that shows every permutation of font, data format, border, background and foreground color, it seems difficult to get a style count high enough in most applications where there wouldn't be duplicates.

I've rolled my own code to manage styles so I can avoid creating millions of styles. Usually I want to change the data format of an existing cell without affecting other cells that use the same style. Rather than cloning the cell's style (that's how you get millions of styles), I temporarily change the cell style's data format, search the style table if there's another style that matches, and if so change the cell's style reference to the match, otherwise I clone the style. Finally, I revert my dataformat change to the original style. Because I apply style changes to thousands of cells, I have some extra data structures to make the style lookup process faster than linearly searching the style table.

I mention this because it may solve your problem if you don't really need 1 million styles.

POI could benefit from a way to consolidate duplicate styles, either when the workbook is written to disk, or through an explicit call.

Glancing at your patch, it looks like your change adds some data structures. The consequence is:
1) higher memory consumption
2) extra processing power updating multiple data structures
3) potential for the data structures to get out of sync, especially considering projects that subclass POI.

My recommendation is use a single data structure that is a container for cell styles, that combines the features that you need that will give fast by-index and by-style lookup. If such a data structure isn't available off-the-shelf, you may want to write your own. Most trivially, this is just a class that contains an ArrayList and a HashMap for inverted array lookups, and any call to the hybrid data structure would update both underlying data structures. Encapsulating the complexity is the key to solving my 3rd concern, and makes solving #2 and #1 easier to solve down the road.

Alternatively, if you can read the styles from the array into a hashmap, clear the array, and only maintain a hashmap throughout the life of the style table, and then recreate the array when you need to save the style table, you've saved yourself the memory and performance overhead, and also avoided the potential for out-of-sync data structures so long as you clearly mark the array as unmaintained (maybe clear it or set it to null, or don't make it an instance variable, plus comments).

Comment 2 Archie Cobbs 2015-12-16 14:15:18 UTC

(In reply to Javen O'Neal from comment #1)
> Out of curiosity, how many styles are you creating that cause the reported
> several minutes/10 seconds times?

I have one column where the background color is set according to the
data value as a visual aid. Lots of values lead to lots of colors and
therefore styles. Even limiting to couple thousand colors leads to this
problem. 

> 1) higher memory consumption

Yes but insignificantly. A small constant number of bytes per style.

> 2) extra processing power updating multiple data structures

Huh?!? You're missing the whole point.

If you have non-trivial number of styles, there will be MUCH less
processing power utilized because we eliminate the stupid O(n^2) behavior.

For a trivial number of styles, the overhead is minimal - linear time in
the number of styles, which by definition, is small.

> 3) potential for the data structures to get out of sync, especially
> considering projects that subclass POI.

Wrong - the fields are all private. It's not possible for a subclass
to get things out of sync.

> My recommendation is use a single data structure that is a container for
> cell styles, that combines the features that you need that will give fast
> by-index and by-style lookup. If such a data structure isn't available
> off-the-shelf, you may want to write your own. Most trivially, this is just
> a class that contains an ArrayList and a HashMap for inverted array lookups,

That's a possible refinement and I considered it. My goal was to address
the immediate problem.

You are letting the perfect be the enemy of the good. I've provided a
reasonable patch to address what is truly stupid behavior. My recommendation
is to use this patch to "put out the fire" so to speak, and we can refine it
later at a lower priority.

Comment 3 Nick Burch 2015-12-16 16:00:07 UTC

In terms of the style optimisation, we do have a style optimiser for HSSF, but nothing for XSSF. Some of the logic could be re-used, but a fair bit would need writing

Comment 4 Javen O'Neal 2015-12-16 16:46:47 UTC

(In reply to Archie Cobbs from comment #2)
> I have one column where the background color is set according to the
> data value as a visual aid. Lots of values lead to lots of colors and
> therefore styles. Even limiting to couple thousand colors leads to this
> problem. 

Okay. Sounds like you really do need that many styles if the style is dependent on the value. This isn't by any chance related to Excel's Conditional Formatting Color Scales [1]? This would still probably require as many styles as there are values, but would be easier to work with.


I'll take a look at your patch in context with the code later this week to see if we can get away with maintaining just a single data structure.

Thank you for the patch!

[1] https://support.office.com/en-us/article/Highlight-patterns-and-trends-with-conditional-formatting-eea152f5-2a7d-4c1a-a2da-c5f893adb621

Comment 5 Archie Cobbs 2015-12-16 19:21:03 UTC

(In reply to Javen O'Neal from comment #4)
> Okay. Sounds like you really do need that many styles if the style is
> dependent on the value. This isn't by any chance related to Excel's
> Conditional Formatting Color Scales [1]?

No - I'm not familiar with that. Sounds like a neater solution though.

Does OOXML and/or the POI API have support for that?

> I'll take a look at your patch in context with the code later this week to
> see if we can get away with maintaining just a single data structure.

Great, thanks!

Comment 6 Nick Burch 2015-12-16 21:17:19 UTC

(In reply to Archie Cobbs from comment #5)
> Does OOXML and/or the POI API have support for that?

Yes! Take a look at the colourScales method of the ConditionalFormats example program to get started, then the related unit tests if you need more

(Note that if you write a colour scale into a .xls or .xlsx file, it will only render properly if you open in a new enough copy of Excel. Older versions will skip over the records/xml without rendering anything)

Comment 7 Javen O'Neal 2015-12-28 16:31:48 UTC

Created attachment 33382 [details]
Rebased attachment 33352 [details] to r1721943.

I committed the trivial changes from attachment 33352 [details]. I've rebased the patch to r1721943.

This patch should use a Bidirectional Map or other data structure so that the StylesTable doesn't become too difficult to read. I've posted a question to the dev mailing list for feedback.

This test also fails the current unit test suite. Looks to be some issues with ArrayIndexOutOfBounds.

Comment 8 Javen O'Neal 2016-09-12 06:26:39 UTC

Created attachment 34235 [details]
patch showing mixed based indices

Some of the methods return list.size() before adding an item and other methods return list.size() after adding an item. The effect is whether a 0-based list index or 1-based list index is returned. Without better javadocs or comments saying which should be which, I'm inclined to believe that there are some bugs hiding in the attached code.

Comment 9 Javen O'Neal 2016-09-12 14:58:40 UTC

Created attachment 34239 [details]
MappedList data structure that combines a List and a TreeSetValuedHashMap to improve list reverse lookup speed

I wrote a general-purpose class using commons-collections4.1 (this patch was generated off the commons collections trunk, not POI). The indexOf and contains methods were significantly faster, but insertion and removal were pitifully slow (it's slow to shift all the elements in an array by 1 for an ArrayList implementation, but this class is even slower because it has to shift the values of a map--from my trials, it's faster to rebuild the entire map when an insertion or deletion happens that isn't at the end of the list.

Some sample timing:
Each list was initialized with 10,000 elements and 10,000 operations of each type were performed in succession. Timing is relative and seems to be influenced by JIT. Repeatability is around +/-10%, so I have rounded to 2 significant figures. Low numbers are better.
>                        add; toArray; iterator; insert; get; indexOf; contains; remove
>             ArrayList = 2;   1200;       68;    100;    9;    310;      220;      81;
>            LinkedList = 3;   1800;     1500;    350;  420;   1000;     1000;     400;
> NodeCachingLinkedList = 4;   2600;     1900;    250;  350;    860;      830;     370;
>              TreeList = 4;   1300;     3400;     68;   10;   1500;     1500;      55;
>            MappedList = 52;  1600;     1900;  20000;    0;     64;        6;   21000;
> Elapsed time: 1m8s

Comment 10 Javen O'Neal 2016-09-12 17:11:10 UTC

Created attachment 34241 [details]
a non-working MapUniqueList class that mimicks an associative SetUniqueList

If we can remove the scenario where duplicate entries may exist in the lists then we could write a MapUniqueList (see org.apache.commons.collections4.list.SetUniqueList), which may be more performant.

Looking at the usage of these lists, we only use List.add(E), List.indexOf(E), and List.size(), so we may not run into the performance bottlenecks when elements in the array are shifted (and this makes sense--we wouldn't want to change the style index as there are other data structures that only reference the index, not the object itself.

Comment 11 Archie Cobbs 2016-09-19 21:25:57 UTC

Created attachment 34275 [details]
Version of original patch using Guava's BiHashMap class

Just for what it's worth, I'm attaching a patch against 3.14 that uses Guava's BiHashMap, which is a bidirectional Map.

This makes the logic much simpler.

Comment 12 Javen O'Neal 2016-09-20 00:29:36 UTC

Apache Commons collections4 is available to us, but a BiDiMap does not work because values (and therefore keys in the inverted map) may not be unique.

Comment 13 Archie Cobbs 2018-02-10 21:02:49 UTC

It looks like this issue may be addressed in poi-3.17. My tests show better performance.

Can you confirm? Thanks.

Comment 14 PJ Fanning 2021-10-08 18:37:53 UTC

Closing old issue - seems likely to have been resolved