Bug 58896 - autoSizeColumn() is extremely slow if worksheet contains merged cells
Summary: autoSizeColumn() is extremely slow if worksheet contains merged cells
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: SS Common (show other bugs)
Version: unspecified
Hardware: PC Windows NT
: P2 enhancement with 6 votes (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks: 52834 60417
  Show dependency tree
 
Reported: 2016-01-20 04:12 UTC by Gili
Modified: 2019-06-03 15:28 UTC (History)
0 users



Attachments
Input file for testcase (87.30 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2016-03-06 15:10 UTC, Gili
Details
Testcase (764 bytes, text/plain)
2016-03-06 15:12 UTC, Gili
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Gili 2016-01-20 04:12:38 UTC
So, it goes without saying that a lot of people have complained about the performance of autoSizeColumn(). I followed all the online advise about only invoking this method once per column once the worksheet is fully populated, but it is still extremely slow (30 seconds in POI, 1 second using the Excel GUI).

The worksheet I am formatting has 139 columns and 160 rows.

Is there an existing task for optimizing this function? Has anyone taken a profiler to it and investigated if there are is any low-hanging fruit?
Comment 1 Javen O'Neal 2016-01-20 05:11:44 UTC
I've looked through the code 6 months back because I had the same observation.

My conclusion was that nothing stood out as wastefully inefficient. The contents of every cell (formulas evaluated as necessary) in a column was rendered in RichText, and a minimum bounding box around the RichText was calculated. This was done exactly once per non-blank cell in the column.

To make this faster, while auto-sizing a column, we could cache cells with the same value and formatting to avoid re-calculating the RichText box, but this comes at the cost of memory and a slight performance hit when the majority of cell values in a column are unique. Alternatively, the workbook could cache the column widths when cell values are modified, which would result in a net increase in memory consumption and time, and changing any formula would invalidate the calculated column widths. Excel could get away with the latter approach since the marginal cost to auto-sizing one cell is negligible when using a GUI. However, I don't know if this is what Excel does.

Feel free to step through the autosize code yourself to see if you can find something I couldn't. More eyes are better.
Comment 2 Dominik Stadler 2016-02-12 07:43:01 UTC
Analysis showed no apparent general suitable way to increase performance significantly without incurring higher memory requirements (which we would like to avoid).

As your case seems odd with a very high runtime with only few rows/columns (in other places we do autosizing on tens of thousands of rows without much delay), so it would be good if you can do some profiling to find which parts of the code are causing this.

Alternatively please attach both a sample file and sample code to reproduce the problem so somebody else can look at your specific code.
Comment 3 Javen O'Neal 2016-02-18 03:40:52 UTC
To test out the times, I wrote a unit test (r1730997) that creates a file with the following contents:
160 rows, 139 columns, with the content of each cell as "Cell[r=159, c=138]" using the workbook default style. The time required to calculate the best fit width is a function of the number of characters and number of styles/fonts. Likely due to the JVM's JIT compiler, the time spent calculating the best fit width for each column decreased over time. With the simplicity of this test, this is obviously a best case scenario for auto-sizing (hence the reason I didn't get 30 seconds). I have personally seen the 30 second turtle pace described by Gili with more complicated spreadsheets (numbers, number formatting, styles)

                              [1]    [2]    [3]
                             HSSF   XSSF   SXSSF
            Populate sheet:  21ms  344ms    9ms
     Autosize first column:  36ms    4ms   10ms
Autosize first ten columns: 217ms   39ms   84ms
      Autosize all columns: 776ms  447ms  674ms
 Total (populate+autosize): 797ms  791ms  683ms

There isn't a significant difference in the amount of time required to auto-size columns for each spreadsheet implementation.

[1] http://pasted.co/af5f61a5
[2] http://pasted.co/5e0cd6700
[3] http://pasted.co/41a51c4e
Comment 4 Gili 2016-03-06 15:10:44 UTC
Created attachment 33635 [details]
Input file for testcase
Comment 5 Gili 2016-03-06 15:12:40 UTC
Created attachment 33636 [details]
Testcase
Comment 6 Gili 2016-03-06 15:14:05 UTC
I've attached a testcase for your consideration. On my machine, auto-sizing columns takes 2 minutes, 12 seconds, 489 milliseconds.

Please let me know if you can reproduce the problem on your end.
Comment 7 Gili 2016-03-21 17:59:58 UTC
Javen and Dominik,

Have you tried running the testcase I attached? I'm wondering if it revealed anything interesting...
Comment 8 Javen O'Neal 2016-03-22 05:29:40 UTC
Your input file has a few things beyond my unit test from comment 3:
1) Rich Text Formatting (bold, color)
2) Number formatting ($128,000)
3) Merged cells (Description)

How much do each of these contribute to the 132.5 seconds? Could you time your auto-size code from comment 5 using an Excel file with:
A) no formatting
B) 1+2+3 formatting
C) 1, 2, and 3 formatting, separately, if A and B are different

As a side-note: if you are using POI to auto-size a file similar to attachment 33635 [details] and you know the workbook structure, you could either manually determine ideal column widths and use POI to explicitly set the widths, or have POI calculate the best-fit width of the first group of columns, then apply those widths to all of the columns. This obviously doesn't work when the content is less structured or there are longer strings.

Andi suggested that some people could get away with an approximate solution:
> cell width = length(cell value expressed as a string) * width of a typical character in the cell's font.
If we wanted to make this work for mixed-font (RichText) cells, perform the same computation on each run of characters that have the same font size and font name. A cheaper version is to assume all fonts have the same character widths.
> cell width = sum of [(font size)*(average character width at 1pt font size)*(number of characters in a same-size font run)] for each font run.
Comment 9 Gili 2016-03-28 13:12:15 UTC
Javen,

You hit the nail on the head. Cell formatting has no effect on autoSizeColumn() but merged cells do.

It doesn't matter whether I invoke autoSizeColumn() with useMergedCells set to true or false, the performance problem occurs either way. However, if I un-merge all cells in the worksheet before invoking autoSizeColumn() the performance problem disappears.

How do we proceed from here?
Comment 10 Javen O'Neal 2016-06-17 06:28:51 UTC
I wonder how many times the best-fit width of a cell is being calculated in a merged cell. If more than once per cell, can we cache those calculations or rewrite the code so unnecessary work isn't performed?
Are we checking if the cell is blank before proceeding to calculate the best-fit width?
Does useMergedCells=False/True make a difference on speed?
Comment 11 Javen O'Neal 2016-06-17 06:30:29 UTC
Probably the same bug as bug 52834
Comment 12 Axel Howind 2016-08-12 06:38:50 UTC
I remember observing very bad performance in handling merged cells starting in POI 3.14(?). IIRC it was not only with autoSizeColumn(), but whenever trying to query merged regions of a sheet. It could even be that it was just getNbOfMergedRegions() in XSSF that was so horribly slow. I coded around this issue.

Maybe I will dig into this again when I find the time.
Comment 13 Manuel Koller 2019-05-31 09:01:15 UTC
I was tracking down a slow Excel export and I came across this ticket. There seems to be at least one optimisation that could help speed up things.

Profiling one very slow export, I noticed that about 60% of the time was spent calling sheet.getMergedRegions() from SheetUtil.getCellWidth(Cell cell, int defaultCharWidth, DataFormatter formatter, boolean useMergedCells).

The getCellWidth method is called for each cell in the sheet. Presumably the return value of sheet.getMergedRegions() doesn't change during an autoSizeColumn() call. If the return value from getMergedRegions() is either passed in or otherwise cached, then this would probably speed things up considerably.

The passing in approach is easy but would probably clutter the interface too much. But caching the sheet.getMergedRegions() within sheet instances seems more complicated as presumably each implementation would have to handle that separately. As the calls are likely to be made for each column in the sheet, the solution should cover that case, too.
Comment 14 Vladimir 2019-06-03 15:28:42 UTC
I also have a problem with autoSizeColumn with small number of columns and data.
I have an application which generates XLSX files in different threads. 
I have changed the POI version from 3.17 to 4.1.0 and the application became to hang on the autoSizeColumn(sheet, true) line. 
When I had removed this line or had changed to sheet.setColumnWidth(i, SMALL_COLUMN_WIDTH) the problem was resolved.

There aren't excpetions in the log file!


-----Versions-----

1) OS: Windows 10

2) Java
java version "11.0.3" 2019-04-16 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.3+12-LTS)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.3+12-LTS, mixed mode)

3)
POI version: 4.1.0