Bug 51171 - [PATCH] Performance opening large XLS-Files (InternalWorkbook.getStyleRecord)
Summary: [PATCH] Performance opening large XLS-Files (InternalWorkbook.getStyleRecord)
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: HSSF (show other bugs)
Version: 3.8-dev
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-05-09 06:02 UTC by Marcel May
Modified: 2011-05-26 06:49 UTC (History)
1 user (show)



Attachments
Patch against SVN trunk (2.73 KB, patch)
2011-05-09 06:02 UTC, Marcel May
Details | Diff
CPU Hotspots, before to the patch. Notice the 55m inv count on WorkbookRecordList (345.83 KB, image/png)
2011-05-09 06:06 UTC, Marcel May
Details
Including the patch, fixing the previous top hotspots (these were all related, up to the SharedValueManger thing). (360.16 KB, image/png)
2011-05-09 06:09 UTC, Marcel May
Details
Profiled hotspots (opening of XLS file using current trunk v1127506) (660.11 KB, image/tiff)
2011-05-25 14:26 UTC, Marcel May
Details
Profiled cpu calltree (opening of XLS file using current trunk v1127506) (425.80 KB, image/png)
2011-05-25 14:28 UTC, Marcel May
Details
Improves SharedValueManager.findFormulaGroup (6.00 KB, patch)
2011-05-26 06:42 UTC, Marcel May
Details | Diff
Shows profiled hotspots after second patch for SharedValueManager.findFormulaGroup (369.40 KB, image/png)
2011-05-26 06:47 UTC, Marcel May
Details
Shows profiled calltree after second patch for SharedValueManager.findFormulaGroup (189.06 KB, image/png)
2011-05-26 06:49 UTC, Marcel May
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Marcel May 2011-05-09 06:02:00 UTC
Created attachment 26970 [details]
Patch against SVN trunk

Opening ~5M XLS-Files takes about half a second and high CPU load.
By a quick profiling I discovered a 55-million-times loop for getting the user style name for my XLS-files.

This patch against trunk fixes the number of loops significantly, by fixing the 'iterative search' causing the high loop count.

I'll attach the profiling details in a moment.
As a summary, for my test cases I measured an average speedup =  old_time / new_time = ~ 1.3 .

Would be great if the patch could be included in the next 3.8 release.

Cheers,
Marcel
Comment 1 Marcel May 2011-05-09 06:06:44 UTC
Created attachment 26971 [details]
CPU Hotspots, before to the patch. Notice the 55m inv count on WorkbookRecordList
Comment 2 Marcel May 2011-05-09 06:09:29 UTC
Created attachment 26972 [details]
Including the patch, fixing the previous top hotspots (these were all related, up to the SharedValueManger thing).
Comment 3 Marcel May 2011-05-09 06:18:32 UTC
The profiling screenshots attached show opening a single XLS file once:

new HSSFWorkbook(new FileInputStream("...."));

You can see that the top hotspots (first 7 are related) got fixed when you compare before/after the patch. Looping 55 million times does not occur anymore.

Another micro bench mark including warmup phase followed by timed loops shows similiar results (speedup of ~1.3, or only ~75% of original time on average).
Comment 4 Yegor Kozlov 2011-05-15 18:54:35 UTC
Fixed in r1103502, but in a diffrent way.

The real problem was that HSSFCell.setCellStyle was called for every cell when constructing a workbook. This method is expensive and designed for assigning styles to individual cell and applying it to workbook scope causes performance issues. It appears that we don't need to call setCellStyle in the HSSFCell constructior at all, this line remained from all times (POI-3.5 or earlier) and does not make any sense in POI-3.8. So, I removed it.  

This fix should boost performance of opening .xls files even greater than ~1.3. 

Yegor
Comment 5 Marcel May 2011-05-25 14:22:29 UTC
(In reply to comment #4)
> Fixed in r1103502, but in a diffrent way.
> 
> The real problem was that HSSFCell.setCellStyle was called for every cell when
> constructing a workbook. This method is expensive and designed for assigning
> styles to individual cell and applying it to workbook scope causes performance
> issues. It appears that we don't need to call setCellStyle in the HSSFCell
> constructior at all, this line remained from all times (POI-3.5 or earlier) and
> does not make any sense in POI-3.8. So, I removed it.  
> 
> This fix should boost performance of opening .xls files even greater than ~1.3. 
> 
> Yegor

Thanks, Yegor - your fix was even better :-)

I profiled again against latest version 1127506 and noticed another hotspot in the SharedValueManager class. Here's an 8 million times iteration doing 'findFormularGroup'.
Comment 6 Marcel May 2011-05-25 14:26:48 UTC
Created attachment 27064 [details]
Profiled hotspots (opening of XLS file using current trunk v1127506)
Comment 7 Marcel May 2011-05-25 14:28:48 UTC
Created attachment 27065 [details]
Profiled cpu calltree (opening of XLS file using current trunk v1127506)
Comment 8 Marcel May 2011-05-26 06:42:34 UTC
Created attachment 27067 [details]
Improves SharedValueManager.findFormulaGroup
Comment 9 Marcel May 2011-05-26 06:47:56 UTC
Created attachment 27068 [details]
Shows profiled hotspots after second patch for SharedValueManager.findFormulaGroup
Comment 10 Marcel May 2011-05-26 06:49:41 UTC
Created attachment 27069 [details]
Shows profiled calltree after second patch for SharedValueManager.findFormulaGroup