Bug 51171

Summary: [PATCH] Performance opening large XLS-Files (InternalWorkbook.getStyleRecord)
Product: POI Reporter: Marcel May <marcel.may.de>
Component: HSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal CC: marcel.may.de
Priority: P2    
Version: 3.8-dev   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: Patch against SVN trunk
CPU Hotspots, before to the patch. Notice the 55m inv count on WorkbookRecordList
Including the patch, fixing the previous top hotspots (these were all related, up to the SharedValueManger thing).
Profiled hotspots (opening of XLS file using current trunk v1127506)
Profiled cpu calltree (opening of XLS file using current trunk v1127506)
Improves SharedValueManager.findFormulaGroup
Shows profiled hotspots after second patch for SharedValueManager.findFormulaGroup
Shows profiled calltree after second patch for SharedValueManager.findFormulaGroup

Description Marcel May 2011-05-09 06:02:00 UTC
Created attachment 26970 [details]
Patch against SVN trunk

Opening ~5M XLS-Files takes about half a second and high CPU load.
By a quick profiling I discovered a 55-million-times loop for getting the user style name for my XLS-files.

This patch against trunk fixes the number of loops significantly, by fixing the 'iterative search' causing the high loop count.

I'll attach the profiling details in a moment.
As a summary, for my test cases I measured an average speedup =  old_time / new_time = ~ 1.3 .

Would be great if the patch could be included in the next 3.8 release.

Cheers,
Marcel
Comment 1 Marcel May 2011-05-09 06:06:44 UTC
Created attachment 26971 [details]
CPU Hotspots, before to the patch. Notice the 55m inv count on WorkbookRecordList
Comment 2 Marcel May 2011-05-09 06:09:29 UTC
Created attachment 26972 [details]
Including the patch, fixing the previous top hotspots (these were all related, up to the SharedValueManger thing).
Comment 3 Marcel May 2011-05-09 06:18:32 UTC
The profiling screenshots attached show opening a single XLS file once:

new HSSFWorkbook(new FileInputStream("...."));

You can see that the top hotspots (first 7 are related) got fixed when you compare before/after the patch. Looping 55 million times does not occur anymore.

Another micro bench mark including warmup phase followed by timed loops shows similiar results (speedup of ~1.3, or only ~75% of original time on average).
Comment 4 Yegor Kozlov 2011-05-15 18:54:35 UTC
Fixed in r1103502, but in a diffrent way.

The real problem was that HSSFCell.setCellStyle was called for every cell when constructing a workbook. This method is expensive and designed for assigning styles to individual cell and applying it to workbook scope causes performance issues. It appears that we don't need to call setCellStyle in the HSSFCell constructior at all, this line remained from all times (POI-3.5 or earlier) and does not make any sense in POI-3.8. So, I removed it.  

This fix should boost performance of opening .xls files even greater than ~1.3. 

Yegor
Comment 5 Marcel May 2011-05-25 14:22:29 UTC
(In reply to comment #4)
> Fixed in r1103502, but in a diffrent way.
> 
> The real problem was that HSSFCell.setCellStyle was called for every cell when
> constructing a workbook. This method is expensive and designed for assigning
> styles to individual cell and applying it to workbook scope causes performance
> issues. It appears that we don't need to call setCellStyle in the HSSFCell
> constructior at all, this line remained from all times (POI-3.5 or earlier) and
> does not make any sense in POI-3.8. So, I removed it.  
> 
> This fix should boost performance of opening .xls files even greater than ~1.3. 
> 
> Yegor

Thanks, Yegor - your fix was even better :-)

I profiled again against latest version 1127506 and noticed another hotspot in the SharedValueManager class. Here's an 8 million times iteration doing 'findFormularGroup'.
Comment 6 Marcel May 2011-05-25 14:26:48 UTC
Created attachment 27064 [details]
Profiled hotspots (opening of XLS file using current trunk v1127506)
Comment 7 Marcel May 2011-05-25 14:28:48 UTC
Created attachment 27065 [details]
Profiled cpu calltree (opening of XLS file using current trunk v1127506)
Comment 8 Marcel May 2011-05-26 06:42:34 UTC
Created attachment 27067 [details]
Improves SharedValueManager.findFormulaGroup
Comment 9 Marcel May 2011-05-26 06:47:56 UTC
Created attachment 27068 [details]
Shows profiled hotspots after second patch for SharedValueManager.findFormulaGroup
Comment 10 Marcel May 2011-05-26 06:49:41 UTC
Created attachment 27069 [details]
Shows profiled calltree after second patch for SharedValueManager.findFormulaGroup