Bug 55612

Summary: Performance improvement (~11% up to ~92%) by adding a cache to HSSFCellStyle.getDataFormatString()
Product: POI Reporter: Luca Della Toffola <luca.dellatoffola>
Component: HSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: enhancement    
Priority: P2    
Version: 3.9-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Attachments: HSSFCellStyle patch
HSSFCellStyle patch (static lock)
HSSFCellStyle patch (thread local)

Description Luca Della Toffola 2013-09-30 09:34:59 UTC
Created attachment 30895 [details]
HSSFCellStyle patch

Similarly to BUG-55611 we found an easy way to improve POI's performance. 
The idea is to avoid re-executing HSSFCellStyle.getDataFormatString() if the various parameters/object-state are not changing.
This can be done by adding a single-static-entry cache and check if the parameters/object-state did change from the previous call, and in case invalidate the static cache. While the code in the patch might not be the most elegant my main purpose is to expose this performance problem I have found with an easy fix as solution.
For example, when running Poi 3.9 on a small document (~40 KB) and on a larger document (~13.5 MB), the patch reduces the running time
giving a speedup of ~11% in the first case and ~92% in the second case.
Comment 1 Yegor Kozlov 2013-10-19 14:45:05 UTC
Can you please explain the logic in details?

Why do you not compare lastFormat and format ? You cache lastFormat  but use it only in a "not null" clause and not in equals. 


Why do you need to cache lastFormatSummary and collection of workbook's format records? 

Yegor
Comment 2 Luca Della Toffola 2013-10-19 17:00:57 UTC
(In reply to Yegor Kozlov from comment #1)
> Can you please explain the logic in details?
> 
> Why do you not compare lastFormat and format ? You cache lastFormat  but use
> it only in a "not null" clause and not in equals. 

The main reason to keep all that state is to be on the safe side for the cache invalidation. Since the cache is static I wanted to store the state of the "this" object of the previous call. Obviously not all this state is used directly in the computation of getDataFormatString(),
so there is margin to simplify the patch. 

Regarding the check "_format" is initially initialised to null so I wanted to keep that case in the invalidation condition.
If both are null then we are fine the two object fields are equal, otherwise if the format is not null we need to compare the fields to see if the previous "this" object changed. To have a quick-and-dirty check I packed the fields of ExtendedFormatRecord into an array.
 
> Why do you need to cache lastFormatSummary and collection of workbook's
> format records? 

For lastFormatSummary because part of it (one field) is used as parameter for the call "format.getFormat(...)" in HSSFCellStyle.getDataFormatString(InternalWorkbook), the remaining array elements for the reasons explained above. 

For the collection of workbook's format records I wanted to keep them because they are used in the constructor of HSSFDataFormat created in HSSFCellStyle.getDataFormatString(InternalWorkbook) so I consider them as implicit "parameters" of the method getDataFormatString().

I'm preparing a new patch that stores only the collection of workbook's format records and the value returned by the method HSSFCellStyle.getDataFormat() for the invalidation.
Results of the performance tests are coming soon.
Comment 3 Luca Della Toffola 2013-10-22 12:56:24 UTC
Created attachment 30953 [details]
HSSFCellStyle patch (static lock)
Comment 4 Luca Della Toffola 2013-10-22 12:57:29 UTC
Created attachment 30954 [details]
HSSFCellStyle patch (thread local)
Comment 5 Luca Della Toffola 2013-10-22 13:01:09 UTC
I attached two new patches. One used a static lock to synchronize the cache, the other a thread-local variables. Both now only uses a sub-set of the fields of the object, so we have simpler invalidation mechanism. 

In the same initial scenario (single-threaded) program both result in a similar performance improvement (thread-local version works slightly better) as the original patch. As in the other bug report I don't a have a testing scenario with multiple threads accessing the cache, in that case intuitively we will expect a performance degradation.

(In reply to Yegor Kozlov from comment #1)
> Can you please explain the logic in details?
> 
> Why do you not compare lastFormat and format ? You cache lastFormat  but use
> it only in a "not null" clause and not in equals. 
> 
> 
> Why do you need to cache lastFormatSummary and collection of workbook's
> format records? 
> 
> Yegor
Comment 6 Yegor Kozlov 2013-10-25 18:50:46 UTC
patched applied in r1535810

I realized I was wrong when I wrote about thread unsafety of this patch. We really needed synchronization in Bugzilla 55611 because DateUtil.isADateFormat(int, String) is static and the state of its cache must be synchronized. 

HSSFCellStyle.getDataFormatString() is an instance method and  we can assume that it is accessed from one thread ta a time. So I removed the synchronization lock and committed. 

Sorry if my comment caused confusion and extra work on your side. 

Regards,
Yegor
Comment 7 Luca Della Toffola 2013-10-27 12:09:44 UTC
No problem at all, I'm happy that I could contribute.

PS: Few days back in Bugzilla 55611 I made another comment/patch after your commit. I don't know if you had a look at it already. 

(In reply to Yegor Kozlov from comment #6)
> patched applied in r1535810
> 
> I realized I was wrong when I wrote about thread unsafety of this patch. We
> really needed synchronization in Bugzilla 55611 because
> DateUtil.isADateFormat(int, String) is static and the state of its cache
> must be synchronized. 
> 
> HSSFCellStyle.getDataFormatString() is an instance method and  we can assume
> that it is accessed from one thread ta a time. So I removed the
> synchronization lock and committed. 
> 
> Sorry if my comment caused confusion and extra work on your side. 
> 
> Regards,
> Yegor
Comment 8 Luca Della Toffola 2013-10-27 13:49:16 UTC
(In reply to Yegor Kozlov from comment #6)
> I realized I was wrong when I wrote about thread unsafety of this patch. We
> really needed synchronization in Bugzilla 55611 because
> DateUtil.isADateFormat(int, String) is static and the state of its cache
> must be synchronized. 
> 

I think you were actually right in the beginning. Even if HSSFCellStyle.getDataFormatString() is an instance method the cache is static so we need a static lock to synchronize various threads accessing/updating the cache.
Comment 9 Bernhard Seebass 2014-04-29 15:58:41 UTC
I can confirm that synchronization is a must. I just experienced an application showing random behaviour due to multi threaded usage of HSSFCellStyle.getDataFormatString()

However, as the current issue is already closed, I didn't find it at first and filed a new one (BUG-56453).