Bug 61665

Summary: XSSF is much slower than HSSF
Product: POI Reporter: johns <poi.bugzla>
Component: XSSFAssignee: POI Developers List <dev>
Severity: enhancement    
Priority: P2    
Version: 3.17-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: All   

Description johns 2017-10-25 10:13:00 UTC
On making big amount of cell writes 10000-30000, XSSF speed is much slower than HSSF, about x5 - x10, what in general not suposed to be so, or at least not that much.

It can be reproduced by Apache POI test class:

According another message on stackoverflow:
the problem could be not directly in poi, but in synchronized calls of xmlbeans and poi-ooxml-schemas. 

Please also take a look at this messages:


Comment 1 Travis Burtrum 2017-10-25 13:39:53 UTC
So I ran into this too and settled on hacking HSSF to just calculate more cells/rows, though it can't read or write these:


There has also been the recent change to disable synchronization in XmlBeans to hopefully avoid this, but I tested it, and it changed nothing.
Comment 2 Javen O'Neal 2017-10-25 18:31:45 UTC
XML as a serialization and deserialization format will always be slower than an optimized binary format. HApache POI's internal model for an xlsx file maintains XML beans, updating them as needed, writing out the XML beans as is. The benefit of this strategy is that features that POI doesn't understand or implement are kept, unmodified. Had we converted the information in the XML beans to pojos and discarded the XML beans immediately after reading the workbook, it's likely information would have been lost.

We are investigating replacing XMLBeans with a different XML library (constrained by ASL 2.0 license compatibility) that may be more performant and memory efficient, and this may provide some improvements in speed. This is an extremely large task that requires modifying nearly every XSSF class and OOXML class. Any help would be greatly appreciated.

On a smaller scale, if after profiling the code you find a section that can be improved, please submit your profiling results and a patch that doesn't break backwards compatibility.
Comment 3 Dominik Stadler 2017-12-28 10:55:23 UTC
I did some analysis using Dynatrace AppMon and could not find any immediate items that we can improve here. The top-consumers are all deep down somewhere in XmlBeans, therefore I don't think we can do much here outside of the larger XmlBeans replacement work.

getT() 	6.37s 	CPU: 36 %, Sync: 0 %, Wait: 0 %, Suspension: 0 %, I/O: 64 % 	org.openxmlformats.schemas.spreadsheetml.x2006.main.impl.CTCellImpl 	Openxmlformats

hasTextEnsureOccupancy() 	6.37s 	CPU: 36 %, Sync: 0 %, Wait: 0 %, Suspension: 0 %, I/O: 64 % 	org.apache.xmlbeans.impl.store.Xobj 	XMLBeans

embedCurs() 	6.37s 	CPU: 36 %, Sync: 0 %, Wait: 0 %, Suspension: 0 %, I/O: 64 % 	org.apache.xmlbeans.impl.store.Locale 	XMLBeans

I have updated the FAQ entry slightly to adjust the expected timings via r1819415

One thing to note is that initially class loading was taking a considerable amount of time, therefore I added a way to do a "warmup" run to SSPerformanceTest so that the actual code is tested, not classloading, see r1819417