Bug 61213 - Replace SXSSFWorkbook copyStreamAndInjectWorksheet with StAX equivalent
Summary: Replace SXSSFWorkbook copyStreamAndInjectWorksheet with StAX equivalent
Alias: None
Product: POI
Classification: Unclassified
Component: SXSSF (show other bugs)
Version: unspecified
Hardware: PC Mac OS X 10.1
: P2 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Blocks: 60707
  Show dependency tree
Reported: 2017-06-24 09:28 UTC by PJ Fanning
Modified: 2017-07-13 07:16 UTC (History)
0 users

use Stax to parse the worksheet data (3.61 KB, patch)
2017-06-24 10:08 UTC, PJ Fanning
Details | Diff
reload the patch.tar.gz (3.61 KB, application/octet-stream)
2017-06-24 10:09 UTC, PJ Fanning

Note You need to log in before you can comment on or make changes to this bug.
Description PJ Fanning 2017-06-24 09:28:19 UTC
I have been looking at a replacement. I will attach the code shortly. I would like to get it reviewed before merging it.
Comment 1 PJ Fanning 2017-06-24 10:08:21 UTC
Created attachment 35073 [details]
use Stax to parse the worksheet data

I can merge this if it is ok
Comment 2 PJ Fanning 2017-06-24 10:09:50 UTC
Created attachment 35074 [details]
reload the patch.tar.gz
Comment 3 Dominik Stadler 2017-06-26 13:18:10 UTC
Did take a quick look: We are currently just copying the XML Stream as text and would now parse the XML again and write it out again via XML serialization, do you have an idea of how much impact that has for very large files? 

SXSSF is used specifically for handling huge files (customers seemsto have documents with more than 4GB uncompressed size, also multiple millions of rows), we need to check that doing this additional parsing/serializing is not slower for such large files.
Comment 4 PJ Fanning 2017-06-26 15:03:50 UTC
Thanks Dominik - I would expect some performance impact but I think it is more robust for the code not to make assumptions about file encodings etc. I also think the SAX code is easier to understand.
StAX parsers are very fast but it is worth evaluating the impact to see if it is excessive.
Since SXSSFWorkbook is for writing large files, I think the best performance test would be for me to write a test case that adds a large number of rows and to compare the times for the existing code and my proposed change.
Comment 5 Dominik Stadler 2017-06-27 07:26:53 UTC
You can take a look at the FAQ at http://poi.apache.org/faq.html#faq-N10165, it points to a sample which we used for comparing raw performance of HSSF/XSSF/SXSSF in the past.
Comment 6 PJ Fanning 2017-06-29 19:23:07 UTC
I did some initial testing and the Stax based code is significantly slower. I will spend a little more time to see if the performance can be improved.
https://github.com/pjfanning/poi-sxssf-stax - not very scientific but if I use SXSSFWorkbook, the test takes 3 seconds but 25 seconds with the STAX equivalent.
Comment 7 PJ Fanning 2017-07-13 07:16:28 UTC
This approach is much slower