Bug 62328 - Add support for blind cloning elements from XSSF that are not directly supported
Summary: Add support for blind cloning elements from XSSF that are not directly supported
Status: NEEDINFO
Alias: None
Product: POI
Classification: Unclassified
Component: SXSSF (show other bugs)
Version: 3.16-FINAL
Hardware: PC All
: P2 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-04-26 08:57 UTC by krzysieq
Modified: 2018-06-28 15:49 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description krzysieq 2018-04-26 08:57:34 UTC
The use case I've come across is the following: we have a template xlsx file that contains several Sheets. Most of them contain only two rows - the header and a first row of dummy values that gives the format and/or style to be used throughout the rest of the sheet. This is so because the template is create by hand by our business users. Some other Sheets contain formulae calculated based off the data that we generate into the first group of sheets. There are also sheets with graphs that plot the aggregated data in some way.

When we begin generating real data into the workbook, we start by removing the first rows in every sheet we feed with data (the dummy values row) and put real data instead. This works with the traditional api, but not with streaming because the streaming API cannot rewrite rows that are already written to disk. So a solution/workaround/hack to circumvent this limitation was to copy the template that we open into a newly created SXSSFWorkbook cell by cell. And this pretty much works except some minor (irrelevant) styling differences and most importantly - the graphs. 

The support for graphs is very limited in the streaming api for obvious reasons, but would it be possible to add a functionality, whereby the streaming api would be able to simply clone (without exposing the possibility to manipulate) elements of a worksheet that it doesn't understand? I've done it for cells that contain formulae - formulae evaluation is not directly supported in the API, but marking the workbook as requiring formula reevaluation at opening time does the trick. We just store the formulae in the  target workbook's cells corresponding to the template and we're done.

Alternatively, If opening a template using the streaming api would allow modifying as many rows as configured for random access, regardless of whether they've been written to disk or not, would also do the trick in a much simpler fashion for us, as we wouldn't need to copy the workbook template cell by cell, but would just do the same thing as we've been doing in the non-streaming api - just rewrite the dummy data.
Comment 1 Greg Woolsey 2018-04-26 17:53:54 UTC
I'm curious why you need the streaming API?  Are you adding a million rows to a thousand sheets?  If you open the template using the XSSF API, you can then just update the first row of data with real values, then insert all the additional rows, set their style references to match the corresponding cell from the first row, and write it out to a new destination.

I've done that for years, with up to a million total rows of data.  Are you needing to process a large amount in parallel or with much more data than that?  Those are the only cases I can think of where the memory savings for the streaming API would be particularly important.

The XSSF API is very good at leaving well enough alone for elements it doesn't know about or has poor support for.  With this method no object copying is needed, as you read the template directly but write the modified result somewhere else.
Comment 2 krzysieq 2018-04-27 12:33:45 UTC
The amounts of data we're processing don't go into millions of rows, but there's enough to blow out 8G heap with out of memory errors, with the heapdump showing an overwhelming majority of objects being of two types:

org.apache.xmlbeans.impl.store.Xobj$ElementXobj and
org.apache.xmlbeans.impl.store.Xobj$AttrXobj 
as well as TreeMap entries. 

We increased the heap size for production but obviously it will only take us so long as to the next out of memory due to increasing data volumes, which does happen at a slow-but-steady pace. And the servers don't have infinite amounts of RAM as we all know, even if they're virtuals.

I don't think there's anything particularily complex about the spreadsheets that we're generating either, as described in the OP. None of the sheets has more than 30 columns, while row-wise the counts *maybe* go into the order of hundreds of thousands of rows max today, but only in the most busy sheet, while the majority will hover below 10% of that count. 

Having said the above, I have noticed a dramatic improvement in both memory footprint and processing time from using my proof of concept work on using the streaming api to generate these. And in my mind this is the right way to progress, but I need a way to overcome the limitations of the API, so any suggestions would be more than welcome.
Comment 3 krzysieq 2018-04-27 12:40:28 UTC
Just out of curiosity though - the idea behind the streaming api is to keep a fixed amount of rows in memory for random access, but get rid of anything that has already been flushed to disk, right? So where's the harm in allowing access to the first few rows of a template just read in, provided that the random access window is sufficiently big, that we read in from the template? It wouldn't be overwriting the same file cause we'd be saving that to a new one, and the streaming api creates it's own temp files anyways.
Comment 4 Greg Woolsey 2018-04-27 16:22:56 UTC
Interesting.  I've found significant differences in memory and performance when switching from reading XLSX from a stream vs. from a temp file, where temp file access is much faster and leaner.  That may be something worth trying for you.  The reason is because file access can use ZipFile and only access the zip contents needed by your code, leaving the rest on disk until you write it all out, at which point it doesn't necessarily have to parse the previously unreferenced/unread bits, just copy them over, which is what you want.

Beyond that, if you want to work up a patch with tests we'd be glad to take a look.