Bug 65613 - DeferredSXSSFWorkbook still using temp files
Summary: DeferredSXSSFWorkbook still using temp files
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: SXSSF (show other bugs)
Version: 5.0.0-FINAL
Hardware: PC All
: P2 major with 1 vote (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-10-04 15:11 UTC by jasmin.suess
Modified: 2021-10-22 14:26 UTC (History)
1 user (show)



Attachments
JUnitTest for DeferredSXSSFWorkbook usage (3.02 KB, text/plain)
2021-10-22 13:21 UTC, jasmin.suess
Details

Note You need to log in before you can comment on or make changes to this bug.
Description jasmin.suess 2021-10-04 15:11:15 UTC
Documentation of Version 5.0.0 (2021-01-20) claims:

"new experimental DeferredSXSSFWorkbook which avoids temp files by lazily generating rows (see DeferredGeneration in poi-examples)"

However, writing a DeferredSXSSFWorkbook still causes temporary files to be written to disk, as it still uses 
org.apache.poi.xssf.streaming.SXSSFWorkbook.write(OutputStream)

which does:

    //Save the template
    File tmplFile = TempFile.createTempFile("poi-sxssf-template", ".xlsx");


Is this a bug or intended?
Comment 1 PJ Fanning 2021-10-04 15:18:14 UTC
It was not the intention to 100% avoid temp files but to create fewer temp files.
Also note that the code is still regarded as experimental.

If you don't want any temp files, then maybe you can submit a patch. I suspect that any change would need to be configurable because the change could lead to more memory consumption - which other users may not want.
Comment 2 Patric Rufflar 2021-10-04 15:59:54 UTC
We are facing the same issue.

In some environments, e.g. for application server scenarios, any usage of the file system might be prohibited. In these cases, POI / SXSSF can't be used.

A tempfile-free variant would really be nice and an important improvement to the project.

BTW why is the current deferred implementation marked as "experimental"?
Are there any lacking features or known bugs? Or is it just not tested well enough to consider it stable?

Thanks!
Comment 3 PJ Fanning 2021-10-04 16:38:52 UTC
the deferred sxssf code is not widely used and is not well tested - it is not core code and was submitted with minimal test coverage
Comment 4 PJ Fanning 2021-10-04 17:01:47 UTC
One alternative is https://github.com/dhatim/fastexcel which may support excel generation without the use of temp files.
Comment 5 Patric Rufflar 2021-10-04 19:11:00 UTC
Yes, fastexcel completely avoids temporary files.
Unfortunately, fastexcel also lacks important features (which POI offers).

Is there a roadmap / release target for having this feature production-ready?
BTW, what was the reason implementing deferred workbooks?
Comment 6 PJ Fanning 2021-10-04 19:41:41 UTC
POI has no roadmap - it is a mature project with a small volunteer base - essentially, if you want a feature, you write it yourself - other developers might take an interest and review your ideas and/or code
Comment 7 PJ Fanning 2021-10-04 20:41:08 UTC
I've added r1893896 - try it out if you like - I'm in 2 minds as to whether to leave this code in for the release - I'm not sure it is a good approach and I don't have much time to do much more with it.

Up until now, users have been happy enough to live with the fact the SXSSF code relies on using temp files.

ZipInputStreamZipEntrySource has a limitation in that it cannot handle zip entries that are more than 4Gb uncompressed - unless you enable a feature ZipInputStreamZipEntrySource#setThresholdBytesForTempFiles (which is off by default) and as you can infer this relies on temp files if enabled.
Comment 8 Patric Rufflar 2021-10-04 22:20:52 UTC
Thanks, I'll have a look!
Comment 9 jasmin.suess 2021-10-21 15:07:26 UTC
Thanks for the added feature.

I ran into another problem trying to implement the DeferredSXSSFWorkbook. Currently my workflow looks like this:

1. Create DeferredSXSSFWorkbook.
2. Create Sheet.
3. Create Row.
4. Create Cell.

But the file turns out empty. This is because createRow() of SXSSFSheet tries to flush the row to the writer here:

(SXSSFSheet.java Line 1895)

            if (_writer != null) _writer.writeRow(rowIndex, row);
            _rows.remove(firstRowNum);

But the writer for the DeferredSXSSFWorkbook only gets created during the writing process in writeAvoidingTempFiles() so _writer is null and the row gets removed even if it was not written yet.

Is there a way to create a DeferredSXSSFWorkbookWorkbook following the steps above (without using a RowGenerator)?

Could you please share an example or JUnitTest if there is one?
Comment 10 PJ Fanning 2021-10-21 16:05:32 UTC
Can I flip the problem - can you provide a reproducible test case?

There are test cases for the existing code - all our source code is publicly accessible so I'll let you look for it yourself. As I said above, the DeferredSXSSFWorkbook was contributed (ie not written by the Apache POI team) and the test coverage is not great but there is some.
Comment 11 PJ Fanning 2021-10-21 16:07:27 UTC
Can I ask why you are using DeferredSXSSFWorkbook and not the more widely used SXSSFWorkbook? SXSSFWorkbook works in a way more similar to how you seem to want the code to work - ie create a workbook, then a sheet, then a row and then a cell.
Comment 12 jasmin.suess 2021-10-22 13:21:33 UTC
Created attachment 38075 [details]
JUnitTest for DeferredSXSSFWorkbook usage

I can not use the SXSSFWSheets, since they create a SheetDataWriter in their constructor and the SheetDataWriter always produces a temp file (see SheetDataWriter.java line 72), which is a showstopper in my case. 
The DeferredSXSSFWorkbook on the other hand uses a StreamingSheetWriter, which does not produce temp files.

So i would either need a SXSSFSheet that uses a StreamingSheetWriter or a DeferredSXSSFWorkbook, which lets me flush rows before the writing process.

I have searched through your examples and tests but all of them use the RowGenerator, which i cannot use.
I have added a JUnitTest to illustrate my problem.
Comment 13 PJ Fanning 2021-10-22 13:28:16 UTC
DeferredSXSSFWorkbook is for row generators and row generators alone.

Apache POI is a volunteer project and if you can find a volunteer to write you a new version of SXSSFWorkbook/DeferredSXSSFWorkbook to suit your needs, then best of luck - POI team will try to help them if they run into issues.
Comment 14 PJ Fanning 2021-10-22 13:31:53 UTC
https://github.com/dhatim/fastexcel is still work trying - it may be able to generate a xlsx without temp files.
Comment 15 Patric Rufflar 2021-10-22 13:47:28 UTC
@PJ Fanning: Thanks for your support!

As mentioned above, fastexcel unfortunately lacks a couple of POI features which we need (e.g. https://github.com/rap2hpoutre/fast-excel/issues/96).

Therefore we would love to have a 100% tempfile-free workbook creation within POI with all (or most) features available and without the need for changing POI APIs (e.g. to the "row generator" API).

Can you give hints for a proper implementation of such a feature and do you see any showstoppers when considering the POI architecture?

Thanks!
Comment 16 PJ Fanning 2021-10-22 13:59:48 UTC
temp files appear all over POI code - I have no interest in rewriting POI for your use case - you will need to find someone if you are not willing to get involved in writing code yourselves
Comment 17 Patric Rufflar 2021-10-22 14:26:51 UTC
Sorry for the misunderstanding, I don't ask you to implement this, I am asking you for any hints for considering an implementation by ourselves.

I don't know anything about POIs internal architecture (yet), for which things tempfiles are currently used and whether they can be replaced by in-memory data structures (for size-limited data) or by just streaming the data out (for the actual, possibly unbound data to prevent any OOM).
Maybe it is impossible without refactoring major parts of POI at all - then this would be a nice warning for us :)