Bug 53493

Summary: delete xml's temporary files when write workbook
Product: POI Reporter: Massimo Cavalleri <submax>
Component: SXSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal CC: dpletcher
Priority: P2    
Version: 3.8-FINAL   
Target Milestone: ---   
Hardware: All   
OS: All   
Bug Depends on: 53780    
Bug Blocks:    
Attachments: patch for delete xml's temporary files when write workbook
Command line application to test SXSSF temp file removal in finalize()

Description Massimo Cavalleri 2012-07-01 15:19:55 UTC
Created attachment 29018 [details]
patch for delete xml's temporary files when write workbook

the xml's temporary files are not deleted when call write method of SXSSFWorkbook class.

Example:
/var/tmp/poi-sxssf-sheet-xml127328075585779261.gz
or when compress is active:
/var/tmp/poi-sxssf-sheet127328075585779261

In the SXSSF code there is not a part for this purpose.

There is only _fd.deleteOnExit() in SheetDataWriter class, but is not enough if the program that use POI not exit from jvm and these files remain on file system.

For confirm this, in the method write in SXSSFWorkbook class, there is already tmplFile.delete() for another tmp file.

In conclusion in write method missing the delete on xml's temporary files.

I believe the correct solution is to have SXSSF tidy up when it's done with files, rather than requiring users to do that via explicit temp directory controls.

I attach a patch for fix this issue. This is very rudimental patch (I don't know the structure of SXSSF classes), this is an example,  but it works very well.
Comment 1 Alex Geller 2012-07-03 08:37:12 UTC
Reading the patch I conclude that you can save a SXSSF workbook only once. That would not be correct. The earliest point in time that the a data file can be deleted it when it is no longer referenced. Hence I am OK with that part of the patch that deletes the files in Sheet.finalize(). I think that the part that explicitly calls Sheet.finalize() from the Workbook.write() should be removed.
Comment 2 Massimo Cavalleri 2012-07-04 06:55:37 UTC
(In reply to comment #1)
> Reading the patch I conclude that you can save a SXSSF workbook only once.
> That would not be correct. The earliest point in time that the a data file
> can be deleted it when it is no longer referenced. Hence I am OK with that
> part of the patch that deletes the files in Sheet.finalize(). I think that
> the part that explicitly calls Sheet.finalize() from the Workbook.write()
> should be removed.

I try to write a same workbook two time but this do not permit action and return Exception:

Exception in thread "main" java.io.IOException: Stream closed
	at java.io.BufferedWriter.ensureOpen(BufferedWriter.java:98)
	at java.io.BufferedWriter.flushBuffer(BufferedWriter.java:108)
	at java.io.BufferedWriter.flush(BufferedWriter.java:235)
	at org.apache.poi.xssf.streaming.SheetDataWriter.close(SheetDataWriter.java:78)
	at org.apache.poi.xssf.streaming.SXSSFSheet.getWorksheetXMLInputStream(SXSSFSheet.java:71)
	at org.apache.poi.xssf.streaming.SXSSFWorkbook.injectData(SXSSFWorkbook.java:295)
	at org.apache.poi.xssf.streaming.SXSSFWorkbook.write(SXSSFWorkbook.java:767)
	at TestXlsx.main(TestXlsx.java:37)

For this reason I have insert the delete in write method. 
I understand that object's workbook can to be writed only one time.
Infact finalize() of SheetDataWriter already exist and this is private.
Comment 3 Alex Geller 2012-07-04 10:29:11 UTC
Yes, you are right but that is another issue (bug). There is nothing in the documentation that says that you can call (S)XSSFSheet.write() only once. 
To fix it, the statements flushRows(0) and  _writer.close() in SXSSFSheet.getWorksheetXMLInputStream() should be done only if the sheet has not already been flushed to the disk so that the function should look something like this:
public InputStream getWorksheetXMLInputStream() throws IOException
{
// flush all remaining data and close the temp file writer if not already done
    if(isModifiable())
    {
        flushRows(0);
        _writer.close();
    }
    return _writer.getWorksheetXMLInputStream();
} 
/** Is this sheet modifiable. Sheets created before a workbook is saved become immutable after saving*/
public boolean isModifiable()
{
    return !_writer.isClosed();
}
The method isModifiable() also documents another detail in the constraints in terms of "order of calls" that the streaming nature of SXSSF imposes over XSSF. Perhaps we should consider introducing an unchecked "StreamingConstraintException" when one violates the calling rules (e.g. if you try to add data to a closed sheet).
However, I think that the general rule should be to avoid constraints when we have the choice between a constrained and a non constrained implementation of a public API method.
In the case of SXSSFWorkbook.write() I see no reason why the number of calls should be restricted to calling it only once. I even suspect that that saving a workbook, adding sheets and saving again will work with the suggested fix. 
On the other hand, if you run into more difficult issues then it might not be worth it and we should document that you may call the function SXSSFWorkbook.write() only once.
Regarding your correct observation that the current code already contains code to delete the temporary data file of a sheet, my question is, why that isn't sufficient for your problem? If you make sure that your application does not reference the SXSSFWorkbook after saving it then the temporary files should get deleted some tome after that when the garbage collector collects the sheets. If this doesn't happen than that is a bug. Apparently the general advice is not to use finalize() for temp file deletion but the arguments I have heard so far are not convincing. I will read up the relevant chapter in Effective Java and if it turns out to be an issue then I agree that we will have to remove the files explicitly.
Comment 4 Alex Geller 2012-07-05 09:11:59 UTC
Created attachment 29032 [details]
Command line application to test SXSSF temp file removal in finalize()
Comment 5 Alex Geller 2012-07-05 09:12:28 UTC
I hope it is OK to post this much prose for a bug but I think I am done now. The main problem with temp file removal in finalize() is that you can't predict when it is done. I thought that that wouldn't be an issue since we don't really care when they are removed as long as they are removed timely (e.g. the next few minutes). However it is possible that this is not the case. If you comment the line that runs gc explicitly from the attached test program then the workbook does not get garbage collected on my JVM (garbage collection is implementation dependent so it may not reproduce on other JVMs). However, for long running programs that allocate memory regularly one can probably assume that the scheme works.
I suggest to leave the code as is but to introduce an additional function SXSSFWorkbook.dispose() that allows to free the resources explicitly.
Comment 6 Massimo Cavalleri 2012-07-06 11:00:08 UTC
1) What is bug that don't permit multiple writes of the same workbook object?
2) This bug occurs only with SXSSF ?
Comment 7 Alex Geller 2012-07-06 14:41:44 UTC
(In reply to comment #6)
> 1) What is bug that don't permit multiple writes of the same workbook object?
I posted a new bug 53515 "SXSSFWorkbook.write() fails when called more than once" (https://issues.apache.org/bugzilla/show_bug.cgi?id=53515) 
> 2) This bug occurs only with SXSSF ?
Yes, the test application provided in bug 53515 shows that it works perfectly for XSSFWorkbook. Besides, if that weren't the case then it should be documented.
Comment 8 Yegor Kozlov 2012-09-14 14:04:05 UTC
As of r1384784, POI provides a method to explicitly dispose of the temp files associated with the workbook.

Call SXSSFWorkbook.dispose() in the end and you will be good. 

Yegor