Bug 51961

Summary: [Patch] SXSSF : Compress the temporary xml file
Product: POI Reporter: Santosh <santosh.kharolkar>
Component: SXSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Hardware: All   
OS: All   
Attachments: First cut implementation for data store writer class
Add another constructor to be able to set the compression flag during loading of existing workbook

Description Santosh 2011-10-05 14:35:13 UTC
The SXSSF temp  xml file is grows to a very large size. e.g. for a 20 MB 
csv data the size of the temp xml file become few GB large. In order to 
reduce this I have created a path and tested it. 

Request for the review and application of the same to the. 
I have changed the extension of the temp file from xml to gz, which is 
another cleanliness factor



>From 3e95b1880a04657ec93e5626720b1d40b5487ee4 Mon Sep 17 00:00:00 2001
From: Santosh <santosh.kharolkar@tcs.com>
Date: Tue, 4 Oct 2011 12:11:23 +0530
Subject: [PATCH] Compress the temporary xml file

---
 .../org/apache/poi/xssf/streaming/SXSSFSheet.java  |    9 ++++++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/ooxml/java/org/apache/poi/xssf/streaming/SXSSFSheet.java 
b/src/ooxml/java/org/apache/poi/xssf/streaming/SXSSFSheet.java
index 8c1b998..567c9e8 100644
--- a/src/ooxml/java/org/apache/poi/xssf/streaming/SXSSFSheet.java
+++ b/src/ooxml/java/org/apache/poi/xssf/streaming/SXSSFSheet.java
@@ -22,6 +22,8 @@ import java.util.Iterator;
 import java.util.TreeMap;
 import java.util.Map;
 
+import java.util.zip.GZIPOutputStream;
+import java.util.zip.GZIPInputStream;
 import org.apache.poi.ss.usermodel.*;
 import org.apache.poi.ss.util.CellReference;
 
@@ -1284,9 +1286,9 @@ public class SXSSFSheet implements Sheet, Cloneable
 
         public SheetDataWriter() throws IOException 
         {
-            _fd = File.createTempFile("poi-sxxsf-sheet", ".xml");
+            _fd = File.createTempFile("poi-sxxsf-sheet-xml", ".gz");
             _fd.deleteOnExit();
-            _out = new BufferedWriter(new FileWriter(_fd));
+            _out = new OutputStreamWriter(new GZIPOutputStream (new 
FileOutputStream(_fd)));
         }
         public int getNumberOfFlushedRows()
         {
@@ -1308,7 +1310,8 @@ public class SXSSFSheet implements Sheet, Cloneable
         {
             _out.flush();
             _out.close();
-            return new FileInputStream(_fd);
+
+            return new GZIPInputStream ( new FileInputStream(_fd));
         }
 
         /**
-- 
1.7.6.msysgit.0
Comment 1 Santosh 2011-11-09 11:54:05 UTC
Update the file name

>From 3e95b1880a04657ec93e5626720b1d40b5487ee4 Mon Sep 17 00:00:00 2001
From: Santosh <santosh.kharolkar@tcs.com>
Date: Tue, 4 Oct 2011 12:11:23 +0530
Subject: [PATCH] Compress the temporary xml file

---
 .../org/apache/poi/xssf/streaming/SXSSFSheet.java  |    9 ++++++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/ooxml/java/org/apache/poi/xssf/streaming/SXSSFSheet.java 
b/src/ooxml/java/org/apache/poi/xssf/streaming/SXSSFSheet.java
index 8c1b998..567c9e8 100644
--- a/src/ooxml/java/org/apache/poi/xssf/streaming/SXSSFSheet.java
+++ b/src/ooxml/java/org/apache/poi/xssf/streaming/SXSSFSheet.java
@@ -22,6 +22,8 @@ import java.util.Iterator;
 import java.util.TreeMap;
 import java.util.Map;

+import java.util.zip.GZIPOutputStream;
+import java.util.zip.GZIPInputStream;
 import org.apache.poi.ss.usermodel.*;
 import org.apache.poi.ss.util.CellReference;

@@ -1284,9 +1286,9 @@ public class SXSSFSheet implements Sheet, Cloneable

         public SheetDataWriter() throws IOException 
         {
-            _fd = File.createTempFile("poi-sxxsf-sheet", ".xml");
+            _fd = File.createTempFile("poi-sxssf-sheet-xml", ".gz");
             _fd.deleteOnExit();
-            _out = new BufferedWriter(new FileWriter(_fd));
+            _out = new OutputStreamWriter(new GZIPOutputStream (new 
FileOutputStream(_fd)));
         }
         public int getNumberOfFlushedRows()
         {
@@ -1308,7 +1310,8 @@ public class SXSSFSheet implements Sheet, Cloneable
         {
             _out.flush();
             _out.close();
-            return new FileInputStream(_fd);
+
+            return new GZIPInputStream ( new FileInputStream(_fd));
         }

         /**
-- 
1.7.6.msysgit.0
Comment 2 Santosh 2011-12-05 10:11:38 UTC
Request you to pls add this patch for next beta5 release. It saves me to not keep patching the releases.
Comment 3 Yegor Kozlov 2011-12-05 10:39:40 UTC
It was discussed on the mailing list that compression of the temp files should be optional. I'd rather not apply your patch in its current form.

Yegor
Comment 4 Santosh 2011-12-05 11:27:41 UTC
If you drop a hint on how to add the option for it, I will rework the patch. Should it be a flag in the sxssfworkbook/sheet class? What is your preferred way to set options?
Comment 5 Yegor Kozlov 2011-12-05 11:42:17 UTC
IMO the best would be to define a method to pass a custom instance of SheetDataWriter :

SXSSFWorkbook#setSheetDataWriter (SheetDataWriter  writer) 

In your case you would pass a GZSheetDataWriter that supports compression of the temp files. 

Does it make sense to you? 

Yegor

(In reply to comment #4)
> If you drop a hint on how to add the option for it, I will rework the patch.
> Should it be a flag in the sxssfworkbook/sheet class? What is your preferred
> way to set options?
Comment 6 Santosh 2011-12-06 11:46:20 UTC
Created attachment 28031 [details]
First cut implementation for data store writer class
Comment 7 Santosh 2011-12-06 11:47:04 UTC
I have the first cut implementation ready. But setting the writer in the workbook class is difficult. Since we need multiple instances of this writer in constructor of workbook. (one per sheet)

Can I suggest the polocy to be set in the Sheet class as static or some other way.

I am attaching the first cut implementation to this
Comment 8 Yegor Kozlov 2011-12-09 11:21:17 UTC
Committed in r1212330

Santosh,

When I suggested passing a subclass of SheetDataWriter I didn't realize the complications. Since each sheet uses its own instance of SheetDataWriter you pass the class name instead of an instance and this is not good in my opinion. 

I re-worked the patch and used your original idea: pass a flag to SXSSFWorkbook.
In current implementation the gzip compression is turned on as follows:

        SXSSFWorkbook wb = new SXSSFWorkbook();
        wb.setCompressTempFiles(true); // temp files will be gzipped

I think it is much more user-friendly than passing a class name in the constructor. Sorry that my previous suggestion mislead you. 

In any case, thank you for the patch. You've written 90% of it and I just put it in a slightly different shape. 


Regards,
Yegor
Comment 9 Santosh 2011-12-09 11:50:57 UTC
Thank you., 

Though we will need another way to pass this flag as well in next patch I will work that out. In the use case where the wb is created from pre-existing wb. We can track that separately. Another easy solution would be to make the flag a class level static flag. Which can solve all the cases required.
Comment 10 Santosh 2011-12-12 10:53:52 UTC
Created attachment 28066 [details]
Add another constructor to be able to set the compression flag during loading of existing workbook

Add another constructor to be able to set the compression flag during loading of existing workbook
Comment 11 Santosh 2011-12-16 05:53:13 UTC
A small patch attached to complete the action checked in by yegor. 

Add another constructor to be able to set the compression flag during loading
of existing workbook
Comment 12 Yegor Kozlov 2011-12-16 10:12:08 UTC
Applied in r1215080

Yegor

(In reply to comment #11)
> A small patch attached to complete the action checked in by yegor. 
> 
> Add another constructor to be able to set the compression flag during loading
> of existing workbook