Bug 61031 - XSSFWorkbook and SXSSFWorkbook produce different output
Summary: XSSFWorkbook and SXSSFWorkbook produce different output
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: SXSSF (show other bugs)
Version: 3.15-FINAL
Hardware: Macintosh All
: P2 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-04-24 07:45 UTC by Tim Jones
Modified: 2017-06-16 20:14 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Jones 2017-04-24 07:45:29 UTC
For the same data and code, the output produced is different (although it seems to be functionally identical).

One difference is that the enclosed XML contains newlines for each row when produced by SXSSF, but no newlines when produced by XSSF. There are also other differences.

For larger files, this results in very different file sizes - the file produced by XSSF for 5800 rows of our data is 64% of the size of the file produces by SXSSF. This is significant - especially as SXSSF is the method preferred for larger files.

In the example provided below, the difference is that SXSSF produces a file that's 6 bytes smaller (the reverse of the difference we observed on larger date).

Assuming an OutputStream output:


            Workbook wb = new SXSSFWorkbook();
            Sheet sh = wb.createSheet("Sheet");
            Row r1 = sh.createRow(1);
            r1.createCell(1).setCellValue("One");
            r1.createCell(2).setCellValue("Two");
            Row r2 = sh.createRow(2);
            r2.createCell(1).setCellValue("One");
            r2.createCell(2).setCellValue("Two");
            wb.write(output);  // 3318 bytes
            wb.close();

            


            SXSSFWorkbook wb = new SXSSFWorkbook();
            Sheet sh = wb.createSheet("Sheet");
            Row r1 = sh.createRow(1);
            r1.createCell(1).setCellValue("One");
            r1.createCell(2).setCellValue("Two");
            Row r2 = sh.createRow(2);
            r2.createCell(1).setCellValue("One");
            r2.createCell(2).setCellValue("Two");
            wb.write(output); // 3312 bytes
            wb.dispose();
            wb.close();


Observed: The amount of data written to output are different - 3318 for XSSFWorkbook, and 3312 for SXSSFWorkbook.

Expected: Identical files produced.

(As an aside: I suspect some of the difference is due to SXSSF not using shared strings, although I can't see how to do that on a new workbook. Regardless, the whitespace changes still definitely exist, and it feels likely that there are other differences)
Comment 1 Tim Jones 2017-04-24 07:51:52 UTC
I tried the following to enable shared strings:

            SXSSFWorkbook wb = new SXSSFWorkbook(
                     new XSSFWorkbook(),1000,true,true);
            Sheet sh = wb.createSheet("Sheet");
            Row r1 = sh.createRow(1);
            r1.createCell(1).setCellValue("One");
            r1.createCell(2).setCellValue("Two");
            Row r2 = sh.createRow(2);
            r2.createCell(1).setCellValue("One");
            r2.createCell(2).setCellValue("Two");
            wb.write(output);
            wb.dispose();
            wb.close();

It resulted in a file of 3321 bytes (a new size). There's definitely more going on here.
Comment 2 Javen O'Neal 2017-04-24 08:08:16 UTC
POI inlines strings for SXSSF so that it doesn't have to maintain a shared strings table. This will make the output file larger.

I'm not sure how SXSSF handles cell styles, but wouldn't be surprised if it also inlined those to eliminate the need to maintain a style table in memory.

There have been a couple discussions of adding an optional shared strings table for SXSSF (this would allow RTF strings).

We could probably strip newline characters from the XML output, but this would be a trivial savings in file size. After zip compression, it would be negligible. File size could be improved more easily by adjusting the zip file compression settings. The trade off there would be compression and expanding time.
Comment 3 Javen O'Neal 2017-04-24 08:10:23 UTC
What file sizes are you measuring? The compressed zip or the raw XML?
Comment 4 Javen O'Neal 2017-04-24 08:15:46 UTC
One thing that would be different between the file contents would be the last modified date, which is saved in the XML. We may also save rId's in any order, so long as the reference numbers are used correctly. If we stores these in an unsorted HashMap before serializing, we could make no guarantee of producing binary-identical files, though the information would be the same.
Comment 5 Tim Jones 2017-04-26 06:22:29 UTC
I'm measuring the resulting compressed document, yes.

My main surprise was that the output was different (starting with the whitespace, and finishing with things like inline styles and string tables). 

Since this produces quite different file sizes for larger files, it could be added to the documentation, even if identical binary output is not practical. 

Users of the XSSF and SXSSF classes would want to be aware that there's an additional tradeoff.

The ability to tune the level of compression would be a nice-to-have.