Bug 51160 - [PATCH] Provide ability to produce very large xlsx files using the regular high level API with the default heap space of 64MB
Summary: [PATCH] Provide ability to produce very large xlsx files using the regular hi...
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: XSSF (show other bugs)
Version: unspecified
Hardware: PC All
: P2 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-05-06 13:26 UTC by Alex Geller
Modified: 2011-05-19 12:26 UTC (History)
0 users



Attachments
Patch to apply from the root of the poi-3.6-20091214 source tree (104.74 KB, application/octet-stream)
2011-05-06 13:26 UTC, Alex Geller
Details
Streaming patch using composition (116.09 KB, application/octet-stream)
2011-05-16 09:38 UTC, Alex Geller
Details
Test case for differences in image output for the formats HSSF, XSSF and SXSSF (2.10 KB, text/x-java-source)
2011-05-16 09:44 UTC, Alex Geller
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Geller 2011-05-06 13:26:14 UTC
Created attachment 26964 [details]
Patch to apply from the root of the poi-3.6-20091214 source tree

In response to a discussion with Nick Burch in the user forum (see http://apache-poi.1045710.n5.nabble.com/HSSF-and-XSSF-memory-usage-some-numbers-td4312784.html) I am submitting an example implementation to provide support for very large xlsx files via the regular high level API without requiring more memory that is provided by the default heap space.

This text assumes knowledge of the thread mentioned above so that the problem and its proposed solution are not repeated again here.

General remarks
-------------------------
1) The code is far from being ready for general availability but I am submitting it to see if there is a chance that it may become part of the general source tree. If yes, then I will spend more time on it and do what I can to make it feature complete.
2) I made the patch on the 3.6 version because that is what we currently are using. That of course is nonsense and I should have done it on the 3.8 beta2 but for having a first look, I thought that it wouldn't make a difference. If for whatever reason that is inconvenient to you, please tell me and I will apply the patch to the 3.8 version and make the necessary changes if any.

Description of the patch
----------------------------------
The patch adds a demo application PoiTest (This is the application mentioned in the thread that was used to do the benchmarks) in patched/src/examples/src/org/apache/poi/ss/examples.
This console program takes some command line arguments and produces sheets in the formats xls. and xlsx using HSSF, XSSF and SXSSF. The SXSSF classes are new (The 'S' in the name stands for "streaming)" and they are the main subject of this proposal.
The program creates very simple workbooks (only one sheet, no styles, no images, no merged regions, only numeric data) but the implementation handles the mentioned items and string values too. I expect that many of your unit test will fail but us this is already sufficient to create the most sophisticated output we have.
The patch includes a shell script at the root "RunPoiTest.sh" which contains an example invocation which creates a 300,000 x 33 sheet using SXSSF.
The patch introduces 4 source files in the "ooxml/java/org/apache/poi/xssf/usermodel" package.
1) SXSSFWorkbook extends XSSFWorkbook.

This class overloads write() to do the "BigGridDemo" patching stuff (Had to remove the "final" on the method in POIXMLDocument for that).
I tricked the class into creating SXSSFSheets (instead of XSSFSheets) in calls to createSheet() and cloneSheet() by means of a virtual function XSSFWorkbook.getWorksheetRelation(). It seems to work but it is done without understanding your instantiation code at all.

2) SXSSFSheet extends XSSFSheet

This class overloads all methods that deal with row creation and management. Instead of producing XSSFRow instances in calls to createRow() it returns SXSSFRow instances.
I decided to make SXSSFRow implement the "Row" interface from scratch rather then making it a subclass of XSSFRow. This causes the following problem:
Since XSSFSeet specializes some (or all?) functions from the interface Sheet that return or take a parameter of type "Row" to "XSSFRow" I could not overload those methods because SXSSFRow is not a subclass of XSSFRow as in the following example: 
The signature of Sheet.createRow() in ss.usermodel.Sheet is "public Row createRow(...)"
In xssf.usermodel.XSSFSheet is is "public XSSFRow createRow(..)"
Now in xssf.usermodel.SXSSFSheet I want to overload the function and write "public Row createRow(...)" or perhaps even "public SXSSRow createRow()". Neither is possible because you can only specialize but not generalize on overloading a method in a sub class.

I solved the problem the wrong way. What I did was to remove the specialization in the signatures of the function in XSSFSheet so that the signatures are identical with the interface definitions.  As a result of this I had to add casts in many places in classes in the xssf package. So far, this is fine because a XSSFSheet may safely assume that "Row XSSFSheet.getRow(int index)" will in reality return a SXSSFRow.

By the time I had worked my self through the unit test replacing XSSFRow row=sheet.createRow() with "Row row=sheet.createRow()" and started on the examples I realized that this had not been the smartest idea. If this were version 1.0 of POI then that would have been fine but asking everybody to change from "XSSFRow" to "Row" is perhaps not such a good idea. 
Nevertheless I did it that way for the moment it and we can't solve it any other way, we can always resort to composition (implement SXSSFWorkbook and SXSSFSheet form scratch) which is what I did on my local version here. Incidentally, this also solves the problem that "final void write()" cannot be overloaded.

The class uses an inner class "SheetDataWriter" which is inspired by the "SpreadsheetWriter" class from the "BigGridDemo". Its functionality has been reduced to the capability to serialize the "<sheetData>" containing the rows only via calls to SheetDataWriter.writeRow(int rowIndex,Row row). 
Unlike the "BigGridDemo" which produces the entire file, this class produces a document fragment file which is then injected into the "sheet" file produced by XSSFWorkbook.write(). If you find this confusing then look at the function SXSSFWorkbook.write(). 

3) SXSSFRow implements Row
This is a lightweight implementation of a Row with array storage for the cells. 
The serialization is done by its owning SXSSFSheet.
The function row.createCell() returns SXSSFCell instances described below.

3) SXSSFCell implements Cell
This is a lightweight implementation of a Cell with an array storage for the cells. It can be made a lot lighter but this naive implementation is sufficiently lean to provide a 5000 row cache in the default heap space without any problems at all. I haven't tested the actual numbers but guessing from the other naive models I played with earlier I would expect a cell to take per average something between 70 to 100 byte which is close to what HSSF consumes.

I added "//TODO:" wherever there I was aware that I didn't implement something or when I was unsure about something.

OK, I probably forgot some items but I can't think of anything else for the moment.
Comment 1 Yegor Kozlov 2011-05-09 12:39:02 UTC
Thanks for the great patch! I'm reviewing it and will give my feedback soon. 

When I wrote BugGridDemo, I had in a generic streaming api in mind, but your implementation goes far beyond that. Very cool, I should say. 

Some comments on a quick review:

 1. Please remove 'Property of Four Js*' and other copyright statements. By submitting a patch to Apache POI you agree to the Apache Licence Terms.
 2. the SXSSF classes should live in a separate package under xssf. I propose to re-package them to org.apache.poi.xssf.streaming.*
 3. I'm in two minds if subclasscing from XSSF was a good idea, most of XSSF classes are final by design and not meant for extention. Also, many operations in the streaming mode are not allowed and should throw IllegalStateException. It would be easier control if SXSSF classes were composites. 
Never mind for now, I will say my final word after review.

 4. The PoiTest application should be renamed to something more descriptive, e.g. SSPerformanceTest
 
Please upload a patch against trunk. Your patch applies OK to 3.6, but I had problems applying it to trunk and had to resolve the rejections.

Regards,
Yegor
Comment 2 Alex Geller 2011-05-09 14:00:50 UTC
Hi Yegor,
thanks for looking into this so quickly. I will make a version for trunk with the requested changes as soon as you have decided for either composition or subclassing. 
I discovered the first problem with the implementation. I found that images were not correctly sized and went after the issue. It seems that you are using "twoCellAnchor" for images and since the SXSSFSheet has no rows you are computing "rowOff" based on the default row height. I went hunting for the place where this is computed but I couldn't find it (Can you help?). The solution could be to update any Anchor that spans a row when Row.setHeight() is called. Another solution could be to use "oneCellAnchor" whenever an image is anchored using ClientAnchor.DONT_MOVE_AND_RESIZE. Are there other issues like this?
I added code to quote character data in the the spreadheet writer (this should also be done in the BigGridDemo) because otherwise it will create invalid documents when you write string values containing characters that are meaningful in XML like the character '<'. Do you have a utilty class for this?
Is this the peferred method to communicate or am I misusing bugzilla?
Thanks,
Alex
Comment 3 Yegor Kozlov 2011-05-14 09:49:23 UTC
Hi Alex,

Bugzilla is the right place to discuss patches. This way we have a public record of the discussion and can easily track history.

I'm leaning towards copmposition. The advantage is that with this approach we can stay untangled from XSSF and don't need to change signatures of XSSFSheet#getRow and XSSFSheet#createRow. My concern is compatibility with existing client code (including products I wrote for my company). There can be legacy code assuming that  XSSFSheet#createRow returns XSSFRow, not Row and your changes would broke it. 

So, please refactor your patch to use composition instead of subclassing. I would like to check it in and include in the next beta. 

At first glance,  SXSSF handles images correctly. I ran org.apache.poi.xssf.usermodel.examples.WorkingWithPictures from POI examples, only modified it to construct SXSSFWorkbook instead of XSSFWorkbook and the output looks good. Do you call XSSFPicture#resize() after you add it to a drawing? This is the place where the image is resized relative to its top-left corner.

Regards,
Yegor
Comment 4 Alex Geller 2011-05-16 09:38:58 UTC
Created attachment 27006 [details]
Streaming patch using composition
Comment 5 Alex Geller 2011-05-16 09:40:20 UTC
Hi Yegor,
find attached a batch based on 3.8-beta2-20110408 using composition. As requested the files are now in the new "org.apache.poi.xssf.streaming" package and PoiTest was renamed to SSPerformanceTest. The copyrights are removed and I added our company name to the @author if that is OK. Regarding the image bug find attached a small program that shows the differences in image output for the formats HSSF, XSSF and SXSSF. Note the relative path to the image "./test-data/spreadsheet/logoKarmokar4.png" in the program that has to be adapted if you run the program from anywhere else but the root of the source tree. The sample is prepared to be put as a test case using HSSFTestDataSamples.getTestDataFileContent() instead (The code is commented).
Thanks,
Alex
Comment 6 Alex Geller 2011-05-16 09:44:43 UTC
Created attachment 27007 [details]
Test case for differences in image output for the formats HSSF, XSSF and SXSSF

Running the program will produce three files. The issues are:
- In the XSSFOuput.xlsx the image is not visible
- In both XSSF and SXSSF the first row appears larger than in HSSF (supposed to be 2 inch high)
- In SXSSF the image appears vertically stretched for the reasons I describe earlier.
Comment 7 Yegor Kozlov 2011-05-17 11:09:34 UTC
Applied in r1104120. 

I modified SSPerformanceTest to produce diffrent types of cells - numeric, string, etc. This utility is a handy indicator that all three APIs generate the same output and it is readable. Most feature works OK with SXSSF, I found only two issues so far: 

 - booleans are shown as 1/0 instead of TRUE/FALSE
 - formulas are not yet supported. 

See my TODO's in the code. 

One strange thing is that SXSSF is slower than XSSF, I expect these APIs to be at least on par. 

Here are my benchmarks using SSPerformanceTest. Second column is max heap size, third is the size of grid, the last column is elapsed time.

HSSF   -Xmx1024  10000x100  5 sec
XSSF   -Xmx1024  10000x100  38 sec
SXSSF  default   10000x100  302 sec   // almost 10x slower than XSSF!

There is something to work on.

Alex, would you like to continue working on this feature and submit more patches? The code is great, but to be complete it needs documentation and tests. 

Regards,
Yegor
Comment 8 Yegor Kozlov 2011-05-17 11:11:15 UTC
It is a bug in XSSFPicture on line 250:

instead of 

double ch = getRowHeightInPixels(row2 + 1);

it should be

double ch = getRowHeightInPixels(row2);

The fix is coming soon.

Yegor

(In reply to comment #6)
> Created attachment 27007 [details]
> Test case for differences in image output for the formats HSSF, XSSF and SXSSF
> 
> Running the program will produce three files. The issues are:
> - In the XSSFOuput.xlsx the image is not visible
> - In both XSSF and SXSSF the first row appears larger than in HSSF (supposed to
> be 2 inch high)
> - In SXSSF the image appears vertically stretched for the reasons I describe
> earlier.
Comment 9 Yegor Kozlov 2011-05-18 10:40:22 UTC
I fixed the issue with the image not visible in XSSFOuput.xlsx. As I said, there was a bug in XSSFPicture on line 253. The fix committed in r1312341

The image in SXSSF is still vertically  stretched if you set a custom row height. It is not a bug, rather a feature of the current implemenetation. 

See what happens:

When you create a  Drawing you delegate the call to the underlying XSSFSheet#createDrawingPatriarch(). Then you create a picture, resize rows and call Picture#resize(). Excel anchores shapes to cells and rows and Picture#resize() attempts to calculate how much columns and rows the image occupies. It asks the parent (XSSFSheet, not SXSSFSheet!) to get the height of the first row, but XSSFSheet knows nothing about it and returns the default row height.

To be clear, the problem is in XSSFPicture#getRowHeightInPixels(int rowIndex) :

    private float getRowHeightInPixels(int rowIndex){
        XSSFSheet sheet = (XSSFSheet)getDrawing().getParent();

        XSSFRow row = sheet.getRow(rowIndex);
        float height = row != null ?  row.getHeightInPoints() : sheet.getDefaultRowHeightInPoints();
        return height*PIXEL_DPI/POINT_DPI;
    }


It operates with XSSF* objects while in a SXSSF context it should operate with SXSSFSheet and SXSSFRow.

I'm not yet sure what is the best way to fix it. One approach is to create SXSSFDrawing and SXSSFPicture. Any ideas? 

Regards,
Yegor

(In reply to comment #6)
> Created attachment 27007 [details]
> Test case for differences in image output for the formats HSSF, XSSF and SXSSF
> 
> Running the program will produce three files. The issues are:
> - In the XSSFOuput.xlsx the image is not visible
> - In both XSSF and SXSSF the first row appears larger than in HSSF (supposed to
> be 2 inch high)
> - In SXSSF the image appears vertically stretched for the reasons I describe
> earlier.
Comment 10 Yegor Kozlov 2011-05-18 11:14:02 UTC
Alex,

I committed initial support for SXSSF tests in r1124177. There are four classes based on the abstract spreadsheet tests that are used for HSSF and XSSF testing. Failing tests are disabled. Ideally, all tests that make sense in SXSSF should pass. Please feel free to add / modify them any way you like.

I see that some tests fail because you don't validate input arguments, for example, sheet.createRow(-1) throws IllegalArgumentException in HSSF and XSSF, but not in SXSSF. I hope those are easy to fix.

Regards,
Yegor
Comment 11 Yegor Kozlov 2011-05-19 12:14:38 UTC
I did a deeper review of SXSSSF and made a lot of small fixes, mostly to have the commented tests pass. I'm happy with the result. There are 9 disabled tests of 57, but these involve operations not supported in SXSSF:

1. evalution of formulas. 
2. cloning of sheets
3. shifting rows

Other than that the API works fine and fully conforms to the POI spreadsheet interfaces.

 Below is a summary if my changes:

 SXSSFWorkbook
   - cloneSheet does not make sense in SXSSF throws NotImplemented

 SXSSFSheet
   - added support for blank, boolean, error and formula cells 
   - fixed getFirstRowNum() and getLastRowNum()  to return 0 if there are no rows. The old code threw NoSuchElement exception.
   - removeRow(Row row) throws exception if the argument row belongs to a different sheet

 SXSSFRow
   - getCell(int cellnum) validates the input argument and throws IllegalArgumentException if it is < 0.
   - getCell(int cellnum) respects MissingCellPolicy and returns either null or blank depending on the workbook settings.
   - getLastCellNum() returns -1 if there are no cells.
   - Constructed FilledCellIterator points to the first cell (was to cellIndex=0 which isn't always the case)

 SXXSFCell
   - misc fixes related to conversion of cell types. For example, if you create a string cell and then change its type to boolean then then string value must be converted and if it was "TRUE" or 1 then the boolean value=true. 
   - getCellStyle() always returns not-null. Default cell style has zero index and can be obtained as workbook.getCellStyleAt(0)
   - implememented toString()
   

Committed in r1124698. Shout if I've done something wrong.
   

Regards,
Yegor
Comment 12 Yegor Kozlov 2011-05-19 12:26:50 UTC
We can to a point where we can resolve this ticket. The initial version of SXXSF is in trunk and bugs will be in new tickets.

I opened Bug 51233 for the stretched image issue.

Yegor