Bug 60707 - [PATCH] Reading very large excel files using StAX made easier - StreamingWorkbook
Summary: [PATCH] Reading very large excel files using StAX made easier - StreamingWork...
Status: NEEDINFO
Alias: None
Product: POI
Classification: Unclassified
Component: SXSSF (show other bugs)
Version: unspecified
Hardware: PC All
: P2 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords: PatchAvailable
Depends on: 61213
Blocks:
  Show dependency tree
 
Reported: 2017-02-08 04:41 UTC by Renjith R
Modified: 2017-11-22 02:21 UTC (History)
0 users



Attachments
Patch that contains the new classes anf files introduced for enabling this functionality. (42.20 KB, patch)
2017-02-08 04:41 UTC, Renjith R
Details | Diff
Patch which implements the interfaces Cell, Row, Sheet & Workbook (108.04 KB, patch)
2017-02-15 05:21 UTC, Renjith R
Details | Diff
Version 3: Patch with some of the review comments implemented (145.50 KB, patch)
2017-04-23 15:54 UTC, Renjith R
Details | Diff
Version 4: Added @NotImplemented. Added feature getColumnIndex (152.39 KB, patch)
2017-04-24 04:25 UTC, Renjith R
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Renjith R 2017-02-08 04:41:43 UTC
Created attachment 34731 [details]
Patch that contains the new classes anf files introduced for enabling this functionality.

This API helps the user to 'read' very large excel files and return the data in String format. It represents the excel file as an instance of ‘org.apache.poi.xssf.streaming.reader.StreamedWorkbook’. This workbook can be sub divided into StreamedSheets> StreamedRows > StreamedCells. StreamedCells are the basic building block. StreamedCell represent the excel cell and holds the String representation of the excel cell value.  In order to reduce the memory usage StreamedCell is restricted to store only the String value & Cell Number. Since the string value is exactly same as seen in excel file, user can format it to whatever type he requires.

Apart from that, it uses the pull parser(StAX api) for streaming, so that the user has more control over the parsing. User can read N rows of data, process it and then read the next N blocks as so on..

Patch attached. Please take a look and let me know your comments.

thanks,
Renjith
Comment 1 Javen O'Neal 2017-02-08 05:27:26 UTC
This looks like a good start and a complement to SXSSFWorbook, which is a write-only streaming API.

The unit test for StreamingWorkbook is a nice touch, too.

In order to better integrate these classes within POI, they would need to implement the Workbook, Sheet, Row, and Cell interfaces so that someone can write generic Common SS code and switch between HSSF, XSSF, write-SXSSF and sax-read-SXSSF.

It's fine to stub out most methods for now as either "will be supported in the future" or "won't be supported due to memory footprint".
Comment 2 Renjith R 2017-02-15 05:21:32 UTC
Created attachment 34757 [details]
Patch which implements the interfaces Cell, Row, Sheet & Workbook

I have implemented the Cell, Row, Sheet & Workbook interfaces for easy integration with existing functionality.
Comment 3 Renjith R 2017-02-15 05:24:34 UTC
I have implemented Cell, Row, Sheet & Workbook interfaces for easy integration with existing functionality. Methods which needs more research to implement are mentioned as 'Will be supported in future'. Others are also provided with proper comments.
Comment 4 Renjith R 2017-02-22 13:28:25 UTC
Gentle reminder. Did someone got a chance to look into?
Comment 5 Renjith R 2017-03-08 15:43:18 UTC
I'll continue to add the functionalities and post the patches here. I will be glad if someone got a chance to review my work. Let me know if you want me to look at anything specific or please guide me if I am moving in a wrong direction.
Comment 6 Dominik Stadler 2017-03-19 08:45:51 UTC
I did a look, here some initial comments:
* The initial implementation looks already like a good start, thanks for putting in the work
* Maybe we should throw "UnsupportedException" in the methods that are not supported, this way the user immediately knows even if she does not look at the JavaDoc?
* As it is complex new functionality, we might first add it to the "scratchpad" project/source-folders so we can let it stabilise some more until we declare it as "production ready" by moving it into the ooxml-part.
* Please remove the "// TODO Auto-generated method stub" comments with a comment that explains why the method is empty or with an exception as stated above or simply remove it
* Please try to format the code consistently according to our coding-guidelines, see http://poi.apache.org/guidelines.html#CodeStyle, your IDE will usually allow to define it and reformat a whole file in one go.
* Is there a way to not duplicate the date-formats in StreamedSheetEventHandler? We already handle them in various places, e.g. DateFormatConverter
* On testing I would love to see some test that kind of "trashes" the implementation, i.e. take all spreadsheets that we have under test-data and run the normal XSSFWorkbook/HSSFWorkbook and compare results to your implementation as far as possible. This way we ensure that your code handles all the special cases that can arise.
Comment 7 Renjith R 2017-03-20 05:00:01 UTC
Thanks a lot for the Inputs, Dominik. Let me work out on these points and I'll get back to you.
Comment 8 Renjith R 2017-04-23 15:54:57 UTC
Created attachment 34943 [details]
Version 3: Patch with some of the review comments implemented

Hi.. 
I have tried to implement most of the comments. I am still working on reusing the existing methods for reading the date from excel. Since it is time-consuming, I am posting rest of the changes here. Kindly take and look and let me know your queries, suggestions etc..
Comment 9 Javen O'Neal 2017-04-23 19:02:13 UTC
Could you also add a @NotImplemented annotation to methods where you throw Unsupported operation exception? This will make it clear to the Javadocs readers that the method isn't implemented yet.

https://poi.apache.org/apidocs/org/apache/poi/util/NotImplemented.html
Comment 10 Renjith R 2017-04-24 04:25:56 UTC
Created attachment 34945 [details]
Version 4: Added @NotImplemented. Added feature getColumnIndex

Thanks a lot for the suggestion.
Added @NotImplemented.
Comment 11 Renjith R 2017-05-16 10:59:48 UTC
I was analyzing 'DateFormatConverter' to see if I can reuse any logic from it. But, as per my understanding, this class is more helpful while writing data to excel. eg convert(Locale.JAPANESE, "dd MMMM, yyyy"), will generate a data format string which can be used for applying a style to Cell. I'll keep on searching. Please let me know if anyone has any clue on this.
Comment 12 Renjith R 2017-05-16 11:01:56 UTC
I was analyzing 'DateFormatConverter' to see if I can reuse any logic from it. But, as per my understanding, this class is more helpful while writing data to excel. eg convert(Locale.JAPANESE, "dd MMMM, yyyy"), will generate a data format string which can be used for applying a style to Cell. I'll keep on searching. Please let me know if anyone has any clue on this.
Comment 13 Renjith R 2017-06-24 13:27:35 UTC
Did someone get a chance to review the code? 
I'd be glad to see some comments.
Comment 14 PJ Fanning 2017-06-24 13:46:56 UTC
I added a StaxHelper class today because I think it is a good idea for us to apply default configuration to the factories. One benefit is to protect against XML Entity Expansion attacks.
Would you be able to uptake this?
Comment 15 Renjith R 2017-06-26 07:33:03 UTC
Sure, I'll look into it. 
Hope (https://bz.apache.org/bugzilla/show_bug.cgi?id=61213) is the one you are talking about.
Comment 16 PJ Fanning 2017-06-26 15:06:46 UTC
Renjith - the StaxHelper is already merged to the svn trunk (src/java). The rest of the change for using StAX parser for issue 61213 is still under discussion.
Comment 17 PJ Fanning 2017-07-08 18:54:55 UTC
Renjith - could you attach SpreadSheetSample04022017.xlsx to the issue?
Comment 18 Renjith R 2017-07-09 03:33:45 UTC
You can take it from git. 

https://github.com/ranju4u6/poi/blob/Startup/test-data/spreadsheet/SpreadSheetSample04022017.xlsx


Sorry, I did not get a chance to look into your code due to tight work schedule. I have committed my changes to the following branch.

https://github.com/ranju4u6/poi/tree/Startup

If you find it interesting, please take it up. It may take me a couple of months or more to get into a normal schedule. I'll catch up with you then.
Comment 19 PJ Fanning 2017-07-13 12:37:12 UTC
I created a new branch and Pull Request for the work to date.
I took Renjith's patch and did a little tidy up.
https://github.com/apache/poi/pull/64

I think we need a fair amount of extra coverage (using different workbooks) and to decide whether enough of the core Workbook API is implemented to proceed.
Comment 20 Renjith R 2017-09-26 16:27:54 UTC
Hi PJ Fanning, I believe you have done some code review changes and some tidy ups.Thanks for that. So am I good to take the latest code from your repository(https://github.com/pjfanning/poi.git) with Branch: "streaming-workbook"?. I would like to resume from where I stopped a couple of months back.
Comment 21 PJ Fanning 2017-09-26 19:42:08 UTC
Renjith - feel free to fork my branch.
Comment 22 PJ Fanning 2017-11-21 22:33:24 UTC
One alternative worth trying is https://github.com/monitorjbl/excel-streaming-reader which is built on top of Apache POI.
Comment 23 Renjith R 2017-11-22 02:21:27 UTC
Cool! Looks promising.