Bug 61832 - Unable to create a excel of 500,000 rows and 150 column using SXSSF workbook
Summary: Unable to create a excel of 500,000 rows and 150 column using SXSSF workbook
Status: RESOLVED DUPLICATE of bug 57342
Alias: None
Product: POI
Classification: Unclassified
Component: SXSSF (show other bugs)
Version: unspecified
Hardware: PC All
: P2 major (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-11-29 09:57 UTC by Thamodharan
Modified: 2018-12-04 20:36 UTC (History)
0 users



Attachments
"GenerateExcelFile.txt" contains the java code. (1.08 KB, text/plain)
2017-11-29 09:57 UTC, Thamodharan
Details
Excel 2013 - Alert Message - While opening such large file (19.95 KB, image/jpeg)
2017-11-29 10:33 UTC, Thamodharan
Details
ExcelErrorAfterRepair (107.74 KB, image/jpeg)
2017-11-30 14:23 UTC, Thamodharan
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Thamodharan 2017-11-29 09:57:58 UTC
Created attachment 35565 [details]
"GenerateExcelFile.txt" contains the java code.

I am trying to generate a excel with SXSSF workbook. The size of excel may shoot to 200-300MB, whose row count will be 500,000 (i.e 0.5 million) and column count will be around 150 approximately..! I get content issue while opening such large excel file. 
The version which I am using is APACHE POI 3.9
I had also tried with, 3.15 the same issue exist. 
 
System holds enough ram and disk space as per the requirement. 
 
3.9 performance seems to better than other versions, Is there any way to rule this issue out in that version.? 
 
Anyone, who has idea on above scenario, kindly help me to proceed further.

FYI, I hadn't applied any styles or format for the cells, just inserted a data of 25 character length. I hadn't violated/crossed any excel specification/limits.

Please find the sample code as below.

The alert message which I receive on opening the excel file is,

"We found a problem with some content in 'filename'. Do you want us to try to recover as much as we can? If you trust the source of this workbook"
Comment 1 Thamodharan 2017-11-29 10:33:06 UTC
Created attachment 35566 [details]
Excel 2013 - Alert Message - While opening such large file
Comment 2 Yegor Kozlov 2017-11-30 12:37:34 UTC
(In reply to Thamodharan from comment #1)
> Created attachment 35566 [details]
> Excel 2013 - Alert Message - While opening such large file

Is it a 32 or 64-bit Excel ? I wonder if it matters.

What happens if you click Yes and agree to recover as much as possible ? 

Also, try to disable automatic re-calculation of formulas in Options / Formulas / Calculation Options. By default, Excel recalculates all the formulas in a worksheet on open. You'd better turn it off for such large files.
Comment 3 Thamodharan 2017-11-30 14:19:53 UTC
(In reply to Yegor Kozlov from comment #2)
> (In reply to Thamodharan from comment #1)
> > Created attachment 35566 [details]
> > Excel 2013 - Alert Message - While opening such large file
> 
> Is it a 32 or 64-bit Excel ? I wonder if it matters.
> 
> What happens if you click Yes and agree to recover as much as possible ? 
> 
> Also, try to disable automatic re-calculation of formulas in Options /
> Formulas / Calculation Options. By default, Excel recalculates all the
> formulas in a worksheet on open. You'd better turn it off for such large
> files.

Hi Yegor Kozlov,

FYI,
The above attached code works good, if the number of rows has been reduced to 0.3 million rows (i.e 300,000) and 150 column. I can open the file without any issue or error. But the same doesn't happen for 0.4 or 0.5 million rows.


Response to your questions.,

 1)  Its 32-bit Excel 2013.

 2)  When I click Yes to recover, excel stops responding few minutes and later i am able to retrieve all data without any loss, but I get repair error report in an alert box. Please find new attachments related to it.

     - Once, I save the repaired file, I didn't get the content issue pop up when I open it again or later.

 3)  We just tried to insert string, no numeric or formulas are applied to the workbook. Though, I tried your suggestion.

     - Before generating the workbook, I turned off the formula in local excel file and saved it, checked once again by opening, it remains turned off. But once the code is executed successfully, the formula is again enabled automatically. Same Content issue persist for large files.

Kindly help if there is any further solution.
Comment 4 Thamodharan 2017-11-30 14:23:20 UTC
Created attachment 35571 [details]
ExcelErrorAfterRepair
Comment 5 Yegor Kozlov 2017-11-30 14:59:12 UTC
Can you please do one more test: generate a 500K rows x 150 columns spreadsheet containing only numbers and see if it opens? Due to its streaming nature SXSSF stores strings differently than Excel does: Excel puts strings in a shared cache while SXXSF stores strings inline. Number cells should be fully compatible . Lets see.
Comment 6 Thamodharan 2017-11-30 16:26:42 UTC
(In reply to Yegor Kozlov from comment #5)
> Can you please do one more test: generate a 500K rows x 150 columns
> spreadsheet containing only numbers and see if it opens? Due to its
> streaming nature SXSSF stores strings differently than Excel does: Excel
> puts strings in a shared cache while SXXSF stores strings inline. Number
> cells should be fully compatible . Lets see.

Hi Yegor Kozlov,

I had tried as you suggested, it worked fine for 500K X 150 column while inserting numbers.The maximum int limit which we can give is 2147483647.
cell.setCellValue(2147483647);
Thus it worked fine.

So, I extended the testing by increasing the row count and column count,
i.e., 900K X 250 Column. Same cellValue 2147483647.
The workbook is done successful. File size reached to 504 MB.

But,The same content issue occurs while opening.

Still as a mystery.
Comment 7 Greg Woolsey 2017-12-04 20:52:59 UTC
I ran the attached code, and examined the generated workbook.  All XML validates successfully with xmllint.  The worksheet file, however, expands to > 5GB due to all the inlineStr attributes for all the cells.

My suspicion is that Excel doesn't like that many inline cell string values.  I verified that the change between the original and "repaired" versions was exactly turning those inlineStr attributes into a single shared string value referenced by all the cells.
Comment 8 Thamodharan 2017-12-06 10:30:01 UTC
Hi Greg Woolsey,

SXSSF by default takes inline string, it can be changed to shared string for which some patch is available. Since Shared string takes the heap space memory, its difficult to opt for it, due to resource constraint. Is there any luck, to get it done with SXSSF inline itself ?

Regards
Thamodharan B
Comment 9 Yegor Kozlov 2017-12-06 12:46:29 UTC
In theory SXSSF can use a fixed-size cache of shared strings, e.g. we can allow SXSSF to put 10K distinct strings in the SST. If the cache is full then the string goes inline. The 10K limit can be configurable.
This way SXSSF-generated workbooks with small number of distinct values will be fully compatible with XSSF.
Comment 10 Thamodharan 2017-12-07 11:31:27 UTC
Hi Yegor Kozlov,

Hope, you mean to set this limit 10K as mentioned below.

SXSSFWorkbook wb = new SXSSFWorkbook(10000);

Though I set this, I face content issue while opening it.
I tried extending it to 30K, the same persist.

If so my try is wrong from what you suggest. Kindly correct it.

I have limited heap space, but have enough disk space. Though I plan to extend the memory, and store in SST, I cant run parallel workbook. I can do only one, though i increase twice or thrice as of now. 

My system RAM is 8GB, I had allotted 1 GB max to jboss.

Kindly suggest your view on it.

Regards
Thamodharan B
Comment 11 Yegor Kozlov 2017-12-08 13:57:56 UTC
try to construct SXSSFWorkbook in a different way:

  wb = new SXSSFWorkbook(
    null, /* template workbook. pass null to start with a blank workbook */
    1000, /* size of the sliding window*/
    true, /* whether to compress temp files.*/
    true  /* whether to use shared string table*/
  );



not that it may increase the memory footprint, especially if the number of distinct strings is large .
Comment 12 Thamodharan 2017-12-12 10:21:54 UTC
(In reply to Yegor Kozlov from comment #11)
> try to construct SXSSFWorkbook in a different way:
> 
>   wb = new SXSSFWorkbook(
>     null, /* template workbook. pass null to start with a blank workbook */
>     1000, /* size of the sliding window*/
>     true, /* whether to compress temp files.*/
>     true  /* whether to use shared string table*/
>   );
> 
> 
> 
> not that it may increase the memory footprint, especially if the number of
> distinct strings is large .


Hi Yegor Kozlov,

As you stated, it worked fine, if a same string is written in all the cells (500 Row X 150 Column). Even it is good if same string written in (750 rows X 150 Column).

Thus by digging more about it,I came to know how shared string table concept works.

But for more number of distinct string that need to be inserted, i face java heap space out of memory error, since shared string takes all distinct in into heap memory.

Help me to improve it further.
Comment 13 Greg Woolsey 2017-12-12 16:37:51 UTC
At some point, you just have to allocate more heap, no way around it. Sounds like you might be there, needing to allow for a very large shared strings table.  Some tasks just take a large amount of RAM, which is cheap these days, including the ability to "rent" time and space on any number of cloud provider platforms to achieve tasks that don't warrant purchasing your own additional hardware.

Only other thing I can think of would be to write a patch for POI yourself that adds an option for streaming and shared strings to store the hash table on disk somehow.  That would save RAM but be incredibly slow.  Better off just giving the VM 64GB of RAM or something.
Comment 14 Thamodharan 2017-12-18 15:18:10 UTC
Though I adopt 64GB system RAM and 30GB to JBOSS heap space, I cant even produce two parallel workbook of such large size at a time.

In case of adopting to write a patch, speed is concerned as you said.

Is there any possibility to overcome this issue in forthcoming version?

Anyhow, I will give a try to write patch, can you please guide or provide any reference to write a patch to store SST on disk as you suggested.
Comment 15 mewalig 2017-12-18 20:07:56 UTC
I do not agree with this at all: "Some tasks just take a large amount of RAM, which is cheap these days, including the ability to "rent" time and space on any number of cloud provider platforms [...]".

This is the kind of reasoning that leads to bloatware and all around bad products and bad libraries. Here is an approach that might work for improving performance well within the OP requirements. I'd be interested to hear any reasons this would not work.

1. shared string table uses a lookup that only tracks hash and index
2. worksheets and shared string table are output as streams to temporary files (optionally compressed prior to write, which often saves time given speed of CPU and sluggishness of write operations)
3. after all pieces from #2 are created, the final xlsx is assembled

Assuming a 256-bit hash, 1 mm rows, and 150 columns, and 100% non-distinct string values (obviously unlikely) step 1 will take a maximum of (256+32)/8*150*1mm = 5.4GB of memory. This could be further streamlined by not hashing string values of under 32 bytes and/or using a 128-bit hash, and could be further supplemented by supporting hash collisions. Of course there are libraries that will take care of this so it shouldn't be much work to incorporate. Something that is performance-oriented like sqlite3 comes to mind as a possibility.

Step 2 will take minimal fixed memory because it is streaming the data to disk.

Step 3 can release all memory before starting, and then stream all the pieces, in serial, into the final output (and implicitly through a zip streamer), which can be either streamed to the calling process or to disk and thereby take up minimal fixed memory.
Comment 16 PJ Fanning 2017-12-18 23:45:46 UTC
I have the initial work done on a prototype version of SharedStringTable that can be used in SXSSF to reduce the memory footprint.
The Pull Request is for discussion as opposed to suggesting this work is anywhere close to being merged yet.
https://github.com/apache/poi/pull/85 

The general idea is to put the shared string data in an H2 MVTable backed by a temp file. This is not done yet.
Comment 17 Thamodharan 2017-12-27 13:27:35 UTC
Hi mewalig,

Does 1mm what you mentioned refer to 0.1 million or 1 million rows ?

Can you please explain this calculation [(256+32)/8*150*1mm = 5.4GB of memory] in detail, so that it would be easy for me to calculate as per my requirement.
Comment 18 Thamodharan 2018-01-11 13:00:32 UTC
Hi mewalig,

Kindly help us by explaining the calculation which you stated on your last reply.
Comment 19 Thamodharan 2018-01-11 13:17:11 UTC
My machine specs,

Jboss server
RAM 32 GB
Heap size allocated to jboss 28 GB
Harddisk 1 TB

I am good to go with generating excel file upto a limit of 0.3 million X 200 rows with enabling shared string. Ahead such row, i face corrupt file issue when opening the generated excel. The reason for this error as stated by greg woolsey is, the data's are stored as inline string in the disk, thus excel couldn't tolerate putting such large number of inline string, thus repairing it, turning inline to single shared string.

As suggested by greg woolsey, its better to go by enabling shared string, which holds unique data in heap space instead of disk space. This idea may aid the application to generate a single excel file in my machine with the above spec, but generating 2 or more excel at same time, will lead to memory out.

I guess, writing a patch to store data in disk, rather in heap, will again lead to inline string and same error may occur.

Can anyone from apache dev team or other followers, ignite a vice idea to rule this error out ?

Waiting a good response

Thamodharan B
Comment 20 PJ Fanning 2018-01-11 14:41:02 UTC
There is a Pull Request which uses temp files for constructing the shared strings table.
https://github.com/apache/poi/pull/85
Comment 21 Thamodharan 2018-01-16 15:28:05 UTC
Hi Fanning,

I went through the pull request in github, really great. Is there a chance to expect complete code in upcoming version of apache poi ?
Comment 22 Marion 2018-05-14 21:20:59 UTC
This is still an issue. Can anyone from the Apache team confirm this issue will be resolved with the new release?
Comment 23 PJ Fanning 2018-05-15 09:17:38 UTC
There have been no changes made to POI for this.
1. If you use inline strings, the issue appears to be that Excel itself baulks at the large dataset (eg Comment 7 above).
2. Comment 16 above highlights a workaround where shared strings are used - this workaround would require a manual build of my Poi branch or to wait until Poi 4 is released so that I can release my extension.
Comment 24 Thamodharan 2018-09-05 15:37:23 UTC
@Fanning,

In below link,

https://poi.apache.org/changes.html

under History of Changes
Version 4.0.0-SNAPSHOT (2018-08-??)

The change 61832 / Github-85 is not involved.

As you replied to marion, can we expect this in upcoming release ?
Comment 25 PJ Fanning 2018-09-05 20:28:50 UTC
https://github.com/pjfanning/poi-shared-strings is an extension to poi that may help - will be released as soon as POI 4.0.0 is released (it relies on changes in POI 4.0.0).
Comment 26 PJ Fanning 2018-09-06 15:03:16 UTC
POI 4.0.0 and poi-shared-strings 1.0.0 are now released.
Comment 27 krzysztof.rzymkowski 2018-12-04 20:36:36 UTC

*** This bug has been marked as a duplicate of bug 57342 ***