Bug 57342 - Writing very large file via SXSSF leads to corrupt file
Summary: Writing very large file via SXSSF leads to corrupt file
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: SXSSF (show other bugs)
Version: 3.9-FINAL
Hardware: PC All
: P1 blocker with 4 votes (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
: 54523 61832 (view as bug list)
Depends on: 62872 54523
Blocks:
  Show dependency tree
 
Reported: 2014-12-11 19:55 UTC by RatiKanata Pal
Modified: 2019-06-12 19:19 UTC (History)
8 users (show)



Attachments
Attached! source (9.42 KB, text/plain)
2014-12-11 19:55 UTC, RatiKanata Pal
Details
Error Shown from Excel (108.79 KB, image/jpeg)
2015-01-08 07:17 UTC, RatiKanata Pal
Details
Jars Used (237.85 KB, image/jpeg)
2015-01-08 07:20 UTC, RatiKanata Pal
Details

Note You need to log in before you can comment on or make changes to this bug.
Description RatiKanata Pal 2014-12-11 19:55:11 UTC
Created attachment 32284 [details]
Attached! source

Hi,
After 7 lakh records file are getting corroupted,when i am writing 1 million recors with 60 columns its working fine but when i am writing 7.5 lakh records with 100 columns file are getting corrupted.

I have attached snipnet of code . Kindly suggest i if doing something wrong

package com.rati.Dec;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;

import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.xssf.streaming.SXSSFSheet;
import org.apache.poi.xssf.streaming.SXSSFWorkbook;

public class MillionTest1 {

	public static void main(String[] args) {
		// TODO Auto-generated method stub
		SXSSFWorkbook workbook = new SXSSFWorkbook(100);
		workbook.setCompressTempFiles(true);
		SXSSFSheet sheet = null;
		Row row = null;
		Cell cell = null;
		sheet = (SXSSFSheet) workbook.createSheet("Test");
		FileOutputStream fileOut = null;
		try {
			System.out.println("Started!");
			for (int i = 0; i < 777384; i++) {
				row = sheet.createRow(i);
				for (int j = 0; j < 100; j++) {
					cell = row.createCell(j);
					cell.setCellValue("Rati : row : "+i +" Col: "+ j);
				}
			}

			fileOut = new FileOutputStream(
					new File("Z:\\output\\Version\\Version11.xlsx"));
			workbook.write(fileOut);

		} catch (Exception exception) {
			exception.printStackTrace();
		} finally {
			workbook.dispose();
			try {
				fileOut.flush();
				fileOut.close();
			} catch (IOException e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			}
			System.out.println("Completed!");
		}
	}

}
Comment 1 Nick Burch 2014-12-11 20:21:16 UTC
How is the file getting corrupt?

How do the files differ between 3.9 and 3.10?

Did you try with 3.11 beta 3?

Did you try generating a simple file with the same number of rows+columns using SSPerformanceTest (see http://poi.apache.org/faq.html#faq-N10109)? Did that file work fine when generated at the same row+column size?
Comment 2 RatiKanata Pal 2014-12-11 20:30:48 UTC
sir.i have tried all version and compatible dependency jars.still file is getting corrupted.

u can also try running this or attached code!

:(
Comment 3 RatiKanata Pal 2014-12-14 10:12:05 UTC
(In reply to Nick Burch from comment #1)
> How is the file getting corrupt?
> 
> How do the files differ between 3.9 and 3.10?
> 
> Did you try with 3.11 beta 3?
> 
> Did you try generating a simple file with the same number of rows+columns
> using SSPerformanceTest (see http://poi.apache.org/faq.html#faq-N10109)? Did
> that file work fine when generated at the same row+column size?

No Sir,
i have tried using SSPerformanceTest 

args = new String[4];
args[0] = "SXSSF";
args[1] = "1000000";->Rows
args[2] = "100";--->Cols
args[3] = "1";


but still file is getting corrupt.
Comment 4 maggie1 2014-12-18 02:17:02 UTC
We are also using 3.9 and experiencing the same corruption. Ran the performance test with the same results as reporting user.
Comment 5 Dominik Stadler 2015-01-04 23:35:04 UTC
I think this the problem may happen around the 2G/4G size border, so it may be related to the trouble in some JDK versions with handling zip-files larger than 2G/4G, can you state which exact Java version you are using?
Comment 6 Dominik Stadler 2015-01-04 23:38:43 UTC
I tried to look up details of support of large zip files in Java, but the bug-database is not very definite about what is supported.

However https://en.wikipedia.org/wiki/Zip_(file_format)#ZIP64 states that only Java 7 fully supports files larger than 4G, can you try running your test with the latest patchlevel of this version to verify this?
Comment 7 Dominik Stadler 2015-01-05 07:58:36 UTC
I tried with java version "1.6.0_26" and could not reproduce the problem there.
Comment 8 RatiKanata Pal 2015-01-08 07:17:08 UTC
Created attachment 32355 [details]
Error Shown from Excel

Error and compiler version
Comment 9 RatiKanata Pal 2015-01-08 07:20:06 UTC
Created attachment 32356 [details]
Jars Used

HI,
Run the test using Jdk 1.8 and latest poi jars.

here the code I used 


package com.rati.Dec;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;

import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.xssf.streaming.SXSSFSheet;
import org.apache.poi.xssf.streaming.SXSSFWorkbook;

public class MillionTest1 {

	@SuppressWarnings("resource")
	public static void main(String[] args) {
		// TODO Auto-generated method stub
		SXSSFWorkbook workbook = new SXSSFWorkbook(100);
		workbook.setCompressTempFiles(true);
		SXSSFSheet sheet = null;
		Row row = null;
		Cell cell = null;
		sheet = (SXSSFSheet) workbook.createSheet("Test");
		FileOutputStream fileOut = null;
		try {
			System.out.println("Started!");
			for (int i = 0; i < 777384; i++) {
				row = sheet.createRow(i);
				for (int j = 0; j < 100; j++) {
					cell = row.createCell(j);
					cell.setCellValue("Rati : row : "+i +" Col: "+ j);
				}
			}

			fileOut = new FileOutputStream(
					new File("Z:\\Version11.xlsx"));
			workbook.write(fileOut);

		} catch (Exception exception) {
			exception.printStackTrace();
		} finally {
			workbook.dispose();
			try {
				fileOut.flush();
				fileOut.close();
			} catch (IOException e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			}
			System.out.println("Completed!");
		}
	}

}


Same error Is popping up.

I have also attached the screen shots in the attachment section.

Please check once.
Comment 10 RatiKanata Pal 2015-01-08 08:02:34 UTC
It also been tested multiple system but issue remains the same.
Comment 11 Nick Burch 2015-02-05 17:45:57 UTC
I've just run your test program on this jvm:

java version "1.6.0_33"
OpenJDK Runtime Environment (IcedTea6 1.13.5) (6b33-1.13.5-1ubuntu0.12.04)

The generated file (268mb!) is processed just fine with Apache Tika

The generated file takes an absolute epic age to load in Excel 
then Excel gives a warning about unreadable content, after repairing it reports:
"System error: - 2134163449. Line 437452, column 5899". However, it does still show 777384 rows, all with 100 columns

The only thing I can suggest is finding the exact point when the file files to load (eg 777384 fails, 777383 works), then unzip both files and compare the XML to see how they differ. (We now have a tool to help with this - OOXMLPrettyPrint)
Comment 12 matthewtckr 2015-02-25 22:05:46 UTC
Using Rati's code with Java 7u76, the file works with a row count of 561564, but breaks at 561565 and higher.
Comment 13 RatiKanata Pal 2015-02-26 02:20:02 UTC
Matt i think this is the range can you provide the file size and column count.
Comment 14 matthewtckr 2015-02-27 16:59:40 UTC
Hi Rati,

Both files are reported as 197,511 KB (202,250,805 and 202,251,162 bytes), and have the 100 columns as in the sample code.

7-Zip reports the uncompressed size of the worksheet as 4,294,971,804 bytes and 4,294,979,492 bytes, which is just over the 4GB size for the broken file.

I'm also uploading the files that were created, before being opened/repaired by Excel.  The files will be available at https://www.dropbox.com/sh/v322ryfnw7zk6ra/AADlEnu9t-25OCKNquJvdDyQa?dl=0
Comment 15 Dominik Stadler 2015-03-23 21:43:40 UTC
I think bug 54523 reports pretty much the same thing.
Comment 16 Aldrin Baroi 2016-04-21 00:50:27 UTC
I am using version 3.14
Comment 17 Aldrin Baroi 2016-09-22 00:38:52 UTC
Here is additional details:

Excel Version     :  Excel 2010
Apache POI Version:  3.14 & 3.15
Java Version      :  JDK 8 (1.8.0_101)
OS Version        :  Windows 7 Enterprise 64-bit


Excel sheet information:
Number of columns   :  208
Number of rows      :  390,000 or more
Average column width:  15 characters
Max column width    :  40 characters 

Problem starts when the number of rows exceed 410,000 or so... 

Error message displayed in Excel:

  "Excel found unreadable content in [file name].  Do you want to recover the contents of this workbook? If you trust the source of this workbook, click Yes"

This ONLY HAPPENS when the number of rows exceeds 410,000 rows or so.
Comment 18 Ramesh 2016-10-10 14:39:11 UTC
Dear All,

Any workaround to this issue? We are also facing the same issue with similar approach. Appreciate your help on this.

Thanks
Ramesh
Comment 19 Javen O'Neal 2016-10-10 14:59:45 UTC
Patches are greatly appreciated. Appreciate your help on this.
Comment 20 PJ Fanning 2016-10-12 21:40:50 UTC
I ran the test case in https://bz.apache.org/bugzilla/show_bug.cgi?id=54523 using latest poi code and Excel 2016 (Mac) gives the error about corrupt data.
I was able to unzip the generated xlsx with p7zip. The default unzip on Macs can't handle zips containing such large files.
The generated worksheet is 4Gb but the XML looks fine.
Comment 21 Dominik Stadler 2016-11-01 20:03:14 UTC
The comment at https://bz.apache.org/bugzilla/show_bug.cgi?id=54523#c6 indicates that a workaround is to unzip the file with 7zip and then zipping it again lead to a file that was usable in Excel.
Comment 22 Dominik Stadler 2016-11-01 20:04:28 UTC
*** Bug 54523 has been marked as a duplicate of this bug. ***
Comment 23 krzysztof.rzymkowski 2018-12-04 20:35:59 UTC
I did an in depth analysis of this issue. Turns out the problem is not with the OOXML data generated by POI. The problem has to do with the ZIP format. Specifically with ZIP64 extension. That's why it's all OK up until sheet1.xml reaches over 4GB (uncompressed). 
I have all the details written up in a blog post: https://rzymek.github.io/post/excel-zip64/
Short story: Excel will want to repair the file if uncompressed size of a zip entry exceeds 4GB and ZIP's Local File Header (LFH) does not specify zip spec version 4.5

A minimal fix would go like this:
1. Switch commons-compress ZipArchiveOutputStream to Zip64Mode.Always (Apache POI uses commons-compress not java.util.zip)
2. Modify commons-compress to put 0 in 32bit size fields if size is not known. That is in LFH creation in streaming mode.  Currently in Zip64Mode.Always and streaming zip creation commons-compress stores FF FF FF FF in 32bit size field and 00 00 00 00 00 00 00 00 in 64bit in LFH. Excel expects 00 00 00 00 in the 32bit size field only in LFH.
Comment 24 krzysztof.rzymkowski 2018-12-04 20:36:36 UTC
*** Bug 61832 has been marked as a duplicate of this bug. ***
Comment 25 PJ Fanning 2018-12-04 21:52:51 UTC
Krzysztof - thanks for the analysis, could you report the issue to commons-compress?
I'm going to add a setZip64Mode method on SXSSFWorkbook. https://svn.apache.org/viewvc?view=revision&revision=1848179
Comment 26 krzysztof.rzymkowski 2018-12-09 22:53:33 UTC
Issue submitted to commons-compress at https://issues.apache.org/jira/browse/COMPRESS-474
Comment 27 Dominik Stadler 2018-12-21 07:30:29 UTC
Method to set 64-bit mode was added via r1848179, unfortunately it cannot be set always currently because LibreOffice cannot open such files, see bug 62872 for details.
Comment 28 krzysztof.rzymkowski 2019-02-23 18:01:14 UTC
I have implemented a drop in replacement for ZipOutputStream based of my previous analysis in https://rzymek.github.io/post/excel-zip64/.

It's available at https://github.com/rzymek/opczip and from maven central.

It foremost supports Excel compatible zip64 archives.  
Version 1.0.2 also creates LibreOffice compatible xlsx files. 
LibreCalc still will not open zip64 archives, but is OK with zip version 4.5 in Local File Header. Just requires max 3.0 in Central Directory. This is also OK with Excel, which requires 4.5 in LHF but ignores version in CEN. This combination allows streaming creation without prior knowledge whether 4GB limit will be exceed or not.

opczip is currently used by https://github.com/dhatim/fastexcel
Comment 29 PJ Fanning 2019-02-24 09:30:29 UTC
Thanks Krzysztof. With Apache projects, we try to limit the amount of non-Apache dependencies. Would it be feasible for you to submit the implementation of opczip to us and we can have a mode in SXSSF that allows users to choose their preferred zip implementation.
Comment 30 krzysztof.rzymkowski 2019-03-06 15:58:56 UTC
@PJ_Fanning Sure, but I'm not fully clear on what you mean by "submit to us". Do you mean submit to Apache POI source repo, Apache Incubator or via https://projects.apache.org/create.html? Could you point me to what the next steps would be.

Here's a least invasive scenario:
Allow to tap into SXSSFWorkbook.injectData() [1], to provide own implementation of  ZipArchiveOutputStream. For example via introducing protected createZipOutputStream() method.
I could provide an OpcOutputStream wrapper extending ZipArchiveOutputStream. Just like I did with ZipOutputStream [2]. 
Then to create SXSSFWorkbook with >4GB sheet support, you would:

SXSSFWorkbook wb = new SXSSFWorkbook() {
  protected ZipArchiveOutputStream createZipOutputStream(OutputStream out) {
     return new OpcZipArchiveOutputStream(out);
  }
}

[1] https://github.com/apache/poi/blob/cdbda8e615610b960844e58b89f51cf304ce00c6/src/ooxml/java/org/apache/poi/xssf/streaming/SXSSFWorkbook.java#L388
[2] https://github.com/rzymek/opczip/blob/master/src/main/java/com/github/rzymek/opczip/OpcZipOutputStream.java
Comment 31 PJ Fanning 2019-03-07 00:00:46 UTC
Krzysztof, I'm asking for permission for us to copy your code and include it in Apache POI - to avoid us having a jar dependency.
Comment 32 krzysztof.rzymkowski 2019-03-07 10:17:32 UTC
(In reply to PJ Fanning from comment #31)
> Krzysztof, I'm asking for permission for us to copy your code and include it
> in Apache POI - to avoid us having a jar dependency.

Sure! I'd just like have my name in the file's header.

You only need OpcOutputStream.java and Zip64Impl.java. They both don't have any dependencies.

https://github.com/rzymek/opczip/tree/master/src/main/java/com/github/rzymek/opczip
Comment 33 krzysztof.rzymkowski 2019-06-12 19:19:51 UTC
Submitted https://github.com/apache/poi/pull/154