Bug 62872

Summary: Writing large files with 800k rows gives java.io.IOException: This archive contains unclosed entries.
Product: POI Reporter: andreas
Component: SXSSFAssignee: POI Developers List <dev>
Status: NEW ---    
Severity: regression CC: andreas
Priority: P2    
Version: 4.0.0-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Bug Depends on:    
Bug Blocks: 57342    
Attachments: Sampel for java.io.IOException: This archive contains unclosed entries
Test Case
modified test

Description andreas 2018-11-01 00:17:41 UTC
Created attachment 36225 [details]
Sampel for java.io.IOException: This archive contains unclosed entries

The new 4.0 seems to have a problem when writing large XLSX files (e.g. >100k rows but less than 1 Mill. rows) using SXSSFWorkbook:

java.io.IOException: This archive contains unclosed entries.
        at org.apache.commons.compress.archivers.zip.ZipArchiveOutputStream.finish(ZipArchiveOutputStream.java:467)
        at org.apache.poi.xssf.streaming.SXSSFWorkbook.injectData(SXSSFWorkbook.java:406)
        at org.apache.poi.xssf.streaming.SXSSFWorkbook.write(SXSSFWorkbook.java:936)

Please note, small files (e.g. less than 100k rows seem to work fine) and also everything works fine with 3.18 (same code, same data).

The attached sample reproduces the error.
Comment 1 andreas 2018-11-01 03:09:11 UTC
Created attachment 36226 [details]
Test Case
Comment 2 Dominik Stadler 2018-11-01 08:09:24 UTC
Based on discussion on user-list, this worked in 3.17.
Comment 3 Dominik Stadler 2018-11-01 10:26:22 UTC
Reproducible with 600_000 rows, not reproducible with 400_000 rows. 

Temporary file is aprox. 200MB, so not a 2GB issue

When the exception is thrown in ZipArchiveOutputStream, the variables have the following contents:

this.finished = false
this.entry = {ZipArchiveOutputStream$CurrentEntry@2501} 
 entry = {ZipArchiveEntry@2523} "xl/worksheets/sheet1.xml"
  method = 8
  size = 4497373312
  internalAttributes = 0
  versionRequired = 0
  versionMadeBy = 0
  platform = 0
  rawFlag = 0
  externalAttributes = 0
  alignment = 0
  extraFields = null
  unparseableExtra = null
  name = "xl/worksheets/sheet1.xml"
  rawName = null
  gpb = {GeneralPurposeBit@2525} 
  localHeaderOffset = -1
  dataOffset = -1
  isStreamContiguous = false
  nameSource = {ZipArchiveEntry$NameSource@2526} "NAME"
  commentSource = {ZipArchiveEntry$CommentSource@2527} "COMMENT"
  ZipEntry.name = "xl/worksheets/sheet1.xml"
  xdostime = 276176132385
  mtime = null
  atime = null
  ctime = null
  crc = 2326399640
  ZipEntry.size = -1
  csize = 228388641
  ZipEntry.method = -1
  flag = 0
  extra = null
  comment = null
 localDataStart = 2340
 dataStart = 2380
 bytesRead = 4497373312
 causedUseOfZip64 = false
 hasWritten = false
Comment 4 Andreas Beeker 2018-11-02 20:02:23 UTC
My guess is, this is a shaded exception in SXSSFWorkbook.injectData.
If you add some checkpoint variables in, it will enter the "finally" for "zos.closeArchiveEntry()", but actually doesn't finish it.
Comment 5 Andreas Beeker 2018-11-02 20:41:40 UTC
It is a shaded exception complaining about the 4gb limit.
When using "zos.setUseZip64(Zip64Mode.Always)" the test runs through successfully, but the result can't be opened in Libre Office

I'm now checking if the xml differs on 3.17 vs 4.0.0 or if it's caused by the 64bit zip stream.
Comment 6 Andreas Beeker 2018-11-02 23:04:43 UTC
Patched via r1845629

The change now implicitly sets the stream entry to 64bit based on the given filesize, the others stream entries stay in normal mode (... I guess ...)

I've compared the sheet content in 3.17 vs trunk, but there wasn't any differences.

As this still produces files which can't be opened in Libre/MS Office, I leave this issue open.
Comment 7 Andreas Beeker 2018-11-11 20:33:59 UTC
Created attachment 36258 [details]
modified test

I forgot to mention, that also with POI 3.17 I couldn't produce a file which could be opened by Libre Office ... I haven't tried with MS Excel. This also due the double "i++" in the test, which creates more than the 1048576 logical rows with a ROW_COUNT of 600000.

Please run your fixed test again, with POI 3.17 vs. the current trunk.

For comparison of the unzipped xmls, you might want to use the same test data.
I've attached my modified version which inserts reproducible timestamps.
Comment 8 andreas 2018-11-30 07:32:41 UTC
LibreOffice does not seem to support ZIP64.

I have created 2 small files with the same content, but only one written as ZIP64.

Gnumeric was able to open both files, LibreOffice fails with the ZIP64.
I have refreshed LibreOffice bug 82984 accordingly (https://bugs.documentfoundation.org/show_bug.cgi?id=82984#c10).

Setting Zip64 to Always for archives exceeding 4GB and/or 65536 entries would be the correct solution. When doing so, a warning related to the use of LibreOffice would be great.
Comment 9 andreas 2018-11-30 07:34:05 UTC
Created attachment 36286 [details]
Comment 10 andreas 2018-11-30 07:34:38 UTC
Created attachment 36287 [details]
Comment 11 PJ Fanning 2018-12-04 22:04:10 UTC
I added a method to SXSSFWorkbook so that you can set the Zip64Mode - https://svn.apache.org/viewvc?view=revision&revision=1848179