Bug 59841 - ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutputStream
Summary: ZipInputStreamZipEntrySource should use temp files instead of ByteArrayOutput...
Alias: None
Product: POI
Classification: Unclassified
Component: POI Overall (show other bugs)
Version: 3.15-dev
Hardware: All All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Blocks: 59893
  Show dependency tree
Reported: 2016-07-11 14:24 UTC by PJ Fanning
Modified: 2016-07-23 12:12 UTC (History)
1 user (show)

new OPCPackage.open(ZipEntrySource) method / custom encrypted temp zip (9.33 KB, patch)
2016-07-16 21:59 UTC, Andreas Beeker
Details | Diff
new OPCPackage.open(ZipEntrySource) method / custom encrypted temp zip (10.95 KB, patch)
2016-07-16 23:53 UTC, Andreas Beeker
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description PJ Fanning 2016-07-11 14:24:08 UTC
I have a use case where I have a worksheet with a large amount of data.
I'm using XSSFEventBasedExcelExtractor to try to extract the text from the workbook and am hoping to use as little memory as possible.
Most of the code base seems pretty memory efficient but the putting the worksheet data into a ByteArrayOutputStream in the ZipInputStreamZipEntrySource.FakeZipEntry causes high memory consumption.
POI has temp file support via DefaultTempFileCreationStrategy.
Would it be possible to consider using temp files in this code base, even it was just for data beyond a certain size?
Comment 1 Nick Burch 2016-07-11 14:40:58 UTC
To minimise memory use, open the XLSX file using a real File object, not an InputStream, see http://poi.apache.org/spreadsheet/quick-guide.html#FileInputStream . That should avoid all in-memory buffering of the xml parts at the OPC / Zip level
Comment 2 PJ Fanning 2016-07-11 18:10:25 UTC
Thanks Nick for the quick response. In my case, the xlsx file is password protected and I'm using POIFSFileSystem to decrypt it. I don't want to store the unencrypted xlsx on the file system.
Comment 3 PJ Fanning 2016-07-11 19:16:06 UTC
I have created https://github.com/apache/poi/pull/34 - if this approach is acceptable, I can add extra test coverage.
Comment 4 PJ Fanning 2016-07-13 20:57:14 UTC
Using temp files for the fake zip entries makes a big difference in memory usage in the use case I have. The https://github.com/apache/poi/pull/34 change leaves the default behaviour as it is but allows user to provide customised overrides to the behaviour.
Would it be possible to consider reopening this issue and I will proceed to extend the test coverage?
I think it is a legitimate concern not wanting to put unencrypted xlsx files onto disk in order to read them efficiently.
Comment 5 Nick Burch 2016-07-13 22:13:32 UTC
(In reply to PJ Fanning from comment #4)
> I think it is a legitimate concern not wanting to put unencrypted xlsx files
> onto disk in order to read them efficiently.

I'm not sure how your patch helps with this - all it does is only put parts of the unencrypted data on disk at a time, but the unencrypted data is still on disk in the temp files!
Comment 6 PJ Fanning 2016-07-13 22:20:01 UTC
I have a class that encrypts the temp files (not part of patch). With the patch I can create my own custom FakeZipEntryStrategy class that uses this temp file util under the hood.
Comment 7 Nick Burch 2016-07-13 22:23:50 UTC
Ah, that makes more sense now :)

Re-opening so that someone can review the patch, and apply if appropriate / give further comments if not!
Comment 8 Andreas Beeker 2016-07-13 22:37:13 UTC
Sorry - I don't understand your FakeEntry approach ... and even Nick beat me on this, I'd like to add my two cents ...

So lets summarize:
- you have a hugh encrypted .xlsx
- and want to use eventmodel for text extraction
- the extraction should be on a file and not a stream
- it's not a problem to a have temporary file, but it should be encrypted

My/further insights:
- the shared string table is loaded in full into the memory, so this might be a caveat
- you need to provide a custom ZipFile implementation with standard AES or custom encryption support which is used by POI - maybe "Apache Commons Compress" can be used for that ... and of course we need to adapt the OPCPackage handling for that

So currently it looks like the following approach for me:
- read the ole2 container
- copy the encrypted stream into a encrypted zip with a session key
- provide the zip to an adapted ZIP-/OPCPackage
- use XSSFReader with that OPCPackage

The key is, that we need to change the ZIPPackage to support a custom ZipFile!
So writing the ZipFile is not an issue and can be handled without changing POI, but reading is the interesting part.
Comment 9 Andreas Beeker 2016-07-13 23:15:29 UTC
So we could either
- add a OPCPackage.open(ZipFile) method, which would skip the functionality of ZipSecureFile
- or pass a FilterInputStream-wrapper/decorator down to ThresholdInputStream which is used decrypt/deobfuscate the raw zip bytes - of course the encrypting part seems to be not trivial then, if not a simple bytewise XOR is used ...

(again this all assumes, we are working on a random accessible file and not on a stream)
Comment 10 PJ Fanning 2016-07-13 23:19:43 UTC
Thanks Andreas for your analysis. Your summary is exactly what I am looking to achieve. I understand the Shared String Table could be another memory issue but I'm wondering if at least the FakeZipEntry memory consumption could be addressed by my patch set? I've added a basic test case to the pull request. My custom strategy class would be similar to the one in the test case except that it has its own built-in AES encryption.
This approach is based on the pre-existing TempFile creation strategy class in POI.
Of course, the solutions you are suggesting are also well worth considering.
Comment 11 Javen O'Neal 2016-07-14 08:44:36 UTC
(In reply to Andreas Beeker from comment #8)
> - the extraction should be on a file and not a stream

And another requirement, since there usually tends to be a trade off between memory consumption and execution speed:
If using a file is slower, we must keep the stream variant so users  can choose based on what best meets their needs. Having a method that operates on an OutputStream and letting the user pass in a the implementation: BAOS or a FOS, and an InputStream with the choice of implementation including BAOS, FIS, ZipInputStream, EncryptedZipInputStream, EncryptedZipFileInputStream, etc).
Comment 12 Andreas Beeker 2016-07-15 19:54:46 UTC
(In reply to PJ Fanning from comment #10)
> This approach is based on the pre-existing TempFile creation strategy class
> in POI.

Although I understand, that a temp file needs to be created, this can be done outside of POI and therefore user code can do whatever temp creation they want to do - so this is not my main scope as mentioned above, i.e. "writing the ZipFile is not an issue".

I'll now try to do a test implementation with commons-compress which doesn't extend java.util.zip.ZipFile, therefore I need a interface to a delegate which forwards the zip operations to the underlying zip implementation. This interface will be referenced via OPCPackage.open(interface).
Comment 13 Andreas Beeker 2016-07-16 21:59:09 UTC
Created attachment 34046 [details]
new OPCPackage.open(ZipEntrySource) method / custom encrypted temp zip

As described above, the OPCPackage needed to be extended to provide the functionality of providing a custom encrypted temp file, which can be streamed to POI

At first I've started to use packages like zip4j, winzipaes or commons-compress (which doesn't provide encryption support), but as the resulting temp doesn't need to be outside of the user process, I've downgraded it to a simple custom enc-/decryption routine.

Also as noted above, I haven't put any effort into the the temp file generation, as this is simply a user code issue - so given your described use case, I don't see a need to push the logic into ZipInputStreamZipEntrySource & Co. and rather supply an example in the unit tests for further reference, i.e. I'm reluctant to make things more complicated ...

Please validate that approach and if it's ok I'll commit it.
Comment 14 PJ Fanning 2016-07-16 23:00:29 UTC
Thanks Andreas. I tried out your patch code on 2 xlsx files and got the same NullPointerException for both.

	at org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable.getEntryAt(ReadOnlySharedStringsTable.java:184)
	at org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:353)
	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:609)

I'm using the OPCPackage in new XSSFEventBasedExcelExtractor.
I can try to put together a test case to demonstrate the issue I'm seeing.
Comment 15 Andreas Beeker 2016-07-16 23:12:47 UTC
I've added a call to XSSFEventBasedExcelExtractor and also receive that error ...
I see what I can do ...
Comment 16 PJ Fanning 2016-07-16 23:42:50 UTC
ReadOnlySharedStringsTable readFrom(InputStream is) does a check on is.available and for the CipherInputStream, this seems to be zero.
I suspect that we don't need the is.available check.
Comment 17 Andreas Beeker 2016-07-16 23:53:18 UTC
Created attachment 34047 [details]
new OPCPackage.open(ZipEntrySource) method / custom encrypted temp zip

There's an error in handling InputStreams via available() in reading shared strings tables, i.e. assuming EOF on available() == 0 is simply wrong.

I've fixed it for this use case, but on a quick search, this wrong assumption is also on various other places.

So check if it works for you and I open a new ticket for fixing the other invocations.
Comment 18 PJ Fanning 2016-07-17 00:00:02 UTC
I spotted that problem with the available check too and removing it fixes my test case.
Comment 19 Andreas Beeker 2016-07-17 00:15:25 UTC
Thank you for testing.
patch applied via r1753003