Bug 60567 - XSSFReader caused OutOfMemoryError when reading a large excel file in HDFS as inputStream
Summary: XSSFReader caused OutOfMemoryError when reading a large excel file in HDFS as...
Status: RESOLVED INVALID
Alias: None
Product: POI
Classification: Unclassified
Component: XSSF (show other bugs)
Version: 3.15-FINAL
Hardware: PC All
: P2 enhancement (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-01-10 00:20 UTC by Dejian Tu
Modified: 2017-05-17 09:11 UTC (History)
2 users (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dejian Tu 2017-01-10 00:20:06 UTC
My project is using POI library to read excel file in HDFS. The API I used is as below:
==================
// inputStream is generated from a HDFS path, because OPCPackage could
// not recognize HDFS path directly.
XSSFReader xssfReader = new XSSFReader(OPCPackage.open(inputStream));
==================

The excel file has around 1,000,000 rows of simple data (columns like name, id, address, etc.), and the file size is around 140MB. When I run my project, the process consumes about 3.25GB memory, which is much bigger than the excel file size.

AFAIK, reading from a String path or File uses much less memory than reading from inputStream for XSSFReader. But for my case, because the excel file is in HDFS file system, we could not pass the HDFS path to XSSFReader directly.

Could you please help to fix the issue that XSSFReader uses much more memory when reading from inputStream?

Thank you.
Comment 1 Javen O'Neal 2017-01-19 08:30:43 UTC
1,000,000 rows is massive. That's nearly the maximum number of rows allowed per the file format specification.

140 MB file size is massive. Keep in mind that this is zipped XML files, and I would expect 90-95% compression for these files. Unzip this on your hard drive to see how much disk space is consumed when you expand it. It should be in the neighborhood of 1-3 GB.

You're also opening the file via an input stream, which has some memory overhead.

Therefore, 3.25 GB of memory consumption is reasonable in this case, considering input stream overhead, memory alignment, garbage collection, temporary files for unzipping, maintaining references to files in the unzipped directory structure, creating XML trees for the minimum files needed for XSSFReader.

If you have any suggestions and could contribute a patch towards lowering XSSFReader's memory footprint, we'd greatly appreciate the help.
Comment 2 Javen O'Neal 2017-01-19 08:37:53 UTC
(In reply to Dejian Tu from comment #0)
> AFAIK, reading from a String path or File uses much less memory than reading
> from inputStream for XSSFReader. But for my case, because the excel file is
> in HDFS file system, we could not pass the HDFS path to XSSFReader directly.

This sounds like a point for discussion on the mailing list (and perhaps Stack Overflow or other community to get suggestions on how to write a program that can deal with data stored on a distributed file system), and is not a bug without convincing evidence and a patch.
Comment 3 chenchanghan 2017-05-17 09:11:16 UTC
I encounter the same question. 700,000 rows, size:139M
I find use OPCPackage.open(inputStream) will get OutOfMemoryError,But use OPCPackage.open(filePath) run very well.