Bug 61371 - 20^20 hidden cells in a .xlsx file causes heap memory full exception
Summary: 20^20 hidden cells in a .xlsx file causes heap memory full exception
Status: RESOLVED WORKSFORME
Alias: None
Product: POI
Classification: Unclassified
Component: XSSF (show other bugs)
Version: unspecified
Hardware: PC Mac OS X 10.1
: P2 major (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-08-02 11:08 UTC by Lakshminathan
Modified: 2018-01-01 08:50 UTC (History)
0 users



Attachments
File which causes heap memory out (8.57 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2017-08-02 11:08 UTC, Lakshminathan
Details
Information about the bug (169.73 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2017-08-03 20:25 UTC, Lakshminathan
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Lakshminathan 2017-08-02 11:08:05 UTC
Created attachment 35196 [details]
File which causes heap memory out

An .xlsx file has only one cell filled but all other cells are hidden. When tried to read the file using " workbook = new org.apache.poi.xssf.usermodel.XSSFWorkbook(in);" this throws an Heap memory full exception.
Comment 1 Nick Burch 2017-08-02 11:24:45 UTC
How big a heap did you give to Apache POI? Most JVM default heaps are way way too small these days
Comment 2 Lakshminathan 2017-08-02 11:33:09 UTC
The file size as you can see from the attachment is just 9Kb . It has only one cell with data. But since there are 2^20 hidden cells Apache POI allocates more heap space causing heap space full exception. I think this can be handled only by Apache POI because file size is the only information we get even before reading the file.
Comment 3 PJ Fanning 2017-08-02 18:42:40 UTC
The XSSF model loads all the data in the xlsx file, even if it is hidden, The XSSF cell interface will let you know which cells are hidden and which are not.
In theory, the XSSF workbook code where we load the xlsx data could have an option to ignore hidden cells or hidden sheets. This would be an enhancement and I'm not sure how useful this feature is.
You could also try the XSSFReader which let's you read the xlsx file in a streaming way.
Comment 4 Greg Woolsey 2017-08-02 18:59:39 UTC
To clarify:

1. even empty hidden cells are still cells, and as such require object allocations when read using the XSSFWorkbook API.  no space is allocated if they are truly missing from the workbook - not even defined in the OOXML file.

2. OOXML files are ZIP packages containing mostly XML files.  Cells defined but empty compress extremely well, so a 9k file can actually unzip to hundreds of megabytes.  Unzip your sample file and see just how big it really is.  This will give you a good rough start for estimating the heap space required to read it using the XSSFWorkbook API.  I'd say double the unzipped size as a good first guess.

If the result turns out to be too big for the available heap, try using the streaming API as mentioned by PJ Fanning in comment 3.
Comment 5 Javen O'Neal 2017-08-03 09:09:29 UTC
The attached workbook, attachment 35196 [details], is actually quite small uncompressed and doesn't describe many cells.

xl/worksheets/sheet1.xml:
<cols>
  <col min="1" max="1" width="9.140625" customWidth="1"/>
  <col min="2" max="16384" width="9.140625" hidden="1"/>
</cols>
<sheetData>
  <row r="1" spans="1:1" x14ac:dyDescent="0.25">
    <c r="A1" t="s">
      <v>0</v>
    </c>
  </row>
</sheetData>

sheet2 and 3 are even smaller.

The entire file expands to 28 KB on disk, and I didn't see anything suspicious in sharedStrings or elsewhere that would cause this workbook to misbehave.

Lakshminathan, please verify whether you're able to read https://svn.apache.org/repos/asf/poi/trunk/test-data/spreadsheet/SampleSS.xlsx in your application.

Which JVM are you using and what is your max heap size?
What version of POI are you using?
Is there anything else in your application that could be consuming a substantial part of the heap?
Comment 6 Lakshminathan 2017-08-03 19:41:32 UTC
@Team, Thank you for the response. Please find below the requested information

https://docs.google.com/document/d/1Lf1FEeQDillIrT4fTz-u9nKiUp7s9_60RwTrEApZ4jk/edit?usp=sharing
Comment 7 Javen O'Neal 2017-08-03 20:14:47 UTC
(In reply to Javen O'Neal from comment #5)
> Lakshminathan, please verify whether you're able to read
> https://svn.apache.org/repos/asf/poi/trunk/test-data/spreadsheet/SampleSS.
> xlsx in your application.
> 
> Which JVM are you using and what is your max heap size?
> What version of POI are you using?
> Is there anything else in your application that could be consuming a
> substantial part of the heap?
Comment 8 Javen O'Neal 2017-08-03 20:17:44 UTC
Please add supporting information as a comment or an attachment. External dependencies may break or be inaccessible to some users.
Comment 9 Lakshminathan 2017-08-03 20:25:34 UTC
Created attachment 35199 [details]
Information about the bug

> Lakshminathan, please verify whether you're able to read
> https://svn.apache.org/repos/asf/poi/trunk/test-data/spreadsheet/SampleSS.
> xlsx in your application. - yes,its working
> 
> Which JVM are you using and what is your max heap size? details attached
> What version of POI are you using?3.16
> Is there anything else in your application that could be consuming a
> substantial part of the heap? No

Further details attached.
Comment 10 Javen O'Neal 2017-08-04 08:31:52 UTC
attachment 35199 [details] references a different file that was uploaded to Google Sheets which is 2.75 MB instead of 8 KB in size.
Unzipping the 2.75 MB file expands to 27.9 MB (90% compression ratio is typical for workbooks that do not contain embedded objects).
xl/worksheets/sheet1.xml is 27.9 MB (leaving a few KB for the rest of the extracted files).

Inspecting the XML, I see that the workbook defines 1 cell at A1 and 1 million rows:
<sheetFormatPr defaultColWidth="0" defaultRowHeight="15" zeroHeight="1"/>
<cols>
  <col min="1" max="1" width="9.140625" customWidth="1"/>
  <col min="2" max="16384" width="9.140625" hidden="1"/>
</cols>
<sheetData>
  <row r="1" spans="1:1">
    <c r="A1" t="s"><v>0</v></c>
  </row>
  <row r="2" spans="1:1" hidden="1"/>
  <row r="3" spans="1:1" hidden="1"/>
  ...
  <row r="16" spans="1:1" hidden="1"/>
  <row r="17" hidden="1"/>
  ...
  <row r="1048556" hidden="1"/>
  <row r="1048557" hidden="1"/>
</sheetData>

POI uses a TreeMap<int rowNumber, XSSFRow row> to store these rows for fast random and sequential access, at the cost of some memory. Oracle Java 8 docs state that it implements TreeMap using a self-balancing pointer-based red-black tree. This should be fine.

To test that your JVM can handle 1 million items in a TreeMap, do something like the following:
private static final Random rand = new Random();

private Object createFakeRow(int nbytes) {
    byte[] row = new byte[nbytes];
    rand.nextBytes(row);
    return row;
}

SortedMap<Integer, Object> rows = new TreeMap<>();
for (int r=1; r<=1048557; r++) {
    // create an object that consumes 10 KB of RAM in place of a real XSSFRow
    rows.put(i, createFakeRow(10*1024));
}

If that works without issue, then the next focus is on what XMLBeans is doing as it's reading Sheet1.xml.
You could look at how POI unzips the file into memory and reads each XML file into an XML DOM using XMLBeans and how much extra memory is consumed by the CT classes.
Before we blame XMLBeans, we'd have to fairly compare it with JAXB.
  
Here's a Google Drive link to this file since the file exceeds the file size limit of bugzilla: https://drive.google.com/file/d/0B2v9cndcBwIWeURpNXVrSjJYbDg/view

Based on your reported 9 GB of RAM, the amortized size of each XSSFRow would be roughly 10 KB.

To make sure this really isn't a bug in POI's handling of hidden rows or columns, we'd need to test for OOM on a workbook with the same number of rows defined but all of them visible.

I think you have enough ideas here to try to figure out where the problem is, which is needed before a potential fix can be written.
Comment 11 Dominik Stadler 2018-01-01 08:50:30 UTC
There was no indication of an actual "bug" in Apache POI, memory usage for a file with 1 million rows is expected to be considerable, so unless there is more evidence of too much memory usage, we are resolving this issue, please reopen with more information if you still think there is something wrong in Apache POI itself.