Bug 52069 - Heap out of memory errors for large xlsx files - even when using PipedReader to read file
Summary: Heap out of memory errors for large xlsx files - even when using PipedReader ...
Alias: None
Product: POI
Classification: Unclassified
Component: XSSF (show other bugs)
Version: unspecified
Hardware: PC Windows XP
: P2 major (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2011-10-21 14:54 UTC by Meghana
Modified: 2011-10-21 15:02 UTC (History)
0 users

Dodgy xlsx file (993.54 KB, application/octet-stream)
2011-10-21 14:56 UTC, Meghana

Note You need to log in before you can comment on or make changes to this bug.
Description Meghana 2011-10-21 14:54:36 UTC
While parsing an xlsx file of about 4 MB using Apache Tika 0.9, I came across this error. I am using PipedReader and PipedWriter to access the file content. Hence, I believe that heap size allocation is not really a problem since I have been running the same code with much larger files. 

Looking at the memory consumption using a profiler, I found that instances of 2 classes - org.apache.xmlbeans.impl.store.Xobj$AttrXobj and Xobj$ElementXobj seem to grow exponentially with file size. For the above mentioned file, there were more than 1,600,000 objects of type Xobj$AttrXobj. 

I am attaching the xlsx file which caused this error. 

Note: this error also occurs for .docx files.
Comment 1 Meghana 2011-10-21 14:56:39 UTC
Created attachment 27835 [details]
Dodgy xlsx file
Comment 2 Nick Burch 2011-10-21 15:02:56 UTC
Please re-try with Tika 0.10, this should be fixed there