Bug 61295 - Vector.read -- Java heap space on corrupt file
Summary: Vector.read -- Java heap space on corrupt file
Alias: None
Product: POI
Classification: Unclassified
Component: HPSF (show other bugs)
Version: 3.16-FINAL
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2017-07-13 19:31 UTC by Tim Allison
Modified: 2017-07-25 01:39 UTC (History)
0 users

triggering file (60.50 KB, application/x-ole-storage)
2017-07-13 19:31 UTC, Tim Allison

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Allison 2017-07-13 19:31:51 UTC
Created attachment 35128 [details]
triggering file

I started experimenting with randomly corrupting files based on feedback from Luis Filipe Nassif [1].  The attached file triggers this:

java.lang.OutOfMemoryError: Java heap space

	at org.apache.poi.hpsf.Vector.read(Vector.java:43)
	at org.apache.poi.hpsf.TypedPropertyValue.readValue(TypedPropertyValue.java:219)
	at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:174)
	at org.apache.poi.hpsf.Property.<init>(Property.java:179)
	at org.apache.poi.hpsf.MutableProperty.<init>(MutableProperty.java:53)
	at org.apache.poi.hpsf.Section.<init>(Section.java:237)
	at org.apache.poi.hpsf.MutableSection.<init>(MutableSection.java:41)
	at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:494)
	at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:196)
	at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:83)
	at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:74)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:155)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)

[1] https://issues.apache.org/jira/browse/TIKA-2428?focusedCommentId=16086045&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16086045
Comment 1 Tim Allison 2017-07-25 01:27:12 UTC
The actual Vector size that is causing an OOM in Tika is 1,358,954,497 on one triggering file.  We could arbitrarily set a max_value << Integer.MAX_VALUE, or we could use a list and then convert that to an array.  If we do the latter, and there is a corrupt size value, the LittleEndianInputStream will throw an exception when asked to read beyond what is available in the stream.

I somewhat prefer the second option.  Commit on way...

Happy to go with the first or open to other options...
Comment 2 Tim Allison 2017-07-25 01:39:58 UTC

I didn't add a test file because I didn't think the test was worth 65kb.

I can look for a shorter triggering file if necessary.