Created attachment 35128 [details] triggering file I started experimenting with randomly corrupting files based on feedback from Luis Filipe Nassif [1]. The attached file triggers this: java.lang.OutOfMemoryError: Java heap space at org.apache.poi.hpsf.Vector.read(Vector.java:43) at org.apache.poi.hpsf.TypedPropertyValue.readValue(TypedPropertyValue.java:219) at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:174) at org.apache.poi.hpsf.Property.<init>(Property.java:179) at org.apache.poi.hpsf.MutableProperty.<init>(MutableProperty.java:53) at org.apache.poi.hpsf.Section.<init>(Section.java:237) at org.apache.poi.hpsf.MutableSection.<init>(MutableSection.java:41) at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:494) at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:196) at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:83) at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:74) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:155) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131) [1] https://issues.apache.org/jira/browse/TIKA-2428?focusedCommentId=16086045&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16086045
The actual Vector size that is causing an OOM in Tika is 1,358,954,497 on one triggering file. We could arbitrarily set a max_value << Integer.MAX_VALUE, or we could use a list and then convert that to an array. If we do the latter, and there is a corrupt size value, the LittleEndianInputStream will throw an exception when asked to read beyond what is available in the stream. I somewhat prefer the second option. Commit on way... Happy to go with the first or open to other options...
r1802879 I didn't add a test file because I didn't think the test was worth 65kb. I can look for a shorter triggering file if necessary.