|Summary:||OutOfMemoryError parsing a word file|
|Product:||POI||Reporter:||Jerome Lacoste <jerome.lacoste>|
|Component:||HPSF||Assignee:||POI Developers List <dev>|
|Attachments:||An anonymised Doc file reproducing the problem|
Description Jerome Lacoste 2011-12-20 12:08:45 UTC
Created attachment 28090 [details] An anonymised Doc file reproducing the problem Calling Parser#parseToString on the attached file produces an OOME. This is because Tika doesn't validate the size it tries to allocate. Had it been C code, this could have been a buffer overflow... Not sure if the file is corrupted or not, it opens fine on Word Mac and WIndows platform. Saving the file in one of these editors causes the problem to disappear, so we've manually edited the content of the file to anonymise it yet keep it as close as possible to the original. We're able to create similar problems by flipping bits in files. java.lang.OutOfMemoryError: Java heap space at org.apache.poi.hpsf.Section.<init>(Section.java:207) at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:451) at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:246) at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:73) at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:64) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:177) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:380) at org.apache.tika.Tika.parseToString(Tika.java:414) at no.finntech.tika.harderner.TikaIndexerHardenerTest.parseContent(TikaIndexerHardenerTest.java:100) at no.finntech.tika.harderner.TikaIndexerHardenerTest.indexContent(TikaIndexerHardenerTest.java:91) at no.finntech.tika.harderner.TikaIndexerHardenerTest.originalFileIndexesProperly2(TikaIndexerHardenerTest.java:34)
Comment 1 Nick Burch 2011-12-20 13:24:36 UTC
Can you confirm if this issue still occurs with POI 3.8 beta 5 (just released) or not?
Comment 2 Jerome Lacoste 2011-12-20 13:34:44 UTC
Yes it does fail with 3.8-beta5
Comment 3 Nick Burch 2011-12-20 13:37:54 UTC
Note that this bug doesn't look to be word specific, the exception is coming from the common HPSF properties, rather than HWPF
Comment 4 Jerome Lacoste 2011-12-20 15:01:12 UTC
I filled it on HPFS being the module that had the closest name to HPSF... Picking the right module was a bit confusing !
Comment 5 Nick Burch 2011-12-20 23:35:08 UTC
Bah, looks like there was a typo in the component name (dating back quite a number of years....), should now be fixed. In general, each of the components has help which describes the subject area + package it covers, which should help with identifying
Comment 6 Nick Burch 2011-12-23 03:24:35 UTC
The issue is that we're reading a value that should contain the number of properties in the section, then trying to create an array to hold that many properties (before reading into them, so it couldn't be a buffer overflow even in C!) What we're not doing is sanity checking the number of properties, so if the file has been corrupted and that value is very large, we trust it at that point and try to allocate a big array. (Later on we'd throw a different exception on discovering the value was corrupt and specified more properties than there's data for) We could probably do some checks on the size, and also move the array initialisation to after the first pass too Are you able to check the Microsoft Documentation to see what the limit on the number of properties in a section is? (That'd be an easy sanity check to do first)
Comment 7 Jerome Lacoste 2011-12-23 09:01:27 UTC
> What we're not doing is sanity checking the number of properties > so if the file has been corrupted Just a question: are we sure the file is corrupted ? Word opens it properly with on both Windows and Mac. Also the place where the code tries to read the property size contains some text "Hewlett-Packard" > Are you able to check the Microsoft Documentation to see what the limit on the > number of properties in a section is? (That'd be an easy sanity check to do > first) http://msdn.microsoft.com/en-us/library/dd949336%28v=office.12%29.aspx I wasn't able to find a maximum number of properties. From the .Doc structure format: http://msdn.microsoft.com/en-us/library/cc313153%28v=office.12%29.aspx Example of a section http://msdn.microsoft.com/en-us/library/dd907622%28v=office.12%29.aspx Property storage http://msdn.microsoft.com/en-us/library/dd949336%28v=office.12%29.aspx But we may be able to use a different limit. We know the document/buffer length. Surely there are at most (buffer length) / (min property length).