Bug 58656

Summary: ArrayIndexOutOfBounds when parsing ms word document
Product: POI Reporter: Panagiotis Bailis <pmpailis>
Component: POIFSAssignee: POI Developers List <dev>
Status: NEW ---    
Severity: enhancement    
Priority: P2    
Version: 3.13-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Attachments: MS word file that raises an AIOOBE (poi v3.13-FINAL)

Description Panagiotis Bailis 2015-11-26 15:02:45 UTC
Created attachment 33300 [details]
MS word file that raises an AIOOBE (poi v3.13-FINAL)


We are trying to parse a number of MS Word documents using Tika v1.11 (POI v.3.13-FINAL), however an AIOOBE is raised when trying to parse the document attached. Even if the file is corrupted, shouldn't we get a different exception other than an "Unexpected RuntimeException" 

Could you please have a look at this?


Cause: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@6ebda34c
  Cause: 128
Caused by: java.lang.ArrayIndexOutOfBoundsException: 128
        at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:224)
        at org.apache.poi.util.ShortField.readFromBytes(ShortField.java:166)
        at org.apache.poi.util.ShortField.<init>(ShortField.java:91)
        at org.apache.poi.poifs.property.Property.<init>(Property.java:165)
        at org.apache.poi.poifs.property.DirectoryProperty.<init>(DirectoryProperty.java:69)
        at org.apache.poi.poifs.property.PropertyFactory.convertToProperties(PropertyFactory.java:79)
        at org.apache.poi.poifs.property.NPropertyTable.buildProperties(NPropertyTable.java:110)
        at org.apache.poi.poifs.property.NPropertyTable.<init>(NPropertyTable.java:66)
        at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents(NPOIFSFileSystem.java:416)
        at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:228)
        at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:164)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:164)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
Comment 1 Dominik Stadler 2015-12-06 23:16:41 UTC
Upon a quick look it looks like the data that we read tries to read a huge property-string where the actual byte-array with data is much smaller, so based on that it looks like the document is really incorrectly formatted. It also does not look too good when opened in LibreOffice, lots of giverish and no readable content as far as I can see.

So for now we only can try to improve the error message here with an additional check.