Bug 58656 - ArrayIndexOutOfBounds when parsing ms word document
Summary: ArrayIndexOutOfBounds when parsing ms word document
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: POIFS (show other bugs)
Version: 3.13-FINAL
Hardware: PC Linux
: P2 enhancement with 3 votes (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2015-11-26 15:02 UTC by Panagiotis Bailis
Modified: 2015-12-06 23:16 UTC (History)
0 users

MS word file that raises an AIOOBE (poi v3.13-FINAL) (447.00 KB, application/msword)
2015-11-26 15:02 UTC, Panagiotis Bailis

Note You need to log in before you can comment on or make changes to this bug.
Description Panagiotis Bailis 2015-11-26 15:02:45 UTC
Created attachment 33300 [details]
MS word file that raises an AIOOBE (poi v3.13-FINAL)


We are trying to parse a number of MS Word documents using Tika v1.11 (POI v.3.13-FINAL), however an AIOOBE is raised when trying to parse the document attached. Even if the file is corrupted, shouldn't we get a different exception other than an "Unexpected RuntimeException" 

Could you please have a look at this?


Cause: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@6ebda34c
  Cause: 128
Caused by: java.lang.ArrayIndexOutOfBoundsException: 128
        at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:224)
        at org.apache.poi.util.ShortField.readFromBytes(ShortField.java:166)
        at org.apache.poi.util.ShortField.<init>(ShortField.java:91)
        at org.apache.poi.poifs.property.Property.<init>(Property.java:165)
        at org.apache.poi.poifs.property.DirectoryProperty.<init>(DirectoryProperty.java:69)
        at org.apache.poi.poifs.property.PropertyFactory.convertToProperties(PropertyFactory.java:79)
        at org.apache.poi.poifs.property.NPropertyTable.buildProperties(NPropertyTable.java:110)
        at org.apache.poi.poifs.property.NPropertyTable.<init>(NPropertyTable.java:66)
        at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents(NPOIFSFileSystem.java:416)
        at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:228)
        at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:164)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:164)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
Comment 1 Dominik Stadler 2015-12-06 23:16:41 UTC
Upon a quick look it looks like the data that we read tries to read a huge property-string where the actual byte-array with data is much smaller, so based on that it looks like the document is really incorrectly formatted. It also does not look too good when opened in LibreOffice, lots of giverish and no readable content as far as I can see.

So for now we only can try to improve the error message here with an additional check.