Created attachment 32674 [details] failing document Trying to parse this document via Tika. It appears to be a pretty vanilla Word 97 era .doc. It opens correctly in Word for Mac 2011. It's attached. The document is already publicly posted and I grant any rights I have in the document to ASF; I should note that it's part of a publicly-posted dump of emails sent to/from former Florida Gov. Jeb Bush, so I don't hold copyright over it. This is the POI version of https://issues.apache.org/jira/browse/TIKA-1608 Stacktrace looks like this: $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 1534-attachment.doc Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@69af0db6 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134) Caused by: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171) at org.apache.poi.hwpf.model.PAPFormattedDiskPage.<init>(PAPFormattedDiskPage.java:101) at org.apache.poi.hwpf.model.OldPAPBinTable.<init>(OldPAPBinTable.java:49) at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:109) at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) ... 5 more
POI detected this as a Word 95 or older file, requiring HWPFOldDocument to read the file. The file claims it is a Microsoft Word 6.0 Document, which is the file format of Word 6.0, released in 1993. [1] [1] https://en.wikipedia.org/wiki/Microsoft_Word#Release_history I got the same error as you in the latest version of POI, 3.16 trunk. I added this failing unit test in r1761873. > java.lang.ArrayIndexOutOfBoundsException > at java.lang.System.arraycopy(Native Method) > at org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171) > at org.apache.poi.hwpf.model.PAPFormattedDiskPage.<init>(PAPFormattedDiskPage.java:101) > at org.apache.poi.hwpf.model.OldPAPBinTable.<init>(OldPAPBinTable.java:49) > at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:107) > at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:45) > at org.apache.poi.hwpf.usermodel.TestBugs.test57843(TestBugs.java:911)
There is a failing test for this at org.apache.poi.hwpf.usermodel.TestBugs.test57603SevenRowTable which was added via r1761873
Accidentally fixed it via r1876641, while refactoring the System.arraycopy calls