Created attachment 26027 [details] Word 95 document with a section Processing a word 6/word 95 document with sections causes ArrayIndexOutOfBoundsException. Tika (Revision: 997224, 2010-09-14) with 3.7-beta2 POI dependency on the attached document gives rise to: Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1e7c5cb at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:165) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:146) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:197) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:71) Caused by: java.lang.ArrayIndexOutOfBoundsException: 22 at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:46) at org.apache.poi.hwpf.sprm.SprmOperation.<init>(SprmOperation.java:54) at org.apache.poi.hwpf.sprm.SprmIterator.next(SprmIterator.java:45) at org.apache.poi.hwpf.sprm.SectionSprmUncompressor.uncompressSEP(SectionSprmUncompressor.java:36) at org.apache.poi.hwpf.model.SEPX.<init>(SEPX.java:33) at org.apache.poi.hwpf.model.OldSectionTable.<init>(OldSectionTable.java:61) at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:103) at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:42) at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:150) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:51) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:187) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:163)
That turned out to be slightly trickier than expected, as there were issues with both the Sprm decoding and the byte/character translation on the old section table Fixed in r998131. The fix also seems to have improved some problem word97 files too, so it's not all bad!
Hi Nick, Thanks for your fix. This fixes several documents, but there are still some documents that produce ArrayIndexOutOfBoundsExceptions. Attached the files that cause the execpetions being thrown. Unfortunately my knowledge of old Word docs is limited, otherwise I could have helped. Stacktraces: Processing: Case1.doc java.lang.ArrayIndexOutOfBoundsException: 240 at org.apache.poi.hwpf.sprm.SprmOperation.getOperand(SprmOperation.java:94) at org.apache.poi.hwpf.sprm.SectionSprmUncompressor.unCompressSEPOperation(SectionSprmUncompressor.java:57) at org.apache.poi.hwpf.sprm.SectionSprmUncompressor.uncompressSEP(SectionSprmUncompressor.java:37) at org.apache.poi.hwpf.model.SEPX.<init>(SEPX.java:33) at org.apache.poi.hwpf.model.OldSectionTable.<init>(OldSectionTable.java:61) at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:103) at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:42) at com.ravn.test.poi.OldMSDocTester.parse(OldMSDocTester.java:27) at com.ravn.test.Tester.main(Tester.java:29) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:115) Processing: Case2.doc java.lang.ArrayIndexOutOfBoundsException: 244 Processing: Case3.doc java.lang.ArrayIndexOutOfBoundsException: 32 Processing: Case4.doc java.lang.ArrayIndexOutOfBoundsException: 26 Processing: Case5.doc java.lang.ArrayIndexOutOfBoundsException: 238 Processing: Case6.doc java.lang.ArrayIndexOutOfBoundsException: 247
Created attachment 26046 [details] Documents that throw an ArrayIndexOutOfBoundsException
I've added a slightly icky fix of adding a couple of spare 0 bytes on the end of the array, so that we should always be able to decode the SEPX without error, even if not always making sense of the contents fully... I can now process all 6 of your files without error
Last fix did broke another Word95 file: java.lang.ArrayIndexOutOfBoundsException: 34 at org.apache.poi.hwpf.sprm.SprmOperation.getOperand(SprmOperation.java:94) at org.apache.poi.hwpf.sprm.SectionSprmUncompressor.unCompressSEPOperation(SectionSprmUncompressor.java:57) at org.apache.poi.hwpf.sprm.SectionSprmUncompressor.uncompressSEP(SectionSprmUncompressor.java:37) at org.apache.poi.hwpf.model.SEPX.<init>(SEPX.java:33) at org.apache.poi.hwpf.model.OldSectionTable.<init>(OldSectionTable.java:66) at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:103) at org.apache.poi.hwpf.extractor.Word6Extractor.<init>(Word6Extractor.java:58) at org.apache.poi.hwpf.extractor.Word6Extractor.<init>(Word6Extractor.java:55) at org.apache.poi.hwpf.extractor.Word6Extractor.<init>(Word6Extractor.java:47) at org.apache.poi.hwpf.extractor.TestWordExtractor.testWord95err(TestWordExtractor.java:279) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.junit3.JUnit3IdeaTestRunner.doRun(JUnit3IdeaTestRunner.java:108) at com.intellij.junit3.JUnit3IdeaTestRunner.startRunnerWithArgs(JUnit3IdeaTestRunner.java:42) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:192) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:64) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:115)
Created attachment 26083 [details] word95 doc
Workaround for this bug implemented in trunk. Now section properties won't be parsed immediatly on loading. Text is extracted (but encoding is not, sorry). "Real" fix shall include new Word95 SPRM parser (which is different from Word97-or-later SPRM parsed).