Bug 49933 - Word 6/95 documents with sections cause ArrayIndexOutOfBoundsException
Summary: Word 6/95 documents with sections cause ArrayIndexOutOfBoundsException
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: 3.7-dev
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-09-15 08:33 UTC by Adam
Modified: 2011-07-09 15:38 UTC (History)
1 user (show)



Attachments
Word 95 document with a section (6.50 KB, application/msword)
2010-09-15 08:33 UTC, Adam
Details
Documents that throw an ArrayIndexOutOfBoundsException (58.19 KB, application/x-gzip)
2010-09-17 15:49 UTC, ssmeets
Details
word95 doc (32.50 KB, application/msword)
2010-09-27 09:13 UTC, Maxim Valyanskiy
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Adam 2010-09-15 08:33:44 UTC
Created attachment 26027 [details]
Word 95 document with a section

Processing a word 6/word 95 document with sections causes ArrayIndexOutOfBoundsException. Tika (Revision: 997224, 2010-09-14) with 3.7-beta2 POI dependency on the attached document gives rise to:

Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1e7c5cb
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:165)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:146)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:197)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:71)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 22
        at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:46)
        at org.apache.poi.hwpf.sprm.SprmOperation.<init>(SprmOperation.java:54)
        at org.apache.poi.hwpf.sprm.SprmIterator.next(SprmIterator.java:45)
        at org.apache.poi.hwpf.sprm.SectionSprmUncompressor.uncompressSEP(SectionSprmUncompressor.java:36)
        at org.apache.poi.hwpf.model.SEPX.<init>(SEPX.java:33)
        at org.apache.poi.hwpf.model.OldSectionTable.<init>(OldSectionTable.java:61)
        at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:103)
        at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:42)
        at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:150)
        at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:51)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:187)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:163)
Comment 1 Nick Burch 2010-09-17 09:47:23 UTC
That turned out to be slightly trickier than expected, as there were issues with both the Sprm decoding and the byte/character translation on the old section table

Fixed in r998131. The fix also seems to have improved some problem word97 files too, so it's not all bad!
Comment 2 ssmeets 2010-09-17 15:48:11 UTC
Hi Nick,

Thanks for your fix. This fixes several documents, but there are still some documents that produce ArrayIndexOutOfBoundsExceptions. Attached the files that cause the execpetions being thrown. Unfortunately my knowledge of old Word docs is limited, otherwise I could have helped.

Stacktraces:
Processing: Case1.doc
java.lang.ArrayIndexOutOfBoundsException: 240
	at org.apache.poi.hwpf.sprm.SprmOperation.getOperand(SprmOperation.java:94)
	at org.apache.poi.hwpf.sprm.SectionSprmUncompressor.unCompressSEPOperation(SectionSprmUncompressor.java:57)
	at org.apache.poi.hwpf.sprm.SectionSprmUncompressor.uncompressSEP(SectionSprmUncompressor.java:37)
	at org.apache.poi.hwpf.model.SEPX.<init>(SEPX.java:33)
	at org.apache.poi.hwpf.model.OldSectionTable.<init>(OldSectionTable.java:61)
	at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:103)
	at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:42)
	at com.ravn.test.poi.OldMSDocTester.parse(OldMSDocTester.java:27)
	at com.ravn.test.Tester.main(Tester.java:29)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:115)
Processing: Case2.doc
java.lang.ArrayIndexOutOfBoundsException: 244
Processing: Case3.doc
java.lang.ArrayIndexOutOfBoundsException: 32
Processing: Case4.doc
java.lang.ArrayIndexOutOfBoundsException: 26
Processing: Case5.doc
java.lang.ArrayIndexOutOfBoundsException: 238
Processing: Case6.doc
java.lang.ArrayIndexOutOfBoundsException: 247
Comment 3 ssmeets 2010-09-17 15:49:48 UTC
Created attachment 26046 [details]
Documents that throw an ArrayIndexOutOfBoundsException
Comment 4 Nick Burch 2010-09-19 06:00:25 UTC
I've added a slightly icky fix of adding a couple of spare 0 bytes on the end of the array, so that we should always be able to decode the SEPX without error, even if not always making sense of the contents fully...

I can now process all 6 of your files without error
Comment 5 Maxim Valyanskiy 2010-09-27 09:12:25 UTC
Last fix did broke another Word95 file:

java.lang.ArrayIndexOutOfBoundsException: 34
	at org.apache.poi.hwpf.sprm.SprmOperation.getOperand(SprmOperation.java:94)
	at org.apache.poi.hwpf.sprm.SectionSprmUncompressor.unCompressSEPOperation(SectionSprmUncompressor.java:57)
	at org.apache.poi.hwpf.sprm.SectionSprmUncompressor.uncompressSEP(SectionSprmUncompressor.java:37)
	at org.apache.poi.hwpf.model.SEPX.<init>(SEPX.java:33)
	at org.apache.poi.hwpf.model.OldSectionTable.<init>(OldSectionTable.java:66)
	at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:103)
	at org.apache.poi.hwpf.extractor.Word6Extractor.<init>(Word6Extractor.java:58)
	at org.apache.poi.hwpf.extractor.Word6Extractor.<init>(Word6Extractor.java:55)
	at org.apache.poi.hwpf.extractor.Word6Extractor.<init>(Word6Extractor.java:47)
	at org.apache.poi.hwpf.extractor.TestWordExtractor.testWord95err(TestWordExtractor.java:279)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at com.intellij.junit3.JUnit3IdeaTestRunner.doRun(JUnit3IdeaTestRunner.java:108)
	at com.intellij.junit3.JUnit3IdeaTestRunner.startRunnerWithArgs(JUnit3IdeaTestRunner.java:42)
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:192)
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:64)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:115)
Comment 6 Maxim Valyanskiy 2010-09-27 09:13:27 UTC
Created attachment 26083 [details]
word95 doc
Comment 7 Sergey Vladimirov 2011-07-09 15:38:33 UTC
Workaround for this bug implemented in trunk. Now section properties won't be parsed immediatly on loading. Text is extracted (but encoding is not, sorry).

"Real" fix shall include new Word95 SPRM parser (which is different from Word97-or-later SPRM parsed).