49933 – Word 6/95 documents with sections cause ArrayIndexOutOfBoundsException

Bug 49933 - Word 6/95 documents with sections cause ArrayIndexOutOfBoundsException

Summary: Word 6/95 documents with sections cause ArrayIndexOutOfBoundsException

Status:	RESOLVED FIXED

Alias:	None

Product:	POI
Classification:	Unclassified
Component:	HWPF (show other bugs)
Version:	3.7-dev
Hardware:	PC Linux

Importance:	P2 normal (vote)
Target Milestone:	---
Assignee:	POI Developers List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2010-09-15 08:33 UTC by Adam
Modified:	2011-07-09 15:38 UTC (History)
CC List:	1 user (show)

Attachments
Word 95 document with a section (6.50 KB, application/msword) 2010-09-15 08:33 UTC, Adam	Details
Documents that throw an ArrayIndexOutOfBoundsException (58.19 KB, application/x-gzip) 2010-09-17 15:49 UTC, ssmeets	Details
word95 doc (32.50 KB, application/msword) 2010-09-27 09:13 UTC, Maxim Valyanskiy	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Adam 2010-09-15 08:33:44 UTC

Created attachment 26027 [details]
Word 95 document with a section

Processing a word 6/word 95 document with sections causes ArrayIndexOutOfBoundsException. Tika (Revision: 997224, 2010-09-14) with 3.7-beta2 POI dependency on the attached document gives rise to:

Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1e7c5cb
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:165)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:146)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:197)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:71)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 22
        at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:46)
        at org.apache.poi.hwpf.sprm.SprmOperation.<init>(SprmOperation.java:54)
        at org.apache.poi.hwpf.sprm.SprmIterator.next(SprmIterator.java:45)
        at org.apache.poi.hwpf.sprm.SectionSprmUncompressor.uncompressSEP(SectionSprmUncompressor.java:36)
        at org.apache.poi.hwpf.model.SEPX.<init>(SEPX.java:33)
        at org.apache.poi.hwpf.model.OldSectionTable.<init>(OldSectionTable.java:61)
        at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:103)
        at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:42)
        at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:150)
        at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:51)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:187)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:163)

Comment 1 Nick Burch 2010-09-17 09:47:23 UTC

That turned out to be slightly trickier than expected, as there were issues with both the Sprm decoding and the byte/character translation on the old section table

Fixed in r998131. The fix also seems to have improved some problem word97 files too, so it's not all bad!

Comment 2 ssmeets 2010-09-17 15:48:11 UTC

Hi Nick,

Thanks for your fix. This fixes several documents, but there are still some documents that produce ArrayIndexOutOfBoundsExceptions. Attached the files that cause the execpetions being thrown. Unfortunately my knowledge of old Word docs is limited, otherwise I could have helped.

Stacktraces:
Processing: Case1.doc
java.lang.ArrayIndexOutOfBoundsException: 240
	at org.apache.poi.hwpf.sprm.SprmOperation.getOperand(SprmOperation.java:94)
	at org.apache.poi.hwpf.sprm.SectionSprmUncompressor.unCompressSEPOperation(SectionSprmUncompressor.java:57)
	at org.apache.poi.hwpf.sprm.SectionSprmUncompressor.uncompressSEP(SectionSprmUncompressor.java:37)
	at org.apache.poi.hwpf.model.SEPX.<init>(SEPX.java:33)
	at org.apache.poi.hwpf.model.OldSectionTable.<init>(OldSectionTable.java:61)
	at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:103)
	at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:42)
	at com.ravn.test.poi.OldMSDocTester.parse(OldMSDocTester.java:27)
	at com.ravn.test.Tester.main(Tester.java:29)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:115)
Processing: Case2.doc
java.lang.ArrayIndexOutOfBoundsException: 244
Processing: Case3.doc
java.lang.ArrayIndexOutOfBoundsException: 32
Processing: Case4.doc
java.lang.ArrayIndexOutOfBoundsException: 26
Processing: Case5.doc
java.lang.ArrayIndexOutOfBoundsException: 238
Processing: Case6.doc
java.lang.ArrayIndexOutOfBoundsException: 247

Comment 3 ssmeets 2010-09-17 15:49:48 UTC

Created attachment 26046 [details]
Documents that throw an ArrayIndexOutOfBoundsException

Comment 4 Nick Burch 2010-09-19 06:00:25 UTC

I've added a slightly icky fix of adding a couple of spare 0 bytes on the end of the array, so that we should always be able to decode the SEPX without error, even if not always making sense of the contents fully...

I can now process all 6 of your files without error

Comment 5 Maxim Valyanskiy 2010-09-27 09:12:25 UTC

Last fix did broke another Word95 file:

java.lang.ArrayIndexOutOfBoundsException: 34
	at org.apache.poi.hwpf.sprm.SprmOperation.getOperand(SprmOperation.java:94)
	at org.apache.poi.hwpf.sprm.SectionSprmUncompressor.unCompressSEPOperation(SectionSprmUncompressor.java:57)
	at org.apache.poi.hwpf.sprm.SectionSprmUncompressor.uncompressSEP(SectionSprmUncompressor.java:37)
	at org.apache.poi.hwpf.model.SEPX.<init>(SEPX.java:33)
	at org.apache.poi.hwpf.model.OldSectionTable.<init>(OldSectionTable.java:66)
	at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:103)
	at org.apache.poi.hwpf.extractor.Word6Extractor.<init>(Word6Extractor.java:58)
	at org.apache.poi.hwpf.extractor.Word6Extractor.<init>(Word6Extractor.java:55)
	at org.apache.poi.hwpf.extractor.Word6Extractor.<init>(Word6Extractor.java:47)
	at org.apache.poi.hwpf.extractor.TestWordExtractor.testWord95err(TestWordExtractor.java:279)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at com.intellij.junit3.JUnit3IdeaTestRunner.doRun(JUnit3IdeaTestRunner.java:108)
	at com.intellij.junit3.JUnit3IdeaTestRunner.startRunnerWithArgs(JUnit3IdeaTestRunner.java:42)
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:192)
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:64)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:115)

Comment 6 Maxim Valyanskiy 2010-09-27 09:13:27 UTC

Created attachment 26083 [details]
word95 doc

Comment 7 Sergey Vladimirov 2011-07-09 15:38:33 UTC

Workaround for this bug implemented in trunk. Now section properties won't be parsed immediatly on loading. Text is extracted (but encoding is not, sorry).

"Real" fix shall include new Word95 SPRM parser (which is different from Word97-or-later SPRM parsed).