Bug 51944 - [PATCH] PAPFormattedDiskPage.getPAPX - IndexOutOfBounds
Summary: [PATCH] PAPFormattedDiskPage.getPAPX - IndexOutOfBounds
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: 3.8-dev
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2011-10-03 23:27 UTC by Jeremy
Modified: 2011-10-30 00:06 UTC (History)
0 users

Patch of two files (1.59 KB, application/octet-stream)
2011-10-03 23:27 UTC, Jeremy
Test File with Range Issue (27.00 KB, application/msword)
2011-10-25 20:28 UTC, Jeremy

Note You need to log in before you can comment on or make changes to this bug.
Description Jeremy 2011-10-03 23:27:29 UTC
Created attachment 27681 [details]
Patch of two files

A handful of word97-2003 (though could be even earlier) documents produce an ArrayOutOfBoundsException stemming from the OldPAPBinTable.<init>. These documents may also have additional encodings included from the Hong Kong region. (Unable to include sample due to sensitive nature of documents)

Essentially the parsing of the file thinks there are more elements in the array than are present.  The patch included prevents the error by including a public member call within PAPFormattedDiskPage to return the actual size of the _papxList property.  The initial usage of pfkp.size() within OldPAPBinTable<init> does not seem to always be accurate.

Stack Trace (Daily Build)
Caused by: java.lang.IndexOutOfBoundsException: Index: 36, Size: 36
	at java.util.ArrayList.RangeCheck(ArrayList.java:547)
	at java.util.ArrayList.get(ArrayList.java:322)
	at org.apache.poi.hwpf.model.PAPFormattedDiskPage.getPAPX(PAPFormattedDiskPage.java:145)
	at org.apache.poi.hwpf.model.OldPAPBinTable.<init>(OldPAPBinTable.java:58)
	at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:108)
	at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:410)
	at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:69)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:200)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	... 60 more

The supplied patch allows these documents to be extracted, however there do appear to be additional underlying issues at play.  (I've only tested more indepth on 2 of the 12 failed ones).  One document ended up being truncated mid file and had random newline and carriage returns inserted.  The other appeared to have repetition of some paragraphs added and additional newlines added, although all of the documents text was extracted.

Due to this, I'm not sure if you'd want to go forward with the patch.  Though I'd think getting as much text out as possible would be preferable than to no text at all.

Thanks in advance,

Comment 1 Jeremy 2011-10-25 20:28:46 UTC
Created attachment 27847 [details]
Test File with Range Issue

Attached is a test file that will demonstrate the issue being seen.  This file actually seems to have its text extracted properly.  I'll try to hex edit out the other longer file that gets truncated.
Comment 2 Sergey Vladimirov 2011-10-30 00:06:03 UTC
Jeremy, thanks for the patch, but this king of bug is already fixed for [new]PAPBinTable, so I just copied the solution from it.

This bug is fixed both for OldPAPBinTable and OldCHPBinTable.

Thanks for the report!
Comment 3 Sergey Vladimirov 2011-10-30 00:06:48 UTC
Fixed in revision 1195079