Bug 51944

Summary: [PATCH] PAPFormattedDiskPage.getPAPX - IndexOutOfBounds
Product: POI Reporter: Jeremy <rpi_alum>
Component: HWPFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: 3.8-dev   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: Patch of two files
Test File with Range Issue

Description Jeremy 2011-10-03 23:27:29 UTC
Created attachment 27681 [details]
Patch of two files

A handful of word97-2003 (though could be even earlier) documents produce an ArrayOutOfBoundsException stemming from the OldPAPBinTable.<init>. These documents may also have additional encodings included from the Hong Kong region. (Unable to include sample due to sensitive nature of documents)

Essentially the parsing of the file thinks there are more elements in the array than are present.  The patch included prevents the error by including a public member call within PAPFormattedDiskPage to return the actual size of the _papxList property.  The initial usage of pfkp.size() within OldPAPBinTable<init> does not seem to always be accurate.

Stack Trace (Daily Build)
----------------------------------------------------------------
Caused by: java.lang.IndexOutOfBoundsException: Index: 36, Size: 36
	at java.util.ArrayList.RangeCheck(ArrayList.java:547)
	at java.util.ArrayList.get(ArrayList.java:322)
	at org.apache.poi.hwpf.model.PAPFormattedDiskPage.getPAPX(PAPFormattedDiskPage.java:145)
	at org.apache.poi.hwpf.model.OldPAPBinTable.<init>(OldPAPBinTable.java:58)
	at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:108)
	at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:410)
	at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:69)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:200)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	... 60 more


The supplied patch allows these documents to be extracted, however there do appear to be additional underlying issues at play.  (I've only tested more indepth on 2 of the 12 failed ones).  One document ended up being truncated mid file and had random newline and carriage returns inserted.  The other appeared to have repetition of some paragraphs added and additional newlines added, although all of the documents text was extracted.

Due to this, I'm not sure if you'd want to go forward with the patch.  Though I'd think getting as much text out as possible would be preferable than to no text at all.

Thanks in advance,

Jeremy
Comment 1 Jeremy 2011-10-25 20:28:46 UTC
Created attachment 27847 [details]
Test File with Range Issue

Attached is a test file that will demonstrate the issue being seen.  This file actually seems to have its text extracted properly.  I'll try to hex edit out the other longer file that gets truncated.
Comment 2 Sergey Vladimirov 2011-10-30 00:06:03 UTC
Jeremy, thanks for the patch, but this king of bug is already fixed for [new]PAPBinTable, so I just copied the solution from it.

This bug is fixed both for OldPAPBinTable and OldCHPBinTable.

Thanks for the report!
Comment 3 Sergey Vladimirov 2011-10-30 00:06:48 UTC
Fixed in revision 1195079