Bug 53379 - IndexOutOfBoundsException on MS word 2007 doc
Summary: IndexOutOfBoundsException on MS word 2007 doc
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: HDF (show other bugs)
Version: unspecified
Hardware: Macintosh All
: P2 major (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-06-07 10:37 UTC by Tim Barrett
Modified: 2015-03-22 20:42 UTC (History)
0 users



Attachments
offending word document (242.50 KB, application/msword)
2012-06-07 10:37 UTC, Tim Barrett
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Barrett 2012-06-07 10:37:50 UTC
Created attachment 28900 [details]
offending word document

Error (stack trace heer) when parsing 'old' .doc format word doc. When same doc is saved to docx format, error no longer occurs.
<p class="tOC_3"><i>Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@4c5cc942
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:133)
	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:400)
	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
Caused by: java.lang.IndexOutOfBoundsException: Index: 151, Size: 79
	at java.util.ArrayList.RangeCheck(ArrayList.java:547)
	at java.util.ArrayList.get(ArrayList.java:322)
	at org.apache.poi.hwpf.model.ListTables.getOverride(ListTables.java:196)
	at org.apache.poi.hwpf.usermodel.Paragraph.newParagraph(Paragraph.java:108)
	at org.apache.poi.hwpf.usermodel.Range.getParagraph(Range.java:890)
	at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:96)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:185)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	... 5 more
Comment 1 Dominik Stadler 2015-03-22 20:42:40 UTC
I verified that text/properties from this document can be extracted successfully.with current POI (3.12-beta1), therefore resolving this as fixed.