Bug 60374 - Extracting text from some older Word documents fails with ArrayIndexOutOfBoundsException due to unicode/non-unicode mismatch
Summary: Extracting text from some older Word documents fails with ArrayIndexOutOfBoun...
Status: RESOLVED DUPLICATE of bug 50955
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: 3.16-dev
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-11-15 08:58 UTC by Dominik Stadler
Modified: 2017-06-09 12:52 UTC (History)
0 users



Attachments
Sample file (22.50 KB, application/msword)
2016-11-15 08:58 UTC, Dominik Stadler
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dominik Stadler 2016-11-15 08:58:14 UTC
Created attachment 34447 [details]
Sample file

The regression testing at http://people.apache.org/~centic/poi_regression/reportsAll/ shows the following for some files.

It seems the text-pieces in the files are stored as non-unicode, but the class PieceDescriptor sets unicode = true. If I set unicode = false manually there extracting text works for these documents as well.


    public void testException() throws IOException, OpenXML4JException, XmlException {
		final POITextExtractor extractor = ExtractorFactory.createExtractor(POIDataSamples.getDocumentInstance().openResourceAsStream("cn.orthodox.www_divenbog_APRIL_30-APRIL.DOC"));

		// Check it gives text without error
		System.out.println(extractor.getText());

		extractor.close();
	}



java.lang.IllegalArgumentException: Error creating Scratchpad Extractor
	at o.a.p.extractor.OLE2ExtractorFactory.createExtractor(OLE2ExtractorFactory.java:197)
	at o.a.p.extractor.OLE2ExtractorFactory.createExtractor(OLE2ExtractorFactory.java:119)
	at o.a.p.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:276)
	at o.a.p.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:129)
	at o.a.p.stress.AbstractFileHandler.handleExtractingInternal(AbstractFileHandler.java:81)
	at o.a.p.stress.AbstractFileHandler.handleExtracting(AbstractFileHandler.java:60)
	at org.dstadler.commoncrawl.FileHandlingRunnable.run(FileHandlingRunnable.java:62)

Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.GeneratedMethodAccessor4560.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at o.a.p.extractor.OLE2ExtractorFactory.createExtractor(OLE2ExtractorFactory.java:192)
	... 12 more
Caused by: java.lang.ArrayIndexOutOfBoundsException
	at o.a.p.hwpf.model.TextPieceTable.(TextPieceTable.java:109)
	at o.a.p.hwpf.model.ComplexFileTable.(ComplexFileTable.java:70)
	at o.a.p.hwpf.HWPFOldDocument.(HWPFOldDocument.java:68)
	at o.a.p.hwpf.extractor.Word6Extractor.(Word6Extractor.java:74)
	at o.a.p.extractor.OLE2ScratchpadExtractorFactory.createExtractor(OLE2ScratchpadExtractorFactory.java:62)
	... 16 more
Comment 1 Dominik Stadler 2017-06-09 12:52:24 UTC
This likely got fixed via bug 50955, the file now works fine, added a unit-test via r1798200.

*** This bug has been marked as a duplicate of bug 50955 ***