Bug 60374

Summary: Extracting text from some older Word documents fails with ArrayIndexOutOfBoundsException due to unicode/non-unicode mismatch
Product: POI Reporter: Dominik Stadler <dominik.stadler>
Component: HWPFAssignee: POI Developers List <dev>
Status: RESOLVED DUPLICATE    
Severity: normal    
Priority: P2    
Version: 3.16-dev   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: Sample file

Description Dominik Stadler 2016-11-15 08:58:14 UTC
Created attachment 34447 [details]
Sample file

The regression testing at http://people.apache.org/~centic/poi_regression/reportsAll/ shows the following for some files.

It seems the text-pieces in the files are stored as non-unicode, but the class PieceDescriptor sets unicode = true. If I set unicode = false manually there extracting text works for these documents as well.


    public void testException() throws IOException, OpenXML4JException, XmlException {
		final POITextExtractor extractor = ExtractorFactory.createExtractor(POIDataSamples.getDocumentInstance().openResourceAsStream("cn.orthodox.www_divenbog_APRIL_30-APRIL.DOC"));

		// Check it gives text without error
		System.out.println(extractor.getText());

		extractor.close();
	}



java.lang.IllegalArgumentException: Error creating Scratchpad Extractor
	at o.a.p.extractor.OLE2ExtractorFactory.createExtractor(OLE2ExtractorFactory.java:197)
	at o.a.p.extractor.OLE2ExtractorFactory.createExtractor(OLE2ExtractorFactory.java:119)
	at o.a.p.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:276)
	at o.a.p.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:129)
	at o.a.p.stress.AbstractFileHandler.handleExtractingInternal(AbstractFileHandler.java:81)
	at o.a.p.stress.AbstractFileHandler.handleExtracting(AbstractFileHandler.java:60)
	at org.dstadler.commoncrawl.FileHandlingRunnable.run(FileHandlingRunnable.java:62)

Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.GeneratedMethodAccessor4560.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at o.a.p.extractor.OLE2ExtractorFactory.createExtractor(OLE2ExtractorFactory.java:192)
	... 12 more
Caused by: java.lang.ArrayIndexOutOfBoundsException
	at o.a.p.hwpf.model.TextPieceTable.(TextPieceTable.java:109)
	at o.a.p.hwpf.model.ComplexFileTable.(ComplexFileTable.java:70)
	at o.a.p.hwpf.HWPFOldDocument.(HWPFOldDocument.java:68)
	at o.a.p.hwpf.extractor.Word6Extractor.(Word6Extractor.java:74)
	at o.a.p.extractor.OLE2ScratchpadExtractorFactory.createExtractor(OLE2ScratchpadExtractorFactory.java:62)
	... 16 more
Comment 1 Dominik Stadler 2017-06-09 12:52:24 UTC
This likely got fixed via bug 50955, the file now works fine, added a unit-test via r1798200.

*** This bug has been marked as a duplicate of bug 50955 ***