Created attachment 34447 [details] Sample file The regression testing at http://people.apache.org/~centic/poi_regression/reportsAll/ shows the following for some files. It seems the text-pieces in the files are stored as non-unicode, but the class PieceDescriptor sets unicode = true. If I set unicode = false manually there extracting text works for these documents as well. public void testException() throws IOException, OpenXML4JException, XmlException { final POITextExtractor extractor = ExtractorFactory.createExtractor(POIDataSamples.getDocumentInstance().openResourceAsStream("cn.orthodox.www_divenbog_APRIL_30-APRIL.DOC")); // Check it gives text without error System.out.println(extractor.getText()); extractor.close(); } java.lang.IllegalArgumentException: Error creating Scratchpad Extractor at o.a.p.extractor.OLE2ExtractorFactory.createExtractor(OLE2ExtractorFactory.java:197) at o.a.p.extractor.OLE2ExtractorFactory.createExtractor(OLE2ExtractorFactory.java:119) at o.a.p.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:276) at o.a.p.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:129) at o.a.p.stress.AbstractFileHandler.handleExtractingInternal(AbstractFileHandler.java:81) at o.a.p.stress.AbstractFileHandler.handleExtracting(AbstractFileHandler.java:60) at org.dstadler.commoncrawl.FileHandlingRunnable.run(FileHandlingRunnable.java:62) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor4560.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at o.a.p.extractor.OLE2ExtractorFactory.createExtractor(OLE2ExtractorFactory.java:192) ... 12 more Caused by: java.lang.ArrayIndexOutOfBoundsException at o.a.p.hwpf.model.TextPieceTable.(TextPieceTable.java:109) at o.a.p.hwpf.model.ComplexFileTable.(ComplexFileTable.java:70) at o.a.p.hwpf.HWPFOldDocument.(HWPFOldDocument.java:68) at o.a.p.hwpf.extractor.Word6Extractor.(Word6Extractor.java:74) at o.a.p.extractor.OLE2ScratchpadExtractorFactory.createExtractor(OLE2ScratchpadExtractorFactory.java:62) ... 16 more
This likely got fixed via bug 50955, the file now works fine, added a unit-test via r1798200. *** This bug has been marked as a duplicate of bug 50955 ***