1. When converting a bunch of Microsoft Word documents using the command, java -jar tika-app-1.1-SNAPSHOT.jar -v -t , I'm getting the following exception. Ditto with Tika 1.1 release candidate. org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@5d3ac0 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130) at org.apache.tika.cli.TikaCLI$TikaServer$1.run(TikaCLI.java:735) Caused by: java.lang.ArrayIndexOutOfBoundsException: 487 at org.apache.poi.hwpf.sprm.SprmOperation.initSize(SprmOperation.java:174) at org.apache.poi.hwpf.sprm.SprmOperation.<init>(SprmOperation.java:80) at org.apache.poi.hwpf.sprm.SprmIterator.next(SprmIterator.java:48) at org.apache.poi.hwpf.sprm.ParagraphSprmUncompressor.uncompressPAP(ParagraphSprmUncompressor.java:67) at org.apache.poi.hwpf.usermodel.Paragraph.newParagraph(Paragraph.java:103) at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:943) at org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:146) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:97) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:185) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 4 more A user, Nick Burch, has advised me to raise this as a POI bug. 2. Here's the output of the BFF Validator tool: <BFFValidation path="failing.doc" datetime="03/08/12 07:14:27" result="FAILED"> <ParseStack> <Type builtinType="Docfile" docName="MS-DOC" sectionTitle="File Structure" msdnLink="http://msdn.microsoft.com/en-us/library/4eaddc8f-4abd-43bb-8fd4-aef9c6121737"> <Info>Built-in type "Docfile": The root storage object of an OLE compound file. For more information, see http://msdn.microsoft.com/en-us/library/dd942138.aspx.</Info> </Type> <Type builtinType="Stream" docName="MS-DOC" sectionTitle="File Structure" msdnLink="http://msdn.microsoft.com/en-us/library/4eaddc8f-4abd-43bb-8fd4-aef9c6121737" streamName="WordDocument" streamOffset="0" hexStreamOffset="0x0"> <Info>Built-in type "Stream": Any stream object for OLE compound files. The entire file contents for other files.</Info> </Type> <Type docName="MS-DOC" sectionTitle="Fib" sectionNumber="2.5.1" msdnLink="http://msdn.microsoft.com/en-us/library/9AEAA2E7-4A45-468E-AB13-3F6193EB9394" streamName="WordDocument" streamOffset="0" hexStreamOffset="0x0"/> <Type docName="MS-DOC" sectionTitle="FibBase" sectionNumber="2.5.2" msdnLink="http://msdn.microsoft.com/en-us/library/26FB6C06-4E5C-4778-AB4E-EDBF26A545BB" streamName="WordDocument" streamOffset="0" hexStreamOffset="0x0"/> <Type builtinType="USHORT" streamName="WordDocument" bitfield="True" bitOffsetWithinStruct="84" hexBitOffsetWithinStruct="0x54" bitCount="4" streamOffsetOfStruct="0" hexStreamOffsetOfStruct="0x0" streamOffset="10" hexStreamOffset="0xa" childId="10" hexChildId="0xa"> <Info>Built-in type "USHORT": Unsigned 2-byte integer.</Info> </Type> </ParseStack> <LastData><![CDATA[ EC A5 01 01 4D 20 09 04 00 00 08 12 BF 00 00 00 ....M........... 00 00 00 30 00 00 00 00 00 08 00 00 66 EF 00 00 ...0........f... ]]></LastData> </BFFValidation> -------------------------------------------- Would greatly appreciate a timely fix, as I have 2000+ of documents that POI/Tika are failing on. I cannot proceed any further.
Can you upload a failing document? Yegor
Unfortunately, this is a classified document.
Do you know the origin of these failing docs? Were they created by MS Word or by OpenOffice or by what ? Without a sample file we can't do much. Yegor (In reply to comment #2) > Unfortunately, this is a classified document.
> Do you know the origin of these failing > docs? Were they created by MS Word or > by OpenOffice or by what ? They were created by a post-2003 and pre-2007 version of MS Word. > Without a sample file we can't do much. Just the name itself of the document is 'Business Intelligence', so you can imagine my difficulty. Even other documents that failing are sensitive enough. I thought, I should be able to remove the sensitive parts of this document and then upload it for the Tika/POI developers. But even mere re-saving the document in Word 2007 (i.e., without any new edits whatsoever) makes the problem mostly go away. I say 'mostly' because, while Tika/POI are then able to extract the text, they also append text like this to the output _-1388201556/ole-[42, 4D, 0E, 0A, 00, 00, 00, 00] _-1388203796/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00] _-1388843352/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00] _-1388845272/ole-[42, 4D, BA, 09, 00, 00, 00, 00] _-1388297360/ole-[42, 4D, BA, 09, 00, 00, 00, 00] _-1388297680/ole-[42, 4D, D6, 09, 00, 00, 00, 00] _-1388296720/ole-[42, 4D, BA, 09, 00, 00, 00, 00] _-1388203476/ole-[42, 4D, 66, 09, 00, 00, 00, 00] _-1382869532/ole-[42, 4D, 36, 0C, 00, 00, 00, 00] _-1388200596/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00] _-1388200916/ole-[42, 4D, BA, 09, 00, 00, 00, 00] _-1383036196/ole-[42, 4D, 12, 09, 00, 00, 00, 00] _-1382867932/ole-[42, 4D, 86, 0A, 00, 00, 00, 00] _-1382868252/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00] _-1380808936/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00] Being a developer myself, I am fully aware how hard it can be to fix (certain) bugs without appropriate test input. I will watch out for newer releases.
Created attachment 28554 [details] The same problem with MS PowerPoint files Hi *, I have the same problem with tika-app-1.1.jar und MS PowerPoint files. In the zip archive you can find 2 PPT files. The file Tika.ppt is the "old" file, that cannot be converted with the error message: System.ApplicationException : Extraction of text from the file 'Tika.ppt' failed. ----> org.apache.tika.exception.TikaException : Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@2a784f5 ----> java.lang.ArrayIndexOutOfBoundsException : at TikaOnDotNet.TextExtractor.Extract(String filePath) in d:\Work\tikaondotnet.git\trunk\TikaOnDotnet\TextExtractor.cs:line 63 at TikaOnDotNet.tikadriver_examples.should_extract_from_ppt() in d:\Work\tikaondotnet.git\trunk\TikaOnDotnet\tikadriver_examples.cs:line 104 --TikaException at org.apache.tika.parser.CompositeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) at org.apache.tika.parser.CompositeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) at org.apache.tika.parser.AutoDetectParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) at TikaOnDotNet.TextExtractor.Extract(String filePath) in d:\Work\tikaondotnet.git\trunk\TikaOnDotnet\TextExtractor.cs:line 55 --ArrayIndexOutOfBoundsException at IKVM.Runtime.ByteCodeHelper.arraycopy_primitive_1(Array src, Int32 srcStart, Array dest, Int32 destStart, Int32 len) at org.apache.poi.util.LittleEndian.getByteArray(Byte[] data, Int32 offset, Int32 size) at org.apache.poi.hpsf.UnicodeString..ctor(Byte[] , Int32 ) at org.apache.poi.hpsf.TypedPropertyValue.readValue(Byte[] , Int32 ) at org.apache.poi.hpsf.Vector.read(Byte[] , Int32 ) at org.apache.poi.hpsf.TypedPropertyValue.readValue(Byte[] , Int32 ) at org.apache.poi.hpsf.VariantSupport.read(Byte[] src, Int32 offset, Int32 length, Int64 type, Int32 codepage) at org.apache.poi.hpsf.Property..ctor(Int64 id, Byte[] src, Int64 offset, Int32 length, Int32 codepage) at org.apache.poi.hpsf.Section..ctor(Byte[] src, Int32 offset) at org.apache.poi.hpsf.PropertySet.init(Byte[] , Int32 , Int32 ) at org.apache.poi.hpsf.PropertySet..ctor(InputStream stream) at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(DirectoryNode , String ) at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(DirectoryNode ) at org.apache.tika.parser.microsoft.OfficeParser.parse(DirectoryNode root, ParseContext context, Metadata metadata, XHTMLContentHandler xhtml) at org.apache.tika.parser.microsoft.OfficeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) at org.apache.tika.parser.CompositeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) The second file Tika_new.ppt is the same file, that has been saved with the MS PowerPoint 2010 (File -> Save as...), can be converted without any problems. With tika-app-0.9.jar the file Tika.ppt can be converted too ==> the error is in the new version of tika-app-1.1.jar??? Thank you Sepp
Since there is a problem with original file (i.e. structure is broken), i'm closing this bug as WONTFIX. But in trunk the workaround will be added to skip the problematic SPRMs. I could NOT guarantee that the file will be correctly processed after such errors, but it worse to try.