Currently i am using the apache tika 0.9[plus tika 0.9's dependent jar file] and apache poi 3.7 jar for text extraction . i get the exception when i used some Microsoft office document. i have attached document zip file. Please check it with said jar file. i get the following exception when we upload '1_1.doc' document from the attached zip file. Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@d8e54c at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:232) ... 1 more Caused by: java.lang.NullPointerException at org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.uncompressCHP(CharacterSprmUncompressor.java:39) at org.apache.poi.hwpf.model.CHPX.getCharacterProperties(CHPX.java:61) at org.apache.poi.hwpf.usermodel.CharacterRun.<init>(CharacterRun.java:98) at org.apache.poi.hwpf.usermodel.Range.getCharacterRun(Range.java:797) at org.apache.poi.hwpf.model.PicturesTable.getAllPictures(PicturesTable.java:191) at org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:430) at org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:420) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:75) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:182) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) ... 4 more i get the following exception when we upload 'Book1.xlsb' and 'MSPPT2007.thmx' documents from the attached zip file. Caused by: org.apache.tika.exception.TikaException: TIKA-418: RuntimeException while getting content for thmx and xps file types at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:67) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:232) ... 1 more Caused by: java.lang.IllegalArgumentException: No supported documents found in the OOXML package (found application/vnd.ms-excel.sheet.binary.macroEnabled.main) at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:191) at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:152) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:65) ... 6 more i get the following exception when we upload 'MSPPT2007.xps' document from the attached zip file. Caused by: org.apache.tika.exception.TikaException: Error creating OOXML extractor at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:90) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:67) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:232) ... 1 more Caused by: java.lang.IllegalArgumentException: Invalid OOXML Package received - expected 1 core document, found 0 at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:161) at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:152) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:65) ... 6 more Please try to resolve this issue. Thanks & Regards Yatin Baraiya
*** Bug 51920 has been marked as a duplicate of this bug. ***
Please try the latest Tika 0.10 which depends on POI 3.8-beta4. Also, I forgot to attach the problematic file. Yegor
I could reproduce the error with Apache Tika 0.10 and the attached document (Word-Crash067.doc). Same thing with current trunk (Revision 1180244). As soon as I delete the style 'Tabellenzeilenaufschrift klein' (see screenshot.png and resulting Word-Crash067-ok.doc) or save the document as .docx (Word-Crash067.docx), everything works as expected. I tested this with Microsoft Office Word 2003 (11.8328.8341) SP3.
Created attachment 27744 [details] Word-Document to throw exception 'CharacterSprmUncompressor.java:48'
Created attachment 27745 [details] Word-Document without the problematic style
Created attachment 27746 [details] Screenshot of the problematic style
Created attachment 27747 [details] Same Word-Document as Word-Crash067.doc but saved as .docx
I hopefully provided the necessaray information some days ago. Setting the status back to new...
@Yegor: Are there any updates on this issue?
The provided sample files do work now, so the bug seems to have been fixed some time ago already.