Created attachment 27638 [details] Problem getting in the attached zip file's documents. Currently i am using the apache tika 0.9[plus tika 0.9's dependent jar file] and apache poi 3.7 jar for text extraction . i get the exception when i used some Microsoft office document. i have attached document zip file. Please check it with said jar file. i get the following exception when we upload '1_1.doc' document from the attached zip file. Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@d8e54c at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:232) ... 1 more Caused by: java.lang.NullPointerException at org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.uncompressCHP(CharacterSprmUncompressor.java:39) at org.apache.poi.hwpf.model.CHPX.getCharacterProperties(CHPX.java:61) at org.apache.poi.hwpf.usermodel.CharacterRun.<init>(CharacterRun.java:98) at org.apache.poi.hwpf.usermodel.Range.getCharacterRun(Range.java:797) at org.apache.poi.hwpf.model.PicturesTable.getAllPictures(PicturesTable.java:191) at org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:430) at org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:420) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:75) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:182) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) ... 4 more i get the following exception when we upload 'Book1.xlsb' and 'MSPPT2007.thmx' documents from the attached zip file. Caused by: org.apache.tika.exception.TikaException: TIKA-418: RuntimeException while getting content for thmx and xps file types at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:67) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:232) ... 1 more Caused by: java.lang.IllegalArgumentException: No supported documents found in the OOXML package (found application/vnd.ms-excel.sheet.binary.macroEnabled.main) at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:191) at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:152) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:65) ... 6 more i get the following exception when we upload 'MSPPT2007.xps' document from the attached zip file. Caused by: org.apache.tika.exception.TikaException: Error creating OOXML extractor at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:90) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:67) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:232) ... 1 more Caused by: java.lang.IllegalArgumentException: Invalid OOXML Package received - expected 1 core document, found 0 at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:161) at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:152) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:65) ... 6 more Please try to resolve this issue. Thanks & Regards Yatin Baraiya
Created attachment 27639 [details] exception with poi 3.7 hear i can't add the zip file thats why add each document separately. Please you should only find the document not zip file.
Created attachment 27640 [details] exception with poi 3.7 jar
Created attachment 27641 [details] exception with poi 3.7 jar
1) NPE in CharacterSprmUncompressor.uncompressCHP is fixed in current POI. Tika 0.10 release will be released soon, try its release candidate https://people.apache.org/~mattmann/apache-tika-0.10/rc1/ 2) xlsb and thmx formats are completely unsupported (as far as I know)
ok, first of thanks you for providing the solution of this bug. can you provide me the information regarding the tika 0.10 jar release date because i am looking for tika-core-0.10.jar,tika-parsers-0.10.jar,poi-3.8.jar. as per your given link is for source download link. Thanks Yatin Baraiya
for which solution for this document "MSPPT2007.xps" with tika 0.9 and poi 3.7.
(In reply to comment #5) > can you provide me the information regarding the tika 0.10 jar release date > because i am looking for tika-core-0.10.jar,tika-parsers-0.10.jar,poi-3.8.jar. Tika 0.10 is available from the Tika download page: http://tika.apache.org/download.html
hy thanks for providing me the download link. but this link is for tika-app-0.10.jar file and i am looking for the tika-core-0.10.jar,tika-parsers-0.10.jar,poi-3.8.jar and its all dependent jar file. or is tika-app-0.10.jar include all the jar which i looking for? Regard Yatin Baraiya
The Tika users list is the right place for a question like that