Bug 51920

Summary: Get exception in text extraction with poi 3.7 jar
Product: POI Reporter: yatin <baraiya.yatin>
Component: HDFAssignee: POI Developers List <dev>
Status: RESOLVED DUPLICATE    
Severity: major    
Priority: P2    
Version: 3.7-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: Windows XP   

Description yatin 2011-09-30 05:46:11 UTC
Currently i am using the apache tika 0.9[plus tika 0.9's dependent jar file] and apache poi 3.7 jar for text extraction .

i get the exception when i used some Microsoft office document. i have attached document zip file. Please check it with said jar file.

i get the following exception when  we upload '1_1.doc' document from the attached zip file.

Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.microsoft.OfficeParser@d8e54c
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    at
org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:232)
    ... 1 more
Caused by: java.lang.NullPointerException
    at
org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.uncompressCHP(CharacterSprmUncompressor.java:39)
    at org.apache.poi.hwpf.model.CHPX.getCharacterProperties(CHPX.java:61)
    at org.apache.poi.hwpf.usermodel.CharacterRun.<init>(CharacterRun.java:98)
    at org.apache.poi.hwpf.usermodel.Range.getCharacterRun(Range.java:797)
    at
org.apache.poi.hwpf.model.PicturesTable.getAllPictures(PicturesTable.java:191)
    at
org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:430)
    at
org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:420)
    at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:75)
    at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:182)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    ... 4 more



i get the following exception when  we upload 'Book1.xlsb' and 'MSPPT2007.thmx' documents from the attached zip file.

Caused by: org.apache.tika.exception.TikaException: TIKA-418: RuntimeException
while getting content for thmx and xps file types
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:67)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    at
org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:232)
    ... 1 more
Caused by: java.lang.IllegalArgumentException: No supported documents found in
the OOXML package (found
application/vnd.ms-excel.sheet.binary.macroEnabled.main)
    at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:191)
    at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:152)
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:65)
    ... 6 more

i get the following exception when  we upload 'MSPPT2007.xps' document from the attached zip file.

Caused by: org.apache.tika.exception.TikaException: Error creating OOXML
extractor
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:90)
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:67)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    at
org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:232)
    ... 1 more
Caused by: java.lang.IllegalArgumentException: Invalid OOXML Package received -
expected 1 core document, found 0
    at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:161)
    at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:152)
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:65)
    ... 6 more


Please try to resolve this issue.

Thanks & Regards
Yatin Baraiya
Comment 1 Yegor Kozlov 2011-10-04 12:27:47 UTC

*** This bug has been marked as a duplicate of bug 51921 ***