Bug 51920 - Get exception in text extraction with poi 3.7 jar
Summary: Get exception in text extraction with poi 3.7 jar
Status: RESOLVED DUPLICATE of bug 51921
Alias: None
Product: POI
Classification: Unclassified
Component: HDF (show other bugs)
Version: 3.7-FINAL
Hardware: PC Windows XP
: P2 major (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-09-30 05:46 UTC by yatin
Modified: 2011-10-04 12:27 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description yatin 2011-09-30 05:46:11 UTC
Currently i am using the apache tika 0.9[plus tika 0.9's dependent jar file] and apache poi 3.7 jar for text extraction .

i get the exception when i used some Microsoft office document. i have attached document zip file. Please check it with said jar file.

i get the following exception when  we upload '1_1.doc' document from the attached zip file.

Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.microsoft.OfficeParser@d8e54c
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    at
org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:232)
    ... 1 more
Caused by: java.lang.NullPointerException
    at
org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.uncompressCHP(CharacterSprmUncompressor.java:39)
    at org.apache.poi.hwpf.model.CHPX.getCharacterProperties(CHPX.java:61)
    at org.apache.poi.hwpf.usermodel.CharacterRun.<init>(CharacterRun.java:98)
    at org.apache.poi.hwpf.usermodel.Range.getCharacterRun(Range.java:797)
    at
org.apache.poi.hwpf.model.PicturesTable.getAllPictures(PicturesTable.java:191)
    at
org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:430)
    at
org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:420)
    at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:75)
    at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:182)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    ... 4 more



i get the following exception when  we upload 'Book1.xlsb' and 'MSPPT2007.thmx' documents from the attached zip file.

Caused by: org.apache.tika.exception.TikaException: TIKA-418: RuntimeException
while getting content for thmx and xps file types
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:67)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    at
org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:232)
    ... 1 more
Caused by: java.lang.IllegalArgumentException: No supported documents found in
the OOXML package (found
application/vnd.ms-excel.sheet.binary.macroEnabled.main)
    at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:191)
    at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:152)
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:65)
    ... 6 more

i get the following exception when  we upload 'MSPPT2007.xps' document from the attached zip file.

Caused by: org.apache.tika.exception.TikaException: Error creating OOXML
extractor
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:90)
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:67)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    at
org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:232)
    ... 1 more
Caused by: java.lang.IllegalArgumentException: Invalid OOXML Package received -
expected 1 core document, found 0
    at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:161)
    at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:152)
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:65)
    ... 6 more


Please try to resolve this issue.

Thanks & Regards
Yatin Baraiya
Comment 1 Yegor Kozlov 2011-10-04 12:27:47 UTC

*** This bug has been marked as a duplicate of bug 51921 ***