Bug 51921 - Get exception in text extraction with poi 3.7 jar
Summary: Get exception in text extraction with poi 3.7 jar
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: HDF (show other bugs)
Version: 3.7-FINAL
Hardware: PC Windows XP
: P2 major (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
: 51920 (view as bug list)
Depends on:
Blocks:
 
Reported: 2011-09-30 05:48 UTC by yatin
Modified: 2015-03-22 19:44 UTC (History)
1 user (show)



Attachments
Word-Document to throw exception 'CharacterSprmUncompressor.java:48' (55.50 KB, application/msword)
2011-10-09 13:53 UTC, Michael
Details
Word-Document without the problematic style (55.00 KB, application/msword)
2011-10-09 13:54 UTC, Michael
Details
Screenshot of the problematic style (68.94 KB, image/png)
2011-10-09 13:54 UTC, Michael
Details
Same Word-Document as Word-Crash067.doc but saved as .docx (19.54 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2011-10-09 13:55 UTC, Michael
Details

Note You need to log in before you can comment on or make changes to this bug.
Description yatin 2011-09-30 05:48:33 UTC
Currently i am using the apache tika 0.9[plus tika 0.9's dependent jar file] and apache poi 3.7 jar for text extraction .

i get the exception when i used some Microsoft office document. i have attached document zip file. Please check it with said jar file.

i get the following exception when  we upload '1_1.doc' document from the attached zip file.

Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.microsoft.OfficeParser@d8e54c
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    at
org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:232)
    ... 1 more
Caused by: java.lang.NullPointerException
    at
org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.uncompressCHP(CharacterSprmUncompressor.java:39)
    at org.apache.poi.hwpf.model.CHPX.getCharacterProperties(CHPX.java:61)
    at org.apache.poi.hwpf.usermodel.CharacterRun.<init>(CharacterRun.java:98)
    at org.apache.poi.hwpf.usermodel.Range.getCharacterRun(Range.java:797)
    at
org.apache.poi.hwpf.model.PicturesTable.getAllPictures(PicturesTable.java:191)
    at
org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:430)
    at
org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:420)
    at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:75)
    at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:182)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    ... 4 more



i get the following exception when  we upload 'Book1.xlsb' and 'MSPPT2007.thmx' documents from the attached zip file.

Caused by: org.apache.tika.exception.TikaException: TIKA-418: RuntimeException
while getting content for thmx and xps file types
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:67)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    at
org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:232)
    ... 1 more
Caused by: java.lang.IllegalArgumentException: No supported documents found in
the OOXML package (found
application/vnd.ms-excel.sheet.binary.macroEnabled.main)
    at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:191)
    at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:152)
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:65)
    ... 6 more

i get the following exception when  we upload 'MSPPT2007.xps' document from the attached zip file.

Caused by: org.apache.tika.exception.TikaException: Error creating OOXML
extractor
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:90)
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:67)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    at
org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:232)
    ... 1 more
Caused by: java.lang.IllegalArgumentException: Invalid OOXML Package received -
expected 1 core document, found 0
    at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:161)
    at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:152)
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:65)
    ... 6 more


Please try to resolve this issue.

Thanks & Regards
Yatin Baraiya
Comment 1 Yegor Kozlov 2011-10-04 12:27:47 UTC
*** Bug 51920 has been marked as a duplicate of this bug. ***
Comment 2 Yegor Kozlov 2011-10-04 12:32:29 UTC
Please try the latest Tika 0.10 which depends on POI 3.8-beta4. Also, I forgot to attach the problematic file.

Yegor
Comment 3 Michael 2011-10-09 13:50:35 UTC
I could reproduce the error with Apache Tika 0.10 and the attached document (Word-Crash067.doc). Same thing with current trunk (Revision 1180244).

As soon as I delete the style 'Tabellenzeilenaufschrift klein' (see screenshot.png and resulting Word-Crash067-ok.doc) or save the document as .docx (Word-Crash067.docx), everything works as expected. I tested this with Microsoft Office Word 2003 (11.8328.8341) SP3.
Comment 4 Michael 2011-10-09 13:53:13 UTC
Created attachment 27744 [details]
Word-Document to throw exception 'CharacterSprmUncompressor.java:48'
Comment 5 Michael 2011-10-09 13:54:13 UTC
Created attachment 27745 [details]
Word-Document without the problematic style
Comment 6 Michael 2011-10-09 13:54:40 UTC
Created attachment 27746 [details]
Screenshot of the problematic style
Comment 7 Michael 2011-10-09 13:55:27 UTC
Created attachment 27747 [details]
Same Word-Document as Word-Crash067.doc but saved as .docx
Comment 8 Michael 2011-10-28 06:24:02 UTC
I hopefully provided the necessaray information some days ago. Setting the status back to new...
Comment 9 Nico 2011-12-16 14:38:30 UTC
@Yegor: Are there any updates on this issue?
Comment 10 Dominik Stadler 2015-03-22 19:44:59 UTC
The provided sample files do work now, so the bug seems to have been fixed some time ago already.