Bug 51922 - Get exception in text extraction with poi 3.7 jar
Summary: Get exception in text extraction with poi 3.7 jar
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: HDF (show other bugs)
Version: 3.7-FINAL
Hardware: PC Windows XP
: P2 critical (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-09-30 05:49 UTC by yatin
Modified: 2011-10-06 13:08 UTC (History)
1 user (show)



Attachments
Problem getting in the attached zip file's documents. (51.00 KB, application/msword)
2011-09-30 05:49 UTC, yatin
Details
exception with poi 3.7 (8.58 KB, application/vnd.ms-excel.sheet.binary.macroEnabled.12)
2011-09-30 05:54 UTC, yatin
Details
exception with poi 3.7 jar (41.49 KB, application/vnd.ms-officetheme)
2011-09-30 05:54 UTC, yatin
Details
exception with poi 3.7 jar (73.67 KB, application/vnd.ms-xpsdocument)
2011-09-30 05:54 UTC, yatin
Details

Note You need to log in before you can comment on or make changes to this bug.
Description yatin 2011-09-30 05:49:50 UTC
Created attachment 27638 [details]
Problem getting in the attached zip file's documents.

Currently i am using the apache tika 0.9[plus tika 0.9's dependent jar file] and apache poi 3.7 jar for text extraction .

i get the exception when i used some Microsoft office document. i have attached document zip file. Please check it with said jar file.

i get the following exception when  we upload '1_1.doc' document from the attached zip file.

Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.microsoft.OfficeParser@d8e54c
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    at
org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:232)
    ... 1 more
Caused by: java.lang.NullPointerException
    at
org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.uncompressCHP(CharacterSprmUncompressor.java:39)
    at org.apache.poi.hwpf.model.CHPX.getCharacterProperties(CHPX.java:61)
    at org.apache.poi.hwpf.usermodel.CharacterRun.<init>(CharacterRun.java:98)
    at org.apache.poi.hwpf.usermodel.Range.getCharacterRun(Range.java:797)
    at
org.apache.poi.hwpf.model.PicturesTable.getAllPictures(PicturesTable.java:191)
    at
org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:430)
    at
org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:420)
    at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:75)
    at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:182)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    ... 4 more



i get the following exception when  we upload 'Book1.xlsb' and 'MSPPT2007.thmx' documents from the attached zip file.

Caused by: org.apache.tika.exception.TikaException: TIKA-418: RuntimeException
while getting content for thmx and xps file types
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:67)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    at
org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:232)
    ... 1 more
Caused by: java.lang.IllegalArgumentException: No supported documents found in
the OOXML package (found
application/vnd.ms-excel.sheet.binary.macroEnabled.main)
    at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:191)
    at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:152)
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:65)
    ... 6 more

i get the following exception when  we upload 'MSPPT2007.xps' document from the attached zip file.

Caused by: org.apache.tika.exception.TikaException: Error creating OOXML
extractor
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:90)
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:67)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    at
org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:232)
    ... 1 more
Caused by: java.lang.IllegalArgumentException: Invalid OOXML Package received -
expected 1 core document, found 0
    at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:161)
    at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:152)
    at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:65)
    ... 6 more


Please try to resolve this issue.

Thanks & Regards
Yatin Baraiya
Comment 1 yatin 2011-09-30 05:54:03 UTC
Created attachment 27639 [details]
exception with poi 3.7

hear i can't add the zip file thats why add each document separately.

Please you should only find the document not zip file.
Comment 2 yatin 2011-09-30 05:54:31 UTC
Created attachment 27640 [details]
exception with poi 3.7 jar
Comment 3 yatin 2011-09-30 05:54:44 UTC
Created attachment 27641 [details]
exception with poi 3.7 jar
Comment 4 Maxim Valyanskiy 2011-09-30 07:15:15 UTC
1) NPE in CharacterSprmUncompressor.uncompressCHP is fixed in current POI. Tika 0.10 release will be released soon, try its release candidate https://people.apache.org/~mattmann/apache-tika-0.10/rc1/

2) xlsb and thmx formats are completely unsupported (as far as I know)
Comment 5 yatin 2011-10-04 06:31:27 UTC
ok,

first of thanks you for providing the solution of this bug.

can you provide me the information regarding the tika 0.10 jar release date because i am looking for tika-core-0.10.jar,tika-parsers-0.10.jar,poi-3.8.jar.

as per your given link is for source download link.

Thanks
Yatin Baraiya
Comment 6 yatin 2011-10-04 06:40:33 UTC
for which solution for this document "MSPPT2007.xps"  with tika 0.9 and poi 3.7.
Comment 7 Nick Burch 2011-10-04 09:25:12 UTC
(In reply to comment #5)
> can you provide me the information regarding the tika 0.10 jar release date
> because i am looking for tika-core-0.10.jar,tika-parsers-0.10.jar,poi-3.8.jar.

Tika 0.10 is available from the Tika download page: http://tika.apache.org/download.html
Comment 8 yatin 2011-10-06 12:37:59 UTC
hy

thanks for providing me the download link.

but this link is for tika-app-0.10.jar file and i am looking for the 
tika-core-0.10.jar,tika-parsers-0.10.jar,poi-3.8.jar and its all dependent jar file.

or 
is tika-app-0.10.jar include all the jar which i looking for?

Regard
Yatin Baraiya
Comment 9 Nick Burch 2011-10-06 13:08:44 UTC
The Tika users list is the right place for a question like that