Bug 52991 - Unexpected end of ZLIB input stream on embedded OLE extraction from PPT
Summary: Unexpected end of ZLIB input stream on embedded OLE extraction from PPT
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: HSLF (show other bugs)
Version: 3.8-dev
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-03-27 08:27 UTC by Maxim Valyanskiy
Modified: 2014-12-29 22:32 UTC (History)
3 users (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Maxim Valyanskiy 2012-03-27 08:27:31 UTC
Caused by: org.apache.tika.io.TaggedIOException: Unexpected end of ZLIB input stream
	at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
	at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
	at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
	at java.io.FilterInputStream.read(FilterInputStream.java:107)
	at org.apache.tika.io.IOUtils.copyLarge(IOUtils.java:933)
	at org.apache.tika.io.IOUtils.copy(IOUtils.java:907)
	at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:536)
	at org.apache.tika.io.TikaInputStream.getFileChannel(TikaInputStream.java:564)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:335)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:152)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:113)
	at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
	at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
	at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:68)
	at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources(HSLFExtractor.java:210)
	at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:122)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:188)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	... 5 more
Caused by: java.io.EOFException: Unexpected end of ZLIB input stream
	at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
	at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
	at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
	... 26 more
Comment 1 Maxim Valyanskiy 2012-03-27 08:32:57 UTC
fixed in r1305778
Comment 2 EM 2012-05-13 10:16:17 UTC
Verified on with the current trunk, revision 1337825, not fixed yet:

The source is a ppt, error is exactly the same:
xception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@bd928a
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126)
	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:395)
	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97)
Caused by: org.apache.tika.io.TaggedIOException: Unexpected end of ZLIB input stream
	at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
	at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
	at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
	at java.io.FilterInputStream.read(FilterInputStream.java:90)
	at org.apache.tika.io.IOUtils.copyLarge(IOUtils.java:933)
	at org.apache.tika.io.IOUtils.copy(IOUtils.java:907)
	at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:536)
	at org.apache.tika.io.TikaInputStream.getFileChannel(TikaInputStream.java:564)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:335)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:152)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:113)
	at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
	at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)
	at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:68)
	at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources(HSLFExtractor.java:236)
	at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:117)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:188)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	... 5 more
Caused by: java.io.EOFException: Unexpected end of ZLIB input stream
	at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:223)
	at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:141)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
	at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
	... 26 more


---------
Debian Squeeze with tika from source ( also tried 1.0 and 1.1 )
Comment 3 EM 2012-05-13 10:24:30 UTC
Just researched a bit on the net, several people running into this because of "broken archives". I verified that i can unzip the ppt without issues using unzip.

> unzip <file.ppt>
Archive:  <file.ppt> 
Warning[<file.ppt>]: 1865926 extra bytes at beginning or within zipfile
  (attempting to process anyway)
  inflating: [Content_Types].xml     
  inflating: _rels/.rels             
  inflating: drs/picturexml.xml      
  inflating: drs/_rels/picturexml.xml.rels  
  inflating: drs/downrev.xml         
 extracting: drs/media/image1.png    

Iam not sure, if the warning is one of those issues or related, just wanted to provide the information i got. Bascially this happens with a lot of ppts here..we are using tika+solr to index attachments
Comment 4 Maxim Valyanskiy 2012-05-14 07:56:06 UTC
Are you sure that yours Tika is build with latest version of POI? Stacktrace looks like it was produced by build without my fix.
Comment 5 EM 2012-05-14 09:00:06 UTC
i used: 
svn checkout https://svn.apache.org/repos/asf/tika/trunk tika.trunk

then "mvn install". Not sure about POI, is that an extra lib? Does maven not fetch it properly / is it not included into the source?

Should i build it again and provide you the logs on a pastebin?
Comment 6 Maxim Valyanskiy 2012-05-14 09:21:09 UTC
This is bugzilla of POI project, not Tika :-) 

Tika uses POI as dependency in tika-parsers module. Bug was fixed in unreleased version of POI, so you need to build your own version (or wait for next release).

If you want to build Tika with POI, then:

1) Build POI

2) Install POI artifacts to you local maven repository

3) Update POI version in tika-parsers/pom.xml

4) Build Tika
Comment 7 EM 2012-05-14 09:23:04 UTC
oh holy, iam sorry! Totally got confused here, due to rather "general" bugtracking system.

Will do what you suggested, thank you a lot!
Comment 8 Andreas Beeker 2014-12-29 22:32:11 UTC
I assume this is fixed, when you reopen it, please attach a test file.