Bug 62886

Summary: Regression extracting text from corrupted docx files
Product: POI Reporter: Luis Filipe Nassif <lfcnassif>
Component: OPCAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: regression    
Priority: P2    
Version: 4.0.0-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: All   
Attachments: Example file

Description Luis Filipe Nassif 2018-11-05 14:55:07 UTC
Created attachment 36245 [details]
Example file

While testing Tika-1.19.1, POI throws the following exception with some corrupt docx files (MS Word complains but fixes them) previously handled without problems by POI-3.17. See TIKA-2765 for more info. Stacktrace bellow:

org.apache.poi.openxml4j.exceptions.InvalidOperationException: Could not open the specified zip entry source stream
at org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:214)
at org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:196)
at org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:170)
at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:151)
at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:123)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:234)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:81)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 43 more
Caused by: java.io.EOFException
at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readFully(ZipArchiveInputStream.java:803)
at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readFully(ZipArchiveInputStream.java:795)
at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.skipRemainderOfArchive(ZipArchiveInputStream.java:1014)
at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.getNextZipEntry(ZipArchiveInputStream.java:257)
at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.getNextEntry(ZipArchiveThresholdInputStream.java:139)
at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:47)
at org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:212)
... 51 more{code}
Comment 1 Andreas Beeker 2018-12-18 23:56:43 UTC
fixed via r1849252
Comment 2 Luis Filipe Nassif 2018-12-19 01:52:44 UTC
Thank you, Andreas!
Comment 3 Luis Filipe Nassif 2018-12-19 01:54:21 UTC
Thank you, Andreas!