Bug 62886 - Regression extracting text from corrupted docx files
Summary: Regression extracting text from corrupted docx files
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: OPC (show other bugs)
Version: 4.0.0-FINAL
Hardware: PC All
: P2 regression (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-11-05 14:55 UTC by Luis Filipe Nassif
Modified: 2018-12-19 01:54 UTC (History)
0 users



Attachments
Example file (19.32 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2018-11-05 14:55 UTC, Luis Filipe Nassif
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Luis Filipe Nassif 2018-11-05 14:55:07 UTC
Created attachment 36245 [details]
Example file

While testing Tika-1.19.1, POI throws the following exception with some corrupt docx files (MS Word complains but fixes them) previously handled without problems by POI-3.17. See TIKA-2765 for more info. Stacktrace bellow:

org.apache.poi.openxml4j.exceptions.InvalidOperationException: Could not open the specified zip entry source stream
at org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:214)
at org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:196)
at org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:170)
at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:151)
at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:123)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:234)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:81)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 43 more
Caused by: java.io.EOFException
at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readFully(ZipArchiveInputStream.java:803)
at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readFully(ZipArchiveInputStream.java:795)
at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.skipRemainderOfArchive(ZipArchiveInputStream.java:1014)
at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.getNextZipEntry(ZipArchiveInputStream.java:257)
at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.getNextEntry(ZipArchiveThresholdInputStream.java:139)
at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:47)
at org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:212)
... 51 more{code}
Comment 1 Andreas Beeker 2018-12-18 23:56:43 UTC
fixed via r1849252
Comment 2 Luis Filipe Nassif 2018-12-19 01:52:44 UTC
Thank you, Andreas!
Comment 3 Luis Filipe Nassif 2018-12-19 01:54:21 UTC
Thank you, Andreas!