Bug 64473 - OPCPackage.open(fileName, PackageAccess.READ) does not open valid xlsx file
Summary: OPCPackage.open(fileName, PackageAccess.READ) does not open valid xlsx file
Status: NEW
Alias: None
Product: POI
Classification: Unclassified
Component: OPC (show other bugs)
Version: 4.1.2-FINAL
Hardware: PC All
: P2 blocker (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-05-27 09:55 UTC by Eugene
Modified: 2020-05-27 13:04 UTC (History)
0 users



Attachments
corrupted file (10.80 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2020-05-27 09:55 UTC, Eugene
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Eugene 2020-05-27 09:55:39 UTC
Created attachment 37268 [details]
corrupted file

Contents of the unpacked xlsx file and apache poi

The initial problem is the inability to open the xlsx file through poi (OPCPackage.open(fileName, PackageAccess.READ)), while in Excel it opens.

A detailed study of the poi showed that the problem lies in the contents of the xlsx file.
If you unzip xslx file, then in the xl folder, in addition to all other files there will be two due to which there is a problem

xl/metadata
xl/metadata.xml

when using poi method OPCPackage.open(fileName, PackageAccess.READ) this leads to an error:

org.apache.poi.openxml4j.exceptions.InvalidFormatException: You can't add a part with a part name derived from another part ! [M1.11]

which occurs due to the same file names in PackagePartCollection.put method.

If I just copy the contents of the entire xlsx file to a new created xlsx file and save it, then the xl/metadata file will not be there and it will open through poi well.
But I don’t have the task of just fixing the file, I need to figure out why this problem could arise.

it looks like a slightly incorrect xlsx, but I can still open it through exel, is there any way to open it through poi?
Is there any idea about the occurrence of xl/metadata in the contents of the xlsx?
Comment 1 Eugene 2020-05-27 12:43:13 UTC
Also pay attention to the documentation. I found only the draft version, but I think that the difference there is not big.

https://www.ecma-international.org/activities/Office%20Open%20XML%20Formats/Draft%20ECMA-376%203rd%20edition,%20March%202011/Office%20Open%20XML%20Part%202%20-%20Open%20Packaging%20Conventions.pdf

item 9.1.1.4 Part Naming 


A package implementer shall neither create nor recognize a part with apart name derived from another part name by appending segments to it. [M1.11][Example:If a package contains a part named“/segment1/segment2/.../segmentn”, then other parts in that packageshall not have names such as: “/segment1”, “segment1/segment2”, or “/segment1/segment2/.../segmentn-1”. endexample]

But also look at the item:

9.1.1 Part Names
Each part has a name. Part namesrefer to parts within a package. [Example:The part name “/hello/world/doc.xml” contains three segments: “hello”, “world”, and “doc.xml”.The first two segments in the sample represent levelsin the logical hierarchy and serve to organize the parts of the package, whereas the 
ECMA-376 Part214third contains actual content.Note that segments are not explicitly representedas foldersin the package model, and no directory of folders exists in the package model.end example]

In this example, “doc.xml” the name of this file is considered along with the extension, whereas in the POI in the class PackagePartCollection in method PackagePart put (final PackagePartName partName, final PackagePart part)

Comparison is made only by file names, not considering their extension, which is possibly a mistake.
Comment 2 PJ Fanning 2020-05-27 13:04:04 UTC
It's possible we'll change POI code but the next release could be weeks away.

It's worth investigating where your xlsx file came from to find out why its contents are not standard.