Bug 64473 - [PATCH] OPCPackage.open(fileName, PackageAccess.READ) does not open valid xlsx file
Summary: [PATCH] OPCPackage.open(fileName, PackageAccess.READ) does not open valid xls...
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: OPC (show other bugs)
Version: 4.1.2-FINAL
Hardware: PC All
: P2 blocker with 1 vote (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on: 61942
Blocks:
  Show dependency tree
 
Reported: 2020-05-27 09:55 UTC by Eugene
Modified: 2021-07-21 07:50 UTC (History)
1 user (show)



Attachments
corrupted file (10.80 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2020-05-27 09:55 UTC, Eugene
Details
Zip file with files to reproduce the bug (23.68 KB, application/x-zip-compressed)
2021-07-01 18:42 UTC, Nail Samatov
Details
[PATCH] for fixing the issue (10.40 KB, application/x-gzip)
2021-07-20 14:48 UTC, Yury
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Eugene 2020-05-27 09:55:39 UTC
Created attachment 37268 [details]
corrupted file

Contents of the unpacked xlsx file and apache poi

The initial problem is the inability to open the xlsx file through poi (OPCPackage.open(fileName, PackageAccess.READ)), while in Excel it opens.

A detailed study of the poi showed that the problem lies in the contents of the xlsx file.
If you unzip xslx file, then in the xl folder, in addition to all other files there will be two due to which there is a problem

xl/metadata
xl/metadata.xml

when using poi method OPCPackage.open(fileName, PackageAccess.READ) this leads to an error:

org.apache.poi.openxml4j.exceptions.InvalidFormatException: You can't add a part with a part name derived from another part ! [M1.11]

which occurs due to the same file names in PackagePartCollection.put method.

If I just copy the contents of the entire xlsx file to a new created xlsx file and save it, then the xl/metadata file will not be there and it will open through poi well.
But I don’t have the task of just fixing the file, I need to figure out why this problem could arise.

it looks like a slightly incorrect xlsx, but I can still open it through exel, is there any way to open it through poi?
Is there any idea about the occurrence of xl/metadata in the contents of the xlsx?
Comment 1 Eugene 2020-05-27 12:43:13 UTC
Also pay attention to the documentation. I found only the draft version, but I think that the difference there is not big.

https://www.ecma-international.org/activities/Office%20Open%20XML%20Formats/Draft%20ECMA-376%203rd%20edition,%20March%202011/Office%20Open%20XML%20Part%202%20-%20Open%20Packaging%20Conventions.pdf

item 9.1.1.4 Part Naming 


A package implementer shall neither create nor recognize a part with apart name derived from another part name by appending segments to it. [M1.11][Example:If a package contains a part named“/segment1/segment2/.../segmentn”, then other parts in that packageshall not have names such as: “/segment1”, “segment1/segment2”, or “/segment1/segment2/.../segmentn-1”. endexample]

But also look at the item:

9.1.1 Part Names
Each part has a name. Part namesrefer to parts within a package. [Example:The part name “/hello/world/doc.xml” contains three segments: “hello”, “world”, and “doc.xml”.The first two segments in the sample represent levelsin the logical hierarchy and serve to organize the parts of the package, whereas the 
ECMA-376 Part214third contains actual content.Note that segments are not explicitly representedas foldersin the package model, and no directory of folders exists in the package model.end example]

In this example, “doc.xml” the name of this file is considered along with the extension, whereas in the POI in the class PackagePartCollection in method PackagePart put (final PackagePartName partName, final PackagePart part)

Comparison is made only by file names, not considering their extension, which is possibly a mistake.
Comment 2 PJ Fanning 2020-05-27 13:04:04 UTC
It's possible we'll change POI code but the next release could be weeks away.

It's worth investigating where your xlsx file came from to find out why its contents are not standard.
Comment 3 Nail Samatov 2021-07-01 18:42:55 UTC
Created attachment 37929 [details]
Zip file with files to reproduce the bug

We also have the same issue.
I tried to find the steps on how we can create such files that apache poi can't read.

Pre-requisites:
Excel from MS Office 365
files 1.xlsx and 2.xlsx (you can find them in the attached zip file).
1.xlsx contains "xl/metadata" and 2.xlsx contains "xl/metadata.xml"

Steps:
1. Open 1.xlsx in Excel
2. Open 2.xlsx in Excel
3. Right click on the worksheet tab and select Move or Copy.
4. Select the 1.xlsx option at the To Book drop-down list.
5. Press OK.
6. Save 1.xlsx.

After save you will have 1.xlsx which contains both xl/metadata and xl/metadata.xml

You can find result of the steps above in the folder "result-of-merge" in the same attached zip file. This file can't be read by POI but can be opened in Excel.
Comment 4 Yury 2021-07-20 12:07:23 UTC
The issue appeared after https://bz.apache.org/bugzilla/show_bug.cgi?id=61942 ticket in revision 1819708.

I think the dot symbol in the regexp is unnecessary in the line :

"(?=["+PackagingURIHelper.FORWARD_SLASH_STRING+".])";
                                                ^
                                                this

See https://svn.apache.org/viewvc/poi/trunk/poi-ooxml/src/main/java/org/apache/poi/openxml4j/opc/PackagePartCollection.java?revision=1819708&view=markup#l64
Comment 5 Yury 2021-07-20 12:13:06 UTC
The issue appeared after https://bz.apache.org/bugzilla/show_bug.cgi?id=61942 ticket in revision 1819708.

I think the dot symbol in the regexp is unnecessary in the line :

"(?=["+PackagingURIHelper.FORWARD_SLASH_STRING+".])";
                                                ^
                                                this

See https://svn.apache.org/viewvc/poi/trunk/poi-ooxml/src/main/java/org/apache/poi/openxml4j/opc/PackagePartCollection.java?revision=1819708&view=markup#l64
Comment 6 Yury 2021-07-20 14:48:34 UTC
Created attachment 37964 [details]
[PATCH] for fixing the issue

created by the following command:
ant -f patch.xml
Comment 7 PJ Fanning 2021-07-20 17:01:20 UTC
Thanks Yury - merged with r1891692