Bug 59747 - xlsx file does not conform to bit patterns used by common file type detection software
Summary: xlsx file does not conform to bit patterns used by common file type detection...
Status: RESOLVED FIXED
Alias: None
Product: POI
Classification: Unclassified
Component: XSSF (show other bugs)
Version: 3.14-FINAL
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-06-23 13:01 UTC by Dominik Mähl
Modified: 2017-09-22 21:21 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dominik Mähl 2016-06-23 13:01:41 UTC
Hi,

I'm creating this bug due to a problem we've encountered with POI generated xlsx files.

Apparently the order of zip entries in xlsx files is important for tools which determine the file type be matching a byte pattern. See for example Apache Tika (without deeper OOXML support library) and linux's file command.

The OOXML spec and Excel have no problem with POI files but tools relying on a certain pattern have.

Here the output of unzip -l on a POI xlsx file:

Archive:  poi.xlsx
  Length     Date   Time    Name
 --------    ----   ----    ----
      591  02.06.16 12:40   _rels/.rels
     1063  02.06.16 12:40   [Content_Types].xml
      183  02.06.16 12:40   docProps/app.xml
      437  02.06.16 12:40   docProps/core.xml
      137  02.06.16 12:40   xl/sharedStrings.xml
      818  02.06.16 12:40   xl/styles.xml
      349  02.06.16 12:40   xl/workbook.xml
      569  02.06.16 12:40   xl/_rels/workbook.xml.rels
      670  02.06.16 12:40   xl/worksheets/sheet1.xml
 --------                   -------
     4817                   9 files

And for a native file:

Archive:  excel.xlsx
  Length     Date   Time    Name
 --------    ----   ----    ----
     1032  01.01.80 00:00   [Content_Types].xml
      588  01.01.80 00:00   _rels/.rels
      557  01.01.80 00:00   xl/_rels/workbook.xml.rels
      906  01.01.80 00:00   xl/workbook.xml
     1542  01.01.80 00:00   xl/styles.xml
     6790  01.01.80 00:00   xl/theme/theme1.xml
     1306  01.01.80 00:00   xl/worksheets/sheet1.xml
      593  01.01.80 00:00   docProps/core.xml
      816  01.01.80 00:00   docProps/app.xml
 --------                   -------
    14130                   9 files

According to linux file and Tika they seem to expect [Content_Types].xml as the first entry, skip the second and look for a "xl/" in the third entry.

Would it be possible to fix the order of the entries?

We've written a simple post processing tool which rewrites the zip file but would be happy to have this in POI proper.

Thanks and contact me if I can help.
Comment 1 Nick Burch 2016-06-23 13:27:50 UTC
Apart from a handful of formats (eg those which require a mimetypes file that's uncompressed as the first entry in the zip), reliably detecting container formats can only be done by opening up the container itself

Apache Tika ships with a special detector for zip-based container formats for this very reason!

(Tika also, on trunk, correctly detects POI-generated OOXML files as OOXML from mime magic only)
Comment 2 Mark Murphy 2016-06-23 13:46:24 UTC
Seems to me, those tools that rely on a specific file order within an archive have a design flaw, that is, they rely on a specific file order within the archive. Apparently Tika does not have that issue, but anything that does will have an issue if Excel ever changes the order in which it writes files to the xlsx archive. It apparently doesn't care what the order is, so there is no guarantee the order will remain the same in future versions of the product.
Comment 3 Dominik Mähl 2016-06-24 06:15:54 UTC
I agree with both of you. But I'm also convinced that Excel will be (and is) seen as the reference implementation for ooxml. I can give you the name of at least one commercial content filtering product which ships with the mentioned bit patterns.

Also the change for tika was committed just yesterday :-)

(https://github.com/apache/tika/commit/52ea9ba7c2e3c99e7a2d4fb38875caa996438857)

To be clear. I know that this approach is flawed but it seems to me that it is a standard practice and maybe it is easier to "fix" in POI than in every tool out there.

If someone would point me to how to do it I would happily create a patch or pull request or whatever. It's just that by looking at the POI code I could not find an easy way to do it.
Comment 4 Javen O'Neal 2016-07-09 09:31:35 UTC
Here's a start:
$ grep --recursive --files-with-matches --exclude-dir=".svn" -E "CONTENT_TYPES_PART_NAME|Content_Types|_rels|\.rels|RELATIONSHIP_PART" --include=*.java src/ooxml/java/org/apache/poi/openxml4j/opc

src/ooxml/java/org/apache/poi/openxml4j/opc/PackageRelationship.java
src/ooxml/java/org/apache/poi/openxml4j/opc/PackagePartName.java
src/ooxml/java/org/apache/poi/openxml4j/opc/OPCPackage.java
src/ooxml/java/org/apache/poi/openxml4j/opc/ZipPackage.java
src/ooxml/java/org/apache/poi/openxml4j/opc/internal/ContentTypeManager.java
src/ooxml/java/org/apache/poi/openxml4j/opc/internal/ZipHelper.java
src/ooxml/java/org/apache/poi/openxml4j/opc/internal/ZipContentTypeManager.java
src/ooxml/java/org/apache/poi/openxml4j/opc/PackagingURIHelper.java

I did a quick glance over and ZipPackage#getPartsImpl and the TreeMap partList looked potentially relevant, but couldn't figure it out if this is where the order is being set. Also, it's possible that the content manager needs to be created before the rels, which may make it difficult to simply rearrange the code to get the _rels directory to be created first. Seems more logical to me for files in higher directories to be created before files in lower directories.
Comment 5 Dominik Stadler 2017-09-22 21:21:44 UTC
A fix for this was actually quite easy, just exchanging the order of writing the two files in ZipPackage.saveImpl(). 

I have done this in r1809357. If it causes issues we may need to revert this, though!