Summary: | xlsx file does not conform to bit patterns used by common file type detection software | ||
---|---|---|---|
Product: | POI | Reporter: | Dominik Mähl <dominik.maehl> |
Component: | XSSF | Assignee: | POI Developers List <dev> |
Status: | RESOLVED FIXED | ||
Severity: | normal | ||
Priority: | P2 | ||
Version: | 3.14-FINAL | ||
Target Milestone: | --- | ||
Hardware: | PC | ||
OS: | All |
Description
Dominik Mähl
2016-06-23 13:01:41 UTC
Apart from a handful of formats (eg those which require a mimetypes file that's uncompressed as the first entry in the zip), reliably detecting container formats can only be done by opening up the container itself Apache Tika ships with a special detector for zip-based container formats for this very reason! (Tika also, on trunk, correctly detects POI-generated OOXML files as OOXML from mime magic only) Seems to me, those tools that rely on a specific file order within an archive have a design flaw, that is, they rely on a specific file order within the archive. Apparently Tika does not have that issue, but anything that does will have an issue if Excel ever changes the order in which it writes files to the xlsx archive. It apparently doesn't care what the order is, so there is no guarantee the order will remain the same in future versions of the product. I agree with both of you. But I'm also convinced that Excel will be (and is) seen as the reference implementation for ooxml. I can give you the name of at least one commercial content filtering product which ships with the mentioned bit patterns. Also the change for tika was committed just yesterday :-) (https://github.com/apache/tika/commit/52ea9ba7c2e3c99e7a2d4fb38875caa996438857) To be clear. I know that this approach is flawed but it seems to me that it is a standard practice and maybe it is easier to "fix" in POI than in every tool out there. If someone would point me to how to do it I would happily create a patch or pull request or whatever. It's just that by looking at the POI code I could not find an easy way to do it. Here's a start: $ grep --recursive --files-with-matches --exclude-dir=".svn" -E "CONTENT_TYPES_PART_NAME|Content_Types|_rels|\.rels|RELATIONSHIP_PART" --include=*.java src/ooxml/java/org/apache/poi/openxml4j/opc src/ooxml/java/org/apache/poi/openxml4j/opc/PackageRelationship.java src/ooxml/java/org/apache/poi/openxml4j/opc/PackagePartName.java src/ooxml/java/org/apache/poi/openxml4j/opc/OPCPackage.java src/ooxml/java/org/apache/poi/openxml4j/opc/ZipPackage.java src/ooxml/java/org/apache/poi/openxml4j/opc/internal/ContentTypeManager.java src/ooxml/java/org/apache/poi/openxml4j/opc/internal/ZipHelper.java src/ooxml/java/org/apache/poi/openxml4j/opc/internal/ZipContentTypeManager.java src/ooxml/java/org/apache/poi/openxml4j/opc/PackagingURIHelper.java I did a quick glance over and ZipPackage#getPartsImpl and the TreeMap partList looked potentially relevant, but couldn't figure it out if this is where the order is being set. Also, it's possible that the content manager needs to be created before the rels, which may make it difficult to simply rearrange the code to get the _rels directory to be created first. Seems more logical to me for files in higher directories to be created before files in lower directories. |