Bug 52540

Summary: [PATCH] Need to support docx files with multiple core properties
Product: POI Reporter: Gregg Morris <gregg.morris>
Component: XWPFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: 3.8-dev   
Target Milestone: ---   
Hardware: Macintosh   
OS: All   
Attachments: Small document containing more than one core properties element.
Patch file generated by ant

Description Gregg Morris 2012-01-27 05:28:25 UTC
Created attachment 28215 [details]
Small document containing more than one core properties element.

In order to comply with rule M4.1 ("A format consumer shall consider
more than one core properties relationship for a package to be an
error"), POI throws an exception when you attempt to open a Word docx
file that violates this rule. Unfortunately, Word 2008 and 2011 for
Macintosh create, save, and open files with multiple core
properties. I do not have easy access to Windows versions of Word, so
I don't know what happens when this file is read into Word and saved
out again. This is using the latest 3.8beta5 version of POI. I have
attached a small Word docx file ("base.docx") that demonstrates the
problem.

I have created a patch that relaxes this compliance. Instead of
throwing an exception, I take the first core properties encountered in
the file and silently ignore any subsequent core properties in the file.
I completely understand if this patch is not accepted. I've modified
the code to explicitly violate a clear rule in the standard. But I
need to support my users, which means I need to support these
non-conforming files.
Comment 1 Gregg Morris 2012-01-27 05:32:19 UTC
Created attachment 28216 [details]
Patch file generated by ant

This is a patch file generated by ant.
There are changes to two files, src/ooxml/java/org/apache/poi/openxml4j/opc/OPCPackage.java and src/ooxml/testcases/org/apache/poi/openxml4j/opc/compliance/TestOPCComplianceCoreProperties.java
Comment 2 Nick Burch 2012-01-27 11:50:02 UTC
It'd be good to confirm that the first properties are the one to use, and not say the last one.

For one of these problem files, would it be possible for you to unzip the .docx file (it's a zip of XML files), and manually edit the XML for the 1st and 2nd core properties? Make some changes so that you can identify which one it's coming from, then zip up and load that file in word. Using Word 2008, 2011 and 2007, check which properties is actually seen by word.

Assuming Mac and Windows see the same one, once we know which of the first or last is used, we can then apply your patch / tweak + apply it, and unit test as appropriate
Comment 3 Gregg Morris 2012-01-27 19:11:29 UTC
I tested both Word 2008 and 2011 for Mac (the only ones I have access to). In the docProps directory of the unzipped docx file, there is "core.xml" and "core1.xml". In both versions of Word, the "core.xml" is the "current" one. If I make changes via fie File -> Properties dialog, they appear in the "core.xml" file.

Interestingly, the "core1.xml" file contains what looks like valid data, but it's "old". It appears to be from an earlier version of the document, maybe? I sure wish I knew how to tell my authors to avoid creating it!
Comment 4 David Fisher 2012-01-27 20:31:22 UTC
Are these versions original not updated releases of these files? If you are using templates were these created from the original release of Word 2011? If so, we did have some problems with PPTX when 2011 first came and Microsoft acknowledged the bug and made an update release one month later.
Comment 5 Nick Burch 2012-01-30 13:00:48 UTC
Fixed in r1237631, thanks. (I've tweaked some of the comments, logs and exceptions to make it clearer exactly what we're doing and why, and beefed up the tests a bit too)