|Summary:||[PATCH] Need to support docx files with multiple core properties|
|Product:||POI||Reporter:||Gregg Morris <gregg.morris>|
|Component:||XWPF||Assignee:||POI Developers List <dev>|
Small document containing more than one core properties element.
Patch file generated by ant
Description Gregg Morris 2012-01-27 05:28:25 UTC
Created attachment 28215 [details] Small document containing more than one core properties element. In order to comply with rule M4.1 ("A format consumer shall consider more than one core properties relationship for a package to be an error"), POI throws an exception when you attempt to open a Word docx file that violates this rule. Unfortunately, Word 2008 and 2011 for Macintosh create, save, and open files with multiple core properties. I do not have easy access to Windows versions of Word, so I don't know what happens when this file is read into Word and saved out again. This is using the latest 3.8beta5 version of POI. I have attached a small Word docx file ("base.docx") that demonstrates the problem. I have created a patch that relaxes this compliance. Instead of throwing an exception, I take the first core properties encountered in the file and silently ignore any subsequent core properties in the file. I completely understand if this patch is not accepted. I've modified the code to explicitly violate a clear rule in the standard. But I need to support my users, which means I need to support these non-conforming files.
Comment 1 Gregg Morris 2012-01-27 05:32:19 UTC
Created attachment 28216 [details] Patch file generated by ant This is a patch file generated by ant. There are changes to two files, src/ooxml/java/org/apache/poi/openxml4j/opc/OPCPackage.java and src/ooxml/testcases/org/apache/poi/openxml4j/opc/compliance/TestOPCComplianceCoreProperties.java
Comment 2 Nick Burch 2012-01-27 11:50:02 UTC
It'd be good to confirm that the first properties are the one to use, and not say the last one. For one of these problem files, would it be possible for you to unzip the .docx file (it's a zip of XML files), and manually edit the XML for the 1st and 2nd core properties? Make some changes so that you can identify which one it's coming from, then zip up and load that file in word. Using Word 2008, 2011 and 2007, check which properties is actually seen by word. Assuming Mac and Windows see the same one, once we know which of the first or last is used, we can then apply your patch / tweak + apply it, and unit test as appropriate
Comment 3 Gregg Morris 2012-01-27 19:11:29 UTC
I tested both Word 2008 and 2011 for Mac (the only ones I have access to). In the docProps directory of the unzipped docx file, there is "core.xml" and "core1.xml". In both versions of Word, the "core.xml" is the "current" one. If I make changes via fie File -> Properties dialog, they appear in the "core.xml" file. Interestingly, the "core1.xml" file contains what looks like valid data, but it's "old". It appears to be from an earlier version of the document, maybe? I sure wish I knew how to tell my authors to avoid creating it!
Comment 4 David Fisher 2012-01-27 20:31:22 UTC
Are these versions original not updated releases of these files? If you are using templates were these created from the original release of Word 2011? If so, we did have some problems with PPTX when 2011 first came and Microsoft acknowledged the bug and made an update release one month later.