Bug 56814 - Replace use of Dom4J with JAXP
Summary: Replace use of Dom4J with JAXP
Alias: None
Product: POI
Classification: Unclassified
Component: OPC (show other bugs)
Version: 3.11-dev
Hardware: PC All
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2014-08-04 21:09 UTC by Nick Burch
Modified: 2014-08-13 06:54 UTC (History)
0 users

Required changes to replace dom4j with JAXP (70.38 KB, patch)
2014-08-10 00:39 UTC, Yaniv Kunda
Details | Diff
Required changes to replace dom4j with JAXP - without ThreadLocal (70.17 KB, patch)
2014-08-10 13:20 UTC, Yaniv Kunda
Details | Diff
Required changes to replace dom4j with JAXP - without ThreadLocal, including a test (74.57 KB, patch)
2014-08-10 14:38 UTC, Yaniv Kunda
Details | Diff
56814.patch (89.77 KB, patch)
2014-08-10 21:23 UTC, Uwe Schindler (ASF)
Details | Diff
56814.patch (69.86 KB, patch)
2014-08-11 08:21 UTC, Uwe Schindler (ASF)
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Nick Burch 2014-08-04 21:09:18 UTC
When the OPC code was first added, it only required Java 1.4, so needed external XML jars in the form of stax-api and dom4j. Now that we're on Java 1.6, we've already been able to drop the stax-api jar. It may also be possible to drop the dom4j one too, with some minor package name changes. If so, that'd save us one dependency, and let us solve some generics warnings (dom4j 1.6.x is Java 1.4 compatible so no generics)

Depending on exactly how major a change this turns out to be, it might be possible to do in 3.11, or might be better left for a 3.12 change
Comment 1 Yaniv Kunda 2014-08-10 00:39:43 UTC
Created attachment 31890 [details]
Required changes to replace dom4j with JAXP
Comment 2 Uwe Schindler (ASF) 2014-08-10 10:06:14 UTC
+1 !

Why do we need the ThreadLocal for the DocumentBuilder? Can we not configure the DocumentBuilderFactory and return a *new* DocumentBuilder each time? Thread locals have some side effects and should be avoided if not really needed.
Comment 3 Yaniv Kunda 2014-08-10 10:34:38 UTC
It's a precaution as neither DocumentBuilderFactory nor DocumentBuilder are guaranteed to be thread-safe, plus it duals as a cache.
But I agree that this needs to be more carefully examined, and perhaps starting with a static DocumentBuilderFactory spawning new DocumentBuilder objects on each call is the way to go.
Performance wise, I do believe we have more pressing issues than creation of DocumentBuilders and their factories...
Comment 4 Yaniv Kunda 2014-08-10 13:20:54 UTC
Created attachment 31893 [details]
Required changes to replace dom4j with JAXP - without ThreadLocal
Comment 5 Yaniv Kunda 2014-08-10 13:25:00 UTC
Eventually I did some testing and as long as it is not changed from different threads, a DocumentBuilderFactory can produce new DocumentBuilder objects from different threads without a problem.
Regarding DocumentBuilder, for the purpose of parsing I created a new one for each invocation of SAXHelper.readSAXDocument, but for DocumentHelper.newDocument I used a static DocumentBuilder, as it does change its state, and only used for that purpose.
Comment 6 Uwe Schindler (ASF) 2014-08-10 13:29:44 UTC

Looks fine. The problem with DocumentBuilderFactory is more that it is not thread safe if you change properties. But if you have a correctly conbfigured factory and just call newDocumentBuilder() it should be thread safe.

BTW: The warnings about missing thread safety were removed from Javadocs after Java 1.5. So I assume that newDocumentBuilder are thread safe. If you canr, wrap the newDocumentBuilder with a static synchronized block, that would be most safe: "static synchronized getDocumentBuilder()" (in both SAXHelper and the other factory class). Since Java 1.6 the synchronized does not matter at all if there is no real concurrency involved (optimized away by Hotspot).
Comment 7 Yaniv Kunda 2014-08-10 14:38:20 UTC
Created attachment 31894 [details]
Required changes to replace dom4j with JAXP - without ThreadLocal, including a test

Added a test to check for concurrency issues - at least for SAXHelper, if I change the implementation to return a static DocumentBuilder, it fails with org.xml.sax.SAXException: FWK005 parse may not be called while parsing.

I could not manage to fail DocumentHelper.createDocument() :-)
Comment 8 Uwe Schindler (ASF) 2014-08-10 14:43:41 UTC
I would still make it synchronized, then we are safe!
Comment 9 Uwe Schindler (ASF) 2014-08-10 18:07:23 UTC
I think we should put this into 3.11, too. I will apply the patch locally and do my tests with it. If all goes fine, I would be very happy to get rid of dom4j. This makes using of POI complicated, because it may conflict with other uses of dom4j. Also, dom4j is more or less inactive, because it has no features that are not available in the official W3C DOM APIs. The only nice thing with dom4j is the more "Java-like" API, but this is no longer true, because it lacks generics support.
Comment 10 Uwe Schindler (ASF) 2014-08-10 21:23:02 UTC
Created attachment 31895 [details]

I reviewed the patch and made some changes:

- In StreamHelper, I set OutputKeys in the identity transformer, to prettyprint (as done by dom4j) and also ste standalone attribute (as required by openxml).
- In StreamHelper I prevented the transformer from closing the outputstream (using a CloseShieldOutputStream like in TIKA, unfortunately it was not available to POI). The problem: some transformers close the outputstream (from StreamResult), which would break ZIPOutputStream.
- I removed stax.Namespace from the properties parser, useless there, as namespace prefixes are not used while parsing
- I moved the methods to add xmlns attributes from POI to openxml, because this would have added a wrong relationship. It is now in DocumentHelper
- I fixed a incorrect namespaceprefix in the coreProperties. openxml document should use "cp" prefix. 
- I removed the thread safety test again, because I made the methods that use the factories synchronized. This has no overhead in Java 6, if not used from many threads.
Comment 11 Uwe Schindler (ASF) 2014-08-11 08:21:35 UTC
Created attachment 31898 [details]

New patch, because the previous one had Line-feed issues (unfortunately POI does not have svn:eol-style:native on all text files.
Comment 12 Yaniv Kunda 2014-08-11 11:16:06 UTC
I intentionally left the coreProperties prefix in PackagePropertiesMarshaller empty because I've noticed it is used as the default namespace when writing - I'm not sure if that matters.
Comment 13 Uwe Schindler (ASF) 2014-08-11 11:38:40 UTC
I checked this. This was an error. In official openxml documents as created e.g. by Word 2010, the coreproperties prefix is marked as "cp:". There is no default namespace.
Comment 14 Uwe Schindler (ASF) 2014-08-11 11:41:42 UTC
Is there anything against committing this for now? Nick already mentioned, that we should get this in BETA2, too.
To me the patch looks fine, we can improve later. Locally, I just changed some *-imports to be explicit, otherwise its the same as attached patch and ready to commit.
Comment 15 Yaniv Kunda 2014-08-11 16:45:33 UTC
+1 on the synchronization - my test was good for verifying the JRE running it was safe, not for every user's JRE.
Comment 16 Uwe Schindler (ASF) 2014-08-12 07:21:30 UTC
I committed the current state.