Bug 63240

Summary: Use of synchronized in DocumentHelper.newDocumentBuilder
Product: POI Reporter: Niklas Mollenhauer <nikeee>
Component: POI OverallAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal CC: uwe
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Linux   

Description Niklas Mollenhauer 2019-03-06 17:24:34 UTC
We are procesing XLSF files in parallel and faced a performance bottleneck. Investigating the issue, it turns out that all threads wait on the synchronized 
DocumentHelper.newDocumentBuilder().

It was introduced in this commit:
https://github.com/apache/poi/commit/eabb6a924be24abb879372d0bc967e0d316b2cf8?diff=split

Related Bug:
https://bz.apache.org/bugzilla/show_bug.cgi?id=56814#c10

It seems that the synchronized was added because it does not have any impact on single-threaded applications, as it will get optimized away. When working multi-threaded, all threads are waiting for the mehtod to finish.

This is the related code section:
https://github.com/apache/poi/blame/trunk/src/ooxml/java/org/apache/poi/ooxml/util/DocumentHelper.java#L89-L98

The newDocumentBuilder does not seem to mutate any global state as it only creates a local variable which is returned afterwards.

Is there any special reason for this synchronized? Can this be removed? Are there workarounds we can use with the current version of POI?

This is the thread dump/call stack:
> "thread-2" - Thread t@845
>    java.lang.Thread.State: BLOCKED
> 	at org.apache.poi.util.DocumentHelper.newDocumentBuilder(DocumentHelper.java:87)
> 	- waiting to lock <246a71d6> (a java.lang.Class) owned by "thead-1" t@806
> 	at org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:140)
> 	at org.apache.poi.openxml4j.opc.PackageRelationshipCollection.parseRelationshipsPart(PackageRelationshipCollection.java:308)
> 	at org.apache.poi.openxml4j.opc.PackageRelationshipCollection.<init>(PackageRelationshipCollection.java:167)
> 	at org.apache.poi.openxml4j.opc.PackageRelationshipCollection.<init>(PackageRelationshipCollection.java:135)
> 	at org.apache.poi.openxml4j.opc.PackagePart.loadRelationships(PackagePart.java:565)
> 	at org.apache.poi.openxml4j.opc.PackagePart.<init>(PackagePart.java:111)
> 	at org.apache.poi.openxml4j.opc.PackagePart.<init>(PackagePart.java:82)
> 	at org.apache.poi.openxml4j.opc.PackagePart.<init>(PackagePart.java:127)
> 	at org.apache.poi.openxml4j.opc.ZipPackagePart.<init>(ZipPackagePart.java:79)
> 	at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:335)
> 	at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:756)
> 	at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:327)
> 	at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:184)
> 	at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:149)
Comment 1 PJ Fanning 2019-03-06 18:07:10 UTC
The DocumentBuilderFactory is package private so you only access it if you create a class in org.apache.poi.ooxml.util package.

You can then create your own version of the newDocumentBuilder method.

I think we can change the behaviour in POI but it could be weeks before our next release and this change would need to be load tested.
Comment 2 Niklas Mollenhauer 2019-03-06 18:23:51 UTC
Well, we are only using this indirectly, because it is used in a call of WorkbookFactory.create (as seen in the provided call stack). I don't think we can tell POI to use a different class.

The current workaround is caching the workbooks. However, this is not a suitable long-term solution, due to other constraints. Having this issue fixed in the next release in some weeks would suffice.
Comment 3 PJ Fanning 2019-03-06 18:37:18 UTC
updated using https://svn.apache.org/viewvc?view=revision&revision=1854935