We are procesing XLSF files in parallel and faced a performance bottleneck. Investigating the issue, it turns out that all threads wait on the synchronized DocumentHelper.newDocumentBuilder(). It was introduced in this commit: https://github.com/apache/poi/commit/eabb6a924be24abb879372d0bc967e0d316b2cf8?diff=split Related Bug: https://bz.apache.org/bugzilla/show_bug.cgi?id=56814#c10 It seems that the synchronized was added because it does not have any impact on single-threaded applications, as it will get optimized away. When working multi-threaded, all threads are waiting for the mehtod to finish. This is the related code section: https://github.com/apache/poi/blame/trunk/src/ooxml/java/org/apache/poi/ooxml/util/DocumentHelper.java#L89-L98 The newDocumentBuilder does not seem to mutate any global state as it only creates a local variable which is returned afterwards. Is there any special reason for this synchronized? Can this be removed? Are there workarounds we can use with the current version of POI? This is the thread dump/call stack: > "thread-2" - Thread t@845 > java.lang.Thread.State: BLOCKED > at org.apache.poi.util.DocumentHelper.newDocumentBuilder(DocumentHelper.java:87) > - waiting to lock <246a71d6> (a java.lang.Class) owned by "thead-1" t@806 > at org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:140) > at org.apache.poi.openxml4j.opc.PackageRelationshipCollection.parseRelationshipsPart(PackageRelationshipCollection.java:308) > at org.apache.poi.openxml4j.opc.PackageRelationshipCollection.<init>(PackageRelationshipCollection.java:167) > at org.apache.poi.openxml4j.opc.PackageRelationshipCollection.<init>(PackageRelationshipCollection.java:135) > at org.apache.poi.openxml4j.opc.PackagePart.loadRelationships(PackagePart.java:565) > at org.apache.poi.openxml4j.opc.PackagePart.<init>(PackagePart.java:111) > at org.apache.poi.openxml4j.opc.PackagePart.<init>(PackagePart.java:82) > at org.apache.poi.openxml4j.opc.PackagePart.<init>(PackagePart.java:127) > at org.apache.poi.openxml4j.opc.ZipPackagePart.<init>(ZipPackagePart.java:79) > at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:335) > at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:756) > at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:327) > at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:184) > at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:149)
The DocumentBuilderFactory is package private so you only access it if you create a class in org.apache.poi.ooxml.util package. You can then create your own version of the newDocumentBuilder method. I think we can change the behaviour in POI but it could be weeks before our next release and this change would need to be load tested.
Well, we are only using this indirectly, because it is used in a call of WorkbookFactory.create (as seen in the provided call stack). I don't think we can tell POI to use a different class. The current workaround is caching the workbooks. However, this is not a suitable long-term solution, due to other constraints. Having this issue fixed in the next release in some weeks would suffice.
updated using https://svn.apache.org/viewvc?view=revision&revision=1854935