Bug 56648

Summary: ContainerBase.addChild blocks all user requests while single webapp is being deployed
Product: Tomcat 6 Reporter: Volker Kleinschmidt <vkleinschmidt>
Component: CatalinaAssignee: Tomcat Developers Mailing List <dev>
Severity: major    
Priority: P2    
Version: 6.0.35   
Target Milestone: default   
Hardware: All   
OS: All   

Description Volker Kleinschmidt 2014-06-19 21:24:58 UTC
While deploying a complex webapp that took very long to initialize (using Spring, lots of filesystem overhead, don't ask), we found that all user requests got blocked in ApplicationContext.getContext, while trying to determine the current web application via ContainerBase.findChild(). This synchronizes on the HashMap ContainerBase.children, which was being locked by the addChild() method that was doing the webapp deployment:

	- locked <0x000000055070f778> (a java.util.HashMap)
	at org.apache.catalina.core.ContainerBase.access$000(ContainerBase.java:124)
	at org.apache.catalina.core.ContainerBase$PrivilegedAddChild.run(ContainerBase.java:146)
	at java.security.AccessController.doPrivileged(Native Method)
	at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:777)
	at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:601)
	at blackboard.tomcat.servletcontainer.TomcatContainerAdapter.registerWebApp(TomcatContainerAdapter.java:200)

I found that addChild() synchronizes for so long on this collection because it wants to know whether the child container deployed successfully before adding it to the collection, and it checks at the beginning whether it was already in the collection before trying to deploy it. Both of those seem like valid reasons, however it is clearly unacceptable to block the collection (and thus all user requests!) while deploying a new webapp, which can take any amount of time, including hanging on OS/filesystem locks.

Here's what all those other threads looked like:
<thread details here> waiting for monitor entry [0x0000000058b89000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.apache.catalina.core.ContainerBase.findChild(ContainerBase.java:855)
	- waiting to lock <0x000000055070f778> (a java.util.HashMap)
	at org.apache.catalina.core.ApplicationContext.getContext(ApplicationContext.java:211)
	at sun.reflect.GeneratedMethodAccessor524.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.catalina.core.ApplicationContextFacade$1.run(ApplicationContextFacade.java:456)
	at java.security.AccessController.doPrivileged(Native Method)
	at org.apache.catalina.core.ApplicationContextFacade.executeMethod(ApplicationContextFacade.java:454)
	at org.apache.catalina.core.ApplicationContextFacade.invokeMethod(ApplicationContextFacade.java:402)
	at org.apache.catalina.core.ApplicationContextFacade.doPrivileged(ApplicationContextFacade.java:374)
	at org.apache.catalina.core.ApplicationContextFacade.getContext(ApplicationContextFacade.java:122)
...<varying application code here>...

I see two possible approaches to address this:

A) Use a threadsafe ConcurrentHashMap instead of synchronizing on a plain old HashMap. Means we'd need to be sure that the picture of which child contexts are available at a given time doesn't have to always be consistent among all threads. I cannot judge that.

B) Fix addChild to use a flag to mark the child as initializing, and check the flag at the beginning, after verifying that it's not yet in the collection. If it's not there yet, and the flag isn't set yet, set the flag, then try to deploy the webapp. Once that's successful, add it to the collection, then unset the flag. Here we'd need to synchronize on the collection only very briefly - for the check at the start and for the addition at the end. The rest of the code would just need to synchronize on the flag, or on the child object itself, not on the collection, which it is not interacting with in any way while deploying the webapp.
Comment 1 Volker Kleinschmidt 2014-06-20 19:14:49 UTC
Nevermind, grepcode shows that this issue was actually already fixed back in tomcat 7.0.5.

I just haven't seen any reports of it anywhere.
Comment 2 Mark Thomas 2014-07-07 21:45:56 UTC
The 7.0.x fix to be reviewed for backport is http://svn.apache.org/viewvc?view=revision&revision=1036918
Comment 3 Mark Thomas 2014-07-09 11:47:44 UTC
Port of fix proposed for 6.0.x.
Comment 4 Mark Thomas 2014-07-29 09:13:18 UTC
This has been fixed in 6.0.x for 6.0.42 onwards.
Comment 5 Volker Kleinschmidt 2014-08-03 02:58:28 UTC
Thanks a bunch!