Bug 54086 - ConcurrentModificationException in NioReceiver on shutdown
Summary: ConcurrentModificationException in NioReceiver on shutdown
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 7
Classification: Unclassified
Component: Cluster (show other bugs)
Version: 7.0.41
Hardware: Sun Solaris
: P2 normal (vote)
Target Milestone: ---
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-11-01 13:32 UTC by Casey Lucas
Modified: 2013-07-01 13:09 UTC (History)
0 users



Attachments
fix for bug (3.40 KB, patch)
2013-06-26 20:07 UTC, Casey Lucas
Details | Diff
fix for bug (3.30 KB, patch)
2013-06-26 20:17 UTC, Casey Lucas
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Casey Lucas 2012-11-01 13:32:28 UTC
Solaris 10 x86, jdk 1.7.

We use tomcat clustering for session replication with 4 nodes and sometimes 8 nodes.  We get a ConcurrentModificationException occasionally on shutdown.  I have been unable to reliably reproduce the exception. In the log, I see "Unable to close cluster receiver selector." with the exception below:

java.util.ConcurrentModificationException 
    java.util.HashMap$HashIterator.nextEntry(HashMap.java:894) 
    java.util.HashMap$KeyIterator.next(HashMap.java:928) 
    java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1067) 
    org.​apache.​catalina.​tribes.​transport.​nio.​NioReceiver.​closeSelector(​NioReceiver.​java:382) 
    org.​apache.​catalina.​tribes.​transport.​nio.​NioReceiver.​stopListening(​NioReceiver.​java:365) 
    org.apache.catalina.tribes.transport.nio.NioReceiver.stop(NioReceiver.java:86) 
    org.​apache.​catalina.​tribes.​group.​ChannelCoordinator.​internalStop(​ChannelCoordinator.java:203) 
    org.​apache.​catalina.​tribes.​group.​ChannelCoordinator.​stop(​ChannelCoordinator.​java:115) 
    org.​apache.​catalina.​tribes.​group.​ChannelInterceptorBase.​stop(​ChannelInterceptorBase.java:178) 
    org.​apache.​catalina.​tribes.​group.​ChannelInterceptorBase.​stop(​ChannelInterceptorBase.java:178) 
    org.​apache.​catalina.​tribes.​group.​ChannelInterceptorBase.​stop(​ChannelInterceptorBase.java:178) 
    org.​apache.​catalina.​tribes.​group.​interceptors.​MessageDispatchInterceptor.​stop(​MessageDispatchInterceptor.java:172)
    org.​apache.​catalina.​tribes.​group.​ChannelInterceptorBase.​stop(​ChannelInterceptorBase.java:178) 
    org.​apache.​catalina.​tribes.​group.​ChannelInterceptorBase.​stop(​ChannelInterceptorBase.java:178) 
    org.apache.catalina.tribes.group.GroupChannel.stop(GroupChannel.java:438) 
    org.​apache.​catalina.​ha.​tcp.​SimpleTcpCluster.​stopInternal(​SimpleTcpCluster.​java:​744) 
    org.apache.catalina.util.LifecycleBase.stop(LifecycleBase.java:232) 
    org.apache.catalina.core.ContainerBase.stopInternal(ContainerBase.java:1199) 
    org.apache.catalina.util.LifecycleBase.stop(LifecycleBase.java:232) 
    org.apache.catalina.core.StandardService.stopInternal(StandardService.java:502) 
    org.apache.catalina.util.LifecycleBase.stop(LifecycleBase.java:232) 
    org.apache.catalina.core.StandardServer.stopInternal(StandardServer.java:753) 
    org.apache.catalina.util.LifecycleBase.stop(LifecycleBase.java:232) 
    org.apache.catalina.startup.Catalina.stop(Catalina.java:751) 
    org.apache.catalina.startup.Catalina.start(Catalina.java:713) 
    sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
    sun.​reflect.​DelegatingMethodAccessorImpl.​invoke(​DelegatingMethodAccessorImpl.​java:43) 
    java.lang.reflect.Method.invoke(Method.java:601) 
    org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:322) 
    org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:451)

I looked at the code briefly and noticed the use of a SelectionKey Iterator.  I have not dug deep enough to find any issue in the NioReceiver code, but I did find this potentially relevant text in the Selector javadocs (http://docs.oracle.com/javase/7/docs/api/java/nio/channels/Selector.html):

"A selector's key and selected-key sets are not, in general, safe for use by multiple concurrent threads. If such a thread might modify one of these sets directly then access should be controlled by synchronizing on the set itself. The iterators returned by these sets' iterator methods are fail-fast: If the set is modified after the iterator is created, in any way except by invoking the iterator's own remove method, then a ConcurrentModificationException will be thrown."
Comment 1 Mark Thomas 2012-11-05 00:25:32 UTC
Thanks for the report. I did see a code path that would result in two threads both trying to close the Selector. This, and some other threading issues, have been fixed in trunk and 7.0.x and will be included in 7.0.33 onwards.

I can't be 100% certain that the problem I found is the one you were seeing but it looks very much like it.
Comment 2 Casey Lucas 2013-06-26 20:03:09 UTC
I verified that this problem still exists in 7.0.41.  I tracked down and fixed the problem.

Problem:
During shutdown, NioReceiver.close() is called which ultimately calls NioReceiver.closeSelector(). selector.keys() is called and an iteration is performed - closing the existing channels and canceling the keys.  Per the Selector javadocs, the set cannot be modified during iteration by another thread or else a ConcurrentModificationException will be thrown.  The set is in fact being modified by another thread (NioReceiver thread) when existing connections are closed (see listen() and canceledKey(SelectionKey) methods).

The problem is difficult to reproduce because of timing. I was able to reliably reproduce the problem but I had to add a sleep in the closeSelector() method and or set breakpoints during testing.  Ultimately, I couldn't come up with a test case that didn't need a sleep in closeSelector() so I don't have a test case for the attached patch.

The attached patch loops (up to 3 times) catching ConcurrentModificationException during close and sleeps for a brief period (100ms * attempt number) allowing the NioReceiver thread to finish it's loop. After the sleep, it attempts to close channels again.
Comment 3 Casey Lucas 2013-06-26 20:07:37 UTC
Created attachment 30489 [details]
fix for bug
Comment 4 Casey Lucas 2013-06-26 20:17:54 UTC
Created attachment 30490 [details]
fix for bug

i inadvertently left in a printStackTrace on the original patch. use this one instead.
Comment 5 Mark Thomas 2013-07-01 10:36:35 UTC
(In reply to Casey Lucas from comment #2)
> Problem:
> During shutdown, NioReceiver.close()

There is no such method. I'm going to assume you mean stop().
Comment 6 Mark Thomas 2013-07-01 11:11:18 UTC
Thanks for the analysis and patch.

I went with a different approach as I generally prefer to tackle the root cause rather than handle the symptoms of a problem.

Providing that the receiver thread stops in a timely manner, the patch prevents multiple threads accessing the key set.
Comment 7 Casey Lucas 2013-07-01 13:09:45 UTC
(In reply to Mark Thomas from comment #5)
> (In reply to Casey Lucas from comment #2)
> > Problem:
> > During shutdown, NioReceiver.close()
> 
> There is no such method. I'm going to assume you mean stop().

Yes, stop.  Thanks for the fix.