Bug 64007

Summary: a deadlock with Poller run and cancelledKey
Product: Tomcat 9 Reporter: echo <langren_qing>
Component: CatalinaAssignee: Tomcat Developers Mailing List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: 9.0.26   
Target Milestone: -----   
Hardware: PC   
OS: Linux   

Description echo 2019-12-17 07:16:44 UTC
when i press test my application with spring boot 2.1.9.RELEASE and tomcat 9.0.26 i meet a deadlock. It's maybe a bug with tomcat. when i use tomcat 8.5.49 instead the bug is not appear. The follow is the stack i dump from my application.

Found one Java-level deadlock:
=============================
"http-nio-8080-exec-455":
  waiting to lock monitor 0x00007f59c40475e8 (object 0x00000006c0b4fe28, a java.util.HashSet),
  which is held by "http-nio-8080-ClientPoller"
"http-nio-8080-ClientPoller":
  waiting to lock monitor 0x00007f59680af4d8 (object 0x00000006de810f10, a java.lang.Object),
  which is held by "http-nio-8080-exec-215"
"http-nio-8080-exec-215":
  waiting to lock monitor 0x00007f59c4057a98 (object 0x00000006c0b27028, a java.util.HashSet),
  which is held by "http-nio-8080-BlockPoller"
"http-nio-8080-BlockPoller":
  waiting to lock monitor 0x00007f5998946158 (object 0x00000006de8588c0, a java.lang.Object),
  which is held by "http-nio-8080-exec-378"
"http-nio-8080-exec-378":
  waiting to lock monitor 0x00007f59c40475e8 (object 0x00000006c0b4fe28, a java.util.HashSet),
  which is held by "http-nio-8080-ClientPoller"

Java stack information for the threads listed above:
===================================================
"http-nio-8080-exec-455":
    at java.nio.channels.spi.AbstractSelector.cancel(AbstractSelector.java:91)
    - waiting to lock <0x00000006c0b4fe28> (a java.util.HashSet)
    at java.nio.channels.spi.AbstractSelectionKey.cancel(AbstractSelectionKey.java:73)
    - locked <0x00000006de870a68> (a sun.nio.ch.SelectionKeyImpl)
    at java.nio.channels.spi.AbstractSelectableChannel.implCloseChannel(AbstractSelectableChannel.java:240)
    - locked <0x00000006de870a90> (a java.lang.Object)
    at java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:115)
    - locked <0x00000006de870a28> (a java.lang.Object)
    at org.apache.tomcat.util.net.NioChannel.close(NioChannel.java:104)
    at org.apache.tomcat.util.net.NioChannel.close(NioChannel.java:116)
    at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doClose(NioEndpoint.java:1179)
    - locked <0x00000006c3ca9be8> (a org.apache.tomcat.util.net.NioChannel)
    at org.apache.tomcat.util.net.SocketWrapperBase.close(SocketWrapperBase.java:394)
    at org.apache.tomcat.util.net.NioEndpoint$Poller.cancelledKey(NioEndpoint.java:667)
    at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1592)
    at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
    - locked <0x00000006de825cc0> (a org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    at java.lang.Thread.run(Thread.java:748)
"http-nio-8080-ClientPoller":
    at java.nio.channels.spi.AbstractSelectableChannel.removeKey(AbstractSelectableChannel.java:130)
    - waiting to lock <0x00000006de810f10> (a java.lang.Object)
    at java.nio.channels.spi.AbstractSelector.deregister(AbstractSelector.java:188)
    at sun.nio.ch.EPollSelectorImpl.implDereg(EPollSelectorImpl.java:191)
    at sun.nio.ch.SelectorImpl.processDeregisterQueue(SelectorImpl.java:149)
    - locked <0x00000006c0b4fe28> (a java.util.HashSet)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:97)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
    - locked <0x00000006c0b50068> (a sun.nio.ch.Util$3)
    - locked <0x00000006c0b50058> (a java.util.Collections$UnmodifiableSet)
    - locked <0x00000006c0b4fdd0> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
    at org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:708)
    at java.lang.Thread.run(Thread.java:748)
"http-nio-8080-exec-215":
    at java.nio.channels.spi.AbstractSelector.cancel(AbstractSelector.java:91)
    - waiting to lock <0x00000006c0b27028> (a java.util.HashSet)
    at java.nio.channels.spi.AbstractSelectionKey.cancel(AbstractSelectionKey.java:73)
    - locked <0x00000006de810c70> (a sun.nio.ch.SelectionKeyImpl)
    at java.nio.channels.spi.AbstractSelectableChannel.implCloseChannel(AbstractSelectableChannel.java:240)
    - locked <0x00000006de810f10> (a java.lang.Object)
    at java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:115)
    - locked <0x00000006de810c30> (a java.lang.Object)
    at org.apache.tomcat.util.net.NioChannel.close(NioChannel.java:104)
    at org.apache.tomcat.util.net.NioChannel.close(NioChannel.java:116)
    at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doClose(NioEndpoint.java:1179)
    - locked <0x00000006c2d7b8a0> (a org.apache.tomcat.util.net.NioChannel)
    at org.apache.tomcat.util.net.SocketWrapperBase.close(SocketWrapperBase.java:394)
    at org.apache.tomcat.util.net.NioEndpoint$Poller.cancelledKey(NioEndpoint.java:667)
    at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1592)
    at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
    - locked <0x00000006de810c98> (a org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    at java.lang.Thread.run(Thread.java:748)
"http-nio-8080-BlockPoller":
    at java.nio.channels.spi.AbstractSelectableChannel.removeKey(AbstractSelectableChannel.java:130)
    - waiting to lock <0x00000006de8588c0> (a java.lang.Object)
    at java.nio.channels.spi.AbstractSelector.deregister(AbstractSelector.java:188)
    at sun.nio.ch.EPollSelectorImpl.implDereg(EPollSelectorImpl.java:191)
    at sun.nio.ch.SelectorImpl.processDeregisterQueue(SelectorImpl.java:149)
    - locked <0x00000006c0b27028> (a java.util.HashSet)
    at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:90)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
    - locked <0x00000006c0b27108> (a sun.nio.ch.Util$3)
    - locked <0x00000006c0b270f8> (a java.util.Collections$UnmodifiableSet)
    - locked <0x00000006c0b26fc0> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.selectNow(SelectorImpl.java:105)
    at org.apache.tomcat.util.net.NioBlockingSelector$BlockPoller.run(NioBlockingSelector.java:310)
"http-nio-8080-exec-378":
    at java.nio.channels.spi.AbstractSelector.cancel(AbstractSelector.java:91)
    - waiting to lock <0x00000006c0b4fe28> (a java.util.HashSet)
    at java.nio.channels.spi.AbstractSelectionKey.cancel(AbstractSelectionKey.java:73)
    - locked <0x00000006de858648> (a sun.nio.ch.SelectionKeyImpl)
    at java.nio.channels.spi.AbstractSelectableChannel.implCloseChannel(AbstractSelectableChannel.java:240)
    - locked <0x00000006de8588c0> (a java.lang.Object)
    at java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:115)
    - locked <0x00000006de8585e0> (a java.lang.Object)
    at org.apache.tomcat.util.net.NioChannel.close(NioChannel.java:104)
    at org.apache.tomcat.util.net.NioChannel.close(NioChannel.java:116)
    at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doClose(NioEndpoint.java:1179)
    - locked <0x00000006c3360628> (a org.apache.tomcat.util.net.NioChannel)
    at org.apache.tomcat.util.net.SocketWrapperBase.close(SocketWrapperBase.java:394)
    at org.apache.tomcat.util.net.NioEndpoint$Poller.cancelledKey(NioEndpoint.java:667)
    at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1592)
    at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
    - locked <0x00000006de858670> (a org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    at java.lang.Thread.run(Thread.java:748)

Found 1 deadlock.
Comment 1 Remy Maucherat 2019-12-17 10:34:43 UTC
NIO deadlocks internally in your trace while trying to close multiple independent channels.

Some of the structures that might help the deadlock, in particular the BlockPoller, are no longer used in trunk.
Comment 2 echo 2019-12-17 13:34:23 UTC
but, in my opinion. just the NioEndpoint$Poller.cancelledKey and NioEndpoint$Poller.run may cause the deadlock.  Poller.run finally call AbstractSelector.cancel and Poller.cancelledKey finally call AbstractSelector.deregister. The two method all need the conditions keyLock and cancelledKeys.
Comment 3 Remy Maucherat 2019-12-17 14:25:56 UTC
Ok, so then cancelledKey should cancel the key first, and this would resolve the "problem" as the channel close locks everything before cancelling that key itself. This is what was done like that in 8.5 but this didn't look important.

Can you test it or give me your opinion on it ?
https://github.com/apache/tomcat/commit/9b1a8b67bffe462fc745b19e15ed59c37e2e1dcf
Comment 4 echo 2019-12-18 08:21:28 UTC
I will test first. If there is progress i will recommend agin
Comment 5 Remy Maucherat 2019-12-18 08:35:51 UTC
Ok. If you run again into a deadlock, please add the updated deadlock traces as an attachment to the BZ (otherwise it's harder to read).
Comment 6 Remy Maucherat 2019-12-18 14:11:17 UTC
Looking at the most recent JDK sources, the root issue likely has been fixed by these commits which remove all the unwanted internal NIO syncing:
https://github.com/openjdk/jdk/commit/33b921f25d8b5e6a50003daa6db31176437d29c0
https://github.com/openjdk/jdk/commit/38bdacafbcf3c96ac40e30098c0990f88a9edf8e

I'm ok with having a workaround fix, as it seems the first commit for this NIO bug will be in 11, and the second one in 14.
Not 100% sure if the first one alone is enough to avoid the deadlock, but it might well be. In the trace, if http-nio-8080-exec-215 does not hold the lock in the AbstractSelectableChannel, then http-nio-8080-ClientPoller will be able to get the sync on it.
Comment 7 echo 2019-12-19 01:02:50 UTC
yes. thank you for your patient answer.I don't have a correct way to test. Just pressure test and it's hard to reproduce the bug fixed or not.
Comment 8 Remy Maucherat 2019-12-19 09:11:31 UTC
The issue would have been introduced in 9.0.20, and it's not surprising it's hard to reproduce.

It's interesting that with the new Java dev model they actually do this sort of refactorings, cleanups and improvements on NIO, which is a very old piece of code.
Comment 9 echo 2019-12-19 11:48:27 UTC
with your fixed code i package a new jar and then replace tomcat-embed-core.jar in 9.0.26.and then i pressure my application 3 hours and all is fine. But with the tomcat 9.0.26 press 1 hour the deadlock appear.
Comment 10 Remy Maucherat 2019-12-19 14:21:21 UTC
Ok, so let's say the regression is fixed. The fix will be in Tomcat 9.0.31+, thanks for the help resolving the issue.