when i press test my application with spring boot 2.1.9.RELEASE and tomcat 9.0.26 i meet a deadlock. It's maybe a bug with tomcat. when i use tomcat 8.5.49 instead the bug is not appear. The follow is the stack i dump from my application. Found one Java-level deadlock: ============================= "http-nio-8080-exec-455": waiting to lock monitor 0x00007f59c40475e8 (object 0x00000006c0b4fe28, a java.util.HashSet), which is held by "http-nio-8080-ClientPoller" "http-nio-8080-ClientPoller": waiting to lock monitor 0x00007f59680af4d8 (object 0x00000006de810f10, a java.lang.Object), which is held by "http-nio-8080-exec-215" "http-nio-8080-exec-215": waiting to lock monitor 0x00007f59c4057a98 (object 0x00000006c0b27028, a java.util.HashSet), which is held by "http-nio-8080-BlockPoller" "http-nio-8080-BlockPoller": waiting to lock monitor 0x00007f5998946158 (object 0x00000006de8588c0, a java.lang.Object), which is held by "http-nio-8080-exec-378" "http-nio-8080-exec-378": waiting to lock monitor 0x00007f59c40475e8 (object 0x00000006c0b4fe28, a java.util.HashSet), which is held by "http-nio-8080-ClientPoller" Java stack information for the threads listed above: =================================================== "http-nio-8080-exec-455": at java.nio.channels.spi.AbstractSelector.cancel(AbstractSelector.java:91) - waiting to lock <0x00000006c0b4fe28> (a java.util.HashSet) at java.nio.channels.spi.AbstractSelectionKey.cancel(AbstractSelectionKey.java:73) - locked <0x00000006de870a68> (a sun.nio.ch.SelectionKeyImpl) at java.nio.channels.spi.AbstractSelectableChannel.implCloseChannel(AbstractSelectableChannel.java:240) - locked <0x00000006de870a90> (a java.lang.Object) at java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:115) - locked <0x00000006de870a28> (a java.lang.Object) at org.apache.tomcat.util.net.NioChannel.close(NioChannel.java:104) at org.apache.tomcat.util.net.NioChannel.close(NioChannel.java:116) at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doClose(NioEndpoint.java:1179) - locked <0x00000006c3ca9be8> (a org.apache.tomcat.util.net.NioChannel) at org.apache.tomcat.util.net.SocketWrapperBase.close(SocketWrapperBase.java:394) at org.apache.tomcat.util.net.NioEndpoint$Poller.cancelledKey(NioEndpoint.java:667) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1592) at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) - locked <0x00000006de825cc0> (a org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748) "http-nio-8080-ClientPoller": at java.nio.channels.spi.AbstractSelectableChannel.removeKey(AbstractSelectableChannel.java:130) - waiting to lock <0x00000006de810f10> (a java.lang.Object) at java.nio.channels.spi.AbstractSelector.deregister(AbstractSelector.java:188) at sun.nio.ch.EPollSelectorImpl.implDereg(EPollSelectorImpl.java:191) at sun.nio.ch.SelectorImpl.processDeregisterQueue(SelectorImpl.java:149) - locked <0x00000006c0b4fe28> (a java.util.HashSet) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:97) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0x00000006c0b50068> (a sun.nio.ch.Util$3) - locked <0x00000006c0b50058> (a java.util.Collections$UnmodifiableSet) - locked <0x00000006c0b4fdd0> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:708) at java.lang.Thread.run(Thread.java:748) "http-nio-8080-exec-215": at java.nio.channels.spi.AbstractSelector.cancel(AbstractSelector.java:91) - waiting to lock <0x00000006c0b27028> (a java.util.HashSet) at java.nio.channels.spi.AbstractSelectionKey.cancel(AbstractSelectionKey.java:73) - locked <0x00000006de810c70> (a sun.nio.ch.SelectionKeyImpl) at java.nio.channels.spi.AbstractSelectableChannel.implCloseChannel(AbstractSelectableChannel.java:240) - locked <0x00000006de810f10> (a java.lang.Object) at java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:115) - locked <0x00000006de810c30> (a java.lang.Object) at org.apache.tomcat.util.net.NioChannel.close(NioChannel.java:104) at org.apache.tomcat.util.net.NioChannel.close(NioChannel.java:116) at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doClose(NioEndpoint.java:1179) - locked <0x00000006c2d7b8a0> (a org.apache.tomcat.util.net.NioChannel) at org.apache.tomcat.util.net.SocketWrapperBase.close(SocketWrapperBase.java:394) at org.apache.tomcat.util.net.NioEndpoint$Poller.cancelledKey(NioEndpoint.java:667) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1592) at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) - locked <0x00000006de810c98> (a org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748) "http-nio-8080-BlockPoller": at java.nio.channels.spi.AbstractSelectableChannel.removeKey(AbstractSelectableChannel.java:130) - waiting to lock <0x00000006de8588c0> (a java.lang.Object) at java.nio.channels.spi.AbstractSelector.deregister(AbstractSelector.java:188) at sun.nio.ch.EPollSelectorImpl.implDereg(EPollSelectorImpl.java:191) at sun.nio.ch.SelectorImpl.processDeregisterQueue(SelectorImpl.java:149) - locked <0x00000006c0b27028> (a java.util.HashSet) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:90) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0x00000006c0b27108> (a sun.nio.ch.Util$3) - locked <0x00000006c0b270f8> (a java.util.Collections$UnmodifiableSet) - locked <0x00000006c0b26fc0> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.selectNow(SelectorImpl.java:105) at org.apache.tomcat.util.net.NioBlockingSelector$BlockPoller.run(NioBlockingSelector.java:310) "http-nio-8080-exec-378": at java.nio.channels.spi.AbstractSelector.cancel(AbstractSelector.java:91) - waiting to lock <0x00000006c0b4fe28> (a java.util.HashSet) at java.nio.channels.spi.AbstractSelectionKey.cancel(AbstractSelectionKey.java:73) - locked <0x00000006de858648> (a sun.nio.ch.SelectionKeyImpl) at java.nio.channels.spi.AbstractSelectableChannel.implCloseChannel(AbstractSelectableChannel.java:240) - locked <0x00000006de8588c0> (a java.lang.Object) at java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:115) - locked <0x00000006de8585e0> (a java.lang.Object) at org.apache.tomcat.util.net.NioChannel.close(NioChannel.java:104) at org.apache.tomcat.util.net.NioChannel.close(NioChannel.java:116) at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.doClose(NioEndpoint.java:1179) - locked <0x00000006c3360628> (a org.apache.tomcat.util.net.NioChannel) at org.apache.tomcat.util.net.SocketWrapperBase.close(SocketWrapperBase.java:394) at org.apache.tomcat.util.net.NioEndpoint$Poller.cancelledKey(NioEndpoint.java:667) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1592) at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) - locked <0x00000006de858670> (a org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748) Found 1 deadlock.
NIO deadlocks internally in your trace while trying to close multiple independent channels. Some of the structures that might help the deadlock, in particular the BlockPoller, are no longer used in trunk.
but, in my opinion. just the NioEndpoint$Poller.cancelledKey and NioEndpoint$Poller.run may cause the deadlock. Poller.run finally call AbstractSelector.cancel and Poller.cancelledKey finally call AbstractSelector.deregister. The two method all need the conditions keyLock and cancelledKeys.
Ok, so then cancelledKey should cancel the key first, and this would resolve the "problem" as the channel close locks everything before cancelling that key itself. This is what was done like that in 8.5 but this didn't look important. Can you test it or give me your opinion on it ? https://github.com/apache/tomcat/commit/9b1a8b67bffe462fc745b19e15ed59c37e2e1dcf
I will test first. If there is progress i will recommend agin
Ok. If you run again into a deadlock, please add the updated deadlock traces as an attachment to the BZ (otherwise it's harder to read).
Looking at the most recent JDK sources, the root issue likely has been fixed by these commits which remove all the unwanted internal NIO syncing: https://github.com/openjdk/jdk/commit/33b921f25d8b5e6a50003daa6db31176437d29c0 https://github.com/openjdk/jdk/commit/38bdacafbcf3c96ac40e30098c0990f88a9edf8e I'm ok with having a workaround fix, as it seems the first commit for this NIO bug will be in 11, and the second one in 14. Not 100% sure if the first one alone is enough to avoid the deadlock, but it might well be. In the trace, if http-nio-8080-exec-215 does not hold the lock in the AbstractSelectableChannel, then http-nio-8080-ClientPoller will be able to get the sync on it.
yes. thank you for your patient answer.I don't have a correct way to test. Just pressure test and it's hard to reproduce the bug fixed or not.
The issue would have been introduced in 9.0.20, and it's not surprising it's hard to reproduce. It's interesting that with the new Java dev model they actually do this sort of refactorings, cleanups and improvements on NIO, which is a very old piece of code.
with your fixed code i package a new jar and then replace tomcat-embed-core.jar in 9.0.26.and then i pressure my application 3 hours and all is fine. But with the tomcat 9.0.26 press 1 hour the deadlock appear.
Ok, so let's say the regression is fixed. The fix will be in Tomcat 9.0.31+, thanks for the help resolving the issue.