Bug 66574

Summary: Deadlock in websocket code
Product: Tomcat 9 Reporter: Boris Petrov <boris_petrov>
Component: WebSocketAssignee: Tomcat Developers Mailing List <dev>
Status: RESOLVED FIXED    
Severity: regression    
Priority: P2    
Version: 9.0.74   
Target Milestone: -----   
Hardware: PC   
OS: Linux   
Attachments: Thread dump

Description Boris Petrov 2023-04-20 06:47:19 UTC
I believe this is a regression from 9.0.73 - that is, I never encountered that issue with 9.0.73 (which of course doesn't say that it wasn't there but with 9.0.74 it happens very often so...).

I think it's closely related to (or the same as) these:

https://bz.apache.org/bugzilla/show_bug.cgi?id=63191
https://bz.apache.org/bugzilla/show_bug.cgi?id=66531

The first issue I opened a few years ago. Then, around Tomcat 9.0.71 it stopped appearing and all was fine. 9.0.72 seemed fine as well as 9.0.73. But now 9.0.74 seems to have introduced it again (or something new). Here are the relevant stack traces:

-----------------

"thread name" #615 [2030] daemon prio=5 os_prio=0 cpu=667.09ms elapsed=57804.86s tid=0x00007fc189c87000 nid=2030 waiting on condition  [0x00007fc0bb57c000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@19.0.2/Native Method)
        - parking to wait for  <0x0000000743e004f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(java.base@19.0.2/LockSupport.java:221)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@19.0.2/AbstractQueuedSynchronizer.java:715)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@19.0.2/AbstractQueuedSynchronizer.java:938)
        at java.util.concurrent.locks.ReentrantLock$Sync.lock(java.base@19.0.2/ReentrantLock.java:153)
        at java.util.concurrent.locks.ReentrantLock.lock(java.base@19.0.2/ReentrantLock.java:322)
        at org.apache.tomcat.websocket.WsSession.doClose(WsSession.java:654)
        at org.apache.tomcat.websocket.WsSession.doClose(WsSession.java:636)
        at org.apache.tomcat.websocket.WsSession.close(WsSession.java:624)
        at org.cometd.server.websocket.javax.WebSocketEndPoint$Delegate.close(WebSocketEndPoint.java:119)
        at org.cometd.server.websocket.common.AbstractWebSocketEndPoint.lambda$onMessage$0(AbstractWebSocketEndPoint.java:70)
        at org.cometd.server.websocket.common.AbstractWebSocketEndPoint$$Lambda$1929/0x00000007c10e7078.accept(Unknown Source)
        at org.cometd.bayeux.Promise$2.fail(Promise.java:108)
        at org.cometd.server.websocket.common.AbstractWebSocketEndPoint$$Lambda$1953/0x00000007c111c440.accept(Unknown Source)
        at org.cometd.bayeux.Promise$2.fail(Promise.java:108)
        at org.cometd.server.websocket.common.AbstractWebSocketEndPoint$Entry.fail(AbstractWebSocketEndPoint.java:645)
        at org.cometd.server.websocket.common.AbstractWebSocketEndPoint$Flusher.lambda$onCompleteFailure$0(AbstractWebSocketEndPoint.java:612)
        at org.cometd.server.websocket.common.AbstractWebSocketEndPoint$Flusher$$Lambda$2642/0x00000007c13f9800.accept(Unknown Source)
        at java.util.ArrayList.forEach(java.base@19.0.2/ArrayList.java:1511)
        at org.cometd.server.websocket.common.AbstractWebSocketEndPoint$Flusher.onCompleteFailure(AbstractWebSocketEndPoint.java:612)
        at org.eclipse.jetty.util.IteratingCallback.failed(IteratingCallback.java:417)
        at org.cometd.server.websocket.javax.WebSocketEndPoint$Delegate.lambda$send$0(WebSocketEndPoint.java:106)
        at org.cometd.server.websocket.javax.WebSocketEndPoint$Delegate$$Lambda$1954/0x00000007c111d438.onResult(Unknown Source)
        at org.apache.tomcat.websocket.WsRemoteEndpointImplBase$TextMessageSendHandler.onResult(WsRemoteEndpointImplBase.java:846)
        at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.endMessage(WsRemoteEndpointImplBase.java:442)
        at org.apache.tomcat.websocket.WsRemoteEndpointImplBase$EndMessageHandler.onResult(WsRemoteEndpointImplBase.java:563)
        at org.apache.tomcat.websocket.server.WsRemoteEndpointImplServer$OnResultRunnable.run(WsRemoteEndpointImplServer.java:395)
        at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191)
        at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
        at java.lang.Thread.run(java.base@19.0.2/Thread.java:1589)

   Locked ownable synchronizers:
        - <0x00000006cf2712d8> (a org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker)

-----------------

"other thread name" #713 [2702] daemon prio=5 os_prio=0 cpu=523.94ms elapsed=53545.41s tid=0x00007fc187e92000 nid=2702 waiting on condition  [0x00007fc0c107c000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@19.0.2/Native Method)
        - parking to wait for  <0x0000000743e008a8> (a java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.park(java.base@19.0.2/LockSupport.java:221)
        at java.util.concurrent.CompletableFuture$Signaller.block(java.base@19.0.2/CompletableFuture.java:1864)
        at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@19.0.2/ForkJoinPool.java:3745)
        at java.util.concurrent.ForkJoinPool.managedBlock(java.base@19.0.2/ForkJoinPool.java:3690)
        at java.util.concurrent.CompletableFuture.waitingGet(java.base@19.0.2/CompletableFuture.java:1898)
        at java.util.concurrent.CompletableFuture.get(java.base@19.0.2/CompletableFuture.java:2072)
        at org.cometd.server.websocket.javax.WebSocketEndPoint.onMessage(WebSocketEndPoint.java:60)
        at org.cometd.server.websocket.javax.WebSocketEndPoint.onMessage(WebSocketEndPoint.java:35)
        at org.apache.tomcat.websocket.WsFrameBase.sendMessageText(WsFrameBase.java:390)
        at org.apache.tomcat.websocket.server.WsFrameServer.sendMessageText(WsFrameServer.java:129)
        at org.apache.tomcat.websocket.WsFrameBase.processDataText(WsFrameBase.java:484)
        at org.apache.tomcat.websocket.WsFrameBase.processData(WsFrameBase.java:284)
        at org.apache.tomcat.websocket.WsFrameBase.processInputBuffer(WsFrameBase.java:130)
        at org.apache.tomcat.websocket.server.WsFrameServer.onDataAvailable(WsFrameServer.java:84)
        at org.apache.tomcat.websocket.server.WsFrameServer.doOnDataAvailable(WsFrameServer.java:183)
        at org.apache.tomcat.websocket.server.WsFrameServer.notifyDataAvailable(WsFrameServer.java:163)
        at org.apache.tomcat.websocket.server.WsHttpUpgradeHandler.upgradeDispatch(WsHttpUpgradeHandler.java:152)
        at org.apache.coyote.http11.upgrade.UpgradeProcessorInternal.dispatch(UpgradeProcessorInternal.java:60)
        at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:57)
        at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:926)
        at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1791)
        at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:52)
        at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191)
        at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
        at java.lang.Thread.run(java.base@19.0.2/Thread.java:1589)

   Locked ownable synchronizers:
        - <0x000000070362a330> (a org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker)
        - <0x0000000743e004f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

-----------------

Not sure how to help more but I'm reverting to 7.0.73 for now. Let me know what I can do.
Comment 1 Boris Petrov 2023-04-20 16:00:40 UTC
I want to mention something that I'm noticing which might not be true but so far that's always been the case. Whenever this deadlock happens, there's only ONE such pair of threads with the stacktraces I posted. Before when I had the similar problems (in the other issue I linked) this wasn't so - there were many, many blocked threads. Now it seems there's only one pair. Again, not sure this is true but seems so from a few hours of testing. Hopefully that could help with debugging.
Comment 2 Christopher Schultz 2023-04-20 20:19:37 UTC
I don't see deadlock, here.

Thread 615 is waiting for 0x0000000743e004f8 and owns a lock on 0x00000006cf2712d8.
Thread 713 is waiting for 0x0000000743e008a8 and owns locks for 0x000000070362a330 and 0x0000000743e004f8

Thus, thread 713 is waiting on Thread 615, but Thread 615 isn't waiting on Thread 713.

Unless there are more threads participating, here, then this is not a deadlock.
Comment 3 Boris Petrov 2023-04-21 07:28:35 UTC
I used the word "deadlock" a bit loosely, yes, sorry. It's not exactly a deadlock between these two threads but rather an issue that these two threads are blocked and not continuing. Probably the same reason as the two linked issues - `sendText` doesn't call the callback.

Is there really nothing in the changelog from 9.0.73 to 9.0.74 that could possibly explain this change? As I said, it *never* happened on 73, on 74 it happens often.

I've also attached a full threaddump. You can probably ignore the JRuby stuff - that's what I run my tests with. Yesterday, when I also saw the issue in production, there was no JRuby there. If a thread dump without JRuby is needed, I could try to provide one.
Comment 4 Boris Petrov 2023-04-21 07:29:44 UTC
Created attachment 38543 [details]
Thread dump
Comment 5 Remy Maucherat 2023-04-21 09:48:16 UTC
Well, it seems like it is triggered by your code.
"thread" #824 decides to wait in your code, blocking inside the main Servlet request processing (so it is still holding the socket wrapper lock).
"thread" #1 is simply doing Tomcat shutdown, closing the websocket sessions in the process.
Comment 6 Mark Thomas 2023-04-25 09:09:29 UTC
This is related to one of the fixes for bug 66508. Specifically this change:
https://github.com/apache/tomcat/commit/dccc2644ce701e88b152563473a350ec33a29a81

In short, the above change replaced the internal session state lock with the SocketWrapper lock. This removed the possibility of a deadlock caused by container and non-container threads obtaining the locks in different orders. 

Thread #824 is simulating blocking by calling Future.get() immediately after initiating the async action that created the Future. This thread is holding the SocketWrapper lock because it is a container thread responding to the receipt of a WebSocket message.

Thread #844 is processing the async action initiated by thread #824. It is trying to close the WebSocket session. To do this it needs the SocketWrapper lock which is held by #824.

Hence threads #824 and #844 are deadlocked.

I have observed that simulating blocking often triggers deadlocks with Tomcat. This appears to be another instance of that but one where the locks participating in the deadlock are split across Tomcat and the application.

I'm going to spend some time between now and the next round of releases looking at the way Tomcat uses locks - particularly in WebSocket - with a view to seeing if there is a different approach that is less susceptible to deadlock issues.
Comment 7 Mark Thomas 2023-04-25 16:56:12 UTC
I have a patch that addresses this issue by refactoring the session state so it uses an AtomicReference (with a couple of extra states) rather than a lock.

Initial testing is positive but I am currently running a fuller set of tests including the Autobahn test suite. Assuming that all goes well, are you able to test a build of Tomcat with the changes included if I post a link to the dev build here? I assume you'd want a Tomcat 9.0.x build to test.
Comment 8 Boris Petrov 2023-04-25 19:24:54 UTC
Great, thanks a lot for the quick fix! I'll try to test with whatever you give me, yes. 9.0.x, correct. Thanks!
Comment 9 Mark Thomas 2023-04-25 20:53:29 UTC
Great. 9.0.75-dev build that includes the proposed fix for this deadlock is available at:
https://people.apache.org/~markt/dev/v9.0.75-dev/

Usual caveats apply. This is not an official release. It is only intended to test the fix for this issue. Use at your own risk.
Comment 10 Boris Petrov 2023-04-26 09:19:08 UTC
I tested for some time and all seems fine. 9.0.74 blocked often but with this build it didn't happen so it looks good.
Comment 11 Mark Thomas 2023-04-26 09:21:55 UTC
Thanks for the testing and confirmation.

Fixed in:
- 11.0.x for 11.0.0-M6 onwards
- 10.1.x for 10.1.9 onwards
-  9.0.x for  9.0.75 onwards
-  8.5.x for  8.5.89 onwards