Bug 66508 - Tomcat after a GC pause causes the HTTP threads to be blocked to acquire a semaphore to process WebSockets connection closure.
Summary: Tomcat after a GC pause causes the HTTP threads to be blocked to acquire a se...
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 8
Classification: Unclassified
Component: WebSocket (show other bugs)
Version: 8.5.72
Hardware: All All
: P2 normal (vote)
Target Milestone: ----
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-03-04 01:58 UTC by Venkata Manda
Modified: 2023-06-12 08:40 UTC (History)
2 users (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Venkata Manda 2023-03-04 01:58:54 UTC
For this issue to happen there are three events that need to happen and they are listed below:

Event 1. Tomcat is going through a GC Pause.
Event 2. WebSocket connection at the client is closed (not processed by the tomcat as it is going through a GC pause)
Event 3. Tomcat is trying to write a message on the same connection as above but has not started as the tomcat is going through a pause.

Event 2 happens on an HTTP thread (In the below stack trace it is http-nio2-8082-exec-7) and Event 3 happens on a background thread (pool-12-thread-1).

If Event 3 happens before Event 2, then the semaphore acquired by the event 3 thread is not released and the event 2 thread is blocked for a duration of org.apache.tomcat.websocket.BLOCKING_SEND_TIMEOUT (default value is 20 seconds). If I am not wrong event 3 thread never releases the semaphore.

If there are 5000 connections on a Tomcat and all going through the above issue, causes all the Tomcat HTTP thread pools to stuck for a long duration not processing any incoming HTTP requests. In a way, each tomcat HTTP thread tries to process the socket close and waits to acquire the semaphore for 20 seconds. With a thread pool of 50, and 5000 connections then it takes 100 * 20 Seconds (default time out) = 2000 seconds = 33.33 Minutes.


In our environment, we are using CometD (https://cometd.org/) to support transport independent client-to-server and server-to-client communication via Bayeux protocol. This protocol sends heartbeat messages every 30 seconds to detect the availability of both parties. If the server does not respond within 30 seconds + 10 seconds (Configurable grace period) the client considers the server is not available and re-connects by opening a new connection and closing the old one. Due to a GC Pause at the server, the client won't receive a heartbeat message within a certain interval and is closing the connection. The server when it processes this socket close message writes a response back to the client about it (not sure why it is needed or even if it is needed why it is trying to acquire semaphore with max waiting for org.apache.tomcat.websocket.BLOCKING_SEND_TIMEOUT).

Please review below stack traces. The main issue is http-ni02-8082-exec-7 thread is trying to acquire a semaphore for 20 seconds and pool-12-thread-1 trying to write a message never releases the lock. My debugging points out that the pool-12-thread-1 in our environment is doing a nonblocking write and for some reason, this never completes and never calls to release the semaphore.

Let me know if you need any further information.

Event 2 Stack Trace:

"http-nio2-8082-exec-7" #112 daemon prio=5 os_prio=31 tid=0x00007fc9c07e9000 nid=0x1f403 waiting on condition [0x0000700008eb7000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x0000000748319390> (a java.util.concurrent.Semaphore$NonfairSync)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
        at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:409)
        at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendMessageBlock(WsRemoteEndpointImplBase.java:299)
        at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendMessageBlock(WsRemoteEndpointImplBase.java:262)
        at org.apache.tomcat.websocket.WsSession.sendCloseMessage(WsSession.java:816)
        at org.apache.tomcat.websocket.WsSession.onClose(WsSession.java:739)
        - locked <0x000000074831a820> (a java.lang.Object)
        at org.apache.tomcat.websocket.WsFrameBase.processDataControl(WsFrameBase.java:367)
        at org.apache.tomcat.websocket.WsFrameBase.processData(WsFrameBase.java:296)
        at org.apache.tomcat.websocket.WsFrameBase.processInputBuffer(WsFrameBase.java:133)
        at org.apache.tomcat.websocket.server.WsFrameServer.onDataAvailable(WsFrameServer.java:85)
        at org.apache.tomcat.websocket.server.WsFrameServer.doOnDataAvailable(WsFrameServer.java:183)
        at org.apache.tomcat.websocket.server.WsFrameServer.notifyDataAvailable(WsFrameServer.java:162)
        at org.apache.tomcat.websocket.server.WsHttpUpgradeHandler.upgradeDispatch(WsHttpUpgradeHandler.java:157)
        at org.apache.coyote.http11.upgrade.UpgradeProcessorInternal.dispatch(UpgradeProcessorInternal.java:60)
        at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:59)
        at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:831)
        at org.apache.tomcat.util.net.Nio2Endpoint$SocketProcessor.doRun(Nio2Endpoint.java:1615)
        at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
        - locked <0x0000000748310ca8> (a org.apache.tomcat.util.net.Nio2Endpoint$Nio2SocketWrapper)
        at org.apache.tomcat.util.net.AbstractEndpoint.processSocket(AbstractEndpoint.java:1111)
        at org.apache.tomcat.util.net.Nio2Endpoint$Nio2SocketWrapper$2.completed(Nio2Endpoint.java:564)
        at org.apache.tomcat.util.net.Nio2Endpoint$Nio2SocketWrapper$2.completed(Nio2Endpoint.java:542)
        at sun.nio.ch.Invoker.invokeUnchecked(Invoker.java:126)
        at sun.nio.ch.Invoker$2.run(Invoker.java:218)
        at sun.nio.ch.AsynchronousChannelGroupImpl$1.run(AsynchronousChannelGroupImpl.java:112)
        at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191)
        at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
        at java.lang.Thread.run(Thread.java:748)






Event 3 Stack Trace (made changes to the tomcat code to print the stack trace) and is happening on pool-12-thread-1

java.lang.Thread.getStackTrace(Thread.java:1559)
org.apache.tomcat.websocket.server.WsRemoteEndpointImplServer.onWritePossible(WsRemoteEndpointImplServer.java:175)
org.apache.tomcat.websocket.server.WsRemoteEndpointImplServer.doWrite(WsRemoteEndpointImplServer.java:81)
org.apache.tomcat.websocket.WsRemoteEndpointImplBase.writeMessagePart(WsRemoteEndpointImplBase.java:545)
org.apache.tomcat.websocket.WsRemoteEndpointImplBase.startMessage(WsRemoteEndpointImplBase.java:411)
org.apache.tomcat.websocket.WsRemoteEndpointImplBase$TextMessageSendHandler.write(WsRemoteEndpointImplBase.java:881)
org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendStringByCompletion(WsRemoteEndpointImplBase.java:216)
org.apache.tomcat.websocket.WsRemoteEndpointAsync.sendText(WsRemoteEndpointAsync.java:47)
org.cometd.websocket.server.WebSocketTransport.send(WebSocketTransport.java:118)
org.cometd.websocket.server.WebSocketTransport.send(WebSocketTransport.java:52)
org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler.send(AbstractWebSocketTransport.java:196)
org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler$Flusher.process(AbstractWebSocketTransport.java:627)
org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241)
org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223)
org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler.send(AbstractWebSocketTransport.java:420)
org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler.send(AbstractWebSocketTransport.java:522)
org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler.schedule(AbstractWebSocketTransport.java:500)
org.cometd.websocket.server.WebSocketTransport$WebSocketScheduler$1.schedule(WebSocketTransport.java:154)
org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler$MetaConnectReplyTask.run(AbstractWebSocketTransport.java:548)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
java.util.concurrent.FutureTask.run(FutureTask.java)





Steps to Reproduce:
-------------------
Tomcat version:8.5.72
Java Version: 1.8
(If I am not wrong this issue will happen on higher java versions and also tomcat versions too)

1. Create a WebSocket Endpoint.
2. Let the client connect to the server and make the server send messages at a rate of 1 message per 5 seconds, if not able to reproduce it increase the rate.
3. Either create a GC Pause if that is not possible keep the IDE in debug mode and pause all threads at a breakpoint.
4. Close the client connection.
5. Unpause the debug mode.
6. Take a thread dump.

To understand the problem at scale, increase the number of connections to 1000 or 5000 and perform the above steps.




I made a change to the following line. If opCode is OPCODE_CLOSE instead of waiting to acquire semaphore with time out, just do a tryAcquire.
This fixes the issue. But not sure what are the implications of this change. Ideally, we need to find out why pool-12-thread-1 (listed above)  flow is not releasing the lock when the socket is closed at the client or not even failing to write and release the semaphore.

https://github.com/apache/tomcat/blob/8.5.x/java/org/apache/tomcat/websocket/WsRemoteEndpointImplBase.java#L299


            if (opCode == Constants.OPCODE_CLOSE)  {
                if (!messagePartInProgress.tryAcquire()) {
                    String msg = sm.getString("wsRemoteEndpoint.acquireTimeout");
                    wsSession.doClose(new CloseReason(CloseCodes.GOING_AWAY, msg),
                        new CloseReason(CloseCodes.CLOSED_ABNORMALLY, msg), true);
                    throw new SocketTimeoutException(msg);
                }
            } else if (!messagePartInProgress.tryAcquire(timeout, TimeUnit.MILLISECONDS)) {
                String msg = sm.getString("wsRemoteEndpoint.acquireTimeout");
                wsSession.doClose(new CloseReason(CloseCodes.GOING_AWAY, msg),
                        new CloseReason(CloseCodes.CLOSED_ABNORMALLY, msg), true);
                throw new SocketTimeoutException(msg);
            }



Let me know if you have any questions.

Thank you.
Comment 1 Venkata Manda 2023-03-04 02:54:25 UTC
In my below issue description, I mentioned the test environment java version is 1.8. The issue is reproducible with Java 11.

Based on my reading and analysis of the issue, any time a socket is closed at the client and a thread at the server to send a message on that socket happens before the socket close is processed can trigger this issue.

With GC Pause this issue is reproduced consistently.

Thanks.
Comment 2 Mark Thomas 2023-03-07 13:28:15 UTC
Correct version field to match text report.
Comment 3 Mark Thomas 2023-03-08 16:43:16 UTC
Thanks for the steps to reproduce.

I'll note for the record that you also need to be using the NIO2 connector.

I can see what the root cause is. The write in "Event 3" fails and isn't cleaned up properly which then blocks the close message. I am currently exploring options for fixing this.
Comment 4 Mark Thomas 2023-03-08 18:24:08 UTC
Fixed in:
- 11.0.x for 11.0.0-M5 onwards
- 10.1.x for 10.1.8 onwards
-  9.0.x for  9.0.74 onwards
-  8.5.x for  8.5.88 onwards

If you'd like to test a dev build before the next release, let me know.
Comment 5 Venkata Manda 2023-03-08 19:53:52 UTC
(In reply to Mark Thomas from comment #3)
> Thanks for the steps to reproduce.
> 
> I'll note for the record that you also need to be using the NIO2 connector.
> 
> I can see what the root cause is. The write in "Event 3" fails and isn't
> cleaned up properly which then blocks the close message. I am currently
> exploring options for fixing this.

Thank you for working on the issue.
I am using the NI02 connector.
Comment 6 Venkata Manda 2023-03-08 19:57:28 UTC
(In reply to Mark Thomas from comment #4)
> Fixed in:
> - 11.0.x for 11.0.0-M5 onwards
> - 10.1.x for 10.1.8 onwards
> -  9.0.x for  9.0.74 onwards
> -  8.5.x for  8.5.88 onwards
> 
> If you'd like to test a dev build before the next release, let me know.

Yes. Let me work on setting this up in my environment and will post you my results here.

Thanks a lot for fixing the issue.
Comment 7 Mark Thomas 2023-03-09 08:38:20 UTC
If it helps, dev build with the fix is available from:

https://people.apache.org/~markt/dev/v9.0.74-dev/

Note:
- this is not an official release
- use it at your own risk
Comment 8 Venkata Manda 2023-03-16 05:44:49 UTC
I tested with the changes made and the issue is still happening.

Event 3 thread that is trying to write on a socket that was asked to close by the client as the server didn't respond (Event 1) is failing due to a broken pipe (IO Exception). As it is using nonblocking write, the exception is not happening on the Event 3 thread, happening on the NIO2 thread (example thread name: http-nio2-8082-exec-2). 


java.io.IOException: Broken pipe
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
	at sun.nio.ch.IOUtil.write(IOUtil.java:65)
	at sun.nio.ch.UnixAsynchronousSocketChannelImpl.implWrite(UnixAsynchronousSocketChannelImpl.java:694)
	at sun.nio.ch.AsynchronousSocketChannelImpl.write(AsynchronousSocketChannelImpl.java:383)
	at sun.nio.ch.AsynchronousSocketChannelImpl.write(AsynchronousSocketChannelImpl.java:400)
	at org.apache.tomcat.util.net.Nio2Channel.write(Nio2Channel.java:183)
	at org.apache.tomcat.util.net.Nio2Endpoint$Nio2SocketWrapper.flushNonBlockingInternal(Nio2Endpoint.java:1285)
	at org.apache.tomcat.util.net.Nio2Endpoint$Nio2SocketWrapper.writeNonBlockingInternal(Nio2Endpoint.java:1187)
	at org.apache.tomcat.util.net.Nio2Endpoint$Nio2SocketWrapper.writeNonBlocking(Nio2Endpoint.java:1150)
	at org.apache.tomcat.util.net.SocketWrapperBase.write(SocketWrapperBase.java:486)
	at org.apache.tomcat.websocket.server.WsRemoteEndpointImplServer.onWritePossible(WsRemoteEndpointImplServer.java:148)
	at org.apache.tomcat.websocket.server.WsRemoteEndpointImplServer.doWrite(WsRemoteEndpointImplServer.java:83)
	at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.writeMessagePart(WsRemoteEndpointImplBase.java:556)
	at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.startMessage(WsRemoteEndpointImplBase.java:422)
	at org.apache.tomcat.websocket.WsRemoteEndpointImplBase$TextMessageSendHandler.write(WsRemoteEndpointImplBase.java:892)
	at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendStringByCompletion(WsRemoteEndpointImplBase.java:216)
	at org.apache.tomcat.websocket.WsRemoteEndpointAsync.sendText(WsRemoteEndpointAsync.java:47)
	at org.cometd.websocket.server.WebSocketTransport.send(WebSocketTransport.java:118)
	at org.cometd.websocket.server.WebSocketTransport.send(WebSocketTransport.java:52)
	at org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler.send(AbstractWebSocketTransport.java:196)
	at org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler$Flusher.process(AbstractWebSocketTransport.java:627)
	at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241)
	at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223)
	at org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler.send(AbstractWebSocketTransport.java:420)
	at org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler.send(AbstractWebSocketTransport.java:522)
	at org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler.schedule(AbstractWebSocketTransport.java:500)
	at org.cometd.websocket.server.WebSocketTransport$WebSocketScheduler$1.schedule(WebSocketTransport.java:154)
	at org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler$MetaConnectReplyTask.run(AbstractWebSocketTransport.java:548)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
	at java.util.concurrent.FutureTask.run(FutureTask.java)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)






When this exception is thrown, Nio2Endpoint.writeCompletionHanlder.onFailed method is called (https://github.com/apache/tomcat/blob/8.5.x/java/org/apache/tomcat/util/net/Nio2Endpoint.java#L707). It is releasing writePending lock and is trying to send SocketEvent.ERROR. 

endpoint.processSocket(Nio2SocketWrapper.this, SocketEvent.ERROR, true);


By sending the above event, I am guessing it is expecting to release the messagePartInProgress semaphore waiting to acquire by the Event 1 thread (the thread trying to consume the close message sent by the client).

https://github.com/apache/tomcat/blob/8.5.x/java/org/apache/tomcat/websocket/WsRemoteEndpointImplBase.java#L292


In my observation below, the call is returning true, so it is not calling the close method.

https://github.com/apache/tomcat/blob/main/java/org/apache/tomcat/util/net/Nio2Endpoint.java#L702

The above call spins off another thread to fire a socket event.

https://github.com/apache/tomcat/blob/8.5.x/java/org/apache/tomcat/util/net/SocketProcessorBase.java#L46

In my observation, after 20 seconds (the same time the Event 1 thread waiting to acquire the semaphore) the thread is entering inside SocketProcessorBase.run method's synchronized block, and the socketWrapper.isClosed() is true and is returning, without running doRun method.


Due to asynchronous write by the Event 3 thread, even if it gets an IO Exception (Broken Pipe) it is not releasing the messagePartInProgress semaphore because the thread that acquired the lock and the thread that is executing the write is different. Even when the writer encounters a broken pipe it is not closing socketWrapper.

In my observation, the Event 3 thread is hitting the below line, as the write is not complete it does not call any handlers to release messagePartInProgress semaphore.

https://github.com/apache/tomcat/blob/8.5.x/java/org/apache/tomcat/websocket/server/WsRemoteEndpointImplServer.java#L204

Please let me know if any details are needed.


Thank you
Comment 9 Venkata Manda 2023-03-16 05:57:07 UTC
When a nonblocking write is failing, the writeCompletionHandler.failed inside Nio2Endpointis getting called, but it is unable to trigger a code path to clear the handler on WsRemoteEndpointImplServer.
Comment 10 Remy Maucherat 2023-03-16 09:11:00 UTC
There's some bad analysis in the comment: processSocket is not supposed to return false (it does only when the executor refuses to run the task), the close() is simply there to handle this unlikely (and random) scenario.
Comment 11 Remy Maucherat 2023-03-16 10:27:18 UTC
Can you add your connector configuration to the BZ ?
Comment 12 Mark Thomas 2023-03-16 11:54:17 UTC
OK. I can recreate this.

The sequence is:
- Thread 1: Executor or similar calls Session.getAsyncRemote().sendText()
- Thread 1: Acquires messagePartInProgress semaphore
- Thread 1: Gets as far as the call to OperateState.start()

- Client closes the connection. This triggers:
  - The poller to fire a SocketEvent.ERROR
  - The in-process write to fail

- Thread 2: Starts to process SocketEvent.ERROR
- Thread 2: is holding the SocketWrapper lock
- Thread 2: Thread 2 gets as far as sending the close message but has to wait for the messagePartInProgress semaphore

- Thread 3: Starts to process the failure of the in-process write
- Thread 3: failed() method called for write completion handler
- Thread 3: SocketEvent.ERROR task send to executor

- Thread 4: Executor starts to process SocketEvent.ERROR task send by thread 3
- Thread 4: Blocked waiting to acquire the SocketWrapper lock


On one level, the issue is that the processing of SocketEvent.ERROR expects any in-progress (and failed) message to have released the messagePartInProgress semaphore. But that doesn't happen.

Looking more broadly, one could question if there is any benefit at all in attempting to send a WebSocket close message after any form of IOException rather than just closing the TCP connection.

Note, the previous fix addressed a similar issue where Thread 1 was sending a message in response to a received message.

I intend to try and fix this in the SocketEvent.ERROR handling. The higher level discussion is something probably better suited to the dev@ list. If I can't fix this in SocketEvent.ERROR then we may need to return to the high level discussion.
Comment 13 Mark Thomas 2023-03-16 17:32:31 UTC
I fixed the other code path that could trigger this issue and have uploaded 8.5.x and 9.0.x builds for testing:

https://people.apache.org/~markt/dev/
Comment 14 Venkata Manda 2023-03-17 05:14:42 UTC
Hi Mark Thomas,

I am still able to reproduce the issue with the latest changes.
My understanding is the recent changes help only when the writeCompletionHandler failed method in Nio2Endpoint fires the SocketEvent.ERROR.

                    if (!endpoint.processSocket(Nio2SocketWrapper.this, SocketEvent.ERROR, true))
                        close();


Firing this event is not happening as the thread that is executing to fire is not entering the synchronized block on socketWrapper. (https://github.com/apache/tomcat/blob/8.5.x/java/org/apache/tomcat/util/net/SocketProcessorBase.java#L40)

    public final void run() {
        synchronized (socketWrapper) {


The socketWrapper is hold by the Event 1 thread (in below logs it is http-10). Please find below the timings of each thread in play to give you a better picture of the issue.



2023-03-16 21:25:19 (353) pool-12-thread-1 startMessage - Acquired messagePartInProgress semaphore lock : Session ID : 0


2023-03-16 21:25:19 (360) http-10 sendMessageBlock before acquiring messagePartInProgress tryAcquire - Session ID : 0


2023-03-16 21:25:19 (369) http-11 writeCompletionHandler.failed called : java.io.IOException: Broken Pipe


2023-03-16 21:25:39 (361) http-10 sendMessageBlock could not acquire messagePartInProgress.tryAcquire. Closing socket. - SessionID : 0

java.net.SocketTimeoutException: The current message was not fully sent within the specified timeout


2023-03-16 21:25:39 (362) Inside the synchronized block of SocketProcessorBase.run method, Thread Name: http-nio2-8082-exec-15

2023-03-16 21:25:39 (362) Inside the synchronized block of SocketProcessorBase.run -- socketWrapper.isClosed: true. Not running doRun method. Thread Name : http-nio2-8082-exec-15





If you notice there are three are four threads in play.
At the start, there are two threads.

pool-12-thread-1 (a thread on the server) trying to write a message on the Session ID 0.

http-10 thread received a message from client with close code 1000 and a reason.
Our client is a javascript and is using the below API to send a close message with code 1000 and a reason.
webSocket.close(code, reason);

http-10 thread to process the received message should have acquired the lock on socketWrapper. As this thread could not acquire the messagePartInProgress semaphore as it is already acquired by pool-12-thread-1, it will be waiting for 20 seconds (configured time out) and then though a socket time out exception. In above logs you can see a difference of 20 seconds from the time the log printed before calling 
!messagePartInProgress.tryAcquire(timeout, TimeUnit.MILLISECONDS)

(https://github.com/apache/tomcat/blob/8.5.x/java/org/apache/tomcat/websocket/WsRemoteEndpointImplBase.java#L292)


pool-12-thread-1 is failing to write and writeCompletionHandler.failed method is called. In my test it is by thread http-11.

https://github.com/apache/tomcat/blob/8.5.x/java/org/apache/tomcat/util/net/Nio2Endpoint.java#L707

http-11 thread is trying to fire SocketEvent.ERROR, but it is not getting processed as http-nio2-8082-exec-15 is unable to enter synchronization block as the socketWrapper is hold by http-10 thread. 


Let me know if you have any questions.

I am thinking adding below code before https://github.com/apache/tomcat/blob/8.5.x/java/org/apache/tomcat/websocket/WsRemoteEndpointImplBase.java#L292 may address the issue. If the opcode is close, instead of doing a tryAcquire with time out, simply doing a tryAcquire may be nonproblematic.

if (opCode == Constants.OPCODE_CLOSE)  {
                if (!messagePartInProgress.tryAcquire()) {
                    String msg = sm.getString("wsRemoteEndpoint.acquireTimeout");
                    wsSession.doClose(new CloseReason(CloseCodes.GOING_AWAY, msg),
                        new CloseReason(CloseCodes.CLOSED_ABNORMALLY, msg), true);
                    throw new SocketTimeoutException(msg);
                }
            }


Thanks.
Comment 15 Mark Thomas 2023-03-17 10:54:10 UTC
Please provide full thread stacks for each of the four threads.

Also, Rémy's request for the full Connector configuration is still outstanding.
Comment 16 Mark Thomas 2023-03-17 17:42:09 UTC
Short term, any combination of NIO/NIO2 and useAsyncIO that is NOT NIO2 + useAsyncIO="false" should work. Performance differences should be minimal between all of those combinations.

Medium term, this issue is one symptom of a deeper problem. It is going to take a little longer to explore the options for fixing it.
Comment 17 Venkata Manda 2023-03-23 22:42:38 UTC
Excuse me for the delayed response. Please find below our connector configuration:

protocol=org.apache.coyote.http11.Http11Nio2Protocol
tomcat.connector.all.enableLookups=false
tomcat.connector.all.URIEncoding=ISO-8859-1
tomcat.connector.all.connectionTimeout=20000
tomcat.connector.all.disableUploadTimeout=true
tomcat.connector.main.redirectPort=8443
tomcat.connector.main.maxThreads=50
tomcat.connector.main.minSpareThreads=50
tomcat.connector.main.acceptCount=500
tomcat.connector.main.maxConnections=10000
tomcat.connector.main.relaxedQueryChars="<>[\]^{|}
tomcat.connector.main.relaxedPathChars="<>[\\]^{|}
tomcat.connector.secure.port=8443
tomcat.connector.secure.scheme=https
tomcat.connector.secure.secure=true
tomcat.connector.secure.SSLEnabled=true
tomcat.connector.secure.clientAuth=want
tomcat.connector.secure.sslProtocol=TLS
tomcat.connector.secure.algorithm=X509
org.apache.catalina.connector.CoyoteAdapter.ALLOW_BACKSLASH=true
org.apache.catalina.connector.RECYCLE_FACADES=true



I am still working on sharing the stack trace that matches the source code of the publicly available tomcat version. I added a lot of log statements to get a better idea of the issue due to this the line numbers I see on my internal version may not match with the public version.


Thanks.
Comment 18 Mark Thomas 2023-03-28 03:25:18 UTC
Thanks for the Connector settings. The stack traces will definitely be useful as it looks as if you are using useAsyncIO="true". If correct, I definitely want to see how this is being triggered.

As an alternative, I'm happy to look at the stack traces for the modified Tomcat source if you can also provide the modified source.
Comment 19 Mark Thomas 2023-04-06 16:17:33 UTC
I've applied a further fix for this.

I've also updated the dev build at:
https://people.apache.org/~markt/dev/v9.0.74-dev/

Same caveats apply:
- this is not an official release
- use it at your own risk
Comment 20 Mark Thomas 2023-05-02 11:19:20 UTC
Given the lack of response, I am assuming that this is now fixed.
Comment 21 Venkata Manda 2023-06-09 17:59:19 UTC
Excuse me for the delayed response.

I tested the fix by simulating the GC pause by pausing the JVM (running in debug mode) and it worked. 

My team is working on running this at a higher load with actual user loads to confirm that we are not seeing any other side effects/issues. I was waiting for those results before confirming the fix is working without any issues.

Recently I got to know about https://bz.apache.org/bugzilla/show_bug.cgi?id=66574
and the creator of the bug stated that this seems to be a regression introduced in 9.0.74. 

In the past, after upgrading CometD from 3 to 5 with Tomcat version 9.0.63 we noticed all the HTTP-nio2 threads blocked without making progress. We raised the issue with CometD https://github.com/cometd/cometd/issues/1328
and Tomcat https://bz.apache.org/bugzilla/show_bug.cgi?id=66531.

We fixed it on our side by not using CompletableFuture in org.cometd.server.websocket.javax.WebSocketEndPoint.onMessage method. This has been done to unblock ourselves. 

We are not sure whether 9.0.75 change is needed as we no longer use CompletableFuture in WebSocketEndPoint. Any guidance on this is highly appreciated.

Once again sorry for the delayed response. We were waiting to run this at a higher load with different scenarios to make sure there is no other issues.


Thank you once again for all the help.



-----------------------------------------
Stack trace for https://bz.apache.org/bugzilla/show_bug.cgi?id=66531

"http-nio2-8082-exec-1" #82 daemon prio=5 os_prio=31 cpu=460.02ms elapsed=2578.35s tid=0x00007facbee28800 nid=0xde03 waiting on condition  [0x000070001173d000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.17/Native Method)
	- parking to wait for  <0x00000007b5100478> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.17/LockSupport.java:194)
	at java.util.concurrent.CompletableFuture$Signaller.block(java.base@11.0.17/CompletableFuture.java:1796)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@11.0.17/ForkJoinPool.java:3128)
	at java.util.concurrent.CompletableFuture.waitingGet(java.base@11.0.17/CompletableFuture.java:1823)
	at java.util.concurrent.CompletableFuture.get(java.base@11.0.17/CompletableFuture.java:1998)
	at org.cometd.server.websocket.javax.WebSocketEndPoint.onMessage(WebSocketEndPoint.java:61)
	at org.cometd.server.websocket.javax.WebSocketEndPoint.onMessage(WebSocketEndPoint.java:35)
	at org.apache.tomcat.websocket.WsFrameBase.sendMessageText(WsFrameBase.java:415)
	at org.apache.tomcat.websocket.server.WsFrameServer.sendMessageText(WsFrameServer.java:129)
	at org.apache.tomcat.websocket.WsFrameBase.processDataText(WsFrameBase.java:515)
	at org.apache.tomcat.websocket.WsFrameBase.processData(WsFrameBase.java:301)
	at org.apache.tomcat.websocket.WsFrameBase.processInputBuffer(WsFrameBase.java:133)
	at org.apache.tomcat.websocket.server.WsFrameServer.onDataAvailable(WsFrameServer.java:85)
	at org.apache.tomcat.websocket.server.WsFrameServer.doOnDataAvailable(WsFrameServer.java:183)
	at org.apache.tomcat.websocket.server.WsFrameServer.notifyDataAvailable(WsFrameServer.java:162)
	at org.apache.tomcat.websocket.server.WsHttpUpgradeHandler.upgradeDispatch(WsHttpUpgradeHandler.java:157)
	at org.apache.coyote.http11.upgrade.UpgradeProcessorInternal.dispatch(UpgradeProcessorInternal.java:60)
	at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:59)
	at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:890)
	at org.apache.tomcat.util.net.Nio2Endpoint$SocketProcessor.doRun(Nio2Endpoint.java:1701)
	at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
	- locked <0x00000007b51005d8> (a org.apache.tomcat.util.net.Nio2Endpoint$Nio2SocketWrapper)
	at org.apache.tomcat.util.net.AbstractEndpoint.processSocket(AbstractEndpoint.java:1190)
	at org.apache.tomcat.util.net.Nio2Endpoint$Nio2SocketWrapper$2.completed(Nio2Endpoint.java:633)
	at org.apache.tomcat.util.net.Nio2Endpoint$Nio2SocketWrapper$2.completed(Nio2Endpoint.java:611)
	at sun.nio.ch.Invoker.invokeUnchecked(java.base@11.0.17/Invoker.java:127)
	at sun.nio.ch.Invoker$2.run(java.base@11.0.17/Invoker.java:219)
	at sun.nio.ch.AsynchronousChannelGroupImpl$1.run(java.base@11.0.17/AsynchronousChannelGroupImpl.java:112)
	at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191)
	at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
	at java.lang.Thread.run(java.base@11.0.17/Thread.java:829)
Comment 22 Mark Thomas 2023-06-12 08:40:52 UTC
Thanks for confirming the fix.

Generally, if applications simulate blocking using non-blocking I/O that will be problematic.

If there is a way to reduce the locking in Tomcat (and hence reduce the chances of simulated blocking causing issues) we will always look at it as reducing blocking is generally a good thing but we won't normally treat deadlocks caused by simulated blocking as bugs.