Bug 58624 - Websocket send blocks indefinitely in FutureToSendHandler
Websocket send blocks indefinitely in FutureToSendHandler
Status: RESOLVED FIXED
Product: Tomcat 8
Classification: Unclassified
Component: WebSocket
8.0.24
PC All
: P2 normal (vote)
: ----
Assigned To: Tomcat Developers Mailing List
:
Depends on:
Blocks:
  Show dependency tree
 
Reported: 2015-11-18 12:16 UTC by Barry Coughlan
Modified: 2016-02-23 13:44 UTC (History)
0 users



Attachments
Test to check that session.close() does not call OnError (2.96 KB, text/plain)
2015-12-02 16:50 UTC, Barry Coughlan
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Barry Coughlan 2015-11-18 12:16:29 UTC
- Tomcat 8.0.24
- APR connector disabled, using the NIO connector
- SSL is on
- Encoder is of type Encoder.Text<Object>

We found this during load tests, it only happened to one message out of millions but there is nothing special about that message.

We have a class that wraps javax.websocket.RemoteEndpoint.Basic to ensure that it's thread-safe - so messages are never sent from multiple threads on one socket.

The FutureToSendHandler gets stuck waiting for the call to onResult. Looking at the code, logically sendStringByCompletion(msg, completion) (WsRemoteEndpoint.java:573) must not call the onResult for some code path. I am trying to figure out where but no luck so far.

This is the first time we have seen this problem over many load tests so unfortunately we can't try to replicate on the latest at the moment.

"defaultEventExecutorGroup-3-18" - Thread t@61
   java.lang.Thread.State: WAITING
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for <2fcb7fa> (a java.util.concurrent.CountDownLatch$Sync)
	at java.util.concurrent.locks.LockSupport.park(Unknown Source)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown Source)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(Unknown Source)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(Unknown Source)
	at java.util.concurrent.CountDownLatch.await(Unknown Source)
	at org.apache.tomcat.websocket.FutureToSendHandler.get(FutureToSendHandler.java:76)
	at org.apache.tomcat.websocket.FutureToSendHandler.get(FutureToSendHandler.java:31)
	at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendObject(WsRemoteEndpointImplBase.java:521)
	at org.apache.tomcat.websocket.WsRemoteEndpointBasic.sendObject(WsRemoteEndpointBasic.java:74)
        ...
Comment 1 Barry Coughlan 2015-11-18 15:41:37 UTC
I mistakenly thought that packets sent with getBasicRemote() were sent on the same thread.

To add, all of the threads for sending messages are in the state of waiting for tasks - none of them are blocked:

"WebSocketServer-localhost-/MyApp-1453" - Thread t@13892
   java.lang.Thread.State: TIMED_WAITING
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for <785b947d> (a java.util.concurrent.SynchronousQueue$TransferStack)
	at java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
	at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(Unknown Source)
	at java.util.concurrent.SynchronousQueue$TransferStack.transfer(Unknown Source)
	at java.util.concurrent.SynchronousQueue.poll(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)

   Locked ownable synchronizers:
	- None

Some other things that might be relevant from looking at the code:
- We don't call the sendPing() or sendText() methods, only sendObject().
- We don't use partial message decoding/encoding.
- Batching is not enabled.
- Looking at a heap dump, the only reference to the FutureSendHandler from the stuck thread is in the ConcurrentHashMap in org.apache.tomcat.websocket.WsSession.futures
Comment 2 Barry Coughlan 2015-11-18 18:12:27 UTC
Sorry for all the noise.

If I understand correctly, the flow of sending messages goes like this:

WsRemoteEndpointImplBase.startMessage -> messaged dispatched to WsServerContainer.executor -> calls WsRemoteEndpointImplBase.endMessage on completion (via EndMessageHandler).

I'm wondering if there could be an exception in the executor callback. For example, there are some exceptions thrown in writeMessagePart (called by endMessage). If an exception was thrown there it would be thrown all the way up to the executor and not logged. Am I correct in saying that exceptions in executor tasks would not produce exceptions in the Tomcat logs? ThreadPoolExecutor doesn't log them to the console by (http://stackoverflow.com/q/2248131/281469)


My current working theory for what's happening is as follows: 
- Server starts sending a message
- The client closes the connection before the executor finishes the message, which causes WsRemoteEndpointImplBase.closed to be set to true.
- Server finishes sending the message, calls endMessage, which calls writeMessagePart which throws an IllegalStateException because closed=true.
- Because of the exception the future never completes and the exception doesn't get logged.

At first I thought the synchronized (messagePartLock) block would protect against this. However it can still occur:
- The executor task finishes and endMessage() reaches the end of the synchronized block (but before calling writeMessagePart).
- The client close enters and exits the synchronized block in startMessage(). Calls writeMessagePart(), sets closed=true.
- The executor task calls writeMessagePart().

If this is a race condition it is a very rare one - it occurred once in sending ~5 million messages during a soak test.
Comment 3 Christopher Schultz 2015-11-18 20:34:47 UTC
(In reply to Barry Coughlan from comment #2)
> My current working theory for what's happening is as follows: 
> - Server starts sending a message
> - The client closes the connection before the executor finishes the message,
> which causes WsRemoteEndpointImplBase.closed to be set to true.
> - Server finishes sending the message, calls endMessage, which calls
> writeMessagePart which throws an IllegalStateException because closed=true.
> - Because of the exception the future never completes and the exception
> doesn't get logged.

I'm no expert, but what about this:

The Future won't complete, but it should probably throw ExecutionException if you call one of the "get" methods. You can unwrap that and see the exception that caused the operation to fail.
Comment 4 Barry Coughlan 2015-11-19 00:25:49 UTC
(In reply to Christopher Schultz from comment #3)
> (In reply to Barry Coughlan from comment #2)
> > My current working theory for what's happening is as follows: 
> > - Server starts sending a message
> > - The client closes the connection before the executor finishes the message,
> > which causes WsRemoteEndpointImplBase.closed to be set to true.
> > - Server finishes sending the message, calls endMessage, which calls
> > writeMessagePart which throws an IllegalStateException because closed=true.
> > - Because of the exception the future never completes and the exception
> > doesn't get logged.
> 
> I'm no expert, but what about this:
> 
> The Future won't complete, but it should probably throw ExecutionException
> if you call one of the "get" methods. You can unwrap that and see the
> exception that caused the operation to fail.

The getBasicRemote().sendObject() calls the Future.get() on our behalf (see the stack trace in the first post). The .get() never returns because the onResult of the FutureToSendHandler never gets called.
Comment 5 Barry Coughlan 2015-11-19 10:47:56 UTC
The race condition I was talking about was incorrect, as the if (!close) in endMessage() takes care of it. Also the exceptions are logged because the executor is called with execute() instead of submit().

So I'm back to square one :)
Comment 6 Mark Thomas 2015-11-19 14:52:56 UTC
Thanks for all the work on this.

It is worth updating to the latest 8.0.x in case one of the I/O or concurrency updates has improved this.

I'm switching this to NEEDINFO. Ideally we need a test case but that is not essential. If you believe you have found a code path that could trigger this let us know and we'll take a look. As long as we can confirm the code path is theoretically possible, we'll be happy to fix it.
Comment 7 Barry Coughlan 2015-11-22 04:12:19 UTC
(In reply to Mark Thomas from comment #6)
> Thanks for all the work on this.
> 
> It is worth updating to the latest 8.0.x in case one of the I/O or
> concurrency updates has improved this.
> 
> I'm switching this to NEEDINFO. Ideally we need a test case but that is not
> essential. If you believe you have found a code path that could trigger this
> let us know and we'll take a look. As long as we can confirm the code path
> is theoretically possible, we'll be happy to fix it.

Hi Mark,

I managed to replicate this on 8.0.24 and 8.0.28. I don't think I could automate a test to replicate this, but it can be shown with a simple example by setting breakpoints.

============================================================

Sample server endpoint:

------------------------

package tomcatbug58624;

import javax.websocket.EncodeException;
import javax.websocket.Encoder;
import javax.websocket.EndpointConfig;

public class WsEncoder implements Encoder.Text<Object> {

    @Override
    public void destroy() {
    }

    @Override
    public void init(EndpointConfig endpointConfig) {
    }

    @Override
    public String encode(Object object) throws EncodeException {
        return (String) object;
    }

}

------------------------

package tomcatbug58624;

import java.io.IOException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

import javax.websocket.CloseReason;
import javax.websocket.EncodeException;
import javax.websocket.OnClose;
import javax.websocket.OnError;
import javax.websocket.OnMessage;
import javax.websocket.OnOpen;
import javax.websocket.Session;
import javax.websocket.server.ServerEndpoint;

@ServerEndpoint(value = "/bug58624", encoders = WsEncoder.class)
public class WsEndpoint {

    public static class SendWsMessage implements Runnable {
        private Session session;

        public SendWsMessage(Session session) {
            this.session = session;
        }

        @Override
        public void run() {
            try {
                session.getBasicRemote().sendObject("test");
            } catch (IOException | EncodeException e) {
                e.printStackTrace();
            }
        }

    }

    private static final ExecutorService ex = Executors.newFixedThreadPool(1);

    @OnOpen
    public void onOpen(Session session) throws IOException {
        ex.submit(new SendWsMessage(session));
    }

    @OnMessage
    public void onMessage(String message) {
        System.out.println("OnMessage: " + message);
    }

    @OnError
    public void onError(Throwable t) {
        System.err.println("OnError:");
        t.printStackTrace();
    }

    @OnClose
    public void onClose(Session session, CloseReason cr) {
        System.out.println("Closed " + cr);
    }
}

-----------------------------

Set breakpoint on WsEndpoint$SendWsMessage:29 (The .sendObject() line)
Set breakpoint on WsRemoteEndpointImplServer:76 (start of doWrite())

- Open websocket
  - Thread 1 breaks on sendObject()
- Close websocket from client side (I just did Ctrl+C with Tyrus CLI)
   - Thread 2 breaks on doWrite()
- Continue Thread 1 so that the message gets queued to send
- Continue Thread 2

sendObject() waits indefinitely for Future.get(). The reason is that doWrite is not set to true in endMessage() because closed = true.

This would happen any time a message is queued to be sent while a close is in progress.

Iā€™m not familiar enough with the codebase to submit a patch, but it looks to me like if closed==true in endMessage(), it should call sendResult with an exception for any messages still in the queue.
Comment 8 Mark Thomas 2015-11-24 11:19:39 UTC
Thanks for the analysis. I'll take a look.
Comment 9 Mark Thomas 2015-11-24 22:20:31 UTC
Thanks for the great analysis and test case. I've fixed this in 9.0.x for 9.0.0.M2. There has been some refactoring of the relevant code so a back-port may take a little longer than usual.
Comment 10 Barry Coughlan 2015-11-25 00:00:51 UTC
(In reply to Mark Thomas from comment #9)
> Thanks for the great analysis and test case. I've fixed this in 9.0.x for
> 9.0.0.M2. There has been some refactoring of the relevant code so a
> back-port may take a little longer than usual.

Thanks a lot for the speedy fix Mark.

For future readers, we currently work around this in Tomcat 8 by replicating the behaviour of getBasicRemote().sendObject() but with a timeout on the future, so that at least our application can recover: 

Future<Void> f = session.getAsyncRemote().sendObject(...)        
try {
    f.get(10, TimeUnit.SECONDS);
} catch (InterruptedException e) {
    throw new IOException(e);
} catch (ExecutionException e) {
    Throwable cause = e.getCause();
    if (cause instanceof IOException) {
        throw (IOException) cause;
    } else if (cause instanceof EncodeException) {
        throw (EncodeException) cause;
    } else {
        throw new IOException(e);
    }
} catch (TimeoutException e) {
    throw new IOException(e);
}
Comment 11 Mark Thomas 2015-11-25 11:45:30 UTC
I've back-ported the fix to 8.0.x for 8.0.30 onwards and to 7.0.x for 7.0.66 onwards.
Comment 12 Barry Coughlan 2015-12-02 16:50:59 UTC
Created attachment 33320 [details]
Test to check that session.close() does not call OnError
Comment 13 Barry Coughlan 2015-12-02 16:53:17 UTC
Hi Mark,

I was writing a test case for bug57489 and came across a problem with session.close() that I think is related to this fix. Since session.close() sets state = State.CLOSING the block in registerFuture() fails. See test case attached.



02-Dec-2015 16:49:33.351 FINE [main] org.apache.tomcat.websocket.WsSession.sendCloseMessage Failed to send close message for session [0] to remote endpoint
 java.io.IOException: java.util.concurrent.ExecutionException: java.io.IOException: Unable to write the complete message as the WebSocket connection has been closed
	at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.startMessageBlock(WsRemoteEndpointImplBase.java:282)
	at org.apache.tomcat.websocket.WsSession.sendCloseMessage(WsSession.java:584)
	at org.apache.tomcat.websocket.WsSession.doClose(WsSession.java:488)
	at org.apache.tomcat.websocket.WsSession.close(WsSession.java:455)
	at org.apache.tomcat.websocket.WsSession.close(WsSession.java:449)
	at org.apache.tomcat.websocket.server.TestCleanClose.testOnErrorNotCalledWhenClosingConnection(TestCleanClose.java:46)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:483)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
	at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
	at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:675)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Unable to write the complete message as the WebSocket connection has been closed
	at org.apache.tomcat.websocket.FutureToSendHandler.get(FutureToSendHandler.java:102)
	at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.startMessageBlock(WsRemoteEndpointImplBase.java:275)
	... 34 more
Caused by: java.io.IOException: Unable to write the complete message as the WebSocket connection has been closed
	at org.apache.tomcat.websocket.WsSession.registerFuture(WsSession.java:658)
	at org.apache.tomcat.websocket.FutureToSendHandler.get(FutureToSendHandler.java:92)
	... 35 more
Comment 14 Barry Coughlan 2015-12-09 09:53:18 UTC
Should I change this to reopened just so this isn't forgotten about?
Comment 15 Sebastian Herold 2016-01-05 21:17:32 UTC
+1 for reopening the case or creating a new bug, if this (java.io.IOException: Unable to write the complete message as the WebSocket connection has been closed) is not related to the existing bug.

Spring also recreated the error in one of their tests and had to ignore it until Tomcat is fixed: https://jira.spring.io/browse/INT-3909
Comment 16 Mark Thomas 2016-02-23 13:44:53 UTC
This new issue is being tracked under bug 59014