Bug 61183 - Deadlock occurs while sending to a closing session
Summary: Deadlock occurs while sending to a closing session
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 8
Classification: Unclassified
Component: WebSocket (show other bugs)
Version: 8.0.33
Hardware: PC All
: P2 regression (vote)
Target Milestone: ----
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-06-13 16:25 UTC by Lorcan O'Toole
Modified: 2017-06-21 09:46 UTC (History)
0 users



Attachments
Description in txt (6.50 KB, text/plain)
2017-06-13 16:33 UTC, Lorcan O'Toole
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Lorcan O'Toole 2017-06-13 16:25:12 UTC
Hi,
We have recently ran into a deadlock situation in a customer site while using websockets.
One of the locks is in our code while the other is the Tomcat "stateLock" defined in "class WsSession".

I have marked this issue as a regression because one of the 2 locations that Tomcat takes the stateLock in 8.0.33 did not exist in 8.0.29. Here's a quick look at the differences before examining the deadlocked call stacks:

In 8.0.33 there's an expansion of the stateLock usage:
    protected void registerFuture(FutureToSendHandler f2sh) {
        boolean fail = false;
        synchronized (stateLock) {
			// leaving out the details...
		}
	}
		
While in 8.0.29 the entire method looks like:
    protected void registerFuture(FutureToSendHandler f2sh) {
        futures.put(f2sh, f2sh);
    }

I can understand why the stateLock usage has been expanded. But I think there is a problem with this. Previously we'd only enter stateLock in Tomcat -> Application calls. Now with the new registerFuture we also can enter stateLock in Application -> Tomcat calls.
So a recent change seems to have introduced a key ingredient for deadlocks. If my application has its own lock there is the potential for 2 locks to be acquired in reverse order by 2 simultaneous threads. 

We have experienced a real-world example of this:

In the 1st thread Tomcat is closing the session after an abrupt disconnect from the client. The close action bubbles up to the Application onClose handler:

"http-nio-8445-exec-76" daemon prio=5 tid=249 BLOCKED
                at com.myc.common.domain.websocketutil.WsOutbound.setSocketStatus(WsOutbound.java:152) --> Waits on the Application local lock
                   Local Variable: com.myc.common.domain.websocketutil.WsOutbound$SocketStatus#2
                at com.myc.customer.web.socket.WebSocketEndpoint.onClose(WebSocketEndpoint.java:224)  --> Application onClose handler called which has to do some tidy up
                   Local Variable: javax.websocket.CloseReason#2
                   Local Variable: com.myc.customer.web.socket.WebSocketEndpoint#4
                   Local Variable: java.lang.Object#9184
                at sun.reflect.GeneratedMethodAccessor250.invoke(<unknown string>)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(<unknown string>)
                at java.lang.reflect.Method.invoke(<unknown string>)
                at org.apache.tomcat.websocket.pojo.PojoEndpointBase.onClose(PojoEndpointBase.java:107)
                   Local Variable: org.apache.tomcat.websocket.pojo.PojoEndpointServer#6
                at org.apache.tomcat.websocket.WsSession.fireEndpointOnClose(WsSession.java:541)
                   Local Variable: org.apache.catalina.loader.WebappClassLoader#2
                   Local Variable: org.apache.catalina.core.DefaultInstanceManager#3
                at org.apache.tomcat.websocket.WsSession.doClose(WsSession.java:490)        		--> Acquires a “state lock” on Tomcats call to close the session
                at org.apache.tomcat.websocket.server.WsHttpUpgradeHandler.onError(WsHttpUpgradeHandler.java:150)
                   Local Variable: org.apache.tomcat.websocket.server.WsHttpUpgradeHandler#6
                at org.apache.tomcat.websocket.server.WsHttpUpgradeHandler.access$300(WsHttpUpgradeHandler.java:48)

				
In the 2nd thread the Application is simply sending a normal message to Tomcat (unaware the session is closing simultaneously):

"defaultEventExecutorGroup-3-1" prio=5 tid=94 BLOCKED
                at org.apache.tomcat.websocket.WsSession.registerFuture(WsSession.java:650)           --> Waits on the Tomcat “stateLock”
                   Local Variable: org.apache.tomcat.websocket.WsSession#6
                at org.apache.tomcat.websocket.FutureToSendHandler.get(FutureToSendHandler.java:104)
                   Local Variable: org.apache.tomcat.websocket.FutureToSendHandler#247
                   Local Variable: java.util.concurrent.TimeUnit$4#1
                at org.apache.tomcat.websocket.FutureToSendHandler.get(FutureToSendHandler.java:31)
                at com.myc.common.domain.websocketutil.WsOutbound.await(WsOutbound.java:255)
                at com.myc.common.domain.websocketutil.WsOutbound.sendObj(WsOutbound.java:234)
                at com.myc.common.domain.websocketutil.WsOutbound.send(WsOutbound.java:215)         --> Acquires the Application local lock while sending (to guarantee message order when sending multiple messages)
                   Local Variable: com.myc.customer.model.json.notification.NewMessageNotification#3

Our interface to Tomcat (wsOutbound.java) takes the Application lock using class "synchronized" as follows:

    public synchronized void setSocketStatus(SocketStatus status) {
		// Does some state change stuff. Can be invoked from onClose()
	}
	
	public synchronized boolean send(Object obj) {
		// Sends one or more messages to Tomcat
	}
	
So we had a simple locking strategy. Now I have to come up with a message queuing strategy so that I can send multiple messages, in order, without holding an Application lock.

To my mind the root problem is in the existing Tomcat code for doClose(). In other deadlock situations I've been involved with the onus has been on the thread that invokes a callback/listener to exit any locks it holds first:

    private void doClose(CloseReason closeReasonMessage,
            CloseReason closeReasonLocal) {
        // Double-checked locking. OK because state is volatile
        if (state != State.OPEN) {
            return;
        }

        synchronized (stateLock) {
            if (state != State.OPEN) {
                return;
            }

            if (log.isDebugEnabled()) {
                log.debug(sm.getString("wsSession.doClose", id));
            }
            try {
                wsRemoteEndpoint.setBatchingAllowed(false);
            } catch (IOException e) {
                log.warn(sm.getString("wsSession.flushFailOnClose"), e);
											<-- Exit stateLock first?
                fireEndpointOnError(e);
            }

            state = State.CLOSING;

            sendCloseMessage(closeReasonMessage);
											<-- Exit stateLock first?
            fireEndpointOnClose(closeReasonLocal);
											<-- Reenter stateLock?
            state = State.CLOSED;
        }
		
I took a quick look at the latest 8.0.x (8.0.44) code and it appears the same issue can happen.
I'd appreciate any comment on my analysis or improvements in coding our application.
Comment 1 Lorcan O'Toole 2017-06-13 16:33:50 UTC
Created attachment 35050 [details]
Description in txt

Call stacks and comments might be easier to read with .txt
Comment 2 Mark Thomas 2017-06-18 20:24:37 UTC
The analysis looks to be spot on. The tricky part is likely to be fixing this without re-introducing the bugs those locks were put in place to fix.
Comment 3 Mark Thomas 2017-06-19 12:34:27 UTC
I spent a lot of time trying to avoid calling the callbacks with the lock held. In the end, I opted to refactor registerFuture().

Fixed in:
- trunk for 9.0.0.M22 onwards
- 8.5.x for 8.5.16 onwards
- 8.0.x for 8.0.45 onwards
- 7.0.x for 7.0.79 onwards
Comment 4 Lorcan O'Toole 2017-06-21 09:32:22 UTC
Thanks Mark,
I assume the refactoring of registerFuture involved avoiding stateLock. I would imagine this problem should be reproducible with the right test harness. 

Have you any view to when these patches will be made available?
regards,
Lorcan
Comment 5 Mark Thomas 2017-06-21 09:46:45 UTC
I'm building the 9.0.0.M22 release now. 8.5.x will follow shortly. Releases take 4-5 days, mainly because of the minimum 72 hour voting period and allowing 24 hours for the mirrors to sync.