Bug 56518 - NIO async servlet limit latch leak
Summary: NIO async servlet limit latch leak
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 7
Classification: Unclassified
Component: Connectors (show other bugs)
Version: 7.0.53
Hardware: PC Linux
: P2 normal (vote)
Target Milestone: ---
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-05-13 10:47 UTC by hanyong
Modified: 2014-06-16 10:08 UTC (History)
1 user (show)



Attachments
the sample webapp to reproduce the bug (4.37 KB, application/octet-stream)
2014-05-13 10:47 UTC, hanyong
Details
source code of the sample webapp (1.85 KB, application/x-tar-gz)
2014-05-13 10:52 UTC, hanyong
Details
sample webapp to reproduce the bug in tomcat 8.0.8 (4.45 KB, application/octet-stream)
2014-06-05 10:41 UTC, hanyong
Details
source code of the sample webapp (1.87 KB, application/x-tar-gz)
2014-06-05 10:43 UTC, hanyong
Details
fix bug 56518 based on https://svn.apache.org/repos/asf/tomcat/trunk@1595293 (2.53 KB, patch)
2014-06-09 04:43 UTC, hanyong
Details | Diff
fix bug 56518 on TOMCAT_7_0_54 (1.22 KB, patch)
2014-06-10 09:23 UTC, hanyong
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description hanyong 2014-05-13 10:47:48 UTC
Created attachment 31613 [details]
the sample webapp to reproduce the bug

we have encouter this bug in a real product webapp.

I have tested this in linux x86, oracle jdk jdk1.7.0_55, tomcat 7.0.53 and tomcat 8.0.5.

CONFIG:
we change HTTP Connector to NIO in "server.xml", 
e.g. protocol="org.apache.coyote.http11.Http11NioProtocol"

WEBAPP LOGIC:
the simplified situation:
1. call "req.startAsync()" to start async serlvet, then execute the async logic in our user thread.
2. sometimes the user thread be interrupted (by some timeout logic of our code).
3. some user code call "resp.flushBuffer()" to send response to client

PROBLEM:
in the situation descibed above, the "LimitLatch.countDown()" is not called.
when the connections limit latch count up to max ( default "10000" ),
tomcat DO not accept any connection, all incoming client hangs.

REPRODUCER:

in a clean tomcat-7.0.53 installation:
1. change the default "server.xml" Connector config.
(1) change protocol="org.apache.coyote.http11.Http11NioProtocol"
(2) Optional, add maxConnections="100" to reproduce the bug faster.
2. copy the sample webapp in the attachment to "webapps/ROOT.war"
3. start tomcat.
4. make plenty request to "/async.html", 

for (( i = 0; i < 15000; ++i )) ; do echo $i; curl localhost:8080/async.html; done

each request is likely cause a limit latch leak.
when the requests reaches maxConnections (100 as we set above) or some more,
the client ( curl ) hangs.

TECHNIC-DETAILS:
after some debug, wo found these:

1. when the thread was interrupted, when the user code call "resp.flushBuffer()",
the NioChannel was Closed by jdk NIO code, and a ClosedByInterruptException is thrown.

2. when the channel closed, the SelectionKey was removed by Poller thread, 
stack trace:
Daemon Thread [http-nio-8080-ClientPoller-0] (Suspended)	
	owns: Object  (id=3346)	
	owns: HashSet<E>  (id=3354)	
	owns: EPollSelectorImpl  (id=82)	
	owns: Collections$UnmodifiableSet<E>  (id=3355)	
	owns: Util$2  (id=3356)	
	SocketChannelImpl(AbstractSelectableChannel).removeKey(SelectionKey) line: 114	
	EPollSelectorImpl(AbstractSelector).deregister(AbstractSelectionKey) line: 168	
	EPollSelectorImpl.implDereg(SelectionKeyImpl) line: 162	
	EPollSelectorImpl(SelectorImpl).processDeregisterQueue() line: 131	
	EPollSelectorImpl.doSelect(long) line: 69	
	EPollSelectorImpl(SelectorImpl).lockAndDoSelect(long) line: 69	
	EPollSelectorImpl(SelectorImpl).select(long) line: 80	
	NioEndpoint$Poller.run() line: 1163	
	Thread.run() line: 662	

3. when we call "ctx.complete()", it run to "org.apache.tomcat.util.net.NioEndpoint.processSocket(NioChannel, SocketStatus, boolean)", code is below:

    public boolean processSocket(NioChannel socket, SocketStatus status, boolean dispatch) {
        try {
            KeyAttachment attachment = (KeyAttachment)socket.getAttachment(false);
            if (attachment == null) {
                return false;
            }

since the SelectionKey was removed, the "attachment" returns null.
the logic is break, "AbstractEndpoint.countDownConnection()" is not called, a limit latch leak happens.

WORK-AROUND:
some work-around:
1. switch to the stable BIO connector.
2. avoid call "resp.flushBuffer()" in the user thread.
Comment 1 hanyong 2014-05-13 10:52:29 UTC
Created attachment 31614 [details]
source code of the sample webapp
Comment 2 Mark Thomas 2014-05-13 22:11:10 UTC
Thanks for the sample web application. I am able to reproduce this with NIO but not with NIO2 or APR/native so it appears that this issue is specific to NIO.

I have started some refactoring that should make this simpler to fix (in 8.0.x at least) but it doesn't look like there is an easy fix for this.
Comment 3 Mark Thomas 2014-05-13 23:30:19 UTC
As luck would have it, the next stage of clean-up / refactoring appears to have fixed this issue. The fix is applied to 8.0.x and the unit tests pass on OSX. I'm just waiting for the results for Linux and Windows before looking at back-porting the fix to 7.0.x.
Comment 4 Mark Thomas 2014-05-14 13:59:03 UTC
The fix works on 8.0.x but back-porting the changes to 7.0.x doesn't appear to address the problem on that platform.

For reasons I don't yet fully understand, 8.0.x appears not to throw the ClosedByInterruptException.
Comment 5 hanyong 2014-06-05 10:41:05 UTC
Created attachment 31690 [details]
sample webapp to reproduce the bug in tomcat 8.0.8
Comment 6 hanyong 2014-06-05 10:43:43 UTC
Created attachment 31691 [details]
source code of the sample webapp
Comment 7 hanyong 2014-06-05 10:57:09 UTC
we are try to back-porting the fix in tomcat 8 to tomcat 7.
After some debugging, It seems the bug does not fixed in tomcat 8, but make it more difficult to reproduce.
The root cause is that the SelectionKey is removed before the count down be executed. 

I fixed the sample webapp to set the sleep time a little more than the selectorTimeout, then I can reproduce the bug in tomcat 8.0.8.

I'm use ubuntu 12.04 32bit, oracle jdk 1.7.0_55, tomcat 8.0.8.

In a tomcat 8.0.8 clean install,
(1) edit "server.xml", in the Connector add maxConnections="10" to reproduce this bug faster.
(2) remove origin ROOT in webapps, 'rm -rf conf/ROOT', deploy the sample webapp to "webapps/ROOT.war".
(3) start tomcat.
(4) make plenty request to "/async.html", 

for (( i = 0; i < 15000; ++i )) ; do echo $i; curl localhost:8080/async.html; done

wait a minute, then the curl client hangs, the connection limit will count up to the max.
Comment 8 Mark Thomas 2014-06-05 15:33:17 UTC
(In reply to hanyong from comment #7)
> It seems the bug does not fixed in tomcat 8, but make
> it more difficult to reproduce.

I wondered if that might be the case. I've spent quite a bit of time looking at this and don't see a simple solution at this point.
Comment 9 hanyong 2014-06-09 04:43:14 UTC
Created attachment 31698 [details]
fix bug 56518 based on https://svn.apache.org/repos/asf/tomcat/trunk@1595293

Here is my Quick and Dirty patch to fix this at tomcat 8.0.8 , hope it help for a regular solution.
I also started a PR on github: https://github.com/apache/tomcat/pull/10

In NioEndpoint, save and retrieve SelectionKey in KeyAttachment.
When the socket is closed by the interrupt,
the associated SelectionKey may be removed by the poller thread before async complete.
Then socket.keyFor() returns null, the connection count would not be reduced.
Save the key in the attachment could fix this.
Comment 10 hanyong 2014-06-09 10:33:58 UTC
Sorry, the patch DOES NOT work well.

I found another problem.

The aysnc servlet timeout is triggered by the poller thead also, since the SelectionKey was deregistered when the socket be closed, the asyncTimeout will never triggered.

In this situation, If the user forgot to call "AsyncContext.complete()", the limit leak happens always.
Comment 11 Mark Thomas 2014-06-09 17:05:19 UTC
One way around that issue would be to move the Async timeout processing to a separate thread - like it is for the other endpoints.
Comment 12 hanyong 2014-06-10 09:23:17 UTC
Created attachment 31703 [details]
fix bug 56518 on TOMCAT_7_0_54

Clear thread interrupted status before write to NIO socket. i.e. Never interrupt NIO ops.

It's very rough, but It's the simplest way and works.
Comment 13 Mark Thomas 2014-06-10 13:50:49 UTC
Thanks for all your work on this. It is much appreciated.

Given that it is the application that is triggering the interrupt, shouldn't it be the application's responsibility to clear that interrupt before continuing and flushing the response buffer?

I'm currently thinking of adding a test for the current Thread being interrupted to the NIO write methods along the lines of your proposed patch. I want to test the performance of this first. I'm also leaning towards logging a warning if the thread is interrupted on the basis that the app that generated the interrupt should have cleared it.
Comment 14 Mark Thomas 2014-06-11 10:11:30 UTC
I've done some testing and it looks like the call to Thread.currentThread().isInterrupted() is nice and quick. I'm going to go ahead and make some changes here.

I've been thinking about this some more and I am leaning towards closing the connection if Tomcat detects an interrupt since that is what would happen if the thread was allowed to continue. This adds weight to the view that the application should clear the interrupt.
Comment 15 Konstantin Kolinko 2014-06-11 10:42:58 UTC
(In reply to Mark Thomas from comment #14)
> 
> I am leaning towards closing the
> connection if Tomcat detects an interrupt since that is what would happen if
> the thread was allowed to continue.

+1.

I have not followed this discussion, but I can add that starting with 8.0.9/7.0.55 an interruption may also be performed by o.a.c.valves.StuckThreadDetectionValve, if it is configured to do so. (r1595331 bug 56526).
Comment 16 Mark Thomas 2014-06-12 15:12:44 UTC
I have implemented this for 8.0.x and it will be included in 8.0.9 onwards.

I am now looking at back-porting the fix to 7.0.x.
Comment 17 Mark Thomas 2014-06-16 10:08:55 UTC
This has been fixed in 7.0.x for 7.0.55 onwards.