Created attachment 31613 [details] the sample webapp to reproduce the bug we have encouter this bug in a real product webapp. I have tested this in linux x86, oracle jdk jdk1.7.0_55, tomcat 7.0.53 and tomcat 8.0.5. CONFIG: we change HTTP Connector to NIO in "server.xml", e.g. protocol="org.apache.coyote.http11.Http11NioProtocol" WEBAPP LOGIC: the simplified situation: 1. call "req.startAsync()" to start async serlvet, then execute the async logic in our user thread. 2. sometimes the user thread be interrupted (by some timeout logic of our code). 3. some user code call "resp.flushBuffer()" to send response to client PROBLEM: in the situation descibed above, the "LimitLatch.countDown()" is not called. when the connections limit latch count up to max ( default "10000" ), tomcat DO not accept any connection, all incoming client hangs. REPRODUCER: in a clean tomcat-7.0.53 installation: 1. change the default "server.xml" Connector config. (1) change protocol="org.apache.coyote.http11.Http11NioProtocol" (2) Optional, add maxConnections="100" to reproduce the bug faster. 2. copy the sample webapp in the attachment to "webapps/ROOT.war" 3. start tomcat. 4. make plenty request to "/async.html", for (( i = 0; i < 15000; ++i )) ; do echo $i; curl localhost:8080/async.html; done each request is likely cause a limit latch leak. when the requests reaches maxConnections (100 as we set above) or some more, the client ( curl ) hangs. TECHNIC-DETAILS: after some debug, wo found these: 1. when the thread was interrupted, when the user code call "resp.flushBuffer()", the NioChannel was Closed by jdk NIO code, and a ClosedByInterruptException is thrown. 2. when the channel closed, the SelectionKey was removed by Poller thread, stack trace: Daemon Thread [http-nio-8080-ClientPoller-0] (Suspended) owns: Object (id=3346) owns: HashSet<E> (id=3354) owns: EPollSelectorImpl (id=82) owns: Collections$UnmodifiableSet<E> (id=3355) owns: Util$2 (id=3356) SocketChannelImpl(AbstractSelectableChannel).removeKey(SelectionKey) line: 114 EPollSelectorImpl(AbstractSelector).deregister(AbstractSelectionKey) line: 168 EPollSelectorImpl.implDereg(SelectionKeyImpl) line: 162 EPollSelectorImpl(SelectorImpl).processDeregisterQueue() line: 131 EPollSelectorImpl.doSelect(long) line: 69 EPollSelectorImpl(SelectorImpl).lockAndDoSelect(long) line: 69 EPollSelectorImpl(SelectorImpl).select(long) line: 80 NioEndpoint$Poller.run() line: 1163 Thread.run() line: 662 3. when we call "ctx.complete()", it run to "org.apache.tomcat.util.net.NioEndpoint.processSocket(NioChannel, SocketStatus, boolean)", code is below: public boolean processSocket(NioChannel socket, SocketStatus status, boolean dispatch) { try { KeyAttachment attachment = (KeyAttachment)socket.getAttachment(false); if (attachment == null) { return false; } since the SelectionKey was removed, the "attachment" returns null. the logic is break, "AbstractEndpoint.countDownConnection()" is not called, a limit latch leak happens. WORK-AROUND: some work-around: 1. switch to the stable BIO connector. 2. avoid call "resp.flushBuffer()" in the user thread.
Created attachment 31614 [details] source code of the sample webapp
Thanks for the sample web application. I am able to reproduce this with NIO but not with NIO2 or APR/native so it appears that this issue is specific to NIO. I have started some refactoring that should make this simpler to fix (in 8.0.x at least) but it doesn't look like there is an easy fix for this.
As luck would have it, the next stage of clean-up / refactoring appears to have fixed this issue. The fix is applied to 8.0.x and the unit tests pass on OSX. I'm just waiting for the results for Linux and Windows before looking at back-porting the fix to 7.0.x.
The fix works on 8.0.x but back-porting the changes to 7.0.x doesn't appear to address the problem on that platform. For reasons I don't yet fully understand, 8.0.x appears not to throw the ClosedByInterruptException.
Created attachment 31690 [details] sample webapp to reproduce the bug in tomcat 8.0.8
Created attachment 31691 [details] source code of the sample webapp
we are try to back-porting the fix in tomcat 8 to tomcat 7. After some debugging, It seems the bug does not fixed in tomcat 8, but make it more difficult to reproduce. The root cause is that the SelectionKey is removed before the count down be executed. I fixed the sample webapp to set the sleep time a little more than the selectorTimeout, then I can reproduce the bug in tomcat 8.0.8. I'm use ubuntu 12.04 32bit, oracle jdk 1.7.0_55, tomcat 8.0.8. In a tomcat 8.0.8 clean install, (1) edit "server.xml", in the Connector add maxConnections="10" to reproduce this bug faster. (2) remove origin ROOT in webapps, 'rm -rf conf/ROOT', deploy the sample webapp to "webapps/ROOT.war". (3) start tomcat. (4) make plenty request to "/async.html", for (( i = 0; i < 15000; ++i )) ; do echo $i; curl localhost:8080/async.html; done wait a minute, then the curl client hangs, the connection limit will count up to the max.
(In reply to hanyong from comment #7) > It seems the bug does not fixed in tomcat 8, but make > it more difficult to reproduce. I wondered if that might be the case. I've spent quite a bit of time looking at this and don't see a simple solution at this point.
Created attachment 31698 [details] fix bug 56518 based on https://svn.apache.org/repos/asf/tomcat/trunk@1595293 Here is my Quick and Dirty patch to fix this at tomcat 8.0.8 , hope it help for a regular solution. I also started a PR on github: https://github.com/apache/tomcat/pull/10 In NioEndpoint, save and retrieve SelectionKey in KeyAttachment. When the socket is closed by the interrupt, the associated SelectionKey may be removed by the poller thread before async complete. Then socket.keyFor() returns null, the connection count would not be reduced. Save the key in the attachment could fix this.
Sorry, the patch DOES NOT work well. I found another problem. The aysnc servlet timeout is triggered by the poller thead also, since the SelectionKey was deregistered when the socket be closed, the asyncTimeout will never triggered. In this situation, If the user forgot to call "AsyncContext.complete()", the limit leak happens always.
One way around that issue would be to move the Async timeout processing to a separate thread - like it is for the other endpoints.
Created attachment 31703 [details] fix bug 56518 on TOMCAT_7_0_54 Clear thread interrupted status before write to NIO socket. i.e. Never interrupt NIO ops. It's very rough, but It's the simplest way and works.
Thanks for all your work on this. It is much appreciated. Given that it is the application that is triggering the interrupt, shouldn't it be the application's responsibility to clear that interrupt before continuing and flushing the response buffer? I'm currently thinking of adding a test for the current Thread being interrupted to the NIO write methods along the lines of your proposed patch. I want to test the performance of this first. I'm also leaning towards logging a warning if the thread is interrupted on the basis that the app that generated the interrupt should have cleared it.
I've done some testing and it looks like the call to Thread.currentThread().isInterrupted() is nice and quick. I'm going to go ahead and make some changes here. I've been thinking about this some more and I am leaning towards closing the connection if Tomcat detects an interrupt since that is what would happen if the thread was allowed to continue. This adds weight to the view that the application should clear the interrupt.
(In reply to Mark Thomas from comment #14) > > I am leaning towards closing the > connection if Tomcat detects an interrupt since that is what would happen if > the thread was allowed to continue. +1. I have not followed this discussion, but I can add that starting with 8.0.9/7.0.55 an interruption may also be performed by o.a.c.valves.StuckThreadDetectionValve, if it is configured to do so. (r1595331 bug 56526).
I have implemented this for 8.0.x and it will be included in 8.0.9 onwards. I am now looking at back-porting the fix to 7.0.x.
This has been fixed in 7.0.x for 7.0.55 onwards.