server.xml config: <Connector port="8080" maxThreads="200" minSpareThreads="100" protocol="org.apache.coyote.http11.Http11Nio2Protocol" connectionTimeout="20000" keepAliveTimeout="15000" maxKeepAliveRequests="-1" redirectPort="8443" acceptCount="400"/> Tomcat server run about 10 days, then can't response web request again. request has request header: Connection: Keep-Alive response has header: Keep-Alive: timeout=15 From tcpdump data can see, server don't send fin data to client again. tcpdump 34 stream is a normal stream, response is ok. 3604 2021-07-12 01:25:52.369772 10.61.0.15 10.9.5.23 TCP 66 45978 → 8080 [ACK] Seq=1958 Ack=214372 Win=155136 Len=0 TSval=3599507486 TSecr=359044615 3606 2021-07-12 01:26:07.369723 10.9.5.23 10.61.0.15 TCP 66 8080 → 45978 [FIN, ACK] Seq=214372 Ack=1958 Win=33792 Len=0 TSval=359059615 TSecr=3599507486 3608 2021-07-12 01:26:07.409277 10.61.0.15 10.9.5.23 TCP 66 45978 → 8080 [ACK] Seq=1958 Ack=214373 Win=164864 Len=0 TSval=3599522526 TSecr=359059615 3609 2021-07-12 01:26:14.970215 10.61.0.15 10.9.5.23 TCP 66 45978 → 8080 [FIN, ACK] Seq=1958 Ack=214373 Win=164864 Len=0 TSval=3599530086 TSecr=359059615 3611 2021-07-12 01:26:14.970268 10.9.5.23 10.61.0.15 TCP 66 8080 → 45978 [ACK] Seq=214373 Ack=1959 Win=33792 Len=0 TSval=359067215 TSecr=3599530086 From 35 stream on, cant respnse. 35 stream: 3613 2021-07-12 01:26:14.971184 10.61.0.15 10.9.5.23 TCP 74 46052 → 8080 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=3599530087 TSecr=0 WS=512 3614 2021-07-12 01:26:14.971219 10.9.5.23 10.61.0.15 TCP 74 8080 → 46052 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=359067216 TSecr=3599530087 WS=512 3615 2021-07-12 01:26:14.971440 10.61.0.15 10.9.5.23 TCP 66 46052 → 8080 [ACK] Seq=1 Ack=1 Win=29696 Len=0 TSval=3599530088 TSecr=359067216 3616 2021-07-12 01:26:14.971564 10.61.0.15 10.9.5.23 HTTP 590 POST /mx_traffic_searcher/getTrafficViewInfo?trace_id=112256964&device_no=bc7536fd80d000000 HTTP/1.1 3617 2021-07-12 01:26:14.971572 10.9.5.23 10.61.0.15 TCP 66 8080 → 46052 [ACK] Seq=1 Ack=525 Win=30208 Len=0 TSval=359067217 TSecr=3599530088 3618 2021-07-12 01:26:15.706276 10.9.5.23 10.61.0.15 HTTP 270 HTTP/1.1 200 3619 2021-07-12 01:26:15.706508 10.61.0.15 10.9.5.23 TCP 66 46052 → 8080 [ACK] Seq=525 Ack=205 Win=30720 Len=0 TSval=3599530823 TSecr=359067952 3620 2021-07-12 01:26:16.371719 10.61.0.15 10.9.5.23 HTTP 607 POST /mx_traffic_searcher/getTrafficRouteInfo?trace_id=112258403&device_no=bc7536fd80d000000 HTTP/1.1 3621 2021-07-12 01:26:16.371767 10.9.5.23 10.61.0.15 TCP 66 8080 → 46052 [ACK] Seq=205 Ack=1066 Win=31232 Len=0 TSval=359068617 TSecr=3599531488 3622 2021-07-12 01:26:16.374063 10.9.5.23 10.61.0.15 HTTP 265 HTTP/1.1 200 3623 2021-07-12 01:26:16.374244 10.61.0.15 10.9.5.23 TCP 66 46052 → 8080 [ACK] Seq=1066 Ack=404 Win=31744 Len=0 TSval=3599531490 TSecr=359068619 3624 2021-07-12 01:26:31.123460 10.61.0.15 10.9.5.23 HTTP 606 POST /mx_traffic_searcher/getTrafficRouteInfo?trace_id=30989437&device_no=bc7536fd398000000 HTTP/1.1 3630 2021-07-12 01:26:31.163272 10.9.5.23 10.61.0.15 TCP 66 8080 → 46052 [ACK] Seq=404 Ack=1606 Win=32256 Len=0 TSval=359083409 TSecr=3599546240 3631 2021-07-12 01:26:31.771455 10.9.5.23 10.61.0.15 TCP 66 8080 → 46052 [RST, ACK] Seq=404 Ack=1606 Win=32256 Len=0 TSval=359084017 TSecr=3599546240 The 36 stream: 3625 2021-07-12 01:26:31.157749 10.61.0.15 10.9.5.23 TCP 74 46080 → 8080 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=3599546274 TSecr=0 WS=512 3626 2021-07-12 01:26:31.157787 10.9.5.23 10.61.0.15 TCP 74 8080 → 46080 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=359083403 TSecr=3599546274 WS=512 3627 2021-07-12 01:26:31.157979 10.61.0.15 10.9.5.23 TCP 66 46080 → 8080 [ACK] Seq=1 Ack=1 Win=29696 Len=0 TSval=3599546274 TSecr=359083403 3628 2021-07-12 01:26:31.158111 10.61.0.15 10.9.5.23 HTTP 768 POST /mx_traffic_searcher/getTrafficViewInfo?trace_id=30989435&device_no=bc7536fd398000000 HTTP/1.1 3629 2021-07-12 01:26:31.158118 10.9.5.23 10.61.0.15 TCP 66 8080 → 46080 [ACK] Seq=1 Ack=703 Win=30720 Len=0 TSval=359083403 TSecr=3599546274 3637 2021-07-12 01:26:41.161537 10.61.0.15 10.9.5.23 TCP 66 46080 → 8080 [FIN, ACK] Seq=703 Ack=1 Win=29696 Len=0 TSval=3599556278 TSecr=359083403 3638 2021-07-12 01:26:41.201292 10.9.5.23 10.61.0.15 TCP 66 8080 → 46080 [ACK] Seq=1 Ack=704 Win=30720 Len=0 TSval=359093447 TSecr=3599556278 37~... is similar to 36. jstack result is: "http-nio2-8080-AsyncTimeout" #210 daemon prio=5 os_prio=0 tid=0x00007fb188933000 nid=0x1ad waiting on condition [0x00007f8e842dc000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.coyote.AbstractProtocol$AsyncTimeout.run(AbstractProtocol.java:1193) at java.lang.Thread.run(Thread.java:745) "http-nio2-8080-Acceptor-0" #209 daemon prio=5 os_prio=0 tid=0x00007fb188932000 nid=0x1ac waiting on condition [0x00007f8e843dd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007f941cfffff8> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) at sun.nio.ch.PendingFuture.get(PendingFuture.java:180) at org.apache.tomcat.util.net.Nio2Endpoint$Acceptor.run(Nio2Endpoint.java:393) at java.lang.Thread.run(Thread.java:745) "http-nio2-8080-exec-960" #16165 daemon prio=5 os_prio=0 tid=0x00007f8e1005d800 nid=0x4a8e waiting on condition [0x00007f8dfa485000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007f92bd0541b0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:108) at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:33) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:745) Other port can receive data and response. From netstat can see: tcp 0 0 10.9.5.23:8080 10.61.0.15:40540 SYN_RECV tcp 0 0 10.9.5.23:8080 10.61.0.15:40442 SYN_RECV tcp 0 0 10.9.5.23:8080 10.61.0.15:40570 SYN_RECV tcp 0 0 10.9.5.23:8080 10.61.0.15:40590 SYN_RECV tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:38941 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:18080 0.0.0.0:* LISTEN tcp 0 0 127.0.0.1:8005 0.0.0.0:* LISTEN tcp 205 0 10.9.5.23:8080 10.9.4.93:48620 CLOSE_WAIT tcp 515 0 10.9.5.23:8080 10.61.0.15:46760 CLOSE_WAIT tcp 515 0 10.9.5.23:8080 10.61.0.15:46418 CLOSE_WAIT tcp 0 0 10.9.5.23:44338 10.61.2.234:6379 ESTABLISHED tcp 0 0 10.9.5.23:38178 10.253.94.17:11800 ESTABLISHED
It looks like Tomcat is reaching the maxConnections limit. Given the low usage of the system, it looks like some connection closures are being missed. If this analysis is correct, the finding the root cause is going to be a lot easier if we can recreate the issue. The default value for maxConnections is 10000. Can you try reducing, the lower the better, and then try reproducing the issue? That should trigger the issue faster and may provide some clues as to which connection closures are being missed. What is the nature of the web application where this is observed? Does is use the Servlet async API? Does it use the Servlet non-blocking API? Does it use WebSocket? Try switching to the NIO connector. The performance of NIO and NIO2 is very similar is most circumstances. Testing with the latest 8.5.x release would also be helpful.
(In reply to Mark Thomas from comment #1) > It looks like Tomcat is reaching the maxConnections limit. Given the low > usage of the system, it looks like some connection closures are being > missed. If this analysis is correct, the finding the root cause is going to > be a lot easier if we can recreate the issue. > > The default value for maxConnections is 10000. Can you try reducing, the > lower the better, and then try reproducing the issue? That should trigger > the issue faster and may provide some clues as to which connection closures > are being missed. > > What is the nature of the web application where this is observed? Does is > use the Servlet async API? Does it use the Servlet non-blocking API? Does it > use WebSocket? > > Try switching to the NIO connector. The performance of NIO and NIO2 is very > similar is most circumstances. > > Testing with the latest 8.5.x release would also be helpful. First of all, THANK YOU for your help!! I failed to recreate the issue by using jmeter and hping3, even if use hping3 mock flood attack. I use command 'netstat -nat|grep 8080|wc -l' to stat connections when issue happens, it's 108, and server load is very light, 3-5 req/second, so I think the reason is not maxConnections limit. By the way, client lib is using httpClient, it's timeout param set to 10s. I think, client and server's time out param cause the issue.
Created attachment 37962 [details] wireshark's screen shot, stream normal and abnormal The picture is wireshark's screen shot
The thread dump you provided shows that Tomcat is blocking new connections because the maxConnections limit has been reached. This means that Tomcat isn't tracking completed connections correctly but finding out where this is going wrong is non-trivial. If the average number of concurrent requests is low, you should be able to reduce maxConnections significantly. If reaching the maxConnections limit is the problem, the problem should trigger a lot faster than ~10 days with a lower limit. I'd suggest setting to to 100. It should trigger in a few hours. Confirming that the issue is related to reaching maxConnections is the first step towards a fix. Please answer the following questions: 1. What is the impact of reducing maxConnections to 100? 2. What is the impact of switching from NIO2 to NIO (with maxConnections="100") ? 3. What is the impact of using NIO2 with maxConnections="-1" ? 4. What is the nature of the web application where this is observed? Does is use the Servlet async API? Does it use the Servlet non-blocking API? Does it use WebSocket? 5. Does this issue still occur with NIO2, maxConnections="100" and the latest 8.5.x release? By all means provide us with additional information you think might be helpful but if you don't also provide answers to the questions we ask, we are unlikely to be able to help.
It has been over a month with no response to the questions in comment #4. Without answers to those questions, this issue will get resolved as WONTFIX.
Several further months have passed with no further information provided. I am therefore resolving this as WORKSFORME. If, and only if, you are able to provide a test case that demonstrates this issue, feel free to re-open this issue and provide the test case.