Bug 65455 - after about 10 days ago, tomcat8.5.63 can't response web request again
Summary: after about 10 days ago, tomcat8.5.63 can't response web request again
Status: NEEDINFO
Alias: None
Product: Tomcat 8
Classification: Unclassified
Component: Connectors (show other bugs)
Version: 8.5.63
Hardware: PC All
: P2 normal (vote)
Target Milestone: ----
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-07-19 04:03 UTC by 13998197706
Modified: 2021-08-24 10:29 UTC (History)
0 users



Attachments
wireshark's screen shot, stream normal and abnormal (180.59 KB, image/png)
2021-07-19 09:03 UTC, 13998197706
Details

Note You need to log in before you can comment on or make changes to this bug.
Description 13998197706 2021-07-19 04:03:31 UTC
server.xml config:
<Connector port="8080" maxThreads="200" minSpareThreads="100" protocol="org.apache.coyote.http11.Http11Nio2Protocol" connectionTimeout="20000" keepAliveTimeout="15000" maxKeepAliveRequests="-1" redirectPort="8443" acceptCount="400"/>

Tomcat server run about 10 days, then can't response web request again.
request has request header: Connection: Keep-Alive
response has header: Keep-Alive: timeout=15

From tcpdump data can see, server don't send fin data to client again.

tcpdump 34 stream is a normal stream, response is ok.
3604	2021-07-12 01:25:52.369772	10.61.0.15	10.9.5.23	TCP	66	45978 → 8080 [ACK] Seq=1958 Ack=214372 Win=155136 Len=0 TSval=3599507486 TSecr=359044615
3606	2021-07-12 01:26:07.369723	10.9.5.23	10.61.0.15	TCP	66	8080 → 45978 [FIN, ACK] Seq=214372 Ack=1958 Win=33792 Len=0 TSval=359059615 TSecr=3599507486
3608	2021-07-12 01:26:07.409277	10.61.0.15	10.9.5.23	TCP	66	45978 → 8080 [ACK] Seq=1958 Ack=214373 Win=164864 Len=0 TSval=3599522526 TSecr=359059615
3609	2021-07-12 01:26:14.970215	10.61.0.15	10.9.5.23	TCP	66	45978 → 8080 [FIN, ACK] Seq=1958 Ack=214373 Win=164864 Len=0 TSval=3599530086 TSecr=359059615
3611	2021-07-12 01:26:14.970268	10.9.5.23	10.61.0.15	TCP	66	8080 → 45978 [ACK] Seq=214373 Ack=1959 Win=33792 Len=0 TSval=359067215 TSecr=3599530086

From 35 stream on, cant respnse.
35 stream:
3613	2021-07-12 01:26:14.971184	10.61.0.15	10.9.5.23	TCP	74	46052 → 8080 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=3599530087 TSecr=0 WS=512
3614	2021-07-12 01:26:14.971219	10.9.5.23	10.61.0.15	TCP	74	8080 → 46052 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=359067216 TSecr=3599530087 WS=512
3615	2021-07-12 01:26:14.971440	10.61.0.15	10.9.5.23	TCP	66	46052 → 8080 [ACK] Seq=1 Ack=1 Win=29696 Len=0 TSval=3599530088 TSecr=359067216
3616	2021-07-12 01:26:14.971564	10.61.0.15	10.9.5.23	HTTP	590	POST /mx_traffic_searcher/getTrafficViewInfo?trace_id=112256964&device_no=bc7536fd80d000000 HTTP/1.1 
3617	2021-07-12 01:26:14.971572	10.9.5.23	10.61.0.15	TCP	66	8080 → 46052 [ACK] Seq=1 Ack=525 Win=30208 Len=0 TSval=359067217 TSecr=3599530088
3618	2021-07-12 01:26:15.706276	10.9.5.23	10.61.0.15	HTTP	270	HTTP/1.1 200  
3619	2021-07-12 01:26:15.706508	10.61.0.15	10.9.5.23	TCP	66	46052 → 8080 [ACK] Seq=525 Ack=205 Win=30720 Len=0 TSval=3599530823 TSecr=359067952
3620	2021-07-12 01:26:16.371719	10.61.0.15	10.9.5.23	HTTP	607	POST /mx_traffic_searcher/getTrafficRouteInfo?trace_id=112258403&device_no=bc7536fd80d000000 HTTP/1.1 
3621	2021-07-12 01:26:16.371767	10.9.5.23	10.61.0.15	TCP	66	8080 → 46052 [ACK] Seq=205 Ack=1066 Win=31232 Len=0 TSval=359068617 TSecr=3599531488
3622	2021-07-12 01:26:16.374063	10.9.5.23	10.61.0.15	HTTP	265	HTTP/1.1 200  
3623	2021-07-12 01:26:16.374244	10.61.0.15	10.9.5.23	TCP	66	46052 → 8080 [ACK] Seq=1066 Ack=404 Win=31744 Len=0 TSval=3599531490 TSecr=359068619
3624	2021-07-12 01:26:31.123460	10.61.0.15	10.9.5.23	HTTP	606	POST /mx_traffic_searcher/getTrafficRouteInfo?trace_id=30989437&device_no=bc7536fd398000000 HTTP/1.1 
3630	2021-07-12 01:26:31.163272	10.9.5.23	10.61.0.15	TCP	66	8080 → 46052 [ACK] Seq=404 Ack=1606 Win=32256 Len=0 TSval=359083409 TSecr=3599546240
3631	2021-07-12 01:26:31.771455	10.9.5.23	10.61.0.15	TCP	66	8080 → 46052 [RST, ACK] Seq=404 Ack=1606 Win=32256 Len=0 TSval=359084017 TSecr=3599546240

The 36 stream:
3625	2021-07-12 01:26:31.157749	10.61.0.15	10.9.5.23	TCP	74	46080 → 8080 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=3599546274 TSecr=0 WS=512
3626	2021-07-12 01:26:31.157787	10.9.5.23	10.61.0.15	TCP	74	8080 → 46080 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=359083403 TSecr=3599546274 WS=512
3627	2021-07-12 01:26:31.157979	10.61.0.15	10.9.5.23	TCP	66	46080 → 8080 [ACK] Seq=1 Ack=1 Win=29696 Len=0 TSval=3599546274 TSecr=359083403
3628	2021-07-12 01:26:31.158111	10.61.0.15	10.9.5.23	HTTP	768	POST /mx_traffic_searcher/getTrafficViewInfo?trace_id=30989435&device_no=bc7536fd398000000 HTTP/1.1 
3629	2021-07-12 01:26:31.158118	10.9.5.23	10.61.0.15	TCP	66	8080 → 46080 [ACK] Seq=1 Ack=703 Win=30720 Len=0 TSval=359083403 TSecr=3599546274
3637	2021-07-12 01:26:41.161537	10.61.0.15	10.9.5.23	TCP	66	46080 → 8080 [FIN, ACK] Seq=703 Ack=1 Win=29696 Len=0 TSval=3599556278 TSecr=359083403
3638	2021-07-12 01:26:41.201292	10.9.5.23	10.61.0.15	TCP	66	8080 → 46080 [ACK] Seq=1 Ack=704 Win=30720 Len=0 TSval=359093447 TSecr=3599556278

37~... is similar to 36.

jstack result is:
"http-nio2-8080-AsyncTimeout" #210 daemon prio=5 os_prio=0 tid=0x00007fb188933000 nid=0x1ad waiting on condition [0x00007f8e842dc000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at org.apache.coyote.AbstractProtocol$AsyncTimeout.run(AbstractProtocol.java:1193)
	at java.lang.Thread.run(Thread.java:745)

"http-nio2-8080-Acceptor-0" #209 daemon prio=5 os_prio=0 tid=0x00007fb188932000 nid=0x1ac waiting on condition [0x00007f8e843dd000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00007f941cfffff8> (a java.util.concurrent.CountDownLatch$Sync)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
	at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
	at sun.nio.ch.PendingFuture.get(PendingFuture.java:180)
	at org.apache.tomcat.util.net.Nio2Endpoint$Acceptor.run(Nio2Endpoint.java:393)
	at java.lang.Thread.run(Thread.java:745)

"http-nio2-8080-exec-960" #16165 daemon prio=5 os_prio=0 tid=0x00007f8e1005d800 nid=0x4a8e waiting on condition [0x00007f8dfa485000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00007f92bd0541b0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
	at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:108)
	at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:33)
	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
	at java.lang.Thread.run(Thread.java:745)

Other port can receive data and response.

From netstat can see:
tcp        0      0 10.9.5.23:8080          10.61.0.15:40540        SYN_RECV   
tcp        0      0 10.9.5.23:8080          10.61.0.15:40442        SYN_RECV   
tcp        0      0 10.9.5.23:8080          10.61.0.15:40570        SYN_RECV   
tcp        0      0 10.9.5.23:8080          10.61.0.15:40590        SYN_RECV   
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:38941           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:18080           0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.1:8005          0.0.0.0:*               LISTEN     
tcp      205      0 10.9.5.23:8080          10.9.4.93:48620         CLOSE_WAIT 
tcp      515      0 10.9.5.23:8080          10.61.0.15:46760        CLOSE_WAIT 
tcp      515      0 10.9.5.23:8080          10.61.0.15:46418        CLOSE_WAIT 
tcp        0      0 10.9.5.23:44338         10.61.2.234:6379        ESTABLISHED
tcp        0      0 10.9.5.23:38178         10.253.94.17:11800      ESTABLISHED
Comment 1 Mark Thomas 2021-07-19 07:45:12 UTC
It looks like Tomcat is reaching the maxConnections limit. Given the low usage of the system, it looks like some connection closures are being missed. If this analysis is correct, the finding the root cause is going to be a lot easier if we can recreate the issue.

The default value for maxConnections is 10000. Can you try reducing, the lower the better, and then try reproducing the issue? That should trigger the issue faster and may provide some clues as to which connection closures are being missed.

What is the nature of the web application where this is observed? Does is use the Servlet async API? Does it use the Servlet non-blocking API? Does it use WebSocket?

Try switching to the NIO connector. The performance of NIO and NIO2 is very similar is most circumstances.

Testing with the latest 8.5.x release would also be helpful.
Comment 2 13998197706 2021-07-19 09:01:25 UTC
(In reply to Mark Thomas from comment #1)
> It looks like Tomcat is reaching the maxConnections limit. Given the low
> usage of the system, it looks like some connection closures are being
> missed. If this analysis is correct, the finding the root cause is going to
> be a lot easier if we can recreate the issue.
> 
> The default value for maxConnections is 10000. Can you try reducing, the
> lower the better, and then try reproducing the issue? That should trigger
> the issue faster and may provide some clues as to which connection closures
> are being missed.
> 
> What is the nature of the web application where this is observed? Does is
> use the Servlet async API? Does it use the Servlet non-blocking API? Does it
> use WebSocket?
> 
> Try switching to the NIO connector. The performance of NIO and NIO2 is very
> similar is most circumstances.
> 
> Testing with the latest 8.5.x release would also be helpful.

First of all, THANK YOU for your help!!

I failed to recreate the issue by using jmeter and hping3, even if use hping3 mock flood attack.
I use command 'netstat -nat|grep 8080|wc -l' to stat connections when issue happens, it's 108, and server load is very light, 3-5 req/second, so I think the reason is not maxConnections limit.
By the way, client lib is using httpClient, it's timeout param set to 10s.
I think, client and server's time out param cause the issue.
Comment 3 13998197706 2021-07-19 09:03:22 UTC
Created attachment 37962 [details]
wireshark's screen shot, stream normal and abnormal

The picture is wireshark's screen shot
Comment 4 Mark Thomas 2021-07-19 13:52:03 UTC
The thread dump you provided shows that Tomcat is blocking new connections because the maxConnections limit has been reached. This means that Tomcat isn't tracking completed connections correctly but finding out where this is going wrong is non-trivial.

If the average number of concurrent requests is low, you should be able to reduce maxConnections significantly. If reaching the maxConnections limit is the problem, the problem should trigger a lot faster than ~10 days with a lower limit. I'd suggest setting to to 100. It should trigger in a few hours.

Confirming that the issue is related to reaching maxConnections is the first step towards a fix.

Please answer the following questions:

1. What is the impact of reducing maxConnections to 100?

2. What is the impact of switching from NIO2 to NIO (with maxConnections="100") ?

3. What is the impact of using NIO2 with maxConnections="-1" ?

4. What is the nature of the web application where this is observed? Does is use the Servlet async API? Does it use the Servlet non-blocking API? Does it use WebSocket?

5. Does this issue still occur with NIO2, maxConnections="100" and the latest 8.5.x release?

By all means provide us with additional information you think might be helpful but if you don't also provide answers to the questions we ask, we are unlikely to be able to help.
Comment 5 Mark Thomas 2021-08-24 10:29:53 UTC
It has been over a month with no response to the questions in comment #4. Without answers to those questions, this issue will get resolved as WONTFIX.