Created attachment 35508 [details] An Http Client which is able to send data in controlled packet size and packet interval. When using tomcat in NIO mode as a server to receive files, and if the files are transferred via a non-stable network (there is an obvious interval between neighbor TCP packets), then the traffic always hang for a great while with zero window reported from server. Following are the steps to reproduce the problem. 1. Configure the server tcp receive window size. $ cat /proc/sys/net/ipv4/tcp_rmem 1024 2048 8192 2. Put the following JSP into the ROOT project. (in the path $CATALINA_HOME/webapps/ROOT/consume.jsp) <%@ page import="org.apache.commons.io.IOUtils" %><% final ServletInputStream inputStream = request.getInputStream(); byte[] content = IOUtils.toByteArray(inputStream); %>ok The IOUtils is a class from commons-io 3. Start tomcat in NIO mode with the following settings Server version: Apache Tomcat/8.0.46 Server built: Aug 10 2017 10:10:31 UTC Server number: 8.0.46.0 OS Name: Linux OS Version: 3.10.0-229.el7.x86_64 Architecture: amd64 Java Home: /usr/jdk1.8.0_121/jre JVM Version: 1.8.0_121-b13 JVM Vendor: Oracle Corporation Command line argument: -Xms256m Command line argument: -Xmx256m Command line argument: -Xmn128m Command line argument: -Xss1m 4. Capture the TCP traffics. tcpdump -i any -w tcp.cap port 8080 5. Start a client. sending data with http in chunks with the following settings: chunk size 1024 byte time to rest between chunks 50 ms chunk count per request 100 parallel requests 100 total requests 10000 6. Then we will got a lot of zero window packets, which lasts several seconds. No. stream Time Source Destination ack win Length Info 469817 3201 15:09:04.175309 172.16.1.4 172.16.1.5 1 29696 57 [TCP segment of a reassembled PDU] 469904 3201 15:09:04.214945 172.16.1.5 172.16.1.4 4238 1024 54 8080→58750 [ACK] Seq=1 Ack=4238 Win=1024 Len=0 470091 3201 15:09:04.437137 172.16.1.4 172.16.1.5 1 29696 1078 [TCP Window Full] [TCP segment of a reassembled PDU] 470092 3201 15:09:04.437142 172.16.1.5 172.16.1.4 5262 0 54 [TCP ZeroWindow] 8080→58750 [ACK] Seq=1 Ack=5262 Win=0 Len=0 470334 3201 15:09:04.657120 172.16.1.4 172.16.1.5 1 29696 54 [TCP Keep-Alive] 58750→8080 [ACK] Seq=5261 Ack=1 Win=29696 Len=0 470335 3201 15:09:04.657123 172.16.1.5 172.16.1.4 5262 0 54 [TCP ZeroWindow] 8080→58750 [ACK] Seq=1 Ack=5262 Win=0 Len=0 470620 3201 15:09:05.098135 172.16.1.4 172.16.1.5 1 29696 54 [TCP Keep-Alive] 58750→8080 [ACK] Seq=5261 Ack=1 Win=29696 Len=0 470621 3201 15:09:05.098141 172.16.1.5 172.16.1.4 5262 0 54 [TCP ZeroWindow] 8080→58750 [ACK] Seq=1 Ack=5262 Win=0 Len=0 471017 3201 15:09:05.979136 172.16.1.4 172.16.1.5 1 29696 54 [TCP Keep-Alive] 58750→8080 [ACK] Seq=5261 Ack=1 Win=29696 Len=0 471018 3201 15:09:05.979140 172.16.1.5 172.16.1.4 5262 0 54 [TCP ZeroWindow] 8080→58750 [ACK] Seq=1 Ack=5262 Win=0 Len=0 471619 3201 15:09:07.743148 172.16.1.4 172.16.1.5 1 29696 54 [TCP Keep-Alive] 58750→8080 [ACK] Seq=5261 Ack=1 Win=29696 Len=0 471620 3201 15:09:07.743151 172.16.1.5 172.16.1.4 5262 0 54 [TCP ZeroWindow] 8080→58750 [ACK] Seq=1 Ack=5262 Win=0 Len=0 475765 3201 15:09:08.545625 172.16.1.5 172.16.1.4 5262 3072 54 [TCP Window Update] 8080→58750 [ACK] Seq=1 Ack=5262 Win=3072 Len=0 475781 3201 15:09:08.545815 172.16.1.4 172.16.1.5 1 29696 490 [TCP segment of a reassembled PDU] 475782 3201 15:09:08.545821 172.16.1.5 172.16.1.4 5698 3072 54 8080→58750 [ACK] Seq=1 Ack=5698 Win=3072 Len=0 475784 3201 15:09:08.545825 172.16.1.4 172.16.1.5 1 29696 1514 [TCP segment of a reassembled PDU] After a lot of study, I found this is because of a BUG in the BlockPoller, which will slow down the speed of polling cycle a lot. The BlockPoller will always try to poll the Runnable event and run it until the events queue is empty. But the RunnableAdd event may wake the Worker thread, and the worker thread may add another more event to the queue. Thus the queue will be hard to shrink, and the events method will be hard to return and will affect the latency the next selection to be performed. // events() method of the BlockPoller thread public boolean events() { boolean result = false; Runnable r = null; result = (events.size() > 0); while ( (r = events.poll()) != null ) { // running the "r" event may cause more events to // be added to the queue. then the queue will be hard // to shrink, and the loop may be hard to finish r.run(); result = true; } return result; } // the following e1 and e2 line in the RunnableAdd will wake the worker thread public void run() { SelectionKey sk = ch.keyFor(selector); try { if (sk == null) { sk = ch.register(selector, ops, key); ref.key = sk; } else if (!sk.isValid()) { cancel(sk, key, ops); // <-- e1 } else { sk.interestOps(sk.interestOps() | ops); } } catch (CancelledKeyException cx) { cancel(sk, key, ops); // <-- e2 } catch (ClosedChannelException cx) { cancel(null, key, ops); } } // when the worker is waken at line L3, it will start the next loop // and may add another event at line L2 public int read(ByteBuffer buf, NioChannel socket, long readTimeout) throws IOException { SelectionKey key = socket.getIOChannel().keyFor(socket.getPoller().getSelector()); if ( key == null ) throw new IOException("Key no longer registered"); KeyReference reference = keyReferenceStack.pop(); if (reference == null) { reference = new KeyReference(); } NioSocketWrapper att = (NioSocketWrapper) key.attachment(); int read = 0; boolean timedout = false; int keycount = 1; long time = System.currentTimeMillis(); try { L1: while(!timedout) { if (keycount > 0) { read = socket.read(buf); if (read != 0) { break; } } try { if ( att.getReadLatch()==null || att.getReadLatch().getCount()==0) att.startReadLatch(1); L2: poller.add(att,SelectionKey.OP_READ, reference); if (readTimeout < 0) { att.awaitReadLatch(Long.MAX_VALUE, TimeUnit.MILLISECONDS); } else { L3: att.awaitReadLatch(readTimeout, TimeUnit.MILLISECONDS); } } catch (InterruptedException ignore) { // Ignore } if ( att.getReadLatch()!=null && att.getReadLatch().getCount()> 0) { keycount = 0; }else { keycount = 1; att.resetReadLatch(); } if (readTimeout >= 0 && (keycount == 0)) timedout = (System.currentTimeMillis() - time) >= readTimeout; } //while if (timedout) throw new SocketTimeoutException(); } finally { poller.remove(att,SelectionKey.OP_READ); if (timedout && reference.key!=null) { poller.cancelKey(reference.key); } reference.key = null; keyReferenceStack.push(reference); } return read; } In fact there is such an occasion where the previous case will happen: 1. [Worker]: added an event, and wait on the readLatch 2. [BlockPoller]: registered operation interest on corresponding SelectionKey 3. [BlockPoller]: perform selection 4. [BlockPoller]: wake the Worker on the selection result 5. [Worker]: wake up, read data success 6. [Worker]: add remove event (e1) to the queue, and return from the read method 7. [Worker]: Outside call the read method again 8. [Worker]: read data and got nothing 9. [Worker]: add an event (e2) via the poller.add(att, ops, ref) method 10. [Worker]: wait on the readLatch 11. [BlockPoller]: enter next loop, and execute the events() method, polling events from the events queue and run them. 12. [BlockPoller]: got e1 and run it, the SelectionKey turn invalid (before next selection, the SelectionKey is always invalid) 13. [BlockPoller]: got e2 and run it, since the SelectionKey is invalid, it will wake the worker thread, and discard the event. 14. [BlockPoller]: poll other events from the queue, and run them 15. [Worker]: (before queue empty) waken up, and read again, but got nothing 16. [Worker]: add another event (e3) via the poller.add(att, ops, ref) method 17. [Worker]: wait on the readLatch again 18. [BlockPoller]: got e3 and run it, found it invalid again, wake the worker thread, and discard the event. 19. [Worker]: being waken up again and add event again... steps between 11 to 17 will repeat again and again (step 18 and later is a replay of the steps), if each time the event add in step 16 add the event to the queue before the queue is consumed to empty, because the BlockPoller will always try to empty the queue. Then the BlockPoller and the worker will enter a strange loop, where the BlockPoller will try to empty the queue, while the worker will always add some more events to the queue. My solution to prevent this problem, is to prevent the BlockPoller from trying to empty the queue before move to the selection step, instead, make it only handle events in the queue when it starts to handle them, further events added later will not be handled during this BlockPoller loop, but instead delay to the next loop. The selection operation will remove all invalid SelectionKeys, so in the time of the next loop, the problem will not happen again, since the SelectionKeys are all fresh and valid.
See https://github.com/apache/tomcat/pull/81 for a proposed patch.
Thanks for the detailed report. To have such a complicated issue explained so clearly - and with a patch! - is extremely helpful. I can see how the problem happens and I can also see how your proposed patch addresses it. I'm not concerned about delaying new events to the next loop. The events queue is a custom implementation so I want to look to see if that makes an alternative fix possible but I haven't found anything so far (and probably won't).
Yes, the queue is custom implemented, but it still works following the rule of a non-blocking queue. I'm afraid that the problem is not caused by how the queue works, but how we use a queue. Perhaps a more basic reason for this problem is that, the cancelled keys are still cached. A cancelled key is invalid and cannot be used again, so it should not be cached. If invalid keys are not cached, or they can recover from the invalid state, we may not wake the Worker thread up when handling events, so it's possible to avoid this problem. However how the SelectionKey will behave is controlled in the JDK, out of our control. I didn't find anyway other than performing a selection that can clear the invalid SelectionKey caches. So I think maybe the most effective action to take is to prevent the BlockPoller to cost too much time in the loop polling and handling events, and let the next selection to clear the invalid SelectionKeys as soon as possible.
My point was not that the queue was the root cause. My point was that because the queue implementation is a custom one it *might* offer a cleaner way to implement the polling of the queue rather than getting the current size and then requesting exactly that many items from the queue.
There didn't seem to be any benefit to implementing the limit in the queue rather than getting the size and looping so I went with the patch as proposed. Fixed in: - trunk for 9.0.2 onwards - 8.5.x for 8.5.24 onwards - 8.0.x for 8.0.48 onwardse - 7.0.x for 7.0.83 onwards Thanks again for the analysis and the patch.