Bug 61736

Summary: NIO BlockPoller may stay at events() method for too long time
Product: Tomcat 8 Reporter: ZILONG SONG <szlon>
Component: ConnectorsAssignee: Tomcat Developers Mailing List <dev>
Severity: major CC: szlon
Priority: P2    
Version: 8.0.47   
Target Milestone: ----   
Hardware: PC   
OS: All   
Attachments: An Http Client which is able to send data in controlled packet size and packet interval.

Description ZILONG SONG 2017-11-08 12:21:28 UTC
Created attachment 35508 [details]
An Http Client which is able to send data in controlled packet size and packet interval.

When using tomcat in NIO mode as a server to receive files, and if the files are transferred via a non-stable network (there is an obvious interval between neighbor TCP packets), then the traffic always hang for a great while with zero window reported from server. Following are the steps to reproduce the problem.

1. Configure the server tcp receive window size.
$ cat /proc/sys/net/ipv4/tcp_rmem
1024	2048	8192

2. Put the following JSP into the ROOT project. (in the path $CATALINA_HOME/webapps/ROOT/consume.jsp)
<%@ page import="org.apache.commons.io.IOUtils" %><%
    final ServletInputStream inputStream = request.getInputStream();
    byte[] content = IOUtils.toByteArray(inputStream);
The IOUtils is a class from commons-io

3. Start tomcat in NIO mode with the following settings
Server version:        Apache Tomcat/8.0.46
Server built:          Aug 10 2017 10:10:31 UTC
Server number:
OS Name:               Linux
OS Version:            3.10.0-229.el7.x86_64
Architecture:          amd64
Java Home:             /usr/jdk1.8.0_121/jre
JVM Version:           1.8.0_121-b13
JVM Vendor:            Oracle Corporation
Command line argument: -Xms256m
Command line argument: -Xmx256m
Command line argument: -Xmn128m
Command line argument: -Xss1m

4. Capture the TCP traffics.
tcpdump -i any -w tcp.cap port 8080

5. Start a client. sending data with http in chunks with the following settings:
chunk size                   1024 byte
time to rest between chunks  50 ms
chunk count per request      100
parallel requests            100
total requests               10000

6. Then we will got a lot of zero window packets, which lasts several seconds.
No.     stream     Time               Source              Destination         ack        win        Length Info
 469817 3201       15:09:04.175309          1          29696      57     [TCP segment of a reassembled PDU]
 469904 3201       15:09:04.214945          4238       1024       54     8080→58750 [ACK] Seq=1 Ack=4238 Win=1024 Len=0
 470091 3201       15:09:04.437137          1          29696      1078   [TCP Window Full] [TCP segment of a reassembled PDU]
 470092 3201       15:09:04.437142          5262       0          54     [TCP ZeroWindow] 8080→58750 [ACK] Seq=1 Ack=5262 Win=0 Len=0
 470334 3201       15:09:04.657120          1          29696      54     [TCP Keep-Alive] 58750→8080 [ACK] Seq=5261 Ack=1 Win=29696 Len=0
 470335 3201       15:09:04.657123          5262       0          54     [TCP ZeroWindow] 8080→58750 [ACK] Seq=1 Ack=5262 Win=0 Len=0
 470620 3201       15:09:05.098135          1          29696      54     [TCP Keep-Alive] 58750→8080 [ACK] Seq=5261 Ack=1 Win=29696 Len=0
 470621 3201       15:09:05.098141          5262       0          54     [TCP ZeroWindow] 8080→58750 [ACK] Seq=1 Ack=5262 Win=0 Len=0
 471017 3201       15:09:05.979136          1          29696      54     [TCP Keep-Alive] 58750→8080 [ACK] Seq=5261 Ack=1 Win=29696 Len=0
 471018 3201       15:09:05.979140          5262       0          54     [TCP ZeroWindow] 8080→58750 [ACK] Seq=1 Ack=5262 Win=0 Len=0
 471619 3201       15:09:07.743148          1          29696      54     [TCP Keep-Alive] 58750→8080 [ACK] Seq=5261 Ack=1 Win=29696 Len=0
 471620 3201       15:09:07.743151          5262       0          54     [TCP ZeroWindow] 8080→58750 [ACK] Seq=1 Ack=5262 Win=0 Len=0
 475765 3201       15:09:08.545625          5262       3072       54     [TCP Window Update] 8080→58750 [ACK] Seq=1 Ack=5262 Win=3072 Len=0
 475781 3201       15:09:08.545815          1          29696      490    [TCP segment of a reassembled PDU]
 475782 3201       15:09:08.545821          5698       3072       54     8080→58750 [ACK] Seq=1 Ack=5698 Win=3072 Len=0
 475784 3201       15:09:08.545825          1          29696      1514   [TCP segment of a reassembled PDU]

After a lot of study, I found this is because of a BUG in the BlockPoller, which will slow down the speed of polling cycle a lot.

The BlockPoller will always try to poll the Runnable event and run it until the events queue is empty. But the RunnableAdd event may wake the Worker thread, and the worker thread may add another more event to the queue. Thus the queue will be hard to shrink, and the events method will be hard to return and will affect the latency the next selection to be performed.

// events() method of the BlockPoller thread
public boolean events() {
    boolean result = false;
    Runnable r = null;
    result = (events.size() > 0);
    while ( (r = events.poll()) != null ) {
        // running the "r" event may cause more events to 
        // be added to the queue. then the queue will be hard 
        // to shrink, and the loop may be hard to finish
        result = true;
    return result;

// the following e1 and e2 line in the RunnableAdd will wake the worker thread
public void run() {
    SelectionKey sk = ch.keyFor(selector);
    try {
        if (sk == null) {
            sk = ch.register(selector, ops, key);
            ref.key = sk;
        } else if (!sk.isValid()) {
            cancel(sk, key, ops); // <-- e1
        } else {
            sk.interestOps(sk.interestOps() | ops);
    } catch (CancelledKeyException cx) {
        cancel(sk, key, ops); // <-- e2
    } catch (ClosedChannelException cx) {
        cancel(null, key, ops);

// when the worker is waken at line L3, it will start the next loop
// and may add another event at line L2
public int read(ByteBuffer buf, NioChannel socket, long readTimeout) throws IOException {
    SelectionKey key = socket.getIOChannel().keyFor(socket.getPoller().getSelector());
    if ( key == null ) throw new IOException("Key no longer registered");
    KeyReference reference = keyReferenceStack.pop();
    if (reference == null) {
        reference = new KeyReference();
    NioSocketWrapper att = (NioSocketWrapper) key.attachment();
    int read = 0;
    boolean timedout = false;
    int keycount = 1; 
    long time = System.currentTimeMillis(); 
    try {
L1:     while(!timedout) {
            if (keycount > 0) {
                read = socket.read(buf);
                if (read != 0) {
            try {
                if ( att.getReadLatch()==null || att.getReadLatch().getCount()==0) att.startReadLatch(1);
L2:             poller.add(att,SelectionKey.OP_READ, reference);
                if (readTimeout < 0) {
                    att.awaitReadLatch(Long.MAX_VALUE, TimeUnit.MILLISECONDS);
                } else {
L3:                 att.awaitReadLatch(readTimeout, TimeUnit.MILLISECONDS);
            } catch (InterruptedException ignore) {
                // Ignore
            if ( att.getReadLatch()!=null && att.getReadLatch().getCount()> 0) {
                keycount = 0;
            }else {
                keycount = 1;
            if (readTimeout >= 0 && (keycount == 0))
                timedout = (System.currentTimeMillis() - time) >= readTimeout;
        } //while
        if (timedout)
            throw new SocketTimeoutException();
    } finally {
        if (timedout && reference.key!=null) {
        reference.key = null;
    return read;

In fact there is such an occasion where the previous case will happen:

 1. [Worker]: added an event, and wait on the readLatch
 2. [BlockPoller]: registered operation interest on corresponding SelectionKey
 3. [BlockPoller]: perform selection
 4. [BlockPoller]: wake the Worker on the selection result
 5. [Worker]: wake up, read data success
 6. [Worker]: add remove event (e1) to the queue, and return from the read method
 7. [Worker]: Outside call the read method again
 8. [Worker]: read data and got nothing
 9. [Worker]: add an event (e2) via the poller.add(att, ops, ref) method
10. [Worker]: wait on the readLatch
11. [BlockPoller]: enter next loop, and execute the events() method, polling events from the events queue and run them.
12. [BlockPoller]: got e1 and run it, the SelectionKey turn invalid (before next selection, the SelectionKey is always invalid)
13. [BlockPoller]: got e2 and run it, since the SelectionKey is invalid, it will wake the worker thread, and discard the event.
14. [BlockPoller]: poll other events from the queue, and run them
15. [Worker]: (before queue empty) waken up, and read again, but got nothing
16. [Worker]: add another event (e3) via the poller.add(att, ops, ref) method
17. [Worker]: wait on the readLatch again
18. [BlockPoller]: got e3 and run it, found it invalid again, wake the worker thread, and discard the event.
19. [Worker]: being waken up again and add event again...

steps between 11 to 17 will repeat again and again (step 18 and later is a replay of the steps), if each time the event add in step 16 add the event to the queue before the queue is consumed to empty, because the BlockPoller will always try to empty the queue. 

Then the BlockPoller and the worker will enter a strange loop, where the BlockPoller will try to empty the queue, while the worker will always add some more events to the queue.

My solution to prevent this problem, is to prevent the BlockPoller from trying to empty the queue before move to the selection step, instead, make it only handle events in the queue when it starts to handle them, further events added later will not be handled during this BlockPoller loop, but instead delay to the next loop. The selection operation will remove all invalid SelectionKeys, so in the time of the next loop, the problem will not happen again, since the SelectionKeys are all fresh and valid.
Comment 1 Mark Thomas 2017-11-08 14:59:39 UTC
See https://github.com/apache/tomcat/pull/81 for a proposed patch.
Comment 2 Mark Thomas 2017-11-08 15:48:47 UTC
Thanks for the detailed report. To have such a complicated issue explained so clearly - and with a patch! - is extremely helpful.

I can see how the problem happens and I can also see how your proposed patch addresses it. I'm not concerned about delaying new events to the next loop.

The events queue is a custom implementation so I want to look to see if that makes an alternative fix possible but I haven't found anything so far (and probably won't).
Comment 3 ZILONG SONG 2017-11-12 08:44:25 UTC
Yes, the queue is custom implemented, but it still works following the rule of a non-blocking queue. I'm afraid that the problem is not caused by how the queue works, but how we use a queue. 

Perhaps a more basic reason for this problem is that, the cancelled keys are still cached. A cancelled key is invalid and cannot be used again, so it should not be cached. If invalid keys are not cached, or they can recover from the invalid state, we may not wake the Worker thread up when handling events, so it's possible to avoid this problem. 

However how the SelectionKey will behave is controlled in the JDK, out of our control. I didn't find anyway other than performing a selection that can clear the invalid SelectionKey caches. So I think maybe the most effective action to take is to prevent the BlockPoller to cost too much time in the loop polling and handling events, and let the next selection to clear the invalid SelectionKeys as soon as possible.
Comment 4 Mark Thomas 2017-11-16 12:34:17 UTC
My point was not that the queue was the root cause. My point was that because the queue implementation is a custom one it *might* offer a cleaner way to implement the polling of the queue rather than getting the current size and then requesting exactly that many items from the queue.
Comment 5 Mark Thomas 2017-11-16 13:36:58 UTC
There didn't seem to be any benefit to implementing the limit in the queue rather than getting the size and looping so I went with the patch as proposed.

Fixed in:
- trunk for 9.0.2 onwards
- 8.5.x for 8.5.24 onwards
- 8.0.x for 8.0.48 onwardse
- 7.0.x for 7.0.83 onwards

Thanks again for the analysis and the patch.