Bug 57943

Summary: NioEndpoint, one poller thread died
Product: Tomcat 6 Reporter: sunqi <sunqi800>
Component: ConnectorsAssignee: Tomcat Developers Mailing List <dev>
Status: RESOLVED FIXED    
Severity: critical CC: ian.luo
Priority: P2    
Version: 6.0.44   
Target Milestone: default   
Hardware: PC   
OS: Linux   
Attachments: analysis for bug57943

Description sunqi 2015-05-22 06:21:27 UTC
when i restart tomcat,the tomcat can not process request soon,i find an error log

Exception in thread "http-nio-7001-ClientPoller-1" java.util.ConcurrentModificationException
        at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
        at java.util.HashMap$KeyIterator.next(HashMap.java:828)
        at java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1010)
        at org.apache.tomcat.util.net.NioEndpoint$Poller.timeout(NioEndpoint.java:1421)
        at org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1215)
        at java.lang.Thread.run(Thread.java:662)

tomcat poller thread is not catch this exception,so this thread is died.
connection is accept but has no poller thread to process.
Comment 1 Mark Thomas 2015-05-22 13:29:47 UTC
This is the third report we have received about this issue.

The other two are:
http://markmail.org/message/xgxblpk4v4ykgi5y
http://markmail.org/message/eypk42i6gdpztkpy

I have looked through the NIO Poller code again and I still can't see how this can be happening. I've also dug into the JRE code and I can't see any issues there either (although I'll not this has been reported on a variety of JVMs and I haven't checked the specific ones mentions).

I think the best thing to do at this point is catch the error, log a warning and maybe ask folks to report it if they can repeat it.
Comment 2 Mark Thomas 2015-05-22 14:29:41 UTC
I have added the work-around to trunk (for 9.0.x), 8.0.x (for 8.0.24 onwards) and 7.0.x (for 7.0.63 onwards).

I suspect a JVM bug so I am resolving this as FIXED. If we discover it is a Tomcat bug we can always re-open.
Comment 3 sunqi 2015-07-02 01:28:26 UTC
(In reply to Mark Thomas from comment #2)
> I have added the work-around to trunk (for 9.0.x), 8.0.x (for 8.0.24
> onwards) and 7.0.x (for 7.0.63 onwards).
> 
> I suspect a JVM bug so I am resolving this as FIXED. If we discover it is a
> Tomcat bug we can always re-open.

we know the reason why poller thread died,as this bug https://bz.apache.org/bugzilla/show_bug.cgi?id=57340

we use async servlet, many request is async timeout,at the same time,async servlet calls complete or dispatch


nioChannels cache contain 2 references to the same object,and two request will use same NioChannel,and poller object will replace by other request

wo also find a new place put the same NioChannel and AttachmentKey into caches. 

     } else if (handshake == -1 ) {
                    if (key != null) {
                        socket.getPoller().cancelledKey(key, SocketStatus.DISCONNECT);
                    }
                    if (running && !paused) {
                        nioChannels.push(socket);
                    }
                    socket = null;
                    if (running && !paused) {
                        keyCache.push(ka);
                    }
                    ka = null;
}
Comment 4 Ian Luo 2015-07-02 15:45:19 UTC
Created attachment 32880 [details]
analysis for bug57943

it's in markdown format
Comment 5 Christopher Schultz 2015-07-02 16:07:20 UTC
Comment on attachment 32880 [details]
analysis for bug57943

Ian's comment, not as an attachment, and therefore much more readable, is:

I am the colleague of Sun Qi who reported [bug57943](https://bz.apache.org/bugzilla/show_bug.cgi?id=57943) one month ago. I believe we managed to find out why ConcurrentModificationException happens in our scenario.

Here's the scenario, our application uses async servlet to write the response back to the client. The operation we put in async servlet which run in another thread is quite time-consuming. When the incoming traffic is heavy, which usually happens when the instances of tomcat server reboots, we hit this issue very easily. 

When the ClientPoller thread run into timeout method (in NioEndpoint) and execute the following code:

```java
} else if (!ka.isAsync() || ka.getTimeout() > 0) {
    // Async requests with a timeout of 0 or less never timeout
    long delta = now - ka.getLastAccess();
    long timeout = (ka.getTimeout()==-1)?((long)socketProperties.getSoTimeout()):(ka.getTimeout());
    boolean isTimedout = delta > timeout;
    if (isTimedout) {
        // Prevent subsequent timeouts if the timeout event takes a while to process
        ka.access(Long.MAX_VALUE);
        processSocket(ka.getChannel(), SocketStatus.TIMEOUT, true, 6);
    }
}
```

Eventually it calls `processSocket(ka.getChannel())` and puts back the NioChannel instance to the cache `nioChannels` for the late re-use:

```java
if (ka!=null) ka.setComet(false);
socket.getPoller().cancelledKey(key, SocketStatus.ERROR, false);
if (running && !paused) {
    nioChannels.offer(socket);
}
```

Unfortunately in our scenarios, when QPS is greater than 1000req/sec, async servlet threads have very large chance to start async-complete and begin to call `processSocket` (in Http11NioProcessor) at the same time:

```java
} else if (actionCode == ActionCode.ASYNC_COMPLETE) {
    if (asyncStateMachine.asyncComplete()) {
    ((NioEndpoint)endpoint).processSocket(socketWrapper.getSocket(), SocketStatus.OPEN_READ, true, 10);
}
```

This leads to cache pollution in `nioChannels` since there're chances to offer the same object into the cache multiple times. When the Acceptor thread polls the object and hands it over to the ClientPollers, the same object could be pass into the different ClientPoller threads, then leads to very serious problem.

You may notice the code I show here is the code from tomcat 7.0.54, and claim the similar issue has already been addressed, but this is not true. We noticed the similar issue was reported on [Bug57340](https://bz.apache.org/bugzilla/show_bug.cgi?id=57340), and the solution has already been checked into the later release.

But this solution is not a complete solution, in our scenario, there's chance to offer the duplicated object to the cache in `else if` clause where the solution for Bug57340 doesn't cover:

```java
} else if (handshake == -1 ) {
    if (key != null) {
        socket.getPoller().cancelledKey(key, SocketStatus.DISCONNECT, false);
    }
    nioChannels.offer(socket);
    socket = null;
    if ( ka!=null ) keyCache.offer(ka);
    ka = null;
}
```

This is easily to understand. When handshake equals -1, one possibility is the `key` passed into the doRun method is `null`, and when `key` is null, it means some other thread has already finished processing on the same socket:

```java
public void run() {
    SelectionKey key = socket.getIOChannel().keyFor(
        socket.getPoller().getSelector());
```

We propose the change below in order to address this issue throughly. In this case, I think we can simply drop the object instead of offering it since it looks there's no other ideal way to not use lock.

```java
} else if (handshake == -1 ) {
    if (key != null) {
        if (socket.getPoller().cancelledKey(key, SocketStatus.DISCONNECT, false) != null) {
            nioChannels.offer(socket);
            if (ka != null) keyCache.offer(ka);
        }
    }
    socket = null;
    ka = null;
```
Comment 6 Mark Thomas 2015-07-02 21:13:16 UTC
Many thanks for the details analysis.

I have applied a patch that should fix this to trunk (9.0.x) and back-ported it to 8.0.x (for 8.0.25 onwards) and 7.0.x (7.0.64 onwards). I have also proposed is for 6.0.x.
Comment 7 Konstantin Kolinko 2015-10-25 18:36:02 UTC
Fixed in Tomcat 6 in r1710473 and will be in 6.0.45 onwards.