Bug 61831 - NIO2 connector becomes intermittently unresponsive after some period of time
Summary: NIO2 connector becomes intermittently unresponsive after some period of time
Status: RESOLVED INVALID
Alias: None
Product: Tomcat 8
Classification: Unclassified
Component: Connectors (show other bugs)
Version: 8.0.47
Hardware: All Linux
: P2 normal (vote)
Target Milestone: ----
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-11-29 01:58 UTC by Oleg
Modified: 2017-12-11 12:37 UTC (History)
1 user (show)



Attachments
jstack thread dump (527.82 KB, text/plain)
2017-11-29 01:58 UTC, Oleg
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Oleg 2017-11-29 01:58:29 UTC
Created attachment 35564 [details]
jstack thread dump

We are observing a scenario when NIO2 connector on tomcat becomes unresponsive after some period of time, at the same time NIO connector running on the same host is still able to process the same requests and serves traffic. Only server restart helps in this case.
This issue is intermittent and with the current infrastructure we have few nodes behind LB and it happens from time to time (like once per week) for each node, so it seems to be not a node or hardware specific in our case.
Below is our server.xml:

    <Executor name="tomcatThreadPool" namePrefix="catalina-exec-" maxThreads="800" minSpareThreads="100"/>


      <Connector executor="tomcatServiceThreadPool"
                 port="8080"
                 protocol="org.apache.coyote.http11.Http11Nio2Protocol"
                 connectionTimeout="1000"
                 enableLookups="false"
                 acceptorThreadCount="1"
                 processorCache="800"
                 socket.tcpNoDelay="true"
                 socket.soKeepAlive="true"
                 socket.soLingerOn="false"
                 compression="256"
                 compressableMimeType="text/html,text/xml,text/plain,application/x-protobuf,application/json,application/javascript"
                 URIEncoding="UTF-8" />

      <!-- The load balancer terminates SSL connections and
           then forwards them to the following connector as
           normal HTTP (non-secure) requests
       -->
      <Connector executor="tomcatServiceThreadPool"
                 port="8443"
                 protocol="org.apache.coyote.http11.Http11NioProtocol"
                 connectionTimeout="1000"
                 enableLookups="false"
                 connectionLinger="-1"
                 acceptorThreadCount="20"
                 processorCache="800"
                 socket.tcpNoDelay="true"
                 socket.soKeepAlive="true"
                 socket.soLingerOn="false"
                 compression="256"
                 compressableMimeType="text/html,text/xml,text/plain,application/x-protobuf,application/json,application/javascript"
                 URIEncoding="UTF-8" />


      <!-- Define an AJP 1.3 Connector on port 8009 -->
      <Connector port="8009" protocol="AJP/1.3" redirectPort="8443" />

Also below is an example of the behavior we observe:

curl -verbose 'http://localhost:8080/rs?id=nio2issue'
* About to connect() to localhost port 8080 (#0)
*   Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /rs?id=nio2issue HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.15.3 zlib/1.2.3 libidn/1.18 libssh2/1.4.2
> Host: localhost:8080
> Accept: */*
> Referer: rbose
>
* Closing connection #0
* Failure when receiving data from the peer
curl: (56) Failure when receiving data from the peer

at the same time:

curl -i 'http://localhost:8443/rs?id=nio2issue'
HTTP/1.1 302 Found

Also, no unusual errors are logged to catalina.out at the time of the accident. Enclosed is thread dump from the server.
Also, we have observed the same behavior on tomcat 8.0.18 and upgraded to the latest version in the same release 8.0.47 but it didn't help.

Please let me know what else might be helpful as we keep one of the servers in this state, for now, to be able to gather any data as the issue is intermittent and we were not able to reproduce with a simple load test.

Regards,
Oleg.
Comment 1 Remy Maucherat 2017-11-29 06:58:21 UTC
The thread dump looks perfect: acceptor thread blocking on the accept, all threads idle and ready to execute something. Please investigate on the user list to get at least some idea on how to reproduce it.
If possible, try to avoid using a custom executor, it makes things more complex and the benefit is usually not obvious.
Comment 2 Oleg 2017-11-29 19:25:56 UTC
Hi, 

I realize that thread dump might look fine and this is the most confusing part: even simple curl command from the same host receives no response from this connector and it starts working fine after tomcat restart.  At the same time tomcat in overall looks to be healthy and another connector works fine, as this happens from time to time on different servers, this doesn't look like to be OS or hardware issue but something which is tomcat NIO2 specific. 
And when we do any request to this NIO2 endpoint connector in a bad state - no thread is triggered in tomcat, while looking into tomcat source code it looks like that countdownlatch was not simply updated and service just hangs because of this but the root cause is still not clear. 

So curious what additional information we can provide to help investigate this issue together with tomcat apache dev team? 


Also, I'm not sure about your remark about custom executor - we don't use custom one, we just configure the one form tomcat.

Regards,
Oleg.
Comment 3 Piotr 2017-12-11 11:52:38 UTC
We think we figured it out to be a Java Bug in asynchronous server socket implementation.

Please see the following bug report which seems to exhibit a similar issue.
https://bugs.openjdk.java.net/browse/JDK-8172750
Comment 4 Remy Maucherat 2017-12-11 12:37:03 UTC
Ok, maybe. Let us know if you find some elements demonstrating an issue in Tomcat.