Bug 58970

Summary: http NIO connector crash after update from 8.0.27 to 8.0.30
Product: Tomcat 8 Reporter: slash
Component: ConnectorsAssignee: Tomcat Developers Mailing List <dev>
Status: RESOLVED WORKSFORME    
Severity: normal CC: julien, reda.housnialaoui
Priority: P2    
Version: 8.0.30   
Target Milestone: ----   
Hardware: PC   
OS: Linux   
Attachments: Graph of network connection status during the crash of the connector
Thread dump of a tomcat 8.0.30 with http connector frozen

Description slash 2016-02-04 17:18:07 UTC
Created attachment 33531 [details]
Graph of network connection status during the crash of the connector

==============================
Environment:
Debian 8
Tomcat 8.0.30
Java Oracle JDK 1.8.0_72
Using connector NIO, current connector configuration:
    <Connector port="8001" protocol="org.apache.coyote.http11.Http11NioProtocol"
        connectionTimeout="20000"
        acceptorThreadCount="4"
        maxThreads="200"
        maxConnections="1000"
        maxKeepAliveRequests="5000" />
Hardware: different servers, Intel Xeon CPU with a total of 16 core (32 thread) memory per tomcat around 30GB, using G1GC.
==============================
What is happening:
Before the update, with Tomcat version 8.0.27, we didn't have any issue with the NIO connector, it was working fine and websocket too.
Since the update, the connector just "crash" after several hours of work: no request are then processed (websocket or http), trying to access any application from http://ip:8001/ just hangs. Looking at the state of the network socket, it is clearly not working (graph attached).

The http/NIO connector is used almost exclusively for websocket connections (the only connection that are not websocket are from our internal connector checker).

There is also an AJP/APR connector that is working fine during that time, even when the NIO/http connector crash.

I don't see anything in the catalina.out nor in the system log... 

I know this is difficult to debug with so little information, I only see this issue in production myself when there is a large number of connections, never in test.

The tomcat is behind an apache httpd 2.4 proxy, relevant configuration:
JkMount /APPNAME* server_tomcat1
ProxyPass /APPNAME/realtime/ ws://server.example.net:8001/APPNAME/realtime/
ProxyPassReverse /APPNAME/realtime/ ws://server.example.net:8001/APPNAME/realtime/
Comment 1 Mark Thomas 2016-02-05 09:36:40 UTC
Thread dump when the problem occurs and logs leading up to the problem please.

Best guess at this point in that the Poller thread stopped but without information that is nothing more than a wild guess.
Comment 2 slash 2016-02-05 10:30:41 UTC
I know it's difficult to debug like this, unfortunately I had to rollback the production to 8.0.27 for now to restore our websocket services.

I'll see what I can do to give you relevant logs/thread dump.
Comment 3 Réda Housni Alaoui 2016-04-06 13:43:41 UTC
Created attachment 33732 [details]
Thread dump of a tomcat 8.0.30 with http connector frozen

Hello, 

Please find the required thread dump in attachment.
Thread dump of a tomcat 8.0.30 with a frozen http nio connector.

Regards
Comment 4 Remy Maucherat 2016-04-06 16:32:56 UTC
The dump looks slightly weird (lots of APR AJP, this seems more active to me than the NIO connector). However, the NIO connector is indeed stuck on its max connections which probably have been leaked due to the Atmosphere use, which may or may not be doing bad things.

maxConnections is 10000 and often does not make sense (I disabled it by default for the NIO2 connector).

So I'll switch it back to need info since there's no proof this is valid (or the same issue that was originally reported, although I'd say it's likely).
Comment 5 Réda Housni Alaoui 2016-04-07 07:49:45 UTC
I am sorry, I wasn't clear enough.
Slash and me are working in the same company, so I can assure you that the uploaded thread dump is about this issue.

We have a lot of trafic on AJP and less on http NIO because all non websocket traffic is going through httpd modjk and then AJP connector.
Since modjk can't deal with websocket connections, http NIO connector is here to only manage websocket traffic.

Here is what we do to systematically reproduce the issue:
- From a nodejs application we try to establish 20 000 atmosphere connections using websocket transport to the app running in tomcat 8.0.30
- Once we hit the max connection, we wait about 1 minute
- Then we kill violently the node application and relaunch it to establish 20 000 new atmosphere connections
- If the http connector is still alive, we repeat the whole operation

It takes about 3 attemps to crash the http connector.
In the end, the node app is totally stopped, there is no more connection to the tomcat http nio connector and yet the connector is totally frozen.

From what I have seen, comparing healty tomcat tdump and tomcat with frozen connector tdump, I can see that when connector is frozen, all http nio acceptors thread are in PARKING status.
Comment 6 Réda Housni Alaoui 2016-04-07 07:58:42 UTC
I don't know if you can see this in the tdump but we are using the JSR356 websocket implementation.
Comment 7 Mark Thomas 2016-04-12 21:25:49 UTC
The problem is with the current connection count tracking. There are code paths where this isn't being decremented when a connection closes in error. I'm currently looking for a reliable way to track the open connection count.
Comment 8 Mark Thomas 2016-04-13 19:14:04 UTC
I (think I) found the root cause. This has been fixed in:
- 9.0.x for 9.0.0.M5
- 8.5.x for 8.5.1
- 8.0.x for 8.0.34
- 7.0.x for 7.0.70
Comment 9 Réda Housni Alaoui 2016-04-15 12:37:47 UTC
Thank you for the fix.
When can we expect the 8.0.34 release?
Would it be wise to use the current 8.0.34 snapshot in production?
Comment 10 Remy Maucherat 2016-04-15 12:43:37 UTC
Simply set maxConnections to unlimited (-1) in your configuration and you're done.
Comment 11 Réda Housni Alaoui 2016-11-09 09:37:37 UTC
Hello,

We still have the issue on tomcat 8.0.37 and 8.0.38 with the same configuration.
New jstack attached.
Comment 12 Réda Housni Alaoui 2016-11-09 10:10:06 UTC
The dump is too big to be attached.

Here is a link to download it: 
http://s000.tinyupload.com/index.php?file_id=00903516386387493654
Comment 13 Réda Housni Alaoui 2016-11-09 10:11:57 UTC
The dump comes from a tomcat 8.0.38 with crashed http connector.
Comment 14 Mark Thomas 2016-11-09 11:40:39 UTC
Do the same reproduction steps still create the issue?

Can you provide a (simple as possible) web application and client we can use to recreate this problem?
Comment 15 Remy Maucherat 2016-11-09 12:22:54 UTC
I still don't understand if this is caused by maxConnections or not. Can the unlimited setting be tried and/or the connection count be monitored ?

Usually unplugging a network cable is the worst test since the network connection may never be actually noticed by the other server as being dead. However, the server connectionTimeout should work, but it doesn't necessarily apply in all cases (websockets, etc, and precisely that's the scenario here).
Comment 16 Mark Thomas 2017-04-04 15:38:59 UTC
No further response from OP, no info on how to reproduce this and no similar reports from other users.

If you believe you are experiencing this issue or one similar, please open a new issue with the steps to reproduce the issue on clean install of the latest 7.0.x, 8.0.x, 8.5.x or 9.0.x release.