Bug 60035 - java.io.IOException: Too many open files
Summary: java.io.IOException: Too many open files
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 8
Classification: Unclassified
Component: Catalina (show other bugs)
Version: 8.5.4
Hardware: Other Linux
: P2 normal (vote)
Target Milestone: ----
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-08-24 01:30 UTC by joe.stepowski
Modified: 2017-08-04 20:49 UTC (History)
2 users (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description joe.stepowski 2016-08-24 01:30:44 UTC
Upgraded from Spring Boot v1.3.6 to v.1.4.0, which in turn upgraded the embedded tomcat container from 8.0.36 to 8.5.4.

We are now seeing a bunch of the following exceptions after the app has been running for a number of hours:

2016-08-24 00:00:03.129 ERROR 15154 --- [https-jsse-nio-172.30.72.54-30020-Acceptor-0] [] org.apache.tomcat.util.net.NioEndpoint   : Socket accept failed

java.io.IOException: Too many open files
        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
        at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
        at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
        at org.apache.tomcat.util.net.NioEndpoint$Acceptor.run(NioEndpoint.java:457)
        at java.lang.Thread.run(Thread.java:745)

Running lsof for the Spring Boot pid shows thousands of entries like the following:

java    9694 xxx 1251u  sock                0,6       0t0 6636870 can't identify protocol

Downgrading to embedded tomcat 8.0.36 (whilst remaining on Spring Boot 1.4.0) fixes the issue.
Comment 1 Mark Thomas 2016-08-24 15:15:44 UTC
I'm starting to look at this. I've already found and fixed an unrelated file descriptor leak and I can see a couple more places where fixes might be required. However, I haven't - yet - seen anything that looks like the leak described in this report.

Any information on how to reproduce it would be very helpful. For example, does it require HTTPS, WebSocket, Servlet 3.0 Async or Servlet 3.1 non-blocking? A test case that generates the leak would be even better.
Comment 2 joe.stepowski 2016-08-25 05:44:47 UTC
Hi Mark

I have been able to pinpoint that the issue started with 8.5.3.  8.5.2 seems to be ok.  

I can see from the 8.5.3 changlog at least one change around sockets: "Fix a cause of multiple attempts to close the same socket"

I downloaded, built and deployed the sample Spring Boot app from one of the tutorials.  Couldn't reproduce the issue, even with 8.5.4, although the app is pretty "dumb":
https://spring.io/guides/gs/spring-boot/
https://github.com/spring-guides/gs-spring-boot/archive/master.zip 

Users connect to our app over https, and our app connects to ldap over http and other webservices over https.  No WebSocket, Servlet 3.0 Async or Servlet 3.1 non-blocking.

Cheers
Joe
Comment 3 Mark Thomas 2016-08-25 19:37:18 UTC
Thanks for tracking down which version this appeared in. I'll take a look at the diff between the two tags. I agree that the change you highlight is a good place to start.
Comment 4 Mark Thomas 2016-08-26 07:48:05 UTC
I've reviewed the changes and I don't see anything problematic yet. There is what looks like a single fd leak on first use that I want to track down but that is unlikely to be related to this issue.

Given that you were able to test 8.5.2 and 8.5.3 that suggests you have a way of reproducing this. Can you share that?
Comment 5 Mark Thomas 2016-08-26 13:18:50 UTC
I think I have been able to reproduce this:

- Run Tomcat on machine 1
- Run abs test on machine 2
- Disable network on machine 1

Often, but not always, this leaves sockets open. I've searched through the commits to figure out when was introduced and have narrowed it down to this commit:
http://svn.apache.org/viewvc?view=revision&revision=1746551

Next up is figuring out why and fixing it.
Comment 6 Mark Thomas 2016-08-26 16:53:17 UTC
Good news. I've figured out the root cause and there will be a fix in the next release.

Indirectly, r1746551 was the issue.

The problem occurs for TLS connections when the connection is dropped after the socket has been accepted but before the handshake is complete. The socket ended up in a loop:
- timeout -> ERROR event
- process ERROR (this is the new bit from r1746551)
- try to finish handshake
- need more data from client
- register with poller for READ
- wait for timeout
- timeout ...

... and around you go.
Comment 7 Mark Thomas 2016-08-26 18:21:46 UTC
This has been fixed in the following branches:
- 9.0.x for 9.0.0.M10 onwards
- 8.5.x for 8.5.5 onwards

8.0.x and earlier were not affected.
Comment 8 Marcus Hughes 2016-09-08 01:34:06 UTC
Hi Mark, Apologies Joe went quiet - but he left the company. Thanks for looking into this, it kind of brought down all the servers we deployed the upgrade onto so was pretty serious for us. What you have identified sounds a very likely candidate.

We'll update our tomcat jars and let you know how we go. Thanks again.
Comment 9 Marcus Hughes 2016-09-13 05:13:02 UTC
All indications are that this fixed our issues. Thanks again.
Comment 10 Mark Thomas 2016-09-13 08:14:00 UTC
Thanks for confirming the fix.
Comment 11 Mark Thomas 2017-01-05 20:37:36 UTC
*** Bug 60555 has been marked as a duplicate of this bug. ***
Comment 12 Vaibhav Bhandari 2017-08-04 20:47:37 UTC
(In reply to Mark Thomas from comment #7)
> This has been fixed in the following branches:
> - 9.0.x for 9.0.0.M10 onwards
> - 8.5.x for 8.5.5 onwards
> 
> 8.0.x and earlier were not affected.

As mentioned in https://bz.apache.org/bugzilla/show_bug.cgi?id=60555#c4 this bug effects the 7.0 line as well. Can you please check and confirm based on @hiki's comments in 60555.