Upgraded from Spring Boot v1.3.6 to v.1.4.0, which in turn upgraded the embedded tomcat container from 8.0.36 to 8.5.4. We are now seeing a bunch of the following exceptions after the app has been running for a number of hours: 2016-08-24 00:00:03.129 ERROR 15154 --- [https-jsse-nio-172.30.72.54-30020-Acceptor-0] [] org.apache.tomcat.util.net.NioEndpoint : Socket accept failed java.io.IOException: Too many open files at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) at org.apache.tomcat.util.net.NioEndpoint$Acceptor.run(NioEndpoint.java:457) at java.lang.Thread.run(Thread.java:745) Running lsof for the Spring Boot pid shows thousands of entries like the following: java 9694 xxx 1251u sock 0,6 0t0 6636870 can't identify protocol Downgrading to embedded tomcat 8.0.36 (whilst remaining on Spring Boot 1.4.0) fixes the issue.
I'm starting to look at this. I've already found and fixed an unrelated file descriptor leak and I can see a couple more places where fixes might be required. However, I haven't - yet - seen anything that looks like the leak described in this report. Any information on how to reproduce it would be very helpful. For example, does it require HTTPS, WebSocket, Servlet 3.0 Async or Servlet 3.1 non-blocking? A test case that generates the leak would be even better.
Hi Mark I have been able to pinpoint that the issue started with 8.5.3. 8.5.2 seems to be ok. I can see from the 8.5.3 changlog at least one change around sockets: "Fix a cause of multiple attempts to close the same socket" I downloaded, built and deployed the sample Spring Boot app from one of the tutorials. Couldn't reproduce the issue, even with 8.5.4, although the app is pretty "dumb": https://spring.io/guides/gs/spring-boot/ https://github.com/spring-guides/gs-spring-boot/archive/master.zip Users connect to our app over https, and our app connects to ldap over http and other webservices over https. No WebSocket, Servlet 3.0 Async or Servlet 3.1 non-blocking. Cheers Joe
Thanks for tracking down which version this appeared in. I'll take a look at the diff between the two tags. I agree that the change you highlight is a good place to start.
I've reviewed the changes and I don't see anything problematic yet. There is what looks like a single fd leak on first use that I want to track down but that is unlikely to be related to this issue. Given that you were able to test 8.5.2 and 8.5.3 that suggests you have a way of reproducing this. Can you share that?
I think I have been able to reproduce this: - Run Tomcat on machine 1 - Run abs test on machine 2 - Disable network on machine 1 Often, but not always, this leaves sockets open. I've searched through the commits to figure out when was introduced and have narrowed it down to this commit: http://svn.apache.org/viewvc?view=revision&revision=1746551 Next up is figuring out why and fixing it.
Good news. I've figured out the root cause and there will be a fix in the next release. Indirectly, r1746551 was the issue. The problem occurs for TLS connections when the connection is dropped after the socket has been accepted but before the handshake is complete. The socket ended up in a loop: - timeout -> ERROR event - process ERROR (this is the new bit from r1746551) - try to finish handshake - need more data from client - register with poller for READ - wait for timeout - timeout ... ... and around you go.
This has been fixed in the following branches: - 9.0.x for 9.0.0.M10 onwards - 8.5.x for 8.5.5 onwards 8.0.x and earlier were not affected.
Hi Mark, Apologies Joe went quiet - but he left the company. Thanks for looking into this, it kind of brought down all the servers we deployed the upgrade onto so was pretty serious for us. What you have identified sounds a very likely candidate. We'll update our tomcat jars and let you know how we go. Thanks again.
All indications are that this fixed our issues. Thanks again.
Thanks for confirming the fix.
*** Bug 60555 has been marked as a duplicate of this bug. ***
(In reply to Mark Thomas from comment #7) > This has been fixed in the following branches: > - 9.0.x for 9.0.0.M10 onwards > - 8.5.x for 8.5.5 onwards > > 8.0.x and earlier were not affected. As mentioned in https://bz.apache.org/bugzilla/show_bug.cgi?id=60555#c4 this bug effects the 7.0 line as well. Can you please check and confirm based on @hiki's comments in 60555.