63802 – epoll spin detection is missing

Bug 63802 - epoll spin detection is missing

Summary: epoll spin detection is missing

Status:	RESOLVED WONTFIX

Alias:	None

Product:	Tomcat 8
Classification:	Unclassified
Component:	Catalina (show other bugs)
Version:	8.5.42
Hardware:	PC Mac OS X 10.1

Importance:	P2 critical (vote)
Target Milestone:	----
Assignee:	Tomcat Developers Mailing List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2019-10-04 08:23 UTC by Emmanuel L
Modified:	2020-12-02 19:32 UTC (History)
CC List:	0 users

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Emmanuel L 2019-10-04 08:23:29 UTC

There is a long lived bug either in the JDK or even in Linux epoll implementation that makes it possible for the select() call to return immediately with 0 SelectionKey to be processed. In this case, if you call back the select() funtion immediately, you'll get a 100% CPU usage.

A workaround has been implemented in Apache MINA, in Netty, in Grizzly, but I don't see such a workaround implemented in Tomcat.

The idea is to avoid calling back select() if the previous call has returned 0, after a few iteration. In this case, a new Selector is created, all the channels registered in the old selector are registered in the new selector, and the old selector is ditched.

You can have a look at Grizzly code, line 501 :

https://github.com/javaee/grizzly/blob/master/modules/grizzly/src/main/java/org/glassfish/grizzly/nio/SelectorRunner.java

Or Apache MINA, line 609 and following :

https://github.com/apache/mina/blob/2.1.X/mina-core/src/main/java/org/apache/mina/core/polling/AbstractPollingIoProcessor.java

Or in Netty, line 849 and following :

https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java


This workaround is critical for those three projects to properly work on Linux (the problem does not exist on Windows or Mac OSX, this is the reason Grizzly has added a flag to activate it or not).

FTR, I'm currently being hit by such a random CPU 100% peak on a project I'm working on. A thread dump shows that the thread consuming the CPU is the one doing the infinite select() loop  :

at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:825)
at java.lang.Thread.run(Thread.java:748)

Comment 1 Remy Maucherat 2019-10-04 08:51:30 UTC

I would recommend investigating and discussing this first on the user list.

Comment 2 Emmanuel L 2019-10-04 14:34:54 UTC

Discussion started on the users mailing list (I would have assumed that it would rather be a dev mailing list discussion, but I followed your advice)

Thanks !

Comment 3 Remy Maucherat 2019-11-04 11:04:04 UTC

Following the discussion on the mailing list, and given I could find only one mention of a possible issue overall ( https://github.com/netty/netty/issues/327 ), I will not add a workaround for now.
I did not get feedback on the NIO2 resilience to this possible problem.
Leaving this open for further research.

Comment 4 Mark Thomas 2020-11-30 14:49:33 UTC

The associated JRE bug is https://bugs.openjdk.java.net/browse/JDK-8238279

I have confirmed that the reproducer provided with that bug (https://github.com/cedric780/EPollArrayWrapper-bug) still triggers with the latest Java 8 from Adopt OpenJDK.

I think this is enough evidence to implement a work-around in Tomcat.

Comment 5 Remy Maucherat 2020-11-30 15:16:34 UTC

Ok, so there's a reproducer for this now. It's supposedly fixed in Java 11. Personally, given the ugliness of the workaround, the rarity of the issue and the fact that there's a fix, I would rather not do anything.

Comment 6 Mark Thomas 2020-11-30 20:50:38 UTC

I ran 10 tests with Java 11 and didn't see the issue. The developer of the reproducer also confirmed the issue is fixed in Java 11.

I'm happy to implement a work-around but I'd be equally happy with closing this as WONTFIX and pointing folks that are experiencing this issue to Java 11 and/or the Java 8 bug.

Given your preference for WONTFIX are there any objections to taking that approach?

Comment 7 Mark Thomas 2020-12-02 19:32:32 UTC

Resolving as WONTFIX as per previous comments