Bug 65776 - "Duplicate accept detected" error from a subsequent request with the same local port
Summary: "Duplicate accept detected" error from a subsequent request with the same loc...
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 9
Classification: Unclassified
Component: Connectors (show other bugs)
Version: 9.0.56
Hardware: All All
: P2 normal (vote)
Target Milestone: -----
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-12-31 09:25 UTC by Johnny Lim
Modified: 2022-02-10 17:17 UTC (History)
1 user (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Johnny Lim 2021-12-31 09:25:36 UTC
"Duplicate accept detected" error might happen if a client reuses its local port somehow.

This is a synthetic test to demonstrate it: https://github.com/izeye/spring-boot-throwaway-branches/blob/tomcat-duplicate-accept-detected/src/test/java/com/izeye/throwaway/DuplicateAcceptDetectedTests.java

This is happening in my production environment that is running on CentOS in a Kubernetes cluster. Although I didn't get to the bottom of the reason why the same local port is being used yet, it seems that it might prevent from accepting valid requests from a client.
Comment 1 Mark Thomas 2021-12-31 12:38:13 UTC
This scenario was considered when designing the protection for the OS bug. The solution considered was to add a timing check for the re-use as in the OS bug case the re-use is near enough instant. However, we didn't add the check as we could not see a scenario where:
- a client connected to Tomcat
- no other clients connected
- the same client reconnected using the same local port
and we wanted to avoid the performance overhead of the check.

It appears that there is something about your production environment where the above sequence is happening. Are you sure the client is genuinely re-using the local port rather than the server hitting the OS bug? We believe the bug affects multiple Linux distributions.
Comment 2 Johnny Lim 2021-12-31 13:29:38 UTC
Thanks for the quick feedback!

I just assumed that it's a Ubuntu-specific bug as it's reported against the Ubuntu issue tracker. I haven't had time looking into it closely yet, so it was just one of possible guesses based on the assumption.

I'll let you know if anything meaningful is identified.
Comment 3 Mark Thomas 2021-12-31 14:11:48 UTC
Moving to NEEDINFO pending further updates.
Comment 4 Mark Thomas 2022-01-02 13:10:27 UTC
The simplest thing to do is to run the pure C test case provided in the Java bug report:

https://bugs.openjdk.java.net/browse/JDK-8263243?focusedCommentId=14410275&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14410275

If you see the error then raise a bug with CentOS. I'd recommend including a link to at least the Java bug report for background.
Comment 5 Sunwoo 2022-01-14 03:39:38 UTC
I'm sorry, I'm not a goot at English.
Please carefully read it. :)

-----
I found a situation where a normal connection was attempted with the same client IP/Port in the following situations.

In under load, in the case of the kernel configuration as follows, if the randomly found position is located in 32768-49999 with high probability, there is a possibility that 50000 will be allocated continuously.

- ip_local_port_range = 32768 - 60000
- ip_local_reserved_ports = 30000-49999

# find client port - kernel simple pseudocode
``` 
  port = random(in ip_local_port_range)
  while port++ < max(ip_local_port_range)
    if (port in ip_local_reserved_ports) continue
    if (port is used ports) continue
  	return port 
  done
  return not found port
```


- kernel 3.10
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/ipv4/inet_connection_sock.c?h=linux-3.10.y#n104

```
smallest_rover = rover = net_random() % remaining + low;
```

- kernel 4.19
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/ipv4/inet_connection_sock.c?h=linux-4.19.y#n182
```
offset = prandom_u32() % remaining;
```


In this setting, the problem occurs because the range of ip_local_reserved_ports is too wide, and although the appropriate port according to rfc6056 was not allocated, it does not mean that this connection itself is invalid.

I experienced this problem in the k8s readiness/liveness probe request, and eventually the service became UNREADY and the service became unavailable.

The need to bypass bugs in Ubuntu is understandable, but the current code is risky and has side effects.

Therefore, it appears that the change needs to be rolled back, offered an option, or better hedged. 

https://github.com/apache/tomcat/commit/d03cfcf3b0d6639acb2884f1bbea5f2f29b95d91

I hope for a positive review.
Comment 6 Mark Thomas 2022-01-14 19:58:19 UTC
That should only be an issue if:
- there are no other connections to the server between liveness checks
- the liveness checks are >= time_wait seconds apart

Increasing the frequency of the liveness checks should be a valid workaround in the rare cases this is an issue.

Meanwhile, I'll look at adding a "time since last accept" check to the test. When the error occurs it is almost instant so something like less than a second should work.
Comment 7 Mark Thomas 2022-01-14 22:23:06 UTC
Checking the time since the last accept adds a significant overhead. With a simply JMeter test with keep-alive disabled, throughput dropped by 75% once I added the timing check. That level of overhead isn't acceptable.

If you are seeing what you believe to be a false positive warning with a liveness check, reducing the time between checks should fix the issue.
Comment 8 Brooke T. Hedrick 2022-01-24 20:13:25 UTC
We are seeing this error on Windows Server 2019

Jan 21, 2022 10:30:21 AM org.apache.tomcat.util.net.AprEndpoint setSocketOptions
SEVERE: Error allocating socket processor
java.io.IOException: Duplicate accept detected. This is a known OS bug. Please consider reporting that you are affected: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1924298
 at org.apache.tomcat.util.net.AprEndpoint.setSocketOptions(AprEndpoint.java:811)
 at org.apache.tomcat.util.net.AprEndpoint.setSocketOptions(AprEndpoint.java:83)
 at org.apache.tomcat.util.net.Acceptor.run(Acceptor.java:149)
 at java.base/java.lang.Thread.run(Thread.java:829)

Adoptium JDK jdk-11.0.13+8 64bit
Apache Tomcat 9.0.56
tomcat-native-1.2.31
openssl-1.1.1l
apr-1.7.0
apr-util-1.6.1

2022-01-21 09:58:59 Apache Commons Daemon procrun stdout initialized.
Jan 21, 2022 9:59:01 AM org.apache.catalina.core.AprLifecycleListener lifecycleEvent
INFO: Loaded Apache Tomcat Native library [1.2.31] using APR version [1.7.0].
Jan 21, 2022 9:59:01 AM org.apache.catalina.core.AprLifecycleListener lifecycleEvent
INFO: APR capabilities: IPv6 [true], sendfile [true], accept filters [false], random [true], UDS [true].
Jan 21, 2022 9:59:01 AM org.apache.catalina.core.AprLifecycleListener lifecycleEvent
INFO: APR/OpenSSL configuration: useAprConnector [false], useOpenSSL [true]
Jan 21, 2022 9:59:01 AM org.apache.catalina.core.AprLifecycleListener initializeSSL
INFO: OpenSSL successfully initialized [OpenSSL 1.1.1l 24 Aug 2021]
Jan 21, 2022 9:59:01 AM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["http-apr-0.0.0.0-xxxxx"]
Jan 21, 2022 9:59:01 AM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["https-openssl-apr-0.0.0.0-xxxxx"]
Jan 21, 2022 9:59:01 AM org.apache.tomcat.util.net.openssl.OpenSSLUtil getKeyManagers
Jan 21, 2022 9:59:01 AM org.apache.catalina.startup.Catalina load
INFO: Server initialization in [667] milliseconds
Jan 21, 2022 9:59:01 AM org.apache.catalina.core.StandardService startInternal
INFO: Starting service [Catalina]
Jan 21, 2022 9:59:01 AM org.apache.catalina.core.StandardEngine startInternal
INFO: Starting Servlet engine: [Apache Tomcat/9.0.56]

AND We are seeing the error on Ubuntu 20.04.3 LTS

Jan 21, 2022 8:55:02 AM org.apache.tomcat.util.net.AprEndpoint setSocketOptions
SEVERE: Error allocating socket processor
java.io.IOException: Duplicate accept detected. This is a known OS bug. Please consider reporting that you are affected: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1924298
        at org.apache.tomcat.util.net.AprEndpoint.setSocketOptions(AprEndpoint.java:811)
        at org.apache.tomcat.util.net.AprEndpoint.setSocketOptions(AprEndpoint.java:83)
        at org.apache.tomcat.util.net.Acceptor.run(Acceptor.java:149)
        at java.base/java.lang.Thread.run(Thread.java:829)

Jan 21, 2022 7:08:11 AM org.apache.catalina.core.AprLifecycleListener lifecycleEvent
INFO: Loaded Apache Tomcat Native library [1.2.31] using APR version [1.7.0].
Jan 21, 2022 7:08:11 AM org.apache.catalina.core.AprLifecycleListener lifecycleEvent
INFO: APR capabilities: IPv6 [true], sendfile [true], accept filters [false], random [true], UDS [true].
Jan 21, 2022 7:08:11 AM org.apache.catalina.core.AprLifecycleListener lifecycleEvent
INFO: APR/OpenSSL configuration: useAprConnector [false], useOpenSSL [true]
Jan 21, 2022 7:08:11 AM org.apache.catalina.core.AprLifecycleListener initializeSSL
INFO: OpenSSL successfully initialized [OpenSSL 1.1.1l 24 Aug 2021]
Jan 21, 2022 7:08:11 AM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["http-apr-0.0.0.0-xxxxx"]
Jan 21, 2022 7:08:11 AM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["https-openssl-apr-0.0.0.0-xxxxx"]
Jan 21, 2022 7:08:11 AM org.apache.catalina.startup.Catalina load
INFO: Server initialization in [463] milliseconds
Jan 21, 2022 7:08:11 AM org.apache.catalina.core.StandardService startInternal
INFO: Starting service [Catalina]
Jan 21, 2022 7:08:11 AM org.apache.catalina.core.StandardEngine startInternal
INFO: Starting Servlet engine: [Apache Tomcat/9.0.56]

We do NOT see this issue logged, by Tomcat, on the same Windows or Linux O/S versions with

DIFF AdoptOpenJDK 11.0.10+9 64bit
DIFF Apache Tomcat 9.0.43
DIFF tomcat-native-1.2.26
DIFF openssl-1.1.1i
SAME apr-1.7.0
SAME apr-util-1.6.1

Reported here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1924298/comments/8
Comment 9 Christopher Schultz 2022-01-25 15:10:52 UTC
(In reply to Brooke T. Hedrick from comment #8)
> We are seeing this error on Windows Server 2019

Uh oh. This is supposed to be a Linux kernel bug.

> Apache Tomcat 9.0.56

> We do NOT see this issue logged, by Tomcat, on the same Windows or Linux O/S
> versions with
> 
> DIFF Apache Tomcat 9.0.43

This check was added in Tomcat 9.0.56. You may want to downgrade to Tomcat 9.0.55  in your Windows environments which, theoretically, should not be susceptible to the bug we are trying to work around.

Are you able to compile+run the C-language test driver which is mentioned above? I'm not sure it will even compile on Windows but it would be interesting to see what happens on a Windows system.
Comment 10 Brooke T. Hedrick 2022-01-25 23:33:56 UTC
Hey Christopher Schultz,

RE>
Are you able to compile+run the C-language test driver which is mentioned above? I'm not sure it will even compile on Windows but it would be interesting to see what happens on a Windows system.



We spent about 90 minutes with VS 2017 and didn't have it running yet.  Issues with libraries not being the same and the "equivalents" not having the same functions.  We don't have strong c/c++ skills and haven't tried to move code between Linux and Windows before.

Would compiling/running this under WSL be a valid test?
Comment 11 Brooke T. Hedrick 2022-01-26 00:00:20 UTC
Not sure if this is useful, but running the test with WSL and Ubuntu 20.04.3 LTS

2 "console windows" running the wrk at the same time.

This is how the server failed
Listening on port 5555
pthread_create: Cannot allocate memory
Comment 12 Mark Thomas 2022-02-07 17:31:12 UTC
There are multiple issues being report here so i will try and address each in turn.

The original issue reported by Johhny Lim
- The log message is correct - you've hit the OS bug
- The issue appears to affect mutliple (most?, all?) Linux distributions
This should be considered resolved as "working as designed".


The issue reported by 2022-01-14 03:39:38 GMT
- Can be avoided by configuration see comment #6
- Modifying the existing check is not practical see comment #7
This should be considered resolved as "Won't fix"


The issue on Windows reported by Brooke T. Hedrick
- The memory allocation error running with WSL (comment #11) indicates a different issue
- I have ported the original test case to Windows and I have not yet been able to recreate the error

I have a few more tests I want to run. I want to try and tweak the test case and the Windows network config to try and increase the load on accept().

I do have a couple of questions regarding the Windows system where this issue was observed:
- Bare metal or VM?
- If VM, which hypervisor?
- What sort of load (in terms of new connections per second) was the system under at the time?

Moving status to NEEDINFO to reflect the status of this one remaining unresolved issue.
Comment 13 Brooke T. Hedrick 2022-02-07 17:50:04 UTC
@Mark
RE: https://bz.apache.org/bugzilla/show_bug.cgi?id=65776#c12


- Bare metal or VM? VM
- If VM, which hypervisor? VMWare ESX 7.x 
- What sort of load (in terms of new connections per second) was the system under at the time?  Very Little.  IIRC, a single request could cause it.
Comment 14 Mark Thomas 2022-02-08 14:09:49 UTC
Thanks for the additional info.

I wasn't able to reproduce the issue on Windows. Combined with your information regarding low load, it looks like you are seeing a false positive.

The good news is that means we aren't seeing the same issue on Windows. The bad news is we need to try and find a way to avoid - or at least reduce - the false positive rate.
Comment 15 Mark Thomas 2022-02-10 14:57:13 UTC
The large drop in throughput I reported in comment #7 has been bothering me so I wanted to look at this again.

Testing this is tricky as you can quickly run out of ephemeral ports. Once that happens, test results become unstable. I think this may have happened when I tested this previously. I definitely saw ephemeral port exhaustion when I tried to re-test this.

By reducing the test to a single client thread and temporarily reducing the period sockets spend in TIME_WAIT I was able to get the test results to stabilise. I was then able to test the following changes:
- skip the check completely on Windows
- add a check for 'duplicates must occur within 1s" to avoid false positives from things like heartbeats on idle systems

The results were very promising. The performance difference was (probably) marginally lower but I'd need to repeat the test quite a few times (and brush up on my stats) to provide a definitive answer.

My engineering estimate view of the results is that the performance impact is negligible so I'll be apply the change shortly.
Comment 16 Mark Thomas 2022-02-10 17:17:38 UTC
Fixed in:
- 10.1.x for 10.1.0-M11 onwards
- 10.0.x for 10.0.17 onwards
- 9.0.x for 9.0.59 onwards
- 8.5.x for 8.5.76 onwards