Bug 65776 - "Duplicate accept detected" error from a subsequent request with the same local port
Summary: "Duplicate accept detected" error from a subsequent request with the same loc...
Status: NEW
Alias: None
Product: Tomcat 9
Classification: Unclassified
Component: Connectors (show other bugs)
Version: 9.0.56
Hardware: All All
: P2 normal (vote)
Target Milestone: -----
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-12-31 09:25 UTC by Johnny Lim
Modified: 2022-01-14 22:23 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Johnny Lim 2021-12-31 09:25:36 UTC
"Duplicate accept detected" error might happen if a client reuses its local port somehow.

This is a synthetic test to demonstrate it: https://github.com/izeye/spring-boot-throwaway-branches/blob/tomcat-duplicate-accept-detected/src/test/java/com/izeye/throwaway/DuplicateAcceptDetectedTests.java

This is happening in my production environment that is running on CentOS in a Kubernetes cluster. Although I didn't get to the bottom of the reason why the same local port is being used yet, it seems that it might prevent from accepting valid requests from a client.
Comment 1 Mark Thomas 2021-12-31 12:38:13 UTC
This scenario was considered when designing the protection for the OS bug. The solution considered was to add a timing check for the re-use as in the OS bug case the re-use is near enough instant. However, we didn't add the check as we could not see a scenario where:
- a client connected to Tomcat
- no other clients connected
- the same client reconnected using the same local port
and we wanted to avoid the performance overhead of the check.

It appears that there is something about your production environment where the above sequence is happening. Are you sure the client is genuinely re-using the local port rather than the server hitting the OS bug? We believe the bug affects multiple Linux distributions.
Comment 2 Johnny Lim 2021-12-31 13:29:38 UTC
Thanks for the quick feedback!

I just assumed that it's a Ubuntu-specific bug as it's reported against the Ubuntu issue tracker. I haven't had time looking into it closely yet, so it was just one of possible guesses based on the assumption.

I'll let you know if anything meaningful is identified.
Comment 3 Mark Thomas 2021-12-31 14:11:48 UTC
Moving to NEEDINFO pending further updates.
Comment 4 Mark Thomas 2022-01-02 13:10:27 UTC
The simplest thing to do is to run the pure C test case provided in the Java bug report:

https://bugs.openjdk.java.net/browse/JDK-8263243?focusedCommentId=14410275&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14410275

If you see the error then raise a bug with CentOS. I'd recommend including a link to at least the Java bug report for background.
Comment 5 Sunwoo 2022-01-14 03:39:38 UTC
I'm sorry, I'm not a goot at English.
Please carefully read it. :)

-----
I found a situation where a normal connection was attempted with the same client IP/Port in the following situations.

In under load, in the case of the kernel configuration as follows, if the randomly found position is located in 32768-49999 with high probability, there is a possibility that 50000 will be allocated continuously.

- ip_local_port_range = 32768 - 60000
- ip_local_reserved_ports = 30000-49999

# find client port - kernel simple pseudocode
``` 
  port = random(in ip_local_port_range)
  while port++ < max(ip_local_port_range)
    if (port in ip_local_reserved_ports) continue
    if (port is used ports) continue
  	return port 
  done
  return not found port
```


- kernel 3.10
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/ipv4/inet_connection_sock.c?h=linux-3.10.y#n104

```
smallest_rover = rover = net_random() % remaining + low;
```

- kernel 4.19
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/ipv4/inet_connection_sock.c?h=linux-4.19.y#n182
```
offset = prandom_u32() % remaining;
```


In this setting, the problem occurs because the range of ip_local_reserved_ports is too wide, and although the appropriate port according to rfc6056 was not allocated, it does not mean that this connection itself is invalid.

I experienced this problem in the k8s readiness/liveness probe request, and eventually the service became UNREADY and the service became unavailable.

The need to bypass bugs in Ubuntu is understandable, but the current code is risky and has side effects.

Therefore, it appears that the change needs to be rolled back, offered an option, or better hedged. 

https://github.com/apache/tomcat/commit/d03cfcf3b0d6639acb2884f1bbea5f2f29b95d91

I hope for a positive review.
Comment 6 Mark Thomas 2022-01-14 19:58:19 UTC
That should only be an issue if:
- there are no other connections to the server between liveness checks
- the liveness checks are >= time_wait seconds apart

Increasing the frequency of the liveness checks should be a valid workaround in the rare cases this is an issue.

Meanwhile, I'll look at adding a "time since last accept" check to the test. When the error occurs it is almost instant so something like less than a second should work.
Comment 7 Mark Thomas 2022-01-14 22:23:06 UTC
Checking the time since the last accept adds a significant overhead. With a simply JMeter test with keep-alive disabled, throughput dropped by 75% once I added the timing check. That level of overhead isn't acceptable.

If you are seeing what you believe to be a false positive warning with a liveness check, reducing the time between checks should fix the issue.