Bug 66615 - httpd kills keepalive connections when idle workers available
Summary: httpd kills keepalive connections when idle workers available
Status: NEW
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mpm_event (show other bugs)
Version: 2.4.37
Hardware: PC Linux
: P2 regression (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-05-26 10:47 UTC by mkempski
Modified: 2023-05-30 10:47 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description mkempski 2023-05-26 10:47:18 UTC
I have two identical VMs - 16GB RAM, 16 vCPUs. One is fresh Centos 7 install, the other is fresh Rocky 8. I installed httpd (on Centos 7it's version 2.4.6 and on Rocky 8 it's 2.4.37), configured them to point to the same static default html file and enabled mpm event on Centos (mpm event is default on Rocky). Then I added following options to default config on both servers:
```
<IfModule mpm_event_module>
ThreadsPerChild 25
StartServers 3
ServerLimit 120
MinSpareThreads 75
MaxSpareThreads 3000
MaxRequestWorkers 3000
MaxConnectionsPerChild 0
</IfModule>
```
After this is done I performed ab tests with keepalive using different Centos 7 VM in the same local network. On Centos I am able to complete 1 milion requests at 1000 concurrent connections with little to no errors, however with version 2.4.37 on Rocky I get a lot of failed requests due to length and exceptions. Served content is static so I am assuming this is because keepalive connections are closed by the server.
This problem occurs only when using keepalive. There are no errors when using ab without -k option, although speed is lower.  I can replicate this issue on newest httpd build from source (2.4.57).

Here is example logs with trace1 for mpm-event enabled on apache 2.4.37 during test:
```
[Tue May 23 08:21:24.206092 2023] [mpm_event:trace1] [pid 2123:tid
140575713961728] event.c(1583): Idle workers: 22
[Tue May 23 08:21:24.206300 2023] [mpm_event:debug] [pid 2291:tid
140575713961728] event.c(1580): Too many open connections (72), not
accepting new conns in
this process
[Tue May 23 08:21:24.206303 2023] [mpm_event:trace1] [pid 2291:tid
140575713961728] event.c(1583): Idle workers: 23
[Tue May 23 08:21:24.214594 2023] [mpm_event:debug] [pid 2402:tid
140575713961728] event.c(386): AH00457: Accepting new connections
again: 71 active conns (
1 lingering/0 clogged/0 suspended), 24 idle workers
[Tue May 23 08:21:24.214651 2023] [mpm_event:debug] [pid 2402:tid
140575713961728] event.c(1580): Too many open connections (71), not
accepting new conns in
this process
[Tue May 23 08:21:24.214657 2023] [mpm_event:trace1] [pid 2402:tid
140575713961728] event.c(1583): Idle workers: 17
[Tue May 23 08:21:24.224628 2023] [mpm_event:debug] [pid 2402:tid
140575713961728] event.c(386): AH00457: Accepting new connections
again: 71 active conns (
1 lingering/0 clogged/0 suspended), 24 idle workers
[Tue May 23 08:21:24.224677 2023] [mpm_event:debug] [pid 2402:tid
140575713961728] event.c(1580): Too many open connections (71), not
accepting new conns in
this process
[Tue May 23 08:21:24.224681 2023] [mpm_event:trace1] [pid 2402:tid
140575713961728] event.c(1583): Idle workers: 18
[Tue May 23 08:21:24.224986 2023] [mpm_event:debug] [pid 2402:tid
140575713961728] event.c(386): AH00457: Accepting new connections
again: 70 active conns (
2 lingering/0 clogged/0 suspended), 23 idle workers
[Tue May 23 08:21:24.225018 2023] [mpm_event:debug] [pid 2402:tid
140575713961728] event.c(1580): Too many open connections (70), not
accepting new conns in
this process
[Tue May 23 08:21:24.225024 2023] [mpm_event:trace1] [pid 2402:tid
140575713961728] event.c(1583): Idle workers: 19
[Tue May 23 08:21:24.227927 2023] [mpm_event:debug] [pid 2121:tid
140575713961728] event.c(386): AH00457: Accepting new connections
again: 73 active conns (
4 lingering/0 clogged/0 suspended), 25 idle workers
[Tue May 23 08:21:24.227978 2023] [mpm_event:debug] [pid 2121:tid
140575713961728] event.c(1580): Too many open connections (73), not
accepting new conns in
this process
[Tue May 23 08:21:24.227982 2023] [mpm_event:trace1] [pid 2121:tid
140575713961728] event.c(1583): Idle workers: 21
[Tue May 23 08:21:24.233929 2023] [mpm_event:debug] [pid 2402:tid
140575713961728] event.c(386): AH00457: Accepting new connections
again: 70 active conns (
2 lingering/0 clogged/0 suspended), 24 idle workers
[Tue May 23 08:21:24.233981 2023] [mpm_event:debug] [pid 2402:tid
140575713961728] event.c(1580): Too many open connections (70), not
accepting new conns in
this process
[Tue May 23 08:21:24.233987 2023] [mpm_event:trace1] [pid 2402:tid
140575713961728] event.c(1583): Idle workers: 21
[Tue May 23 08:21:24.234230 2023] [mpm_event:debug] [pid 2402:tid
140575713961728] event.c(386): AH00457: Accepting new connections
again: 72 active conns (
2 lingering/0 clogged/0 suspended), 24 idle workers
[Tue May 23 08:21:24.234247 2023] [mpm_event:debug] [pid 2402:tid
140575713961728] event.c(1580): Too many open connections (72), not
accepting new conns in
this process
[Tue May 23 08:21:24.234250 2023] [mpm_event:trace1] [pid 2402:tid
140575713961728] event.c(1583): Idle workers: 22
[Tue May 23 08:21:24.234601 2023] [mpm_event:debug] [pid 2402:tid
140575713961728] event.c(386): AH00457: Accepting new connections
again: 70 active conns (
0 lingering/0 clogged/0 suspended), 24 idle workers
[Tue May 23 08:21:24.234618 2023] [mpm_event:debug] [pid 2402:tid
140575713961728] event.c(1580): Too many open connections (70), not
accepting new conns in
this process
```

I can see 2 problems during my tests.
httpd does not add enough servers when test is running. It kills keepalive connections and logs "all workers busy or dying" but adds only up to 25 workers with config mentioned before.
httpd seems to not register that it has free workers even when I set StartServers to 120. There are thousands of idle threads and it still logs "all workers busy or dying" and kills keepalive connections. This can be worked around by setting ThreadsPerChild and ThreadLimit much higher and lowering StartServers/ServerLimit respectively. For example with following settings I can easily process over 1500 concurrent connections without errors and keepalive killing:
```
<IfModule mpm_event_module>
ThreadsPerChild 200
ThreadLimit 200
StartServers 10
ServerLimit 15
MinSpareThreads 75
MaxSpareThreads 3000
MaxRequestWorkers 3000
MaxConnectionsPerChild 0
</IfModule>
```
If I am understanding this correctly it looks this: each server (StartServers/ServerLimit) has workers (ThreadsPerChild) and is getting connections from listener. When ThreadsPerChild is low and connection rate is high listener overfills workers often which in turn causes server to kill its keepalive connections to free some workers. When we set ThreadsPerChild higher we can get to much higher count of concurrent connections before encountering this problem. Please correct me if I am wrong.
Comment 1 mkempski 2023-05-26 10:50:10 UTC
Thread in the mailing list with additional information:
https://lists.apache.org/thread/156tzy45crd0kqsr988rn4hvdcvm6skc
Comment 2 Eric Covener 2023-05-26 11:30:05 UTC
> httpd seems to not register that it has free workers even when I set StartServers to 120

startservers only controls how many are created at startup. Every second, min/max spare threads will be checking how many processes need to be stopped or started.

The problem on your system seems to be that additional processes aren't being spun up.
Comment 3 Ruediger Pluem 2023-05-30 06:31:46 UTC
Did you have a look at https://httpd.apache.org/docs/2.4/mod/event.html#asyncrequestworkerfactor ?
Comment 4 mkempski 2023-05-30 10:47:57 UTC
(In reply to Eric Covener from comment #2)
> > httpd seems to not register that it has free workers even when I set StartServers to 120
> 
> startservers only controls how many are created at startup. Every second,
> min/max spare threads will be checking how many processes need to be stopped
> or started.
> 
> The problem on your system seems to be that additional processes aren't
> being spun up.

Yes it does not spawn enough. Even if I force it to start with maximum values the problem persists. Only changing ThreadsPerChild to much higher number and starting with enough servers to handle peak has an effect. What is the reason for this behavior? 

(In reply to Ruediger Pluem from comment #3)
> Did you have a look at
> https://httpd.apache.org/docs/2.4/mod/event.html#asyncrequestworkerfactor ?

Yes, changing this setting to higher or lower values than default does not resolve this problem.