Bug 65769 - Child processes fail to start
Summary: Child processes fail to start
Status: RESOLVED FIXED
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: Core (show other bugs)
Version: 2.4.52
Hardware: PC Linux
: P2 major with 9 votes (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords: FixedInTrunk
: 65803 (view as bug list)
Depends on:
Blocks:
 
Reported: 2021-12-26 22:59 UTC by asf
Modified: 2022-06-01 11:34 UTC (History)
5 users (show)



Attachments
error log 27122021 (5.11 KB, text/plain)
2021-12-27 17:40 UTC, ldeliso
Details
Use _exit() in clean_child_exit() (385 bytes, patch)
2021-12-27 19:39 UTC, Yann Ylavic
Details | Diff
apache installation details (1.65 KB, text/plain)
2021-12-28 10:09 UTC, ldeliso
Details
server_main_loop to decrement active_daemons for exited children with quiescing == 1 too (2.41 KB, patch)
2021-12-28 11:18 UTC, Yann Ylavic
Details | Diff
Same without debug traces (626 bytes, patch)
2021-12-28 11:24 UTC, Yann Ylavic
Details | Diff
error_log_28122021 (23.46 KB, text/plain)
2021-12-28 12:22 UTC, ldeliso
Details
Active daemons accounting fix plus idle worker waking up the listener when listeners disabled (4.02 KB, patch)
2021-12-28 17:23 UTC, Yann Ylavic
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description asf 2021-12-26 22:59:45 UTC
Starting with Apache 2.4.52 the child processes for Apache fail to start after a period of time or after so many requests.

Hard to isolate exactly how to replicate the issue.

But after a period of time, Apache will stop accepting new connection until the parent Apache process is eventually restarted.  Requests for new web pages will just hang.

Seems to be affecting mostly servers using the event MPM.

Might be related to bug 65592 - https://bz.apache.org/bugzilla/show_bug.cgi?id=65592 - in that a similar issue was introduced in Apache 2.4.49.

In this particular case reverting back to Apache 2.4.51 resolves the issue.

Nothing is reported in the Apache error_log when this hanging event occurs.

Web hosting service providers using DirectAdmin and cPanel are reporting this issue.
Comment 1 ldeliso 2021-12-27 11:36:24 UTC
I experienced this issue as well (in all cases I was using the event MPM): apache stops accepting any connection because no listener thread is running. No events are reported in the error log and downgrading to apache 2.4.51 "fixes" this issue.


I'm able to reproduce this with this config (test case provided from bug 65592):

=====================
<IfModule mpm_event_module>
    StartServers             1
    MinSpareThreads         1
    MaxSpareThreads        1
    ThreadsPerChild         1
    MaxRequestWorkers      1
    MaxConnectionsPerChild   1
</IfModule>
=====================

Will try to eventually provide a trace later.
Comment 2 Yann Ylavic 2021-12-27 15:24:11 UTC
Please add "LogLevel mpm_event:trace6" in the global configuration (after the other LogLevel, if any) and attach the error_log file here.

I tried to reproduce with ab and the default configuration (also with the all 1 configuration above) but couldn't. httpd stops serving requests temporarily when its MaxRequestWorkers is saturated but it always recovers while connections terminate.

What are your actual MPM event settings (ServerLimit, MinSpareThreads, MaxSpareThreads, ThreadsPerChild, MaxRequestWorkers, MaxConnectionsPerChild, ...)?
Comment 3 Yann Ylavic 2021-12-27 15:37:14 UTC
(In reply to Yann Ylavic from comment #2)
> Please add "LogLevel mpm_event:trace6" in the global configuration

"LogLevel mpm_event:trace5" should be enough and much less verbose.
Comment 4 ldeliso 2021-12-27 17:40:16 UTC
Created attachment 38143 [details]
error log 27122021

Below you can find the logs after increasing the debug level. The configuration is the one I previously reported, although it tends to happen with a lower value for "MaxConnectionsPerChild" and not using ab here, just sending 3/4 requests with curl after starting up httpd.

What I also noticed is this piece of log continuously in loop (until I stopped httpd):

+++++++++++++++++++++++++++++++++++++++++++++
[Mon Dec 27 18:09:26.412311 2021] [mpm_event:trace1] [pid 23746:tid 47989522396224] event.c(2952): server is at active daemons limit, spawning of 1 children cancelled: 1/1 active, rate 32

[Mon Dec 27 18:09:26.412349 2021] [mpm_event:info] [pid 23746:tid 47989522396224] AH00486: server seems busy, (you may need to increase StartServers, ThreadsPerChild or Min/MaxSpareThreads), spawning 0 children, there are around 0 idle threads, 1 active children, and 0 children that are shutting down
+++++++++++++++++++++++++++++++++++++++++++++



In another case (leading to the same situation) no event were logged in loop, but after logging "Too many open connections (0) ..." no connections were served anymore:

+++++++++++++++++++++++++++++++++++++++++++++
[Mon Dec 27 17:54:05.842996 2021] [mpm_event:trace1] [pid 23211:tid 47140081620736] event.c(1217): Stopping process due to MaxConnectionsPerChild
[Mon Dec 27 17:54:05.843074 2021] [mpm_event:trace1] [pid 23211:tid 47139822519360] event.c(2683): graceful termination, workers joined, exiting
[Mon Dec 27 17:54:06.894748 2021] [mpm_event:trace5] [pid 23164:tid 47139822519360] event.c(2972): Spawning new child: slot 0 active / total daemons: 0/0
[Mon Dec 27 17:54:06.900350 2021] [mpm_event:debug] [pid 23218:tid 47140077418240] event.c(2367): AH02471: start_threads: Using epoll (wakeable)
[Mon Dec 27 17:54:06.900545 2021] [mpm_event:debug] [pid 23218:tid 47140081620736] event.c(1864): Too many open connections (0), not accepting new conns in this process
[Mon Dec 27 17:54:06.900563 2021] [mpm_event:trace1] [pid 23218:tid 47140081620736] event.c(1867): Idle workers: 0
+++++++++++++++++++++++++++++++++++++++++++++


In both cases I couldn't find any running listening thread to serve the requests.
Comment 5 Eric Covener 2021-12-27 18:31:50 UTC
> In both cases I couldn't find any running listening thread to serve the requests.

I wonder if you can find processes that appear to hang during termination (like an "atexit" kind of handler)? e.g. for the uploaded doc did you see 23794 in ps output, and if so what does pstack or gdb say about what it's doing?

>  [Mon Dec 27 18:05:08.139951 2021] [mpm_event:trace1] [pid 23794:tid 47989522396224] event.c(2683): graceful termination, workers joined, exiting


can you share apachectl -M output too in case there is some pattern or module that stands out?
Comment 6 Yann Ylavic 2021-12-27 19:39:11 UTC
Created attachment 38144 [details]
Use _exit() in clean_child_exit()

If the issue is of the atexit() kind, this patch should help too, does it?
Comment 7 ldeliso 2021-12-28 10:09:09 UTC
Created attachment 38145 [details]
apache installation details

@Eric: when the child process hangs, it's immediately terminated so no luck in tracing it. Sometimes a new child process is spawned (and however no threads are there to accept requests) and it's stuck in this state (though this has been observed too on working child processes when the issue does not arise):

++++++++++++++++++++++++++++++++++++++++++++++
#0  0x00007fca258239e4 in read () from /lib64/libpthread.so.0
#1  0x0000000000460f84 in ap_mpm_podx_check (pod=<optimized out>) at mpm_unix.c:546
#2  0x000000000047338f in child_main (child_num_arg=child_num_arg@entry=1, child_bucket=child_bucket@entry=0) at event.c:2647
#3  0x000000000047373a in make_child (s=0x15f9e48, slot=slot@entry=1, bucket=bucket@entry=0) at event.c:2754
#4  0x000000000047420a in perform_idle_server_maintenance (child_bucket=0) at event.c:2973
#5  server_main_loop (remaining_children_to_start=0) at event.c:3103
#6  event_run (_pconf=<optimized out>, plog=<optimized out>, s=<optimized out>) at event.c:3180
#7  0x000000000043774e in ap_run_mpm (pconf=pconf@entry=0x15ce3c8, plog=0x15fb608, s=0x15f9e48) at mpm_common.c:95
#8  0x000000000042fd0e in main (argc=<optimized out>, argv=<optimized out>) at main.c:819
++++++++++++++++++++++++++++++++++++++++++++++

This server runs RHEL 8 (but also tried on CentOS 7 / Ubuntu 21.10) and a very basic apache compiled from source. I have attached the details.

I did some tests with the worker MPM and this issue is not present. All works as expected. So this issue is only present with the event MPM.


@Yann: I applied your patch but unfortunately it does not fix this issue.
Comment 8 Yann Ylavic 2021-12-28 11:18:30 UTC
Created attachment 38146 [details]
server_main_loop to decrement active_daemons for exited children with quiescing == 1 too

Could you please try this new patch?

Besides the debug traces which should show what's happening, the change is mainly that server_main_loop will now count *all* exited children (that have not been caught by perform_idle_server_maintenance) as inactive.
Comment 9 Yann Ylavic 2021-12-28 11:24:51 UTC
Created attachment 38147 [details]
Same without debug traces
Comment 10 ldeliso 2021-12-28 12:22:59 UTC
Created attachment 38148 [details]
error_log_28122021

I applied the patch and can still reproduce the issue. The server ceased to accept requests just after this event (though as you can see in the attached logs a new child process is spawned and remains in execution):


[Tue Dec 28 12:55:24.703419 2021] [mpm_event:debug] [pid 2687016:tid 139917791557376] event.c(1864): Too many open connections (0), not accepting new conns in this process

Curious enough: if I do a pstack or attach a gdb to this PID (in this log it's 2687016) it logs the following and then accepts the requests again:

+++++++++++++++++++++++++++++++++++++++++++++++
[Tue Dec 28 13:14:21.609592 2021] [mpm_event:debug] [pid 2687016:tid 139921957689088] event.c(496): AH00457: Accepting new connections again: 0 active conns (0 lingering/0 clogged/0 suspended), 1 idle workers
[Tue Dec 28 13:14:21.609822 2021] [mpm_event:trace1] [pid 2687016:tid 139921957689088] event.c(1217): Stopping process due to MaxConnectionsPerChild
[Tue Dec 28 13:14:21.609829 2021] [mpm_event:debug] [pid 2687016:tid 139921957689088] event.c(571): wake up listener
[Tue Dec 28 13:14:21.610740 2021] [mpm_event:trace1] [pid 2687016:tid 139922101746560] event.c(2679): graceful termination received, joining workers
[Tue Dec 28 13:14:21.610756 2021] [mpm_event:trace1] [pid 2687016:tid 139922101746560] event.c(2683): graceful termination, workers joined, exiting
+++++++++++++++++++++++++++++++++++++++++++++++

Then, after the next request, it ceases to accept new requests unless I do one of the following things again:

- apachectl graceful
OR
- apachectl stop / apachectl start
OR
- run pstack or gdb on the running child process reported in the logs for the event "Too many open connections"
Comment 11 Yann Ylavic 2021-12-28 17:23:26 UTC
Created attachment 38149 [details]
Active daemons accounting fix plus idle worker waking up the listener when listeners disabled

It seems that no worker is idle yet when the listener handles its first connection, and nothing recovers after the listening socket is disabled.

Could you please try this patch? It's the same as attachment 38146 [details] plus the above case hopefully addressed.
Comment 12 asf 2021-12-28 18:04:10 UTC
There were a lot of changes to server/mpm/event/event.c from Apache 2.4.51 to Apache 2.4.52 - are you sure none of those changes is causing this issue?

All setups and configurations seem to work fine under Apache 2.4.51.

But upgrading to Apache 2.4.52 - and no configuration changes - causes this issue to appear.  That's got to mean something.
Comment 13 Yann Ylavic 2021-12-28 18:52:54 UTC
(In reply to asf from comment #12)
> There were a lot of changes to server/mpm/event/event.c from Apache 2.4.51
> to Apache 2.4.52 - are you sure none of those changes is causing this issue?

Sure it's one of those changes, namely the active_daemons accounting which moved from a global variable reset on restart to a variable retained/reused on restart.
So the difference is that on reload httpd will accurately account for the number of active vs dying children, whereas in 2.4.51 and earlier it could have created more children than allowed to causing several "scoreboard full" kind of issues.

Actually, the issues we are trying to fix now were already there in previous versions but masked by the incorrect accounting of active_daemons. I think the same thing can happen in 2.4.51, it's just less likely..
Comment 14 ldeliso 2021-12-28 19:27:50 UTC
Yann: Yes, I confirm that the last attached patch fixes the issue.
Comment 15 Yann Ylavic 2021-12-29 13:18:54 UTC
Thanks for testing, now in r1896505.
Comment 16 Yann Ylavic 2022-01-16 12:37:20 UTC
*** Bug 65803 has been marked as a duplicate of this bug. ***
Comment 17 Jörg Meltzer 2022-02-08 09:29:47 UTC
Hi,
this is affecting us in production.
Do we know already when there is a next release?
Comment 18 Stefan Eissing 2022-02-08 12:33:43 UTC
It seems highly likely that we do a release this month.
Comment 19 Szymon Żogała 2022-03-01 14:18:59 UTC
Hi, is there any news about the next release?
Comment 20 Stefan Eissing 2022-03-03 11:12:58 UTC
I plan to make a release candidate on Monday, March 7th. Giving that all goes well, that should come out on Thursday, March 10th.
Comment 21 Luke-Jr 2022-03-07 06:28:51 UTC
FWIW, this has been happening to me also every few days.

Looking at the changes in r1896505, I suspect it may not fix it, because if I attach during the problem condition:

(gdb) print listensocks_disabled
$2 = 0
Comment 22 Yann Ylavic 2022-03-07 11:46:49 UTC
(In reply to Luke-Jr from comment #21)
> 
> Looking at the changes in r1896505, I suspect it may not fix it, because if
> I attach during the problem condition:
> 
> (gdb) print listensocks_disabled
> $2 = 0

What does:
(gdb) thread apply all bt full
say?
Comment 23 Yann Ylavic 2022-03-07 11:47:35 UTC
Backported to 2.4.x (r1897149), will be in the next release.
Comment 24 Luke-Jr 2022-03-07 19:07:45 UTC
(In reply to Yann Ylavic from comment #22)
> (In reply to Luke-Jr from comment #21)
> > 
> > Looking at the changes in r1896505, I suspect it may not fix it, because if
> > I attach during the problem condition:
> > 
> > (gdb) print listensocks_disabled
> > $2 = 0
> 
> What does:
> (gdb) thread apply all bt full
> say?

My website was down, so I had to restart it. But from memory, strace showed it in a poll loop with none of the listening sockets (identified using lsof). A simple backtrace was inside an apr function, possibly used for the polling.
Comment 25 Teodor Milkov 2022-03-16 16:39:57 UTC
Anybody still having this issue?

Yesterday I've upgraded from 2.4.51 to 2.4.53 and started experiencing similar symptoms. Once in a while httpd just ceases to accept new connections. On a fleet of about 1000 web servers it occurred about 8 times in 24 hours, so quite rare and I don't know how to reproduce it. I'll try to strace it next time.
Comment 26 Yann Ylavic 2022-03-16 20:36:08 UTC
If it reproduces could you please:

$ gdb /path/to/httpd -p <pid>
(gdb) set logging on
(gdb) set logging file /tmp/httpd-backtrace.log
(gdb) thread apply all bt full

and attach "httpd-backtrace.log" here?
(<pid> is the process id of the httpd child in this non-responsive state)
Comment 27 Teodor Milkov 2022-03-21 13:11:58 UTC
(In reply to Yann Ylavic from comment #26)
> If it reproduces could you please:
> ...

We were in a hurry to get something working and so we have reverted to using 2.4.51 + mod_h2 2.0.2. We may try again 2.4.53 on a limited set of servers soon and be ready to collect the debug information.
Comment 28 Ruediger Pluem 2022-03-22 07:08:19 UTC
(In reply to Yann Ylavic from comment #26)
> If it reproduces could you please:
> 
> $ gdb /path/to/httpd -p <pid>
> (gdb) set logging on
> (gdb) set logging file /tmp/httpd-backtrace.log
> (gdb) thread apply all bt full
> 
> and attach "httpd-backtrace.log" here?
> (<pid> is the process id of the httpd child in this non-responsive state)

As it seems hard to reproduce I guess it would make sense to "save" such a process via

gcore <pid>

in a core dump first. This would enable us to ask for further debugging steps later if the analysis of the stacktraces creates demand for it.
Comment 29 Teodor Milkov 2022-03-31 16:41:11 UTC
I finally managed to run a debug version and save a dump with gcore.

Here's how process list looks usually on this server:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     50551  0.0  0.9 119520 57908 ?        Ss   Mar30   0:23 /apache/bin/httpd -k start
apache   55583  0.0  0.4 118344 26592 ?        S    12:15   0:00  \_ /apache/bin/httpd -k start
apache   55584  0.0  0.4 119520 26948 ?        S    12:15   0:00  \_ /apache/bin/httpd -k start
apache   55585  5.2  0.8 9374960 49764 ?       Sl   12:15   0:00  \_ /apache/bin/httpd -k start
apache   56594  1.3  0.6 9331520 42000 ?       Sl   12:15   0:00  \_ /apache/bin/httpd -k start

And here's how it looked when stuck:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     13105  0.0  0.9 119384 57524 ?        Ss   03:08   0:02 /apache/bin/httpd -k start
apache   28298  0.0  0.4 118216 27392 ?        S    06:00   0:00  \_ /apache/bin/httpd -k start
apache   28299  0.0  0.4 119376 26624 ?        S    06:00   0:00  \_ /apache/bin/httpd -k start

So, there are: 1. the master (root) process, 2. two "thin" processes which probably take care of the cgi (we use suexec) and 3. two "fat" processes which are the mpm_event workers.

Here's bt of the root process. Please let me know if I should do something else:

gdb ./httpd-2.4.53-debug core.13105

(gdb) bt full
#0  0x000072a77d0b7a27 in __GI___select (nfds=0, readfds=0x0, writefds=0x0, exceptfds=0x0, timeout=0x7b4d48201f70) at ../sysdeps/unix/sysv/linux/select.c:41
        resultvar = 18446744073709551102
        sc_ret = <optimized out>
#1  0x000072a77d1d3535 in apr_sleep () from /usr/lib/x86_64-linux-gnu/libapr-1.so.0
No symbol table info available.
#2  0x00001e99ee736b41 in ap_wait_or_timeout (status=0x7b4d48202024, exitcode=0x7b4d48202020, ret=0x7b4d48202000, p=0x72a77d8c4028, s=0x72a77cd6cb48) at mpm_common.c:205
        rv = 70006
#3  0x00001e99ee81d8d9 in server_main_loop (remaining_children_to_start=0) at event.c:3026
        num_buckets = 1
        child_slot = 0
        exitwhy = APR_PROC_SIGNAL
        status = 11
        processed_status = 0
        pid = {pid = -1, in = 0x1e99ee769e1e <ap_log_command_line+327>, out = 0x1e99ee82b6dd, err = 0x72a778d487e8}
        i = 1
#4  0x00001e99ee81df29 in event_run (_pconf=0x72a77d8c4028, plog=0x72a77cd5f028, s=0x72a77cd6cb48) at event.c:3204
        num_buckets = 1
        remaining_children_to_start = 0
        i = 0
#5  0x00001e99ee735f4d in ap_run_mpm (pconf=0x72a77d8c4028, plog=0x72a77cd5f028, s=0x72a77cd6cb48) at mpm_common.c:95
        pHook = 0x72a77a9082d0
        n = 0
        rv = -1
#6  0x00001e99ee72c0cd in main (argc=3, argv=0x7b4d48202308) at main.c:841
        c = 0 '\000'
        showcompile = 0
        showdirectives = 0
        confname = 0x1e99ee822d43 "conf/httpd.conf"
        def_server_root = 0x1e99ee822d53 "/apache"
        temp_error_log = 0x0
        error = 0x0
        process = 0x72a77d8c6118
        pconf = 0x72a77d8c4028
        plog = 0x72a77cd5f028
        ptemp = 0x72a77cd57028
        pcommands = 0x72a77cd79028
        opt = 0x72a77cd79118
        rv = 0
        mod = 0x1e99ee86d9e8 <ap_prelinked_modules+328>
        opt_arg = 0x7b4d48202170 "\250\002\215}\247r"
        signal_server = 0x1e99ee77446e <ap_signal_server>
        rc = 0

Looking at this backtrace I noticed it mentions signal 11 and looking in the error_log there's indeed this:

[Wed Mar 30 06:04:37 2022] [notice] [pid 13105] mpm_unix.c(436): AH00052: child pid 28302 exit signal Segmentation fault (11)

Now looking across error logs of all our servers it's not unusual to see this segfault on a daily basis. Apparently, 2.4.51 handles the child crashing gracefully, while 2.4.53 gets stuck.

It's another question why the segfault and I'm going to investigate this one too, but IMO it'd be great if we can get back to the graceful behaviour from pre 2.4.53.
Comment 30 Luke-Jr 2022-03-31 17:44:48 UTC
Can confirm my error_log also has a segfault at least a few times most days.

Should we reopen this bug, or create a new one?
Comment 31 Ruediger Pluem 2022-04-01 08:29:23 UTC
I think we need:

1. thread apply all bt full from the main process.
2. thread apply all bt full from one of the "fat" processes.
3. thread apply all bt full from one of the crashed proceses.
Comment 32 Teodor Milkov 2022-04-01 15:52:44 UTC
1. I believe the main process is not multi-threaded. Just in case I've run "thread apply all bt full" on the core file I have from the master process and it looks the same as "bt full".

2/3: The crashed process is actually the fat process. I don't have core file, because core files are disabled on my systems for security reasons. I'll see if I can enable core files and collect one, but this may take time...
Comment 33 Teodor Milkov 2022-04-08 12:56:51 UTC
Crashing child turned out to be caused by bad mod_security rule (at least in one of the cases). It looks the crash is unrelated, but newer apache is not recovering from the situation.
Comment 34 Ruediger Pluem 2022-04-13 13:49:10 UTC
(In reply to Teodor Milkov from comment #33)
> Crashing child turned out to be caused by bad mod_security rule (at least in
> one of the cases). It looks the crash is unrelated, but newer apache is not
> recovering from the situation.

How do these crashes happen? Directly after the child starts or only when a certain request was processed? We might face a situation here where one child after another crashed due to this. Do you have an error log? These should mention the crashed processeses.
Comment 35 Teodor Milkov 2022-04-15 08:49:28 UTC
(In reply to Ruediger Pluem from comment #34)
> (In reply to Teodor Milkov from comment #33)
> > Crashing child turned out to be caused by bad mod_security rule (at least in
> > one of the cases). It looks the crash is unrelated, but newer apache is not
> > recovering from the situation.
> 
> How do these crashes happen? Directly after the child starts or only when a
> certain request was processed? We might face a situation here where one
> child after another crashed due to this. Do you have an error log? These
> should mention the crashed processeses.

Child crashes only when a certain request is maid. Crashing processes are indeed mentioned in the error_log -- sometimes these are spread out, and at other times in quick succession or same second:

[Fri Apr 01 18:46:42 2022] [notice] [pid 22993] mpm_unix.c(436): AH00052: child pid 13864 exit signal Segmentation fault (11)
[Fri Apr 01 20:39:15 2022] [notice] [pid 22993] mpm_unix.c(436): AH00052: child pid 58297 exit signal Segmentation fault (11)
[Fri Apr 01 20:39:15 2022] [notice] [pid 22993] mpm_unix.c(436): AH00052: child pid 58299 exit signal Segmentation fault (11)
[Fri Apr 01 20:39:18 2022] [notice] [pid 22993] mpm_unix.c(436): AH00052: child pid 35244 exit signal Segmentation fault (11)
[Fri Apr 01 20:45:01 2022] [notice] [pid 22993] mpm_unix.c(436): AH00052: child pid 37311 exit signal Segmentation fault (11)
[Fri Apr 01 21:21:00 2022] [notice] [pid 22993] mpm_unix.c(436): AH00052: child pid 29877 exit signal Segmentation fault (11)
[Fri Apr 01 23:58:43 2022] [notice] [pid 22993] mpm_unix.c(436): AH00052: child pid 12423 exit signal Segmentation fault (11)
[Fri Apr 01 23:58:44 2022] [notice] [pid 22993] mpm_unix.c(436): AH00052: child pid 12421 exit signal Segmentation fault (11)
[Sat Apr 02 00:13:14 2022] [notice] [pid 22993] mpm_unix.c(436): AH00052: child pid 29774 exit signal Segmentation fault (11)
[Sat Apr 02 02:33:56 2022] [notice] [pid 22993] mpm_unix.c(436): AH00052: child pid 30997 exit signal Segmentation fault (11)
Comment 36 Ruediger Pluem 2022-04-15 11:01:31 UTC
Thanks. There could be some delay (1 or 2 secs) in restarting crashed processes, but it should not remain without processes "forever". As far as I understand the crashed processes never get replaced and you end up with just the parent process after some time?
Comment 37 Teodor Milkov 2022-04-16 14:14:24 UTC
Exactly, yes, that's what we are seeing -- just the parent process and no workers for (at least) several minutes until the alarms ring and web server is manually restarted.
Comment 38 Ruediger Pluem 2022-04-19 13:21:14 UTC
Please use the core dump you captured of such a process and issue the following gdb commands:

print *retained
print *(retained->idle_spawn_rate)
print *(retained->mpm)
Comment 39 Ruediger Pluem 2022-04-19 13:36:50 UTC
Further gdb commands:

set $i=0
while ($i<server_limit)
print ap_scoreboard_image->parent[$i++]
end

set $i=0
while($i<server_limit)
set $j=0
while($j<threads_per_child)
print ap_scoreboard_image->servers[$i][$j++]
end
set $i=$i+1
end
Comment 40 Teodor Milkov 2022-05-05 19:17:36 UTC
I've manually added the patches from https://www.mail-archive.com/dev@httpd.apache.org/msg76497.html and more specifically:


    https://svn.apache.org/viewvc?view=revision&revision=1899858
    https://svn.apache.org/viewvc?view=revision&revision=1899865
    https://svn.apache.org/viewvc?view=revision&revision=1899884

And now 2.4.53 is no longer getting stuck (on a limited set of servers). I'm going to try on more servers during the weekend, but so far it looks like the problem is gone.
Comment 41 Teodor Milkov 2022-05-13 09:49:45 UTC
(In reply to Teodor Milkov from comment #40)
> I've manually added the patches from
> https://www.mail-archive.com/dev@httpd.apache.org/msg76497.html

I confirm, that with this patch httpd is running stable on all of my servers for more than 5 days now.
Comment 42 Yann Ylavic 2022-06-01 11:34:01 UTC
(In reply to Teodor Milkov from comment #41)
> (In reply to Teodor Milkov from comment #40)
> > I've manually added the patches from
> > https://www.mail-archive.com/dev@httpd.apache.org/msg76497.html
> 
> I confirm, that with this patch httpd is running stable on all of my servers
> for more than 5 days now.

Backported to 2.4 (r1901234), will be in the next release.