Starting with Apache 2.4.52 the child processes for Apache fail to start after a period of time or after so many requests. Hard to isolate exactly how to replicate the issue. But after a period of time, Apache will stop accepting new connection until the parent Apache process is eventually restarted. Requests for new web pages will just hang. Seems to be affecting mostly servers using the event MPM. Might be related to bug 65592 - https://bz.apache.org/bugzilla/show_bug.cgi?id=65592 - in that a similar issue was introduced in Apache 2.4.49. In this particular case reverting back to Apache 2.4.51 resolves the issue. Nothing is reported in the Apache error_log when this hanging event occurs. Web hosting service providers using DirectAdmin and cPanel are reporting this issue.
I experienced this issue as well (in all cases I was using the event MPM): apache stops accepting any connection because no listener thread is running. No events are reported in the error log and downgrading to apache 2.4.51 "fixes" this issue. I'm able to reproduce this with this config (test case provided from bug 65592): ===================== <IfModule mpm_event_module> StartServers 1 MinSpareThreads 1 MaxSpareThreads 1 ThreadsPerChild 1 MaxRequestWorkers 1 MaxConnectionsPerChild 1 </IfModule> ===================== Will try to eventually provide a trace later.
Please add "LogLevel mpm_event:trace6" in the global configuration (after the other LogLevel, if any) and attach the error_log file here. I tried to reproduce with ab and the default configuration (also with the all 1 configuration above) but couldn't. httpd stops serving requests temporarily when its MaxRequestWorkers is saturated but it always recovers while connections terminate. What are your actual MPM event settings (ServerLimit, MinSpareThreads, MaxSpareThreads, ThreadsPerChild, MaxRequestWorkers, MaxConnectionsPerChild, ...)?
(In reply to Yann Ylavic from comment #2) > Please add "LogLevel mpm_event:trace6" in the global configuration "LogLevel mpm_event:trace5" should be enough and much less verbose.
Created attachment 38143 [details] error log 27122021 Below you can find the logs after increasing the debug level. The configuration is the one I previously reported, although it tends to happen with a lower value for "MaxConnectionsPerChild" and not using ab here, just sending 3/4 requests with curl after starting up httpd. What I also noticed is this piece of log continuously in loop (until I stopped httpd): +++++++++++++++++++++++++++++++++++++++++++++ [Mon Dec 27 18:09:26.412311 2021] [mpm_event:trace1] [pid 23746:tid 47989522396224] event.c(2952): server is at active daemons limit, spawning of 1 children cancelled: 1/1 active, rate 32 [Mon Dec 27 18:09:26.412349 2021] [mpm_event:info] [pid 23746:tid 47989522396224] AH00486: server seems busy, (you may need to increase StartServers, ThreadsPerChild or Min/MaxSpareThreads), spawning 0 children, there are around 0 idle threads, 1 active children, and 0 children that are shutting down +++++++++++++++++++++++++++++++++++++++++++++ In another case (leading to the same situation) no event were logged in loop, but after logging "Too many open connections (0) ..." no connections were served anymore: +++++++++++++++++++++++++++++++++++++++++++++ [Mon Dec 27 17:54:05.842996 2021] [mpm_event:trace1] [pid 23211:tid 47140081620736] event.c(1217): Stopping process due to MaxConnectionsPerChild [Mon Dec 27 17:54:05.843074 2021] [mpm_event:trace1] [pid 23211:tid 47139822519360] event.c(2683): graceful termination, workers joined, exiting [Mon Dec 27 17:54:06.894748 2021] [mpm_event:trace5] [pid 23164:tid 47139822519360] event.c(2972): Spawning new child: slot 0 active / total daemons: 0/0 [Mon Dec 27 17:54:06.900350 2021] [mpm_event:debug] [pid 23218:tid 47140077418240] event.c(2367): AH02471: start_threads: Using epoll (wakeable) [Mon Dec 27 17:54:06.900545 2021] [mpm_event:debug] [pid 23218:tid 47140081620736] event.c(1864): Too many open connections (0), not accepting new conns in this process [Mon Dec 27 17:54:06.900563 2021] [mpm_event:trace1] [pid 23218:tid 47140081620736] event.c(1867): Idle workers: 0 +++++++++++++++++++++++++++++++++++++++++++++ In both cases I couldn't find any running listening thread to serve the requests.
> In both cases I couldn't find any running listening thread to serve the requests. I wonder if you can find processes that appear to hang during termination (like an "atexit" kind of handler)? e.g. for the uploaded doc did you see 23794 in ps output, and if so what does pstack or gdb say about what it's doing? > [Mon Dec 27 18:05:08.139951 2021] [mpm_event:trace1] [pid 23794:tid 47989522396224] event.c(2683): graceful termination, workers joined, exiting can you share apachectl -M output too in case there is some pattern or module that stands out?
Created attachment 38144 [details] Use _exit() in clean_child_exit() If the issue is of the atexit() kind, this patch should help too, does it?
Created attachment 38145 [details] apache installation details @Eric: when the child process hangs, it's immediately terminated so no luck in tracing it. Sometimes a new child process is spawned (and however no threads are there to accept requests) and it's stuck in this state (though this has been observed too on working child processes when the issue does not arise): ++++++++++++++++++++++++++++++++++++++++++++++ #0 0x00007fca258239e4 in read () from /lib64/libpthread.so.0 #1 0x0000000000460f84 in ap_mpm_podx_check (pod=<optimized out>) at mpm_unix.c:546 #2 0x000000000047338f in child_main (child_num_arg=child_num_arg@entry=1, child_bucket=child_bucket@entry=0) at event.c:2647 #3 0x000000000047373a in make_child (s=0x15f9e48, slot=slot@entry=1, bucket=bucket@entry=0) at event.c:2754 #4 0x000000000047420a in perform_idle_server_maintenance (child_bucket=0) at event.c:2973 #5 server_main_loop (remaining_children_to_start=0) at event.c:3103 #6 event_run (_pconf=<optimized out>, plog=<optimized out>, s=<optimized out>) at event.c:3180 #7 0x000000000043774e in ap_run_mpm (pconf=pconf@entry=0x15ce3c8, plog=0x15fb608, s=0x15f9e48) at mpm_common.c:95 #8 0x000000000042fd0e in main (argc=<optimized out>, argv=<optimized out>) at main.c:819 ++++++++++++++++++++++++++++++++++++++++++++++ This server runs RHEL 8 (but also tried on CentOS 7 / Ubuntu 21.10) and a very basic apache compiled from source. I have attached the details. I did some tests with the worker MPM and this issue is not present. All works as expected. So this issue is only present with the event MPM. @Yann: I applied your patch but unfortunately it does not fix this issue.
Created attachment 38146 [details] server_main_loop to decrement active_daemons for exited children with quiescing == 1 too Could you please try this new patch? Besides the debug traces which should show what's happening, the change is mainly that server_main_loop will now count *all* exited children (that have not been caught by perform_idle_server_maintenance) as inactive.
Created attachment 38147 [details] Same without debug traces
Created attachment 38148 [details] error_log_28122021 I applied the patch and can still reproduce the issue. The server ceased to accept requests just after this event (though as you can see in the attached logs a new child process is spawned and remains in execution): [Tue Dec 28 12:55:24.703419 2021] [mpm_event:debug] [pid 2687016:tid 139917791557376] event.c(1864): Too many open connections (0), not accepting new conns in this process Curious enough: if I do a pstack or attach a gdb to this PID (in this log it's 2687016) it logs the following and then accepts the requests again: +++++++++++++++++++++++++++++++++++++++++++++++ [Tue Dec 28 13:14:21.609592 2021] [mpm_event:debug] [pid 2687016:tid 139921957689088] event.c(496): AH00457: Accepting new connections again: 0 active conns (0 lingering/0 clogged/0 suspended), 1 idle workers [Tue Dec 28 13:14:21.609822 2021] [mpm_event:trace1] [pid 2687016:tid 139921957689088] event.c(1217): Stopping process due to MaxConnectionsPerChild [Tue Dec 28 13:14:21.609829 2021] [mpm_event:debug] [pid 2687016:tid 139921957689088] event.c(571): wake up listener [Tue Dec 28 13:14:21.610740 2021] [mpm_event:trace1] [pid 2687016:tid 139922101746560] event.c(2679): graceful termination received, joining workers [Tue Dec 28 13:14:21.610756 2021] [mpm_event:trace1] [pid 2687016:tid 139922101746560] event.c(2683): graceful termination, workers joined, exiting +++++++++++++++++++++++++++++++++++++++++++++++ Then, after the next request, it ceases to accept new requests unless I do one of the following things again: - apachectl graceful OR - apachectl stop / apachectl start OR - run pstack or gdb on the running child process reported in the logs for the event "Too many open connections"
Created attachment 38149 [details] Active daemons accounting fix plus idle worker waking up the listener when listeners disabled It seems that no worker is idle yet when the listener handles its first connection, and nothing recovers after the listening socket is disabled. Could you please try this patch? It's the same as attachment 38146 [details] plus the above case hopefully addressed.
There were a lot of changes to server/mpm/event/event.c from Apache 2.4.51 to Apache 2.4.52 - are you sure none of those changes is causing this issue? All setups and configurations seem to work fine under Apache 2.4.51. But upgrading to Apache 2.4.52 - and no configuration changes - causes this issue to appear. That's got to mean something.
(In reply to asf from comment #12) > There were a lot of changes to server/mpm/event/event.c from Apache 2.4.51 > to Apache 2.4.52 - are you sure none of those changes is causing this issue? Sure it's one of those changes, namely the active_daemons accounting which moved from a global variable reset on restart to a variable retained/reused on restart. So the difference is that on reload httpd will accurately account for the number of active vs dying children, whereas in 2.4.51 and earlier it could have created more children than allowed to causing several "scoreboard full" kind of issues. Actually, the issues we are trying to fix now were already there in previous versions but masked by the incorrect accounting of active_daemons. I think the same thing can happen in 2.4.51, it's just less likely..
Yann: Yes, I confirm that the last attached patch fixes the issue.
Thanks for testing, now in r1896505.
*** Bug 65803 has been marked as a duplicate of this bug. ***
Hi, this is affecting us in production. Do we know already when there is a next release?
It seems highly likely that we do a release this month.
Hi, is there any news about the next release?
I plan to make a release candidate on Monday, March 7th. Giving that all goes well, that should come out on Thursday, March 10th.
FWIW, this has been happening to me also every few days. Looking at the changes in r1896505, I suspect it may not fix it, because if I attach during the problem condition: (gdb) print listensocks_disabled $2 = 0
(In reply to Luke-Jr from comment #21) > > Looking at the changes in r1896505, I suspect it may not fix it, because if > I attach during the problem condition: > > (gdb) print listensocks_disabled > $2 = 0 What does: (gdb) thread apply all bt full say?
Backported to 2.4.x (r1897149), will be in the next release.
(In reply to Yann Ylavic from comment #22) > (In reply to Luke-Jr from comment #21) > > > > Looking at the changes in r1896505, I suspect it may not fix it, because if > > I attach during the problem condition: > > > > (gdb) print listensocks_disabled > > $2 = 0 > > What does: > (gdb) thread apply all bt full > say? My website was down, so I had to restart it. But from memory, strace showed it in a poll loop with none of the listening sockets (identified using lsof). A simple backtrace was inside an apr function, possibly used for the polling.
Anybody still having this issue? Yesterday I've upgraded from 2.4.51 to 2.4.53 and started experiencing similar symptoms. Once in a while httpd just ceases to accept new connections. On a fleet of about 1000 web servers it occurred about 8 times in 24 hours, so quite rare and I don't know how to reproduce it. I'll try to strace it next time.
If it reproduces could you please: $ gdb /path/to/httpd -p <pid> (gdb) set logging on (gdb) set logging file /tmp/httpd-backtrace.log (gdb) thread apply all bt full and attach "httpd-backtrace.log" here? (<pid> is the process id of the httpd child in this non-responsive state)
(In reply to Yann Ylavic from comment #26) > If it reproduces could you please: > ... We were in a hurry to get something working and so we have reverted to using 2.4.51 + mod_h2 2.0.2. We may try again 2.4.53 on a limited set of servers soon and be ready to collect the debug information.
(In reply to Yann Ylavic from comment #26) > If it reproduces could you please: > > $ gdb /path/to/httpd -p <pid> > (gdb) set logging on > (gdb) set logging file /tmp/httpd-backtrace.log > (gdb) thread apply all bt full > > and attach "httpd-backtrace.log" here? > (<pid> is the process id of the httpd child in this non-responsive state) As it seems hard to reproduce I guess it would make sense to "save" such a process via gcore <pid> in a core dump first. This would enable us to ask for further debugging steps later if the analysis of the stacktraces creates demand for it.
I finally managed to run a debug version and save a dump with gcore. Here's how process list looks usually on this server: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 50551 0.0 0.9 119520 57908 ? Ss Mar30 0:23 /apache/bin/httpd -k start apache 55583 0.0 0.4 118344 26592 ? S 12:15 0:00 \_ /apache/bin/httpd -k start apache 55584 0.0 0.4 119520 26948 ? S 12:15 0:00 \_ /apache/bin/httpd -k start apache 55585 5.2 0.8 9374960 49764 ? Sl 12:15 0:00 \_ /apache/bin/httpd -k start apache 56594 1.3 0.6 9331520 42000 ? Sl 12:15 0:00 \_ /apache/bin/httpd -k start And here's how it looked when stuck: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 13105 0.0 0.9 119384 57524 ? Ss 03:08 0:02 /apache/bin/httpd -k start apache 28298 0.0 0.4 118216 27392 ? S 06:00 0:00 \_ /apache/bin/httpd -k start apache 28299 0.0 0.4 119376 26624 ? S 06:00 0:00 \_ /apache/bin/httpd -k start So, there are: 1. the master (root) process, 2. two "thin" processes which probably take care of the cgi (we use suexec) and 3. two "fat" processes which are the mpm_event workers. Here's bt of the root process. Please let me know if I should do something else: gdb ./httpd-2.4.53-debug core.13105 (gdb) bt full #0 0x000072a77d0b7a27 in __GI___select (nfds=0, readfds=0x0, writefds=0x0, exceptfds=0x0, timeout=0x7b4d48201f70) at ../sysdeps/unix/sysv/linux/select.c:41 resultvar = 18446744073709551102 sc_ret = <optimized out> #1 0x000072a77d1d3535 in apr_sleep () from /usr/lib/x86_64-linux-gnu/libapr-1.so.0 No symbol table info available. #2 0x00001e99ee736b41 in ap_wait_or_timeout (status=0x7b4d48202024, exitcode=0x7b4d48202020, ret=0x7b4d48202000, p=0x72a77d8c4028, s=0x72a77cd6cb48) at mpm_common.c:205 rv = 70006 #3 0x00001e99ee81d8d9 in server_main_loop (remaining_children_to_start=0) at event.c:3026 num_buckets = 1 child_slot = 0 exitwhy = APR_PROC_SIGNAL status = 11 processed_status = 0 pid = {pid = -1, in = 0x1e99ee769e1e <ap_log_command_line+327>, out = 0x1e99ee82b6dd, err = 0x72a778d487e8} i = 1 #4 0x00001e99ee81df29 in event_run (_pconf=0x72a77d8c4028, plog=0x72a77cd5f028, s=0x72a77cd6cb48) at event.c:3204 num_buckets = 1 remaining_children_to_start = 0 i = 0 #5 0x00001e99ee735f4d in ap_run_mpm (pconf=0x72a77d8c4028, plog=0x72a77cd5f028, s=0x72a77cd6cb48) at mpm_common.c:95 pHook = 0x72a77a9082d0 n = 0 rv = -1 #6 0x00001e99ee72c0cd in main (argc=3, argv=0x7b4d48202308) at main.c:841 c = 0 '\000' showcompile = 0 showdirectives = 0 confname = 0x1e99ee822d43 "conf/httpd.conf" def_server_root = 0x1e99ee822d53 "/apache" temp_error_log = 0x0 error = 0x0 process = 0x72a77d8c6118 pconf = 0x72a77d8c4028 plog = 0x72a77cd5f028 ptemp = 0x72a77cd57028 pcommands = 0x72a77cd79028 opt = 0x72a77cd79118 rv = 0 mod = 0x1e99ee86d9e8 <ap_prelinked_modules+328> opt_arg = 0x7b4d48202170 "\250\002\215}\247r" signal_server = 0x1e99ee77446e <ap_signal_server> rc = 0 Looking at this backtrace I noticed it mentions signal 11 and looking in the error_log there's indeed this: [Wed Mar 30 06:04:37 2022] [notice] [pid 13105] mpm_unix.c(436): AH00052: child pid 28302 exit signal Segmentation fault (11) Now looking across error logs of all our servers it's not unusual to see this segfault on a daily basis. Apparently, 2.4.51 handles the child crashing gracefully, while 2.4.53 gets stuck. It's another question why the segfault and I'm going to investigate this one too, but IMO it'd be great if we can get back to the graceful behaviour from pre 2.4.53.
Can confirm my error_log also has a segfault at least a few times most days. Should we reopen this bug, or create a new one?
I think we need: 1. thread apply all bt full from the main process. 2. thread apply all bt full from one of the "fat" processes. 3. thread apply all bt full from one of the crashed proceses.
1. I believe the main process is not multi-threaded. Just in case I've run "thread apply all bt full" on the core file I have from the master process and it looks the same as "bt full". 2/3: The crashed process is actually the fat process. I don't have core file, because core files are disabled on my systems for security reasons. I'll see if I can enable core files and collect one, but this may take time...
Crashing child turned out to be caused by bad mod_security rule (at least in one of the cases). It looks the crash is unrelated, but newer apache is not recovering from the situation.
(In reply to Teodor Milkov from comment #33) > Crashing child turned out to be caused by bad mod_security rule (at least in > one of the cases). It looks the crash is unrelated, but newer apache is not > recovering from the situation. How do these crashes happen? Directly after the child starts or only when a certain request was processed? We might face a situation here where one child after another crashed due to this. Do you have an error log? These should mention the crashed processeses.
(In reply to Ruediger Pluem from comment #34) > (In reply to Teodor Milkov from comment #33) > > Crashing child turned out to be caused by bad mod_security rule (at least in > > one of the cases). It looks the crash is unrelated, but newer apache is not > > recovering from the situation. > > How do these crashes happen? Directly after the child starts or only when a > certain request was processed? We might face a situation here where one > child after another crashed due to this. Do you have an error log? These > should mention the crashed processeses. Child crashes only when a certain request is maid. Crashing processes are indeed mentioned in the error_log -- sometimes these are spread out, and at other times in quick succession or same second: [Fri Apr 01 18:46:42 2022] [notice] [pid 22993] mpm_unix.c(436): AH00052: child pid 13864 exit signal Segmentation fault (11) [Fri Apr 01 20:39:15 2022] [notice] [pid 22993] mpm_unix.c(436): AH00052: child pid 58297 exit signal Segmentation fault (11) [Fri Apr 01 20:39:15 2022] [notice] [pid 22993] mpm_unix.c(436): AH00052: child pid 58299 exit signal Segmentation fault (11) [Fri Apr 01 20:39:18 2022] [notice] [pid 22993] mpm_unix.c(436): AH00052: child pid 35244 exit signal Segmentation fault (11) [Fri Apr 01 20:45:01 2022] [notice] [pid 22993] mpm_unix.c(436): AH00052: child pid 37311 exit signal Segmentation fault (11) [Fri Apr 01 21:21:00 2022] [notice] [pid 22993] mpm_unix.c(436): AH00052: child pid 29877 exit signal Segmentation fault (11) [Fri Apr 01 23:58:43 2022] [notice] [pid 22993] mpm_unix.c(436): AH00052: child pid 12423 exit signal Segmentation fault (11) [Fri Apr 01 23:58:44 2022] [notice] [pid 22993] mpm_unix.c(436): AH00052: child pid 12421 exit signal Segmentation fault (11) [Sat Apr 02 00:13:14 2022] [notice] [pid 22993] mpm_unix.c(436): AH00052: child pid 29774 exit signal Segmentation fault (11) [Sat Apr 02 02:33:56 2022] [notice] [pid 22993] mpm_unix.c(436): AH00052: child pid 30997 exit signal Segmentation fault (11)
Thanks. There could be some delay (1 or 2 secs) in restarting crashed processes, but it should not remain without processes "forever". As far as I understand the crashed processes never get replaced and you end up with just the parent process after some time?
Exactly, yes, that's what we are seeing -- just the parent process and no workers for (at least) several minutes until the alarms ring and web server is manually restarted.
Please use the core dump you captured of such a process and issue the following gdb commands: print *retained print *(retained->idle_spawn_rate) print *(retained->mpm)
Further gdb commands: set $i=0 while ($i<server_limit) print ap_scoreboard_image->parent[$i++] end set $i=0 while($i<server_limit) set $j=0 while($j<threads_per_child) print ap_scoreboard_image->servers[$i][$j++] end set $i=$i+1 end
I've manually added the patches from https://www.mail-archive.com/dev@httpd.apache.org/msg76497.html and more specifically: https://svn.apache.org/viewvc?view=revision&revision=1899858 https://svn.apache.org/viewvc?view=revision&revision=1899865 https://svn.apache.org/viewvc?view=revision&revision=1899884 And now 2.4.53 is no longer getting stuck (on a limited set of servers). I'm going to try on more servers during the weekend, but so far it looks like the problem is gone.
(In reply to Teodor Milkov from comment #40) > I've manually added the patches from > https://www.mail-archive.com/dev@httpd.apache.org/msg76497.html I confirm, that with this patch httpd is running stable on all of my servers for more than 5 days now.
(In reply to Teodor Milkov from comment #41) > (In reply to Teodor Milkov from comment #40) > > I've manually added the patches from > > https://www.mail-archive.com/dev@httpd.apache.org/msg76497.html > > I confirm, that with this patch httpd is running stable on all of my servers > for more than 5 days now. Backported to 2.4 (r1901234), will be in the next release.