Updated httpd to 2.4.53 from 2.4.51. After several time reloads, child processes disappeared one by one with following: [Mon Apr 11 14:19:49.422912 2022] [mpm_event:debug] [pid 20759:tid 140355756119808] event.c(576): wake up listener All child processes disappeared and only parent process survive, following is logged every second: [Mon Apr 11 14:29:25.195671 2022] [mpm_event:info] [pid 28925:tid 140356948518720] AH00486: server seems busy, (you may need to increase StartServers, ThreadsPerChild or Min/MaxSpareThreads), spawning 0 children, there are around 64 idle threads, 14 active children, and 14 children that are shutting down [Mon Apr 11 14:29:26.196818 2022] [mpm_event:info] [pid 28925:tid 140356948518720] AH00486: server seems busy, (you may need to increase StartServers, ThreadsPerChild or Min/MaxSpareThreads), spawning 0 children, there are around 64 idle threads, 14 active children, and 14 children that are shutting down [Mon Apr 11 14:29:27.197926 2022] [mpm_event:info] [pid 28925:tid 140356948518720] AH00486: server seems busy, (you may need to increase StartServers, ThreadsPerChild or Min/MaxSpareThreads), spawning 0 children, there are around 64 idle threads, 14 active children, and 14 children that are shutting down Reloading or stopping/starting httpd recovers from this situation. This can be reproduce not only at production environment but also at the test environment with no web accesses to httpd. Can anyone please investigate and fix this. ///////////////////////////commands(RHEL) systemctl stop httpd systemctl start httpd /usr/bin/systemctl reload httpd.service /usr/bin/systemctl reload httpd.service /usr/bin/systemctl reload httpd.service /usr/bin/systemctl reload httpd.service /usr/bin/systemctl reload httpd.service (wait about 30 minutes) ///////////////////////////httpd-mpm.conf <IfModule mpm_event_module> ServerLimit 10 StartServers 4 MinSpareThreads 75 MaxSpareThreads 250 ThreadsPerChild 64 MaxRequestWorkers 640 MaxConnectionsPerChild 300 </IfModule>
It would be useful to run the reproducer with "LogLevel mpm_event:trace6" and attach the error log here.
(In reply to Yann Ylavic from comment #1) > It would be useful to run the reproducer with "LogLevel mpm_event:trace6" > and attach the error log here. OK. I'm trying to reproduce now. Please wait.
Created attachment 38243 [details] error log
I reproduced with "LogLevel mpm_event:trace6" and uploaded error log. Can you please investigae. ///////////////////////////// # systemctl start httpd # /usr/bin/systemctl reload httpd.service # /usr/bin/systemctl reload httpd.service # /usr/bin/systemctl reload httpd.service # /usr/bin/systemctl reload httpd.service # /usr/bin/systemctl reload httpd.service # systemctl status httpd ● httpd.service - The Apache HTTP Server Loaded: loaded (/etc/systemd/system/httpd.service; enabled; vendor preset: disabled) Active: active (running) since 月 2022-04-11 19:03:14 JST; 37s ago Process: 15865 ExecStop=/opt/app/httpd/httpd-2/bin/httpd $OPTIONS -k graceful-stop (code=exited, status=0/SUCCESS) Process: 17327 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS) Process: 17325 ExecReload=/opt/app/httpd/httpd-2/bin/httpd -t (code=exited, status=0/SUCCESS) Process: 15929 ExecStart=/opt/app/httpd/httpd-2/bin/httpd $OPTIONS -k start (code=exited, status=0/SUCCESS) Main PID: 15930 (httpd) CGroup: /system.slice/httpd.service tq15930 /opt/app/httpd/httpd-2/bin/httpd -k start tq17329 /opt/app/httpd/httpd-2/bin/httpd -k start tq17331 /opt/app/httpd/httpd-2/bin/httpd -k start mq17336 /opt/app/httpd/httpd-2/bin/httpd -k start 33 minutes later... # systemctl status httpd ● httpd.service - The Apache HTTP Server Loaded: loaded (/etc/systemd/system/httpd.service; enabled; vendor preset: disabled) Active: active (running) since 月 2022-04-11 19:03:14 JST; 33min ago Process: 15865 ExecStop=/opt/app/httpd/httpd-2/bin/httpd $OPTIONS -k graceful-stop (code=exited, status=0/SUCCESS) Process: 17327 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS) Process: 17325 ExecReload=/opt/app/httpd/httpd-2/bin/httpd -t (code=exited, status=0/SUCCESS) Process: 15929 ExecStart=/opt/app/httpd/httpd-2/bin/httpd $OPTIONS -k start (code=exited, status=0/SUCCESS) Main PID: 15930 (httpd) CGroup: /system.slice/httpd.service tq15930 /opt/app/httpd/httpd-2/bin/httpd -k start mq17329 /opt/app/httpd/httpd-2/bin/httpd -k start /////////////////////////////
Created attachment 38244 [details] Fix active/total daemons accounting on non-graceful restart Could you please try this patch?
(In reply to Yann Ylavic from comment #5) > Created attachment 38244 [details] > Fix active/total daemons accounting on non-graceful restart > > Could you please try this patch? Thank you for your quick response. What was wrong with 2.4.53? I will try and report the result of the trial.
(In reply to Shun F from comment #6) > What was wrong with 2.4.53? The issue was with the non-accounting of children processes stopped by reloads (ungraceful restarts). 2.4.53 only accounts for processes started and stopped during runtime maintenance (namely perform_idle_server_maintenance() which honors Min/MaxSpareThreads and restarts children exited after MaxRequestsPerChild), while reloads also kill processes outside this mechanism thus outside the accounting (note that graceful restarts don't use this mechanism so are not affected by this issue). At some point, with bad counters, httpd thinks it can't create new children even though none is actually running (besides the main process). The fix consists in updating the counters for started/stopped children in a common function called by both runtime maintenance and reloads.
Created attachment 38245 [details] error log with "LogLevel mpm_event:trace6" after patching
The patch trial completed. I tried to reproduce twice and the phenomenon didn't reproduce. The fix will be included in the next httpd release (2.4.54)? /////////////////////////////error log uploaded (as error_log_after_patching0412) systemctl start httpd /usr/bin/systemctl reload httpd.service /usr/bin/systemctl reload httpd.service /usr/bin/systemctl reload httpd.service /usr/bin/systemctl reload httpd.service /usr/bin/systemctl reload httpd.service systemctl status httpd ● httpd.service - The Apache HTTP Server Loaded: loaded (/etc/systemd/system/httpd.service; enabled; vendor preset: disabled) Active: active (running) since 火 2022-04-12 18:00:36 JST; 58s ago Process: 21597 ExecStop=/opt/app/httpd/httpd-2/bin/httpd $OPTIONS -k graceful-stop (code=exited, status=0/SUCCESS) Process: 23190 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS) Process: 23188 ExecReload=/opt/app/httpd/httpd-2/bin/httpd -t (code=exited, status=0/SUCCESS) Process: 21713 ExecStart=/opt/app/httpd/httpd-2/bin/httpd $OPTIONS -k start (code=exited, status=0/SUCCESS) Main PID: 21714 (httpd) CGroup: /system.slice/httpd.service tq21714 /opt/app/httpd/httpd-2/bin/httpd -k start tq23192 /opt/app/httpd/httpd-2/bin/httpd -k start tq23193 /opt/app/httpd/httpd-2/bin/httpd -k start mq23248 /opt/app/httpd/httpd-2/bin/httpd -k start About an hour later... 2 child processes still exists. systemctl status httpd ● httpd.service - The Apache HTTP Server Loaded: loaded (/etc/systemd/system/httpd.service; enabled; vendor preset: disabled) Active: active (running) since 火 2022-04-12 18:00:36 JST; 1h 2min ago Process: 21597 ExecStop=/opt/app/httpd/httpd-2/bin/httpd $OPTIONS -k graceful-stop (code=exited, status=0/SUCCESS) Process: 23190 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS) Process: 23188 ExecReload=/opt/app/httpd/httpd-2/bin/httpd -t (code=exited, status=0/SUCCESS) Process: 21713 ExecStart=/opt/app/httpd/httpd-2/bin/httpd $OPTIONS -k start (code=exited, status=0/SUCCESS) Main PID: 21714 (httpd) CGroup: /system.slice/httpd.service tq21714 /opt/app/httpd/httpd-2/bin/httpd -k start tq23192 /opt/app/httpd/httpd-2/bin/httpd -k start mq26723 /opt/app/httpd/httpd-2/bin/httpd -k start ///////////////////////////// Thank you
Thanks for testing, checked in r1899777. The change will be proposed for backport to 2.4, let's how it goes with others' reviews..
(In reply to Yann Ylavic from comment #7) > (In reply to Shun F from comment #6) > > What was wrong with 2.4.53? > > The issue was with the non-accounting of children processes stopped by > reloads (ungraceful restarts). 2.4.53 only accounts for processes started > and stopped during runtime maintenance (namely > perform_idle_server_maintenance() which honors Min/MaxSpareThreads and > restarts children exited after MaxRequestsPerChild), while reloads also kill > processes outside this mechanism thus outside the accounting (note that > graceful restarts don't use this mechanism so are not affected by this > issue). > At some point, with bad counters, httpd thinks it can't create new children > even though none is actually running (besides the main process). > > The fix consists in updating the counters for started/stopped children in a > common function called by both runtime maintenance and reloads. Thank you for telling me. I got the reason.
(In reply to Yann Ylavic from comment #10) > Thanks for testing, checked in r1899777. > > The change will be proposed for backport to 2.4, let's how it goes with > others' reviews.. Thanks so much. I got it. I hope your fix will be included in the next release.
If MaxConnectionsPerChild is set to a non-zero value, The phenomenon seems to reproduce. I'm testing now. I will report the results later.
There is a fix (follow up) to r1899777, namely r1899812. A full patch for 2.4 including all the changes is available there: https://patch-diff.githubusercontent.com/raw/apache/httpd/pull/313.diff
(In reply to Yann Ylavic from comment #14) > There is a fix (follow up) to r1899777, namely r1899812. > > A full patch for 2.4 including all the changes is available there: > https://patch-diff.githubusercontent.com/raw/apache/httpd/pull/313.diff Is r1899777 an incomplete fix? I'd better test the r1899812, right?
(In reply to Shun F from comment #13) > If MaxConnectionsPerChild is set to a non-zero value, The phenomenon seems > to reproduce. > > I'm testing now. > I will report the results later. If MaxConnectionsPerChild is 0, the phenomenon doesn't reproduce under unpatched 2.4.53. The reproducibility conditions are: [1]Using httpd 2.4.53 (2.4.52 is probably the same) [2]Using event MPM [3]Set a non-zero value to MaxConnectionsPerChild [4]Reloaded regularly, such as daily (e.g. by /etc/logrotate.d/httpd)
> Is r1899777 an incomplete fix? > I'd better test the r1899812, right? Yes, r1899777 alone is not enough, you can test r1899777+r1899812 (both patches) which is the same as 313.diff.
(In reply to Yann Ylavic from comment #17) > > Is r1899777 an incomplete fix? > > I'd better test the r1899812, right? > > Yes, r1899777 alone is not enough, you can test r1899777+r1899812 (both > patches) which is the same as 313.diff. I've applied 313.diff to original 2.4.53 and the patch trial completed. I've tried to reproduce twice and the phenomenon didn't reproduce.
How does it goes with others' reviews? The fix will be backport to 2.4?
Backported to 2.4 (r1901199), will be in the next release.
(In reply to Yann Ylavic from comment #20) > Backported to 2.4 (r1901199), will be in the next release. Good news! Thanks a million.
Mentioned in : https://operations.osmfoundation.org/2022/07/24/standard-tile-layer.html