Bug 66004

Summary: Child processes disappears one by one with no web access.
Product: Apache httpd-2 Reporter: Shun F <uni.roy>
Component: mpm_eventAssignee: Apache HTTPD Bugs Mailing List <bugs>
Status: RESOLVED FIXED    
Severity: normal Keywords: FixedInTrunk
Priority: P2    
Version: 2.4.53   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Attachments: error log
Fix active/total daemons accounting on non-graceful restart
error log with "LogLevel mpm_event:trace6" after patching

Description Shun F 2022-04-11 09:49:06 UTC
Updated httpd to 2.4.53 from 2.4.51.

After several time reloads, child processes disappeared one by one with following:

[Mon Apr 11 14:19:49.422912 2022] [mpm_event:debug] [pid 20759:tid 140355756119808] event.c(576): wake up listener


All child processes disappeared and only parent process survive, following is logged every second:

[Mon Apr 11 14:29:25.195671 2022] [mpm_event:info] [pid 28925:tid 140356948518720] AH00486: server seems busy, (you may need to increase StartServers, ThreadsPerChild or Min/MaxSpareThreads), spawning 0 children, there are around 64 idle threads, 14 active children, and 14 children that are shutting down
[Mon Apr 11 14:29:26.196818 2022] [mpm_event:info] [pid 28925:tid 140356948518720] AH00486: server seems busy, (you may need to increase StartServers, ThreadsPerChild or Min/MaxSpareThreads), spawning 0 children, there are around 64 idle threads, 14 active children, and 14 children that are shutting down
[Mon Apr 11 14:29:27.197926 2022] [mpm_event:info] [pid 28925:tid 140356948518720] AH00486: server seems busy, (you may need to increase StartServers, ThreadsPerChild or Min/MaxSpareThreads), spawning 0 children, there are around 64 idle threads, 14 active children, and 14 children that are shutting down

Reloading or stopping/starting httpd recovers from this situation.

This can be reproduce not only at production environment but also at the test environment with no web accesses to httpd.

Can anyone please investigate and fix this.

///////////////////////////commands(RHEL)
systemctl stop httpd
systemctl start httpd
/usr/bin/systemctl reload httpd.service
/usr/bin/systemctl reload httpd.service
/usr/bin/systemctl reload httpd.service
/usr/bin/systemctl reload httpd.service
/usr/bin/systemctl reload httpd.service
(wait about 30 minutes)

///////////////////////////httpd-mpm.conf
<IfModule mpm_event_module>
    ServerLimit             10
    StartServers               4
    MinSpareThreads           75
    MaxSpareThreads          250
    ThreadsPerChild           64
    MaxRequestWorkers        640
    MaxConnectionsPerChild   300
</IfModule>
Comment 1 Yann Ylavic 2022-04-11 09:58:55 UTC
It would be useful to run the reproducer with "LogLevel mpm_event:trace6" and attach the error log here.
Comment 2 Shun F 2022-04-11 10:01:57 UTC
(In reply to Yann Ylavic from comment #1)
> It would be useful to run the reproducer with "LogLevel mpm_event:trace6"
> and attach the error log here.

OK.
I'm trying to reproduce now. Please wait.
Comment 3 Shun F 2022-04-11 10:44:03 UTC
Created attachment 38243 [details]
error log
Comment 4 Shun F 2022-04-11 10:44:58 UTC
I reproduced with "LogLevel mpm_event:trace6" and uploaded error log.
Can you please investigae.

/////////////////////////////
# systemctl start httpd
# /usr/bin/systemctl reload httpd.service
# /usr/bin/systemctl reload httpd.service
# /usr/bin/systemctl reload httpd.service
# /usr/bin/systemctl reload httpd.service
# /usr/bin/systemctl reload httpd.service
# systemctl status httpd

● httpd.service - The Apache HTTP Server
   Loaded: loaded (/etc/systemd/system/httpd.service; enabled; vendor preset: disabled)
   Active: active (running) since 月 2022-04-11 19:03:14 JST; 37s ago
  Process: 15865 ExecStop=/opt/app/httpd/httpd-2/bin/httpd $OPTIONS -k graceful-stop (code=exited, status=0/SUCCESS)
  Process: 17327 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS)
  Process: 17325 ExecReload=/opt/app/httpd/httpd-2/bin/httpd -t (code=exited, status=0/SUCCESS)
  Process: 15929 ExecStart=/opt/app/httpd/httpd-2/bin/httpd $OPTIONS -k start (code=exited, status=0/SUCCESS)
 Main PID: 15930 (httpd)
   CGroup: /system.slice/httpd.service
           tq15930 /opt/app/httpd/httpd-2/bin/httpd -k start
           tq17329 /opt/app/httpd/httpd-2/bin/httpd -k start
           tq17331 /opt/app/httpd/httpd-2/bin/httpd -k start
           mq17336 /opt/app/httpd/httpd-2/bin/httpd -k start


33 minutes later...


# systemctl status httpd

● httpd.service - The Apache HTTP Server
   Loaded: loaded (/etc/systemd/system/httpd.service; enabled; vendor preset: disabled)
   Active: active (running) since 月 2022-04-11 19:03:14 JST; 33min ago
  Process: 15865 ExecStop=/opt/app/httpd/httpd-2/bin/httpd $OPTIONS -k graceful-stop (code=exited, status=0/SUCCESS)
  Process: 17327 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS)
  Process: 17325 ExecReload=/opt/app/httpd/httpd-2/bin/httpd -t (code=exited, status=0/SUCCESS)
  Process: 15929 ExecStart=/opt/app/httpd/httpd-2/bin/httpd $OPTIONS -k start (code=exited, status=0/SUCCESS)
 Main PID: 15930 (httpd)
   CGroup: /system.slice/httpd.service
           tq15930 /opt/app/httpd/httpd-2/bin/httpd -k start
           mq17329 /opt/app/httpd/httpd-2/bin/httpd -k start
/////////////////////////////
Comment 5 Yann Ylavic 2022-04-11 13:58:31 UTC
Created attachment 38244 [details]
Fix active/total daemons accounting on non-graceful restart

Could you please try this patch?
Comment 6 Shun F 2022-04-12 04:29:50 UTC
(In reply to Yann Ylavic from comment #5)
> Created attachment 38244 [details]
> Fix active/total daemons accounting on non-graceful restart
> 
> Could you please try this patch?

Thank you for your quick response.
What was wrong with 2.4.53?

I will try and report the result of the trial.
Comment 7 Yann Ylavic 2022-04-12 09:55:42 UTC
(In reply to Shun F from comment #6)
> What was wrong with 2.4.53?

The issue was with the non-accounting of children processes stopped by reloads (ungraceful restarts). 2.4.53 only accounts for processes started and stopped during runtime maintenance (namely perform_idle_server_maintenance() which honors Min/MaxSpareThreads and restarts children exited after MaxRequestsPerChild), while reloads also kill processes outside this mechanism thus outside the accounting (note that graceful restarts don't use this mechanism so are not affected by this issue).
At some point, with bad counters, httpd thinks it can't create new children even though none is actually running (besides the main process).

The fix consists in updating the counters for started/stopped children in a common function called by both runtime maintenance and reloads.
Comment 8 Shun F 2022-04-12 10:27:18 UTC
Created attachment 38245 [details]
error log with "LogLevel mpm_event:trace6" after patching
Comment 9 Shun F 2022-04-12 10:27:58 UTC
The patch trial completed.
I tried to reproduce twice and the phenomenon didn't reproduce.

The fix will be included in the next httpd release (2.4.54)?

/////////////////////////////error log uploaded (as error_log_after_patching0412)
systemctl start httpd

/usr/bin/systemctl reload httpd.service
/usr/bin/systemctl reload httpd.service
/usr/bin/systemctl reload httpd.service
/usr/bin/systemctl reload httpd.service
/usr/bin/systemctl reload httpd.service

systemctl status httpd

● httpd.service - The Apache HTTP Server
   Loaded: loaded (/etc/systemd/system/httpd.service; enabled; vendor preset: disabled)
   Active: active (running) since 火 2022-04-12 18:00:36 JST; 58s ago
  Process: 21597 ExecStop=/opt/app/httpd/httpd-2/bin/httpd $OPTIONS -k graceful-stop (code=exited, status=0/SUCCESS)
  Process: 23190 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS)
  Process: 23188 ExecReload=/opt/app/httpd/httpd-2/bin/httpd -t (code=exited, status=0/SUCCESS)
  Process: 21713 ExecStart=/opt/app/httpd/httpd-2/bin/httpd $OPTIONS -k start (code=exited, status=0/SUCCESS)
 Main PID: 21714 (httpd)
   CGroup: /system.slice/httpd.service
           tq21714 /opt/app/httpd/httpd-2/bin/httpd -k start
           tq23192 /opt/app/httpd/httpd-2/bin/httpd -k start
           tq23193 /opt/app/httpd/httpd-2/bin/httpd -k start
           mq23248 /opt/app/httpd/httpd-2/bin/httpd -k start

About an hour later... 2 child processes still exists.

systemctl status httpd

● httpd.service - The Apache HTTP Server
   Loaded: loaded (/etc/systemd/system/httpd.service; enabled; vendor preset: disabled)
   Active: active (running) since 火 2022-04-12 18:00:36 JST; 1h 2min ago
  Process: 21597 ExecStop=/opt/app/httpd/httpd-2/bin/httpd $OPTIONS -k graceful-stop (code=exited, status=0/SUCCESS)
  Process: 23190 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS)
  Process: 23188 ExecReload=/opt/app/httpd/httpd-2/bin/httpd -t (code=exited, status=0/SUCCESS)
  Process: 21713 ExecStart=/opt/app/httpd/httpd-2/bin/httpd $OPTIONS -k start (code=exited, status=0/SUCCESS)
 Main PID: 21714 (httpd)
   CGroup: /system.slice/httpd.service
           tq21714 /opt/app/httpd/httpd-2/bin/httpd -k start
           tq23192 /opt/app/httpd/httpd-2/bin/httpd -k start
           mq26723 /opt/app/httpd/httpd-2/bin/httpd -k start
/////////////////////////////

Thank you
Comment 10 Yann Ylavic 2022-04-12 13:09:23 UTC
Thanks for testing, checked in r1899777.

The change will be proposed for backport to 2.4, let's how it goes with others' reviews..
Comment 11 Shun F 2022-04-13 08:54:51 UTC
(In reply to Yann Ylavic from comment #7)
> (In reply to Shun F from comment #6)
> > What was wrong with 2.4.53?
> 
> The issue was with the non-accounting of children processes stopped by
> reloads (ungraceful restarts). 2.4.53 only accounts for processes started
> and stopped during runtime maintenance (namely
> perform_idle_server_maintenance() which honors Min/MaxSpareThreads and
> restarts children exited after MaxRequestsPerChild), while reloads also kill
> processes outside this mechanism thus outside the accounting (note that
> graceful restarts don't use this mechanism so are not affected by this
> issue).
> At some point, with bad counters, httpd thinks it can't create new children
> even though none is actually running (besides the main process).
> 
> The fix consists in updating the counters for started/stopped children in a
> common function called by both runtime maintenance and reloads.

Thank you for telling me.
I got the reason.
Comment 12 Shun F 2022-04-13 08:58:03 UTC
(In reply to Yann Ylavic from comment #10)
> Thanks for testing, checked in r1899777.
> 
> The change will be proposed for backport to 2.4, let's how it goes with
> others' reviews..

Thanks so much. I got it.
I hope your fix will be included in the next release.
Comment 13 Shun F 2022-04-14 13:15:15 UTC
If MaxConnectionsPerChild is set to a non-zero value, The phenomenon seems to reproduce.

I'm testing now.
I will report the results later.
Comment 14 Yann Ylavic 2022-04-14 13:34:10 UTC
There is a fix (follow up) to r1899777, namely r1899812.

A full patch for 2.4 including all the changes is available there: https://patch-diff.githubusercontent.com/raw/apache/httpd/pull/313.diff
Comment 15 Shun F 2022-04-15 01:53:50 UTC
(In reply to Yann Ylavic from comment #14)
> There is a fix (follow up) to r1899777, namely r1899812.
> 
> A full patch for 2.4 including all the changes is available there:
> https://patch-diff.githubusercontent.com/raw/apache/httpd/pull/313.diff

Is r1899777 an incomplete fix?
I'd better test the r1899812, right?
Comment 16 Shun F 2022-04-15 04:09:55 UTC
(In reply to Shun F from comment #13)
> If MaxConnectionsPerChild is set to a non-zero value, The phenomenon seems
> to reproduce.
> 
> I'm testing now.
> I will report the results later.


If MaxConnectionsPerChild is 0, the phenomenon doesn't reproduce under unpatched 2.4.53.

The reproducibility conditions are:

 [1]Using httpd 2.4.53 (2.4.52 is probably the same)
 [2]Using event MPM
 [3]Set a non-zero value to MaxConnectionsPerChild
 [4]Reloaded regularly, such as daily (e.g. by /etc/logrotate.d/httpd)
Comment 17 Yann Ylavic 2022-04-15 08:24:48 UTC
> Is r1899777 an incomplete fix?
> I'd better test the r1899812, right?

Yes, r1899777 alone is not enough, you can test r1899777+r1899812 (both patches) which is the same as 313.diff.
Comment 18 Shun F 2022-04-18 04:57:24 UTC
(In reply to Yann Ylavic from comment #17)
> > Is r1899777 an incomplete fix?
> > I'd better test the r1899812, right?
> 
> Yes, r1899777 alone is not enough, you can test r1899777+r1899812 (both
> patches) which is the same as 313.diff.

I've applied 313.diff to original 2.4.53 and the patch trial completed.
I've tried to reproduce twice and the phenomenon didn't reproduce.
Comment 19 Shun F 2022-05-18 01:18:25 UTC
How does it goes with others' reviews?
The fix will be backport to 2.4?
Comment 20 Yann Ylavic 2022-05-24 09:11:17 UTC
Backported to 2.4 (r1901199), will be in the next release.
Comment 21 Shun F 2022-05-24 10:30:31 UTC
(In reply to Yann Ylavic from comment #20)
> Backported to 2.4 (r1901199), will be in the next release.

Good news!
Thanks a million.