Bug 48769

Summary: [PATCH] Processes in the busy list should not be killed during graceful restarts
Product: Apache httpd-2 Reporter: John Lightsey <jd>
Component: mod_fcgidAssignee: Apache HTTPD Bugs Mailing List <bugs>
Status: REOPENED ---    
Severity: normal CC: apache-bugs.a2, JBlond, rob-apachebugs, stefan
Priority: P2 Keywords: PatchAvailable
Version: 2.4-HEAD   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Attachments: Graceful restart patch
Separated graceful restart patch
Re-added #define FCGID_PM_MAX_SHUTDOWN_WAIT_ATTEMPTS to fcgid_pm_main.h
Remove "is mutex filename NULL" check

Description John Lightsey 2010-02-18 18:29:09 UTC
Created attachment 25019 [details]
Graceful restart patch

During graceful restarts mod_fcgid sends SIGTERM to its child processes, waits one second, sends SIGKILL to them, then blocks until they are all reaped.  This is a particularly bad method since the only safe time for a fastcgi worker process to shut down is between requests.  For graceful restarts to be truely graceful with mod_fcgid, all fastcgi requests need to complete in under 1 second.

Additionally, we've seen some servers where it takes several minutes for mod_fcgid to reap all of its processes during graceful restarts.  This is likely due the the fact that mod_fcgid is also trying to reap processes in the free list which it didn't actually spawn.

An better method would be for mod_fcgid to not send SIGKILL and wait on the processes in the busy list during graceful restarts.  The attached patch attempts to go this route by keeping the shared memory, mutexes and pipes the mod_fcgid process manager uses to communicate with bridged apache workers alive across restarts.

This also changes the shutdown code so that it does not signal and attempt to reap processes IDs that are in the free list.

I also changed the virtualhost part of the process tables to be a static string instead of a pointer.  The results of the pointer comparisons were often incorrect, resulting in the creation of multiple fastcgi classes for the same exact virtualhost and location.
Comment 1 Jeff Trawick 2010-02-18 18:50:47 UTC
Hi John,  

It would be helpful if you can open separate bugs and provide separate patches for issues that can be solved separately from the primary bug you reported -- SIGKILL-ing active processes during graceful restart.

At a glance, issues that appear to be addressable separately appear to be

a) broken vhost pointer comparisons (is anything special required to trigger the problem?)
b) trying to reap non-existent processes in the free list

(maybe there is something else as well?)

Thanks a bunch!
Comment 2 John Lightsey 2010-02-18 19:05:45 UTC
No problem.  I'll split it up.
Comment 3 Jeff Trawick 2010-04-29 16:52:27 UTC
BTW, a different fix has been committed (r939472) for the broken vhost pointer comparison issue.
Comment 4 Michał Grzędzicki 2012-02-03 20:43:38 UTC
Created attachment 28262 [details]
Separated graceful restart patch

This is against current mod_fcgid trunk, only parts related to graceful restart
from John Lightsey's patch are included.
Comment 5 Stefan Priebe 2015-03-19 19:19:32 UTC
Great, why is this one not upstream? Nobody wants killed processes on reload.
Comment 6 apache-bugs.a2 2016-03-04 21:10:21 UTC
*** Bug 59112 has been marked as a duplicate of this bug. ***
Comment 7 apache-bugs.a2 2016-03-04 21:14:17 UTC
Yes, after 6+ years this bug is still at large. Not only that graceful restarts don't work with fastcgi php processes, fcgid actualy kills apache processes (like static object downloads that were not using php at all) with a graceful restart (only if at least 2 fcgid handers were running at the time of graceful restart) - major bug.

I've tested https://bz.apache.org/bugzilla/attachment.cgi?id=28262 patch and it seems to fix that.
Comment 8 Damien Norris 2016-07-03 16:25:26 UTC
Created attachment 34001 [details]
Re-added #define FCGID_PM_MAX_SHUTDOWN_WAIT_ATTEMPTS to fcgid_pm_main.h
Comment 9 Damien Norris 2016-07-03 16:26:07 UTC
Since Debian's migration to systemd, this bug has become much more prominent.  Logrotate jobs that run apachectl graceful several times in succession frequently get hung up, crashing the webserver in the middle of the night, requiring manual intervention (kilall -9 /usr/sbin/apache2 seems to do it, but only a SIGKILL can get rid of the stuck ones).

The separated restart patch from comment #4 (attachment 28262 [details]) did not work right away because it's missing the "#define FCGID_PM_MAX_SHUTDOWN_WAIT_ATTEMPTS 30" from the original patch.  I've attached one with that re-added that should apply cleanly and build.

I am testing in the Debian libapache2-mod-fcgid package source to see if the problem is resolved.. unfortunately I haven't had much luck reproducing the problem except when it happens on production machines.  So I will load the patched mod_fcgid up there and see if my problem goes from happening several times a week to not at all, and report back.
Comment 10 Robert Mathews 2018-05-02 00:03:30 UTC
I have been using the attachment 34001 [details] patch in comment #8 successfully on around a dozen web servers that use "apache2ctl graceful" reloads at least daily.

It fixes the problem described in the bug: Apache can now reload without interrupting in-flight FCGI requests, and it no longer tries to kill processes in the free list. This is a huge improvement.

That said, I discovered a slight problem with how the patch works: on each graceful reload, it "leaks" a shared memory segment and two semaphore arrays, according to "ipcs". This is apparently because it doesn't clean up the resources the previous generation created with fcgid_mutex_create(), etc., even after the graceful reload is finished.

If you reload a large number of times without a restart, this can cause problems. This can be demonstrated by running this when using the patch:

while `sleep 5`; do echo -n "Reloading: "; ipcs | wc; systemctl reload apache2; done

The "ipcs" output will increase by 3 lines for each reload, and if you let this run for long enough, you will eventually run out of resources, resulting in something like this:

 Reloading:     186    1019   11366
 Reloading:     189    1036   11556
 Reloading:     192    1053   11746
 Reloading:     195    1070   11936
 Reloading:      18      72     776
 Job for apache2.service failed. See 'systemctl status apache2.service' and 'journalctl -xn' for details.

The error log shows "[core:emerg] [pid 5753] (28)No space left on device: AH00023: Couldn't create the mpm-accept mutex".

A "systemctl restart apache2" frees up the resources, but it would be better if this did not happen.
Comment 11 Robert Mathews 2018-05-22 18:51:08 UTC
Created attachment 35944 [details]
Remove "is mutex filename NULL" check

I found the cause of the leaked semaphores/shared memory in my comment #10.

When using mod_socache_shmcb, "fcgid_mutex_create(&g_sharelock, &g_sharelock_name" will legitimately return NULL for g_sharelock_name. Because of that, it's incorrect to check whether "g_sharelock_name == NULL" on a graceful reload and then avoid reclaiming the previously created mutex.

I've attached a slightly modified version of the patch that avoids the erroneous filename check in two places, and now there are no semaphore/shared memory leaks.

I've been using this version of the patch with no trouble for a couple of weeks on about 20 servers that have received hundreds of graceful restarts and tens of millions of mod_fcgid requests. It completely fixes the original problem as far as I can tell, and should be included in mod_fcgid.
Comment 12 William A. Rowe Jr. 2018-11-07 21:08:32 UTC
Please help us to refine our list of open and current defects; this is a mass update of old and inactive Bugzilla reports which reflect user error, already resolved defects, and still-existing defects in httpd.

As repeatedly announced, the Apache HTTP Server Project has discontinued all development and patch review of the 2.2.x series of releases. The final release 2.2.34 was published in July 2017, and no further evaluation of bug reports or security risks will be considered or published for 2.2.x releases. All reports older than 2.4.x have been updated to status RESOLVED/LATER; no further action is expected unless the report still applies to a current version of httpd.

If your report represented a question or confusion about how to use an httpd feature, an unexpected server behavior, problems building or installing httpd, or working with an external component (a third party module, browser etc.) we ask you to start by bringing your question to the User Support and Discussion mailing list, see [https://httpd.apache.org/lists.html#http-users] for details. Include a link to this Bugzilla report for completeness with your question.

If your report was clearly a defect in httpd or a feature request, we ask that you retest using a modern httpd release (2.4.33 or later) released in the past year. If it can be reproduced, please reopen this bug and change the Version field above to the httpd version you have reconfirmed with.

Your help in identifying defects or enhancements still applicable to the current httpd server software release is greatly appreciated.
Comment 13 William A. Rowe Jr. 2018-11-08 16:24:56 UTC
mod_fcgid is independent of httpd, my apologies this was caught up in the
mass-update. No behavior change is expected between 2.2 and 2.4.

I'm not aware of how much interest there is in a subsequent release but
can confirm the patch does apply, and on a first read-through it does look