|Summary:||restart + piped logs stalls httpd for 24 minutes (Solaris)|
|Product:||Apache httpd-1.3||Reporter:||Rainer Jung <rainer.jung>|
|Component:||core||Assignee:||Apache HTTPD Bugs Mailing List <bugs>|
Description Rainer Jung 2006-03-15 20:31:41 UTC
When using piped logs "apachectl restart" stalls apache on Solaris for 1431 seconds. This report is quite lengthy, because I tried to fully analyze and I will explain, why that happens e.g. on Solaris, but not on Linux. Tested on Solaris/Sparc 8, 9 and 10 with Apache 1.3.27 and 1.3.34. Also tested with our own build and with the standard Solaris 10 build. It doesn't matter, which piped logger is being used. Problem found with rotatelogs, but /bin/cat shows the same problem. From the code analysis the bug must be quite old. 1) Piped Logs and children Piped logs for apache 1.3 spawn there children by fork/exec to /bin/sh -c "piped_logger". Under Solaris this results in two processes per pipe. The first one is a child of the main httpd process and is /bin/sh, the second one is a child of /bin/sh and is a piped_logger process. Under Linux there will be only one child per piped_logger, the /bin/sh-process does not exist. I assume, /bin/sh has a very different implementation for handling the "-c" flag for Solaris and for Linux. 2) Signal masks of the logging children All logging children have SIGHUP ignored. This is true for the two process case (Solaris) and the one process case (Linux). I assum this comes from the fact, that the three methods in http_log.c which spawn logging child all set SIGHUP to SIG_IGN before forking. All logging children have SIGUSR1 set to default. This again is true for Solaris and Linux. Now concerning SIGTERM: Under Solaris the /bin/sh child of httpd has a signal handler for SIGTERM. psig shows: TERM caught 0x16c64 0 The second looging child (the piped_logger itself) has SIGTERM set to default. Under Linux the only logging child, the piped_logger has SIGTERM set to default. 3) What happens when restart is done apachectl sends a SIGHUP to the httpd master process. That one itself sends SIGHUP to the full process group. On Solaris and on Linux all logging children ignore SIGHUP, so no effect on the logging children. The httpd processes which are children of the master terminate during SIGHUP. Now in reclaim_child_processes(0) near the end there is a loop which calls waitpid for all direct logging children. The call returns with 0 and then the maintenance method is caled, which sends a TERM signal to the direct logging child. Under Solaris this means, that a TERM signal is being send to the /bin/sh children. These are temselves inside waitid, which gets interrupted and starts again. The signal is not passed along to their child (the piped_loger itself). Under Linux the logging child is the piped_logger itself. It receives the SIGTERM and terminates. Since the /bin/sh (and the piped_logger) did not terminate for Solaris, reclaim_child_processes(0) takes another pass through the tries loop and increases the wait time before callng another maintenance/sending another kill in each pass by a factor of 4. We get waits for ... 13479: timeout: 1.048576000 sec 13479: timeout: 4.194304000 sec 13479: timeout: 16.777216000 sec 13479: timeout: 67.108864000 sec 13479: timeout: 268.435456000 sec 13479: timeout: 1073.741824000 sec The last line is for the 9th iteration. Then the waittime gets too big and as a result, the remaining 3 iterations wait with a timeout of 0 (minor bug). So after about 1431 seconds httpd gives up. It then closes several file descriptors and at that moment the piped_loggers terminate and as a result the /bin/sh also. 4) Why is graceful working? graceful uses SIGUSR1 which is not masked by any of the logging children. So when sending the signal to the process group, all logging children terminate. httpd never has to use the TERM signal like in restart. 5) Why is apache2 working apache2 does not use /bin/sh to spawn the logging children, so even under Solaris the piped logger is a direct child of httpd, so receives SIGTERM and terminates. 6) What can be done? I've no especially good solution. Possible are: - not increasing the timeout that often/to such huge values - sending SIGTERM not only to the children, but instead to the process group - closing the file descriptors earlier, so that the piped loggers have a chance to terminate without the signal I hope you made it to the end. Any Solutions or ideas for discussion?
Comment 1 Ruediger Pluem 2006-03-16 12:01:55 UTC
Setting back to HEAD. I did not notice that the product had been set to "Apache httpd-1.3".
Comment 2 Rainer Jung 2006-03-17 09:29:55 UTC
I found a workaround, which is sufficient for us: I changed the SHELL_PATH in httpd.h from "/bin/sh" to "/usr/xpg4/bin/sh". That's a XPG4 compliant shell which exists on Solaris since Solaris 2.5, so about 10 years. The shell is similar to ksh (see "man ksh") and does *not* fork when called with "-c". As a consequence Solaris then behaves like Linux, only one process is being forked from the httpd for each piped logger and the signal handling is OK. Unfortunately I could not figure out how to pass a 'CFLAGS=-DSHELL_PATH="/usr/xpg4/bin/sh"' successfully to configure, because when I don't escape the '"', they get stripped of, and when I escape them, then mod_ssl configure complains :( I finally just hacked httpd.h before running make. It would be nice to at least document the behaviour: - restart in combination with piped logs is broken on any platform, where "/bin/sh -c" forks (e.g. Solaris) - if the platform has another shell, that does not fork, like /usr/xpg4/bin/sh or /bin/ksh, then defining SHELL_PATH to this shell will help.
Comment 3 Malte S. Stretz 2011-03-21 10:41:15 UTC
Apache HTTP Server 1.3.x is not supported anymore and no bugs will be fixed in the old codebase (cf. <http://mail-archives.apache.org/mod_mbox/httpd-announce/201002.mbox/%3C20100203000334.GA19021@infiltrator.stdlib.net%3E>). Since this bug seems to affect only 1.3.x, I'm closing it as WONTFIX. If this bug still affects you in a recent version (version 2.2.x or the upcoming version 2.4), please open a new bug. Thank you for reporting the bug.