Bug 38989

Summary:	restart + piped logs stalls httpd for 24 minutes (Solaris)
Product:	Apache httpd-1.3	Reporter:	Rainer Jung <rainer.jung>
Component:	core	Assignee:	Apache HTTPD Bugs Mailing List <bugs>
Status:	RESOLVED WONTFIX
Severity:	normal
Priority:	P2
Version:	HEAD
Target Milestone:	---
Hardware:	Sun
OS:	Solaris

Description Rainer Jung 2006-03-15 20:31:41 UTC

When using piped logs "apachectl restart" stalls apache on Solaris for 1431 seconds.

This report is quite lengthy, because I tried to fully analyze and I will
explain, why that happens e.g. on Solaris, but not on Linux.

Tested on Solaris/Sparc 8, 9 and 10 with Apache 1.3.27 and 1.3.34. Also tested
with our own build and with the standard Solaris 10 build. It doesn't matter,
which piped logger is being used. Problem found with rotatelogs, but /bin/cat
shows the same problem.

From the code analysis the bug must be quite old.

1) Piped Logs and children

Piped logs for apache 1.3 spawn there children by fork/exec to /bin/sh -c
"piped_logger". Under Solaris this results in two processes per pipe. The first
one is a child of the main httpd process and is /bin/sh, the second one is a
child of /bin/sh and is a piped_logger process.

Under Linux there will be only one child per piped_logger, the /bin/sh-process
does not exist. I assume, /bin/sh has a very different implementation for
handling the "-c" flag for Solaris and for Linux.

2) Signal masks of the logging children

All logging children have SIGHUP ignored. This is true for the two process case
(Solaris) and the one process case (Linux). I assum this comes from the fact,
that the three methods in http_log.c which spawn logging child all set SIGHUP to
SIG_IGN before forking.

All logging children have SIGUSR1 set to default. This again is true for Solaris
and Linux.

Now concerning SIGTERM: Under Solaris the /bin/sh child of httpd has a signal
handler for SIGTERM. psig shows:

TERM    caught  0x16c64         0

The second looging child (the piped_logger itself) has SIGTERM set to default.

Under Linux the only logging child, the piped_logger has SIGTERM set to default.

3) What happens when restart is done

apachectl sends a SIGHUP to the httpd master process. That one itself sends
SIGHUP to the full process group. On Solaris and on Linux all logging children
ignore SIGHUP, so no effect on the logging children. The httpd processes which
are children of the master terminate during SIGHUP.

Now in reclaim_child_processes(0) near the end there is a loop which calls
waitpid for all direct logging children. The call returns with 0 and then the
maintenance method is caled, which sends a TERM signal to the direct logging child.

Under Solaris this means, that a TERM signal is being send to the /bin/sh
children. These are temselves inside waitid, which gets interrupted and starts
again. The signal is not passed along to their child (the piped_loger itself).

Under Linux the logging child is the piped_logger itself. It receives the
SIGTERM and terminates.

Since the /bin/sh (and the piped_logger) did not terminate for Solaris,
reclaim_child_processes(0) takes another pass through the tries loop and
increases the wait time before callng another maintenance/sending another kill
in each pass by a factor of 4. We get waits for

...
13479:          timeout: 1.048576000 sec
13479:          timeout: 4.194304000 sec
13479:          timeout: 16.777216000 sec
13479:          timeout: 67.108864000 sec
13479:          timeout: 268.435456000 sec
13479:          timeout: 1073.741824000 sec

The last line is for the 9th iteration. Then the waittime gets too big and as a
result, the remaining 3 iterations wait with a timeout of 0 (minor bug).

So after about 1431 seconds httpd gives up. It then closes several file
descriptors and at that moment the piped_loggers terminate and as a result the
/bin/sh also.

4) Why is graceful working?

graceful uses SIGUSR1 which is not masked by any of the logging children. So
when sending the signal to the process group, all logging children terminate.
httpd never has to use the TERM signal like in restart.

5) Why is apache2 working

apache2 does not use /bin/sh to spawn the logging children, so even under
Solaris the piped logger is a direct child of httpd, so receives SIGTERM and
terminates.

6) What can be done?

I've no especially good solution. Possible are:

- not increasing the timeout that often/to such huge values
- sending SIGTERM not only to the children, but instead to the process group
- closing the file descriptors earlier, so that the piped loggers have a chance
to terminate without the signal

I hope you made it to the end. Any Solutions or ideas for discussion?

Comment 1 Ruediger Pluem 2006-03-16 12:01:55 UTC

Setting back to HEAD. I did not notice that the product had been set to "Apache
httpd-1.3".

Comment 2 Rainer Jung 2006-03-17 09:29:55 UTC

I found a workaround, which is sufficient for us: I changed the SHELL_PATH in
httpd.h from "/bin/sh" to "/usr/xpg4/bin/sh". That's a XPG4 compliant shell
which exists on Solaris since Solaris 2.5, so about 10 years.

The shell is similar to ksh (see "man ksh") and does *not* fork when called with
"-c".

As a consequence Solaris then behaves like Linux, only one process is being
forked from the httpd for each piped logger and the signal handling is OK.

Unfortunately I could not figure out how to pass a
'CFLAGS=-DSHELL_PATH="/usr/xpg4/bin/sh"' successfully to configure, because when
I don't escape the '"', they get stripped of, and when I escape them, then
mod_ssl configure complains :(

I finally just hacked httpd.h before running make.

It would be nice to at least document the behaviour:

- restart in combination with piped logs is broken on any platform,
  where "/bin/sh -c" forks (e.g. Solaris)
- if the platform has another shell, that does not fork, like
  /usr/xpg4/bin/sh or /bin/ksh, then defining SHELL_PATH to
  this shell will help.

Comment 3 Malte S. Stretz 2011-03-21 10:41:15 UTC

Apache HTTP Server 1.3.x is not supported anymore and no bugs will be fixed in the old codebase (cf. <http://mail-archives.apache.org/mod_mbox/httpd-announce/201002.mbox/%3C20100203000334.GA19021@infiltrator.stdlib.net%3E>). Since this bug seems to affect only 1.3.x, I'm closing it as WONTFIX.

If this bug still affects you in a recent version (version 2.2.x or the upcoming version 2.4), please open a new bug.

Thank you for reporting the bug.