Bug 23911 - CGI processes left defunct/zombie under 2.0.54
Summary: CGI processes left defunct/zombie under 2.0.54
Status: REOPENED
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: Core (show other bugs)
Version: 2.2.13
Hardware: All All
: P3 critical (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-10-18 17:33 UTC by David Cook
Modified: 2011-09-14 17:11 UTC (History)
1 user (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Cook 2003-10-18 17:33:17 UTC
With the latest version of Apache (2.0.47) we are seeing occassional defunct/zombie cgi's.  The 
cgi's can't be killed (won't die with kill -9 pid) but do die if we kill the parent httpd process.

This seems to me to be identical to bug report 21737 --- but with the 2.0.47 version (21737 was 
for the 1... version).  Since their fix was to alloc.c, and I don't see an alloc.c to compare too, I'm 
seeking help on fixing this problem.

The zombies are not taking up cpu cycles, of course... but do tend to deplete the process count 
pool. We've counted as high as 60 zombies in one situation.  Last night there were 8.
Comment 1 Jeff Trawick 2003-10-19 23:49:31 UTC
suexec or not?

mod_cgi or mod_cgid?

probably doesn't matter, but which MPM?
Comment 2 David Cook 2003-10-20 00:25:10 UTC
configure:12836: checking whether to enable mod_suexec
configure:12888: result: no


config.log:MPM_LIB='server/mpm/prefork/libprefork.la'
config.log:MPM_NAME='prefork'
config.log:MPM_SUBDIR_NAME='prefork'
config.log:#define APACHE_MPM_DIR "server/mpm/prefork"


./httpd -l
Compiled in modules:
  core.c
  mod_access.c
  mod_auth.c
  mod_include.c
  mod_log_config.c
  mod_env.c
  mod_setenvif.c
  mod_ssl.c
  prefork.c
  http_core.c
  mod_mime.c
  mod_status.c
  mod_autoindex.c
  mod_asis.c
  mod_cgi.c
  mod_negotiation.c
  mod_dir.c
  mod_imap.c
  mod_actions.c
  mod_userdir.c
  mod_alias.c
  mod_rewrite.c
  mod_so.c
Comment 3 Jeff Trawick 2003-10-23 19:20:30 UTC
Does this happen even for simple CGIs such as printenv (in cgi-bin dir of
default install), or only for setuid binaries, or what?

Also, can you get a truss of a CGI request, including both the web server child
handling the request and the CGI itself?

Start the server like this:

# truss -o outfile -f ./httpd -DONE_PROCESS

and run a couple of CGI requests, then use ps to see whether or not the zombie
problem occurs, then interupt truss+httpd.  If this run exhibited the zombie
problem, send in the truss.  If not, you may need to start the server normally,
run truss against one of the children (truss -o outfile -f -p PID) and keep
doing CGI requests until the truss-ed process handles it and we can see the trace.
Comment 4 David Cook 2003-10-24 17:58:42 UTC
It is not specific to any cgi.  It is difficult for us to reproduce this because we can't predict when it 
will happen and these are public/commercial servers with which we don't have the luxury of 
playing with.

Is there something I can do once I get zombies?  The zombies usually belong to one or two 
parents.  If there is information that I can get from that parent for you that would be useful, let me 
know (just killed 27 zombies in fact).
Comment 5 Jeff Trawick 2003-11-07 18:27:51 UTC
I don't know what the next step is, unfortunately.

I've been testing 2.0.47 with default config (prefork, mod_cgi, no suexec) this
afternoon and using printenv as the example cgi.  No long-term zombies. 
printenv goes through zombie state temporarily but Apache cleans it up very soon
after.

I'm curious about how you can tell it isn't specific to some cgi.  All I see
from ps for zombies is

 trawick  6872 29703  0                   0:00 <defunct>

Is it possible that the zombie represents a child process that the CGI script
created, and not the CGI script itself?

Apache parent
  -> Apache child process
        -> CGI script
             -> some command invoked by the CGI

Maybe there is some infrequent condition where the Apache child process
terminates the CGI script before it has reaped status from the command it runs,
and then the Apache child process becomes the parent of the command invoked by
the CGI.  Since the Apache child process doesn't call waitpid() to collect
status from arbitrary processes, then the zombie never gets cleaned up.

Apache will terminate the CGI script with SIGTERM (and later SIGKILL) if the CGI
script keeps running for a while after the client connection drops.

>Is there something I can do once I get zombies?

nothing easy that I know of...
Comment 6 Jeff Trawick 2003-11-07 22:24:22 UTC
I wasn't able to recreate any zombies in this scenario

Apache parent
  -> Apache child process
        -> CGI script
             -> some command invoked by the CGI

when the CGI script exited without reaping status from its child.  (just the way
Unix works I guess)

If you set MaxRequestsPerChild relatively low, won't that take care of zombies?

Another VERY stray thought is to write a simple module that calls waitpid(-1,,)
to try to reap status from any stray child process remaining for any reason. 
Since this is prefork, it shouldn't interfere with any other requests.
Comment 7 Joe Orton 2005-09-07 17:41:30 UTC
Is this reproducible in 2.0.54?
Comment 8 Thomas Martinsen 2005-12-17 03:02:24 UTC
I run Debian stable with apache2 2.0.54. I can confirm that this version leaves
defuncts every now and then. It does this every few days, and what's happening
is all the defuncts lock up all the apache processes and the server is
unresponsible and has to be restarted.
Comment 9 Nick Kew 2007-10-07 17:07:33 UTC
Does this affect 2.2.x?
Comment 10 Nick Kew 2010-07-20 09:19:59 UTC
Nearly three years in NEEDINFO, closing old 2.0 report.  If it's not fixed in 2.0.latest, it won't get fixed in 2.0.any.
Comment 11 Alexandre Ferrieux 2011-09-14 17:11:25 UTC
Still there in 2.2.13-1fc11.

I have isolated it: it can be simply reproduced with a cgi containing

    sleep 9999 >/dev/null &

and fixed by redirecting the stderr of the child:

    sleep 9999 >/dev/null 2> /dev/null &

(the stdout redir is needed anyway for the HTTP request to complete)

So it boils down to: CGI exits with the stderr dup'ed over to a lingering child.
I assume this is linked to Apache's capture of CGI's stderrs (for error_log), not expecting their lifecycle to be decoupled from the CGI process's.

Apologies if this is not the proper place to reopen. Spank me in that case :)