21737 – cgi process defunct

Bug 21737 - cgi process defunct

Summary: cgi process defunct

Status:	CLOSED FIXED

Alias:	None

Product:	Apache httpd-1.3
Classification:	Unclassified
Component:	core (show other bugs)
Version:	1.3.28
Hardware:	Other Linux

Importance:	P3 critical (vote)
Target Milestone:	---
Assignee:	Apache HTTPD Bugs Mailing List

URL:
Keywords:

Duplicates (3):	21739 21746 21926 (view as bug list)
Depends on:
Blocks:

Reported:	2003-07-20 06:13 UTC by rcs
Modified:	2004-11-16 19:05 UTC (History)
CC List:	11 users (show)

Attachments
reverse what happened in 1.3.28 (2.14 KB, patch) 2003-07-20 18:07 UTC, J. Nick Koston	Details \| Diff
instead of reverse ... fix (642 bytes, patch) 2003-07-20 18:35 UTC, J. Nick Koston	Details \| Diff
A program to test errno after signal to zombie process. (345 bytes, text/plain) 2003-09-04 00:52 UTC, Eric Seidel	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description rcs 2003-07-20 06:13:00 UTC

cgi process are not being closed and end up in a zombie state untill parent
process is terminated. This has been observed on move then 5 servers.

Comment 1 J. Nick Koston 2003-07-20 18:07:18 UTC

Created attachment 7406 [details]
reverse what happened in 1.3.28

Comment 2 J. Nick Koston 2003-07-20 18:08:05 UTC

I've been having the same problem.  Please try that patch I attached.

Comment 3 J. Nick Koston 2003-07-20 18:19:55 UTC

The problem is that when running under suexec you cannot send the process
sigterm since apache and the script are running with diffrent uids.  Apache
1.3.28 beleives the process is dead because the kill failed and sets
p_kill_how=kill_never.  It really should look for -ESRCH.

Comment 4 J. Nick Koston 2003-07-20 18:35:01 UTC

Created attachment 7407 [details]
instead of reverse ... fix

Comment 5 J. Nick Koston 2003-07-20 18:36:56 UTC

The second patch implements a fix instead of reverting things back to the way
1.3.27 handled things.

Comment 6 J. Nick Koston 2003-07-20 18:48:10 UTC

This probably should be critical as it causes servers to crash.

Comment 7 J. Nick Koston 2003-07-20 18:50:01 UTC

*** Bug 21739 has been marked as a duplicate of this bug. ***

Comment 8 rcs 2003-07-20 18:52:14 UTC

I don't think the patch works (either of them). Still got new defuncts.
However I tested on one server only.

Comment 9 rcs 2003-07-20 18:58:00 UTC

sorry about severity change, new to bugzilla.

(altho I haven't experience any crushs on 7 servers some with multiple instances
of apache).

Comment 10 J. Nick Koston 2003-07-20 20:01:07 UTC

Can you get it to reliably create a defunct process by doing the following?

> a) Put a this script in your cgi-bin
> --cut here--
> #!/usr/bin/php
> <?php phpinfo(); ?>
> --cut here--
> 
> b) Go to it your web browser.  Click reload over and over again (this
> will eventually cause a sigpipe). 
> 
> c) watch the zombies build up

Comment 11 rcs 2003-07-20 20:51:54 UTC

The problem is fixed, I forgot I had another apache on that server. sorry for
extra work I caused.

Comment 12 J. Nick Koston 2003-07-20 21:13:34 UTC

Just to confirm, the patched apache is working correctly?

Comment 13 J. Nick Koston 2003-07-20 21:14:41 UTC

*** Bug 21746 has been marked as a duplicate of this bug. ***

Comment 14 rcs 2003-07-20 21:24:07 UTC

I'm not sure now. I manualy patched (with the second patch) few servers and they
look fine, one cpanel server (with the first patch) still defuncts. not sure
yet if it's me of the patch.

Comment 15 rcs 2003-07-20 21:27:45 UTC

second cpanel defuncts.

Comment 16 rcs 2003-07-20 21:30:22 UTC

second patch also defuncts. so to sum it, the patches don't work.

Comment 17 rcs 2003-07-20 21:40:20 UTC

I can create a defunct process with the script above and get it about 80% of the
time. The defunct process is gone very quickly now.

Comment 18 J. Nick Koston 2003-07-20 22:06:54 UTC

You should probably still get defunct processes (the same thing happens in
1.3.27 and below), but they shouldn't exist for more then 5 or so seconds cause
there is still a wait time between the time sigpip is recieved and the timeout.
 This should be ok though.

Comment 19 John 2003-07-20 22:17:49 UTC

> You should probably still get defunct processes (the same thing happens in
> 1.3.27 and below), but they shouldn't exist for more then 5 or so seconds 
cause
> there is still a wait time between the time sigpip is recieved and the 
timeout.

Oh no!
The defunct-processes won't clean automaticly!

Comment 20 J. Nick Koston 2003-07-21 00:27:48 UTC

John:  do you still see them lingering after using the patch?

Comment 21 Cliff Woolley 2003-07-21 02:25:16 UTC

Just so you know, we (the dev team) have seen this bug report and are looking into it.  Thanks 
for the detailed investigation!

Comment 22 147099.vserver.de 2003-07-21 10:56:51 UTC

The patch does not help!

Comment 23 Tim Greer 2003-07-22 06:32:51 UTC

This appears to be the modified PHP-suexec patch you have that is broken.  Note 
this is likely only a problem for users that use Cpanel and the Apache version 
with the newly updated alloc.c patch--looking at it, it is a source of more 
problems.  You are looking in the wrong place.  The php-suexec patch that I've 
modified myself and use with Apache 1.3.28 does not suffer from these problems 
at all.  I believe that is the cause of your problems and your alloc.c patch 
causes more (look at the modifications to see why!)  Maybe I'm wrong, but this 
is how it appears looking at the patch and the alloc.c patch as well seems to 
cause yet more issues.

Comment 24 Tim Greer 2003-07-22 06:35:35 UTC

I.e., it's ignoring -USR1 hup's and so on, which is also conflicting with your 
control panel on new account set ups and so on, which is causing more users to 
submit reports about this version of Apache having this bug, which beyond the 
PHP-SuEXEC patch and the messed up alloc.c patch, doesn't exist (*from what I've 
seen*--I may be wrong, but it _is_ a source of additional problems).

Comment 25 Tim Greer 2003-07-22 06:58:44 UTC

While I have actually been able to confirm this on .28 with the PHP-suexec patch 
under specific circumstances--I am not able to reproduce the problem for .28 nor 
.27 without that patch being implemented.

Has anyone experienced this issue that is not using the PHP suexec patch from 
http://www.localhost.nl/patches/ (or a similar source), be it a modified version 
of this patch or not?  I.e., the problem exists on installs without this?  I've 
not been able to reproduce it without this patch being implemented.

Either way, the alloc.c patch is not the solution, at least not a complete 
solution and opens up other problems, with some tests.  More information about 
that later, if it's needed.

Comment 26 cheewai 2003-07-23 01:56:42 UTC

Not sure why the comments I added is not in.

I can confirm the problem as well. My setup is stock apache1.3.28 (no php-suexec
path) with modssl 2.8.15-1.3.28 and php 4.3.2. A simple test script that only
display "hello world" will stay as a defunct process owned by the suExec user.
The defunct process only went away after a restart of the apache processes.

Comment 27 Tim Greer 2003-07-23 02:06:02 UTC

Odd, I've not been able to recreate this on non "PHP for CGI w/ SuEXEC" patched 
systems, but I sincey you experience it as well, I personally assume it relates 
to CGI and SuEXEC.  Has anyone confirmed this on non-SuEXEC enabled installs?

Comment 28 J. Nick Koston 2003-07-23 05:37:03 UTC

I'm not sure why the first patch would cause USR1 not to work.  The patch just
reverts parts of alloc.c to 1.3.27.

Comment 29 Tim Greer 2003-07-23 06:09:12 UTC

I was mistaken.  This was not related.  The patch did cause other problems 
though.  I will attempt to recreate the problem in various ways and report it in 
the very near future.  However, since this doesn't seem to be the overall 
solution and more of a quick fix, I suppose there's no need.

Comment 30 J. Nick Koston 2003-07-23 06:23:42 UTC

Which patch are you using?  The first one or the second one.  

I've reviewed the second one:

http://nagoya.apache.org/bugzilla/showattachment.cgi?attach_id=7407

and I can't see how it could cause a problem.   I'm not so sure about the first
one though.

Comment 31 Christian Noack 2003-07-23 07:21:18 UTC

I confirm the bug for Apache 1.3.28/mod_ssl-2.8.15-1.3.28/php-4.3.1 running on
OpenBSD 3.2. The fix (7407) seems to be working here.

Comment 32 cheewai 2003-07-23 10:14:21 UTC

I applied the patch 7407 and it seems to fix the problem. Now the suEXEC process
no longer stayed in zombie state.

Comment 33 Mads Toftum 2003-07-28 10:33:28 UTC

*** Bug 21926 has been marked as a duplicate of this bug. ***

Comment 34 tchesmeli 2003-07-28 10:54:35 UTC

with the second patch: 
http://nagoya.apache.org/bugzilla/showattachment.cgi?attach_id=7407 
seems to be ok. 
I got some defunct process butthey are killed in few seconds. Looks good, thanks.

Comment 35 EvE 2003-07-28 13:06:42 UTC

We had severe zombie problems as well after upgrading to 1.3.28 (on Solaris 
2.8). Compiling with the second (7407) patch solved the zombie problem for us 
as well.

Comment 36 Tim Greer 2003-07-28 15:25:47 UTC

This second 'patch' is just reverting back to the alloc.c file for .27 instead 
of .28.  Has anyone noted any impact due to this?  This does seem to help, but 
not completely remove the issue.  Also, bypassing the function(s) in .28--does 
this matter?  It seems to be a logic error.

Comment 37 J. Nick Koston 2003-07-29 05:01:47 UTC

The first patch is the one that revents, the second one is the one that should
fix the problem with keeping the current logic.

Comment 38 J. Nick Koston 2003-07-29 05:02:38 UTC

s/revents/reverts (gee 1am)

Comment 39 Tim Greer 2003-07-29 05:45:08 UTC

I meant first, not second. :-)

Comment 40 EvE 2003-07-31 09:42:36 UTC

What will happen with this patch in the future? Will this be part of 1.3.29 or 
will it become an official patch? We like to know that before we start 
upgrading all our other apache 1.3.27 instances.

Comment 41 Tim Greer 2003-07-31 15:50:13 UTC

Though I am not the person that would authorize anything as official or have 
any control over what Apache does, and I don't personally assume this would be 
implemented in .29 or be official (what do I know), I still recommend you 
upgrade to .28 and implement this patch if you find you have to (or create some 
solution yourself that maybe you feel more comfortable with otherwise).  You 
don't want to stick with .27 at this point anyway.

Comment 42 JR 2003-08-01 20:14:15 UTC

For me, patch 7407 works like a dream. I have yet to see any zombie processes 
with my PHP script or with the "phpinfo()" script mentioned above. Before, I 
could reproduce the zombies quite easily - about 50% of the time with my script 
and 100% of the time with the "phpinfo()" script.

BTW, I'm not using any "PHP-suexec" patch.

Comment 43 Tim Greer 2003-08-01 22:48:33 UTC

Hi Jordan,

I'm curious, do you have suexec enabled?  If so, do you have this problem 
without it enabled?

Comment 44 Tim Greer 2003-08-01 22:51:34 UTC

Sorry if that sounded obvious to ask, given the history of the reports. :-)

Comment 45 JR 2003-08-01 23:14:25 UTC

Yes, I use suexec.

I just tested the phpinfo() script on a virtual site that does NOT use suexec, 
and I get no zombies while holding down the browser's Refresh key (F5 in 
Internet Explorer).

When the same script is run on a suexec site, I get about 10 zombies per second 
(great potential for DoS?). Patch 7407 fixes this for me.

Comment 46 Tim Greer 2003-08-01 23:20:40 UTC

How about on a build without suexec compiled in as an option, rather than a 
site without it enabled on a build that has it?  I'm just testing out a few 
thoeries that likely have little to do with your problem, but I'd like to see if 
anyone sees this on a non-suexec build... and if so, how long they take to die 
off.  If you don't mind anyway.  Thanks.

Comment 47 JR 2003-08-01 23:30:33 UTC

I tested the phpinfo() script on a 1.3.28 server that doesn't have suexec 
compiled in, and got no zombies.

Comment 48 Ruud van Melick 2003-08-02 17:39:00 UTC

Ralf S. Engelschall posted a different patch on apache-http-dev:
http://marc.theaimsgroup.com/?l=apache-httpd-dev&m=105952652425849&w=2

Is that patch better than 4707?

Comment 49 J. Nick Koston 2003-08-02 17:55:37 UTC

Either one should work just as well

Comment 50 Tim Greer 2003-08-02 20:17:05 UTC

Yes, it would be a better overall patch, though, as he stated himself, the 
original one works, but could be improved--which it seems it was.  I'd recommend 
using this other patch by Ralf overall.

Comment 51 Erin Fortenberry 2003-08-14 19:57:28 UTC

The FreeBSD port apache13-modssl (Apache 1.3.28 + ModSSL 2.8.15) has been 
updated to include the 4707 patch. My server with heavy CGI (Perl and PHP) was 
creating zombies left and right, about 5000/hour.

I like the 4707 patch. It made me happy again.

Comment 52 Ari Pollak 2003-09-02 21:17:37 UTC

This patch does not work for me. I'm using suEXEC, mod_ssl compiled as a DSO,
mod_frontpage_mirfak, mod_gzip, mod_pointer, and mod_throttle, all compiled as
DSO modules.

Comment 53 Ari Pollak 2003-09-02 21:20:17 UTC

I should note that the patch given on the mailing list works fine.

Comment 54 Jeff Trawick 2003-09-03 13:50:45 UTC

This is the patch that actually got committed (yesterday):

http://cvs.apache.org/viewcvs.cgi/apache-1.3/src/main/alloc.c.diff?r1=1.145&r2=1.146

It is slightly different than the patches that were posted to the mailing
list, but it addresses all known concerns.

Comment 55 Eric Seidel 2003-09-03 22:12:53 UTC

I applied the patch for alloc.c as is in the CVS tree to a APACHE_1_3_27 source.  I failed to correct 
the problem for me under Mac OS X 10.3... investigating further...

Comment 56 Eric Seidel 2003-09-04 00:51:14 UTC

I have confirmed, this condition still exists on Mac OS X 10.3 with this patch.  I patched the 1.3.28 
sources (I erred in my comment above).  And was still able to reproduce this.

I did extensive further testing and found that both Mac OS X, and FreeBSD violate the POSIX 
specification for kill() and return ESRCH, when sending a signal to a zombie process.

This violation introduces a race condition with this patched code, as a process could finish (become 
zombie) after the NEED_WAITPID "waitpid" cleanup, but before the ap_os_kill() call and thus return 
ESRCH, be marked as kill_never, and then never be cleaned up.

Although it is my hope that Mac OS X 10.3 final will have fixed this error.  Apache is still left with 
an interoperability problem on Mac OS X 10.2 and likely FreeBSD (as they share this same 
violoation).

I have attached my program "main.c" which tests for this phenomenon.

Comment 57 Eric Seidel 2003-09-04 00:52:21 UTC

Created attachment 8055 [details]
A program to test errno after signal to zombie process.

Comment 58 Eric Seidel 2003-09-04 15:51:42 UTC

Thinking about this a little more... I think there are two options here:

1.  Come up with a solution, not dependant on this behaviour of kill...  (moving the waitpid to after 
the kill call should be sufficient?)

2.  Add a configure.in rule to check for the POSIX compliance to kill and conditionally deal with it's 
compliance or non-compliance accordingly.  There may be other systems (besides OS X and 
FreeBSD) which have this ESRCH behaviour when sending to zombies...

Comment 59 Eric Seidel 2003-09-04 19:14:44 UTC

Here is a "hack" which fixes the problem on Mac OS X (likely FreeBSD as well).

-           if (ap_os_kill(p->pid, SIGTERM) == -1) {
+           if ( (ap_os_kill(p->pid, SIGTERM) == -1) && (errno == ESRCH) ) {
+               // in case ESRCH means "zombie".
+                waitpid(p->pid, (int *) 0, 0);

Comment 60 Eric Seidel 2003-09-05 02:59:35 UTC

Unless I'm mistaken, can't kill return EPERM for setuid processes?  Wouldn't that also leak?

Comment 61 Jeff Trawick 2003-09-05 11:18:48 UTC

Another data point: I received an e-mail from a Tru64 user indicating that the
patch as committed failed there too, and that Ralf's patch worked fine.

Comment 62 Jim Jagielski 2003-09-05 12:44:25 UTC

Because of the bogusness of how some OSs handle errors from KILL, I've changed us to simply kill 
and wait.