cgi process are not being closed and end up in a zombie state untill parent process is terminated. This has been observed on move then 5 servers.
Created attachment 7406 [details] reverse what happened in 1.3.28
I've been having the same problem. Please try that patch I attached.
The problem is that when running under suexec you cannot send the process sigterm since apache and the script are running with diffrent uids. Apache 1.3.28 beleives the process is dead because the kill failed and sets p_kill_how=kill_never. It really should look for -ESRCH.
Created attachment 7407 [details] instead of reverse ... fix
The second patch implements a fix instead of reverting things back to the way 1.3.27 handled things.
This probably should be critical as it causes servers to crash.
*** Bug 21739 has been marked as a duplicate of this bug. ***
I don't think the patch works (either of them). Still got new defuncts. However I tested on one server only.
sorry about severity change, new to bugzilla. (altho I haven't experience any crushs on 7 servers some with multiple instances of apache).
Can you get it to reliably create a defunct process by doing the following? > a) Put a this script in your cgi-bin > --cut here-- > #!/usr/bin/php > <?php phpinfo(); ?> > --cut here-- > > b) Go to it your web browser. Click reload over and over again (this > will eventually cause a sigpipe). > > c) watch the zombies build up
The problem is fixed, I forgot I had another apache on that server. sorry for extra work I caused.
Just to confirm, the patched apache is working correctly?
*** Bug 21746 has been marked as a duplicate of this bug. ***
I'm not sure now. I manualy patched (with the second patch) few servers and they look fine, one cpanel server (with the first patch) still defuncts. not sure yet if it's me of the patch.
second cpanel defuncts.
second patch also defuncts. so to sum it, the patches don't work.
I can create a defunct process with the script above and get it about 80% of the time. The defunct process is gone very quickly now.
You should probably still get defunct processes (the same thing happens in 1.3.27 and below), but they shouldn't exist for more then 5 or so seconds cause there is still a wait time between the time sigpip is recieved and the timeout. This should be ok though.
> You should probably still get defunct processes (the same thing happens in > 1.3.27 and below), but they shouldn't exist for more then 5 or so seconds cause > there is still a wait time between the time sigpip is recieved and the timeout. Oh no! The defunct-processes won't clean automaticly!
John: do you still see them lingering after using the patch?
Just so you know, we (the dev team) have seen this bug report and are looking into it. Thanks for the detailed investigation!
The patch does not help!
This appears to be the modified PHP-suexec patch you have that is broken. Note this is likely only a problem for users that use Cpanel and the Apache version with the newly updated alloc.c patch--looking at it, it is a source of more problems. You are looking in the wrong place. The php-suexec patch that I've modified myself and use with Apache 1.3.28 does not suffer from these problems at all. I believe that is the cause of your problems and your alloc.c patch causes more (look at the modifications to see why!) Maybe I'm wrong, but this is how it appears looking at the patch and the alloc.c patch as well seems to cause yet more issues.
I.e., it's ignoring -USR1 hup's and so on, which is also conflicting with your control panel on new account set ups and so on, which is causing more users to submit reports about this version of Apache having this bug, which beyond the PHP-SuEXEC patch and the messed up alloc.c patch, doesn't exist (*from what I've seen*--I may be wrong, but it _is_ a source of additional problems).
While I have actually been able to confirm this on .28 with the PHP-suexec patch under specific circumstances--I am not able to reproduce the problem for .28 nor .27 without that patch being implemented. Has anyone experienced this issue that is not using the PHP suexec patch from http://www.localhost.nl/patches/ (or a similar source), be it a modified version of this patch or not? I.e., the problem exists on installs without this? I've not been able to reproduce it without this patch being implemented. Either way, the alloc.c patch is not the solution, at least not a complete solution and opens up other problems, with some tests. More information about that later, if it's needed.
Not sure why the comments I added is not in. I can confirm the problem as well. My setup is stock apache1.3.28 (no php-suexec path) with modssl 2.8.15-1.3.28 and php 4.3.2. A simple test script that only display "hello world" will stay as a defunct process owned by the suExec user. The defunct process only went away after a restart of the apache processes.
Odd, I've not been able to recreate this on non "PHP for CGI w/ SuEXEC" patched systems, but I sincey you experience it as well, I personally assume it relates to CGI and SuEXEC. Has anyone confirmed this on non-SuEXEC enabled installs?
I'm not sure why the first patch would cause USR1 not to work. The patch just reverts parts of alloc.c to 1.3.27.
I was mistaken. This was not related. The patch did cause other problems though. I will attempt to recreate the problem in various ways and report it in the very near future. However, since this doesn't seem to be the overall solution and more of a quick fix, I suppose there's no need.
Which patch are you using? The first one or the second one. I've reviewed the second one: http://nagoya.apache.org/bugzilla/showattachment.cgi?attach_id=7407 and I can't see how it could cause a problem. I'm not so sure about the first one though.
I confirm the bug for Apache 1.3.28/mod_ssl-2.8.15-1.3.28/php-4.3.1 running on OpenBSD 3.2. The fix (7407) seems to be working here.
I applied the patch 7407 and it seems to fix the problem. Now the suEXEC process no longer stayed in zombie state.
*** Bug 21926 has been marked as a duplicate of this bug. ***
with the second patch: http://nagoya.apache.org/bugzilla/showattachment.cgi?attach_id=7407 seems to be ok. I got some defunct process butthey are killed in few seconds. Looks good, thanks.
We had severe zombie problems as well after upgrading to 1.3.28 (on Solaris 2.8). Compiling with the second (7407) patch solved the zombie problem for us as well.
This second 'patch' is just reverting back to the alloc.c file for .27 instead of .28. Has anyone noted any impact due to this? This does seem to help, but not completely remove the issue. Also, bypassing the function(s) in .28--does this matter? It seems to be a logic error.
The first patch is the one that revents, the second one is the one that should fix the problem with keeping the current logic.
s/revents/reverts (gee 1am)
I meant first, not second. :-)
What will happen with this patch in the future? Will this be part of 1.3.29 or will it become an official patch? We like to know that before we start upgrading all our other apache 1.3.27 instances.
Though I am not the person that would authorize anything as official or have any control over what Apache does, and I don't personally assume this would be implemented in .29 or be official (what do I know), I still recommend you upgrade to .28 and implement this patch if you find you have to (or create some solution yourself that maybe you feel more comfortable with otherwise). You don't want to stick with .27 at this point anyway.
For me, patch 7407 works like a dream. I have yet to see any zombie processes with my PHP script or with the "phpinfo()" script mentioned above. Before, I could reproduce the zombies quite easily - about 50% of the time with my script and 100% of the time with the "phpinfo()" script. BTW, I'm not using any "PHP-suexec" patch.
Hi Jordan, I'm curious, do you have suexec enabled? If so, do you have this problem without it enabled?
Sorry if that sounded obvious to ask, given the history of the reports. :-)
Yes, I use suexec. I just tested the phpinfo() script on a virtual site that does NOT use suexec, and I get no zombies while holding down the browser's Refresh key (F5 in Internet Explorer). When the same script is run on a suexec site, I get about 10 zombies per second (great potential for DoS?). Patch 7407 fixes this for me.
How about on a build without suexec compiled in as an option, rather than a site without it enabled on a build that has it? I'm just testing out a few thoeries that likely have little to do with your problem, but I'd like to see if anyone sees this on a non-suexec build... and if so, how long they take to die off. If you don't mind anyway. Thanks.
I tested the phpinfo() script on a 1.3.28 server that doesn't have suexec compiled in, and got no zombies.
Ralf S. Engelschall posted a different patch on apache-http-dev: http://marc.theaimsgroup.com/?l=apache-httpd-dev&m=105952652425849&w=2 Is that patch better than 4707?
Either one should work just as well
Yes, it would be a better overall patch, though, as he stated himself, the original one works, but could be improved--which it seems it was. I'd recommend using this other patch by Ralf overall.
The FreeBSD port apache13-modssl (Apache 1.3.28 + ModSSL 2.8.15) has been updated to include the 4707 patch. My server with heavy CGI (Perl and PHP) was creating zombies left and right, about 5000/hour. I like the 4707 patch. It made me happy again.
This patch does not work for me. I'm using suEXEC, mod_ssl compiled as a DSO, mod_frontpage_mirfak, mod_gzip, mod_pointer, and mod_throttle, all compiled as DSO modules.
I should note that the patch given on the mailing list works fine.
This is the patch that actually got committed (yesterday): http://cvs.apache.org/viewcvs.cgi/apache-1.3/src/main/alloc.c.diff?r1=1.145&r2=1.146 It is slightly different than the patches that were posted to the mailing list, but it addresses all known concerns.
I applied the patch for alloc.c as is in the CVS tree to a APACHE_1_3_27 source. I failed to correct the problem for me under Mac OS X 10.3... investigating further...
I have confirmed, this condition still exists on Mac OS X 10.3 with this patch. I patched the 1.3.28 sources (I erred in my comment above). And was still able to reproduce this. I did extensive further testing and found that both Mac OS X, and FreeBSD violate the POSIX specification for kill() and return ESRCH, when sending a signal to a zombie process. This violation introduces a race condition with this patched code, as a process could finish (become zombie) after the NEED_WAITPID "waitpid" cleanup, but before the ap_os_kill() call and thus return ESRCH, be marked as kill_never, and then never be cleaned up. Although it is my hope that Mac OS X 10.3 final will have fixed this error. Apache is still left with an interoperability problem on Mac OS X 10.2 and likely FreeBSD (as they share this same violoation). I have attached my program "main.c" which tests for this phenomenon.
Created attachment 8055 [details] A program to test errno after signal to zombie process.
Thinking about this a little more... I think there are two options here: 1. Come up with a solution, not dependant on this behaviour of kill... (moving the waitpid to after the kill call should be sufficient?) 2. Add a configure.in rule to check for the POSIX compliance to kill and conditionally deal with it's compliance or non-compliance accordingly. There may be other systems (besides OS X and FreeBSD) which have this ESRCH behaviour when sending to zombies...
Here is a "hack" which fixes the problem on Mac OS X (likely FreeBSD as well). - if (ap_os_kill(p->pid, SIGTERM) == -1) { + if ( (ap_os_kill(p->pid, SIGTERM) == -1) && (errno == ESRCH) ) { + // in case ESRCH means "zombie". + waitpid(p->pid, (int *) 0, 0);
Unless I'm mistaken, can't kill return EPERM for setuid processes? Wouldn't that also leak?
Another data point: I received an e-mail from a Tru64 user indicating that the patch as committed failed there too, and that Ralf's patch worked fine.
Because of the bogusness of how some OSs handle errors from KILL, I've changed us to simply kill and wait.