SA Bugzilla – Bug 3743
Spamd not cleaning up defunct processes
Last modified: 2004-09-03 12:12:20 UTC
Version: SpamAssassin-3.0.0-RC2 Linux <host> 2.4.21-15.0.3.ELsmp #1 SMP Tue Jun 29 18:04:47 EDT 2004 i686 i686 i386 GNU/Linux Perl: This is perl, v5.8.0 built for i386-linux-thread-multi We have noticed that spamd is not cleaning up child processes that are 'defunct' and is considering these processes to be available; therefore over time this is leading to the number of available active processes dropping until eventually there are no 'child' processes left to handle requests: 520 9297 1 0 17:27 ? 00:00:00 /net/sa/bin/spamd -u spamassassin -x -d --max-children 7 520 8145 9297 2 17:57 ? 00:01:01 [spamd <defunct>] 520 668 9297 4 18:32 ? 00:00:17 spamd child 520 918 9297 3 18:32 ? 00:00:15 spamd child 520 1325 9297 3 18:33 ? 00:00:11 spamd child 520 2160 9297 3 18:34 ? 00:00:09 spamd child 520 2325 9297 3 18:35 ? 00:00:09 spamd child 520 2508 9297 3 18:35 ? 00:00:08 spamd child You can see that the zombie process was never 'detected' by the spamd parent (no mention if process 8145 sending a SIGCHLD): Sep 2 17:27:07 larry spamd[9297]: server started on port 783/tcp (running version 3.0.0-rc2) Sep 2 17:27:07 larry spamd[9297]: server successfully spawned child process, pid 9306 Sep 2 17:27:07 larry spamd[9297]: server successfully spawned child process, pid 9307 Sep 2 17:27:07 larry spamd[9297]: server successfully spawned child process, pid 9308 Sep 2 17:27:07 larry spamd[9297]: server successfully spawned child process, pid 9309 Sep 2 17:27:07 larry spamd[9297]: server successfully spawned child process, pid 9310 Sep 2 17:27:07 larry spamd[9297]: server successfully spawned child process, pid 9311 Sep 2 17:27:07 larry spamd[9297]: server successfully spawned child process, pid 9312 Sep 2 17:57:37 larry spamd[9297]: server hit by SIGCHLD, pid 9306 Sep 2 17:57:37 larry spamd[9297]: server successfully spawned child process, pid 7933 Sep 2 17:57:39 larry spamd[9297]: server hit by SIGCHLD, pid 9310 Sep 2 17:57:39 larry spamd[9297]: server successfully spawned child process, pid 7979 Sep 2 17:57:39 larry spamd[9297]: server hit by SIGCHLD, pid 9311 Sep 2 17:57:39 larry spamd[9297]: server successfully spawned child process, pid 7981 Sep 2 17:57:43 larry spamd[9297]: server hit by SIGCHLD, pid 9308 Sep 2 17:57:43 larry spamd[9297]: server successfully spawned child process, pid 8067 Sep 2 17:57:46 larry spamd[9297]: server hit by SIGCHLD, pid 9309 Sep 2 17:57:46 larry spamd[9297]: server successfully spawned child process, pid 8127 Sep 2 17:57:48 larry spamd[9297]: server hit by SIGCHLD, pid 9307 Sep 2 17:57:48 larry spamd[9297]: server successfully spawned child process, pid 8145 Sep 2 18:32:33 larry spamd[9297]: server hit by SIGCHLD, pid 9312 Sep 2 18:32:33 larry spamd[9297]: server successfully spawned child process, pid 668 Sep 2 18:32:56 larry spamd[9297]: server hit by SIGCHLD, pid 7979 Sep 2 18:32:56 larry spamd[9297]: server successfully spawned child process, pid 918 Sep 2 18:33:37 larry spamd[9297]: server hit by SIGCHLD, pid 7981 Sep 2 18:33:37 larry spamd[9297]: server successfully spawned child process, pid 1325 Sep 2 18:34:51 larry spamd[9297]: server hit by SIGCHLD, pid 7933 Sep 2 18:34:51 larry spamd[9297]: server successfully spawned child process, pid 2160 Sep 2 18:35:00 larry spamd[9297]: server hit by SIGCHLD, pid 8127 Sep 2 18:35:00 larry spamd[9297]: server successfully spawned child process, pid 2325 Sep 2 18:35:11 larry spamd[9297]: server hit by SIGCHLD, pid 8067 Sep 2 18:35:11 larry spamd[9297]: server successfully spawned child process, pid 2508 Could there be some kind of polling code to verify whether the child processes are in a zombie state and act accordingly. Thanks, Shane.
have you seen more than 1 defunct process? on some platforms spamd will occasionally have 1 (and only 1) defunct process listed, but that has virtually no effect on performance, memory, limits etc. so we don't consider it a bug.
Hi Justin, The process list as of this morning: 520 9297 1 0 Sep02 ? 00:00:01 /net/sa/bin/spamd -u spamassassin -x -d --max-children 7 520 13104 9297 3 08:15 ? 00:01:04 [spamd <defunct>] 520 19352 9297 6 08:28 ? 00:01:24 [spamd <defunct>] 520 22015 9297 8 08:32 ? 00:01:25 [spamd <defunct>] 520 22640 9297 8 08:33 ? 00:01:21 spamd child 520 23037 9297 8 08:34 ? 00:01:20 spamd child 520 29491 9297 4 08:45 ? 00:00:12 spamd child 520 32075 9297 2 08:49 ? 00:00:00 spamd child So to answer your question, yes over time we get more and more defunct processes until eventually there are no working 'spamd child' processes left.
hrm. it seems like for some reason either 1) the OS isn't sending a SIGCHLD when the child exits, so the parent doesn't know to wait for it and spawn a new child, or 2) the parent is somehow not paying attention/able to receive to signals/was hit by SIGSTOP/etc. In the logs posted, I can see 8145 being spawned: Sep 2 17:57:48 larry spamd[9297]: server hit by SIGCHLD, pid 9307 Sep 2 17:57:48 larry spamd[9297]: server successfully spawned child process, pid 8145 but there is no record of a SIGCHLD being received when it exited. > Could there be some kind of polling code to verify whether the child processes are in a zombie state and act accordingly. Not in the current spamd. The parent has no interaction with the children beyond spawning them and cleaning up when a SIGCHLD is received. That's all the parent does actually, it sits in a sleep() until a signal comes along to wake it up.
Hi Theo, We suspect the problem has to do with a SIGCHLD being received whilst the spamd process is already in the wait state cleaning up another SIGCHLD signal; hence the second SIGCHLD is missed. We have made the following changes to use the waitpid call instead which will reap all children that need a cleanup. We are currently testing the change and will report back if it fixes the problem. --- spamd 2004-09-03 10:14:27.000000000 +1000 +++ spamd.old 2004-08-30 13:48:02.000000000 +1000 @@ -1745,11 +1745,9 @@ sub child_handler { my ($sig) = @_; $SIG{CHLD} = \&child_handler; # reset as necessary - logmsg("server hit by SIG$sig"); - while((my $pid = waitpid(-1, &WNOHANG)) > 0) { # reap the child - delete $children{$pid}; # remove the child out of the pool - logmsg(" cleaned up child pid $pid"); - } + my $pid = wait; # reap the child + delete $children{$pid}; # remove the child out of the pool + logmsg("server hit by SIG$sig, pid $pid"); } sub restart_handler { @@ -1759,7 +1757,7 @@ $SIG{CHLD} = 'DEFAULT'; # we're going to kill our children foreach (keys %children) { kill 'INT' => $_; - my $pid = waitpid($_, 0); + my $pid = wait; logmsg("child $pid killed successfully"); } %children = ();
Subject: Re: Spamd not cleaning up defunct processes On Thu, Sep 02, 2004 at 05:53:04PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > We suspect the problem has to do with a SIGCHLD being received whilst the spamd > process is already in the wait state cleaning up another SIGCHLD signal; hence > the second SIGCHLD is missed. We have made the following changes to use the > waitpid call instead which will reap all children that need a cleanup. We are > currently testing the change and will report back if it fixes the problem. Hrm, ok. As an FYI, child_handler (and the rest of the pre-fork stuff) was essentially pulled right out of the perlipc doc. Looking at the code, it does seem like we ought to wait until the end to reset the handler too... > sub child_handler { > my ($sig) = @_; > $SIG{CHLD} = \&child_handler; # reset as necessary yeah there... according to the perlipc page, there's a SysV issue which makes it have to go after the wait, which we apparently don't do (why didn't I do that?). :( > - logmsg("server hit by SIG$sig"); > - while((my $pid = waitpid(-1, &WNOHANG)) > 0) { # reap the child > - delete $children{$pid}; # remove the child out of the > pool > - logmsg(" cleaned up child pid $pid"); > - } > + my $pid = wait; # reap the child > + delete $children{$pid}; # remove the child out of the pool > + logmsg("server hit by SIG$sig, pid $pid"); Errr... I liked the waitpid() version because it can do NOHANG, and because the first handler call could catch more children if they exited around the same time. wait() causes the process to hang for a child, which makes me think of bad things depending on how the handler is called. I'd say the first thing to try is just moving the handler reset to the bottom of the function, and see if that helps. If not, progress forward. :)
er, theo, I think the waitpid() version is what shane's suggesting. current svn has: sub child_handler { my ($sig) = @_; $SIG{CHLD} = \&child_handler; # reset as necessary my $pid = wait; # reap the child delete $children{$pid}; # remove the child out of the pool logmsg("server hit by SIG$sig, pid $pid"); } sub restart_handler { my ($sig) = @_; logmsg("server hit by SIG$sig, restarting"); $SIG{CHLD} = 'DEFAULT'; # we're going to kill our children foreach (keys %children) { kill 'INT' => $_; my $pid = wait; logmsg("child $pid killed successfully"); } %children = (); ie. wait(). so shane's fix looks good -- I think.... /me hates the UNIX SIGCHLD portability-nightmare
Just as an FYI, we can see the changes in action in the following log extract; the spamd parent detected just a single SIGCHLD but the waitpid call cleaned up 4 processes in total -- therefore without the change we would have been left with 3 defunct processes (in theory): Sep 3 10:58:58 larry spamd[9980]: server hit by SIGCHLD Sep 3 10:58:58 larry spamd[9980]: cleaned up child pid 9991 Sep 3 10:58:58 larry spamd[9980]: cleaned up child pid 9993 Sep 3 10:58:58 larry spamd[9980]: cleaned up child pid 9994 Sep 3 10:58:58 larry spamd[9980]: cleaned up child pid 9995 Sep 3 10:58:58 larry spamd[9980]: server successfully spawned child process, pid 23856 Sep 3 10:58:58 larry spamd[9980]: server successfully spawned child process, pid 23857 Sep 3 10:58:58 larry spamd[9980]: server successfully spawned child process, pid 23858 Sep 3 10:58:58 larry spamd[9980]: server successfully spawned child process, pid 23859 Sep 3 10:58:59 larry spamd[9980]: server hit by SIGCHLD Sep 3 10:58:59 larry spamd[9980]: cleaned up child pid 9992 Sep 3 10:58:59 larry spamd[9980]: server successfully spawned child process, pid 23870 Sep 3 11:17:04 larry spamd[9980]: server hit by SIGCHLD Sep 3 11:17:04 larry spamd[9980]: cleaned up child pid 23859 Sep 3 11:17:04 larry spamd[9980]: server successfully spawned child process, pid 5024 Sep 3 11:17:05 larry spamd[9980]: server hit by SIGCHLD Sep 3 11:17:05 larry spamd[9980]: cleaned up child pid 23857 Sep 3 11:17:05 larry spamd[9980]: server successfully spawned child process, pid 5041 Sep 3 11:17:21 larry spamd[9980]: server hit by SIGCHLD Sep 3 11:17:21 larry spamd[9980]: cleaned up child pid 23856
Subject: Re: Spamd not cleaning up defunct processes On Thu, Sep 02, 2004 at 06:27:13PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > er, theo, I think the waitpid() version is what shane's suggesting. current svn > has: really? oh, the patch was reverse. well, then yeah, waitpid++. :)
3.0.0 milestone this must cause a new -rc release if we do this
I'm +1 on the suggested fix, fwiw. however I've scribbled on that code in my svn so I can't upload a patch as a proper attachment. ;) do we need a CLA for this? I'm thinking "no".
> I'm +1 on the suggested fix, fwiw. however I've scribbled on that code in my > svn so I can't upload a patch as a proper attachment. ;) > do we need a CLA for this? I'm thinking "no". I think I'm +1 on it as well. I'll see about putting up a slightly different version against trunk (the patch from bug 3625 modified the child handler), and the patch is reversed anyway.
Created attachment 2306 [details] suggested patch basically the same as the one suggested, but slightly different. ;) 1) not a reverse patch 2) against trunk, which is now slightly different than the rc2 code
Created attachment 2307 [details] new version same as before, but I apparently forgot to wrap the first logmsg() call so this one does that. :)
+1
committed. r43314