Bug 3743 - Spamd not cleaning up defunct processes
Summary: Spamd not cleaning up defunct processes
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamc/spamd (show other bugs)
Version: unspecified
Hardware: PC Linux
: P3 major
Target Milestone: 3.0.0
Assignee: Theo Van Dinter
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-09-02 01:43 UTC by Shane Hjorth
Modified: 2004-09-03 12:12 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
suggested patch patch None Theo Van Dinter [HasCLA]
new version patch None Theo Van Dinter [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Shane Hjorth 2004-09-02 01:43:31 UTC
Version: SpamAssassin-3.0.0-RC2
Linux <host> 2.4.21-15.0.3.ELsmp #1 SMP Tue Jun 29 18:04:47 EDT 2004 i686 i686
i386 GNU/Linux
Perl: This is perl, v5.8.0 built for i386-linux-thread-multi

We have noticed that spamd is not cleaning up child processes that are 'defunct'
and is considering these processes to be available; therefore over time this is
leading to the number of available active processes dropping until eventually
there are no 'child' processes left to handle requests:

520       9297     1  0 17:27 ?        00:00:00 /net/sa/bin/spamd -u
spamassassin -x -d --max-children 7
520       8145  9297  2 17:57 ?        00:01:01 [spamd <defunct>]
520        668  9297  4 18:32 ?        00:00:17 spamd child
520        918  9297  3 18:32 ?        00:00:15 spamd child
520       1325  9297  3 18:33 ?        00:00:11 spamd child
520       2160  9297  3 18:34 ?        00:00:09 spamd child
520       2325  9297  3 18:35 ?        00:00:09 spamd child
520       2508  9297  3 18:35 ?        00:00:08 spamd child

You can see that the zombie process was never 'detected' by the spamd parent (no
mention if process 8145 sending a SIGCHLD):

Sep  2 17:27:07 larry spamd[9297]: server started on port 783/tcp (running
version 3.0.0-rc2)
Sep  2 17:27:07 larry spamd[9297]: server successfully spawned child process,
pid 9306
Sep  2 17:27:07 larry spamd[9297]: server successfully spawned child process,
pid 9307
Sep  2 17:27:07 larry spamd[9297]: server successfully spawned child process,
pid 9308
Sep  2 17:27:07 larry spamd[9297]: server successfully spawned child process,
pid 9309
Sep  2 17:27:07 larry spamd[9297]: server successfully spawned child process,
pid 9310
Sep  2 17:27:07 larry spamd[9297]: server successfully spawned child process,
pid 9311
Sep  2 17:27:07 larry spamd[9297]: server successfully spawned child process,
pid 9312
Sep  2 17:57:37 larry spamd[9297]: server hit by SIGCHLD, pid 9306
Sep  2 17:57:37 larry spamd[9297]: server successfully spawned child process,
pid 7933
Sep  2 17:57:39 larry spamd[9297]: server hit by SIGCHLD, pid 9310
Sep  2 17:57:39 larry spamd[9297]: server successfully spawned child process,
pid 7979
Sep  2 17:57:39 larry spamd[9297]: server hit by SIGCHLD, pid 9311
Sep  2 17:57:39 larry spamd[9297]: server successfully spawned child process,
pid 7981
Sep  2 17:57:43 larry spamd[9297]: server hit by SIGCHLD, pid 9308
Sep  2 17:57:43 larry spamd[9297]: server successfully spawned child process,
pid 8067
Sep  2 17:57:46 larry spamd[9297]: server hit by SIGCHLD, pid 9309
Sep  2 17:57:46 larry spamd[9297]: server successfully spawned child process,
pid 8127
Sep  2 17:57:48 larry spamd[9297]: server hit by SIGCHLD, pid 9307
Sep  2 17:57:48 larry spamd[9297]: server successfully spawned child process,
pid 8145
Sep  2 18:32:33 larry spamd[9297]: server hit by SIGCHLD, pid 9312
Sep  2 18:32:33 larry spamd[9297]: server successfully spawned child process,
pid 668
Sep  2 18:32:56 larry spamd[9297]: server hit by SIGCHLD, pid 7979
Sep  2 18:32:56 larry spamd[9297]: server successfully spawned child process,
pid 918
Sep  2 18:33:37 larry spamd[9297]: server hit by SIGCHLD, pid 7981
Sep  2 18:33:37 larry spamd[9297]: server successfully spawned child process,
pid 1325
Sep  2 18:34:51 larry spamd[9297]: server hit by SIGCHLD, pid 7933
Sep  2 18:34:51 larry spamd[9297]: server successfully spawned child process,
pid 2160
Sep  2 18:35:00 larry spamd[9297]: server hit by SIGCHLD, pid 8127
Sep  2 18:35:00 larry spamd[9297]: server successfully spawned child process,
pid 2325
Sep  2 18:35:11 larry spamd[9297]: server hit by SIGCHLD, pid 8067
Sep  2 18:35:11 larry spamd[9297]: server successfully spawned child process,
pid 2508

Could there be some kind of polling code to verify whether the child processes
are in a zombie state and act accordingly.

Thanks,

Shane.
Comment 1 Justin Mason 2004-09-02 12:42:13 UTC
have you seen more than 1 defunct process?

on some platforms spamd will occasionally have 1 (and only 1) defunct process
listed, but that has virtually no effect on performance, memory, limits etc. so
we don't consider it a bug.
Comment 2 Shane Hjorth 2004-09-02 15:53:10 UTC
Hi Justin,

The process list as of this morning:

520       9297     1  0 Sep02 ?        00:00:01 /net/sa/bin/spamd -u
spamassassin -x -d --max-children 7
520      13104  9297  3 08:15 ?        00:01:04 [spamd <defunct>]
520      19352  9297  6 08:28 ?        00:01:24 [spamd <defunct>]
520      22015  9297  8 08:32 ?        00:01:25 [spamd <defunct>]
520      22640  9297  8 08:33 ?        00:01:21 spamd child
520      23037  9297  8 08:34 ?        00:01:20 spamd child
520      29491  9297  4 08:45 ?        00:00:12 spamd child
520      32075  9297  2 08:49 ?        00:00:00 spamd child

So to answer your question, yes over time we get more and more defunct processes
until eventually there are no working 'spamd child' processes left.
Comment 3 Theo Van Dinter 2004-09-02 16:10:25 UTC
hrm.  it seems like for some reason either 1) the OS isn't sending a SIGCHLD
when the child exits, so the parent doesn't know to wait for it and spawn a new
child, or 2) the parent is somehow not paying attention/able to receive to
signals/was hit by SIGSTOP/etc.

In the logs posted, I can see 8145 being spawned:

Sep  2 17:57:48 larry spamd[9297]: server hit by SIGCHLD, pid 9307
Sep  2 17:57:48 larry spamd[9297]: server successfully spawned child process,
pid 8145

but there is no record of a SIGCHLD being received when it exited.

> Could there be some kind of polling code to verify whether the child processes
are in a zombie state and act accordingly.

Not in the current spamd.  The parent has no interaction with the children
beyond spawning them and cleaning up when a SIGCHLD is received.  That's all the
parent does actually, it sits in a sleep() until a signal comes along to wake it up.
Comment 4 Shane Hjorth 2004-09-02 17:53:03 UTC
Hi Theo,

We suspect the problem has to do with a SIGCHLD being received whilst the spamd
process is already in the wait state cleaning up another SIGCHLD signal; hence
the second SIGCHLD is missed. We have made the following changes to use the
waitpid call instead which will reap all children that need a cleanup. We are
currently testing the change and will report back if it fixes the problem.

--- spamd       2004-09-03 10:14:27.000000000 +1000
+++ spamd.old   2004-08-30 13:48:02.000000000 +1000
@@ -1745,11 +1745,9 @@
 sub child_handler {
   my ($sig) = @_;
   $SIG{CHLD} = \&child_handler;    # reset as necessary
-  logmsg("server hit by SIG$sig");
-  while((my $pid = waitpid(-1, &WNOHANG)) > 0) {  # reap the child
-    delete $children{$pid};                       # remove the child out of the
pool
-    logmsg("  cleaned up child pid $pid");
-  }
+  my $pid = wait;                  # reap the child
+  delete $children{$pid};          # remove the child out of the pool
+  logmsg("server hit by SIG$sig, pid $pid");
 }

 sub restart_handler {
@@ -1759,7 +1757,7 @@
   $SIG{CHLD} = 'DEFAULT';    # we're going to kill our children
   foreach (keys %children) {
     kill 'INT' => $_;
-    my $pid = waitpid($_, 0);
+    my $pid = wait;
     logmsg("child $pid killed successfully");
   }
   %children = ();
Comment 5 Theo Van Dinter 2004-09-02 18:16:01 UTC
Subject: Re:  Spamd not cleaning up defunct processes

On Thu, Sep 02, 2004 at 05:53:04PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> We suspect the problem has to do with a SIGCHLD being received whilst the spamd
> process is already in the wait state cleaning up another SIGCHLD signal; hence
> the second SIGCHLD is missed. We have made the following changes to use the
> waitpid call instead which will reap all children that need a cleanup. We are
> currently testing the change and will report back if it fixes the problem.

Hrm, ok.  As an FYI, child_handler (and the rest of the pre-fork stuff)
was essentially pulled right out of the perlipc doc.  Looking at the
code, it does seem like we ought to wait until the end to reset the
handler too...

>  sub child_handler {
>    my ($sig) = @_;
>    $SIG{CHLD} = \&child_handler;    # reset as necessary

yeah there...  according to the perlipc page, there's a SysV issue
which makes it have to go after the wait, which we apparently don't do
(why didn't I do that?). :(

> -  logmsg("server hit by SIG$sig");
> -  while((my $pid = waitpid(-1, &WNOHANG)) > 0) {  # reap the child
> -    delete $children{$pid};                       # remove the child out of the
> pool
> -    logmsg("  cleaned up child pid $pid");
> -  }
> +  my $pid = wait;                  # reap the child
> +  delete $children{$pid};          # remove the child out of the pool
> +  logmsg("server hit by SIG$sig, pid $pid");

Errr...  I liked the waitpid() version because it can do NOHANG, and because
the first handler call could catch more children if they exited around the
same time.  wait() causes the process to hang for a child, which makes me
think of bad things depending on how the handler is called.

I'd say the first thing to try is just moving the handler reset to the bottom
of the function, and see if that helps.  If not, progress forward. :)

Comment 6 Justin Mason 2004-09-02 18:27:12 UTC
er, theo, I think the waitpid() version is what shane's suggesting.  current svn
has:

sub child_handler {
  my ($sig) = @_;
  $SIG{CHLD} = \&child_handler;    # reset as necessary
  my $pid = wait;                  # reap the child
  delete $children{$pid};          # remove the child out of the pool
  logmsg("server hit by SIG$sig, pid $pid");
}

sub restart_handler {
  my ($sig) = @_;
  logmsg("server hit by SIG$sig, restarting");

  $SIG{CHLD} = 'DEFAULT';    # we're going to kill our children
  foreach (keys %children) {
    kill 'INT' => $_;
    my $pid = wait;
    logmsg("child $pid killed successfully");
  }
  %children = ();


ie. wait().   so shane's fix looks good -- I think....

/me hates the UNIX SIGCHLD portability-nightmare
Comment 7 Shane Hjorth 2004-09-02 18:43:53 UTC
Just as an FYI, we can see the changes in action in the following log extract;
the spamd parent detected just a single SIGCHLD but the waitpid call cleaned up
4 processes in total -- therefore without the change we would have been left
with 3 defunct processes (in theory):

Sep  3 10:58:58 larry spamd[9980]: server hit by SIGCHLD
Sep  3 10:58:58 larry spamd[9980]:   cleaned up child pid 9991
Sep  3 10:58:58 larry spamd[9980]:   cleaned up child pid 9993
Sep  3 10:58:58 larry spamd[9980]:   cleaned up child pid 9994
Sep  3 10:58:58 larry spamd[9980]:   cleaned up child pid 9995
Sep  3 10:58:58 larry spamd[9980]: server successfully spawned child process,
pid 23856
Sep  3 10:58:58 larry spamd[9980]: server successfully spawned child process,
pid 23857
Sep  3 10:58:58 larry spamd[9980]: server successfully spawned child process,
pid 23858
Sep  3 10:58:58 larry spamd[9980]: server successfully spawned child process,
pid 23859
Sep  3 10:58:59 larry spamd[9980]: server hit by SIGCHLD
Sep  3 10:58:59 larry spamd[9980]:   cleaned up child pid 9992
Sep  3 10:58:59 larry spamd[9980]: server successfully spawned child process,
pid 23870
Sep  3 11:17:04 larry spamd[9980]: server hit by SIGCHLD
Sep  3 11:17:04 larry spamd[9980]:   cleaned up child pid 23859
Sep  3 11:17:04 larry spamd[9980]: server successfully spawned child process,
pid 5024
Sep  3 11:17:05 larry spamd[9980]: server hit by SIGCHLD
Sep  3 11:17:05 larry spamd[9980]:   cleaned up child pid 23857
Sep  3 11:17:05 larry spamd[9980]: server successfully spawned child process,
pid 5041
Sep  3 11:17:21 larry spamd[9980]: server hit by SIGCHLD
Sep  3 11:17:21 larry spamd[9980]:   cleaned up child pid 23856
Comment 8 Theo Van Dinter 2004-09-02 19:05:34 UTC
Subject: Re:  Spamd not cleaning up defunct processes

On Thu, Sep 02, 2004 at 06:27:13PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> er, theo, I think the waitpid() version is what shane's suggesting.  current svn
> has:

really?  oh, the patch was reverse.  well, then yeah, waitpid++. :)

Comment 9 Daniel Quinlan 2004-09-02 23:13:55 UTC
3.0.0 milestone

this must cause a new -rc release if we do this
Comment 10 Justin Mason 2004-09-03 19:17:39 UTC
I'm +1 on the suggested fix, fwiw.  however I've scribbled on that code in my
svn so I can't upload a patch as a proper attachment. ;)

do we need a CLA for this?  I'm thinking "no".
Comment 11 Theo Van Dinter 2004-09-03 19:46:14 UTC
> I'm +1 on the suggested fix, fwiw.  however I've scribbled on that code in my
> svn so I can't upload a patch as a proper attachment. ;)
> do we need a CLA for this?  I'm thinking "no".

I think I'm +1 on it as well.  I'll see about putting up a slightly different
version against trunk (the patch from bug 3625 modified the child handler), and
the patch is reversed anyway.
Comment 12 Theo Van Dinter 2004-09-03 20:01:28 UTC
Created attachment 2306 [details]
suggested patch

basically the same as the one suggested, but slightly different. ;)

1) not a reverse patch
2) against trunk, which is now slightly different than the rc2 code
Comment 13 Theo Van Dinter 2004-09-03 20:05:39 UTC
Created attachment 2307 [details]
new version

same as before, but I apparently forgot to wrap the first logmsg() call so this
one does that. :)
Comment 14 Justin Mason 2004-09-03 20:06:50 UTC
+1
Comment 15 Michael Parker 2004-09-03 20:07:47 UTC
+1
Comment 16 Theo Van Dinter 2004-09-03 20:12:20 UTC
committed.  r43314