Bug 5422 - [review] spamd error: 'prefork: ordered child N to accept, but they reported state '1', killing rogue'
Summary: [review] spamd error: 'prefork: ordered child N to accept, but they reported ...
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamc/spamd (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P5 minor
Target Milestone: 3.2.1
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard: go
Keywords:
Depends on:
Blocks:
 
Reported: 2007-04-18 09:03 UTC by Justin Mason
Modified: 2008-06-26 02:58 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
new patch against 3.1.8, *including* patch rev 3 from bug 5313 patch None Justin Mason [HasCLA]
version *not including* patch rev 3 from bug 5313 patch None Justin Mason [HasCLA]
SpamdForkScaling.pm text/plain None Justin Mason [HasCLA]
that change, as patch against 3.1.x HEAD with bug 5313 fix applied patch None Justin Mason [HasCLA]
that change, as patch against svn trunk HEAD patch None Justin Mason [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Justin Mason 2007-04-18 09:03:12 UTC
this error is currently worked around (see bug 5313), but it would be a lot
better to fix it properly.  bug 5313 comment 63 is particularly useful,
containing a link to a "truss" log.
Comment 1 Justin Mason 2007-04-18 09:06:16 UTC
Created attachment 3908 [details]
new patch against 3.1.8, *including* patch rev 3 from bug 5313

This patch is a new patch which adds more debugging instrumentation, based on a
look at that truss log; Michael, could you try applying this, wait for the
error to reoccur, then upload the syslog entries from around that time?
Comment 2 Justin Mason 2007-04-18 09:10:12 UTC
Created attachment 3909 [details]
version *not including* patch rev 3 from bug 5313

this is an alternative version of the patch above, but which does not include
the bug 5313 changes; in other words if you already have "patch rev 3" from
that bug applied, you can apply this patch incrementally.
Comment 3 Michael Naef 2007-04-19 04:55:08 UTC
(In reply to comment #2)

I applied the patch, thank you very much. It's now running fine for two hours
with no rougue killed children yet. I'll post traces and logs as soon as we get
this error again.

Michael
Comment 4 Justin Mason 2007-04-19 05:02:38 UTC
thanks Michael -- much obliged!
Comment 5 Michael Naef 2007-04-20 00:52:21 UTC
Hi Justin.

All our spamds crashed short after the patch applied. The last 10000 syslog
lines from mxin001 are available on ftp://ftp.linu.gs/tmp/spamd.out.3909-1.gz
Comment 6 Justin Mason 2007-04-20 04:44:57 UTC
hmm, that's a new one -- the patch shouldn't cause the spamds to crash ;)  It
appears the parent spamd exited silently. odd.

Could you keep an eye out for the 'ordered child N to accept, but they reported
state '1', killing' message? it doesn't appear in that log...
Comment 7 Michael Naef 2007-04-20 04:54:51 UTC
(In reply to comment #6)
> hmm, that's a new one -- the patch shouldn't cause the spamds to crash ;)  It
> appears the parent spamd exited silently. odd.

Yes I saw; It just adds some more debugging output. Strange. It happened again
just after I restarted spamd. We had to revert back to the old patch as it kept
crashing. I'affraid, but our customers were'nt too happy about this :-(
Comment 8 Justin Mason 2007-04-20 05:05:10 UTC
so you're saying it works fine *without* the patch in attachment 3909 [details], but as
soon as you add that, it breaks?  that's really odd.

could you try applying just these diff chunks from that patch?

@@ -315,6 +334,7 @@
   {
     # just read one line.  if there's more lines, we'll get them
     # when we re-enter the can_read() select call above...
+warn "JMD bug5313 reading child status report";
     if ($self->read_one_message_from_child_socket($fh) == PFSTATE_IDLE)
     {
       dbg("prefork: child reports idle");
@@ -396,11 +416,13 @@
   chomp $line;
   if ($line =~ s/^I//) {
     my $pid = unpack("N1", $line);
+warn "JMD bug5313 read_one_message_from_child_socket $pid=I";
     $self->set_child_state ($pid, PFSTATE_IDLE);
     return PFSTATE_IDLE;
   }
   elsif ($line =~ s/^B//) {
     my $pid = unpack("N1", $line);
+warn "JMD bug5313 read_one_message_from_child_socket $pid=B";
     $self->set_child_state ($pid, PFSTATE_BUSY);
     return PFSTATE_BUSY;
   }
@@ -470,6 +492,7 @@
 
   while (1) {
     my $state = $self->read_one_message_from_child_socket($sock);
+warn "JMD bug5313 wait_for_child_to_accept $kid => $state";
 
     if ($state == PFSTATE_BUSY) {
       return 1;     # 1 == success


they're pretty trivial, just adding logs, no new logic.  hopefully that's more
likely to stay up.

once again, thanks for helping out here -- this has been tricky to track down,
but I think we're getting there ;)
Comment 9 Michael Naef 2007-04-20 05:30:24 UTC
(In reply to comment #8)
> so you're saying it works fine *without* the patch in attachment 3909 [details] [edit],
> but as soon as you add that, it breaks?  that's really odd.

Yes, that's the situation.

> could you try applying just these diff chunks from that patch?

The patch is applied. I'll be able to monitor it the next 1.5 hours. (I'm away
during the weekend so the next report is likely to be on monday.)
Comment 10 Michael Naef 2007-04-24 05:19:10 UTC
Hi Justin

I captured some rogue killed childern again today:

ftp://ftp.linu.gs/tmp/spamd.mxin001.out.2007-04-24.gz
ftp://ftp.linu.gs/tmp/spamd.mxin001.truss.out.2007-04-24.gz

The patch from comment #8 was applied.

Michael
Comment 11 Justin Mason 2007-04-24 06:52:08 UTC
ok, I have a theory.   I think this is what's happening: at some point, the
{kids} array is not coherent.  for example, this order of two
immediately-contiguous lines in the log demonstrates it:

Apr 24 13:56:51 mxin001 spamd[44308]: JMD bug5313
read_one_message_from_child_socket 69792=I at
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 400.
Apr 24 13:56:51 mxin001 spamd[44308]: prefork: child states:
BBKBBBBBBBBBBBBBBBBBBBBBBBBB

After the first line, $self->set_child_state ($pid, PFSTATE_IDLE); is called,
which sets the {kids} entry for pid 69792 to PFSTATE_IDLE unless the pid has no
entry (which only happens for servers which have exited).  However, there's no
"I" in the "child states" line!

This later results in a kid notifying the parent that its state is "B"
(PFSTATE_BUSY), the parent notes this, but the notification is "lost" somehow
-- hence the parent attempts to assign a job to the supposedly PFSTATE_IDLE
task, causing the error.

I think the reason it's becoming incoherent is due to an intermittent race
condition between the main thread and the SIGCHLD signal handler.  The latter
performs write ops on the {kids} hash -- it deletes entries from the hash.
Perhaps when this happens at bad times, it results in other entries getting
"lost" somehow, and therefore causing the incoherence.

I'll upload a new version of SpamdForkScaling.pm (in entirety, not a patch, too
many patches = getting messy ;).   This version moves all deletions from the
{kids} hash into the mainline, and adds (yet more) debugging info.
Comment 12 Justin Mason 2007-04-24 06:53:47 UTC
Created attachment 3920 [details]
SpamdForkScaling.pm

hi Michael -- could you try this?

(BTW I think my explanation may also explain why patch 3908/3909 caused the
spamd to crash.)
Comment 13 Michael Naef 2007-04-24 08:03:51 UTC
This sounds reasonable. I applied the patch/replaced the file with the new
version. I'll report again... Thank you very much, Justin!
Comment 14 Michael Naef 2007-04-26 05:50:53 UTC
No rogue killed children yet. However the amount of mails checked has severly
(by 2/3) decreased due to new means against spam. The patch seems to work nicely :-)
Comment 15 Justin Mason 2007-04-26 05:59:22 UTC
good news.  I won't apply it to svn trunk just yet though -- could you keep an
eye out for a little longer?
Comment 16 Michael Naef 2007-04-26 06:04:43 UTC
Yes, of course. I'll report again.
Comment 17 Michael Naef 2007-04-30 07:19:13 UTC
It still seems to work without any problems.
Comment 18 Justin Mason 2007-04-30 07:31:15 UTC
excellent -- that sounds a lot like a fix.  I'm not certain if we should apply
it to trunk before 3.2.0 release yet, though, since we may be putting out 3.2.0
in a couple of days; however, I think it'll be in 3.2.1.
Comment 19 Justin Mason 2007-04-30 07:31:50 UTC
marking deferable (since we have a functioning workaround in the meantime).
Comment 20 Michael Naef 2007-04-30 08:30:57 UTC
(In reply to comment #18)
> I'm not certain if we should apply it to trunk before 3.2.0 release yet, though,
> since we may be putting out 3.2.0 in a couple of days; however, I think it'll be
> in 3.2.1.

Too bad. IOW we have to continue with the patched version... Is there a risk
incorporating it in 3.2.0?

Comment 21 Justin Mason 2007-04-30 09:05:24 UTC
Created attachment 3930 [details]
that change, as patch against 3.1.x HEAD with bug 5313 fix applied

this is the change in attachment 3920 [details]'s version of SpamdForkScaling.pm.  It's a
patch against 3.1.x HEAD, with the bug 5313 fix applied.
Comment 22 Justin Mason 2007-04-30 09:11:21 UTC
Created attachment 3931 [details]
that change, as patch against svn trunk HEAD

and here it is against 3.2.0.

'Too bad. IOW we have to continue with the patched version... Is there a risk
incorporating it in 3.2.0?'

yes -- we're very late in the 3.2.0 release process; at this stage we have
to be *very* conservative about code changes between rc3 and the (possible)
release tarball, since rc3 is the one that has received minimal bug reports...
Comment 23 Justin Mason 2007-05-03 04:57:17 UTC
please review for 3.2.1
Comment 24 Justin Mason 2007-05-03 05:03:59 UTC
applied, to SVN trunk:

: jm 127...; svn commit -m "bug 5422: deleting hash entries from the SIGCHLD signal
handler is unsafe, causes corruption of the data structure, and results in
'prefork: ordered child N to accept, but they reported state '1', killing rogue'
errors.  fix" lib/Mail/SpamAssassin/SpamdForkScaling.pm
Sending        lib/Mail/SpamAssassin/SpamdForkScaling.pm
Transmitting file data .
Committed revision 534809.
Comment 25 Sidney Markowitz 2007-06-03 02:33:36 UTC
+1
Comment 26 Daryl C. W. O'Shea 2007-06-04 21:05:32 UTC
+1 for 3.2
Comment 27 Sidney Markowitz 2007-06-04 21:30:09 UTC
Justin, I hope you don't mind my going ahead and checking this in for you whilst
you're off partying on the Queen's Birthday :-)

Committed revision 544370.