SA Bugzilla – Bug 5422
[review] spamd error: 'prefork: ordered child N to accept, but they reported state '1', killing rogue'
Last modified: 2008-06-26 02:58:18 UTC
this error is currently worked around (see bug 5313), but it would be a lot better to fix it properly. bug 5313 comment 63 is particularly useful, containing a link to a "truss" log.
Created attachment 3908 [details] new patch against 3.1.8, *including* patch rev 3 from bug 5313 This patch is a new patch which adds more debugging instrumentation, based on a look at that truss log; Michael, could you try applying this, wait for the error to reoccur, then upload the syslog entries from around that time?
Created attachment 3909 [details] version *not including* patch rev 3 from bug 5313 this is an alternative version of the patch above, but which does not include the bug 5313 changes; in other words if you already have "patch rev 3" from that bug applied, you can apply this patch incrementally.
(In reply to comment #2) I applied the patch, thank you very much. It's now running fine for two hours with no rougue killed children yet. I'll post traces and logs as soon as we get this error again. Michael
thanks Michael -- much obliged!
Hi Justin. All our spamds crashed short after the patch applied. The last 10000 syslog lines from mxin001 are available on ftp://ftp.linu.gs/tmp/spamd.out.3909-1.gz
hmm, that's a new one -- the patch shouldn't cause the spamds to crash ;) It appears the parent spamd exited silently. odd. Could you keep an eye out for the 'ordered child N to accept, but they reported state '1', killing' message? it doesn't appear in that log...
(In reply to comment #6) > hmm, that's a new one -- the patch shouldn't cause the spamds to crash ;) It > appears the parent spamd exited silently. odd. Yes I saw; It just adds some more debugging output. Strange. It happened again just after I restarted spamd. We had to revert back to the old patch as it kept crashing. I'affraid, but our customers were'nt too happy about this :-(
so you're saying it works fine *without* the patch in attachment 3909 [details], but as soon as you add that, it breaks? that's really odd. could you try applying just these diff chunks from that patch? @@ -315,6 +334,7 @@ { # just read one line. if there's more lines, we'll get them # when we re-enter the can_read() select call above... +warn "JMD bug5313 reading child status report"; if ($self->read_one_message_from_child_socket($fh) == PFSTATE_IDLE) { dbg("prefork: child reports idle"); @@ -396,11 +416,13 @@ chomp $line; if ($line =~ s/^I//) { my $pid = unpack("N1", $line); +warn "JMD bug5313 read_one_message_from_child_socket $pid=I"; $self->set_child_state ($pid, PFSTATE_IDLE); return PFSTATE_IDLE; } elsif ($line =~ s/^B//) { my $pid = unpack("N1", $line); +warn "JMD bug5313 read_one_message_from_child_socket $pid=B"; $self->set_child_state ($pid, PFSTATE_BUSY); return PFSTATE_BUSY; } @@ -470,6 +492,7 @@ while (1) { my $state = $self->read_one_message_from_child_socket($sock); +warn "JMD bug5313 wait_for_child_to_accept $kid => $state"; if ($state == PFSTATE_BUSY) { return 1; # 1 == success they're pretty trivial, just adding logs, no new logic. hopefully that's more likely to stay up. once again, thanks for helping out here -- this has been tricky to track down, but I think we're getting there ;)
(In reply to comment #8) > so you're saying it works fine *without* the patch in attachment 3909 [details] [edit], > but as soon as you add that, it breaks? that's really odd. Yes, that's the situation. > could you try applying just these diff chunks from that patch? The patch is applied. I'll be able to monitor it the next 1.5 hours. (I'm away during the weekend so the next report is likely to be on monday.)
Hi Justin I captured some rogue killed childern again today: ftp://ftp.linu.gs/tmp/spamd.mxin001.out.2007-04-24.gz ftp://ftp.linu.gs/tmp/spamd.mxin001.truss.out.2007-04-24.gz The patch from comment #8 was applied. Michael
ok, I have a theory. I think this is what's happening: at some point, the {kids} array is not coherent. for example, this order of two immediately-contiguous lines in the log demonstrates it: Apr 24 13:56:51 mxin001 spamd[44308]: JMD bug5313 read_one_message_from_child_socket 69792=I at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm line 400. Apr 24 13:56:51 mxin001 spamd[44308]: prefork: child states: BBKBBBBBBBBBBBBBBBBBBBBBBBBB After the first line, $self->set_child_state ($pid, PFSTATE_IDLE); is called, which sets the {kids} entry for pid 69792 to PFSTATE_IDLE unless the pid has no entry (which only happens for servers which have exited). However, there's no "I" in the "child states" line! This later results in a kid notifying the parent that its state is "B" (PFSTATE_BUSY), the parent notes this, but the notification is "lost" somehow -- hence the parent attempts to assign a job to the supposedly PFSTATE_IDLE task, causing the error. I think the reason it's becoming incoherent is due to an intermittent race condition between the main thread and the SIGCHLD signal handler. The latter performs write ops on the {kids} hash -- it deletes entries from the hash. Perhaps when this happens at bad times, it results in other entries getting "lost" somehow, and therefore causing the incoherence. I'll upload a new version of SpamdForkScaling.pm (in entirety, not a patch, too many patches = getting messy ;). This version moves all deletions from the {kids} hash into the mainline, and adds (yet more) debugging info.
Created attachment 3920 [details] SpamdForkScaling.pm hi Michael -- could you try this? (BTW I think my explanation may also explain why patch 3908/3909 caused the spamd to crash.)
This sounds reasonable. I applied the patch/replaced the file with the new version. I'll report again... Thank you very much, Justin!
No rogue killed children yet. However the amount of mails checked has severly (by 2/3) decreased due to new means against spam. The patch seems to work nicely :-)
good news. I won't apply it to svn trunk just yet though -- could you keep an eye out for a little longer?
Yes, of course. I'll report again.
It still seems to work without any problems.
excellent -- that sounds a lot like a fix. I'm not certain if we should apply it to trunk before 3.2.0 release yet, though, since we may be putting out 3.2.0 in a couple of days; however, I think it'll be in 3.2.1.
marking deferable (since we have a functioning workaround in the meantime).
(In reply to comment #18) > I'm not certain if we should apply it to trunk before 3.2.0 release yet, though, > since we may be putting out 3.2.0 in a couple of days; however, I think it'll be > in 3.2.1. Too bad. IOW we have to continue with the patched version... Is there a risk incorporating it in 3.2.0?
Created attachment 3930 [details] that change, as patch against 3.1.x HEAD with bug 5313 fix applied this is the change in attachment 3920 [details]'s version of SpamdForkScaling.pm. It's a patch against 3.1.x HEAD, with the bug 5313 fix applied.
Created attachment 3931 [details] that change, as patch against svn trunk HEAD and here it is against 3.2.0. 'Too bad. IOW we have to continue with the patched version... Is there a risk incorporating it in 3.2.0?' yes -- we're very late in the 3.2.0 release process; at this stage we have to be *very* conservative about code changes between rc3 and the (possible) release tarball, since rc3 is the one that has received minimal bug reports...
please review for 3.2.1
applied, to SVN trunk: : jm 127...; svn commit -m "bug 5422: deleting hash entries from the SIGCHLD signal handler is unsafe, causes corruption of the data structure, and results in 'prefork: ordered child N to accept, but they reported state '1', killing rogue' errors. fix" lib/Mail/SpamAssassin/SpamdForkScaling.pm Sending lib/Mail/SpamAssassin/SpamdForkScaling.pm Transmitting file data . Committed revision 534809.
+1
+1 for 3.2
Justin, I hope you don't mind my going ahead and checking this in for you whilst you're off partying on the Queen's Birthday :-) Committed revision 544370.