Bug 6006

Summary: prefork: oops! no idle kids in need_to_del_server?
Product: Spamassassin Reporter: Per Jessen <per>
Component: spamc/spamdAssignee: SpamAssassin Developer Mailing List <dev>
Status: NEW ---    
Severity: normal    
Priority: P5    
Version: 3.2.5   
Target Milestone: Undefined   
Hardware: PC   
OS: Linux   
Whiteboard:

Description Per Jessen 2008-10-27 23:52:24 UTC
Following a SIGHUP of spamd, spamd decided to just stop:

Oct 25 01:20:38 louiswu7 spamd[3122]: prefork: child states: IBBII
Oct 25 01:20:38 louiswu7 spamd[3122]: spamd: handled cleanup of child pid 21529 due to SIGCHLD
Oct 25 01:20:39 louiswu7 spamd[21548]: spamd reload - 1 updated files found
[SIGHUP]
Oct 25 01:20:39 louiswu7 spamd[3122]: spamd: child 18309 killed successfully
Oct 25 01:20:39 louiswu7 spamd[3122]: spamd: child 21466 killed successfully
Oct 25 01:20:39 louiswu7 spamd[3122]: spamd: child 21469 killed successfully
Oct 25 01:20:40 louiswu7 spamd[3122]: spamd: child 20900 killed successfully
Oct 25 01:20:39 louiswu7 spamd[3122]: prefork: child states: IBII
Oct 25 01:20:40 louiswu7 spamd[3122]: prefork: child states: IBII
Oct 25 01:20:40 louiswu7 spamd[3122]: prefork: oops! no idle kids in need_to_del_server? at /usr/lib/perl5/site_perl/5.10.0/Mail/SpamAssassin/SpamdForkScaling.pm line 846.

After that, spamd exited.  I reload spamd like this a few times a day on several machines, so far I have not seen another case.
Comment 1 Per Jessen 2008-10-30 01:39:25 UTC
Same ting happened again just last night:

Oct 29 19:21:35 louiswu7 spamd[10115]: spamd reload - 1 updated files found
Oct 29 19:21:35 louiswu7 spamd[28250]: spamd: child 9982 killed successfully
Oct 29 19:21:36 louiswu7 spamd[28250]: spamd: child 9507 killed successfully
Oct 29 19:21:36 louiswu7 spamd[28250]: spamd: child 10104 killed successfully
Oct 29 19:21:36 louiswu7 spamd[28250]: spamd: child 10020 killed successfully
Oct 29 19:21:36 louiswu7 spamd[28250]: spamd: child 8415 killed successfully
Oct 29 19:21:35 louiswu7 spamd[28250]: prefork: child states: BIIBI
Oct 29 19:21:35 louiswu7 spamd[28250]: prefork: child states: BIIBI
Oct 29 19:21:37 louiswu7 spamd[28250]: prefork: oops! no idle kids in need_to_del_server? at /usr/lib/perl5/site_perl/5.10.0/Mail/SpamAssassin/SpamdForkScaling.pm line 846.

Comment 2 Justin Mason 2008-11-03 02:17:35 UTC
what OS and perl version are you running?  can you run spamd with -D and post the resulting logs?
Comment 3 Justin Mason 2008-11-03 02:17:54 UTC
er, _attach_ the resulting logs (as an attachment). ;)
Comment 4 Per Jessen 2008-11-03 02:32:31 UTC
I'm running openSUSE 11.0 32-bit on AMD64 and ia32, perl is 5.10.0.  

About running with with debug - the logs would be enormous, but I'll look into it.  For the moment, I've amended Mail/SpamAssassin/SpamdForkScaling.pm to print the pid and state of the children in need_to_del_server(). 
Comment 5 Justin Mason 2008-11-03 02:38:58 UTC
(In reply to comment #4)
> About running with with debug - the logs would be enormous, but I'll look into
> it.  For the moment, I've amended Mail/SpamAssassin/SpamdForkScaling.pm to
> print the pid and state of the children in need_to_del_server(). 

yeah.  an option might be to just put up the gzipped log somewhere for me to get, instead of attaching it here.  they'll be pretty big, but not gigabytes big ;)
Comment 6 Per Jessen 2008-11-03 03:06:24 UTC
Okay, have restarted spamd with -D - looks like it'll produce about 100K log per minute.  The config is reloaded at least four times a day, usually more. I guess I could up the number of reloads without upsetting anything.
Comment 7 Per Jessen 2008-11-07 02:00:41 UTC
Running spamd with -D continually is slowing down processing a lot - this might be ok for one server, but the problem has just now occured on another. I can't run with -D everywhere, it would make everything grind to a halt. 

Isn't there some specific info you can look for?  I can modify the code, just let me know.  Also, did the debug output I made available off-line not help at all?
Comment 8 Justin Mason 2008-11-07 02:10:35 UTC
Per, feel free to turn off -D.  unfortunately I haven't had a chance to sit down with that debug info just yet :(
Comment 9 Justin Mason 2008-11-07 08:58:10 UTC
I've had a look at that, and I have no idea what's going on. :(  Can you monitor STDERR and see if some error is output there, but not to syslog, next time it happens?
Comment 10 Per Jessen 2008-11-27 00:06:35 UTC
Update - I've been logging output to file, but only on a couple of newer SMP machines (I thought this might be some sort of concurrency issue). I've had 4 incidents since 1 November, but all on single-CPU machines. I guess I could start logging on those too, but I'll probably have to set up some log rotation. 
Comment 11 Per Jessen 2008-11-27 00:32:10 UTC
Okay, I did manage to log stdout+stder during one of those events, but I don't see any additional info in the log. Just the same errormsgs:

[3132] info: prefork: child states: BIIBBB
[3132] info: prefork: child states: BIIIBB
[3132] info: spamd: handled cleanup of child pid 25409 due to SIGCHLD
[3132] info: prefork: child states: BIIIB
[3132] info: spamd: handled cleanup of child pid 25410 due to SIGCHLD
[3132] info: prefork: child states: BIIB
[3132] info: prefork: child states: BIII
[3132] info: spamd: child 24218 killed successfully
[3132] info: spamd: child 22482 killed successfully
[3132] info: spamd: child 25411 killed successfully
[3132] info: spamd: child 25408 killed successfully
[3132] info: prefork: child states: BIII
[3132] error: prefork: oops! no idle kids in need_to_del_server? at /usr/lib/perl5/site_perl/5.10.0/Mail/SpamAssassin/SpamdForkScaling.pm line 846.

Comment 12 Per Jessen 2008-12-05 00:05:24 UTC
Update - I still see the probleme every now and then, but I've stopped logging stdout+stderr, it just produced way too much data.  Let me know if there's anything else I can do to help diagnose this.
Comment 13 Per Jessen 2008-12-09 10:14:24 UTC
I've got a growing suspicion that this is a race condition somewhere or other. I've added some output to the error msg, and it looks as if the number of idle kids is not being decremented even if they have gone idle and died:

Dec  9 17:06:45 louiswu6 spamd[17923]: prefork: child states: BBBIII
Dec  9 17:06:45 louiswu6 spamd[17923]: prefork: child states: BBBIII
Dec  9 17:06:47 louiswu6 spamd[17923]: prefork: oops! no idle kids in need_to_del_server? num_idle=3 states=444444 at /usr/lib/perl5/site_perl/5.10.0/Mail/SpamAssassin/SpamdForkScaling.pm line 849.

I've added the "num_idle=3 states=444444", the meaning should be obvious. 
Comment 14 Justin Mason 2008-12-09 12:36:59 UTC
(In reply to comment #13)
> I've got a growing suspicion that this is a race condition somewhere or other.
> I've added some output to the error msg, and it looks as if the number of idle
> kids is not being decremented even if they have gone idle and died:
> 
> Dec  9 17:06:45 louiswu6 spamd[17923]: prefork: child states: BBBIII
> Dec  9 17:06:45 louiswu6 spamd[17923]: prefork: child states: BBBIII
> Dec  9 17:06:47 louiswu6 spamd[17923]: prefork: oops! no idle kids in
> need_to_del_server? num_idle=3 states=444444 at
> /usr/lib/perl5/site_perl/5.10.0/Mail/SpamAssassin/SpamdForkScaling.pm line 849.
> 
> I've added the "num_idle=3 states=444444", the meaning should be obvious. 

hi Per --

I've simply run out of bandwidth to help with this right now :(  but you're on the right track, I think.  there's a possibility that some code run from a signal handler (search for $SIG{CHLD} usage in spamd) is updating a data structure which then isn't being read correctly from the "main thread".  we've had a bug like that before....
Comment 15 Per Jessen 2008-12-10 00:32:15 UTC
Not to worry, Justin - I've got the automatic restart working fine now, and I'm  thinking of maybe just removing the assertion for the moment. 
Comment 16 AXB 2012-07-19 14:10:03 UTC
FTR:

I had the same with SA 3.4 trunk - also as Per's "Following a SIGHUP of spamd"

Due to phat fingers I was running the SIGHUP job twice, consecutively!!!!! 
This drove spamd nuts and crashed.