|
SA Bugzilla – Full Text Bug Listing |
Summary: | prefork: oops! no idle kids in need_to_del_server? | ||
---|---|---|---|
Product: | Spamassassin | Reporter: | Per Jessen <per> |
Component: | spamc/spamd | Assignee: | SpamAssassin Developer Mailing List <dev> |
Status: | NEW --- | ||
Severity: | normal | ||
Priority: | P5 | ||
Version: | 3.2.5 | ||
Target Milestone: | Undefined | ||
Hardware: | PC | ||
OS: | Linux | ||
Whiteboard: |
Description
Per Jessen
2008-10-27 23:52:24 UTC
Same ting happened again just last night: Oct 29 19:21:35 louiswu7 spamd[10115]: spamd reload - 1 updated files found Oct 29 19:21:35 louiswu7 spamd[28250]: spamd: child 9982 killed successfully Oct 29 19:21:36 louiswu7 spamd[28250]: spamd: child 9507 killed successfully Oct 29 19:21:36 louiswu7 spamd[28250]: spamd: child 10104 killed successfully Oct 29 19:21:36 louiswu7 spamd[28250]: spamd: child 10020 killed successfully Oct 29 19:21:36 louiswu7 spamd[28250]: spamd: child 8415 killed successfully Oct 29 19:21:35 louiswu7 spamd[28250]: prefork: child states: BIIBI Oct 29 19:21:35 louiswu7 spamd[28250]: prefork: child states: BIIBI Oct 29 19:21:37 louiswu7 spamd[28250]: prefork: oops! no idle kids in need_to_del_server? at /usr/lib/perl5/site_perl/5.10.0/Mail/SpamAssassin/SpamdForkScaling.pm line 846. what OS and perl version are you running? can you run spamd with -D and post the resulting logs? er, _attach_ the resulting logs (as an attachment). ;) I'm running openSUSE 11.0 32-bit on AMD64 and ia32, perl is 5.10.0. About running with with debug - the logs would be enormous, but I'll look into it. For the moment, I've amended Mail/SpamAssassin/SpamdForkScaling.pm to print the pid and state of the children in need_to_del_server(). (In reply to comment #4) > About running with with debug - the logs would be enormous, but I'll look into > it. For the moment, I've amended Mail/SpamAssassin/SpamdForkScaling.pm to > print the pid and state of the children in need_to_del_server(). yeah. an option might be to just put up the gzipped log somewhere for me to get, instead of attaching it here. they'll be pretty big, but not gigabytes big ;) Okay, have restarted spamd with -D - looks like it'll produce about 100K log per minute. The config is reloaded at least four times a day, usually more. I guess I could up the number of reloads without upsetting anything. Running spamd with -D continually is slowing down processing a lot - this might be ok for one server, but the problem has just now occured on another. I can't run with -D everywhere, it would make everything grind to a halt. Isn't there some specific info you can look for? I can modify the code, just let me know. Also, did the debug output I made available off-line not help at all? Per, feel free to turn off -D. unfortunately I haven't had a chance to sit down with that debug info just yet :( I've had a look at that, and I have no idea what's going on. :( Can you monitor STDERR and see if some error is output there, but not to syslog, next time it happens? Update - I've been logging output to file, but only on a couple of newer SMP machines (I thought this might be some sort of concurrency issue). I've had 4 incidents since 1 November, but all on single-CPU machines. I guess I could start logging on those too, but I'll probably have to set up some log rotation. Okay, I did manage to log stdout+stder during one of those events, but I don't see any additional info in the log. Just the same errormsgs: [3132] info: prefork: child states: BIIBBB [3132] info: prefork: child states: BIIIBB [3132] info: spamd: handled cleanup of child pid 25409 due to SIGCHLD [3132] info: prefork: child states: BIIIB [3132] info: spamd: handled cleanup of child pid 25410 due to SIGCHLD [3132] info: prefork: child states: BIIB [3132] info: prefork: child states: BIII [3132] info: spamd: child 24218 killed successfully [3132] info: spamd: child 22482 killed successfully [3132] info: spamd: child 25411 killed successfully [3132] info: spamd: child 25408 killed successfully [3132] info: prefork: child states: BIII [3132] error: prefork: oops! no idle kids in need_to_del_server? at /usr/lib/perl5/site_perl/5.10.0/Mail/SpamAssassin/SpamdForkScaling.pm line 846. Update - I still see the probleme every now and then, but I've stopped logging stdout+stderr, it just produced way too much data. Let me know if there's anything else I can do to help diagnose this. I've got a growing suspicion that this is a race condition somewhere or other. I've added some output to the error msg, and it looks as if the number of idle kids is not being decremented even if they have gone idle and died: Dec 9 17:06:45 louiswu6 spamd[17923]: prefork: child states: BBBIII Dec 9 17:06:45 louiswu6 spamd[17923]: prefork: child states: BBBIII Dec 9 17:06:47 louiswu6 spamd[17923]: prefork: oops! no idle kids in need_to_del_server? num_idle=3 states=444444 at /usr/lib/perl5/site_perl/5.10.0/Mail/SpamAssassin/SpamdForkScaling.pm line 849. I've added the "num_idle=3 states=444444", the meaning should be obvious. (In reply to comment #13) > I've got a growing suspicion that this is a race condition somewhere or other. > I've added some output to the error msg, and it looks as if the number of idle > kids is not being decremented even if they have gone idle and died: > > Dec 9 17:06:45 louiswu6 spamd[17923]: prefork: child states: BBBIII > Dec 9 17:06:45 louiswu6 spamd[17923]: prefork: child states: BBBIII > Dec 9 17:06:47 louiswu6 spamd[17923]: prefork: oops! no idle kids in > need_to_del_server? num_idle=3 states=444444 at > /usr/lib/perl5/site_perl/5.10.0/Mail/SpamAssassin/SpamdForkScaling.pm line 849. > > I've added the "num_idle=3 states=444444", the meaning should be obvious. hi Per -- I've simply run out of bandwidth to help with this right now :( but you're on the right track, I think. there's a possibility that some code run from a signal handler (search for $SIG{CHLD} usage in spamd) is updating a data structure which then isn't being read correctly from the "main thread". we've had a bug like that before.... Not to worry, Justin - I've got the automatic restart working fine now, and I'm thinking of maybe just removing the assertion for the moment. FTR: I had the same with SA 3.4 trunk - also as Per's "Following a SIGHUP of spamd" Due to phat fingers I was running the SIGHUP job twice, consecutively!!!!! This drove spamd nuts and crashed. |