SA Bugzilla – Bug 4310
spamd stops responding to new connections
Last modified: 2005-06-07 10:56:50 UTC
Today I upgraded from r162091 to r168644. After a relatively short amount of time, I found that spamd was no longer functioning -- spamc would connect to 783 (which connected), but spamd would just sit there and not actually respond. There is nothing in the log about any errors. [root@eclectic log]$ telnet localhost 783 Trying 127.0.0.1... Connected to localhost.localdomain (127.0.0.1). Escape character is '^]'. PING [hang] [root@eclectic log]$ ps -ef | grep spamd root 15457 1 0 16:23 ? 00:00:01 /usr/bin/spamd -d -c -H -r /var/run/spamd.pid -m 8 root 20399 15457 1 16:52 ? 00:01:02 spamd child root 21831 15457 0 16:59 ? 00:00:14 spamd child root 23135 23093 0 17:52 pts/5 00:00:00 grep spamd [root@eclectic log]$ strace -p 15457 Process 15457 attached - interrupt to quit select(16, [6], NULL, [6], {1, 949000}) = 0 (Timeout) alarm(0) = 2 rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0 rt_sigprocmask(SIG_BLOCK, [ALRM], ~[KILL STOP RTMIN RT_1], 8) = 0 rt_sigaction(SIGALRM, {0x3011a8b110, [], 0x4000000}, {0x3011a8ed00, [], 0x4000000}, 8) = 0 rt_sigprocmask(SIG_SETMASK, ~[KILL STOP RTMIN RT_1], NULL, 8) = 0 rt_sigaction(SIGALRM, {0x3011a8ed00, [], 0x4000000}, NULL, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 alarm(4) = 0 select(16, [6], NULL, [6], {2, 0}) = 0 (Timeout) alarm(0) = 2 rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0 rt_sigprocmask(SIG_BLOCK, [ALRM], ~[KILL STOP RTMIN RT_1], 8) = 0 rt_sigaction(SIGALRM, {0x3011a8b110, [], 0x4000000}, {0x3011a8ed00, [], 0x4000000}, 8) = 0 rt_sigprocmask(SIG_SETMASK, ~[KILL STOP RTMIN RT_1], NULL, 8) = 0 rt_sigaction(SIGALRM, {0x3011a8ed00, [], 0x4000000}, NULL, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 alarm(4) = 0 select(16, [6], NULL, [6], {2, 0}) = 0 (Timeout) alarm(0) = 2
ick -- that looks like Daniel's hang bug (which was fixed). What *is* in the log? can you rerun with debugs on?
also, what's the platform? worth noting I'm running it live on my linux 2.4.25 server here without any hangs noted...
Subject: Re: spamd stops responding to new connections On Fri, May 06, 2005 at 03:14:02PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > also, what's the platform? worth noting I'm running it live on my linux 2.4.25 > server here without any hangs noted... Linux eclectic.kluge.net 2.6.10-1.741_FC3smp #1 SMP Thu Jan 13 16:58:29 EST 2005 x86_64 x86_64 x86_64 GNU/Linux
Subject: Re: spamd stops responding to new connections On Fri, May 06, 2005 at 03:12:23PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > ick -- that looks like Daniel's hang bug (which was fixed). What *is* in the > log? can you rerun with debugs on? Hrm. I'll see what I can do about running with debug, maybe only run the new version for myself or something. I see we stopped logging child states: May 6 15:54:03 eclectic spamd[28841]: prefork: child states: II May 6 15:54:05 eclectic spamd[28841]: prefork: child states: II May 6 17:53:42 eclectic spamd[23241]: prefork: child states: BI May 6 17:53:43 eclectic spamd[23241]: prefork: child states: IB May 6 17:53:44 eclectic spamd[23241]: prefork: child states: IB I issued the upgrade command at 15:53:54. First noticed a problem around 16:23:30 when I restarted spamd. Saw the probem again around 17:53 when I downgraded (did a sudo shell for that, so I don't quite know the time). May 6 15:54:05 eclectic spamd[28841]: server killed by SIGTERM, shutting down May 6 15:54:08 eclectic spamassassin[14014]: spamd: server started on port 783/tcp (running version 3.1.0-r164435) May 6 15:54:08 eclectic spamassassin[14014]: spamd: server successfully spawned child process, pid 14018 May 6 15:54:08 eclectic spamassassin[14014]: spamd: server successfully spawned child process, pid 14019 May 6 15:54:08 eclectic spamassassin[14014]: spamd: server pid: 14014 then it goes on and processes messages for a little while, but stops pretty quickly: May 6 15:59:32 eclectic spamassassin[14018]: spamd: setuid to timex succeeded May 6 15:59:32 eclectic spamassassin[14018]: spamd: processing message <20050506122845.CFFA62FD21@eclectic.kluge.net> for timex:507 May 6 15:59:34 eclectic spamassassin[14018]: spamd: clean message (0.1/5.0) for timex:507 in 2.2 seconds, 8400 bytes. May 6 15:59:34 eclectic spamassassin[14018]: spamd: result: . 0 - BAYES_00,INVESTMENT_ADVICE,MSGID_FROM_MTA_ID,NO_REAL_NAME scantime=2.2,size=8400,user=timex,uid=507,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=47360,mid=<20050506122845.CFFA62FD21@eclectic.kluge.net>,bayes=1.66533453693773e-16,autolearn=no May 6 16:23:57 eclectic spamassassin[14014]: spamd: server killed by SIGTERM, shutting down May 6 16:23:57 eclectic spamc[15444]: connect(AF_INET) to spamd at 127.0.0.1 failed, retrying (#1 of 3): Connection refused May 6 16:23:57 eclectic spamc[15447]: connect(AF_INET) to spamd at 127.0.0.1 failed, retrying (#1 of 3): Connection refused May 6 16:23:58 eclectic spamassassin[15457]: spamd: server started on port 783/tcp (running version 3.1.0-r164435) May 6 16:23:58 eclectic spamassassin[15457]: spamd: server successfully spawned child process, pid 15461 May 6 16:23:58 eclectic spamassassin[15457]: spamd: server successfully spawned child process, pid 15462 May 6 16:23:58 eclectic spamassassin[15457]: spamd: server pid: 15457 So apparently it died for ~25 minutes before I noticed it. 14018 did 113 msgs, 14019 did 60. Did the restart, ran more messages for a while (~40m) until it died and I downgraded: May 6 17:04:16 eclectic spamassassin[20399]: spamd: clean message (-1.6/5.0) for timex:507 in 2.7 seconds, 11596 bytes. May 6 17:04:16 eclectic spamassassin[20399]: spamd: result: . -1 - BAYES_00,INVESTMENT_ADVICE scantime=2.7,size=11596,user=timex,uid=507,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=49902,mid=<200505061316.j46DGPYW002799@dune.met.ca>,bayes=2.22044604925031e-16,autolearn=ham May 6 17:53:38 eclectic spamassassin[15457]: spamd: server killed by SIGTERM, shutting down May 6 17:53:39 eclectic spamc[23190]: connect(AF_INET) to spamd at 127.0.0.1 failed, retrying (#1 of 3): Connection refused May 6 17:53:39 eclectic spamc[23193]: connect(AF_INET) to spamd at 127.0.0.1 failed, retrying (#1 of 3): Connection refused May 6 17:53:39 eclectic spamd[23201]: spamd starting May 6 17:53:40 eclectic spamd[23203]: server started on port 783/tcp (running version 3.1.0-r162060) May 6 17:53:40 eclectic spamd[23203]: server killed by SIGTERM, shutting down May 6 17:53:41 eclectic spamd[23239]: spamd starting May 6 17:53:42 eclectic spamd[23241]: server started on port 783/tcp (running version 3.1.0-r162060) May 6 17:53:42 eclectic spamd[23241]: server successfully spawned child process, pid 23245 May 6 17:53:42 eclectic spamd[23241]: server successfully spawned child process, pid 23246 This last run did spawn children and such unlike the first run: 199 15461 200 15462 3 15470 1 16696 200 17343 1 17925 200 18183 1 19058 200 19341 177 20399 4 20426 1 20554 41 21831
'I see we stopped logging child states' yes, that's annoying. I think Daniel changed it to be an info(), which isn't logged. that seems pretty crappy, since I *want* to see informational stuff!! Daniel, any reason why info() isn't output in the syslog? IMO it should be on by default, if it really is informational stuff, otherwise it should just be subsumed into dbg().
Subject: Re: spamd stops responding to new connections On Fri, May 06, 2005 at 03:45:46PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > Hrm. I'll see what I can do about running with debug, maybe only run the new > version for myself or something. I think this and bug 4330 are related. fwiw.
'I think this and bug 4330 are related. fwiw.' iirc, r168644 still used Storable instead of the current config-clone method, so seems unlikely.
Subject: Re: spamd stops responding to new connections On Sat, May 14, 2005 at 01:15:48AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > 'I think this and bug 4330 are related. fwiw.' > > iirc, r168644 still used Storable instead of the current config-clone method, so > seems unlikely. ah, good point. hrm, well, the daemon has been running fine for >16 hours now on r170124. the only difference, excluding code, is that right now 1) I'm logging to a file instead of syslog, and 2) I'm running in debug mode. When I get a chance to watch it, I'll shift it to normal mode and see if the hang problem returns.
(In reply to comment #8) > ah, good point. hrm, well, the daemon has been running fine for >16 hours > now on r170124. the only difference, excluding code, is that right now 1) > I'm logging to a file instead of syslog, and 2) I'm running in debug mode. > When I get a chance to watch it, I'll shift it to normal mode and see if the > hang problem returns. Well, it ran all weekend on 170124 in debug to file mode. I reverted it to normal usage this morning ~11am and it ran fine for the past 3 hours or so. I just upgraded to r170422 and will continue to watch it. I'm not sure what changed in the code, but it's going a lot better now than it was previously.
'I'm not sure what changed in the code, but it's going a lot better now than it was previously.' getting rid of Storable? /me ducks ;)
Well, r170442 has been running for several days now without issue, so ... :)
unfortunately, this has come back. spamd has been running since may 18th without a problem on r170109, but then this evening spamd stopped responding. I did a small amount of debugging (lsof + strace) before killing the children. The parent respawned the kids and processing happened from there. I then noticed that I now have 3 children running, 2 are stuck reading from a "socket". I don't know what it's connected to, but I did also notice that we may have a socket leak: spamd 1473 root 4u unix 0x0000010042b29040 102191263 socket spamd 1473 root 5u IPv4 102191264 TCP localhost:783 (LISTEN) spamd 1473 root 6u unix 0x0000010029fde0c0 108488215 socket spamd 1473 root 7u unix 0x000001007d9f0940 108488217 socket spamd 1473 root 9u unix 0x00000100193f3040 108488328 socket [root@eclectic /tmp]$ strace -p 1473 Process 1473 attached - interrupt to quit read(9, <unfinished ...> Process 1473 detached [root@eclectic /tmp]$ lsof | grep 0x0000010042b29040 spamd 1467 root 4u unix 0x0000010042b29040 102191263 socket spamd 1468 root 4u unix 0x0000010042b29040 102191263 socket spamd 1473 root 4u unix 0x0000010042b29040 102191263 socket spamd 26796 root 4u unix 0x0000010042b29040 102191263 socket [root@eclectic /tmp]$ lsof | grep 0x0000010029fde0c0 spamd 1468 root 6u unix 0x0000010029fde0c0 108488215 socket spamd 1473 root 6u unix 0x0000010029fde0c0 108488215 socket spamd 26796 root 6u unix 0x0000010029fde0c0 108488215 socket [root@eclectic /tmp]$ lsof | grep 0x000001007d9f0940 spamd 1473 root 7u unix 0x000001007d9f0940 108488217 socket spamd 26796 root 7u unix 0x000001007d9f0940 108488217 socket [root@eclectic /tmp]$ lsof | grep 0x00000100193f3040 spamd 1473 root 9u unix 0x00000100193f3040 108488328 socket I have script output from 2 lsof/strace checks if anyone's interested.
yep, please attach them!
Created attachment 2904 [details] first debug
Created attachment 2905 [details] after restarting
btw I presume there were no warnings in the logs?
Subject: Re: spamd stops responding to new connections On Mon, May 30, 2005 at 01:25:02PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > btw I presume there were no warnings in the logs? No, the log file is really not helpful: May 27 14:17:54 eclectic spamassassin[10641]: spamd: connection from localhost [127.0.0.1] at port 45278 May 27 14:17:54 eclectic spamassassin[10641]: spamd: setuid to felicity succeededMay 27 14:17:54 eclectic spamassassin[10641]: spamd: processing message <KLHKBHKNMEHPIAODLMMFAFIGGIAB.akersrf@scala.net> for felicity:501May 27 14:17:55 eclectic spamassassin[10641]: spamd: identified spam (11.7/5.0) for felicity:501 in 1.1 seconds, 11572 bytes. May 27 14:17:55 eclectic spamassassin[10641]: spamd: result: Y 11 - BAYES_99,DCC_CHECK,DIGEST_MULTIPLE,INVESTMENT_ADVICE,MIME_BASE64_TEXT,NOT_ADVISOR,RATWARE_NAME_ID,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,RAZOR2_CHECK,SPF_HELO_FAIL scantime=1.1,size=11572,user=felicity,uid=501,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=45278,mid=<KLHKBHKNMEHPIAODLMMFAFIGGIAB.akersrf@scala.net>,bayes=0.999999253953297,autolearn=disabled Then nothing for 40 minutes when I restarted the procs. I think I forgot to upload the debug stuff I did from that restart... Basically the same info as before. BTW: Since there was more DNS work done, I upgraded to 178811.
Created attachment 2910 [details] another debug output shows strace and lsof again. also that a telnet to 783 gets connected but then gets no response to input.
'I then noticed that I now have 3 children running, 2 are stuck reading from a "socket". I don't know what it's connected to, but I did also notice that we may have a socket leak: spamd 1473 root 4u unix 0x0000010042b29040 102191263 socket spamd 1473 root 5u IPv4 102191264 TCP localhost:783 (LISTEN) spamd 1473 root 6u unix 0x0000010029fde0c0 108488215 socket spamd 1473 root 7u unix 0x000001007d9f0940 108488217 socket spamd 1473 root 9u unix 0x00000100193f3040 108488328 socket' those, btw, aren't really sockets -- I can only surmise that that's how it's implementing pipe() fd pairs on that platform.
meh. my mistake; we do explicitly use socketpair() internally. so it's expected that they'd be UNIX-domain socks.
right. r179348 has some additional code added that may help; firstly, it adds a timeout so that a child will die after 300 secs of no orders from the master, and the master itself will log and error and die if it gets out of sync in a similar way. Also, it adds more debugging and logging. Finally, this deals with the case of partial writes -- we should not be using syswrite/sysread without dealing with that, and I'll bet that's the problem here! please give that rev a try.
Subject: Re: spamd stops responding to new connections On Tue, May 31, 2005 at 09:09:01PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > please give that rev a try. done, upgraded to 179349
fwiw, Daniel had another bug which I don't think was bugzilla'd, but it was related to spamd hangs with the new prefork code; so I'll comment here since I've just fixed it, I think. The symptom is that if you strace the master spamd, it's in a loop of killing a child process repeatedly. There was a race condition, in that if the lowest idle child died or exited while still in the I state, the master spamd would still attempt to assign it tasks and get stuck in an infinite loop trying to write the 'accept' order to the dead child. This is now avoided by recomputing the lowest idle child in the signal handler. r185098 has that fix.
marking this FIXED until someone spots this bug again (or hopefully not)
Subject: Re: spamd stops responding to new connections On Tue, Jun 07, 2005 at 06:56:51PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > marking this FIXED until someone spots this bug again (or hopefully not) So far r179349 has been working fine for me. /me keeps fingers crossed