SA Bugzilla – Bug 4258
spamd locks up after varying load
Last modified: 2005-05-04 15:11:57 UTC
When running spamd in a varying load environment ranging from very loaded to idle, it will eventually stop responding to requests.
*** Bug 4257 has been marked as a duplicate of this bug. ***
Created attachment 2785 [details] log of spamd running At the end of this, spamd won't even respond to a PING or allow you to connect to the port. It seems like spamd believes there are idle clients when there are none.
is it still running? can you get a ptrace/strace of both procs?
Created attachment 2790 [details] strace strace of parent process as zipped text file strace of last child returned this: read(8, <unfinished ...> --- SIGINT (Interrupt) --- --- SIGINT (Interrupt) --- --- SIGINT (Interrupt) --- (ended when I control-C'ed the parent) spamd is always dead when the last ": I" status is printed ps auxww: root 66218 1.2 0.8 26988 25700 p2 S+ 2:26PM 0:05.72 perl -T ./spamd -A 172.16.105.1 -i -C ../rules --max-children=5 --min-children=1 --max-spare=1 root 66217 1.1 0.0 1064 644 p2 S+ 2:26PM 0:05.20 strace -o /tmp/strace.out perl -T ./spamd -A 172.16.105.1 -i -C ../rules --max-children=5 --min-children=1 --max-spare=1 root 66443 0.1 0.8 27728 26400 p2 I+ 2:27PM 0:00.38 spamd child (perl)
wierd -- not seeing any problem here :( check it out... let's start halfway through. cast of characters: fd 5 = accept fd fd 6 = control channel to child 66451 fd 7 = " " 66443 write(2, "prefork: child states: IB\n", 26) = 26 one child is processing, the other one's idle select(16, [5 6 7], NULL, [5 6 7], {2, 0}) = 1 (in [6]) read(6, "I66451\n", 8192) = 7 child 66451 says "I'm idle" write(2, "prefork: child states: II\n", 26) = 26 prefork master says, "ok, we don't need two kids, we're idle" kill(66451, SIGINT <unfinished ...> --- SIGCHLD (Child exited) --- --- SIGCHLD (Child exited) --- <... kill resumed> ) = 0 sigreturn(0x9ebff7a0) = 0 select(8, [4], NULL, [4], {0, 0}) = 0 (Timeout) write(4, "<22>spamd[66218]: server hit by "..., 41) = 41 wait4(-1, [WIFSIGNALED(s) && WTERMSIG(s) == SIGINT], WNOHANG, NULL) = 66451 select(8, [4], NULL, [4], {0, 0}) = 0 (Timeout) write(4, "<22>spamd[66218]: handled cleanu"..., 54) = 54 wait4(-1, 0x9ebff7dc, WNOHANG, NULL) = 0 select(16, [5 6 7], NULL, [5 6 7], {2, 0}) = 1 (in [6]) read(6, "", 8192) = 0 close(-1) = -1 EBADF (Bad file descriptor) close(6) = 0 write(2, "prefork: child states: I\n", 25) = 25 and there's 1 idle child. select(16, [5 7], NULL, [5 7], {2, 0}) = 1 (in [5]) write(7, "A\n", 2) = 2 activity on the accept socket -- tells the child on fd 7 to accept... read(7, "B66443\n", 8192) = 7 child says "I'm busy" select(16, [5 7], NULL, [5 7], {2, 0}) = 1 (in [7]) read(7, "I66443\n", 8192) = 7 child finishes processing and says "I'm idle". write(2, "prefork: child states: I\n", 25) = 25 select(16, [5 7], NULL, [5 7], {2, 0}) = 0 (Timeout) select(16, [5 7], NULL, [5 7], {2, 0}) = 0 (Timeout) select(16, [5 7], NULL, [5 7], {2, 0}) = 0 (Timeout) select(16, [5 7], NULL, [5 7], {2, 0}) = 0 (Timeout) select(16, [5 7], NULL, [5 7], {2, 0}) = 0 (Timeout) select(16, [5 7], NULL, [5 7], {2, 0}) = 0 (Timeout) select(16, [5 7], NULL, [5 7], {2, 0}) = 0 (Timeout) select(16, [5 7], NULL, [5 7], {2, 0}) = 0 (Timeout) select(16, [5 7], NULL, [5 7], {2, 0}) = 0 (Timeout) and repeat until eventually killed. So in other words, the master spamd is not getting hung -- it really is selecting on the accept fd. But there's no activity arriving on that fd, which is obviously incorrect?! right now, I'm stumped as to how a select on the accept() fd could *fail* to notice activity, if there is activity there, and that select() call looks 100% fine to me (unless BSD has changed its semantics somewhere since BSD 4.3...) Could you retry again, using strace -ft (follow forks, add time)? maybe there's something in the kid's straces that'll help.
OK, after some investigation offline, I've figured this out -- <jmason> I think we have a perl IO issue <jmason> 5664 perl 1115186325.449041 CALL read(0x8,0x8c77000,0x2000) <jmason> 5664 perl 1115186325.449053 GIO fd 8 read 12 bytes <jmason> "B5682 <jmason> I5682 <jmason> " <jmason> note the two lines in the one read <jmason> it totally misses the second one... <jmason> the problem is this <jmason> 1. parent calls select(), which says "channel N is active" <jmason> 2. it reads from channel N <jmason> 3. perl calls read(2) and gets those 2 lines <jmason> 4. perl returns one of those 2 lines to our code <jmason> 5. parent later calls select() again, and gets no activity -- although there is a line that wasn't processed sitting in the stdio buffer <jmason> we call $socket->getline() <quinlan> you're not mixing sysread/syswrite/etc. with read/write, etc. ? <jmason> nope. but perhaps select() doesn't mix with readline() <jmason> at least not on freebsd ;) Moving to a new protocol that uses fixed-size buffers instead of variable-length "lines" fixes the bug -- since we can then use the (unbuffered) sysread() instead of (buffered) getline(). r168094.
sooo.... any luck with this fix?
Subject: Re: spamd locks up after varying load > sooo.... any luck with this fix? Yes! It fixed the lock-up. I considered it closed once you resolved the bug. I would have reopened if it didn't. ;-)