Bug 4258 - spamd locks up after varying load
Summary: spamd locks up after varying load
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamc/spamd (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: PC FreeBSD
: P3 critical
Target Milestone: 3.1.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
: 4257 (view as bug list)
Depends on:
Blocks:
 
Reported: 2005-04-12 18:37 UTC by Daniel Quinlan
Modified: 2005-05-04 15:11 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
log of spamd running text/plain None Daniel Quinlan [HasCLA]
strace application/octet-stream None Daniel Quinlan [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Quinlan 2005-04-12 18:37:45 UTC
When running spamd in a varying load environment ranging from very
loaded to idle, it will eventually stop responding to requests.
Comment 1 Daniel Quinlan 2005-04-12 18:38:07 UTC
*** Bug 4257 has been marked as a duplicate of this bug. ***
Comment 2 Daniel Quinlan 2005-04-12 18:41:03 UTC
Created attachment 2785 [details]
log of spamd running

At the end of this, spamd won't even respond to a PING or allow you to
connect to the port.

It seems like spamd believes there are idle clients when there are none.
Comment 3 Justin Mason 2005-04-12 19:15:10 UTC
is it still running?  can you get a ptrace/strace of both procs?
Comment 4 Daniel Quinlan 2005-04-13 14:37:24 UTC
Created attachment 2790 [details]
strace

strace of parent process as zipped text file

strace of last child returned this:

read(8,  <unfinished ...>
--- SIGINT (Interrupt) ---
--- SIGINT (Interrupt) ---
--- SIGINT (Interrupt) ---

(ended when I control-C'ed the parent)

spamd is always dead when the last ": I" status is printed

ps auxww:

root	66218  1.2  0.8 26988 25700  p2  S+    2:26PM	0:05.72 perl -T ./spamd
-A 172.16.105.1 -i -C ../rules --max-children=5 --min-children=1 --max-spare=1
root	66217  1.1  0.0  1064  644  p2	S+    2:26PM   0:05.20 strace -o
/tmp/strace.out perl -T ./spamd -A 172.16.105.1 -i -C ../rules --max-children=5
--min-children=1 --max-spare=1
root	66443  0.1  0.8 27728 26400  p2  I+    2:27PM	0:00.38 spamd child
(perl)
Comment 5 Justin Mason 2005-04-14 18:56:18 UTC
wierd -- not seeing any problem here :(  check it out...
let's start halfway through.

cast of characters:

    fd 5 = accept fd
    fd 6 = control channel to child 66451
    fd 7 = "                      " 66443


write(2, "prefork: child states: IB\n", 26) = 26

        one child is processing, the other one's idle

select(16, [5 6 7], NULL, [5 6 7], {2, 0}) = 1 (in [6])
read(6, "I66451\n", 8192)               = 7

        child 66451 says "I'm idle"

write(2, "prefork: child states: II\n", 26) = 26

        prefork master says, "ok, we don't need two kids, we're idle"

kill(66451, SIGINT <unfinished ...>
--- SIGCHLD (Child exited) ---
--- SIGCHLD (Child exited) ---
<... kill resumed> )                    = 0
sigreturn(0x9ebff7a0)                   = 0
select(8, [4], NULL, [4], {0, 0})       = 0 (Timeout)
write(4, "<22>spamd[66218]: server hit by "..., 41) = 41
wait4(-1, [WIFSIGNALED(s) && WTERMSIG(s) == SIGINT], WNOHANG, NULL) = 66451
select(8, [4], NULL, [4], {0, 0})       = 0 (Timeout)
write(4, "<22>spamd[66218]: handled cleanu"..., 54) = 54
wait4(-1, 0x9ebff7dc, WNOHANG, NULL)    = 0
select(16, [5 6 7], NULL, [5 6 7], {2, 0}) = 1 (in [6])
read(6, "", 8192)                       = 0
close(-1)                               = -1 EBADF (Bad file descriptor)
close(6)                                = 0
write(2, "prefork: child states: I\n", 25) = 25

        and there's 1 idle child.

select(16, [5 7], NULL, [5 7], {2, 0})  = 1 (in [5])
write(7, "A\n", 2)                      = 2

        activity on the accept socket -- tells the child on fd 7 to accept...

read(7, "B66443\n", 8192)               = 7

        child says "I'm busy"

select(16, [5 7], NULL, [5 7], {2, 0})  = 1 (in [7])
read(7, "I66443\n", 8192)               = 7

        child finishes processing and says "I'm idle".

write(2, "prefork: child states: I\n", 25) = 25
select(16, [5 7], NULL, [5 7], {2, 0})  = 0 (Timeout)
select(16, [5 7], NULL, [5 7], {2, 0})  = 0 (Timeout)
select(16, [5 7], NULL, [5 7], {2, 0})  = 0 (Timeout)
select(16, [5 7], NULL, [5 7], {2, 0})  = 0 (Timeout)
select(16, [5 7], NULL, [5 7], {2, 0})  = 0 (Timeout)
select(16, [5 7], NULL, [5 7], {2, 0})  = 0 (Timeout)
select(16, [5 7], NULL, [5 7], {2, 0})  = 0 (Timeout)
select(16, [5 7], NULL, [5 7], {2, 0})  = 0 (Timeout)
select(16, [5 7], NULL, [5 7], {2, 0})  = 0 (Timeout)

        and repeat until eventually killed.


So in other words, the master spamd is not getting hung -- it really is
selecting on the accept fd.  But there's no activity arriving on that fd, which
is obviously incorrect?!

right now, I'm stumped as to how a select on the accept() fd could *fail*
to notice activity, if there is activity there, and that select() call looks
100% fine to me (unless BSD has changed its semantics somewhere since BSD
4.3...)

Could you retry again, using strace -ft (follow forks, add time)? maybe there's
something in the kid's straces that'll help.

Comment 6 Justin Mason 2005-05-04 00:00:26 UTC
OK, after some investigation offline, I've figured this out --

<jmason> I think we have a perl IO issue
<jmason>   5664 perl     1115186325.449041 CALL  read(0x8,0x8c77000,0x2000)
<jmason>   5664 perl     1115186325.449053 GIO   fd 8 read 12 bytes
<jmason>        "B5682
<jmason>         I5682
<jmason>        "
<jmason> note the two lines in the one read
<jmason> it totally misses the second one...
<jmason> the problem is this
<jmason> 1. parent calls select(), which says "channel N is active"
<jmason> 2. it reads from channel N
<jmason> 3. perl calls read(2) and gets those 2 lines
<jmason> 4. perl returns one of those 2 lines to our code
<jmason> 5. parent later calls select() again, and gets no activity -- although
there is a line that wasn't processed sitting in the stdio buffer

<jmason> we call $socket->getline()
<quinlan> you're not mixing sysread/syswrite/etc. with read/write, etc. ?
<jmason> nope.  but perhaps select() doesn't mix with readline()
<jmason> at least not on freebsd ;)

Moving to a new protocol that uses fixed-size buffers instead of variable-length
"lines" fixes the bug -- since we can then use the (unbuffered) sysread()
instead of (buffered) getline().

r168094.
Comment 7 Justin Mason 2005-05-04 23:11:57 UTC
sooo.... any luck with this fix?
Comment 8 Daniel Quinlan 2005-05-04 23:23:22 UTC
Subject: Re:  spamd locks up after varying load

> sooo.... any luck with this fix?

Yes!  It fixed the lock-up.  I considered it closed once you resolved
the bug.  I would have reopened if it didn't.  ;-)