Bug 4310 - spamd stops responding to new connections
Summary: spamd stops responding to new connections
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamc/spamd (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P1 blocker
Target Milestone: 3.1.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-05-06 15:00 UTC by Theo Van Dinter
Modified: 2005-06-07 10:56 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
first debug text/plain None Theo Van Dinter [HasCLA]
after restarting text/plain None Theo Van Dinter [HasCLA]
another debug output text/plain None Theo Van Dinter [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Theo Van Dinter 2005-05-06 15:00:53 UTC
Today I upgraded from r162091 to r168644.  After a relatively short amount of time, I found that 
spamd was no longer functioning -- spamc would connect to 783 (which connected), but spamd would 
just sit there and not actually respond.  There is nothing in the log about any errors.

[root@eclectic log]$ telnet localhost 783
Trying 127.0.0.1...
Connected to localhost.localdomain (127.0.0.1).
Escape character is '^]'.
PING
[hang]


[root@eclectic log]$ ps -ef | grep spamd
root     15457     1  0 16:23 ?        00:00:01 /usr/bin/spamd -d -c -H -r /var/run/spamd.pid -m 8
root     20399 15457  1 16:52 ?        00:01:02 spamd child
root     21831 15457  0 16:59 ?        00:00:14 spamd child
root     23135 23093  0 17:52 pts/5    00:00:00 grep spamd
[root@eclectic log]$ strace -p 15457
Process 15457 attached - interrupt to quit
select(16, [6], NULL, [6], {1, 949000}) = 0 (Timeout)
alarm(0)                                = 2
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_BLOCK, [ALRM], ~[KILL STOP RTMIN RT_1], 8) = 0
rt_sigaction(SIGALRM, {0x3011a8b110, [], 0x4000000}, {0x3011a8ed00, [], 0x4000000}, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[KILL STOP RTMIN RT_1], NULL, 8) = 0
rt_sigaction(SIGALRM, {0x3011a8ed00, [], 0x4000000}, NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
alarm(4)                                = 0
select(16, [6], NULL, [6], {2, 0})      = 0 (Timeout)
alarm(0)                                = 2
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_BLOCK, [ALRM], ~[KILL STOP RTMIN RT_1], 8) = 0
rt_sigaction(SIGALRM, {0x3011a8b110, [], 0x4000000}, {0x3011a8ed00, [], 0x4000000}, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[KILL STOP RTMIN RT_1], NULL, 8) = 0
rt_sigaction(SIGALRM, {0x3011a8ed00, [], 0x4000000}, NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
alarm(4)                                = 0
select(16, [6], NULL, [6], {2, 0})      = 0 (Timeout)
alarm(0)                                = 2
Comment 1 Justin Mason 2005-05-06 15:12:23 UTC
ick -- that looks like Daniel's hang bug (which was fixed).  What *is* in the
log?  can you rerun with debugs on?
Comment 2 Justin Mason 2005-05-06 15:14:02 UTC
also, what's the platform?  worth noting I'm running it live on my linux 2.4.25
server here without any hangs noted...
Comment 3 Theo Van Dinter 2005-05-06 15:15:17 UTC
Subject: Re:  spamd stops responding to new connections

On Fri, May 06, 2005 at 03:14:02PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> also, what's the platform?  worth noting I'm running it live on my linux 2.4.25
> server here without any hangs noted...

Linux eclectic.kluge.net 2.6.10-1.741_FC3smp #1 SMP Thu Jan 13 16:58:29 EST 2005 x86_64 x86_64 x86_64 GNU/Linux

Comment 4 Theo Van Dinter 2005-05-06 15:45:46 UTC
Subject: Re:  spamd stops responding to new connections

On Fri, May 06, 2005 at 03:12:23PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> ick -- that looks like Daniel's hang bug (which was fixed).  What *is* in the
> log?  can you rerun with debugs on?

Hrm.  I'll see what I can do about running with debug, maybe only run the new
version for myself or something.


I see we stopped logging child states:

May  6 15:54:03 eclectic spamd[28841]: prefork: child states: II 
May  6 15:54:05 eclectic spamd[28841]: prefork: child states: II 
May  6 17:53:42 eclectic spamd[23241]: prefork: child states: BI 
May  6 17:53:43 eclectic spamd[23241]: prefork: child states: IB 
May  6 17:53:44 eclectic spamd[23241]: prefork: child states: IB 

I issued the upgrade command at 15:53:54.  First noticed a problem around
16:23:30 when I restarted spamd.  Saw the probem again around 17:53 when
I downgraded (did a sudo shell for that, so I don't quite know the time).

May  6 15:54:05 eclectic spamd[28841]: server killed by SIGTERM, shutting down
May  6 15:54:08 eclectic spamassassin[14014]: spamd: server started on port 783/tcp (running version 3.1.0-r164435)
May  6 15:54:08 eclectic spamassassin[14014]: spamd: server successfully spawned child process, pid 14018
May  6 15:54:08 eclectic spamassassin[14014]: spamd: server successfully spawned child process, pid 14019
May  6 15:54:08 eclectic spamassassin[14014]: spamd: server pid: 14014

then it goes on and processes messages for a little while, but stops pretty
quickly:

May  6 15:59:32 eclectic spamassassin[14018]: spamd: setuid to timex succeeded
May  6 15:59:32 eclectic spamassassin[14018]: spamd: processing message <20050506122845.CFFA62FD21@eclectic.kluge.net> for timex:507
May  6 15:59:34 eclectic spamassassin[14018]: spamd: clean message (0.1/5.0) for timex:507 in 2.2 seconds, 8400 bytes.
May  6 15:59:34 eclectic spamassassin[14018]: spamd: result: .  0 - BAYES_00,INVESTMENT_ADVICE,MSGID_FROM_MTA_ID,NO_REAL_NAME scantime=2.2,size=8400,user=timex,uid=507,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=47360,mid=<20050506122845.CFFA62FD21@eclectic.kluge.net>,bayes=1.66533453693773e-16,autolearn=no
May  6 16:23:57 eclectic spamassassin[14014]: spamd: server killed by SIGTERM, shutting down
May  6 16:23:57 eclectic spamc[15444]: connect(AF_INET) to spamd at 127.0.0.1 failed, retrying (#1 of 3): Connection refused
May  6 16:23:57 eclectic spamc[15447]: connect(AF_INET) to spamd at 127.0.0.1 failed, retrying (#1 of 3): Connection refused
May  6 16:23:58 eclectic spamassassin[15457]: spamd: server started on port 783/tcp (running version 3.1.0-r164435)
May  6 16:23:58 eclectic spamassassin[15457]: spamd: server successfully spawned child process, pid 15461
May  6 16:23:58 eclectic spamassassin[15457]: spamd: server successfully spawned child process, pid 15462
May  6 16:23:58 eclectic spamassassin[15457]: spamd: server pid: 15457

So apparently it died for ~25 minutes before I noticed it.  14018 did
113 msgs, 14019 did 60.  Did the restart, ran more messages for a while
(~40m) until it died and I downgraded:

May  6 17:04:16 eclectic spamassassin[20399]: spamd: clean message (-1.6/5.0) for timex:507 in 2.7 seconds, 11596 bytes.
May  6 17:04:16 eclectic spamassassin[20399]: spamd: result: . -1 - BAYES_00,INVESTMENT_ADVICE scantime=2.7,size=11596,user=timex,uid=507,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=49902,mid=<200505061316.j46DGPYW002799@dune.met.ca>,bayes=2.22044604925031e-16,autolearn=ham
May  6 17:53:38 eclectic spamassassin[15457]: spamd: server killed by SIGTERM, shutting down
May  6 17:53:39 eclectic spamc[23190]: connect(AF_INET) to spamd at 127.0.0.1 failed, retrying (#1 of 3): Connection refused
May  6 17:53:39 eclectic spamc[23193]: connect(AF_INET) to spamd at 127.0.0.1 failed, retrying (#1 of 3): Connection refused
May  6 17:53:39 eclectic spamd[23201]: spamd starting 
May  6 17:53:40 eclectic spamd[23203]: server started on port 783/tcp (running version 3.1.0-r162060) May  6 17:53:40 eclectic spamd[23203]: server killed by SIGTERM, shutting down
May  6 17:53:41 eclectic spamd[23239]: spamd starting 
May  6 17:53:42 eclectic spamd[23241]: server started on port 783/tcp (running version 3.1.0-r162060)
May  6 17:53:42 eclectic spamd[23241]: server successfully spawned child process, pid 23245
May  6 17:53:42 eclectic spamd[23241]: server successfully spawned child process, pid 23246


This last run did spawn children and such unlike the first run:

    199 15461
    200 15462
      3 15470
      1 16696
    200 17343
      1 17925
    200 18183
      1 19058
    200 19341
    177 20399
      4 20426
      1 20554
     41 21831

Comment 5 Justin Mason 2005-05-06 15:56:48 UTC
'I see we stopped logging child states'

yes, that's annoying.  I think Daniel changed it to be an info(), which isn't
logged.  that seems pretty crappy, since I *want* to see informational stuff!!

Daniel, any reason why info() isn't output in the syslog?  IMO it should be on
by default, if it really is informational stuff, otherwise it should just be
subsumed into dbg().
Comment 6 Theo Van Dinter 2005-05-13 21:18:56 UTC
Subject: Re:  spamd stops responding to new connections

On Fri, May 06, 2005 at 03:45:46PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Hrm.  I'll see what I can do about running with debug, maybe only run the new
> version for myself or something.

I think this and bug 4330 are related.  fwiw.

Comment 7 Justin Mason 2005-05-14 01:15:48 UTC
'I think this and bug 4330 are related.  fwiw.'

iirc, r168644 still used Storable instead of the current config-clone method, so
seems unlikely.
Comment 8 Theo Van Dinter 2005-05-14 12:10:42 UTC
Subject: Re:  spamd stops responding to new connections

On Sat, May 14, 2005 at 01:15:48AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> 'I think this and bug 4330 are related.  fwiw.'
> 
> iirc, r168644 still used Storable instead of the current config-clone method, so
> seems unlikely.

ah, good point.  hrm, well, the daemon has been running fine for >16 hours
now on r170124.  the only difference, excluding code, is that right now 1)
I'm logging to a file instead of syslog, and 2) I'm running in debug mode.
When I get a chance to watch it, I'll shift it to normal mode and see if the
hang problem returns.

Comment 9 Theo Van Dinter 2005-05-16 11:05:39 UTC
(In reply to comment #8)
> ah, good point.  hrm, well, the daemon has been running fine for >16 hours
> now on r170124.  the only difference, excluding code, is that right now 1)
> I'm logging to a file instead of syslog, and 2) I'm running in debug mode.
> When I get a chance to watch it, I'll shift it to normal mode and see if the
> hang problem returns.

Well, it ran all weekend on 170124 in debug to file mode.  I reverted it to normal usage this morning 
~11am and it ran fine for the past 3 hours or so.  I just upgraded to r170422 and will continue to watch it.  
I'm not sure what changed in the code, but it's going a lot better now than it was previously.
Comment 10 Justin Mason 2005-05-16 21:17:34 UTC
'I'm not sure what changed in the code, but it's going a lot better now than it
was previously.'

getting rid of Storable?  
/me ducks ;)
Comment 11 Theo Van Dinter 2005-05-19 16:05:12 UTC
Well, r170442 has been running for several days now without issue, so ...  :)
Comment 12 Theo Van Dinter 2005-05-24 17:45:30 UTC
unfortunately, this has come back.  spamd has been running since may 18th without a problem on 
r170109, but then this evening spamd stopped responding.  I did a small amount of debugging (lsof + 
strace) before killing the children.  The parent respawned the kids and processing happened from 
there.

I then noticed that I now have 3 children running, 2 are stuck reading from a "socket".  I don't know 
what it's connected to, but I did also notice that we may have a socket leak:

spamd   1473 root    4u  unix 0x0000010042b29040          102191263 socket
spamd   1473 root    5u  IPv4          102191264                TCP localhost:783 (LISTEN)
spamd   1473 root    6u  unix 0x0000010029fde0c0          108488215 socket
spamd   1473 root    7u  unix 0x000001007d9f0940          108488217 socket
spamd   1473 root    9u  unix 0x00000100193f3040          108488328 socket
[root@eclectic /tmp]$ strace -p 1473
Process 1473 attached - interrupt to quit
read(9,  <unfinished ...>
Process 1473 detached
[root@eclectic /tmp]$ lsof | grep 0x0000010042b29040
spamd      1467     root    4u     unix 0x0000010042b29040            102191263 socket
spamd      1468     root    4u     unix 0x0000010042b29040            102191263 socket
spamd      1473     root    4u     unix 0x0000010042b29040            102191263 socket
spamd     26796     root    4u     unix 0x0000010042b29040            102191263 socket
[root@eclectic /tmp]$ lsof | grep 0x0000010029fde0c0
spamd      1468     root    6u     unix 0x0000010029fde0c0            108488215 socket
spamd      1473     root    6u     unix 0x0000010029fde0c0            108488215 socket
spamd     26796     root    6u     unix 0x0000010029fde0c0            108488215 socket
[root@eclectic /tmp]$ lsof | grep 0x000001007d9f0940
spamd      1473     root    7u     unix 0x000001007d9f0940            108488217 socket
spamd     26796     root    7u     unix 0x000001007d9f0940            108488217 socket
[root@eclectic /tmp]$ lsof | grep 0x00000100193f3040
spamd      1473     root    9u     unix 0x00000100193f3040            108488328 socket


I have script output from 2 lsof/strace checks if anyone's interested.
Comment 13 Justin Mason 2005-05-24 18:29:46 UTC
yep, please attach them!
Comment 14 Theo Van Dinter 2005-05-26 19:40:35 UTC
Created attachment 2904 [details]
first debug
Comment 15 Theo Van Dinter 2005-05-26 19:41:04 UTC
Created attachment 2905 [details]
after restarting
Comment 16 Justin Mason 2005-05-30 13:25:02 UTC
btw I presume there were no warnings in the logs?
Comment 17 Theo Van Dinter 2005-05-30 15:45:06 UTC
Subject: Re:  spamd stops responding to new connections

On Mon, May 30, 2005 at 01:25:02PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> btw I presume there were no warnings in the logs?

No, the log file is really not helpful:

May 27 14:17:54 eclectic spamassassin[10641]: spamd: connection from localhost
[127.0.0.1] at port 45278
May 27 14:17:54 eclectic spamassassin[10641]: spamd: setuid to felicity
succeededMay 27 14:17:54 eclectic spamassassin[10641]: spamd: processing
message <KLHKBHKNMEHPIAODLMMFAFIGGIAB.akersrf@scala.net> for felicity:501May
27 14:17:55 eclectic spamassassin[10641]: spamd: identified spam (11.7/5.0)
for felicity:501 in 1.1 seconds, 11572 bytes.
May 27 14:17:55 eclectic spamassassin[10641]: spamd: result: Y 11 -
BAYES_99,DCC_CHECK,DIGEST_MULTIPLE,INVESTMENT_ADVICE,MIME_BASE64_TEXT,NOT_ADVISOR,RATWARE_NAME_ID,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,RAZOR2_CHECK,SPF_HELO_FAIL
scantime=1.1,size=11572,user=felicity,uid=501,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=45278,mid=<KLHKBHKNMEHPIAODLMMFAFIGGIAB.akersrf@scala.net>,bayes=0.999999253953297,autolearn=disabled

Then nothing for 40 minutes when I restarted the procs.

I think I forgot to upload the debug stuff I did from that restart...
Basically the same info as before.

BTW: Since there was more DNS work done, I upgraded to 178811.

Comment 18 Theo Van Dinter 2005-05-30 15:47:14 UTC
Created attachment 2910 [details]
another debug output

shows strace and lsof again.  also that a telnet to 783 gets connected but then
gets no response to input.
Comment 19 Justin Mason 2005-05-31 19:13:49 UTC
'I then noticed that I now have 3 children running, 2 are stuck reading from a
"socket".  I don't know 
what it's connected to, but I did also notice that we may have a socket leak:

spamd   1473 root    4u  unix 0x0000010042b29040          102191263 socket
spamd   1473 root    5u  IPv4          102191264                TCP
localhost:783 (LISTEN)
spamd   1473 root    6u  unix 0x0000010029fde0c0          108488215 socket
spamd   1473 root    7u  unix 0x000001007d9f0940          108488217 socket
spamd   1473 root    9u  unix 0x00000100193f3040          108488328 socket'

those, btw, aren't really sockets -- I can only surmise that that's how it's
implementing pipe() fd pairs on that platform.
Comment 20 Justin Mason 2005-05-31 19:16:54 UTC
meh.  my mistake; we do explicitly use socketpair() internally.  so it's
expected that they'd be UNIX-domain socks.
Comment 21 Justin Mason 2005-05-31 21:09:01 UTC
right.  r179348 has some additional code added that may help; firstly, it adds a
timeout so that a child will die after 300 secs of no orders from the master,
and the master itself will log and error and die if it gets out of sync in a
similar way.  Also, it adds more debugging and logging.

Finally, this deals with the case of partial writes -- we should not be using
syswrite/sysread without dealing with that, and I'll bet that's the problem here!

please give that rev a try.
Comment 22 Theo Van Dinter 2005-05-31 22:11:24 UTC
Subject: Re:  spamd stops responding to new connections

On Tue, May 31, 2005 at 09:09:01PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> please give that rev a try.

done, upgraded to 179349

Comment 23 Justin Mason 2005-06-06 16:48:35 UTC
fwiw, Daniel had another bug which I don't think was bugzilla'd, but it was
related to spamd hangs with the new prefork code; so I'll comment here since
I've just fixed it, I think.

The symptom is that if you strace the master spamd, it's in a loop of killing a
child process repeatedly.

There was a race condition, in that if the lowest idle child died or exited
while still in the I state, the master spamd would still attempt to assign it
tasks and get stuck in an infinite loop trying to write the 'accept' order to
the dead child.  This is now avoided by recomputing the lowest idle child in the
signal handler.

r185098 has that fix.
Comment 24 Justin Mason 2005-06-07 18:56:50 UTC
marking this FIXED until someone spots this bug again (or hopefully not)
Comment 25 Theo Van Dinter 2005-06-07 19:05:47 UTC
Subject: Re:  spamd stops responding to new connections

On Tue, Jun 07, 2005 at 06:56:51PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> marking this FIXED until someone spots this bug again (or hopefully not)

So far r179349 has been working fine for me.  /me keeps fingers crossed