SA Bugzilla – Bug 3983
[review] adopt Apache preforking algorithm
Last modified: 2005-03-15 10:15:17 UTC
from discussions with a few people at ApacheCon, I think I have an idea of what may have caused trouble swap-wise in SpamAssassin use for a lot of people. By default, SpamAssassin 3.0.0's spamd uses preforking, and starts 5 servers. So we have: 10 MB shared per process 20 MB unshared x 5 = 110 MB used by all processes In addition, something that's become clear, is that linux (and possibly other OSes) will schedule each of the preforked servers to service an incoming accept() call, *in turn*. Even if some of those are swapped out. Now, let's say you're running that on a server with 90 MB of free RAM. At least one or two of those servers will have to be swapped out to fit them all in. Once a sufficient number of requests arrive -- even under low load -- that server will be re-swapped in, and another swapped out every 5 requests... in other words, creating swap load. What I'm proposing to do is to adopt the Apache preforking algorithm. This will mean: - instead of the requests being distributed across all servers equally, most requests are delivered to the first idle server (sorted by PID). So, in other words a small number of the running servers will handle most of the requests under normal load. This will reduce swapping, since the inactive servers stay inactive, and therefore swapped out until they're needed. I think this is the main fix that'll improve response time and cut swapping. - the number of servers running will no longer be simply == max-children; instead, it'll scale from (1 .. max-children) based on how busy the server is, using a min-idle-children and max-idle-children pair of settings. in other words, under low load, server children are killed off; under high load, new servers are forked in advance to deal with more simultaneous requests. This is implemented, and *seems* to work well, although it could do with testing under variable loads. - the master process serialises all accept() calls. The parent spamd entirely controls when, and if, a child accepts a new connection, and which child does so. This is required to implement "lowest-PID idle child accepts", and -- surprisingly -- has very little effect on throughput in my testing; in a simple speed test using about 100 messages, I get these figures: 3.0: 10.819 8.441 9.419 8.865 9.704 = 47.248/5 = 9.45 new: 9.851 9.277 9.759 = 28.887/3 = 9.63 that's 9.45 messages/sec with the 3.0 code, 9.63 m/s with the new preforking algorithm. however, I think there's actually a small slowdown in fact, compared to non-preforking, based on this: rrn: 10.240 9.393 9.339 10.033 = 39.005/3 = 9.75 the methodology used to generate these, btw, are just: start spamd, and do "for mail in [100-message corpus] ; do spamc < $mail ; done", in other words send 1 message through, each time, in sequence. note the *lower* variability in the m/s figures using the new code. Even though my machine has 1Gb of RAM, the re-use of the same spamd child processes seems to increase the predictability of scan times, even without swapping! - a bonus: we'll have a back-channel from the spamd children to the spamd parent to send messages and report stats. this should be pretty damn useful ;) So, I'll upload the current statke of the patch -- I hacked it up over the last few days and finished it, more or less, last night. I'd really appreciate some people trying it out... regarding target: 3.1.0 at least. But the way I did it is pretty nonintrusive -- in my opinion it could even fit into 3.0.x ;)
Created attachment 2527 [details] implementation here's the impl. I'll check this into trunk shortly unless anyone gives me a -1. ;)
OK, here's some good test results. I came up with a little test that uses a variable set of loads (multiple messages in parallel in a burst, followed by messages one after the other, followed by messages in serial separated by a "sleep 1". This is to simulate an environment with variable message loads, instead of the usual "massive burst" or "strictly serial" test loads we have in our test suite. Results are excellent: new prefork code: real 1m50.385s user 0m0.297s sys 0m0.472s real 1m52.728s user 0m0.298s sys 0m0.456s real 1m48.972s user 0m0.287s sys 0m0.460s old prefork, using --round-robin switch: real 1m48.504s user 0m0.293s sys 0m0.477s real 1m48.844s user 0m0.323s sys 0m0.527s real 1m49.016s user 0m0.317s sys 0m0.519s : jm 1455...; sh runtest that's pretty painless -- almost unnoticeable! I'm checking in the code to 3.1.0, and moving this ticket to 3.0.2 in case we want to apply there. here's the test scripts, if you want to try it out... "scr": #!/bin/sh scan_all_bg () { for f in t/data/spam/* ; do if [ -f $f ] ; then true ; else continue ; fi ./spamc/spamc -p 48373 < $f > o & done } scan_all_fg () { for f in t/data/spam/* ; do if [ -f $f ] ; then true ; else continue ; fi ./spamc/spamc -p 48373 < $f > o done } scan_all_sleep_1_sec () { for f in t/data/spam/* ; do if [ -f $f ] ; then true ; else continue ; fi ./spamc/spamc -p 48373 < $f > o sleep 1 done } scan_all_bg scan_all_sleep_1_sec scan_all_bg scan_all_bg scan_all_bg scan_all_bg scan_all_bg scan_all_fg scan_all_sleep_1_sec scan_all_bg scan_all_bg scan_all_bg scan_all_fg scan_all_bg scan_all_bg scan_all_fg scan_all_sleep_1_sec wait "runtest": #!/bin/sh perl -T -Ilib ./spamd/spamd.raw \ -x -s stderr -p 48373 -L -m5 > log1 2>&1 & sdpid=$! sleep 5 time sh scr time sh scr time sh scr kill -15 $sdpid ; sleep 1 perl -T -Ilib ./spamd/spamd.raw --round-robin \ -x -s stderr -p 48373 -L -m5 > log1 2>&1 & sdpid=$! sleep 5 time sh scr time sh scr time sh scr kill -15 $sdpid ; sleep 1 checked in as r106054.
It'd be nice to have in 3.0.2, as this is critical for some people; but the patch is really big, and not widely tested. :-( If we got a bunch of people to run it, that would help. +0
May I suggest that a pre or rc release is made, 3.0.2-pre1 with the Apache preforking model? Having something like that may encourage wider testing, and if this could go into Debian Experimental too, it may help too... :-)
Subject: Re: [review] adopt Apache preforking algorithm I'm not in favor of putting this into 3.0.2. This is such a large change, it should go into 3.1, go through pre and rc releases, etc. Sticking to a 3-4 month release cycle for 3.1 -- that would be my strategy for getting this out.
Subject: Re: [review] adopt Apache preforking algorithm -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > May I suggest that a pre or rc release is made, 3.0.2-pre1 with the > Apache preforking model? Having something like that may encourage wider > testing, and if this could go into Debian Experimental too, it may help > too... :-) I'd definitely like to see more people trying out the patch first anyway before we go rolling tarballs ;) - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD4DBQFBoQT1MJF5cimLx9ARAoNHAJ9yw8o8/VbvazybYXAkt66GvTWtcgCYg7sQ qPzTlArSdOUQ3F6DpZmE7w== =jb+Y -----END PGP SIGNATURE-----
Subject: Re: [review] adopt Apache preforking algorithm On Sun, Nov 21, 2004 at 12:01:10PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > I'm not in favor of putting this into 3.0.2. This is such a large > change, it should go into 3.1, go through pre and rc releases, etc. Agreed, I didn't plan on having this feature in 3.0.
alright that's 2 -1's on getting this into 3.0.x. it'll take several compelling user reports of memory problems fixed to get it in, if at all, given that ;)
Andy Jezierski noted (via mail): The patch seems to be working pretty well on my system, don't notice any slowdowns and paging is definitely better. I'm running on a FreeBSD 5.2.1 system with 192MB of ram, using the defaults for the new switches, so two spamd child processes start up and it seems my system runs with just those two active most of the time. I've seen it running with 5 processes a few times, which is what I have my limit set to, but most of the time it's using 2 or 3. Here's a couple of top displays, the first is before the patch, the second is after. last pid: 64134; load averages: 1.02, 1.38, 1.30 up 45+20:26:52 09:34:27 28 processes: 1 running, 27 sleeping CPU states: 74.3% user, 0.0% nice, 13.2% system, 0.4% interrupt, 12.1% idle Mem: 116M Active, 9728K Inact, 47M Wired, 6664K Cache, 28M Buf, 572K Free Swap: 512M Total, 48M Used, 464M Free, 9% Inuse, 316K In, 416K Out PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND 64073 spamd 113 0 53552K 38820K select 0:21 13.64% 12.40% perl 64011 spamd 4 0 56244K 37632K accept 0:51 15.61% 11.57% perl 63916 spamd 107 0 59760K 36248K select 1:28 15.15% 11.23% perl 63995 spamd 4 0 57212K 39360K accept 0:58 10.62% 9.91% perl 63996 spamd 4 0 56248K 37392K accept 0:46 4.13% 3.76% perl 389 root 96 0 1608K 268K select 4:58 0.00% 0.00% ntpd 260 root 96 0 1312K 248K select 3:47 0.00% 0.00% syslogd 414 root 96 0 3512K 476K select 3:25 0.00% 0.00% sendmail 28604 spamd 20 0 46188K 0K pause 1:11 0.00% 0.00% <perl> 431 root 8 0 1336K 120K nanslp 0:40 0.00% 0.00% cron 359 root 96 0 1236K 84K select 0:18 0.00% 0.00% usbd 408 root 96 0 3488K 0K select 0:10 0.00% 0.00% <sshd> 417 smmsp 20 0 3404K 0K pause 0:08 0.00% 0.00% <sendmail> 64116 root 96 0 2212K 1076K RUN 0:00 0.00% 0.00% top 64094 root 4 0 6240K 0K sbwait 0:00 0.00% 0.00% <sshd> 64112 root 20 0 2312K 0K pause 0:00 0.00% 0.00% <csh> sidewinder# last pid: 74409; load averages: 1.25, 0.85, 0.70 up 46+20:09:26 09:17:01 26 processes: 2 running, 24 sleeping CPU states: 82.9% user, 0.0% nice, 14.8% system, 1.6% interrupt, 0.8% idle Mem: 73M Active, 22M Inact, 47M Wired, 11M Cache, 28M Buf, 27M Free Swap: 512M Total, 13M Used, 499M Free, 2% Inuse PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND 74322 spamd 127 0 56592K 46180K select 1:02 39.79% 39.79% perl 74370 spamd -8 0 55752K 45212K piperd 0:38 35.92% 35.64% perl 389 root 96 0 1608K 396K select 5:04 0.00% 0.00% ntpd 260 root 96 0 1312K 280K select 3:52 0.00% 0.00% syslogd 414 root 96 0 3512K 868K select 3:29 0.00% 0.00% sendmail 431 root 8 0 1336K 192K nanslp 0:40 0.00% 0.00% cron 66431 spamd 96 0 46432K 28368K select 0:28 0.00% 0.00% perl 359 root 96 0 1236K 84K select 0:18 0.00% 0.00% usbd 408 root 97 0 3488K 488K select 0:10 0.00% 0.00% sshd 417 smmsp 20 0 3404K 424K pause 0:09 0.00% 0.00% sendmail 66044 ajezierski 96 0 6240K 1052K select 0:01 0.00% 0.00% sshd 66047 root 20 0 2316K 1088K pause 0:00 0.00% 0.00% csh 66042 root 4 0 6240K 924K sbwait 0:00 0.00% 0.00% sshd sidewinder# Great Job! Thanks Andy
Subject: Re: [review] adopt Apache preforking algorithm I asked Dean Gaudet, one of the Apache httpd hackers, about this a ways back but I don't think I ever forwarded this (he gave his okay to forward). ------- start of cut text -------------- From: dean gaudet <dean@arctic.org> apache keels over if you don't set MaxClients appropriately for your RAM... it's one of the first few things i mention on the apache tuning page. you really shouldn't attempt to defend against dum setups with some complicated accept() protocol... keep it simple. reacting to load rather than launching a fixed number of children is an improvement though (or more accurately... a bug fix -- since pre-3.0 behaved like this) -- then you can default to launching one child, and that will handle 99% of j. random user's needs. i was honestly happy with post-forking SA... there are already 4 other post-forked daemons in my mail delivery setup (qmail+clamav). it's the most trivial to get post-forking correct in a platform independent manner... and it's trivial to have post-fork react to load changes. but then i'm running SA with one request per child, because i prefer to exchange the perf loss for the extra paranoia of having my children drop root privs entirely early... (and there's that config leak bug at least in the 3.0.1 my install is based on). damn does SA really have 20MB unshared? suck. so yeah sorry, no sage advice here... i'll be disabling most of this myself, security/paranoia is more important. ------- end ----------------------------
the MaxClients thing won't be a problem for us I don't think. by default that remains at 5, and users would have scaled that by now if they ran into low-RAM situations with 3.0.x. I also added a warning about that. (MaxClients == --max-children in SAland) I think the patch matches up with what Dean notes; it'll generally run with 1 kid, 2 if there's mail active, and scales up to 5 if things get really busy, by default. this patch works a *lot* better, in many ways, btw -- check out what it looks like on my server after running for nearly a day: 29880 root 17 15 22248 5244 4276 S N 0.0 1.0 0:03 spamd 11505 root 17 15 65336 60M 7180 S N 0.0 12.0 0:44 spamd 28823 root 17 15 29872 25M 7204 S N 0.0 5.1 0:03 spamd ie. the master proc has only 5MB paged in while still having all the code and modules preloaded and precompiled, that's very nice; the subprocs have scaled their RAM usage based on size of messages coming in; there's only two subprocs; and one of those (the big one) is the one handling all the messages, with the second having handled only a small amount of overflow, based on the CPU times. that all looks very good. (note however I don't think the reported SHR values are correct on that kernel as mentioned in prev bugs, so ignore those.)
moving to 3.1 milestone
Subject: Re: [review] adopt Apache preforking algorithm No problems with the prefork patch on 3.0.1 after a few days on my test system. It only has a load of about 1000 messages a day though.
For what this is worth, we are running this patch on SpamAssassin 3.0.1 on Solaris 8 with perl 5.8.5. It is working ok and does reduce memory usage and swap except when we have enough load to require the maximum number of spamd processes. Even then, the faster processing before and after the peak load seems to reduce the impact of the peak load.
fwiw, it's been rock-solid for my server for several weeks now (although again that's a pretty low-load system). Given that this works well, has several positive reports, and is in 3.1.0, and we're agreed it won't be backported to 3.0.x, I'll close this bug. (people who may wish to apply the patch to 3.0.x can still get the patch from here and apply it, of course.) if we ever decide to backport to 3.0.x officially, we can reopen.
marking FIXED
Subject: Re: [review] adopt Apache preforking algorithm On Tue, Dec 07, 2004 at 12:48:43PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > fwiw, it's been rock-solid for my server for several weeks now (although again > that's a pretty low-load system). > > Given that this works well, has several positive reports, and is in 3.1.0, and > we're agreed it won't be backported to 3.0.x, I'll close this bug. (people who > may wish to apply the patch to 3.0.x can still get the patch from here and apply > it, of course.) Actually it barfed on my system just now... I didn't have time to debug too much because of the flood of unscanned mails, but: root 11997 1 1 Dec05 ? 00:45:54 /usr/bin/spamd -d -c -H -r /var/run/spamd.pid -m 8 root 14055 11997 0 13:53 ? 00:00:30 spamd child root 22425 11997 0 15:56 ? 00:00:04 spamd child root 25947 11997 0 16:45 ? 00:00:00 [spamd <defunct>] telnetting to 783 and giving a PING got no response. spamc was timing out connections and stuff was going through unscanned. running: spamassassin-3.1.0-svn109845
Subject: Re: [review] adopt Apache preforking algorithm On Tue, Dec 07, 2004 at 02:37:52PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > > Actually it barfed on my system just now... I didn't have time to debug too > much because of the flood of unscanned mails, but: > > root 11997 1 1 Dec05 ? 00:45:54 /usr/bin/spamd -d -c -H -r /var/run/spamd.pid -m 8 > root 14055 11997 0 13:53 ? 00:00:30 spamd child > root 22425 11997 0 15:56 ? 00:00:04 spamd child > root 25947 11997 0 16:45 ? 00:00:00 [spamd <defunct>] > > telnetting to 783 and giving a PING got no response. spamc was timing out > connections and stuff was going through unscanned. > > running: spamassassin-3.1.0-svn109845 > Someone on IRC mentioned a similar experience last week. I asked them to note it in bug, but I guess they never did. Michael
damn, a strace would have been handy :(
Subject: Re: [review] adopt Apache preforking algorithm On Tue, Dec 07, 2004 at 04:52:16PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > damn, a strace would have been handy :( Yeah, sorry. I was going to do some debugging, but mail was kind of flooding in at that moment, and there were 17 other things going on at work at the same time, so I had to just restart and figured I'd come back to it later. A rough guess is that the parent was trying to send the connection to the defunct child, and the child obviously didn't do anything about it.
Subject: Re: [review] adopt Apache preforking algorithm On Tue, Dec 07, 2004 at 08:20:31PM -0500, Theo Van Dinter wrote: > Yeah, sorry. I was going to do some debugging, but mail was kind of flooding > in at that moment, and there were 17 other things going on at work at the same > time, so I had to just restart and figured I'd come back to it later. > > A rough guess is that the parent was trying to send the connection to the > defunct child, and the child obviously didn't do anything about it. Just happened again. root 28832 1 6 Dec07 ? 00:28:02 /usr/bin/spamd -d -c -H -r /var/run/spamd.pid -m 8 root 13863 28832 0 Dec07 ? 00:01:32 spamd child root 15337 28832 0 Dec07 ? 00:00:16 spamd child root 27295 28832 0 00:28 ? 00:00:01 [spamd <defunct>] the children were sitting in a read() wait. the parent was wedged in a loop: [...] futex(0x40190780, FUTEX_WAIT, 2, NULL) = -1 ENOSYS (Function not implemented) futex(0x40190780, FUTEX_WAIT, 2, NULL) = -1 ENOSYS (Function not implemented) futex(0x40190780, FUTEX_WAIT, 2, NULL) = -1 ENOSYS (Function not implemented) futex(0x40190780, FUTEX_WAIT, 2, NULL) = -1 ENOSYS (Function not implemented) futex(0x40190780, FUTEX_WAIT, 2, NULL) = -1 ENOSYS (Function not implemented) futex(0x40190780, FUTEX_WAIT, 2, NULL) = -1 ENOSYS (Function not implemented) [...] Summary of my perl5 (revision 5.0 version 6 subversion 1) configuration: Platform: osname=linux, osvers=2.4.21-1.1931.2.393.entsmp, archname=i386-linux uname='linux bugs.devel.redhat.com 2.4.21-1.1931.2.393.entsmp #1 smp thu aug 14 14:47:21 edt 2003 i686 unknown ' config_args='-des -Doptimize=-O2 -march=i386 -mcpu=i686 -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dcccdlflags=-fPIC -Dinstallprefix=/usr -Dprefix=/usr -Darchname=i386-linux -Dvendorprefix=/usr -Dsiteprefix=/usr -Uusethreads -Uuseithreads -Uuselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db -Di_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Dinc_version_list=5.6.0/i386-linux 5.6.0' hint=recommended, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=undef d_sfio=undef uselargefiles=undef usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef Compiler: cc='gcc', ccflags ='-fno-strict-aliasing -I/usr/local/include', optimize='-O2 -march=i386 -mcpu=i686', cppflags='-fno-strict-aliasing -I/usr/local/include' ccversion='', gccversion='2.96 20000731 (Red Hat Linux 7.3 2.96-113)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=4 alignbytes=4, usemymalloc=n, prototype=define Linker and Libraries: ld='gcc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -ldl -lm -lc -lcrypt -lutil perllibs=-lnsl -ldl -lm -lc -lcrypt -lutil libc=/lib/libc-2.2.5.so, so=so, useshrplib=false, libperl=libperl.a Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic' cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'
futex(0x40190780, FUTEX_WAIT, 2, NULL) = -1 ENOSYS 'It is typically used to implement the contended case of a lock in shared memory, as described in futex(4).' watching the futex at 0x40190780, waiting indefinitely (NULL) for it to change from value 2. what the hell is doing *that*?!
Subject: Re: [review] adopt Apache preforking algorithm On Tue, 7 Dec 2004, bugzilla-daemon@bugzilla.spamassassin.org mused: > what the hell is doing *that*?! futexes are mostly used by the NPTL threading layer at the moment, so I guess it is a threading bug. (Probably in glibc.) If gdb and perl are compiled with debugging, try attaching to the looping process with GDB and getting a backtrace (unlikely, I know; glibc in particular is almost always shipped stripped).
Subject: Re: [review] adopt Apache preforking algorithm On Wed, Dec 08, 2004 at 01:00:43AM -0500, Theo Van Dinter wrote: > Just happened again. yep. Nothing in the logs about it... Dec 9 04:13:32 eclectic spamd[2950]: result: Y 36 - BAYES_99,BIZ_TLD,DRUGS_ANXIETY,DRUGS_ANXIETY_EREC,DRUGS_ERECTILE,DRUGS_MANYKINDS,DRUGS_MUSCLE,DRUGS_SLEEP,DRUGS_SLEEP_EREC,HELO_DYNAMIC_IPADDR2,HTML_BACKHAIR_8,HTML_FONT_TINY,HTML_MESSAGE,LONGWORDS,RAZOR2_CF_RANGE_51_100,RAZOR2_CHECK,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_DSBL,RCVD_IN_NJABL_DUL,RCVD_IN_SORBS_DUL,RCVD_IN_XBL,URI_AFFILIATE,X_MESSAGE_INFO scantime=4.7,size=5754,mid=<8d76r4pf372nis$ull80hjg286hc0$aw443f1@SR480>,bayes=1,autolearn=spam Dec 9 11:26:40 eclectic spamd[29136]: server killed by SIGTERM, shutting down Interestingly, the parent will take a SIGTERM killing the futex() loop. There were a few notes when spamd was restarting: Dec 9 11:26:45 eclectic spamd[3399]: Use of uninitialized value in index at /usr/lib/perl5/site_perl/5.6.1/Mail/SpamAssassin/SpamdForkScaling.pm line 281, <GEN127> line 24. Dec 9 11:26:45 eclectic spamd[3399]: Use of uninitialized value in concatenation (.) or string at /usr/lib/perl5/site_perl/5.6.1/Mail/SpamAssassin/SpamdForkScaling.pm line 285, <GEN127> line 24. Dec 9 11:26:45 eclectic spamd[2950]: Use of uninitialized value in index at /usr/lib/perl5/site_perl/5.6.1/Mail/SpamAssassin/SpamdForkScaling.pm line 281, <GEN124> line 189. Dec 9 11:26:45 eclectic spamd[2950]: Use of uninitialized value in concatenation (.) or string at /usr/lib/perl5/site_perl/5.6.1/Mail/SpamAssassin/SpamdForkScaling.pm line 285, <GEN124> line 189. Dec 9 11:26:45 eclectic spamd[10223]: spamd starting
those warnings, unfortunately, aren't helpful; they're just "the parent went away" -type warnings. anyway, r111430 fixes that, but the hang bug is still around...
I was too quick in reporting no problems. I am running spamd with the command /usr/bin/spamd -d -c -m 8 and when first started I saw the expected 2 to 8 children depending on the load. Today I noticed that there were never more than 2 children even though there were as many as 7 spamc processes running. In the log file I found: Dec 9 13:51:03 bluegill.adi.com spamd[13605]: prefork: server reached --max-clients setting, consider raising it repeated many times (different dates of course). When I restarted spamd, the correct behavior returned.
Created attachment 2560 [details] debugging version of code Tom, Theo, this patch may be worthwhile; it applies on top of the previous one. basically it (a) adds more debugging and paranoid error-checking, and (b) promotes the "active child scoreboard" debug message to a warning so Tom's situation can be checked without turning on debugging.
Subject: Re: [review] adopt Apache preforking algorithm On Thu, Dec 09, 2004 at 02:23:47PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > Tom, Theo, this patch may be worthwhile; it applies on top of the previous one. > basically it (a) adds more debugging and paranoid error-checking, and (b) > promotes the "active child scoreboard" debug message to a warning so Tom's > situation can be checked without turning on debugging. Just to throw some fun into my mix, I completely rocked the perl world on my system this evening. For the longest time I've been running perl 5.6.1 as came with RedHat 7.3, on my RHEL 3 machine. I upgraded the packages tonight (and the bazillion perl modules I installed before) to the 5.8.0 version that comes with RHEL 3. So I'm going to watch spamd and see if it still has issues. If it does, I'll throw the patch on and see what happens.
We are parsing a little over 250k messages per day with a spamd/spamc setup. Spamd runs on two boxes with a sql backend for bayes. After upgrading to 3.0.1 we were seeing huge delays processing mail. This patch definately fixed the issue for us, and we have seen no issues, even with our scale of install.
I am having a problem after applying the patch. Make test fails as follows t/spamc_B................... Not found: spamflag = X-Spam-Flag: YES # Failed test 8 in t/SATest.pm at line 530 t/spamc_B...................FAILED test 8 Failed 1/8 tests, 87.50% okay t/spamc_c...................# Failed test 1 in t/spamc_c.t at line 15 t/spamc_c...................FAILED test 1 Failed 1/2 tests, 50.00% okay t/spamc_c_stdout_closed.....# Failed test 1 in t/spamc_c_stdout_closed.t at line 31 t/spamc_c_stdout_closed.....FAILED test 1 Failed 1/2 tests, 50.00% okay t/spamc_E...................# Failed test 1 in t/spamc_E.t at line 15 t/spamc_E...................FAILED test 1 Failed 1/2 tests, 50.00% okay I cut test test off at this point as some of the tests were takeing aprox. 30 minutes to fail.
For whatever this is worth. I did some digging through the logs. The drop in the maximum number of spamd children does not happen all at once. After a few hours the maximum number of children was down to 6 from the original 8. A few hours later it was down to 5. It went down to 3 in about an hour. It bottomed out at 2 as before. At the time I checked the log, there were: 308 lines containing "server successfully spawned child process" 306 lines containing "server hit by SIGCHLD" 306 lines containing "handled cleanup of child" Nothing funny there.
Subject: Re: [review] adopt Apache preforking algorithm On Tue, Dec 07, 2004 at 11:04:11PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > watching the futex at 0x40190780, waiting indefinitely (NULL) for it to change > from value 2. > what the hell is doing *that*?! Since upgrading perl on the box, I haven't had any problems with spamd. So apparently the issue was the old 5.6.1 install.
I edited SpamdForkScaling.pm to make the change from the last part of the patch from Comment #27. After running awhile I found the following in the syslog when there were 2 spand children shown by ps Dec 13 14:27:01 bluegill.adi.com spamd[6896]: prefork: child states: KKII at /opt/perl/lib/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 331, <GEN228> line 3. A short time later when the system was more active, there were 6 children and Dec 13 14:38:07 bluegill.adi.com spamd[6896]: prefork: child states: KKBBBBBB at /opt/perl/lib/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 331, <GEN252> line 4.
Subject: Re: [review] adopt Apache preforking algorithm On Sun, 12 Dec 2004, bugzilla-daemon@bugzilla.spamassassin.org stated: > Since upgrading perl on the box, I haven't had any problems with spamd. So > apparently the issue was the old 5.6.1 install. It might have been the old 5.6.1 Perl threading library, perhaps: that does some rather risky things which NPTL might well break. (This is pure conjecture, though.)
Am I correct in assuming that the child states mean something like: I = idle, B = busy, K = killed but waiting for something before being forgotten? We went from 5 'K's to 6 'K's (running with '-m 8') yesterday evening and now 14 hours later we still have 6 'K's. I would like to run with your whole debugging patch, but as it breaks 'make test' I am afraid to try it in production. Following are 3 lines from syslog, the first one is the last one with 5 'K's. I just noticed that some lines do not have the part with GEN... Dec 13 19:29:33 bluegill.adi.com spamd[6896]: prefork: child states: KKIIKKKI at /opt/perl/lib/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 331, <GEN738> line 161. Dec 13 19:29:33 bluegill.adi.com spamd[6896]: prefork: child states: KKIIKKKK at /opt/perl/lib/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 331. Dec 14 09:27:32 bluegill.adi.com spamd[6896]: prefork: child states: KKKIKBKK at /opt/perl/lib/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 331, <GEN852> line 271.
Ah! I think that there may be a race condition. I changed the line dbg ("prefork: adjust: decreasing, too many idle children ($num_idle > $self->{max_idle}), killed $pid"); in SpamdForkScaling.pm from dbg to warn and found the following sequence in syslog when one of the children entered state 'K'. Dec 14 15:10:14 bluegill.adi.com spamd[1459]: prefork: child states: III at /opt/perl/lib/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 331, <GEN6> line 349. Dec 14 15:10:14 bluegill.adi.com spamd[1459]: server hit by SIGCHLD Dec 14 15:10:14 bluegill.adi.com spamd[1459]: handled cleanup of child pid 13252 Dec 14 15:10:14 bluegill.adi.com spamd[1459]: prefork: adjust: decreasing, too many idle children (3 > 2), killed 13252 at /opt/perl/lib/site_perl/5.8.5/ Mail/SpamAssassin/SpamdForkScaling.pm line 380, <GEN6> line 349. Dec 14 15:10:14 bluegill.adi.com spamd[1459]: prefork: child states: IIK at /opt/perl/lib/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 331. Note the order of the lines. Most of the time, the line with 'prefork: adjust: decreasing, too many idle children' happens before the lines with 'server hit by SIGCHLD' and 'handled cleanup of child'. This time it came after those lines and we lost a child. I expect that this is a race condition.
Possible fix. The following change fixes my problem. I don't know if this is the best way to fix this or not. Definately a race condition. --- SpamdForkScaling.pm.orig Tue Dec 14 18:28:10 2004 +++ SpamdForkScaling.pm Tue Dec 14 18:33:01 2004 @@ -375,8 +375,8 @@ die "oops! no idle kids in need_to_del_server?"; } - kill 'INT' => $pid; $self->set_child_state ($pid, PFSTATE_KILLED); + kill 'INT' => $pid; dbg ("prefork: adjust: decreasing, too many idle children ($num_idle > $self->{max_idle}), killed $pid"); }
looks very reasonable. thanks for looking into it; I had spotted that in logs, but time to investigate has been short :( so that fixes it for you?
oh btw -- yes, a child in "K" state should disappear pretty quickly, and should not stick around for more than a couple of seconds in general.
> so that fixes it for you? Yes, that fixes this problem.
Created attachment 2565 [details] new additions on top of patch 2527 this replaces patch 2560, and adds: (a) the spamd-child-states warning, which has proved its usefulness ;) (b) Tom's fix for the race condition (c) a fix so that spamd_hup.t passes it's already checked into trunk as r112032.
Is this patch expected to work on top of 3.0.1? The patch applies cleanly, but I am getting the same errors on make test as I got in Comment #30. I tried commenting out parts of the patch to see what caused the problem, and it is the setup of the alarm that breaks things. Has Mail::SpamAssassin::Util::trap_sigalrm_fully changed in 3.1? If so, then my errors are probably not a problem in 3.1. I am running perl 5.8.5 on Solaris 8.
unfortunately, patch 2565 won't apply cleanly on 3.0.0; the earlier bits will, but not that, because it has a dependency on 3828....
i have this running on many systems and its working great. however, i have one that is a problem, and it really the only oddball out of the bunch. its a dual hyperthreaded zeon box as opposed to all the others being uniprocessor P4's. here are the details... Spamd stops accepting connections after processing several thousand messages. - new apache forking model - running smp kernel, dual ht zeons using the following settings... max_children 15 max_per_child 25 max_spare 5 min_children 5 min_spare 2 when spamd stops accepting connections, an strace shows a hang on connect().. # echo | strace spamc -t 5 .. .. socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3 connect(3, {sin_family=AF_INET, sin_port=htons(783), sin_addr=inet_addr("127.0.0.1")}}, 16 <unfinished ...> tcpdump shows the syn going to spamd, but spamd never acks back. as soon as i svc -d and svc -u on /service/spamd it is available and working again. if i try an hup it with a svc -t, that doesnt help. last few debug lines related to preforking.. @4000000041df61e201060fe4 [10653] debug: prefork: child 21075: entering state 1 @4000000041df61e201077f14 [10653] debug: prefork: child reports idle @4000000041df61e201084264 [10653] debug: prefork: child 21298: entering state 1 @4000000041df61e201099254 [10653] debug: prefork: child reports idle @4000000041df61e2010b8e24 [10653] debug: prefork: child states: IIBIBIIBIIIBBBI @4000000041df61e2010d9d7c [10653] debug: prefork: child 21298: entering state 3 @4000000041df61e2010f2034 [10653] debug: prefork: adjust: decreasing, too many idle children (9 > 5), killed 21298 @4000000041df61e2017c841c [10653] debug: prefork: child 20945: entering state 1 @4000000041df61e2017ef134 [10653] debug: prefork: child reports idle @4000000041df61e2017ff304 [10653] debug: prefork: child closed connection @4000000041df61e20185752c [10653] debug: prefork: child states: IIBIBIIIIIIBBB @4000000041df61e201873e34 [10653] debug: prefork: child 21085: entering state 3 @4000000041df61e201890f0c [10653] debug: prefork: adjust: decreasing, too many idle children (9 > 5), killed 21085 @4000000041df61e201f895d4 [10653] debug: prefork: child closed connection @4000000041df61e201fb06d4 [10653] debug: prefork: child states: IIBIBIIIIIBBB @4000000041df61e201fc3b6c [10653] debug: prefork: child 21081: entering state 3 @4000000041df61e201fd838c [10653] debug: prefork: adjust: decreasing, too many idle children (8 > 5), killed 21081 @4000000041df61e202cb5adc [10653] debug: prefork: child closed connection @4000000041df61e20b7403b4 [10653] debug: prefork: child states: IIBIBIIIIKBBB @4000000041df61e20b741b24 [10653] debug: prefork: child 21075: entering state 3 @4000000041df61e20b7422f4 [10653] debug: prefork: adjust: decreasing, too many idle children (7 > 5), killed 21075 @4000000041df61e20b743a64 [10653] debug: prefork: ordered 18886 to accept @4000000041df61e2197aaea4 [10653] debug: prefork: child 18886: entering state 2 @4000000041df61e2197b6e0c [10653] debug: prefork: child closed connection @4000000041df61e2197d044c [10653] debug: prefork: child 21087: entering state 1 @4000000041df61e2197e3ccc [10653] debug: prefork: child reports idle @4000000041df61e2197f4a54 [10653] debug: prefork: child 21095: entering state 1 @4000000041df61e2198082d4 [10653] debug: prefork: child reports idle @4000000041df61e219815d94 [10653] debug: prefork: child 21096: entering state 1 @4000000041df61e21982a99c [10653] debug: prefork: child reports idle @4000000041df61e2198499b4 [10653] debug: prefork: child states: BIBIBIIIIII @4000000041df61e21986531c [10653] debug: prefork: child 21096: entering state 3 @4000000041df61e21988089c [10653] debug: prefork: adjust: decreasing, too many idle children (8 > 5), killed 21096 last debug line.. @4000000041df61e30180e14c [18886] debug: log: [18886] child is processing message 24 of 25 another time it happend, the last debug line was @4000000041df028027c4ee5c [4250] debug: log: [4250] child is processing message 3 of 25 so it does the same thing every time it hangs it seems. is anyone else running this without issues on SMP enabled kernels?
Subject: Re: [review] adopt Apache preforking algorithm On Fri, Jan 07, 2005 at 09:54:09PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > is anyone else running this without issues on SMP enabled kernels? it's working fine on my box so far, but it's not under a high load.
okay.. box #2, dual p3's, experiencing the same problem. when spamd stops taking CHECKs, the spamc processes just continue to build and spamd never logs any incoming connections.
hi Dallas -- strace -f output would be handy, if at all possible. you'd have to catch it *before* it hangs, of course... ;)
i'll see what i can accomplish. anyways, i'm seeing this on about 6-7 different systems now all SMP based, so that definitely is the common factor between them. they are all redhat 7.3, perl 5.6.1, running fedora-legacy smp kernels. All the other systems are installed exactly the same but run on uniprocessor and do not exhibit this behavior. couple examples today. i have a monitor script running every minute to ping spamd and restart it if it does not pong back in 30 seconds. example #1 ---------------------- Spamd PING Failed at 2005-01-19 10:20:13 2005-01-19 10:18:56.558340500 [470] debug: log: [470] clean message (0.9/5.0) for tilleyl@wmbac.com:200 in 8.9 seconds, 33516 bytes. 2005-01-19 10:18:56.558347500 [470] debug: log: [470] result: . 0 - HTML_40_50,HTML_MESSAGE,HTML_NONELEMENT_70_80 scantime=8.9,size=33516,mid=<003901c4fe3a$1f51a860$8201a8c0@migcorp.net>,autolearn=disabled 2005-01-19 10:18:56.625120500 [470] debug: log: [470] child is processing message 9 of 25 2005-01-19 10:18:56.625665500 [26564] debug: prefork: child 470: entering state 1 2005-01-19 10:18:56.625797500 [26564] debug: prefork: child reports idle 2005-01-19 10:18:56.625921500 [26564] debug: prefork: child states: IIIIII 2005-01-19 10:18:56.626081500 [26564] debug: prefork: child 1773: entering state 3 2005-01-19 10:18:56.626307500 [26564] debug: prefork: adjust: decreasing, too many idle children (6 > 5), killed 1773 2005-01-19 10:18:56.632265500 [26564] debug: log: [26564] server hit by SIGCHLD 2005-01-19 10:18:56.632397500 [26564] debug: log: [26564] handled cleanup of child pid 1773 2005-01-19 10:20:11.266476500 [26564] debug: log: [26564] server killed by SIGTERM, shutting down 2005-01-19 10:20:15.206533500 [2263] debug: logger: enabled with: 2005-01-19 10:20:15.206539500 debug: logger: facility: stderr 2005-01-19 10:20:15.206540500 debug: logger: socket: file 2005-01-19 10:20:15.206542500 debug: logger: output: stderr example #2 ----------------------- Spamd PING Failed at 2005-01-19 10:25:13 2005-01-19 11:19:50.648468500 [24111] debug: check: subtests=__CT,__CTYPE_HAS_BOUNDARY,__HAS_MSGID,__HAS_SUBJECT,__HTML_LENGTH_1024_1536,__MIME_HTML,__MIME_QP,__MIME_VERSION,__SANE_MSGID,__TAG_EXISTS_BODY,__TAG_EXISTS_HEAD,__TAG_EXISTS_HTML,__TAG_EXISTS_META 2005-01-19 11:19:50.658612500 [24111] debug: log: [24111] clean message (4.0/5.0) for langner@anesis.com:200 in 11.1 seconds, 3668 bytes. 2005-01-19 11:19:50.658931500 [24111] debug: log: [24111] result: . 3 - HTML_MESSAGE,HTML_TAG_EXIST_TBODY,MIME_HTML_MOSTLY,URIBL_OB_SURBL,URIBL_WS_SURBL scantime=11.1,size=3668,mid=<3n21c8$51bvqr@ausc60ps301.us.dell.com>,autolearn=disabled 2005-01-19 11:19:50.721801500 [24111] debug: log: [24111] child is processing message 6 of 25 2005-01-19 11:19:50.722339500 [24085] debug: prefork: child 24111: entering state 1 2005-01-19 11:19:50.722468500 [24085] debug: prefork: child reports idle 2005-01-19 11:19:50.722590500 [24085] debug: prefork: child states: IIIIII 2005-01-19 11:19:50.722717500 [24085] debug: prefork: child 24576: entering state 3 2005-01-19 11:19:50.722904500 [24085] debug: prefork: adjust: decreasing, too many idle children (6 > 5), killed 24576 2005-01-19 11:19:50.728925500 [24085] debug: log: [24085] server hit by SIGCHLD 2005-01-19 11:19:50.729046500 [24085] debug: log: [24085] handled cleanup of child pid 24576 2005-01-19 11:20:11.776712500 [24085] debug: log: [24085] server killed by SIGTERM, shutting down 2005-01-19 11:20:19.172192500 [24790] debug: logger: enabled with: 2005-01-19 11:20:19.172198500 debug: logger: facility: stderr 2005-01-19 11:20:19.172200500 debug: logger: socket: file 2005-01-19 11:20:19.172201500 debug: logger: output: stderr
and they're all using Storable 2.13?
ya, as soon as we updated to storable 2.13 and the new 3.0.2 spamassassin it started happening on those smp based boxes. i have a full trace running on a couple boxes now so i should be able to provide more info soon.
at 10:05 am, the spamd log was showing... @4000000041f27a770c42fc14 debug: daemon: LocalAddr: 127.0.0.1 @4000000041f27a770c42fffc debug: daemon: LocalPort: 783 @4000000041f27a770c4303e4 debug: daemon: Proto: 6 @4000000041f27a770c4307cc debug: daemon: ReuseAddr: 1 @4000000041f27a770c430bb4 debug: daemon: Type: 1 @4000000041f27a770c4f19a4 Could not create INET socket on 127.0.0.1:783: Address already in use (IO::Socket::INET: Address already in use) @4000000041f27a7808782cbc [2080] debug: logger: enabled with: @4000000041f27a7808783874 debug: logger: facility: stderr @4000000041f27a7808783c5c debug: logger: socket: file @4000000041f27a7808784044 debug: logger: output: stderr @4000000041f27a780878d0cc [2080] debug: logger: facility stderr disabled: already debugging to stderr. @4000000041f27a780887d874 [2080] debug: daemon: creating INET socket: @4000000041f27a780887e044 debug: daemon: Listen: 128 @4000000041f27a780887e044 debug: daemon: LocalAddr: 127.0.0.1 @4000000041f27a780887e814 debug: daemon: LocalPort: 783 @4000000041f27a780887e814 debug: daemon: Proto: 6 @4000000041f27a780887ebfc debug: daemon: ReuseAddr: 1 @4000000041f27a780887efe4 debug: daemon: Type: 1 @4000000041f27a7808988dcc Could not create INET socket on 127.0.0.1:783: Address already in use (IO::Socket::INET: Address already in use) [root@email spamd]# ps auxwww | grep spamd root 1287 0.1 1.2 28360 26480 ? T 08:40 0:07 /usr/bin/perl -T -w /usr/bin/spamd --max-children=10 --max-conn-per-child=10 --min-children=5 --min-spare=3 --max-spare=5 --syslog-socket=none -q -D -u doublecheck 200 23708 0.0 0.0 0 0 ? Z 09:39 0:00 [spamd <defunct>] 200 24005 0.0 0.0 0 0 ? Z 09:39 0:00 [spamd <defunct>] 200 24170 0.0 0.0 0 0 ? Z 09:39 0:00 [spamd <defunct>] 200 24393 0.0 0.0 0 0 ? Z 09:40 0:00 [spamd <defunct>] 200 26206 0.0 0.0 0 0 ? Z 09:55 0:00 [spamd <defunct>] 200 26510 13.6 0.0 1360 316 ? S 09:56 1:37 /usr/local/bin/multilog t s2048000 !/usr/local/doublecheck/bin/logappend4 /var/log/spamd root 28579 0.0 0.4 10120 8260 ? T 10:00 0:00 /usr/bin/perl -T -w /usr/bin/spamd --max-children=10 --max-conn-per-child=10 --min-children=5 --min-spare=3 --max-spare=5 --syslog-socket=none -q -D -u doublecheck root 28606 0.0 0.0 1344 296 ? S 10:00 0:00 supervise spamd root 2387 4.0 0.1 3844 2548 ? R 10:08 0:00 /usr/bin/perl -T -w /usr/bin/spamd --max-children=10 --max-conn-per-child=10 --min-children=5 --min-spare=3 --max-spare=5 --syslog-socket=none -q -D -u doublecheck root 2402 0.0 0.0 1772 636 pts/1 S 10:08 0:00 grep spamd [root@email spamd]# strace -p 28579 --- SIGTERM (Terminated) --- [root@email spamd]# ps auxwww | grep spamd root 1287 0.1 1.2 28360 26480 ? T 08:40 0:07 /usr/bin/perl -T -w /usr/bin/spamd --max-children=10 --max-conn-per-child=10 --min-children=5 --min-spare=3 --max-spare=5 --syslog-socket=none -q -D -u doublecheck [root@email spamd]# strace -p 1287 --- SIGTERM (Terminated) --- 2005-01-22 09:40:33.941653500 is the last time an email was processed. however, spamd answered a pong 5 minutes later.. 2005-01-22 09:45:00.515695500 [pid 23542] read(9, "PING SPAMC/1.0\n", 4096) = 15 2005-01-22 09:45:00.515781500 [pid 23542] alarm(0) = 30 2005-01-22 09:45:00.515862500 [pid 23542] write(9, "SPAMD/1.2 0 PONG\r\n", 18) = 18 this box does a quite a bit of mail.. so no CHECK's in a 5 minute period is impossible. there is an email every 3-5 seconds. i'm attaching the straced debug log.
Created attachment 2619 [details] straced spamd straced spamd processes during a hang
also, i've switched back to --round-robin, we'll see if the hangs go away.
running --round-robing has removed the hangs... spamd was hanging up every 4 to 8 hours, now there hasnt been one since i changed it saturday morning.
Runnning with Solaris 8, these patches cause the error: prefork: select timeout failed! recovering to be displayed to the syslog continuously, and cause spamd to not process any messages. I'd love to get these patches working, and am afraid that the final released version of these could cause problems for us. Any direction?
Could the patch submitted in Bug 4189 be the fix for the problems seen in comment 44 through comment 55?
yes, bug 4189 is a bug, and the patch in that bug should be applied on top of these. However, I don't think it's the bug Dallas is reporting; this bug manifests itself as lots of "K" state children, and Dallas' logs don't show that. BTW Dallas, I note something very wierd in those logs you posted: --------------------------------------------------------------------------- 2005-01-22 09:50:00.782509500 [pid 1287] write(11, "A\n", 2) = 2 2005-01-22 09:50:00.782578500 [pid 23708] <... read resumed> "A\n", 4096) = 2 2005-01-22 09:50:00.782652500 [pid 1287] write(2, "[1287] debug: prefork: ordered 2"..., 47[1287] debug: prefork: ordered 23708 to accept 2005-01-22 09:50:00.782695500 ) = 47 2005-01-22 09:50:00.782747500 [pid 1287] read(11, <unfinished ...> 2005-01-22 09:50:00.782941500 [pid 23708] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 2005-01-22 09:50:00.783028500 [pid 23708] accept(5, {sin_family=AF_INET, sin_port=htons(40829), sin_addr=inet_addr("127.0.0.1")}}, [16]) = 11 2005-01-22 09:55:00.629245500 [pid 23708] fcntl64(11, F_GETFL) = 0x2 (flags O_RDWR) 2005-01-22 09:55:00.629333500 [pid 23708] fstat64(11, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0 2005-01-22 09:55:00.629470500 [pid 23708] mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40203000 --------------------------------------------------------------------------- note the 5-minute wait time between the accept() calling and returning. that's the hang, in wait_for_child_to_accept() in SpamdForkScaling.pm, where it calls my $state = $self->read_one_line_from_child_socket($sock); essentially, the child process is hung in the accept() system call. (This is not supposed to happen, naturally. ;) The spamd will sit there waiting for the child to report its status, and the child is just hung waiting for accept() to return... So the question is, why is accept() hanging? Is it: - 1. because there's no connection there to accept, according to the child, and it's just waiting for the next conn request? - 2. because something in the kernel is holding onto a lock and eventually time out after 5 minutes exactly? The timestamps (09:50:00.783028500 and 09:55:00.629245500) are almost exactly 5 minutes apart. That doesn't seem to be accidental, and would lead me to suspect that #1 is very unlikely. This doesn't seem to be a particularly quiet server, where the socket would see no activity for a whole 5 minutes... right? If that's the case, I think #2 seems likely. Anyway, I'll try adding some timeout code to the preforking code, to take care of this. But it looks like some kind of kernel lock timeout to me... accept() should not be sitting on the socket for that long assuming there's other connections arriving during those 5 mins. --j.