Bug 3983 - [review] adopt Apache preforking algorithm
Summary: [review] adopt Apache preforking algorithm
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamc/spamd (show other bugs)
Version: 3.0.1
Hardware: Other other
: P5 major
Target Milestone: 3.1.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on: 4189
Blocks: 4002
  Show dependency tree
 
Reported: 2004-11-18 18:16 UTC by Justin Mason
Modified: 2005-03-15 10:15 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
implementation patch None Justin Mason [HasCLA]
debugging version of code patch None Justin Mason [HasCLA]
new additions on top of patch 2527 patch None Justin Mason [HasCLA]
straced spamd application/x-gzip None Dallas Engelken [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Justin Mason 2004-11-18 18:16:53 UTC
from discussions with a few people at ApacheCon, I think I have
an idea of what may have caused trouble swap-wise in SpamAssassin
use for a lot of people.

By default, SpamAssassin 3.0.0's spamd uses preforking, and starts
5 servers.  So we have:

    10 MB shared per process
    20 MB unshared x 5
    = 110 MB used by all processes

In addition, something that's become clear, is that linux (and possibly other
OSes) will schedule each of the preforked servers to service an incoming
accept() call, *in turn*.  Even if some of those are swapped out.

Now, let's say you're running that on a server with 90 MB of free RAM.  At
least one or two of those servers will have to be swapped out to fit them all
in.  Once a sufficient number of requests arrive -- even under low load -- that
server will be re-swapped in, and another swapped out every 5 requests... in
other words, creating swap load.

What I'm proposing to do is to adopt the Apache preforking algorithm.
This will mean:

- instead of the requests being distributed across all servers equally,
  most requests are delivered to the first idle server (sorted by PID).
  So, in other words a small number of the running servers will handle
  most of the requests under normal load.  This will reduce swapping,
  since the inactive servers stay inactive, and therefore swapped out
  until they're needed.

  I think this is the main fix that'll improve response time and cut
  swapping.

- the number of servers running will no longer be simply == max-children;
  instead, it'll scale from (1 .. max-children) based on how busy the
  server is, using a min-idle-children and max-idle-children pair
  of settings.  in other words, under low load, server children
  are killed off; under high load, new servers are forked in advance
  to deal with more simultaneous requests.

  This is implemented, and *seems* to work well, although it could do
  with testing under variable loads.

- the master process serialises all accept() calls.  The parent spamd
  entirely controls when, and if, a child accepts a new connection, and
  which child does so.

  This is required to implement "lowest-PID idle child accepts", and --
  surprisingly -- has very little effect on throughput in my testing;
  in a simple speed test using about 100 messages, I get these figures:

  3.0: 10.819 8.441 9.419 8.865 9.704 = 47.248/5 =  9.45
  new: 9.851 9.277 9.759              = 28.887/3 =  9.63

  that's 9.45 messages/sec with the 3.0 code, 9.63 m/s with the new
  preforking algorithm.  however, I think there's actually a small
  slowdown in fact, compared to non-preforking, based on this:

  rrn: 10.240 9.393 9.339 10.033      = 39.005/3 =  9.75

  the methodology used to generate these, btw, are just: start spamd, and do
  "for mail in [100-message corpus] ; do spamc < $mail ; done", in other words
  send 1 message through, each time, in sequence.

  note the *lower* variability in the m/s figures using the new code. Even
  though my machine has 1Gb of RAM, the re-use of the same spamd child
  processes seems to increase the predictability of scan times, even without
  swapping!

- a bonus: we'll have a back-channel from the spamd children to the spamd
  parent to send messages and report stats.   this should be pretty
  damn useful ;)


So, I'll upload the current statke of the patch -- I hacked it up over the
last few days and finished it, more or less, last night.  I'd really appreciate
some people trying it out...

regarding target: 3.1.0 at least.  But the way I did it is pretty nonintrusive
-- in my opinion it could even fit into 3.0.x ;)
Comment 1 Justin Mason 2004-11-18 19:35:30 UTC
Created attachment 2527 [details]
implementation

here's the impl.   I'll check this into trunk shortly unless anyone gives me a
-1.  ;)
Comment 2 Justin Mason 2004-11-20 17:58:24 UTC
OK, here's some good test results.   I came up with a little test that uses a
variable set of loads (multiple messages in parallel in a burst, followed by
messages one after the other, followed by messages in serial separated by a
"sleep 1".  This is to simulate an environment with variable message loads,
instead of the usual "massive burst" or "strictly serial" test loads we have in
our test suite.  Results are excellent:

new prefork code:

  real    1m50.385s user    0m0.297s sys     0m0.472s
  real    1m52.728s user    0m0.298s sys     0m0.456s
  real    1m48.972s user    0m0.287s sys     0m0.460s

old prefork, using --round-robin switch:

  real    1m48.504s user    0m0.293s sys     0m0.477s 
  real    1m48.844s user    0m0.323s sys     0m0.527s 
  real    1m49.016s user    0m0.317s sys     0m0.519s

: jm 1455...; sh runtest


that's pretty painless -- almost unnoticeable!   I'm checking in the code to
3.1.0, and moving this ticket to 3.0.2 in case we want to apply there.

here's the test scripts, if you want to try it out...

"scr":

#!/bin/sh
  scan_all_bg () {
    for f in t/data/spam/* ; do
      if [ -f $f ] ; then true ; else continue ; fi
      ./spamc/spamc -p 48373 < $f > o &
    done
  }
  scan_all_fg () {
    for f in t/data/spam/* ; do
      if [ -f $f ] ; then true ; else continue ; fi
      ./spamc/spamc -p 48373 < $f > o
    done
  }
  scan_all_sleep_1_sec () {
    for f in t/data/spam/* ; do
      if [ -f $f ] ; then true ; else continue ; fi
      ./spamc/spamc -p 48373 < $f > o
      sleep 1
    done
  }
  scan_all_bg
  scan_all_sleep_1_sec
  scan_all_bg
  scan_all_bg
  scan_all_bg
  scan_all_bg
  scan_all_bg
  scan_all_fg
  scan_all_sleep_1_sec
  scan_all_bg
  scan_all_bg
  scan_all_bg
  scan_all_fg
  scan_all_bg
  scan_all_bg
  scan_all_fg
  scan_all_sleep_1_sec
  wait 


"runtest":

#!/bin/sh

  perl -T -Ilib ./spamd/spamd.raw \
                -x -s stderr -p 48373 -L -m5 > log1 2>&1 &
  sdpid=$!
  sleep 5
  time sh scr
  time sh scr
  time sh scr
  kill -15 $sdpid ; sleep 1

  perl -T -Ilib ./spamd/spamd.raw --round-robin \
                -x -s stderr -p 48373 -L -m5 > log1 2>&1 &
  sdpid=$!
  sleep 5
  time sh scr
  time sh scr
  time sh scr
  kill -15 $sdpid ; sleep 1


checked in as r106054.
Comment 3 Duncan Findlay 2004-11-20 23:56:11 UTC
It'd be nice to have in 3.0.2, as this is critical for some people; but the
patch is really big, and not widely tested. :-(

If we got a bunch of people to run it, that would help.

+0
Comment 4 Kjetil Kjernsmo 2004-11-21 10:02:52 UTC
May I suggest that a pre or rc release is made, 3.0.2-pre1 with the Apache preforking model? 
Having something like that may encourage wider testing, and if this could go into Debian 
Experimental too, it may help too... :-) 
Comment 5 Daniel Quinlan 2004-11-21 12:01:09 UTC
Subject: Re:  [review] adopt Apache preforking algorithm

I'm not in favor of putting this into 3.0.2.  This is such a large
change, it should go into 3.1, go through pre and rc releases, etc.

Sticking to a 3-4 month release cycle for 3.1 -- that would be my
strategy for getting this out.

Comment 6 Justin Mason 2004-11-21 13:14:09 UTC
Subject: Re:  [review] adopt Apache preforking algorithm 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> May I suggest that a pre or rc release is made, 3.0.2-pre1 with the
> Apache preforking model? Having something like that may encourage wider
> testing, and if this could go into Debian Experimental too, it may help
> too... :-) 

I'd definitely like to see more people trying out the patch first
anyway before we go rolling tarballs ;)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD4DBQFBoQT1MJF5cimLx9ARAoNHAJ9yw8o8/VbvazybYXAkt66GvTWtcgCYg7sQ
qPzTlArSdOUQ3F6DpZmE7w==
=jb+Y
-----END PGP SIGNATURE-----

Comment 7 Theo Van Dinter 2004-11-21 14:00:59 UTC
Subject: Re:  [review] adopt Apache preforking algorithm

On Sun, Nov 21, 2004 at 12:01:10PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> I'm not in favor of putting this into 3.0.2.  This is such a large
> change, it should go into 3.1, go through pre and rc releases, etc.

Agreed, I didn't plan on having this feature in 3.0.

Comment 8 Justin Mason 2004-11-21 14:03:21 UTC
alright that's 2 -1's on getting this into 3.0.x.

it'll take several compelling user reports of memory problems fixed to get it
in, if at all, given that ;)
Comment 9 Justin Mason 2004-11-23 12:51:33 UTC
Andy Jezierski noted (via mail):



The patch seems to be working pretty well on my system, don't notice any 
slowdowns and paging is definitely better. I'm running on a FreeBSD 5.2.1 
system with 192MB of ram, using the defaults for the new switches, so two 
spamd child processes start up and it seems my system runs with just those 
two active most of the time.  I've seen it running with 5 processes a few 
times, which is what I have my limit set to, but most of the time it's 
using 2 or 3.  Here's a couple of top displays, the first is before the 
patch, the second is after.

last pid: 64134;  load averages:  1.02,  1.38,  1.30   up 45+20:26:52 
09:34:27
28 processes:  1 running, 27 sleeping
CPU states: 74.3% user,  0.0% nice, 13.2% system,  0.4% interrupt, 12.1% 
idle
Mem: 116M Active, 9728K Inact, 47M Wired, 6664K Cache, 28M Buf, 572K Free
Swap: 512M Total, 48M Used, 464M Free, 9% Inuse, 316K In, 416K Out

  PID USERNAME   PRI NICE   SIZE    RES STATE    TIME   WCPU    CPU 
COMMAND
64073 spamd      113    0 53552K 38820K select   0:21 13.64% 12.40% perl
64011 spamd        4    0 56244K 37632K accept   0:51 15.61% 11.57% perl
63916 spamd      107    0 59760K 36248K select   1:28 15.15% 11.23% perl
63995 spamd        4    0 57212K 39360K accept   0:58 10.62%  9.91% perl
63996 spamd        4    0 56248K 37392K accept   0:46  4.13%  3.76% perl
  389 root        96    0  1608K   268K select   4:58  0.00%  0.00% ntpd
  260 root        96    0  1312K   248K select   3:47  0.00%  0.00% 
syslogd
  414 root        96    0  3512K   476K select   3:25  0.00%  0.00% 
sendmail
28604 spamd       20    0 46188K     0K pause    1:11  0.00%  0.00% <perl>
  431 root         8    0  1336K   120K nanslp   0:40  0.00%  0.00% cron
  359 root        96    0  1236K    84K select   0:18  0.00%  0.00% usbd
  408 root        96    0  3488K     0K select   0:10  0.00%  0.00% <sshd>
  417 smmsp       20    0  3404K     0K pause    0:08  0.00%  0.00% 
<sendmail>
64116 root        96    0  2212K  1076K RUN      0:00  0.00%  0.00% top
64094 root         4    0  6240K     0K sbwait   0:00  0.00%  0.00% <sshd>
64112 root        20    0  2312K     0K pause    0:00  0.00%  0.00% <csh>
sidewinder#


last pid: 74409;  load averages:  1.25,  0.85,  0.70   up 46+20:09:26 
09:17:01
26 processes:  2 running, 24 sleeping
CPU states: 82.9% user,  0.0% nice, 14.8% system,  1.6% interrupt,  0.8% 
idle
Mem: 73M Active, 22M Inact, 47M Wired, 11M Cache, 28M Buf, 27M Free
Swap: 512M Total, 13M Used, 499M Free, 2% Inuse

  PID USERNAME   PRI NICE   SIZE    RES STATE    TIME   WCPU    CPU 
COMMAND
74322 spamd      127    0 56592K 46180K select   1:02 39.79% 39.79% perl
74370 spamd       -8    0 55752K 45212K piperd   0:38 35.92% 35.64% perl
  389 root        96    0  1608K   396K select   5:04  0.00%  0.00% ntpd
  260 root        96    0  1312K   280K select   3:52  0.00%  0.00% 
syslogd
  414 root        96    0  3512K   868K select   3:29  0.00%  0.00% 
sendmail
  431 root         8    0  1336K   192K nanslp   0:40  0.00%  0.00% cron
66431 spamd       96    0 46432K 28368K select   0:28  0.00%  0.00% perl
  359 root        96    0  1236K    84K select   0:18  0.00%  0.00% usbd
  408 root        97    0  3488K   488K select   0:10  0.00%  0.00% sshd
  417 smmsp       20    0  3404K   424K pause    0:09  0.00%  0.00% 
sendmail
66044 ajezierski  96    0  6240K  1052K select   0:01  0.00%  0.00% sshd
66047 root        20    0  2316K  1088K pause    0:00  0.00%  0.00% csh
66042 root         4    0  6240K   924K sbwait   0:00  0.00%  0.00% sshd
sidewinder# 

Great Job!

Thanks
Andy


Comment 10 Daniel Quinlan 2004-11-23 13:30:51 UTC
Subject: Re:  [review] adopt Apache preforking algorithm

I asked Dean Gaudet, one of the Apache httpd hackers, about this a ways
back but I don't think I ever forwarded this (he gave his okay to
forward).

------- start of cut text --------------
From: dean gaudet <dean@arctic.org>

apache keels over if you don't set MaxClients appropriately for your 
RAM... it's one of the first few things i mention on the apache tuning 
page.  you really shouldn't attempt to defend against dum setups with some 
complicated accept() protocol... keep it simple.

reacting to load rather than launching a fixed number of children is an 
improvement though (or more accurately... a bug fix -- since pre-3.0 
behaved like this) -- then you can default to launching one child, and 
that will handle 99% of j. random user's needs.

i was honestly happy with post-forking SA... there are already 4 other 
post-forked daemons in my mail delivery setup (qmail+clamav).  it's the 
most trivial to get post-forking correct in a platform independent 
manner... and it's trivial to have post-fork react to load changes.

but then i'm running SA with one request per child, because i prefer to 
exchange the perf loss for the extra paranoia of having my children drop 
root privs entirely early... (and there's that config leak bug at least in 
the 3.0.1 my install is based on).

damn does SA really have 20MB unshared?  suck.

so yeah sorry, no sage advice here... i'll be disabling most of this 
myself, security/paranoia is more important.
------- end ----------------------------

Comment 11 Justin Mason 2004-11-23 13:46:54 UTC
the MaxClients thing won't be a problem for us I don't think.  by default that
remains at 5, and users would have scaled that by now if they ran into low-RAM
situations with 3.0.x.  I also added a warning about that. (MaxClients ==
--max-children in SAland)

I think the patch matches up with what Dean notes; it'll generally run with 1
kid, 2 if there's mail active, and scales up to 5 if things get really busy, by
default.  this patch works a *lot* better, in many ways, btw -- check out what
it looks like on my server after running for nearly a day:

29880 root      17  15 22248 5244  4276 S N   0.0  1.0   0:03 spamd
11505 root      17  15 65336  60M  7180 S N   0.0 12.0   0:44 spamd
28823 root      17  15 29872  25M  7204 S N   0.0  5.1   0:03 spamd

ie. the master proc has only 5MB paged in while still having all the code and
modules preloaded and precompiled, that's very nice; the subprocs have scaled
their RAM usage based on size of messages coming in; there's only two subprocs;
and one of those (the big one) is the one handling all the messages, with the
second having handled only a small amount of overflow, based on the CPU times. 
that all looks very good.

(note however I don't think the reported SHR values are correct on that kernel
as mentioned in prev bugs, so ignore those.)
Comment 12 Daniel Quinlan 2004-11-23 20:31:52 UTC
moving to 3.1 milestone
Comment 13 Daryl C. W. O'Shea 2004-11-27 17:51:24 UTC
Subject: Re:  [review] adopt Apache preforking algorithm

No problems with the prefork patch on 3.0.1 after a few days on my test 
system.  It only has a load of about 1000 messages a day though.

Comment 14 Tom Schulz 2004-12-07 11:56:43 UTC
For what this is worth, we are running this patch on SpamAssassin 3.0.1 on
Solaris 8 with perl 5.8.5.  It is working ok and does reduce memory usage
and swap except when we have enough load to require the maximum number of
spamd processes.  Even then, the faster processing before and after the peak
load seems to reduce the impact of the peak load.
Comment 15 Justin Mason 2004-12-07 12:48:42 UTC
fwiw, it's been rock-solid for my server for several weeks now (although again
that's a pretty low-load system).

Given that this works well, has several positive reports, and is in 3.1.0, and
we're agreed it won't be backported to 3.0.x, I'll close this bug.  (people who
may wish to apply the patch to 3.0.x can still get the patch from here and apply
it, of course.)

if we ever decide to backport to 3.0.x officially, we can reopen.
Comment 16 Justin Mason 2004-12-07 12:49:11 UTC
marking FIXED
Comment 17 Theo Van Dinter 2004-12-07 14:37:48 UTC
Subject: Re:  [review] adopt Apache preforking algorithm

On Tue, Dec 07, 2004 at 12:48:43PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> fwiw, it's been rock-solid for my server for several weeks now (although again
> that's a pretty low-load system).
> 
> Given that this works well, has several positive reports, and is in 3.1.0, and
> we're agreed it won't be backported to 3.0.x, I'll close this bug.  (people who
> may wish to apply the patch to 3.0.x can still get the patch from here and apply
> it, of course.)

Actually it barfed on my system just now...  I didn't have time to debug too
much because of the flood of unscanned mails, but:

root     11997     1  1 Dec05 ?        00:45:54 /usr/bin/spamd -d -c -H -r /var/run/spamd.pid -m 8
root     14055 11997  0 13:53 ?        00:00:30 spamd child
root     22425 11997  0 15:56 ?        00:00:04 spamd child
root     25947 11997  0 16:45 ?        00:00:00 [spamd <defunct>]

telnetting to 783 and giving a PING got no response.  spamc was timing out
connections and stuff was going through unscanned.

running: spamassassin-3.1.0-svn109845

Comment 18 Michael Parker 2004-12-07 16:42:29 UTC
Subject: Re:  [review] adopt Apache preforking algorithm

On Tue, Dec 07, 2004 at 02:37:52PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> 
> Actually it barfed on my system just now...  I didn't have time to debug too
> much because of the flood of unscanned mails, but:
> 
> root     11997     1  1 Dec05 ?        00:45:54 /usr/bin/spamd -d -c -H -r /var/run/spamd.pid -m 8
> root     14055 11997  0 13:53 ?        00:00:30 spamd child
> root     22425 11997  0 15:56 ?        00:00:04 spamd child
> root     25947 11997  0 16:45 ?        00:00:00 [spamd <defunct>]
> 
> telnetting to 783 and giving a PING got no response.  spamc was timing out
> connections and stuff was going through unscanned.
> 
> running: spamassassin-3.1.0-svn109845
> 

Someone on IRC mentioned a similar experience last week.  I asked them
to note it in bug, but I guess they never did.

Michael
Comment 19 Justin Mason 2004-12-07 16:52:15 UTC
damn, a strace would have been handy :(
Comment 20 Theo Van Dinter 2004-12-07 17:20:33 UTC
Subject: Re:  [review] adopt Apache preforking algorithm

On Tue, Dec 07, 2004 at 04:52:16PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> damn, a strace would have been handy :(

Yeah, sorry.  I was going to do some debugging, but mail was kind of flooding
in at that moment, and there were 17 other things going on at work at the same
time, so I had to just restart and figured I'd come back to it later.

A rough guess is that the parent was trying to send the connection to the
defunct child, and the child obviously didn't do anything about it.

Comment 21 Theo Van Dinter 2004-12-07 22:00:48 UTC
Subject: Re:  [review] adopt Apache preforking algorithm

On Tue, Dec 07, 2004 at 08:20:31PM -0500, Theo Van Dinter wrote:
> Yeah, sorry.  I was going to do some debugging, but mail was kind of flooding
> in at that moment, and there were 17 other things going on at work at the same
> time, so I had to just restart and figured I'd come back to it later.
> 
> A rough guess is that the parent was trying to send the connection to the
> defunct child, and the child obviously didn't do anything about it.

Just happened again.

root     28832     1  6 Dec07 ?        00:28:02 /usr/bin/spamd -d -c -H -r /var/run/spamd.pid -m 8
root     13863 28832  0 Dec07 ?        00:01:32 spamd child
root     15337 28832  0 Dec07 ?        00:00:16 spamd child
root     27295 28832  0 00:28 ?        00:00:01 [spamd <defunct>]

the children were sitting in a read() wait.
the parent was wedged in a loop:

[...]
futex(0x40190780, FUTEX_WAIT, 2, NULL)  = -1 ENOSYS (Function not implemented)
futex(0x40190780, FUTEX_WAIT, 2, NULL)  = -1 ENOSYS (Function not implemented)
futex(0x40190780, FUTEX_WAIT, 2, NULL)  = -1 ENOSYS (Function not implemented)
futex(0x40190780, FUTEX_WAIT, 2, NULL)  = -1 ENOSYS (Function not implemented)
futex(0x40190780, FUTEX_WAIT, 2, NULL)  = -1 ENOSYS (Function not implemented)
futex(0x40190780, FUTEX_WAIT, 2, NULL)  = -1 ENOSYS (Function not implemented)
[...]

Summary of my perl5 (revision 5.0 version 6 subversion 1) configuration:
  Platform:
    osname=linux, osvers=2.4.21-1.1931.2.393.entsmp, archname=i386-linux
    uname='linux bugs.devel.redhat.com 2.4.21-1.1931.2.393.entsmp #1 smp thu
aug 14 14:47:21 edt 2003 i686 unknown '
    config_args='-des -Doptimize=-O2 -march=i386 -mcpu=i686 -Dcc=gcc
-Dcf_by=Red Hat, Inc. -Dcccdlflags=-fPIC -Dinstallprefix=/usr -Dprefix=/usr
-Darchname=i386-linux -Dvendorprefix=/usr -Dsiteprefix=/usr -Uusethreads
-Uuseithreads -Uuselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db -Di_ndbm
-Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm
-Dinc_version_list=5.6.0/i386-linux 5.6.0'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef
usemultiplicity=undef
    useperlio=undef d_sfio=undef uselargefiles=undef usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
  Compiler:
    cc='gcc', ccflags ='-fno-strict-aliasing -I/usr/local/include',
    optimize='-O2 -march=i386 -mcpu=i686',
    cppflags='-fno-strict-aliasing -I/usr/local/include'
    ccversion='', gccversion='2.96 20000731 (Red Hat Linux 7.3 2.96-113)',
gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=4
    alignbytes=4, usemymalloc=n, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -ldl -lm -lc -lcrypt -lutil
    perllibs=-lnsl -ldl -lm -lc -lcrypt -lutil
    libc=/lib/libc-2.2.5.so, so=so, useshrplib=false, libperl=libperl.a
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'

Comment 22 Justin Mason 2004-12-07 23:04:10 UTC
futex(0x40190780, FUTEX_WAIT, 2, NULL)  = -1 ENOSYS

'It is typically used to implement the contended case
       of a lock in shared memory, as described in futex(4).'

watching the futex at 0x40190780, waiting indefinitely (NULL) for it to change
from value 2.

what the hell is doing *that*?!
Comment 23 Nick Alcock 2004-12-09 06:01:24 UTC
Subject: Re:  [review] adopt Apache preforking algorithm

On Tue, 7 Dec 2004, bugzilla-daemon@bugzilla.spamassassin.org mused:
> what the hell is doing *that*?!

futexes are mostly used by the NPTL threading layer at the moment, so I
guess it is a threading bug. (Probably in glibc.)

If gdb and perl are compiled with debugging, try attaching to the
looping process with GDB and getting a backtrace (unlikely, I know;
glibc in particular is almost always shipped stripped).

Comment 24 Theo Van Dinter 2004-12-09 08:31:39 UTC
Subject: Re:  [review] adopt Apache preforking algorithm

On Wed, Dec 08, 2004 at 01:00:43AM -0500, Theo Van Dinter wrote:
> Just happened again.

yep.

Nothing in the logs about it...

Dec  9 04:13:32 eclectic spamd[2950]: result: Y 36 -
BAYES_99,BIZ_TLD,DRUGS_ANXIETY,DRUGS_ANXIETY_EREC,DRUGS_ERECTILE,DRUGS_MANYKINDS,DRUGS_MUSCLE,DRUGS_SLEEP,DRUGS_SLEEP_EREC,HELO_DYNAMIC_IPADDR2,HTML_BACKHAIR_8,HTML_FONT_TINY,HTML_MESSAGE,LONGWORDS,RAZOR2_CF_RANGE_51_100,RAZOR2_CHECK,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_DSBL,RCVD_IN_NJABL_DUL,RCVD_IN_SORBS_DUL,RCVD_IN_XBL,URI_AFFILIATE,X_MESSAGE_INFO
scantime=4.7,size=5754,mid=<8d76r4pf372nis$ull80hjg286hc0$aw443f1@SR480>,bayes=1,autolearn=spam 
Dec  9 11:26:40 eclectic spamd[29136]: server killed by SIGTERM, shutting down 

Interestingly, the parent will take a SIGTERM killing the futex() loop.  There
were a few notes when spamd was restarting:

Dec  9 11:26:45 eclectic spamd[3399]: Use of uninitialized value in index at
/usr/lib/perl5/site_perl/5.6.1/Mail/SpamAssassin/SpamdForkScaling.pm line 281,
<GEN127> line 24. 
Dec  9 11:26:45 eclectic spamd[3399]: Use of uninitialized value in
concatenation (.) or string at
/usr/lib/perl5/site_perl/5.6.1/Mail/SpamAssassin/SpamdForkScaling.pm line 285,
<GEN127> line 24. 
Dec  9 11:26:45 eclectic spamd[2950]: Use of uninitialized value in index at
/usr/lib/perl5/site_perl/5.6.1/Mail/SpamAssassin/SpamdForkScaling.pm line 281,
<GEN124> line 189. 
Dec  9 11:26:45 eclectic spamd[2950]: Use of uninitialized value in
concatenation (.) or string at
/usr/lib/perl5/site_perl/5.6.1/Mail/SpamAssassin/SpamdForkScaling.pm line 285,
<GEN124> line 189. 
Dec  9 11:26:45 eclectic spamd[10223]: spamd starting 

Comment 25 Justin Mason 2004-12-09 12:42:04 UTC
those warnings, unfortunately, aren't helpful; they're just "the parent went
away" -type warnings.   anyway, r111430 fixes that, but the hang bug is still
around...
Comment 26 Tom Schulz 2004-12-09 12:57:39 UTC
I was too quick in reporting no problems.  I am running spamd with the command
/usr/bin/spamd -d -c -m 8
and when first started I saw the expected 2 to 8 children depending on the load.
Today I noticed that there were never more than 2 children even though there were
as many as 7 spamc processes running. In the log file I found:

Dec  9 13:51:03 bluegill.adi.com spamd[13605]: prefork: server reached
--max-clients setting, consider raising it

repeated many times (different dates of course).  When I restarted spamd, the
correct behavior returned.
Comment 27 Justin Mason 2004-12-09 14:23:41 UTC
Created attachment 2560 [details]
debugging version of code

Tom, Theo, this patch may be worthwhile; it applies on top of the previous one.
 basically it (a) adds more debugging and paranoid error-checking, and (b)
promotes the "active child scoreboard" debug message to a warning so Tom's
situation can be checked without turning on debugging.
Comment 28 Theo Van Dinter 2004-12-09 17:55:07 UTC
Subject: Re:  [review] adopt Apache preforking algorithm

On Thu, Dec 09, 2004 at 02:23:47PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Tom, Theo, this patch may be worthwhile; it applies on top of the previous one.
>  basically it (a) adds more debugging and paranoid error-checking, and (b)
> promotes the "active child scoreboard" debug message to a warning so Tom's
> situation can be checked without turning on debugging.

Just to throw some fun into my mix, I completely rocked the perl world on my
system this evening.  For the longest time I've been running perl 5.6.1 as
came with RedHat 7.3, on my RHEL 3 machine.  I upgraded the packages tonight
(and the bazillion perl modules I installed before) to the 5.8.0 version that
comes with RHEL 3.

So I'm going to watch spamd and see if it still has issues.  If it does, I'll
throw the patch on and see what happens.

Comment 29 Scott Wolf 2004-12-09 18:48:00 UTC
We are parsing a little over 250k messages per day with a spamd/spamc setup. 
Spamd runs on two boxes with a sql backend for bayes. After upgrading to 3.0.1 
we were seeing huge delays processing mail. This patch definately fixed the 
issue for us, and we have seen no issues, even with our scale of install.
Comment 30 Tom Schulz 2004-12-09 20:05:21 UTC
I am having a problem after applying the patch.  Make test fails as follows

t/spamc_B...................    Not found: spamflag = X-Spam-Flag: YES
# Failed test 8 in t/SATest.pm at line 530
t/spamc_B...................FAILED test 8                                    
        Failed 1/8 tests, 87.50% okay
t/spamc_c...................# Failed test 1 in t/spamc_c.t at line 15
t/spamc_c...................FAILED test 1                                    
        Failed 1/2 tests, 50.00% okay
t/spamc_c_stdout_closed.....# Failed test 1 in t/spamc_c_stdout_closed.t at
line 31
t/spamc_c_stdout_closed.....FAILED test 1                                    
        Failed 1/2 tests, 50.00% okay
t/spamc_E...................# Failed test 1 in t/spamc_E.t at line 15
t/spamc_E...................FAILED test 1                                    
        Failed 1/2 tests, 50.00% okay

I cut test test off at this point as some of the tests were takeing aprox.
30 minutes to fail.
Comment 31 Tom Schulz 2004-12-10 14:03:45 UTC
For whatever this is worth.
I did some digging through the logs.  The drop in the maximum number of spamd
children does not happen all at once.  After a few hours the maximum number
of children was down to 6 from the original 8.  A few hours later it was down
to 5.  It went down to 3 in about an hour.  It bottomed out at 2 as before.
At the time I checked the log, there were:

308 lines containing "server successfully spawned child process"
306 lines containing "server hit by SIGCHLD"
306 lines containing "handled cleanup of child"

Nothing funny there.
Comment 32 Theo Van Dinter 2004-12-12 21:15:05 UTC
Subject: Re:  [review] adopt Apache preforking algorithm

On Tue, Dec 07, 2004 at 11:04:11PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> watching the futex at 0x40190780, waiting indefinitely (NULL) for it to change
> from value 2.
> what the hell is doing *that*?!

Since upgrading perl on the box, I haven't had any problems with spamd.  So
apparently the issue was the old 5.6.1 install.

Comment 33 Tom Schulz 2004-12-13 11:46:37 UTC
I edited SpamdForkScaling.pm to make the change from the last part of the
patch from Comment #27.  After running awhile I found the following in the
syslog when there were 2 spand children shown by ps

Dec 13 14:27:01 bluegill.adi.com spamd[6896]: prefork: child states: KKII at
/opt/perl/lib/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 331,
<GEN228> line 3.

A short time later when the system was more active, there were 6 children and

Dec 13 14:38:07 bluegill.adi.com spamd[6896]: prefork: child states: KKBBBBBB
at /opt/perl/lib/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line
331, <GEN252> line 4.
Comment 34 Nick Alcock 2004-12-13 13:46:25 UTC
Subject: Re:  [review] adopt Apache preforking algorithm

On Sun, 12 Dec 2004, bugzilla-daemon@bugzilla.spamassassin.org stated:
> Since upgrading perl on the box, I haven't had any problems with spamd.  So
> apparently the issue was the old 5.6.1 install.

It might have been the old 5.6.1 Perl threading library, perhaps: that
does some rather risky things which NPTL might well break. (This is pure
conjecture, though.)

Comment 35 Tom Schulz 2004-12-14 07:06:57 UTC
Am I correct in assuming that the child states mean something like: I = idle,
B = busy, K = killed but waiting for something before being forgotten?  We went
from 5 'K's to 6 'K's (running with '-m 8') yesterday evening and now 14 hours
later we still have 6 'K's.  I would like to run with your whole debugging
patch, but as it breaks 'make test' I am afraid to try it in production.
Following are 3 lines from syslog, the first one is the last one with 5 'K's.
I just noticed that some lines do not have the part with GEN...

Dec 13 19:29:33 bluegill.adi.com spamd[6896]: prefork: child states: KKIIKKKI
at /opt/perl/lib/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line
331, <GEN738> line 161.
Dec 13 19:29:33 bluegill.adi.com spamd[6896]: prefork: child states: KKIIKKKK
at /opt/perl/lib/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line
331.
Dec 14 09:27:32 bluegill.adi.com spamd[6896]: prefork: child states: KKKIKBKK
at /opt/perl/lib/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line
331, <GEN852> line 271.
Comment 36 Tom Schulz 2004-12-14 14:20:23 UTC
Ah!  I think that there may be a race condition.  I changed the line

dbg ("prefork: adjust: decreasing, too many idle children ($num_idle >
$self->{max_idle}), killed $pid");

in SpamdForkScaling.pm from dbg to warn and found the following sequence in
syslog when one of the children entered state 'K'.

Dec 14 15:10:14 bluegill.adi.com spamd[1459]: prefork: child states: III at
/opt/perl/lib/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 331,
<GEN6> line 349.
Dec 14 15:10:14 bluegill.adi.com spamd[1459]: server hit by SIGCHLD
Dec 14 15:10:14 bluegill.adi.com spamd[1459]: handled cleanup of child pid 13252
Dec 14 15:10:14 bluegill.adi.com spamd[1459]: prefork: adjust: decreasing, too
many idle children (3 > 2), killed 13252 at /opt/perl/lib/site_perl/5.8.5/
Mail/SpamAssassin/SpamdForkScaling.pm line 380, <GEN6> line 349.
Dec 14 15:10:14 bluegill.adi.com spamd[1459]: prefork: child states: IIK at
/opt/perl/lib/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 331.

Note the order of the lines.  Most of the time, the line with 'prefork: adjust:
decreasing, too many idle children' happens before the lines with 'server hit
by SIGCHLD' and 'handled cleanup of child'.  This time it came after those
lines and we lost a child.  I expect that this is a race condition.
Comment 37 Tom Schulz 2004-12-14 19:18:49 UTC
Possible fix.  The following change fixes my problem.  I don't know if this
is the best way to fix this or not.  Definately a race condition.

--- SpamdForkScaling.pm.orig    Tue Dec 14 18:28:10 2004
+++ SpamdForkScaling.pm Tue Dec 14 18:33:01 2004
@@ -375,8 +375,8 @@
     die "oops! no idle kids in need_to_del_server?";
   }
 
-  kill 'INT' => $pid;
   $self->set_child_state ($pid, PFSTATE_KILLED);
+  kill 'INT' => $pid;
   dbg ("prefork: adjust: decreasing, too many idle children ($num_idle > 
$self->{max_idle}), killed $pid");
 }
 
Comment 38 Justin Mason 2004-12-14 19:25:25 UTC
looks very reasonable.  thanks for looking into it; I had spotted that in logs,
but time to investigate has been short :(

so that fixes it for you?
Comment 39 Justin Mason 2004-12-14 19:26:34 UTC
oh btw -- yes, a child in "K" state should disappear pretty quickly, and should
not stick around for more than a couple of seconds in general.
Comment 40 Tom Schulz 2004-12-14 20:14:10 UTC
> so that fixes it for you?

Yes, that fixes this problem.
Comment 41 Justin Mason 2004-12-15 14:12:18 UTC
Created attachment 2565 [details]
new additions on top of patch 2527

this replaces patch 2560, and adds:

  (a) the spamd-child-states warning, which has proved its usefulness ;)
  (b) Tom's fix for the race condition
  (c) a fix so that spamd_hup.t passes

it's already checked into trunk as r112032.
Comment 42 Tom Schulz 2004-12-15 18:09:09 UTC
Is this patch expected to work on top of 3.0.1?  The patch applies cleanly,
but I am getting the same errors on make test as I got in Comment #30.
I tried commenting out parts of the patch to see what caused the problem,
and it is the setup of the alarm that breaks things.  Has
Mail::SpamAssassin::Util::trap_sigalrm_fully changed in 3.1?  If so, then
my errors are probably not a problem in 3.1.  I am running perl 5.8.5 on
Solaris 8.
Comment 43 Justin Mason 2004-12-15 18:19:41 UTC
unfortunately, patch 2565 won't apply cleanly on 3.0.0; the earlier bits will,
but not that, because it has a dependency on 3828....
Comment 44 Dallas Engelken 2005-01-07 21:54:08 UTC
i have this running on many systems and its working great.  however, i have one
that is a problem, and it really the only oddball out of the bunch.  its a dual
hyperthreaded zeon box as opposed to all the others being uniprocessor P4's.

here are the details...

Spamd stops accepting connections after processing several thousand messages.

- new apache forking model
- running smp kernel, dual ht zeons
  
using the following settings...

max_children    15
max_per_child   25
max_spare       5
min_children    5
min_spare       2

when spamd stops accepting connections, an strace shows a hang on connect()..

# echo | strace spamc -t 5
..
..
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
connect(3, {sin_family=AF_INET, sin_port=htons(783),
sin_addr=inet_addr("127.0.0.1")}}, 16 <unfinished ...>


tcpdump shows the syn going to spamd, but spamd never acks back.

as soon as i svc -d and svc -u  on /service/spamd it is available and working
again.   if i try an hup it with a svc -t, that doesnt help.

last few debug lines related to preforking..

@4000000041df61e201060fe4 [10653] debug: prefork: child 21075: entering state 1
@4000000041df61e201077f14 [10653] debug: prefork: child reports idle
@4000000041df61e201084264 [10653] debug: prefork: child 21298: entering state 1
@4000000041df61e201099254 [10653] debug: prefork: child reports idle
@4000000041df61e2010b8e24 [10653] debug: prefork: child states: IIBIBIIBIIIBBBI
@4000000041df61e2010d9d7c [10653] debug: prefork: child 21298: entering state 3
@4000000041df61e2010f2034 [10653] debug: prefork: adjust: decreasing, too many
idle children (9 > 5), killed 21298
@4000000041df61e2017c841c [10653] debug: prefork: child 20945: entering state 1
@4000000041df61e2017ef134 [10653] debug: prefork: child reports idle
@4000000041df61e2017ff304 [10653] debug: prefork: child closed connection
@4000000041df61e20185752c [10653] debug: prefork: child states: IIBIBIIIIIIBBB
@4000000041df61e201873e34 [10653] debug: prefork: child 21085: entering state 3
@4000000041df61e201890f0c [10653] debug: prefork: adjust: decreasing, too many
idle children (9 > 5), killed 21085
@4000000041df61e201f895d4 [10653] debug: prefork: child closed connection
@4000000041df61e201fb06d4 [10653] debug: prefork: child states: IIBIBIIIIIBBB
@4000000041df61e201fc3b6c [10653] debug: prefork: child 21081: entering state 3
@4000000041df61e201fd838c [10653] debug: prefork: adjust: decreasing, too many
idle children (8 > 5), killed 21081
@4000000041df61e202cb5adc [10653] debug: prefork: child closed connection
@4000000041df61e20b7403b4 [10653] debug: prefork: child states: IIBIBIIIIKBBB
@4000000041df61e20b741b24 [10653] debug: prefork: child 21075: entering state 3
@4000000041df61e20b7422f4 [10653] debug: prefork: adjust: decreasing, too many
idle children (7 > 5), killed 21075
@4000000041df61e20b743a64 [10653] debug: prefork: ordered 18886 to accept
@4000000041df61e2197aaea4 [10653] debug: prefork: child 18886: entering state 2
@4000000041df61e2197b6e0c [10653] debug: prefork: child closed connection
@4000000041df61e2197d044c [10653] debug: prefork: child 21087: entering state 1
@4000000041df61e2197e3ccc [10653] debug: prefork: child reports idle
@4000000041df61e2197f4a54 [10653] debug: prefork: child 21095: entering state 1
@4000000041df61e2198082d4 [10653] debug: prefork: child reports idle
@4000000041df61e219815d94 [10653] debug: prefork: child 21096: entering state 1
@4000000041df61e21982a99c [10653] debug: prefork: child reports idle
@4000000041df61e2198499b4 [10653] debug: prefork: child states: BIBIBIIIIII
@4000000041df61e21986531c [10653] debug: prefork: child 21096: entering state 3
@4000000041df61e21988089c [10653] debug: prefork: adjust: decreasing, too many
idle children (8 > 5), killed 21096

last debug line..

@4000000041df61e30180e14c [18886] debug: log: [18886] child is processing
message 24 of 25

another time it happend, the last debug line was 

@4000000041df028027c4ee5c [4250] debug: log: [4250] child is processing message
3 of 25

so it does the same thing every time it hangs it seems.  

is anyone else running this without issues on SMP enabled kernels?
Comment 45 Theo Van Dinter 2005-01-07 21:58:18 UTC
Subject: Re:  [review] adopt Apache preforking algorithm

On Fri, Jan 07, 2005 at 09:54:09PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> is anyone else running this without issues on SMP enabled kernels?

it's working fine on my box so far, but it's not under a high load.

Comment 46 Dallas Engelken 2005-01-11 10:26:17 UTC
okay.. box #2, dual p3's, experiencing the same problem.  when spamd stops
taking CHECKs, the spamc processes just continue to build and spamd never logs
any incoming connections.
Comment 47 Justin Mason 2005-01-11 10:41:08 UTC
hi Dallas -- strace -f output would be handy, if at all possible.  you'd have to
catch it *before* it hangs, of course... ;)
Comment 48 Dallas Engelken 2005-01-19 09:08:07 UTC
i'll see what i can accomplish.  

anyways, i'm seeing this on about 6-7 different systems now all SMP based, so
that definitely is the common factor between them.   they are all redhat 7.3,
perl 5.6.1, running fedora-legacy smp kernels.

All the other systems are installed exactly the same but run on uniprocessor and
do not exhibit this behavior.  couple examples today.   i have a monitor script
running every minute to ping spamd and restart it if it does not pong back in 30
seconds.

example #1
----------------------

Spamd PING Failed at 2005-01-19 10:20:13

2005-01-19 10:18:56.558340500 [470] debug: log: [470] clean message (0.9/5.0)
for tilleyl@wmbac.com:200 in 8.9 seconds, 33516 bytes.
2005-01-19 10:18:56.558347500 [470] debug: log: [470] result: .  0 -
HTML_40_50,HTML_MESSAGE,HTML_NONELEMENT_70_80
scantime=8.9,size=33516,mid=<003901c4fe3a$1f51a860$8201a8c0@migcorp.net>,autolearn=disabled
2005-01-19 10:18:56.625120500 [470] debug: log: [470] child is processing
message 9 of 25
2005-01-19 10:18:56.625665500 [26564] debug: prefork: child 470: entering state 1
2005-01-19 10:18:56.625797500 [26564] debug: prefork: child reports idle
2005-01-19 10:18:56.625921500 [26564] debug: prefork: child states: IIIIII
2005-01-19 10:18:56.626081500 [26564] debug: prefork: child 1773: entering state 3
2005-01-19 10:18:56.626307500 [26564] debug: prefork: adjust: decreasing, too
many idle children (6 > 5), killed 1773
2005-01-19 10:18:56.632265500 [26564] debug: log: [26564] server hit by SIGCHLD
2005-01-19 10:18:56.632397500 [26564] debug: log: [26564] handled cleanup of
child pid 1773
2005-01-19 10:20:11.266476500 [26564] debug: log: [26564] server killed by
SIGTERM, shutting down
2005-01-19 10:20:15.206533500 [2263] debug: logger: enabled with:
2005-01-19 10:20:15.206539500 debug: logger:    facility: stderr
2005-01-19 10:20:15.206540500 debug: logger:    socket:   file
2005-01-19 10:20:15.206542500 debug: logger:    output:   stderr




example #2
-----------------------

Spamd PING Failed at 2005-01-19 10:25:13

2005-01-19 11:19:50.648468500 [24111] debug: check:
subtests=__CT,__CTYPE_HAS_BOUNDARY,__HAS_MSGID,__HAS_SUBJECT,__HTML_LENGTH_1024_1536,__MIME_HTML,__MIME_QP,__MIME_VERSION,__SANE_MSGID,__TAG_EXISTS_BODY,__TAG_EXISTS_HEAD,__TAG_EXISTS_HTML,__TAG_EXISTS_META
2005-01-19 11:19:50.658612500 [24111] debug: log: [24111] clean message
(4.0/5.0) for langner@anesis.com:200 in 11.1 seconds, 3668 bytes.
2005-01-19 11:19:50.658931500 [24111] debug: log: [24111] result: .  3 -
HTML_MESSAGE,HTML_TAG_EXIST_TBODY,MIME_HTML_MOSTLY,URIBL_OB_SURBL,URIBL_WS_SURBL
scantime=11.1,size=3668,mid=<3n21c8$51bvqr@ausc60ps301.us.dell.com>,autolearn=disabled
2005-01-19 11:19:50.721801500 [24111] debug: log: [24111] child is processing
message 6 of 25
2005-01-19 11:19:50.722339500 [24085] debug: prefork: child 24111: entering state 1
2005-01-19 11:19:50.722468500 [24085] debug: prefork: child reports idle
2005-01-19 11:19:50.722590500 [24085] debug: prefork: child states: IIIIII
2005-01-19 11:19:50.722717500 [24085] debug: prefork: child 24576: entering state 3
2005-01-19 11:19:50.722904500 [24085] debug: prefork: adjust: decreasing, too
many idle children (6 > 5), killed 24576
2005-01-19 11:19:50.728925500 [24085] debug: log: [24085] server hit by SIGCHLD
2005-01-19 11:19:50.729046500 [24085] debug: log: [24085] handled cleanup of
child pid 24576
2005-01-19 11:20:11.776712500 [24085] debug: log: [24085] server killed by
SIGTERM, shutting down
2005-01-19 11:20:19.172192500 [24790] debug: logger: enabled with:
2005-01-19 11:20:19.172198500 debug: logger:    facility: stderr
2005-01-19 11:20:19.172200500 debug: logger:    socket:   file
2005-01-19 11:20:19.172201500 debug: logger:    output:   stderr
Comment 49 Justin Mason 2005-01-19 11:01:26 UTC
and they're all using Storable 2.13?
Comment 50 Dallas Engelken 2005-01-19 11:56:25 UTC
ya, as soon as we updated to storable 2.13 and the new 3.0.2 spamassassin it
started happening on those smp based boxes.

i have a full trace running on a couple boxes now so i should be able to provide
more info soon.
Comment 51 Dallas Engelken 2005-01-22 08:48:44 UTC
at 10:05 am, the spamd log was showing...


@4000000041f27a770c42fc14 debug: daemon:        LocalAddr: 127.0.0.1
@4000000041f27a770c42fffc debug: daemon:        LocalPort: 783
@4000000041f27a770c4303e4 debug: daemon:        Proto: 6
@4000000041f27a770c4307cc debug: daemon:        ReuseAddr: 1
@4000000041f27a770c430bb4 debug: daemon:        Type: 1
@4000000041f27a770c4f19a4 Could not create INET socket on 127.0.0.1:783: Address
already in use (IO::Socket::INET: Address already in use)
@4000000041f27a7808782cbc [2080] debug: logger: enabled with:
@4000000041f27a7808783874 debug: logger:        facility: stderr
@4000000041f27a7808783c5c debug: logger:        socket:   file
@4000000041f27a7808784044 debug: logger:        output:   stderr
@4000000041f27a780878d0cc [2080] debug: logger: facility stderr disabled:
already debugging to stderr.
@4000000041f27a780887d874 [2080] debug: daemon: creating INET socket:
@4000000041f27a780887e044 debug: daemon:        Listen: 128
@4000000041f27a780887e044 debug: daemon:        LocalAddr: 127.0.0.1
@4000000041f27a780887e814 debug: daemon:        LocalPort: 783
@4000000041f27a780887e814 debug: daemon:        Proto: 6
@4000000041f27a780887ebfc debug: daemon:        ReuseAddr: 1
@4000000041f27a780887efe4 debug: daemon:        Type: 1
@4000000041f27a7808988dcc Could not create INET socket on 127.0.0.1:783: Address
already in use (IO::Socket::INET: Address already in use)


[root@email spamd]# ps auxwww | grep spamd
root      1287  0.1  1.2 28360 26480 ?       T    08:40   0:07 /usr/bin/perl -T
-w /usr/bin/spamd --max-children=10 --max-conn-per-child=10 --min-children=5
--min-spare=3 --max-spare=5 --syslog-socket=none -q -D -u doublecheck
200      23708  0.0  0.0     0    0 ?        Z    09:39   0:00 [spamd <defunct>]
200      24005  0.0  0.0     0    0 ?        Z    09:39   0:00 [spamd <defunct>]
200      24170  0.0  0.0     0    0 ?        Z    09:39   0:00 [spamd <defunct>]
200      24393  0.0  0.0     0    0 ?        Z    09:40   0:00 [spamd <defunct>]
200      26206  0.0  0.0     0    0 ?        Z    09:55   0:00 [spamd <defunct>]
200      26510 13.6  0.0  1360  316 ?        S    09:56   1:37
/usr/local/bin/multilog t s2048000 !/usr/local/doublecheck/bin/logappend4
/var/log/spamd
root     28579  0.0  0.4 10120 8260 ?        T    10:00   0:00 /usr/bin/perl -T
-w /usr/bin/spamd --max-children=10 --max-conn-per-child=10 --min-children=5
--min-spare=3 --max-spare=5 --syslog-socket=none -q -D -u doublecheck
root     28606  0.0  0.0  1344  296 ?        S    10:00   0:00 supervise spamd
root      2387  4.0  0.1  3844 2548 ?        R    10:08   0:00 /usr/bin/perl -T
-w /usr/bin/spamd --max-children=10 --max-conn-per-child=10 --min-children=5
--min-spare=3 --max-spare=5 --syslog-socket=none -q -D -u doublecheck
root      2402  0.0  0.0  1772  636 pts/1    S    10:08   0:00 grep spamd


[root@email spamd]# strace -p 28579
--- SIGTERM (Terminated) ---

[root@email spamd]# ps auxwww | grep spamd
root      1287  0.1  1.2 28360 26480 ?       T    08:40   0:07 /usr/bin/perl -T
-w /usr/bin/spamd --max-children=10 --max-conn-per-child=10 --min-children=5
--min-spare=3 --max-spare=5 --syslog-socket=none -q -D -u doublecheck


[root@email spamd]# strace -p 1287
--- SIGTERM (Terminated) ---


2005-01-22 09:40:33.941653500 is the last time an email was processed.

however, spamd answered a pong 5 minutes later..

2005-01-22 09:45:00.515695500 [pid 23542] read(9, "PING SPAMC/1.0\n", 4096) = 15
2005-01-22 09:45:00.515781500 [pid 23542] alarm(0)                    = 30
2005-01-22 09:45:00.515862500 [pid 23542] write(9, "SPAMD/1.2 0 PONG\r\n", 18) = 18


this box does a quite a bit of mail.. so no CHECK's in a 5 minute period is
impossible.  there is an email every 3-5 seconds.

i'm attaching the straced debug log.
Comment 52 Dallas Engelken 2005-01-22 08:51:02 UTC
Created attachment 2619 [details]
straced spamd

straced spamd processes during a hang
Comment 53 Dallas Engelken 2005-01-22 08:54:09 UTC
also, i've switched back to --round-robin, we'll see if the hangs go away.
Comment 54 Dallas Engelken 2005-01-24 07:24:36 UTC
running --round-robing has removed the hangs...  spamd was hanging up every 4 to
8 hours, now there hasnt been one since i changed it saturday morning.

Comment 55 Leonard Gray 2005-02-16 09:51:48 UTC
Runnning with Solaris 8, these patches cause the error:

prefork: select timeout failed! recovering

to be displayed to the syslog continuously, and cause spamd to not process any
messages.

I'd love to get these patches working, and am afraid that the final released
version of these could cause problems for us.  Any direction?
Comment 56 Tom Schulz 2005-03-15 10:24:05 UTC
Could the patch submitted in Bug 4189 be the fix for the problems seen in
comment 44 through comment 55?
Comment 57 Justin Mason 2005-03-15 19:15:17 UTC
yes, bug 4189 is a bug, and the patch in that bug should be applied on top of
these.   However, I don't think it's the bug Dallas is reporting; this bug
manifests itself as lots of "K" state children, and Dallas' logs don't show that.

BTW Dallas, I note something very wierd in those logs you posted:

---------------------------------------------------------------------------
2005-01-22 09:50:00.782509500 [pid  1287] write(11, "A\n", 2)         = 2
2005-01-22 09:50:00.782578500 [pid 23708] <... read resumed> "A\n", 4096) = 2
2005-01-22 09:50:00.782652500 [pid  1287] write(2, "[1287] debug: prefork:
ordered 2"..., 47[1287] debug: prefork: ordered 23708 to accept
2005-01-22 09:50:00.782695500 ) = 47
2005-01-22 09:50:00.782747500 [pid  1287] read(11,  <unfinished ...>
2005-01-22 09:50:00.782941500 [pid 23708] rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
2005-01-22 09:50:00.783028500 [pid 23708] accept(5, {sin_family=AF_INET,
sin_port=htons(40829), sin_addr=inet_addr("127.0.0.1")}}, [16]) = 11
2005-01-22 09:55:00.629245500 [pid 23708] fcntl64(11, F_GETFL)        = 0x2
(flags O_RDWR)
2005-01-22 09:55:00.629333500 [pid 23708] fstat64(11, {st_mode=S_IFSOCK|0777,
st_size=0, ...}) = 0
2005-01-22 09:55:00.629470500 [pid 23708] mmap2(NULL, 4096,
PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40203000
---------------------------------------------------------------------------

note the 5-minute wait time between the accept() calling and returning. that's
the hang, in wait_for_child_to_accept() in SpamdForkScaling.pm, where it
calls

    my $state = $self->read_one_line_from_child_socket($sock);

essentially, the child process is hung in the accept() system call. (This is
not supposed to happen, naturally. ;)   The spamd will sit there waiting for
the child to report its status, and the child is just hung waiting for accept()
to return...

So the question is, why is accept() hanging?  Is it:

    - 1. because there's no connection there to accept, according to the child,
      and it's just waiting for the next conn request?
    - 2. because something in the kernel is holding onto a lock and eventually
      time out after 5 minutes exactly?

The timestamps (09:50:00.783028500 and 09:55:00.629245500) are almost exactly 5
minutes apart.  That doesn't seem to be accidental, and would lead me to
suspect that #1 is very unlikely.   This doesn't seem to be a particularly
quiet server, where the socket would see no activity for a whole 5 minutes...
right?

If that's the case, I think #2 seems likely.

Anyway, I'll try adding some timeout code to the preforking code, to take
care of this.  But it looks like some kind of kernel lock timeout to me...
accept() should not be sitting on the socket for that long assuming there's
other connections arriving during those 5 mins.

--j.