Bug 4594 - [review] spamd dies unexpectedly: prefork: ordered child to accept, but child reported state '1'
[review] spamd dies unexpectedly: prefork: ordered child to accept, but child...
Status: RESOLVED FIXED
Product: Spamassassin
Classification: Unclassified
Component: spamc/spamd
3.1.0
Other FreeBSD
: P4 minor
: 3.1.8
Assigned To: SpamAssassin Developer Mailing List
go ahead
:
: 4618 (view as bug list)
Depends on:
Blocks:
  Show dependency tree
 
Reported: 2005-09-20 00:50 UTC by Helge Oldach
Modified: 2006-12-11 11:53 UTC (History)
13 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
ktrace from spamd4.oct.nac.net (last 1000 lines) application/octet-stream None Ryan Pavely [NoCLA]
ktrace from merak.origin-it.net (last 1000 lines) text/plain None Helge Oldach [NoCLA]
attempted patch patch None Justin Mason [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Helge Oldach 2005-09-20 00:50:46 UTC
Upgraded from 3.0.4 to 3.1.0. No further configuration change.

etc/mail/spamassassin/local.cf contains just "report_safe 0".

Starting spamd as
/usr/local/bin/spamd -c -d -r /var/run/spamd/spamd.pid -u spamd -H /home/spamd -
x -i merak.origin-it.net -A 194.8.96.226,194.8.96.234,194.8.96.250 -m 15

Syslog before crash:
Sep 19 21:28:57 merak spamd[52026]: prefork: child states: IBBI
Sep 19 21:28:58 merak spamd[52026]: prefork: child states: BIBI
Sep 19 21:29:03 merak spamd[52026]: prefork: child states: IIII
Sep 19 21:29:03 merak spamd[52026]: spamd: handled cleanup of child pid 60800 
due to SIGCHLD
Sep 19 21:29:03 merak spamd[52026]: prefork: child states: III
Sep 19 21:29:03 merak spamd[52026]: spamd: handled cleanup of child pid 60797 
due to SIGCHLD
Sep 19 21:29:03 merak spamd[52026]: prefork: child states: II
Sep 19 21:29:03 merak spamd[52026]: prefork: child states: II
Sep 19 21:29:40 merak spamd[52026]: prefork: child states: II
Sep 19 21:29:44 merak spamd[52026]: prefork: child states: II
Sep 19 21:29:48 merak spamd[52026]: prefork: child states: II
Sep 19 21:29:54 merak spamd[52026]: prefork: child states: BB
Sep 19 21:29:54 merak spamd[52026]: spamd: server successfully spawned child 
process, pid 60801
Sep 19 21:29:54 merak spamd[52026]: prefork: child states: BBI
Sep 19 21:29:54 merak spamd[52026]: prefork: child states: III
Sep 19 21:29:54 merak spamd[52026]: spamd: handled cleanup of child pid 60801 
due to SIGCHLD
Sep 19 21:29:54 merak spamd[52026]: prefork: child states: II
Sep 19 21:30:01 merak spamd[52026]: prefork: ordered child to accept, but child 
reported state '1' at /usr/local/lib/perl5/site_perl/5.8.7/Mail/SpamAssassin/
SpamdForkScaling.pm line 390.
Comment 1 Ryan Pavely 2005-09-21 08:01:43 UTC
I also upgraded from 3.0.4 to 3.1.0
 Spamd is dieing every few hours.
 I tested limiting the children, and tweaking those settings however death still
occurs.
Sep 21 04:36:43 spamd6 spamd[22770]: prefork: child states: BBBBBBBBI
Sep 21 04:36:43 spamd6 spamd[22770]: prefork: ordered child to accept, but child
reported state '1' at
/usr/local/lib/perl5/site_perl/5.8.7/Mail/SpamAssassin/SpamdForkScaling.pm line
390. 
Sep 21 07:18:08 spamd6 spamd[58292]: prefork: ordered child to accept, but child
reported state '1' at
/usr/local/lib/perl5/site_perl/5.8.7/Mail/SpamAssassin/SpamdForkScaling.pm line
390. 
Comment 2 Simon Talbot 2005-09-23 04:19:51 UTC
I have recreated the exact same problem on 5 different mailservers. We were 
upgrading from 3.0.3 to 3.1.0 on Linux 2.4.22. spamd dies every 4-12 hours.

Sep 21 04:05:56 jsmx3 spamd[969]: prefork: child states: III
Sep 21 04:05:56 jsmx3 spamd[969]: spamd: handled cleanup of child pid 12623 due 
to SIGCHLD
Sep 21 04:05:56 jsmx3 spamc[12816]: connect(AF_INET) to spamd at 127.0.0.1 
failed, retrying (#1 of 3): Connection refused
Sep 21 04:05:57 jsmx3 spamc[12816]: connect(AF_INET) to spamd at 127.0.0.1 
failed, retrying (#2 of 3): Connection refused
Sep 21 04:05:58 jsmx3 spamc[12816]: connect(AF_INET) to spamd at 127.0.0.1 
failed, retrying (#3 of 3): Connection refused
Sep 21 04:05:59 jsmx3 spamc[12816]: connection attempt to spamd aborted after 3 
retries

Comment 3 Justin Mason 2005-09-23 10:46:54 UTC
1. Simon, are you sure it's the same bug?  the 'prefork: ordered child to
accept, but child reported state '1' at
/usr/local/lib/perl5/site_perl/5.8.7/Mail/SpamAssassin/SpamdForkScaling.pm line
390.' line does not appear in that syslog output you posted.

If that error does not appear, it's not the same bug.

2. Helge, Ryan, can you provide ktrace output for the spamd processes?  we'll
need dumps from the master spamd process at least.

3.  What exactly happens after that message appears in the syslog?  in other
words what does "death" mean?
Comment 4 Helge Oldach 2005-09-23 11:17:15 UTC
ad 2) will do.

ad 3) "Death" means that spamd is gone. The process just terminates. Note that 
said line 390 is a "die" statement, and that's just what happens. Of course 
there is nothing in the syslog after the spamd process has terminated.
Comment 5 Ryan Pavely 2005-09-23 11:40:50 UTC
I will gather a ktrace for ya.  Just like Helge states, the spamd process
terminates.  With debugging on you still see the other children process(s)
finish up. With debugging off all you see is  'line 390.' (end of log)


Sep 21 09:29:37 spamd4 spamd[58422]: prefork: ordered child to accept, but child
reported state '1' at
/usr/local/lib/perl5/site_perl/5.8.7/Mail/SpamAssassin/SpamdForkScaling.pm line
390. 
Sep 21 09:29:37 spamd4 spamd[58554]: spamd: connection from mx5.oct.nac.net
[209.123.233.205] at port 64776 
Sep 21 09:29:37 spamd4 spamd[58554]: spamd: processing message (...):0 
Sep 21 09:29:37 spamd4 spamd[58801]: spamd: identified spam (28.5/5.0) for 
(...):0 in 5.5 seconds, 1653 bytes. 
Sep 21 09:29:37 spamd4 spamd[58801]: spamd: result: Y 28 -
BAYES_99,DATE_IN_FUTURE_03_06,DCC_CHECK,DIGEST_MULTIPLE,HELO_DYNAMIC_IPADDR2,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_DSBL,RCVD_IN_XBL,URIBL_JP_SURBL,URIBL_SBL
scantime=5.5,size=1653,user=(...),uid=0,required_score=5.0,rhost=mx5.oct.nac.net,raddr=209.123.233.205,rport=61787,mid=(...),bayes=1,autolearn=spam

Sep 21 09:29:38 spamd4 spamd[58775]: spamd: identified spam (32.4/4.8) for
(...):0 in 5.4 seconds, 1231 bytes. 
Sep 21 09:29:38 spamd4 spamd[58775]: spamd: result: Y 32 -
BAYES_99,DCC_CHECK,DNS_FROM_RFC_ABUSE,DNS_FROM_RFC_POST,REMOVE_BEFORE_LINK,REPLICA_WATCH,UNPARSEABLE_RELAY,URIBL_AB_SURBL,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_SBL,URIBL_SC_SURBL,URIBL_WS_SURBL
scantime=5.4,size=1231,user=(...),uid=0,required_score=4.8,rhost=mx5.oct.nac.net,raddr=209.123.233.205,rport=56482,mid=(...),bayes=1,autolearn=spam

Sep 21 09:29:38 spamd4 spamd[58793]: spamd: identified spam (22.6/6.0) for
(...):0 in 6.6 seconds, 3290 bytes. 
Sep 21 09:29:38 spamd4 spamd[58793]: spamd: result: Y 22 -
BAYES_50,HTML_40_50,HTML_MESSAGE,MIME_HTML_MOSTLY,RCVD_DOUBLE_IP_LOOSE,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_NJABL_DUL,RCVD_IN_NJABL_PROXY,RCVD_IN_SORBS_DUL,RCVD_IN_XBL,UNPARSEABLE_RELAY,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_SBL,URIBL_WS_SURBL
scantime=6.6,size=3290,user=(...),uid=0,required_score=6.0,rhost=mx2.oct.nac.net,raddr=209.123.233.202,rport=4633,mid=(...),bayes=0.499973650187907,autolearn=spam

Sep 21 09:29:39 spamd4 spamd[58803]: spamd: identified spam (15.8/4.0) for
(...):0 in 5.3 seconds, 2180 bytes. 
Sep 21 09:29:39 spamd4 spamd[58803]: spamd: result: Y 15 -
BAYES_99,FUZZY_REFINANCE,HTML_MESSAGE,MIME_BOUND_NEXTPART,RCVD_IN_SORBS_DUL,URIBL_JP_SURBL,URIBL_SBL,URIBL_WS_SURBL
scantime=5.3,size=2180,user=(...),uid=0,required_score=4.0,rhost=mx3.oct.nac.net,raddr=209.123.233.203,rport=4585,mid=(...),bayes=0.999999748904047,autolearn=no

Sep 21 09:29:39 spamd4 spamd[58783]: spamd: identified spam (9.7/4.0) for
(...)m:0 in 6.0 seconds, 3629 bytes. 
Sep 21 09:29:39 spamd4 spamd[58783]: spamd: result: Y  9 -
BAYES_50,BODY_ENHANCEMENT2,INFO_TLD,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,URIBL_OB_SURBL,URIBL_WS_SURBL
scantime=6.0,size=3629,user=(...),uid=0,required_score=4.0,rhost=mx5.oct.nac.net,raddr=209.123.233.205,rport=50425,mid=(...),bayes=0.499999389778673,autolearn=no

Sep 21 09:29:42 spamd4 spamd[58554]: spamd: identified spam (45.9/6.0) for
(...):0 in 4.8 seconds, 1401 bytes. 
Sep 21 09:29:42 spamd4 spamd[58554]: spamd: result: Y 45 -
BAYES_99,DCC_CHECK,DIGEST_MULTIPLE,DK_POLICY_SIGNSOME,DK_POLICY_TESTING,DNS_FROM_RFC_ABUSE,DNS_FROM_RFC_POST,HELO_DYNAMIC_IPADDR2,INFO_TLD,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_XBL,SUBJECT_FUZZY_MEDS,SUBJ_BUY,URIBL_AB_SURBL,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_SBL,URIBL_SC_SURBL,URIBL_WS_SURBL
scantime=4.8,size=1401,user=(...),uid=0,required_score=6.0,rhost=mx5.oct.nac.net,raddr=209.123.233.205,rport=64776,mid=(...),bayes=1,autolearn=spam

Sep 21 09:29:43 spamd4 spamd[58794]: spamd: identified spam (33.5/4.0) for
(...):0 in 10.3 seconds, 1247 bytes. 
Sep 21 09:29:43 spamd4 spamd[58794]: spamd: result: Y 33 -
BAYES_99,DCC_CHECK,RCVD_IN_WHOIS_INVALID,REMOVE_BEFORE_LINK,REPLICA_WATCH,SUBJ_YOUR_OWN,UNPARSEABLE_RELAY,URIBL_AB_SURBL,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_SBL,URIBL_SC_SURBL,URIBL_WS_SURBL
scantime=10.3,size=1247,user=(...),uid=0,required_score=4.0,rhost=mx5.oct.nac.net,raddr=209.123.233.205,rport=62609,mid=(...),bayes=1,autolearn=spam

Sep 21 10:55:43 spamd4 spamd[62536]: logger: removing stderr method 
Sep 21 10:55:50 spamd4 spamd[62539]: spamd: server started on port 783/tcp
(running version 3.1.0) 
Sep 21 10:55:50 spamd4 spamd[62539]: spamd: server pid: 62539 

Just pre-fork info

Sep 21 09:29:32 spamd4 spamd[58422]: prefork: child states: BBBBBBBBBBB 
Sep 21 09:29:32 spamd4 spamd[58422]: prefork: child states: BBBBBBBBBBBI 
Sep 21 09:29:32 spamd4 spamd[58422]: prefork: child states: BBBBBBBBBIBI 
Sep 21 09:29:32 spamd4 spamd[58422]: prefork: child states: BBBBBBBIBIBI 
Sep 21 09:29:32 spamd4 spamd[58422]: prefork: child states: BBBBBBBIBIBK 
Sep 21 09:29:33 spamd4 spamd[58422]: prefork: child states: BBBIBBBBBBB 
Sep 21 09:29:33 spamd4 spamd[58422]: prefork: child states: BBBBBBBBBBB 
Sep 21 09:29:33 spamd4 spamd[58422]: prefork: child states: BBBBBBBBBBBB 
Sep 21 09:29:33 spamd4 spamd[58422]: prefork: child states: BBBBBBBBBBBBI 
Sep 21 09:29:34 spamd4 spamd[58422]: prefork: child states: BBBBBBBBIBBBI 
Sep 21 09:29:37 spamd4 spamd[58422]: prefork: ordered child to accept, but child
reported state '1' at
/usr/local/lib/perl5/site_perl/5.8.7/Mail/SpamAssassin/SpamdForkScaling.pm line
390. 
Sep 21 10:55:50 spamd4 spamd[62539]: prefork: child states: BI 
Sep 21 10:55:55 spamd4 spamd[62539]: prefork: child states: BB 
Sep 21 10:55:55 spamd4 spamd[62539]: prefork: child states: BB 
Sep 21 10:55:55 spamd4 spamd[62539]: prefork: child states: BBB 
Sep 21 10:55:55 spamd4 spamd[62539]: prefork: child states: BBBB 
Sep 21 10:55:56 spamd4 spamd[62539]: prefork: child states: BBBBB 
Sep 21 10:55:56 spamd4 spamd[62539]: prefork: child states: BBBBBB 
Sep 21 10:55:56 spamd4 spamd[62539]: prefork: child states: BBBBBBB 
Sep 21 10:55:56 spamd4 spamd[62539]: prefork: child states: BBBBBBBB 
Sep 21 10:55:56 spamd4 spamd[62539]: prefork: child states: BBBBBBBBB 
Comment 6 Ryan Pavely 2005-09-23 17:30:54 UTC
Created attachment 3146 [details]
ktrace from spamd4.oct.nac.net (last 1000 lines)
Comment 7 Helge Oldach 2005-09-26 13:26:12 UTC
Created attachment 3148 [details]
ktrace from merak.origin-it.net (last 1000 lines)

Original ktrace.out is 98 MB
Comment 8 Dale Blount 2005-09-28 06:29:38 UTC
Adding a me too.  This is happening on Linux 2.6 and not FreeBSD for me though.

2.63 - 3.0.4 worked fine for me on the same server.

"perl --version" output.

This is perl, v5.8.7 built for i686-linux-thread-multi

Let me know if you need anything else.
Comment 9 Justin Mason 2005-09-28 11:58:45 UTC
is there any way we can get a strace/ktrace of *both* the child and the parent,
preferably either as two separate files with timestamps, or as one file with the
lines interleaved?

both traces so far have provided only half the picture, unfortunately; I'm still
not sure what's happening.
Comment 10 Justin Mason 2005-09-28 11:59:21 UTC
oh btw I presume everyone knows that --round-robin is a workaround to avoid this
issue?
Comment 11 Dale Blount 2005-09-28 12:08:19 UTC
Jason, can you provide the command I need to run to get the proper strace for
you?  Unfortunately mine hasn't crashed in about 24 hours so I don't know if
it'l help.  Should lowering/raising my children make it happen more often?
Comment 12 Justin Mason 2005-09-28 15:39:20 UTC
if you strace -fp ${pid} -o log, where ${pid} is the PID of the master process,
then kill the kids so that it restarts new ones, that should get the strace
covering all processes.

however, I don't know of a way to cause this to happen sooner :(
Comment 13 Helge Oldach 2005-09-28 21:14:22 UTC
What would be the equivalent for ktrace? It appears that ktrace -d does *not* 
catch the new children.
Comment 14 Dale Blount 2005-09-30 08:46:41 UTC
warning: strace seems to die when the log file hits 2gb and takes spamd with it.

2gb lasts me a couple hours and with spamd dying less often than that, it's hard
to catch with strace.
Comment 15 Justin Mason 2005-10-10 11:35:26 UTC
*** Bug 4618 has been marked as a duplicate of this bug. ***
Comment 16 Dale Blount 2005-10-10 11:39:59 UTC
anyone else think it's odd that this is reproducable on FreeBSD, Linux, and
Solaris at least?  Is everyone else running Perl 5.8.x also?
Comment 17 Tom Schulz 2005-10-10 14:18:52 UTC
Using Perl 5.8.5 and Solaris 8 on an Ultra 10 with no problems.
Comment 18 Sidney Markowitz 2005-10-10 16:22:54 UTC
> Is everyone else running Perl 5.8.x also?

No, other reports of this were uing 5.6.1

I assume that you have no problem when you run --round-robin as a spamd option?
Comment 19 Dale Blount 2005-10-10 18:43:48 UTC
yes, --round-robin works as a work around for me.
Comment 20 Ryan Pavely 2005-10-11 04:35:48 UTC
I'm running Perl 5.8.7 on FreeBsd 4.8 (2 machines), 5.3 (4 machines).
Comment 21 Randy Jones 2005-11-04 18:38:19 UTC
This is a me too with this same issue. I am seeing this on Solaris 8 and
Solaris 10 with spamassassin 3.1.0, perl 5.8.5. I did not have this issue
with previous spamassassin releases - 3.0.4, 3.0.3, 3.0.2, 3.0.1, and 3.0.0.
I have just now restarted spamd with --round-robin and will see how this goes.
Some more on my environment:
Main "mail servers" (2) Sun Ultra 80's w/4x450, Solaris 8. These 2 handle
around 20k messages a day between the 2 of them. The rest of my setup
are mainly Solaris 10 computers - Ultra 30's, Ultra 60's, Blade 1000's,
and Blade 2500's that handle only the email for the user that "owns"
the workstation. I have had 1 or 2 (total) instances of spamd dieing on
these personal workstations. I have have 3 or 4 instances of spamd dieing
on the 2 main mail servers.
Comment 22 Tom Schulz 2005-11-04 19:03:23 UTC
It would seem to not require a high load condition to happen.  This is very
strange as we have never had this problem on an Ultra 10, Solaris 8 and
perl 5.8.5.  What compiler did you use to build your perl?  Did you use
the default optimization?  I used Sun's Forte Developer 7 C 5.4 and I
think that I accepted the default optimization.  Could this problem happen
only for perl compiled with gcc?
Comment 23 Tom Schulz 2005-11-07 19:25:32 UTC
I got to wondering if I might have something in my configuration that would
prevent this problem, and perhaps I do.  I have the following in my
configuration:

bayes_auto_expire 0
bayes_journal_max_size 0
bayes_learn_to_journal 1

To take care of the housekeeping, I have a cron job every 10 minutes that
does a 'sa-learn --sync' and a cron job that runs once a day that does a
'sa-learn --force-expire'.  That way, a child never has to do any bayes
maintenance.  The cron jobs run scripts that have logic to make sure that
sa-learn is not run if another is still active.
Perhaps this bug is triggered by a child doing maintenance.
Comment 24 Luke Reeves 2005-11-07 19:28:47 UTC
(In reply to comment #23)
Not in my case Tom.  I actually have all the Bayes features disabled and the
error still happened on my installation.

Comment 25 Randy Jones 2005-11-07 20:23:01 UTC
Apologies for commenting back so late however right after I rec'd the
comment notification email our T1 line to the internet went down. I am
replying to comment #22 from Tom Schulz...

The perl in use is from sunfreeware.com. I belive that they use Sun's Forte
compiler. The spamassassin (and other required) perl modules were "compiled"
on Solaris 8 using gcc 3.4.2.
Comment 26 Justin Mason 2006-02-10 20:21:28 UTC
There are some major bugfixes for the apache-style preforking subsystem in bug
4696.   If you are seeing this issue, please try applying the patch in that bug,
and see if it helps.
Comment 27 Helge Oldach 2006-02-15 20:00:06 UTC
Did apply the patch from bug 4696, but spamd still dies every couple of hours at 
random intervals.
Comment 28 Daryl C. W. O'Shea 2006-02-15 20:48:21 UTC
The latest patches right (id 3366 or 3368)?

Has the logging output changed?
Comment 29 Helge Oldach 2006-02-15 23:47:10 UTC
Yes, latest patches "aggregated patch, against the 3.1.0 release tarball, take 
2" dated 2006.02.11 02:08.

No, I still get: Feb 15 23:02:07 merak spamd[99234]: prefork: ordered child to 
accept, but child reported state '1' at /usr/local/lib/perl5/site_perl/5.8.7/
Mail/SpamAssassin/SpamdForkScaling.pm line 433.

However, the "tcp timeout" messages have disappeared.
Comment 30 Sidney Markowitz 2006-03-09 20:08:24 UTC
Triage time. This may be a severity critical bug, but does anyone think it
should block the 3.1.1 release now that bug 4696 has been closed? I would like
to retarget this to 3.1.2 unless anyone objects or has a fix.

To clarify the state of the problem: It seems that Helge still sees it with the
latest patches from bug 4696, is that still true? The other people who have
reported problems, Randy, Luke, Tom, and Ryan, do you still see it with the
latest patches, or is it now fixed for you? Is there anyone for whome the
--round-robin workaround not work?

Especially if --round-robin is a workaround that works for everyone, I want to
retarget this to 3.1.2. I'll do that if no objections surface soon
Comment 31 Daryl C. W. O'Shea 2006-03-09 20:18:29 UTC
Definitely not a blocker.
Comment 32 Scott Russell 2006-03-09 20:28:35 UTC
Does the latest round of patches address spamd not releasing the port before
existing when using --roundrobin? When killing the spamd pid port 783 remains in
use until each of the spamd children die because spamd exits immedietly. This
causes trouble when restarting spamd for rules_du_jour.

I only ask because without --roundrobin this same problem doesn't exist. 
Comment 33 Daryl C. W. O'Shea 2006-03-09 20:43:16 UTC
That's an issue with the included redhat-rc-script.sh init script which was
fixed (by using a PID file) about a month or so ago.
Comment 34 Scott Russell 2006-03-09 20:52:43 UTC
(In reply to comment #33)
> That's an issue with the included redhat-rc-script.sh init script which was
> fixed (by using a PID file) about a month or so ago.

A little help on this? Where's the bug report with the solution or patched init
script?
Comment 35 Daryl C. W. O'Shea 2006-03-09 21:01:43 UTC
bug 4655

The patched init script would be in svn since it's been fixed.
Comment 36 Justin Mason 2006-04-22 13:23:37 UTC
"critical" is used for release blockers; pretty sure this isn't one of those
Comment 37 Stephan Menzel 2006-07-05 08:12:30 UTC
got the same problem here on Debian Linux perl 6.8.4
Comment 38 Daryl C. W. O'Shea 2006-07-12 00:57:51 UTC
(In reply to comment #37)
> got the same problem here on Debian Linux perl 6.8.4

Please attach a copy of your spamd log, complete with time stamps, around the
time it happens.

Also, can you do anything to reproduce this?

Any observations about when it happens?  Does it only happen under heavy load? 
How heavy?
Comment 39 Stephan Menzel 2006-07-12 06:40:51 UTC
(In reply to comment #38)
> Please attach a copy of your spamd log, complete with time stamps, around the
> time it happens.

Sorry, I didn't read the bug thoroughly enough. --round-robin fixed the issue. 

By the way, we get significantly better performance with --round-robin too. With
roughly the same load on the spamd servers, we can feed in a lot more from the
MX before they start refusing connections due to high load.

> Also, can you do anything to reproduce this?

It were pretty much exactly the same symptoms as described above. also the log
entry was the same.

> Any observations about when it happens?  Does it only happen under heavy load? 
> How heavy?

Well, having the servers run at a load of about 4-6 you can pretty much wait for
it to happen. On 15 servers you wouldn't wait much longer than 15 mins for one
to crash. Chances are, within 4 hours 50% of them are gone.

With --round-robin it didn't happen once :-)

Thanks and greetings...

Stephan
Comment 40 Dale Blount 2006-07-12 14:08:35 UTC
On 3 pretty loaded servers during prime spam crunching time, you could
anticipate this to happen within an hour on one of them.  After the first one
goes, the second two are always close behind since they get more requests.

--round-robin fixes it for me also (spamd uptimes typical of amount of time
between SA releases).
Comment 41 Justin Mason 2006-07-14 11:31:33 UTC
lowering pri; --round-robin is a functioning workaround, sounds like
Comment 42 Justin Mason 2006-07-17 12:35:51 UTC
Joe Schaefer reports this in the ASF machine hermes' syslog, and spamd crashing
"twice this week":

>   Jul 13 15:34:48 hermes spamd[28964]: prefork: ordered child to accept,
>   but child reported state '1' at
>   /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm
>   line 440.
Comment 43 Daryl C. W. O'Shea 2006-07-18 07:00:37 UTC
Can someone provide me with the following (email me a link to download):

  * their spamd parameters
  * the complete (since spamd startup) spamd contents of a maillog demoing this
    (I'd like to see back to the initial launch of spamd or all of the children
     being killed off due to --max-conn-per-child once)


If you can provide the same info for more than one occurrence that'd be great.
Comment 44 Sebastian Porombka 2006-08-04 08:53:00 UTC
Using Debian with Perl v5.8.4 built for i386-linux-thread-multi. 
Same problem here on 3 machines.

OPTIONS="--create-prefs --max-children 15 --helper-home-dir"
Comment 45 Justin Mason 2006-08-04 09:30:41 UTC
how frequently?  can you provide logs?
Comment 46 Damiano Verzulli 2006-10-07 00:29:19 UTC
(In reply to comment #45)
> how frequently?  can you provide logs?

Same problem here: after upgrading from 3.0.2 to 3.1.5 (on Sep 22 2006) we had 
following two "die":
-----------------------
Sep 30 04:47:17 noc2 spamd[15802]: prefork: ordered child to accept, but child 
reported state '1' 
at /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/SpamdForkScaling.pm line 
441.
[...]
Oct  4 00:32:14 noc spamd[21247]: prefork: ordered child to accept, but child 
reported state '1' at /usr/lib/perl5/site_perl/5.8
.5/Mail/SpamAssassin/SpamdForkScaling.pm line 441.
------------------------
Since the last restart (on Oct  6 12:13:32), SA is still running.

3.1.5 is running with "perl, v5.8.5 built for i386-linux-thread-multi" under 
linux "2.6.8.1-12mdksmp"

With previous version (3.0.2) SA run fine with _no_ problem, since Mar 18 2005.

Now, I'm trying the "--round-robin" workaround.

Comment 47 Larry Rosenman 2006-10-09 21:54:29 UTC
I've got an occurance of this today. Upgraded to 3.1.6 (Via FreeBSD ports 
around 09:20 am (CDT)), and spamd apparently died around 12:54:42 with this 
error. 

Maillog including both these timestamps is at:

http://lists.acesthetics.com/~ler/maillog.spamd.gz

-rw-r--r--  1 ler  ler  2551573 Oct  9 23:52 maillog.spamd.gz


Let me know if you need anything else.
Comment 48 Larry Rosenman 2006-10-11 11:57:52 UTC
I had another occurance today.  I have the log since midnight, and can supply 
the logs since the last one if you want. 


Seeing that nobody's downloaded the previous log, I'll hold off till I hear 
from someone. 

Comment 49 Justin Mason 2006-10-11 14:18:25 UTC
damn, busted ;)

Larry, if you can get a full log with --debug, that would be really useful...
Comment 50 Justin Mason 2006-10-11 14:20:17 UTC
Created attachment 3713 [details]
attempted patch

here's a patch to SpamdForkScaling.pm that may help.  Basically, it'll
catch the failure condition (I think!) and recover from it... I haven't
been able to repro it, but based on simulation, this seems to recover
quite well.

if you try it out and spamd successfully recovers from the error condition
(it'll still be logged), please comment here.
Comment 51 Larry Rosenman 2006-10-11 14:22:20 UTC
I really would hate to do that on this server as it's a client server. 

This server is rather busy.  What could we supply in place of that?  (It's 
been a 2 occurance thing since the upgrade to 3.1.6). 

It's been on a random time scale as well, which sux(tm). 

What else can I provide on a post-mortem basis?

Comment 52 Daryl C. W. O'Shea 2006-10-11 14:23:11 UTC
I'm interested in another log, with debug enabled if you can.

BTW, what version did you upgrade from?
Comment 53 Larry Rosenman 2006-10-11 14:28:16 UTC
I have, however, applied the patch, and will check over the next few days to 
see whether it handles it gracefully.

Thanks, JM.....

Daryl: I upgraded from 3.1.4 on this box. 
Comment 54 Daryl C. W. O'Shea 2006-10-11 14:31:33 UTC
Weird, you should have seen this happen with 3.1.4 too.  Was anything else upgraded?
Comment 55 Larry Rosenman 2006-10-11 14:33:38 UTC
(In reply to comment #54)
> Weird, you should have seen this happen with 3.1.4 too.  Was anything else 
upgraded?

Nope.  Just adding load to it. 

What's also strange is, I have a number of other servers of similar ilk, and 
hadn't seen this insect till my first report on the list a couple of days ago. 

Comment 56 Larry Rosenman 2006-10-11 15:25:00 UTC
Log from midnight (US/Central) till just after I did the restart:

http://lists.acesthetics.com/~ler/maillog.spamd.20061011.gz

-rw-r--r--  1 ler  ler  6027293 Oct 11 13:54 maillog.spamd.20061011.gz

this is PRE-Patch. 

It also is WITHOUT --debug. 

Let me know what else I can do. 

Comment 57 Larry Rosenman 2006-10-17 07:23:32 UTC
As an update, since applying the patch, I've not had an incident of this.  
I've also not seen the log message, so I don't know if something else changed 
(there have been no system changes), or what. 

Just an update. 

Comment 58 Tom Schulz 2006-10-17 08:53:22 UTC
> I've also not seen the log message...

You probably already know this, but just in case you missed it, the log
message has been slightly changed.  Are you looking for the changed message?
Comment 59 Larry Rosenman 2006-10-17 08:54:55 UTC
I've been looking for "prefork: order", and that seems to still be the text. 

Comment 60 Jim Keller 2006-10-26 13:58:50 UTC
Just wanted to add that I also saw this bug on one of my mailservers the other
day. FreeBSD 6.0, SpamAssassin 3.1.0_3. I Tracked it down to this same error
message in the log. 

Any official word on the proposed patch?
Comment 61 Larry Rosenman 2006-10-26 14:49:45 UTC
I still have NOT seen the message since applying the patch.  So far spamd has 
stayed up just fine.  So, I'm not sure what to say here :( 

Comment 62 Larry Rosenman 2006-10-30 11:14:08 UTC
looks like the patch works: 
/var/log/maillog:Oct 30 02:37:44 lists spamd[15552]: prefork: ordered child 
7268 to accept, but they reported state '1', killing rogue 
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm 
line 457.
/var/log/maillog:Oct 30 02:42:54 lists spamd[15552]: prefork: ordered child 
8306 to accept, but they reported state '1', killing rogue 
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm 
line 457.
/var/log/maillog:Oct 30 02:43:01 lists spamd[15552]: prefork: ordered child 
8319 to accept, but they reported state '1', killing rogue 
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm 
line 457.
/var/log/maillog:Oct 30 06:31:57 lists spamd[15552]: prefork: ordered child 
65448 to accept, but they reported state '1', killing rogue 
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm 
line 457.
/var/log/maillog:Oct 30 07:33:39 lists spamd[15552]: prefork: ordered child 
68949 to accept, but they reported state '1', killing rogue 
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm 
line 457.
/var/log/maillog:Oct 30 07:33:54 lists spamd[15552]: prefork: ordered child 
69086 to accept, but they reported state '1', killing rogue 
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm 
line 457.
/var/log/maillog:Oct 30 07:39:08 lists spamd[15552]: prefork: ordered child 
69246 to accept, but they reported state '1', killing rogue 
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm 
line 457.
/var/log/maillog:Oct 30 08:46:38 lists spamd[15552]: prefork: ordered child 
7350 to accept, but they reported state '1', killing rogue 
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm 
line 457.
/var/log/maillog:Oct 30 08:51:48 lists spamd[15552]: prefork: ordered child 
25608 to accept, but they reported state '1', killing rogue 
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm 
line 457.
/var/log/maillog:Oct 30 08:51:56 lists spamd[15552]: prefork: ordered child 
25656 to accept, but they reported state '1', killing rogue 
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm 
line 457.
/var/log/maillog:Oct 30 09:43:18 lists spamd[15552]: prefork: ordered child 
33718 to accept, but they reported state '1', killing rogue 
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm 
line 457.
/var/log/maillog:Oct 30 10:16:11 lists spamd[15552]: prefork: ordered child 
4479 to accept, but they reported state '1', killing rogue 
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm 
line 457.
/var/log/maillog:Oct 30 11:32:56 lists spamd[15552]: prefork: ordered child 
5382 to accept, but they reported state '1', killing rogue 
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm 
line 457.
/var/log/maillog:Oct 30 11:38:05 lists spamd[15552]: prefork: ordered child 
10114 to accept, but they reported state '1', killing rogue 
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm 
line 457.
/var/log/maillog:Oct 30 11:38:11 lists spamd[15552]: prefork: ordered child 
10159 to accept, but they reported state '1', killing rogue 
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm 
line 457.
/var/log/maillog:Oct 30 12:17:26 lists spamd[15552]: prefork: ordered child 
39870 to accept, but they reported state '1', killing rogue 
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm 
line 457.
/var/log/maillog:Oct 30 12:22:19 lists spamd[15552]: prefork: ordered child 
48921 to accept, but they reported state '1', killing rogue 
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm 
line 457.
/var/log/maillog:Oct 30 12:22:29 lists spamd[15552]: prefork: ordered child 
51038 to accept, but they reported state '1', killing rogue 
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm 
line 457.
/var/log/maillog:Oct 30 12:22:35 lists spamd[15552]: prefork: ordered child 
51144 to accept, but they reported state '1', killing rogue 
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm 
line 457.
/var/log/maillog:Oct 30 13:00:45 lists spamd[15552]: prefork: ordered child 
53163 to accept, but they reported state '1', killing rogue 
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm 
line 457.
/var/log/maillog:Oct 30 13:05:56 lists spamd[15552]: prefork: ordered child 
75100 to accept, but they reported state '1', killing rogue 
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm 
line 457.
/var/log/maillog:Oct 30 13:10:59 lists spamd[15552]: prefork: ordered child 
83117 to accept, but they reported state '1', killing rogue 
at /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/SpamdForkScaling.pm 
line 457.
^C$
 

Comment 63 Justin Mason 2006-11-02 13:29:59 UTC
that looks good alright.
Comment 64 Larry Rosenman 2006-11-07 06:47:10 UTC
FYI, I sent a PR in to the FreeBSD folks, and it's been committed with the fix 
as of last night (US/Central). 

Comment 65 Justin Mason 2006-11-09 13:34:35 UTC
thanks Larry; setting to [review] status
Comment 66 Helge Oldach 2006-11-24 02:02:44 UTC
I suspect that bug 5149 is caused by this fix. This one started appearing just 
after this commit. Please crosscheck.
Comment 67 Daryl C. W. O'Shea 2006-12-08 10:15:01 UTC
+1 on 3713

(In reply to comment #66)
> I suspect that bug 5149 is caused by this fix. This one started appearing just 
> after this commit. Please crosscheck.

I doubt it, 3.1.4 (the version reported with the problem in bug 5149) was
released 2.5 months before the patch was supplied.

Comment 68 Theo Van Dinter 2006-12-08 11:51:54 UTC
sure.  +1
Comment 69 Daryl C. W. O'Shea 2006-12-11 11:53:40 UTC
[dos@FC5-VPC 3.1]$ svn ci -m "bug 4594: add workaround for the rare 'ordered
child to accept, but they reported state '1'' bug"
Sending        lib/Mail/SpamAssassin/SpamdForkScaling.pm
Transmitting file data .
Committed revision 485847.