Bug 5731 - DB_File locking hangs on Ubuntu feisty / gutsy Linux systems in spamd
Summary: DB_File locking hangs on Ubuntu feisty / gutsy Linux systems in spamd
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P5 normal
Target Milestone: 3.3.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-11-27 02:54 UTC by Justin Mason
Modified: 2008-02-21 04:17 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Justin Mason 2007-11-27 02:54:26 UTC
(opening a bug for this very annoying issue.)

Recently, the buildbot slaves running on Ubuntu platforms (t-feisty-561 and
jm-feisty) have been failing the 'make test' run, roughly once every couple
of days.


All these buildbot failures recently share these things in common:

- they're happening on Ubuntu Feisty or Gutsy systems (jm-feisty is running
Gutsy nowadays, despite the name)

- they are spamd tests

- they involve a spamd not giving its "server pid: \d+" line, which
  indicates that it's ready to receive connections, in time (within
  something like 3 minutes)

- upon examination, the spamd log looks like this:


[6910] dbg: rules: __MO_OL_25340 merged duplicates: __MO_OL_4EEDB
__MO_OL_7533E
[6910] dbg: conf: finish parsing
[6910] dbg: plugin:
Mail::SpamAssassin::Plugin::ReplaceTags=HASH(0x89db6e4) implements
'finish_parsing_end', priority 0
[6910] dbg: replacetags: replacing tags
[6910] dbg: replacetags: done replacing tags
[6910] dbg: bayes: tie-ing to DB file R/O ./log/user_state/bayes_toks
[6910] dbg: bayes: tie-ing to DB file R/O ./log/user_state/bayes_seen
[eof]


  or


[13512] dbg: rules: running head tests; score so far=9.205
[13512] dbg: rules: compiled head tests
[13512] dbg: rules: running head_eval tests; score so far=9.205
[13512] dbg: locker: safe_lock: created
./log/user_state/auto-whitelist.lock.vmsa.13512
[13512] dbg: locker: safe_lock: trying to get lock on
./log/user_state/auto-whitelist with 0 retries
[13512] dbg: locker: safe_lock: link to ./log/user_state/auto-whitelist.lock:
link ok
[13512] dbg: auto-whitelist: tie-ing to DB file of type DB_File R/W in
./log/user_state/auto-whitelist


  in other words, it seems to be hanging during the compile_now()
  operation, while trying to access a Bayes DB, or an AWL DB.  

  two examples:
  http://buildbot.spamassassin.org:8010/t-feisty-561/builds/270/step-test/0
  http://buildbot.spamassassin.org:8010/t-feisty-561/builds/268/step-test/0


- it's *very* intermittent.

I initially thought it was a vmware-related issue, since t-feisty-561 is running
under vmware, but jm-feisty is a non-emulated Pentium M laptop.

I've tried to repro it under my new Dual Core Centrino Pro system here in work,
but after leaving the spamd test suite running in a loop overnight, there
have been no test failures :(  it may be load-related, since that system is
much beefier than the others.
Comment 1 Justin Mason 2008-01-10 15:08:39 UTC
: jm 441...; ldd /usr/lib/perl/5.8.8/auto/DB_File/DB_File.so
        linux-gate.so.1 =>  (0xffffe000)
        libdb-4.4.so => /usr/lib/libdb-4.4.so (0xb7e0b000)
        libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7cc1000)
        libpthread.so.0 => /lib/tls/i686/cmov/libpthread.so.0 (0xb7ca8000)
        /lib/ld-linux.so.2 (0x80000000)

so it may be libdb 4.4.

I've added a timeout to the AWL tie() call -- let's see if that helps any.
(I think we have reliably seen a hang about once every 3 days or so on
http://buildbot.spamassassin.org/ ....
Comment 2 Justin Mason 2008-01-11 02:23:59 UTC
for the record, the currently-installed libdb4.4 package on "t-feisty-561" is:

Package: libdb4.4
Status: install ok installed
Priority: standard
Section: libs
Installed-Size: 1068
Maintainer: Ubuntu Core Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Architecture: i386
Source: db4.4
Version: 4.4.20-8.1ubuntu3
Depends: libc6 (>= 2.6-1)
Conflicts: libdb2 (<< 2:2.7.7-3)
Description: Berkeley v4.4 Database Libraries [runtime]
 This is the runtime package for programs that use the v4.4 Berkeley
 database library.
Original-Maintainer: Debian Berkeley DB Maintainers
<pkg-db-devel@lists.alioth.debian.org>

I have JUST noticed a change coming through the apt updates channel today:

http://launchpadlibrarian.net/10283478/db4.4_4.4.20-8.1ubuntu4.dsc.diff

diff -u db4.4-4.4.20/debian/changelog db4.4-4.4.20/debian/changelog
--- db4.4-4.4.20/debian/changelog
+++ db4.4-4.4.20/debian/changelog
@@ -1,3 +1,10 @@
+db4.4 (4.4.20-8.1ubuntu3.1) gutsy-proposed; urgency=low
+
+  * Revert "enable NPTL across the board for linux" from 8.1ubuntu2,
+    which broke postgrey and subversion (LP: #153996)
+
+ -- dAniel hAhler <ubuntu@thequod.de>  Mon, 05 Nov 2007 23:52:47 +0100
+
 db4.4 (4.4.20-8.1ubuntu3) gutsy; urgency=low
 
   * Trigger rebuild for hppa

it'll be interesting to see if the bug persists after this is applied. This hang
stinks of NPTL, imo.
Comment 3 Justin Mason 2008-01-11 02:38:42 UTC
(In reply to comment #2)
> for the record, the currently-installed libdb4.4 package on "t-feisty-561" is:

oops, wrong machine.  it's actually

Package: libdb4.4
Status: install ok installed
Priority: standard
Section: libs
Installed-Size: 1064
Maintainer: Ubuntu Core Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Architecture: i386
Source: db4.4
Version: 4.4.20-8ubuntu2
Depends: libc6 (>= 2.5-0ubuntu1)
Conflicts: libdb2 (<< 2:2.7.7-3)
Description: Berkeley v4.4 Database Libraries [runtime]
 This is the runtime package for programs that use the v4.4 Berkeley
 database library.
Original-Maintainer: Debian Berkeley DB Maintainers
<pkg-db-devel@lists.alioth.debian.org>

so it's *not* the NPTL change, since that arrived in 4.4.20-8.1ubuntu2.
Comment 4 Justin Mason 2008-01-11 02:39:10 UTC
here's the changelog if anyone's interested:
http://changelogs.ubuntu.com/changelogs/pool/main/d/db4.4/db4.4_4.4.20-8.1ubuntu3/changelog
Comment 5 Justin Mason 2008-01-18 01:35:20 UTC
I added that timeout code to the Bayes tie() calls as well; it's
in the current rev, r613114, and has been for a few days.  However, even
with that, last night a buildbot test run failed with this bug again:


http://buildbot.spamassassin.org:8010/t-feisty-561/builds/749/step-test/0

[28227] dbg: rules: running uri tests; score so far=6.909
[28227] dbg: rules: compiled uri tests
[28227] dbg: rules: running rawbody tests; score so far=6.909
[28227] dbg: rules: compiled rawbody tests
[28227] dbg: rules: running full tests; score so far=6.909
[28227] dbg: rules: compiled full tests
[28227] dbg: rules: running meta tests; score so far=6.909
[28227] info: rules: meta test T_TVD_PH_BODY_META_ALL has dependency
'TVD_PH_BODY_ACCOUNTS_PRE' with a zero score
[28227] dbg: rules: meta test T_FAKE_REPLY_SURE_A has undefined dependency
'FAKE_REPLY_A'
[28227] dbg: rules: meta test T_FAKE_REPLY_SURE_A has undefined dependency
'FAKE_REPLY_A2'
[28227] dbg: rules: meta test T_FAKE_REPLY_SURE_B has undefined dependency
'FAKE_REPLY_A'
[28227] dbg: rules: meta test T_FAKE_REPLY_SURE_B has undefined dependency
'FAKE_REPLY_A2'
[28227] dbg: rules: meta test DIGEST_MULTIPLE has undefined dependency 'DCC_CHECK'
[28227] dbg: rules: compiled meta tests
[28227] dbg: check: running tests for priority: 1000
[28227] dbg: rules: running head tests; score so far=9.225
[28227] dbg: rules: compiled head tests
[28227] dbg: rules: running head_eval tests; score so far=9.225
[28227] dbg: locker: safe_lock: created
./log/user_state/auto-whitelist.lock.vmsa.28227
[28227] dbg: locker: safe_lock: trying to get lock on
./log/user_state/auto-whitelist with 0 retries
[28227] dbg: locker: safe_lock: link to ./log/user_state/auto-whitelist.lock:
link ok
[28227] dbg: auto-whitelist: tie-ing to DB file of type DB_File R/W in
./log/user_state/auto-whitelist
[28227] dbg: generic: starting tie timeout at Thu Jan 17 20:48:25 2008

Maybe you need to kill a running spamd process?
started at 1200631702, gave up at 1200631831

# Failed test 1 in t/spamc_optL.t at line 20
	Not found: learned spam = Message successfully un/learned

Now, the tie timeout is for 30 seconds.  The tie() started at Thu Jan 17
20:48:25 2008, and the test harness gave up at Fri Jan 18 20:50:31 2008.  
That's a diff of over 2 minutes. So it appears that that timeout code won't
work, since a simple SIGALRM doesn't interrupt the hang. :(


PS: interestingly I haven't seen a single instance of this that *wasn't*
in a spamd-based test.  updating summary.

Comment 6 Justin Mason 2008-01-22 06:42:32 UTC
I'm an idiot.  I just logged into vmsa, ran "ps auxww" and saw this:


buildbot   735  0.0  1.2  22976 18388 ?        S    Jan19   0:01
/usr/local/perl561/bin/perl -T -w ../spamd/spamd.raw -D -x -s stderr -C
log/test_rules_copy --siteconfigpath log/localrules.tmp -p 33400 -A 127.0.0.1 -L
-s log/d.spamc_x_E_R/spamd.err.12.timestamped
buildbot   856  0.0  0.6  35276  9460 ?        S    Jan13   0:02
/usr/local/perl561/bin/perl -T -w ../spamd/spamd.raw -D -x -s stderr -C
log/test_rules_copy --siteconfigpath log/localrules.tmp -p 33620 -A 127.0.0.1 -L
-s log/d.spamd_maxsize/spamd.err.1.timestamped
buildbot   861  0.0  0.0      0     0 ?        Z    Jan13   0:00 [perl] <defunct>
buildbot  1087  0.0  0.1  34020  2208 ?        S     2007   0:02
/usr/local/perl561/bin/perl -T -w ../spamd/spamd.raw -D -x -s stderr -C
log/test_rules_copy --siteconfigpath log/localrules.tmp -p 33851 -A 127.0.0.1 -L
--allow-tell -s log/d.spamc_optL/spamd.err.1.timestamped
buildbot  1092  0.0  0.0      0     0 ?        Z     2007   0:00 [perl] <defunct>
buildbot  1131  0.0  0.2  31840  4136 ?        S    Jan13   0:01
/usr/local/perl561/bin/perl -T -w ../spamd/spamd.raw -D -x -s stderr -C
log/test_rules_copy --siteconfigpath log/localrules.tmp -p 33895 -A 127.0.0.1 -L
-s log/d.spamd_maxsize/spamd.err.1.timestamped
buildbot  2290  0.0  1.4  31636 21420 ?        S    Jan15   0:01
/usr/local/perl561/bin/perl -T -w ../spamd/spamd.raw -D -x -s stderr -C
log/test_rules_copy --siteconfigpath log/localrules.tmp -p 35054 -A 127.0.0.1 -L
-s log/d.spamd_maxsize/spamd.err.1.timestamped
buildbot  3001  0.0  1.2  36792 18296 ?        S    Jan08   0:02
/usr/local/perl561/bin/perl -T -w ../spamd/spamd.raw -D -x -s stderr -C
log/test_rules_copy --siteconfigpath log/localrules.tmp -p 35765 -A 127.0.0.1 -L
-s log/d.spamd_maxsize/spamd.err.1.timestamped
buildbot  3006  0.0  0.0      0     0 ?        Z    Jan08   0:00 [perl] <defunct>
buildbot  3588  0.0  0.1  35844  2188 ?        S    Jan05   0:02
/usr/local/perl561/bin/perl -T -w ../spamd/spamd.raw -D -x -s stderr -C
log/test_rules_copy --siteconfigpath log/localrules.tmp -p 36352 -A 127.0.0.1 -L
-s log/d.spamd_maxsize/spamd.err.1.timestamped
[etc. plenty more]

These are, of course, the hung processes!  Unfortunately I'd killed them
all by the time I realized this. :(

next time it happens, I'll try to get a strace of the proc.
Comment 7 Justin Mason 2008-01-24 03:38:26 UTC
http://buildbot.spamassassin.org:8010/t-feisty-561/builds/794/step-test_3/0 --
today's failure:

{{{
t/spamc_x_E_R...................spamd start failed: log: [10108] dbg: logger:
adding facilities: all
[10108] dbg: logger: logging level is DBG
[10108] dbg: logger: successfully opened file
log/d.spamc_x_E_R/spamd.err.10.timestamped
[10108] dbg: logger: successfully added file method
[10108] dbg: spamd: will perform setuids? 0
[10108] dbg: spamd: creating INET socket:
....
[10108] dbg: bayes: tie-ing to DB file R/O ./log/user_state/bayes_toks
[10108] dbg: generic: starting tie timeout at Wed Jan 23 18:19:52 2008


Maybe you need to kill a running spamd process?
started at 1201141190, gave up at 1201141319
}}}

on vmsa with ps alxww | grep 10108:

0   104 10108     1  18   0  23280 18532 futex_ S    ?          0:01
/usr/local/perl561/bin/perl -T -w ../spamd/spamd.raw -D -x -s stderr -C
log/test_rules_copy --siteconfigpath log/localrules.tmp -p 42796 -A 127.0.0.1 -L
-s log/d.spamc_x_E_R/spamd.err.10.timestamped

sudo lsof -p 10108
COMMAND   PID     USER   FD   TYPE   DEVICE    SIZE     NODE NAME
perl    10108 buildbot  cwd    DIR      8,1       0  4704398
/home/buildbot/slaves/t-feisty-561/t-feisty-561/build/Mail-SpamAssassin-3.3.0/t
(deleted)
perl    10108 buildbot  rtd    DIR      8,1    4096        2 /
perl    10108 buildbot  txt    REG      8,1  851837  4490291
/usr/local/perl561/bin/perl
perl    10108 buildbot  mem    REG      0,0                0 [heap] (stat: No
such file or directory)
perl    10108 buildbot  mem    REG      8,1  109268  2621455 /lib/ld-2.5.so
perl    10108 buildbot  mem    REG      8,1     391  4474898
/usr/lib/locale/en_US.utf8/LC_IDENTIFICATION
perl    10108 buildbot  mem    REG      8,1      23  4474897
/usr/lib/locale/en_US.utf8/LC_MEASUREMENT
perl    10108 buildbot  mem    REG      8,1      59  4474896
/usr/lib/locale/en_US.utf8/LC_TELEPHONE
perl    10108 buildbot  mem    REG      8,1     155  4474895
/usr/lib/locale/en_US.utf8/LC_ADDRESS
perl    10108 buildbot  mem    REG      8,1   79596  2621599
/lib/tls/i686/cmov/libnsl-2.5.so
perl    10108 buildbot  mem    REG      8,1    9684  2621596
/lib/tls/i686/cmov/libdl-2.5.so
perl    10108 buildbot  mem    REG      8,1  153424  2621597
/lib/tls/i686/cmov/libm-2.5.so
perl    10108 buildbot  mem    REG      8,1 1307104  2621592
/lib/tls/i686/cmov/libc-2.5.so
perl    10108 buildbot  mem    REG      8,1   21908  2621594
/lib/tls/i686/cmov/libcrypt-2.5.so
perl    10108 buildbot  mem    REG      8,1    9696  2621623
/lib/tls/i686/cmov/libutil-2.5.so
perl    10108 buildbot  mem    REG      8,1   25460  4442126
/usr/lib/gconv/gconv-modules.cache
perl    10108 buildbot  mem    REG      8,1      77  4474894
/usr/lib/locale/en_US.utf8/LC_NAME
perl    10108 buildbot  mem    REG      8,1      34  4474893
/usr/lib/locale/en_US.utf8/LC_PAPER
perl    10108 buildbot  mem    REG      8,1      52  4474892
/usr/lib/locale/en_US.utf8/LC_MESSAGES/SYS_LC_MESSAGES
perl    10108 buildbot  mem    REG      8,1     286  4474890
/usr/lib/locale/en_US.utf8/LC_MONETARY
perl    10108 buildbot  mem    REG      8,1  880094  4474889
/usr/lib/locale/en_US.utf8/LC_COLLATE
perl    10108 buildbot  mem    REG      8,1    2451  4474888
/usr/lib/locale/en_US.utf8/LC_TIME
perl    10108 buildbot  mem    REG      8,1      54  4474887
/usr/lib/locale/en_US.utf8/LC_NUMERIC
perl    10108 buildbot  mem    REG      8,1  238336  4474886
/usr/lib/locale/en_US.utf8/LC_CTYPE
perl    10108 buildbot  mem    REG      8,1   18814  4689812
/usr/local/perl561/lib/5.6.2/i686-linux/auto/Socket/Socket.so
perl    10108 buildbot  mem    REG      8,1   18829  4689782
/usr/local/perl561/lib/5.6.2/i686-linux/auto/IO/IO.so
perl    10108 buildbot  mem    REG      8,1    8196  4689829
/usr/local/perl561/lib/5.6.2/i686-linux/auto/Sys/Hostname/Hostname.so
perl    10108 buildbot  mem    REG      8,1   12442  4689818
/usr/local/perl561/lib/5.6.2/i686-linux/auto/Fcntl/Fcntl.so
perl    10108 buildbot  mem    REG      8,1   96595  4689847
/usr/local/perl561/lib/5.6.2/i686-linux/auto/POSIX/POSIX.so
perl    10108 buildbot  mem    REG      8,1   13485  4688800
/usr/local/perl561/lib/site_perl/5.6.1/i686-linux/auto/MIME/Base64/Base64.so
perl    10108 buildbot  mem    REG      8,1   17766  4690004
/usr/local/perl561/lib/5.6.2/i686-linux/auto/File/Glob/Glob.so
perl    10108 buildbot  mem    REG      8,1   41963  4688780
/usr/local/perl561/lib/site_perl/5.6.1/i686-linux/auto/HTML/Parser/Parser.so
perl    10108 buildbot  mem    REG      8,1    9644  4688190
/usr/local/perl561/lib/site_perl/5.6.2/i686-linux/auto/Net/DNS/DNS.so
perl    10108 buildbot  mem    REG      8,1   38416  2621602
/lib/tls/i686/cmov/libnss_files-2.5.so
perl    10108 buildbot  mem    REG      8,1   30436  2621600
/lib/tls/i686/cmov/libnss_compat-2.5.so
perl    10108 buildbot  mem    REG      8,1   34352  2621605
/lib/tls/i686/cmov/libnss_nis-2.5.so
perl    10108 buildbot  mem    REG      8,1   19858  4688825
/usr/local/perl561/lib/site_perl/5.6.1/i686-linux/auto/Digest/SHA1/SHA1.so
perl    10108 buildbot  mem    REG      8,1   47375  4688185
/usr/local/perl561/lib/5.6.2/i686-linux/auto/DB_File/DB_File.so
perl    10108 buildbot  mem    REG      8,1 1038208  4442593 /usr/lib/libdb-4.4.so
perl    10108 buildbot    0r   CHR      1,3             7521 /dev/null
perl    10108 buildbot    1w   REG      8,1       0  4718666
/home/buildbot/slaves/t-feisty-561/t-feisty-561/build/Mail-SpamAssassin-3.3.0/t/log/d.spamc_x_E_R/spamd.out.10
(deleted)
perl    10108 buildbot    2w   REG      8,1   17273  4718667
/home/buildbot/slaves/t-feisty-561/t-feisty-561/build/Mail-SpamAssassin-3.3.0/t/log/d.spamc_x_E_R/spamd.err.10
(deleted)
perl    10108 buildbot    3r   REG      8,1  102556  4492020
/home/buildbot/slaves/t-feisty-561/t-feisty-561/build/Mail-SpamAssassin-3.3.0/spamd/spamd.raw
(deleted)
perl    10108 buildbot    4w   REG      8,1   21273  4718668
/home/buildbot/slaves/t-feisty-561/t-feisty-561/build/Mail-SpamAssassin-3.3.0/t/log/d.spamc_x_E_R/spamd.err.10.timestamped
(deleted)
perl    10108 buildbot    5u  IPv4 64500864              TCP localhost:42796
(LISTEN)
perl    10108 buildbot    6r  FIFO      0,6         64500869 pipe

sudo strace -fp 10108
Process 10108 attached - interrupt to quit
futex(0x401a1120, FUTEX_WAIT, 2, NULL
<CTRL_C>

sudo gdb -p 10108
(gdb) info threads
(gdb) thread
[Current thread is 0 (process 10108)]
(gdb) bt full
#0  0xfbffe410 in __kernel_vsyscall ()
No symbol table info available.
#1  0x4013f05e in ?? () from /lib/tls/i686/cmov/libc.so.6
No symbol table info available.
#2  0x400ce8fe in ?? () from /lib/tls/i686/cmov/libc.so.6
No symbol table info available.
Backtrace stopped: previous frame identical to this frame (corrupt stack?)


So all that is not really much use :(  It's hung blocked on a futex(), and the
backtrace is useless.

Comment 8 Justin Mason 2008-01-31 04:10:38 UTC
: jm 258...; svn commit -m "bug 5731: debug instrumentation to capture strace
logs for the DB_File failures on vmsa, based on the presence (hack hack) of a
file in /home/jm/capture_spamd_straces" t/SATest.pm
Sending        t/SATest.pm
Transmitting file data .
Committed revision 617113.

let's see if this captures anything useful....
Comment 9 Justin Mason 2008-02-04 07:50:46 UTC
(In reply to comment #8)
> let's see if this captures anything useful....

wouldn't you know it -- the bug doesn't seem to be manifesting at all when spamd
is being straced. typical :(
Comment 10 Justin Mason 2008-02-04 14:56:24 UTC
got one!  I think...

http://buildbot.spamassassin.org:8010/t-feisty-561/builds/880/step-test/0
contains this:

t/spamd_kill_restart............# Failed test 41 in t/spamd_kill_restart.t at
line 43 fail #5
#  t/spamd_kill_restart.t line 43 is:   ok (!-e $pid_file);
spamd start failed: log: [15535] dbg: logger: adding facilities: all
[15535] dbg: logger: logging level is DBG
[15535] dbg: logger: successfully opened file
log/d.spamd_kill_restart/spamd.err.42.timestamped
[15535] dbg: logger: successfully added file method
[15535] dbg: spamd: will perform setuids? 0
[15535] dbg: spamd: creating INET socket:
[15535] dbg: spamd: Listen: 128
[15535] dbg: spamd: LocalAddr: 127.0.0.1
[15535] dbg: spamd: LocalPort: 48236
[15535] dbg: spamd: Proto: 6
[15535] dbg: spamd: ReuseAddr: 1
[15535] dbg: spamd: Type: 1
[15535] warn: server socket setup failed, retry 1: spamd: could not create INET
socket on 127.0.0.1:48236: Address already in use

http://taint.org/x/2008/bug5731.d.spamd_kill_restart.tgz contains the full
logs from the spamd_kill_restart.t test case, with strace logs.

Here's an examination of the key log, spamd.strace.33.

        First off, this is the last part of signal-handling
        before the error.  note that SIGINT is unblocked:

15514 1202157063.466831 rt_sigprocmask(SIG_UNBLOCK, [INT CHLD], NULL, 8) = 0
15514 1202157063.468826 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
15524 1202157063.493003 time( <unfinished ...>
15514 1202157063.493065 rt_sigaction(SIGALRM, NULL, {SIG_DFL}, 8) = 0
15514 1202157063.493222 rt_sigaction(SIGALRM, {SIG_DFL}, {SIG_DFL}, 8) = 0
15514 1202157063.493367 rt_sigaction(SIGALRM, {0x8099360, [], SA_RESTART}, {SIG_DFL}
, 8) = 0
15514 1202157063.493749 alarm(64)       = 0

        various unimportant spamd stuff omitted here.

15514 1202157063.499815 read(7, "I\0\0<\244\n", 6) = 6
15514 1202157063.500038 time(NULL)      = 1202157063
15514 1202157063.500127 write(4, "Mon Feb  4 12:31:03 2008 [15514]"..., 77) = 77
15514 1202157063.500284 write(2, "[15514] dbg: prefork: child 1552"..., 52) = 52
15514 1202157063.503575 time(NULL)      = 1202157063
15514 1202157064.696936 write(4, "Mon Feb  4 12:31:03 2008 [15514]"..., 74) = 74
15514 1202157064.697259 write(2, "[15514] dbg: prefork: new lowest"..., 49) = 49
15514 1202157064.697604 time(NULL)      = 1202157064
15514 1202157064.697699 write(4, "Mon Feb  4 12:31:04 2008 [15514]"..., 66) = 66
15514 1202157064.697891 write(2, "[15514] dbg: prefork: child repo"..., 41) = 41

        at this point, the test script kills the
        spamd using SIGINT:

15514 1202157064.698111 --- SIGINT (Interrupt) @ 0 (0) ---

        we should see perl calling into its $SIG{INT} handler.
        instead we see this:

15514 1202157064.698255 mmap2(NULL, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MA
P_NORESERVE, -1, 0) = 0x408e8000
15514 1202157064.698376 munmap(0x408e8000, 98304) = 0
15514 1202157064.698466 munmap(0x40a00000, 950272) = 0
15514 1202157064.698596 mprotect(0x40900000, 135168, PROT_READ|PROT_WRITE) = 0
15514 1202157064.698743 futex(0x401a1120, FUTEX_WAIT, 2, NULL <unfinished ...>

        this futex() call sleeps.  Nothing happens until eventually
        a child's select() call hits its timeout:

15524 1202157363.464875 <... select resumed> ) = 0 (Timeout)

        the kids exit due to their timeouts:

15523 1202157364.652244 exit_group(0)   = ?

        and the parent proc is interrupted for the SIGCHLD:

15514 1202157364.675992 <... futex resumed> ) = -1 EINTR (Interrupted system call)
15514 1202157364.676062 --- SIGCHLD (Child exited) @ 0 (0) ---

        but then goes back into the futex():

15514 1202157364.676664 futex(0x401a1120, FUTEX_WAIT, 2, NULL <unfinished ...>

        and never exits.


Comparing the time when the SIGINT arrives to other test cases that worked ok
(e.g. spamd.strace.1), I see no difference; the SIGINT handler is unblocked,
and it should be able to handle the signal.

This looks like a perl signal-handling bug.

for reference, here's what should have happened at the SIGINT:

15473 1202157028.628858 --- SIGINT (Interrupt) @ 0 (0) ---
15473 1202157028.629083 time(NULL)      = 1202157028
15473 1202157028.629191 write(4, "Mon Feb  4 12:30:28 2008 [15473]"..., 85) = 85
15473 1202157028.629362 write(2, "[15473] info: spamd: server kill"..., 60) = 60
15473 1202157028.629710 close(5)        = 0
15473 1202157028.629789 munmap(0x402ff000, 4096) = 0
15473 1202157028.629875 close(5)        = -1 EBADF (Bad file descriptor)
15473 1202157028.629947 munmap(0x402fe000, 4096) = 0
15473 1202157028.630033
unlink("/home/buildbot/slaves/t-feisty-561/t-feisty-561/build/t/log/spamd.pid") = 0

etc.
futex() isn't called in the file *at all*.  It ONLY appears in
the failing strace....

At this point I think I should switch to testing a later perl
build; fiddling with perl 5.6.1 signal-handling is a bit of a waste
of time, and I noted that the Ubuntu issue we're trying to track down
should occur with 5.8.x too.  Ideally it should be repro'able with
the gutsy package of perl.

Also worth trying: tracing with ltrace instead of strace, to get
a better idea of wtf that futex() call is up to.

Comment 11 Justin Mason 2008-02-06 07:29:54 UTC
(In reply to comment #10)
> Also worth trying: tracing with ltrace instead of strace, to get
> a better idea of wtf that futex() call is up to.

unfortunately ltrace is far too heavyweight; this doesn't work.  reverted as of
r619021.
Comment 12 Justin Mason 2008-02-11 13:21:23 UTC
(In reply to comment #10)
> At this point I think I should switch to testing a later perl
> build; fiddling with perl 5.6.1 signal-handling is a bit of a waste
> of time, and I noted that the Ubuntu issue we're trying to track down
> should occur with 5.8.x too.  Ideally it should be repro'able with
> the gutsy package of perl.

so builds 905 (Thu 07 Feb 2008 13:23:41) to 941 (Mon 11 Feb 2008 21:06:03) have
all passed without a failure.
http://buildbot.spamassassin.org/t-feisty-561/builds/905
http://buildbot.spamassassin.org/t-feisty-561/builds/941

this looks like it was basically a bug in perl 5.6.1, which has been fixed in
5.8.x... however, I noted when opening this that the bug occurred on the
"jm-feisty" buildbot, which was running with /usr/bin/perl on a laptop here.

Comment 13 Justin Mason 2008-02-11 13:23:00 UTC
i'll try turning off stracing again.
Comment 14 Justin Mason 2008-02-21 04:17:53 UTC
(In reply to comment #13)
> i'll try turning off stracing again.

ok, since then, there's been no failures.  This appears to be fixed (probably in
the Ubuntu perl 5.8 or libc packages, I think).  yay!