SA Bugzilla – Bug 5731
DB_File locking hangs on Ubuntu feisty / gutsy Linux systems in spamd
Last modified: 2008-02-21 04:17:53 UTC
(opening a bug for this very annoying issue.) Recently, the buildbot slaves running on Ubuntu platforms (t-feisty-561 and jm-feisty) have been failing the 'make test' run, roughly once every couple of days. All these buildbot failures recently share these things in common: - they're happening on Ubuntu Feisty or Gutsy systems (jm-feisty is running Gutsy nowadays, despite the name) - they are spamd tests - they involve a spamd not giving its "server pid: \d+" line, which indicates that it's ready to receive connections, in time (within something like 3 minutes) - upon examination, the spamd log looks like this: [6910] dbg: rules: __MO_OL_25340 merged duplicates: __MO_OL_4EEDB __MO_OL_7533E [6910] dbg: conf: finish parsing [6910] dbg: plugin: Mail::SpamAssassin::Plugin::ReplaceTags=HASH(0x89db6e4) implements 'finish_parsing_end', priority 0 [6910] dbg: replacetags: replacing tags [6910] dbg: replacetags: done replacing tags [6910] dbg: bayes: tie-ing to DB file R/O ./log/user_state/bayes_toks [6910] dbg: bayes: tie-ing to DB file R/O ./log/user_state/bayes_seen [eof] or [13512] dbg: rules: running head tests; score so far=9.205 [13512] dbg: rules: compiled head tests [13512] dbg: rules: running head_eval tests; score so far=9.205 [13512] dbg: locker: safe_lock: created ./log/user_state/auto-whitelist.lock.vmsa.13512 [13512] dbg: locker: safe_lock: trying to get lock on ./log/user_state/auto-whitelist with 0 retries [13512] dbg: locker: safe_lock: link to ./log/user_state/auto-whitelist.lock: link ok [13512] dbg: auto-whitelist: tie-ing to DB file of type DB_File R/W in ./log/user_state/auto-whitelist in other words, it seems to be hanging during the compile_now() operation, while trying to access a Bayes DB, or an AWL DB. two examples: http://buildbot.spamassassin.org:8010/t-feisty-561/builds/270/step-test/0 http://buildbot.spamassassin.org:8010/t-feisty-561/builds/268/step-test/0 - it's *very* intermittent. I initially thought it was a vmware-related issue, since t-feisty-561 is running under vmware, but jm-feisty is a non-emulated Pentium M laptop. I've tried to repro it under my new Dual Core Centrino Pro system here in work, but after leaving the spamd test suite running in a loop overnight, there have been no test failures :( it may be load-related, since that system is much beefier than the others.
: jm 441...; ldd /usr/lib/perl/5.8.8/auto/DB_File/DB_File.so linux-gate.so.1 => (0xffffe000) libdb-4.4.so => /usr/lib/libdb-4.4.so (0xb7e0b000) libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7cc1000) libpthread.so.0 => /lib/tls/i686/cmov/libpthread.so.0 (0xb7ca8000) /lib/ld-linux.so.2 (0x80000000) so it may be libdb 4.4. I've added a timeout to the AWL tie() call -- let's see if that helps any. (I think we have reliably seen a hang about once every 3 days or so on http://buildbot.spamassassin.org/ ....
for the record, the currently-installed libdb4.4 package on "t-feisty-561" is: Package: libdb4.4 Status: install ok installed Priority: standard Section: libs Installed-Size: 1068 Maintainer: Ubuntu Core Developers <ubuntu-devel-discuss@lists.ubuntu.com> Architecture: i386 Source: db4.4 Version: 4.4.20-8.1ubuntu3 Depends: libc6 (>= 2.6-1) Conflicts: libdb2 (<< 2:2.7.7-3) Description: Berkeley v4.4 Database Libraries [runtime] This is the runtime package for programs that use the v4.4 Berkeley database library. Original-Maintainer: Debian Berkeley DB Maintainers <pkg-db-devel@lists.alioth.debian.org> I have JUST noticed a change coming through the apt updates channel today: http://launchpadlibrarian.net/10283478/db4.4_4.4.20-8.1ubuntu4.dsc.diff diff -u db4.4-4.4.20/debian/changelog db4.4-4.4.20/debian/changelog --- db4.4-4.4.20/debian/changelog +++ db4.4-4.4.20/debian/changelog @@ -1,3 +1,10 @@ +db4.4 (4.4.20-8.1ubuntu3.1) gutsy-proposed; urgency=low + + * Revert "enable NPTL across the board for linux" from 8.1ubuntu2, + which broke postgrey and subversion (LP: #153996) + + -- dAniel hAhler <ubuntu@thequod.de> Mon, 05 Nov 2007 23:52:47 +0100 + db4.4 (4.4.20-8.1ubuntu3) gutsy; urgency=low * Trigger rebuild for hppa it'll be interesting to see if the bug persists after this is applied. This hang stinks of NPTL, imo.
(In reply to comment #2) > for the record, the currently-installed libdb4.4 package on "t-feisty-561" is: oops, wrong machine. it's actually Package: libdb4.4 Status: install ok installed Priority: standard Section: libs Installed-Size: 1064 Maintainer: Ubuntu Core Developers <ubuntu-devel-discuss@lists.ubuntu.com> Architecture: i386 Source: db4.4 Version: 4.4.20-8ubuntu2 Depends: libc6 (>= 2.5-0ubuntu1) Conflicts: libdb2 (<< 2:2.7.7-3) Description: Berkeley v4.4 Database Libraries [runtime] This is the runtime package for programs that use the v4.4 Berkeley database library. Original-Maintainer: Debian Berkeley DB Maintainers <pkg-db-devel@lists.alioth.debian.org> so it's *not* the NPTL change, since that arrived in 4.4.20-8.1ubuntu2.
here's the changelog if anyone's interested: http://changelogs.ubuntu.com/changelogs/pool/main/d/db4.4/db4.4_4.4.20-8.1ubuntu3/changelog
I added that timeout code to the Bayes tie() calls as well; it's in the current rev, r613114, and has been for a few days. However, even with that, last night a buildbot test run failed with this bug again: http://buildbot.spamassassin.org:8010/t-feisty-561/builds/749/step-test/0 [28227] dbg: rules: running uri tests; score so far=6.909 [28227] dbg: rules: compiled uri tests [28227] dbg: rules: running rawbody tests; score so far=6.909 [28227] dbg: rules: compiled rawbody tests [28227] dbg: rules: running full tests; score so far=6.909 [28227] dbg: rules: compiled full tests [28227] dbg: rules: running meta tests; score so far=6.909 [28227] info: rules: meta test T_TVD_PH_BODY_META_ALL has dependency 'TVD_PH_BODY_ACCOUNTS_PRE' with a zero score [28227] dbg: rules: meta test T_FAKE_REPLY_SURE_A has undefined dependency 'FAKE_REPLY_A' [28227] dbg: rules: meta test T_FAKE_REPLY_SURE_A has undefined dependency 'FAKE_REPLY_A2' [28227] dbg: rules: meta test T_FAKE_REPLY_SURE_B has undefined dependency 'FAKE_REPLY_A' [28227] dbg: rules: meta test T_FAKE_REPLY_SURE_B has undefined dependency 'FAKE_REPLY_A2' [28227] dbg: rules: meta test DIGEST_MULTIPLE has undefined dependency 'DCC_CHECK' [28227] dbg: rules: compiled meta tests [28227] dbg: check: running tests for priority: 1000 [28227] dbg: rules: running head tests; score so far=9.225 [28227] dbg: rules: compiled head tests [28227] dbg: rules: running head_eval tests; score so far=9.225 [28227] dbg: locker: safe_lock: created ./log/user_state/auto-whitelist.lock.vmsa.28227 [28227] dbg: locker: safe_lock: trying to get lock on ./log/user_state/auto-whitelist with 0 retries [28227] dbg: locker: safe_lock: link to ./log/user_state/auto-whitelist.lock: link ok [28227] dbg: auto-whitelist: tie-ing to DB file of type DB_File R/W in ./log/user_state/auto-whitelist [28227] dbg: generic: starting tie timeout at Thu Jan 17 20:48:25 2008 Maybe you need to kill a running spamd process? started at 1200631702, gave up at 1200631831 # Failed test 1 in t/spamc_optL.t at line 20 Not found: learned spam = Message successfully un/learned Now, the tie timeout is for 30 seconds. The tie() started at Thu Jan 17 20:48:25 2008, and the test harness gave up at Fri Jan 18 20:50:31 2008. That's a diff of over 2 minutes. So it appears that that timeout code won't work, since a simple SIGALRM doesn't interrupt the hang. :( PS: interestingly I haven't seen a single instance of this that *wasn't* in a spamd-based test. updating summary.
I'm an idiot. I just logged into vmsa, ran "ps auxww" and saw this: buildbot 735 0.0 1.2 22976 18388 ? S Jan19 0:01 /usr/local/perl561/bin/perl -T -w ../spamd/spamd.raw -D -x -s stderr -C log/test_rules_copy --siteconfigpath log/localrules.tmp -p 33400 -A 127.0.0.1 -L -s log/d.spamc_x_E_R/spamd.err.12.timestamped buildbot 856 0.0 0.6 35276 9460 ? S Jan13 0:02 /usr/local/perl561/bin/perl -T -w ../spamd/spamd.raw -D -x -s stderr -C log/test_rules_copy --siteconfigpath log/localrules.tmp -p 33620 -A 127.0.0.1 -L -s log/d.spamd_maxsize/spamd.err.1.timestamped buildbot 861 0.0 0.0 0 0 ? Z Jan13 0:00 [perl] <defunct> buildbot 1087 0.0 0.1 34020 2208 ? S 2007 0:02 /usr/local/perl561/bin/perl -T -w ../spamd/spamd.raw -D -x -s stderr -C log/test_rules_copy --siteconfigpath log/localrules.tmp -p 33851 -A 127.0.0.1 -L --allow-tell -s log/d.spamc_optL/spamd.err.1.timestamped buildbot 1092 0.0 0.0 0 0 ? Z 2007 0:00 [perl] <defunct> buildbot 1131 0.0 0.2 31840 4136 ? S Jan13 0:01 /usr/local/perl561/bin/perl -T -w ../spamd/spamd.raw -D -x -s stderr -C log/test_rules_copy --siteconfigpath log/localrules.tmp -p 33895 -A 127.0.0.1 -L -s log/d.spamd_maxsize/spamd.err.1.timestamped buildbot 2290 0.0 1.4 31636 21420 ? S Jan15 0:01 /usr/local/perl561/bin/perl -T -w ../spamd/spamd.raw -D -x -s stderr -C log/test_rules_copy --siteconfigpath log/localrules.tmp -p 35054 -A 127.0.0.1 -L -s log/d.spamd_maxsize/spamd.err.1.timestamped buildbot 3001 0.0 1.2 36792 18296 ? S Jan08 0:02 /usr/local/perl561/bin/perl -T -w ../spamd/spamd.raw -D -x -s stderr -C log/test_rules_copy --siteconfigpath log/localrules.tmp -p 35765 -A 127.0.0.1 -L -s log/d.spamd_maxsize/spamd.err.1.timestamped buildbot 3006 0.0 0.0 0 0 ? Z Jan08 0:00 [perl] <defunct> buildbot 3588 0.0 0.1 35844 2188 ? S Jan05 0:02 /usr/local/perl561/bin/perl -T -w ../spamd/spamd.raw -D -x -s stderr -C log/test_rules_copy --siteconfigpath log/localrules.tmp -p 36352 -A 127.0.0.1 -L -s log/d.spamd_maxsize/spamd.err.1.timestamped [etc. plenty more] These are, of course, the hung processes! Unfortunately I'd killed them all by the time I realized this. :( next time it happens, I'll try to get a strace of the proc.
http://buildbot.spamassassin.org:8010/t-feisty-561/builds/794/step-test_3/0 -- today's failure: {{{ t/spamc_x_E_R...................spamd start failed: log: [10108] dbg: logger: adding facilities: all [10108] dbg: logger: logging level is DBG [10108] dbg: logger: successfully opened file log/d.spamc_x_E_R/spamd.err.10.timestamped [10108] dbg: logger: successfully added file method [10108] dbg: spamd: will perform setuids? 0 [10108] dbg: spamd: creating INET socket: .... [10108] dbg: bayes: tie-ing to DB file R/O ./log/user_state/bayes_toks [10108] dbg: generic: starting tie timeout at Wed Jan 23 18:19:52 2008 Maybe you need to kill a running spamd process? started at 1201141190, gave up at 1201141319 }}} on vmsa with ps alxww | grep 10108: 0 104 10108 1 18 0 23280 18532 futex_ S ? 0:01 /usr/local/perl561/bin/perl -T -w ../spamd/spamd.raw -D -x -s stderr -C log/test_rules_copy --siteconfigpath log/localrules.tmp -p 42796 -A 127.0.0.1 -L -s log/d.spamc_x_E_R/spamd.err.10.timestamped sudo lsof -p 10108 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME perl 10108 buildbot cwd DIR 8,1 0 4704398 /home/buildbot/slaves/t-feisty-561/t-feisty-561/build/Mail-SpamAssassin-3.3.0/t (deleted) perl 10108 buildbot rtd DIR 8,1 4096 2 / perl 10108 buildbot txt REG 8,1 851837 4490291 /usr/local/perl561/bin/perl perl 10108 buildbot mem REG 0,0 0 [heap] (stat: No such file or directory) perl 10108 buildbot mem REG 8,1 109268 2621455 /lib/ld-2.5.so perl 10108 buildbot mem REG 8,1 391 4474898 /usr/lib/locale/en_US.utf8/LC_IDENTIFICATION perl 10108 buildbot mem REG 8,1 23 4474897 /usr/lib/locale/en_US.utf8/LC_MEASUREMENT perl 10108 buildbot mem REG 8,1 59 4474896 /usr/lib/locale/en_US.utf8/LC_TELEPHONE perl 10108 buildbot mem REG 8,1 155 4474895 /usr/lib/locale/en_US.utf8/LC_ADDRESS perl 10108 buildbot mem REG 8,1 79596 2621599 /lib/tls/i686/cmov/libnsl-2.5.so perl 10108 buildbot mem REG 8,1 9684 2621596 /lib/tls/i686/cmov/libdl-2.5.so perl 10108 buildbot mem REG 8,1 153424 2621597 /lib/tls/i686/cmov/libm-2.5.so perl 10108 buildbot mem REG 8,1 1307104 2621592 /lib/tls/i686/cmov/libc-2.5.so perl 10108 buildbot mem REG 8,1 21908 2621594 /lib/tls/i686/cmov/libcrypt-2.5.so perl 10108 buildbot mem REG 8,1 9696 2621623 /lib/tls/i686/cmov/libutil-2.5.so perl 10108 buildbot mem REG 8,1 25460 4442126 /usr/lib/gconv/gconv-modules.cache perl 10108 buildbot mem REG 8,1 77 4474894 /usr/lib/locale/en_US.utf8/LC_NAME perl 10108 buildbot mem REG 8,1 34 4474893 /usr/lib/locale/en_US.utf8/LC_PAPER perl 10108 buildbot mem REG 8,1 52 4474892 /usr/lib/locale/en_US.utf8/LC_MESSAGES/SYS_LC_MESSAGES perl 10108 buildbot mem REG 8,1 286 4474890 /usr/lib/locale/en_US.utf8/LC_MONETARY perl 10108 buildbot mem REG 8,1 880094 4474889 /usr/lib/locale/en_US.utf8/LC_COLLATE perl 10108 buildbot mem REG 8,1 2451 4474888 /usr/lib/locale/en_US.utf8/LC_TIME perl 10108 buildbot mem REG 8,1 54 4474887 /usr/lib/locale/en_US.utf8/LC_NUMERIC perl 10108 buildbot mem REG 8,1 238336 4474886 /usr/lib/locale/en_US.utf8/LC_CTYPE perl 10108 buildbot mem REG 8,1 18814 4689812 /usr/local/perl561/lib/5.6.2/i686-linux/auto/Socket/Socket.so perl 10108 buildbot mem REG 8,1 18829 4689782 /usr/local/perl561/lib/5.6.2/i686-linux/auto/IO/IO.so perl 10108 buildbot mem REG 8,1 8196 4689829 /usr/local/perl561/lib/5.6.2/i686-linux/auto/Sys/Hostname/Hostname.so perl 10108 buildbot mem REG 8,1 12442 4689818 /usr/local/perl561/lib/5.6.2/i686-linux/auto/Fcntl/Fcntl.so perl 10108 buildbot mem REG 8,1 96595 4689847 /usr/local/perl561/lib/5.6.2/i686-linux/auto/POSIX/POSIX.so perl 10108 buildbot mem REG 8,1 13485 4688800 /usr/local/perl561/lib/site_perl/5.6.1/i686-linux/auto/MIME/Base64/Base64.so perl 10108 buildbot mem REG 8,1 17766 4690004 /usr/local/perl561/lib/5.6.2/i686-linux/auto/File/Glob/Glob.so perl 10108 buildbot mem REG 8,1 41963 4688780 /usr/local/perl561/lib/site_perl/5.6.1/i686-linux/auto/HTML/Parser/Parser.so perl 10108 buildbot mem REG 8,1 9644 4688190 /usr/local/perl561/lib/site_perl/5.6.2/i686-linux/auto/Net/DNS/DNS.so perl 10108 buildbot mem REG 8,1 38416 2621602 /lib/tls/i686/cmov/libnss_files-2.5.so perl 10108 buildbot mem REG 8,1 30436 2621600 /lib/tls/i686/cmov/libnss_compat-2.5.so perl 10108 buildbot mem REG 8,1 34352 2621605 /lib/tls/i686/cmov/libnss_nis-2.5.so perl 10108 buildbot mem REG 8,1 19858 4688825 /usr/local/perl561/lib/site_perl/5.6.1/i686-linux/auto/Digest/SHA1/SHA1.so perl 10108 buildbot mem REG 8,1 47375 4688185 /usr/local/perl561/lib/5.6.2/i686-linux/auto/DB_File/DB_File.so perl 10108 buildbot mem REG 8,1 1038208 4442593 /usr/lib/libdb-4.4.so perl 10108 buildbot 0r CHR 1,3 7521 /dev/null perl 10108 buildbot 1w REG 8,1 0 4718666 /home/buildbot/slaves/t-feisty-561/t-feisty-561/build/Mail-SpamAssassin-3.3.0/t/log/d.spamc_x_E_R/spamd.out.10 (deleted) perl 10108 buildbot 2w REG 8,1 17273 4718667 /home/buildbot/slaves/t-feisty-561/t-feisty-561/build/Mail-SpamAssassin-3.3.0/t/log/d.spamc_x_E_R/spamd.err.10 (deleted) perl 10108 buildbot 3r REG 8,1 102556 4492020 /home/buildbot/slaves/t-feisty-561/t-feisty-561/build/Mail-SpamAssassin-3.3.0/spamd/spamd.raw (deleted) perl 10108 buildbot 4w REG 8,1 21273 4718668 /home/buildbot/slaves/t-feisty-561/t-feisty-561/build/Mail-SpamAssassin-3.3.0/t/log/d.spamc_x_E_R/spamd.err.10.timestamped (deleted) perl 10108 buildbot 5u IPv4 64500864 TCP localhost:42796 (LISTEN) perl 10108 buildbot 6r FIFO 0,6 64500869 pipe sudo strace -fp 10108 Process 10108 attached - interrupt to quit futex(0x401a1120, FUTEX_WAIT, 2, NULL <CTRL_C> sudo gdb -p 10108 (gdb) info threads (gdb) thread [Current thread is 0 (process 10108)] (gdb) bt full #0 0xfbffe410 in __kernel_vsyscall () No symbol table info available. #1 0x4013f05e in ?? () from /lib/tls/i686/cmov/libc.so.6 No symbol table info available. #2 0x400ce8fe in ?? () from /lib/tls/i686/cmov/libc.so.6 No symbol table info available. Backtrace stopped: previous frame identical to this frame (corrupt stack?) So all that is not really much use :( It's hung blocked on a futex(), and the backtrace is useless.
: jm 258...; svn commit -m "bug 5731: debug instrumentation to capture strace logs for the DB_File failures on vmsa, based on the presence (hack hack) of a file in /home/jm/capture_spamd_straces" t/SATest.pm Sending t/SATest.pm Transmitting file data . Committed revision 617113. let's see if this captures anything useful....
(In reply to comment #8) > let's see if this captures anything useful.... wouldn't you know it -- the bug doesn't seem to be manifesting at all when spamd is being straced. typical :(
got one! I think... http://buildbot.spamassassin.org:8010/t-feisty-561/builds/880/step-test/0 contains this: t/spamd_kill_restart............# Failed test 41 in t/spamd_kill_restart.t at line 43 fail #5 # t/spamd_kill_restart.t line 43 is: ok (!-e $pid_file); spamd start failed: log: [15535] dbg: logger: adding facilities: all [15535] dbg: logger: logging level is DBG [15535] dbg: logger: successfully opened file log/d.spamd_kill_restart/spamd.err.42.timestamped [15535] dbg: logger: successfully added file method [15535] dbg: spamd: will perform setuids? 0 [15535] dbg: spamd: creating INET socket: [15535] dbg: spamd: Listen: 128 [15535] dbg: spamd: LocalAddr: 127.0.0.1 [15535] dbg: spamd: LocalPort: 48236 [15535] dbg: spamd: Proto: 6 [15535] dbg: spamd: ReuseAddr: 1 [15535] dbg: spamd: Type: 1 [15535] warn: server socket setup failed, retry 1: spamd: could not create INET socket on 127.0.0.1:48236: Address already in use http://taint.org/x/2008/bug5731.d.spamd_kill_restart.tgz contains the full logs from the spamd_kill_restart.t test case, with strace logs. Here's an examination of the key log, spamd.strace.33. First off, this is the last part of signal-handling before the error. note that SIGINT is unblocked: 15514 1202157063.466831 rt_sigprocmask(SIG_UNBLOCK, [INT CHLD], NULL, 8) = 0 15514 1202157063.468826 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 15524 1202157063.493003 time( <unfinished ...> 15514 1202157063.493065 rt_sigaction(SIGALRM, NULL, {SIG_DFL}, 8) = 0 15514 1202157063.493222 rt_sigaction(SIGALRM, {SIG_DFL}, {SIG_DFL}, 8) = 0 15514 1202157063.493367 rt_sigaction(SIGALRM, {0x8099360, [], SA_RESTART}, {SIG_DFL} , 8) = 0 15514 1202157063.493749 alarm(64) = 0 various unimportant spamd stuff omitted here. 15514 1202157063.499815 read(7, "I\0\0<\244\n", 6) = 6 15514 1202157063.500038 time(NULL) = 1202157063 15514 1202157063.500127 write(4, "Mon Feb 4 12:31:03 2008 [15514]"..., 77) = 77 15514 1202157063.500284 write(2, "[15514] dbg: prefork: child 1552"..., 52) = 52 15514 1202157063.503575 time(NULL) = 1202157063 15514 1202157064.696936 write(4, "Mon Feb 4 12:31:03 2008 [15514]"..., 74) = 74 15514 1202157064.697259 write(2, "[15514] dbg: prefork: new lowest"..., 49) = 49 15514 1202157064.697604 time(NULL) = 1202157064 15514 1202157064.697699 write(4, "Mon Feb 4 12:31:04 2008 [15514]"..., 66) = 66 15514 1202157064.697891 write(2, "[15514] dbg: prefork: child repo"..., 41) = 41 at this point, the test script kills the spamd using SIGINT: 15514 1202157064.698111 --- SIGINT (Interrupt) @ 0 (0) --- we should see perl calling into its $SIG{INT} handler. instead we see this: 15514 1202157064.698255 mmap2(NULL, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MA P_NORESERVE, -1, 0) = 0x408e8000 15514 1202157064.698376 munmap(0x408e8000, 98304) = 0 15514 1202157064.698466 munmap(0x40a00000, 950272) = 0 15514 1202157064.698596 mprotect(0x40900000, 135168, PROT_READ|PROT_WRITE) = 0 15514 1202157064.698743 futex(0x401a1120, FUTEX_WAIT, 2, NULL <unfinished ...> this futex() call sleeps. Nothing happens until eventually a child's select() call hits its timeout: 15524 1202157363.464875 <... select resumed> ) = 0 (Timeout) the kids exit due to their timeouts: 15523 1202157364.652244 exit_group(0) = ? and the parent proc is interrupted for the SIGCHLD: 15514 1202157364.675992 <... futex resumed> ) = -1 EINTR (Interrupted system call) 15514 1202157364.676062 --- SIGCHLD (Child exited) @ 0 (0) --- but then goes back into the futex(): 15514 1202157364.676664 futex(0x401a1120, FUTEX_WAIT, 2, NULL <unfinished ...> and never exits. Comparing the time when the SIGINT arrives to other test cases that worked ok (e.g. spamd.strace.1), I see no difference; the SIGINT handler is unblocked, and it should be able to handle the signal. This looks like a perl signal-handling bug. for reference, here's what should have happened at the SIGINT: 15473 1202157028.628858 --- SIGINT (Interrupt) @ 0 (0) --- 15473 1202157028.629083 time(NULL) = 1202157028 15473 1202157028.629191 write(4, "Mon Feb 4 12:30:28 2008 [15473]"..., 85) = 85 15473 1202157028.629362 write(2, "[15473] info: spamd: server kill"..., 60) = 60 15473 1202157028.629710 close(5) = 0 15473 1202157028.629789 munmap(0x402ff000, 4096) = 0 15473 1202157028.629875 close(5) = -1 EBADF (Bad file descriptor) 15473 1202157028.629947 munmap(0x402fe000, 4096) = 0 15473 1202157028.630033 unlink("/home/buildbot/slaves/t-feisty-561/t-feisty-561/build/t/log/spamd.pid") = 0 etc. futex() isn't called in the file *at all*. It ONLY appears in the failing strace.... At this point I think I should switch to testing a later perl build; fiddling with perl 5.6.1 signal-handling is a bit of a waste of time, and I noted that the Ubuntu issue we're trying to track down should occur with 5.8.x too. Ideally it should be repro'able with the gutsy package of perl. Also worth trying: tracing with ltrace instead of strace, to get a better idea of wtf that futex() call is up to.
(In reply to comment #10) > Also worth trying: tracing with ltrace instead of strace, to get > a better idea of wtf that futex() call is up to. unfortunately ltrace is far too heavyweight; this doesn't work. reverted as of r619021.
(In reply to comment #10) > At this point I think I should switch to testing a later perl > build; fiddling with perl 5.6.1 signal-handling is a bit of a waste > of time, and I noted that the Ubuntu issue we're trying to track down > should occur with 5.8.x too. Ideally it should be repro'able with > the gutsy package of perl. so builds 905 (Thu 07 Feb 2008 13:23:41) to 941 (Mon 11 Feb 2008 21:06:03) have all passed without a failure. http://buildbot.spamassassin.org/t-feisty-561/builds/905 http://buildbot.spamassassin.org/t-feisty-561/builds/941 this looks like it was basically a bug in perl 5.6.1, which has been fixed in 5.8.x... however, I noted when opening this that the bug occurred on the "jm-feisty" buildbot, which was running with /usr/bin/perl on a laptop here.
i'll try turning off stracing again.
(In reply to comment #13) > i'll try turning off stracing again. ok, since then, there's been no failures. This appears to be fixed (probably in the Ubuntu perl 5.8 or libc packages, I think). yay!