SA Bugzilla – Bug 4650
child timeout processing eventually causes spamd to hang
Last modified: 2008-04-21 02:03:36 UTC
This problem is seen with the debian version: 3.1.0a-1 I run spamassassin on a very slow machine (HP B180) which means that operations take much longer than on a modern ix86 system. The symptoms I see are that eventually spamd fails to fork. when I do a ps, there are 5 (from the --max-children 5 argument) spamd processess, all owned by my users, all idle but none returning to the spamd fork pool. For each of them there's a message in the log saying spamd[26351]: bayes: expire_old_tokens: child processing timeout at /usr/sbin/spamd line1088. (however, there are more of these lines than the five children). What it looks like is that there's some error leg after timeout where the child fails to return to the prefork pool. On an unmodified spamd installation, this causes spamd to become unusable after about a day of processing emails. I've worked around this problem on my system by adding the option --timeout-child 3600 and now spamd has been running properly for the last week.
see bug 3828 for my comments regarding child timeouts and bayes expiry. http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3828#c38 and theo's response on c39 http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3828#c39 i guess the secondary helper call never made it into 3.1.. d
also, to avoid relying on auto-expiry by spamd, you can disable auto-expiry, and run sa-learn --force-expire (as the user you run spamd as) via cron on a semi regular basis. this will mitigate the issue until a better solution is in place for auto-expiry via spamd.
okay.. i hate to keep writing follow-ups, but i keep thinking of things. ;) have you converted to sql-driven bayes?? my token expiration runs usually take less than 10 seconds now that i have bayes in sql. this may be a more elegant solution to the problem. # time sa-learn --force-expire real 0m3.894s user 0m1.220s sys 0m0.060s
Actually, I'm perfectly happy currently with the solution I'm employing. The reason for reporting isn't the time it takes (I have CPU time in abundance on the network gateway) it's the bug that the child processes don't return to the prefork pool causing spamd to hang eventually. That one needs looking at before it trips someone else up.
if you can ever capture -D debug logs, or strace logs, for this situation, that would be very helpful.
moving RFEs and low-priority stuff to 3.3.0 target
this is a very old bug, and hasn't been touched in several years; it's probably not an issue with current versions. If this is not the case, feel free to reopen.