SA Bugzilla – Bug 7363
Bayes and AWL leave DB locked on timeout
Last modified: 2023-05-07 16:21:30 UTC
Created attachment 5410 [details] patch for bayes timeout handling I have some high volume mail servers running with Amavis/Spamassassin. A few times a day spamassassin blocks delivery of all mails for 10 minutes before it continues. I debugged the problem and it seems to be a race condition when the check() timeout occures while the Bayes database is locked. Then the lock is removed and stays for 10 minutes till it times out and some process breaks it. The attached patch adds timeout handling to the Bayes plugin which takes care of unlocking the database. It also adds a timeout to the AWL plugin. Which also blocks from time to time, but less often. It also modifies the UnixNFS locker to not try to obtain the same lock twice and log related warnings. --- When the race condition occurs the thruput on the relay declines. There is a lock on the bayes database: [root@mail .spamassassin]# cat bayes.lock mail.acme.de.35578 But the same process pid is also trying to obtain the lock (twice!): [root@mail.spamassassin]# cat bayes.lock.mail.acme.de.35578 mail.acme.de.35578 When the process is traced one could see the attempts to get the lock: [root@mail.spamassassin]# strace -e link -p 35578 Process 35578 attached link("/var/spool/amavisd/.spamassassin/bayes.lock.mail.acme.de.35578", "/var/spool/amavisd/.spamassassin/bayes.lock") = -1 EEXIST (File exists) link("/var/spool/amavisd/.spamassassin/bayes.lock.mail.acme.de.35578", "/var/spool/amavisd/.spamassassin/bayes.lock") = -1 EEXIST (File exists) ... All other processes also try to get the lock: [root@mail .spamassassin]# ls -1 bayes.lock* bayes.lock bayes.lock.mail.amce.de.35578 bayes.lock.mail.amce.de.43428 bayes.lock.mail.amce.de.44619 bayes.lock.mail.amce.de.47186 bayes.lock.mail.amce.de.47434 bayes.lock.mail.amce.de.47437 bayes.lock.mail.amce.de.47438 ... --- Attached stacktrace1.txt shows a trace at the point where it tries to get the lock a second time. Attached stacktrace2.txt shows a trace at the point where the first lock was created.
Created attachment 5411 [details] stacktrace when the 2nd lock blocks
Created attachment 5412 [details] stacktrace when the first lock is created
just a comment: you should consider using the Redis backend For hi traffic sites it can't be beat.
Yes, and of course the AWL Redis Plugin I wrote ;-) But its more a midterm to longterm option in this case. But the bug/fix is in the Plugin/Bayes.pm part and not within a BayesStore. So it's not specific to the default DBM backend. It may also affect other backends. I guess the Redis backend is not affected by this because it just relys on the redis server for locking. But it may leave connection handles open when they should be closed or something like that. How this bug affects a backend depends on what the backend does in the tie/untie_ functions. Backends may rely on untie_ beeing called for cleanup which may not happen. I think this should be really fixed in SA.
Markus, can you sign an ICLA so we can consider your patch? It looks like a good idea. https://www.apache.org/licenses/icla.pdf
I see this issue very regularly on a low-traffic mailserver that occasionally gets a number of mails delivered at once. For now the `lock_method flock` works around it, but it seems a shame the default locking is broken.
Dunno if this is still an issue, probably not, keeping it open just in case. But no time to investigate for 4.0.0, especially if "lock_method flock" fixes it anyway (should be default IMO, does anyone actually use NFS these days?).
FYI: yes, some of us do use SA with Bayes autolearning atop NFS (NFSv4, in my case). (Of course I've been using lock_method flock for so many years that I can't remember when I added it.)
Running with SpamAssassin 4.0.0 on FreeBSD 12.4 (out of FreeBSD Ports, installed 12.03.2023) using bayes and AWL with SQL (MariaDB) backend. It took my quite a while until I figured out what is wrong, as even sa-compile and spamassassin --lint did hang. Here the output from MariaDB 'show processlist;', in the "Time" column you see that this update query did hang for 16+ hours, even many hours after I had stopped all running spamd (sorry about long lines and possible line wrap): root@localhost [(none)]> show processlist; +--------+----------+-----------------+--------------+---------+-------+-----------------+------------------------------------------------------------------------------------------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +--------+----------+-----------------+--------------+---------+-------+-----------------+------------------------------------------------------------------------------------------------------+----------+ | 886392 | spamd | localhost | spamassassin | Query | 59013 | init for update | UPDATE bayes_token SET ham_count = ham_count + '1', at | 0.000 | | 886397 | spamd | localhost | spamassassin | Query | 58920 | Statistics | SELECT SUBSTR(token || ' ', 1, 5), spam_count, ham_count, atime FROM bayes_ | 0.000 | | 892689 | spamd | localhost | spamassassin | Query | 24528 | Statistics | SELECT SUBSTR(token || ' ', 1, 5), spam_count, ham_count, atime FROM bayes_ | 0.000 | | 892694 | spamd | localhost | spamassassin | Query | 24488 | Statistics | SELECT SUBSTR(token || ' ', 1, 5), spam_count, ham_count, atime FROM bayes_ | 0.000 | | 893918 | spamd | localhost | spamassassin | Query | 19201 | Opening tables | SELECT SUBSTR(token || ' ', 1, 5), spam_count, ham_count, atime FROM bayes_ | 0.000 | | 893931 | spamd | localhost | spamassassin | Query | 19151 | Opening tables | SELECT SUBSTR(token || ' ', 1, 5), spam_count, ham_count, atime FROM bayes_ | 0.000 | | 894258 | spamd | localhost | spamassassin | Query | 18462 | Opening tables | SELECT SUBSTR(token || ' ', 1, 5), spam_count, ham_count, atime FROM bayes_ | 0.000 | | 894652 | spamd | localhost | spamassassin | Query | 16878 | Opening tables | SELECT SUBSTR(token || ' ', 1, 5), spam_count, ham_count, atime FROM bayes_ | 0.000 | | 894811 | spamd | localhost | spamassassin | Query | 16175 | Opening tables | SELECT SUBSTR(token || ' ', 1, 5), spam_count, ham_count, atime FROM bayes_ | 0.000 | | 895086 | spamd | localhost | spamassassin | Query | 15386 | Opening tables | SELECT SUBSTR(token || ' ', 1, 5), spam_count, ham_count, atime FROM bayes_ | 0.000 | | 895204 | spamd | localhost | spamassassin | Query | 14768 | Opening tables | SELECT SUBSTR(token || ' ', 1, 5), spam_count, ham_count, atime FROM bayes_ | 0.000 | | 895222 | spamd | localhost | spamassassin | Query | 14584 | Opening tables | SELECT SUBSTR(token || ' ', 1, 5), spam_count, ham_count, atime FROM bayes_ | 0.000 | | 895231 | spamd | localhost | spamassassin | Query | 14449 | Opening tables | SELECT SUBSTR(token || ' ', 1, 5), spam_count, ham_count, atime FROM bayes_ | 0.000 | | 895344 | spamd | localhost | spamassassin | Query | 14192 | Opening tables | SELECT SUBSTR(token || ' ', 1, 5), spam_count, ham_count, atime FROM bayes_ | 0.000 | | 895355 | spamd | localhost | spamassassin | Query | 14118 | Opening tables | SELECT SUBSTR(token || ' ', 1, 5), spam_count, ham_count, atime FROM bayes_ | 0.000 | | 895356 | spamd | localhost | spamassassin | Query | 14090 | Opening tables | SELECT SUBSTR(token || ' ', 1, 5), spam_count, ham_count, atime FROM bayes_ | 0.000 | | 895377 | spamd | localhost | spamassassin | Query | 14015 | Opening tables | SELECT SUBSTR(token || ' ', 1, 5), spam_count, ham_count, atime FROM bayes_ | 0.000 | | 895383 | spamd | localhost | spamassassin | Query | 13899 | Opening tables | SELECT SUBSTR(token || ' ', 1, 5), spam_count, ham_count, atime FROM bayes_ | 0.000 | | 895613 | spamd | localhost | spamassassin | Query | 13007 | Opening tables | SELECT SUBSTR(token || ' ', 1, 5), spam_count, ham_count, atime FROM bayes_ | 0.000 | | 895682 | spamd | localhost | spamassassin | Query | 12723 | Opening tables | SELECT SUBSTR(token || ' ', 1, 5), spam_count, ham_count, atime FROM bayes_ | 0.000 | | 895697 | spamd | localhost | spamassassin | Query | 12587 | Opening tables | SELECT SUBSTR(token || ' ', 1, 5), spam_count, ham_count, atime FROM bayes_ | 0.000 | | 895876 | spamd | localhost | spamassassin | Query | 11706 | Opening tables | SELECT SUBSTR(token || ' ', 1, 5), spam_count, ham_count, atime FROM bayes_ | 0.000 | | 899586 | root | localhost | NULL | Query | 0 | starting | show processlist | 0.000 | +--------+----------+-----------------+--------------+---------+-------+-----------------+------------------------------------------------------------------------------------------------------+----------+ 28 rows in set (0.067 sec) root@localhost [(none)]>