Bug 7363 - Bayes and AWL leave DB locked on timeout
Summary: Bayes and AWL leave DB locked on timeout
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamassassin (show other bugs)
Version: 3.4 SVN branch
Hardware: All All
: P2 major
Target Milestone: Future
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-10-26 11:26 UTC by Markus Benning
Modified: 2023-05-07 16:21 UTC (History)
7 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
patch for bayes timeout handling patch None Markus Benning [NoCLA]
stacktrace when the 2nd lock blocks text/plain None Markus Benning [NoCLA]
stacktrace when the first lock is created text/plain None Markus Benning [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Markus Benning 2016-10-26 11:26:28 UTC
Created attachment 5410 [details]
patch for bayes timeout handling

I have some high volume mail servers running with Amavis/Spamassassin.
A few times a day spamassassin blocks delivery of all mails for 10 minutes before
it continues.
I debugged the problem and it seems to be a race condition when the check() timeout
occures while the Bayes database is locked.
Then the lock is removed and stays for 10 minutes till it times out
and some process breaks it.

The attached patch adds timeout handling to the Bayes plugin which takes
care of unlocking the database.
It also adds a timeout to the AWL plugin. Which also blocks from time to time,
but less often.
It also modifies the UnixNFS locker to not try to obtain the same lock
twice and log related warnings.

---

When the race condition occurs the thruput on the relay declines.

There is a lock on the bayes database:

[root@mail .spamassassin]# cat bayes.lock
mail.acme.de.35578

But the same process pid is also trying to obtain the lock (twice!):

[root@mail.spamassassin]# cat bayes.lock.mail.acme.de.35578
mail.acme.de.35578

When the process is traced one could see the attempts to get the lock:

[root@mail.spamassassin]# strace -e link -p 35578
Process 35578 attached
link("/var/spool/amavisd/.spamassassin/bayes.lock.mail.acme.de.35578", "/var/spool/amavisd/.spamassassin/bayes.lock") = -1 EEXIST (File exists)
link("/var/spool/amavisd/.spamassassin/bayes.lock.mail.acme.de.35578", "/var/spool/amavisd/.spamassassin/bayes.lock") = -1 EEXIST (File exists)
...

All other processes also try to get the lock:

[root@mail .spamassassin]# ls -1 bayes.lock*
bayes.lock
bayes.lock.mail.amce.de.35578
bayes.lock.mail.amce.de.43428
bayes.lock.mail.amce.de.44619
bayes.lock.mail.amce.de.47186
bayes.lock.mail.amce.de.47434
bayes.lock.mail.amce.de.47437
bayes.lock.mail.amce.de.47438
...

---

Attached stacktrace1.txt shows a trace at the point where it tries to get the lock a second time.
Attached stacktrace2.txt shows a trace at the point where the first lock was created.
Comment 1 Markus Benning 2016-10-26 11:27:13 UTC
Created attachment 5411 [details]
stacktrace when the 2nd lock blocks
Comment 2 Markus Benning 2016-10-26 11:27:33 UTC
Created attachment 5412 [details]
stacktrace when the first lock is created
Comment 3 AXB 2016-10-26 17:57:23 UTC
just a comment: you should consider using the Redis backend
For hi traffic sites it can't be beat.
Comment 4 Markus Benning 2016-10-27 12:27:15 UTC
Yes, and of course the AWL Redis Plugin I wrote ;-)
But its more a midterm to longterm option in this case.

But the bug/fix is in the Plugin/Bayes.pm part and not within a BayesStore.
So it's not specific to the default DBM backend.
It may also affect other backends.

I guess the Redis backend is not affected by this because it just relys on the redis server for locking.
But it may leave connection handles open when they should be closed or something like that.
How this bug affects a backend depends on what the backend does in the tie/untie_ functions.
Backends may rely on untie_ beeing called for cleanup which may not happen.

I think this should be really fixed in SA.
Comment 5 Kevin A. McGrail 2018-08-26 21:28:01 UTC
Markus, can you sign an ICLA so we can consider your patch?  It looks like a good idea.  https://www.apache.org/licenses/icla.pdf
Comment 6 Matt Corallo 2021-05-28 19:29:11 UTC
I see this issue very regularly on a low-traffic mailserver that occasionally gets a number of mails delivered at once. For now the `lock_method flock` works around it, but it seems a shame the default locking is broken.
Comment 7 Henrik Krohns 2022-04-11 13:13:17 UTC
Dunno if this is still an issue, probably not, keeping it open just in case. But no time to investigate for 4.0.0, especially if "lock_method flock" fixes it anyway (should be default IMO, does anyone actually use NFS these days?).
Comment 8 Nick Alcock 2022-05-07 13:05:15 UTC
FYI: yes, some of us do use SA with Bayes autolearning atop NFS (NFSv4, in my case). (Of course I've been using lock_method flock for so many years that I can't remember when I added it.)
Comment 9 Fabian Wenk 2023-05-07 16:21:30 UTC
Running with SpamAssassin 4.0.0 on FreeBSD 12.4 (out of FreeBSD Ports, installed 12.03.2023) using bayes and AWL with SQL (MariaDB) backend.

It took my quite a while until I figured out what is wrong, as even sa-compile and spamassassin --lint did hang.

Here the output from MariaDB 'show processlist;', in the "Time" column you see that this update query did hang for 16+ hours, even many hours after I had stopped all running spamd (sorry about long lines and possible line wrap):

root@localhost [(none)]> show processlist;
+--------+----------+-----------------+--------------+---------+-------+-----------------+------------------------------------------------------------------------------------------------------+----------+
| Id     | User     | Host            | db           | Command | Time  | State           | Info                                                                                                 | Progress |
+--------+----------+-----------------+--------------+---------+-------+-----------------+------------------------------------------------------------------------------------------------------+----------+
| 886392 | spamd    | localhost       | spamassassin | Query   | 59013 | init for update | UPDATE bayes_token
                     SET ham_count = ham_count + '1',
                         at |    0.000 |
| 886397 | spamd    | localhost       | spamassassin | Query   | 58920 | Statistics      | SELECT SUBSTR(token || '     ', 1, 5), spam_count, ham_count, atime
                     FROM bayes_ |    0.000 |
| 892689 | spamd    | localhost       | spamassassin | Query   | 24528 | Statistics      | SELECT SUBSTR(token || '     ', 1, 5), spam_count, ham_count, atime
                     FROM bayes_ |    0.000 |
| 892694 | spamd    | localhost       | spamassassin | Query   | 24488 | Statistics      | SELECT SUBSTR(token || '     ', 1, 5), spam_count, ham_count, atime
                     FROM bayes_ |    0.000 |
| 893918 | spamd    | localhost       | spamassassin | Query   | 19201 | Opening tables  | SELECT SUBSTR(token || '     ', 1, 5), spam_count, ham_count, atime
                     FROM bayes_ |    0.000 |
| 893931 | spamd    | localhost       | spamassassin | Query   | 19151 | Opening tables  | SELECT SUBSTR(token || '     ', 1, 5), spam_count, ham_count, atime
                     FROM bayes_ |    0.000 |
| 894258 | spamd    | localhost       | spamassassin | Query   | 18462 | Opening tables  | SELECT SUBSTR(token || '     ', 1, 5), spam_count, ham_count, atime
                     FROM bayes_ |    0.000 |
| 894652 | spamd    | localhost       | spamassassin | Query   | 16878 | Opening tables  | SELECT SUBSTR(token || '     ', 1, 5), spam_count, ham_count, atime
                     FROM bayes_ |    0.000 |
| 894811 | spamd    | localhost       | spamassassin | Query   | 16175 | Opening tables  | SELECT SUBSTR(token || '     ', 1, 5), spam_count, ham_count, atime
                     FROM bayes_ |    0.000 |
| 895086 | spamd    | localhost       | spamassassin | Query   | 15386 | Opening tables  | SELECT SUBSTR(token || '     ', 1, 5), spam_count, ham_count, atime
                     FROM bayes_ |    0.000 |
| 895204 | spamd    | localhost       | spamassassin | Query   | 14768 | Opening tables  | SELECT SUBSTR(token || '     ', 1, 5), spam_count, ham_count, atime
                     FROM bayes_ |    0.000 |
| 895222 | spamd    | localhost       | spamassassin | Query   | 14584 | Opening tables  | SELECT SUBSTR(token || '     ', 1, 5), spam_count, ham_count, atime
                     FROM bayes_ |    0.000 |
| 895231 | spamd    | localhost       | spamassassin | Query   | 14449 | Opening tables  | SELECT SUBSTR(token || '     ', 1, 5), spam_count, ham_count, atime
                     FROM bayes_ |    0.000 |
| 895344 | spamd    | localhost       | spamassassin | Query   | 14192 | Opening tables  | SELECT SUBSTR(token || '     ', 1, 5), spam_count, ham_count, atime
                     FROM bayes_ |    0.000 |
| 895355 | spamd    | localhost       | spamassassin | Query   | 14118 | Opening tables  | SELECT SUBSTR(token || '     ', 1, 5), spam_count, ham_count, atime
                     FROM bayes_ |    0.000 |
| 895356 | spamd    | localhost       | spamassassin | Query   | 14090 | Opening tables  | SELECT SUBSTR(token || '     ', 1, 5), spam_count, ham_count, atime
                     FROM bayes_ |    0.000 |
| 895377 | spamd    | localhost       | spamassassin | Query   | 14015 | Opening tables  | SELECT SUBSTR(token || '     ', 1, 5), spam_count, ham_count, atime
                     FROM bayes_ |    0.000 |
| 895383 | spamd    | localhost       | spamassassin | Query   | 13899 | Opening tables  | SELECT SUBSTR(token || '     ', 1, 5), spam_count, ham_count, atime
                     FROM bayes_ |    0.000 |
| 895613 | spamd    | localhost       | spamassassin | Query   | 13007 | Opening tables  | SELECT SUBSTR(token || '     ', 1, 5), spam_count, ham_count, atime
                     FROM bayes_ |    0.000 |
| 895682 | spamd    | localhost       | spamassassin | Query   | 12723 | Opening tables  | SELECT SUBSTR(token || '     ', 1, 5), spam_count, ham_count, atime
                     FROM bayes_ |    0.000 |
| 895697 | spamd    | localhost       | spamassassin | Query   | 12587 | Opening tables  | SELECT SUBSTR(token || '     ', 1, 5), spam_count, ham_count, atime
                     FROM bayes_ |    0.000 |
| 895876 | spamd    | localhost       | spamassassin | Query   | 11706 | Opening tables  | SELECT SUBSTR(token || '     ', 1, 5), spam_count, ham_count, atime
                     FROM bayes_ |    0.000 |
| 899586 | root     | localhost       | NULL         | Query   |     0 | starting        | show processlist                                                                                     |    0.000 |
+--------+----------+-----------------+--------------+---------+-------+-----------------+------------------------------------------------------------------------------------------------------+----------+
28 rows in set (0.067 sec)

root@localhost [(none)]>