Bug 7363 - Bayes and AWL leave DB locked on timeout
Summary: Bayes and AWL leave DB locked on timeout
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamassassin (show other bugs)
Version: 3.4 SVN branch
Hardware: PC Linux
: P2 major
Target Milestone: 4.0.0
Assignee: SpamAssassin Developer Mailing List
Depends on:
Reported: 2016-10-26 11:26 UTC by Markus Benning
Modified: 2019-06-15 20:42 UTC (History)
3 users (show)

Attachment Type Modified Status Actions Submitter/CLA Status
patch for bayes timeout handling patch None Markus Benning [NoCLA]
stacktrace when the 2nd lock blocks text/plain None Markus Benning [NoCLA]
stacktrace when the first lock is created text/plain None Markus Benning [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Markus Benning 2016-10-26 11:26:28 UTC
Created attachment 5410 [details]
patch for bayes timeout handling

I have some high volume mail servers running with Amavis/Spamassassin.
A few times a day spamassassin blocks delivery of all mails for 10 minutes before
it continues.
I debugged the problem and it seems to be a race condition when the check() timeout
occures while the Bayes database is locked.
Then the lock is removed and stays for 10 minutes till it times out
and some process breaks it.

The attached patch adds timeout handling to the Bayes plugin which takes
care of unlocking the database.
It also adds a timeout to the AWL plugin. Which also blocks from time to time,
but less often.
It also modifies the UnixNFS locker to not try to obtain the same lock
twice and log related warnings.


When the race condition occurs the thruput on the relay declines.

There is a lock on the bayes database:

[root@mail .spamassassin]# cat bayes.lock

But the same process pid is also trying to obtain the lock (twice!):

[root@mail.spamassassin]# cat bayes.lock.mail.acme.de.35578

When the process is traced one could see the attempts to get the lock:

[root@mail.spamassassin]# strace -e link -p 35578
Process 35578 attached
link("/var/spool/amavisd/.spamassassin/bayes.lock.mail.acme.de.35578", "/var/spool/amavisd/.spamassassin/bayes.lock") = -1 EEXIST (File exists)
link("/var/spool/amavisd/.spamassassin/bayes.lock.mail.acme.de.35578", "/var/spool/amavisd/.spamassassin/bayes.lock") = -1 EEXIST (File exists)

All other processes also try to get the lock:

[root@mail .spamassassin]# ls -1 bayes.lock*


Attached stacktrace1.txt shows a trace at the point where it tries to get the lock a second time.
Attached stacktrace2.txt shows a trace at the point where the first lock was created.
Comment 1 Markus Benning 2016-10-26 11:27:13 UTC
Created attachment 5411 [details]
stacktrace when the 2nd lock blocks
Comment 2 Markus Benning 2016-10-26 11:27:33 UTC
Created attachment 5412 [details]
stacktrace when the first lock is created
Comment 3 AXB 2016-10-26 17:57:23 UTC
just a comment: you should consider using the Redis backend
For hi traffic sites it can't be beat.
Comment 4 Markus Benning 2016-10-27 12:27:15 UTC
Yes, and of course the AWL Redis Plugin I wrote ;-)
But its more a midterm to longterm option in this case.

But the bug/fix is in the Plugin/Bayes.pm part and not within a BayesStore.
So it's not specific to the default DBM backend.
It may also affect other backends.

I guess the Redis backend is not affected by this because it just relys on the redis server for locking.
But it may leave connection handles open when they should be closed or something like that.
How this bug affects a backend depends on what the backend does in the tie/untie_ functions.
Backends may rely on untie_ beeing called for cleanup which may not happen.

I think this should be really fixed in SA.
Comment 5 Kevin A. McGrail 2018-08-26 21:28:01 UTC
Markus, can you sign an ICLA so we can consider your patch?  It looks like a good idea.  https://www.apache.org/licenses/icla.pdf