Bug 7014 - spamassassin SQL deadlock using bayes
Summary: spamassassin SQL deadlock using bayes
Status: REOPENED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamassassin (show other bugs)
Version: 3.3.2
Hardware: All All
: P2 major
Target Milestone: Future
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
: 5999 (view as bug list)
Depends on:
Blocks:
 
Reported: 2014-02-19 17:40 UTC by peter gervai
Modified: 2022-04-15 08:27 UTC (History)
5 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
/usr/share/perl5/Mail/SpamAssassin/BayesStore/PgSQL.pm diff fixing suboptimal SQL application/x-download None peter gervai [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description peter gervai 2014-02-19 17:40:47 UTC
Created attachment 5187 [details]
/usr/share/perl5/Mail/SpamAssassin/BayesStore/PgSQL.pm diff fixing suboptimal SQL

spamassassin bayes is almost unusable on a high traffic site with postgres, bayes and multiple instances, since it deadlocks in a few seconds and breaking them considerably slows down the db server and sa both. You see such things as:

2014-02-19 18:21:56 CET ERROR:  deadlock detected
2014-02-19 18:21:56 CET DETAIL:  Process 63882 waits for ShareLock on transaction 858300; blocked by process 63883.
        Process 63883 waits for ShareLock on transaction 858419; blocked by process 63965.
        Process 63965 waits for ShareLock on transaction 858365; blocked by process 63931.
        Process 63931 waits for ShareLock on transaction 858390; blocked by process 63942.
        Process 63942 waits for ShareLock on transaction 858316; blocked by process 63899.
        Process 63899 waits for ShareLock on transaction 858301; blocked by process 63882.
        Process 63882: select put_tokens(1, E'{"\\\\000\\\\127\\\\346\\\\327\\\\314","\\\\000\\\\132\\\\372\\\\370\\\\325","\\\\000\\\\135\\\\013\\\\022\\\\111","\\\\000\\\\201\\\\
        Process 63883: select put_tokens(1, E'{"\\\\001\\\\271\\\\217\\\\220\\\\362","\\\\001\\\\275\\\\163\\\\036\\\\167","\\\\002\\\\046\\\\236\\\\242\\\\071","\\\\002\\\\100\\\\
        Process 63965: UPDATE bayes_token SET atime = $1 WHERE id = $2 AND token IN ($3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26,$27,$2
        Process 63931: UPDATE bayes_token SET atime = $1 WHERE id = $2 AND token IN ($3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26,$27,$2
        Process 63942: UPDATE bayes_token SET atime = $1 WHERE id = $2 AND token IN ($3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26,$27,$2
        Process 63899: UPDATE bayes_token SET atime = $1 WHERE id = $2 AND token IN ($3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26,$27,$2
2014-02-19 18:21:56 CET HINT:  See server log for query details.
2014-02-19 18:21:56 CET CONTEXT:  SQL statement "UPDATE bayes_token
               SET spam_count = greatest_int(spam_count + inspam_count, 0),
                   ham_count = greatest_int(ham_count + inham_count, 0),
                   atime = greatest_int(atime, inatime)
             WHERE id = inuserid 
               AND token = _token"
        PL/pgSQL function "put_tokens" line 14 at SQL statement
2014-02-19 18:21:56 CET STATEMENT:  select put_tokens(1, E'{"\\\\000\\\\127\\\\346\\\\327\\\\314","\\\\000\\\\132\\\\372\\\\370\\\\325","\\\\000\\\\135\\\\013\\\\022\\\\111","\\\\0

As this blog post patiently explains (in 2010) the SQL code in SA could use lots of love and tenderness since it's written pretty SQL unfriendly: http://linuxblog.kieser.net/2010/06/spamassassin-spamd-deadlocks-runnning.html

Obviously he's also completely right about the fix: kill INs. He also provide a patch which I am attaching (against 3.3.2, probably doesn't matter since this code seems to be untouched for ages), which completely fix the issue, makes the db server and sa 10 times faster.

I hope you'll apply this because it makes a tremendous difference. (I might even have moved out the commit of the loop but it'd block again.)

File is in /usr/share/perl5/Mail/SpamAssassin/BayesStore/
Comment 1 Quanah Gibson-Mount 2014-02-19 17:48:10 UTC
I would strongly advise you test the patch against SA 3.4.0
Comment 2 Henrik Krohns 2014-02-19 19:22:17 UTC
I suggest reporter reading through discussion the in bug report that the referred blogger already opened years ago.

Closing as duplicate, I suggest continuing in the old bug if it still seems to be a problem and not some local configuration issue / missing index or whatever.

*** This bug has been marked as a duplicate of bug 6444 ***
Comment 3 peter gervai 2014-04-10 07:00:11 UTC
Okay, I have commented in bug 6444 but let me reopen this and share the reasoning.

Bug 6444 is about tok_touch_all; my problem is on a much wider scale.

I am new to the SQL modules and I wanted to use it because the data must be shared between several pretty busy servers, think of 200+ parallel connections on the same table. 

I see deadlocks every second. I am fiddling with it for the last few days now, trying various db settings (like changing the deadlock detection time), trying to move functionality in and out of a given transaction, and while I'm no SQL wizard I may guess some problems.

If I'm right nobody really thought about locking. Automatic locking, explicit locking and transactions.For example I was completely unable to expire bayes tokens due to immediate deadlock; as it turned out most of the functions hold extremely long transations which touch plenty of tuples in bayes_tokens (holding sharelocks and row exlusive locks) then trying to fiddle with bayes_vars (requiring the same, usually with the same userid as other processes) while holding the sharelocks and deadlocking with token updates (require share and row exclusive locks as well).

I was able to expire when I have changed the code to release the locks held on bayes_tokens in token_expiration after the first update (which creates locks through most of the recent parts of the bayes_tokens table) by doing a commit.

Probably most of the code does the same: hold wide locks on tokens and waits for vars, others hold locks on vars and wait for tokens, others hold row locks on some tokens then wait for row locks on others, in parallel, keeping a long transaction. 

For example there's put_tokens SQL statement which deadlocked every second due to holding token lock and waiting for vars lock, and I have resolved it by changing to separate updates releasing their respective locks; it possibly hurts performance but I'm under time pressure to test, volunters welcome.

I'm not yet ready to figure out the best solution; surely it involves commits (lock release) where it's acceptable and explicit locking (prelocking) otherwise, most probably by SELECT FOR UPDATE statements. As I mentioned in 6444 it requires time for me since I'm not that familiar with the code.

I'm doing it against v3.4.0-1 debian package.
Comment 4 AXB 2014-04-10 07:05:16 UTC
(In reply to peter gervai from comment #3)

> I am new to the SQL modules and I wanted to use it because the data must be
> shared between several pretty busy servers, think of 200+ parallel
> connections on the same table. 

sidenote: Redis Bayes backend would give you way less problems than SQL and a huge performance advantage. (site wide bayes only)
Comment 5 peter gervai 2014-04-16 12:27:06 UTC
(In reply to AXB from comment #4)

> sidenote: Redis Bayes backend would give you way less problems than SQL and
> a huge performance advantage. (site wide bayes only)

Indeed redis is interesting despite its several shortcomings, including a serious FD leak (Bug 7034) I had to play with this morning. Seems to be fixed now so doesn't die in every few hours. 

Still have to use SQL for AWL and probably I have to ditch redis when I move over per-user-configs (unless mr. author prefix the keys with userid and solve the problem, don't ask me why he haven't).

This doesn't quite solve the original problem though: for a massively parallel updating code ignoring locking is a nonworking solution in the long run. Now redis saves me to SELECT FOR UPDATE'ing all the code to rowlock everything beforehand and release after, which ought to have been done.
Comment 6 Henrik Krohns 2014-04-16 12:37:39 UTC
(In reply to peter gervai from comment #5)
> and probably I have to ditch redis when I move
> over per-user-configs (unless mr. author prefix the keys with userid and
> solve the problem, don't ask me why he haven't).

Probably because you do not want n^2 users each using 1GB of memory for tokens?
Comment 7 peter gervai 2014-04-29 17:21:39 UTC
(In reply to Henrik Krohns from comment #6)
> (In reply to peter gervai from comment #5)
> > and probably I have to ditch redis when I move
> > over per-user-configs (unless mr. author prefix the keys with userid and
> > solve the problem, don't ask me why he haven't).
> 
> Probably because you do not want n^2 users each using 1GB of memory for
> tokens?

Well even if you have n^2 users then you probably don't have much choice apart from storing the data you need. 

I'm not familiar with redis, so this is only educated guessing: if you believe you'll have more tokens than users (which seems to be a good assumption) I guess the way to go could be:

HMSET w:token1 user:bob:spam 55432 user:bob:ham 23
HMSET w:token1 u:joe:s 22 u:joe:h 8171091
HMSET w:token1 u:henrik:s 86373 u:henrik:h 832

But I guess this will _not_ help fixing psql locking problems, only avoiding it. :-(
Comment 8 Henrik Krohns 2022-04-14 12:09:41 UTC
Looking back at Bug 6444, it does seem a bit silly that a potentially deadlock solving function was canned for a mere "20 ms" performance penalty. Surely having deadlocks is much worse. Should test if "tok_touch_all_6444" works these days.
Comment 9 Henrik Krohns 2022-04-15 08:23:50 UTC
*** Bug 5999 has been marked as a duplicate of this bug. ***
Comment 10 Henrik Krohns 2022-04-15 08:27:47 UTC
All the errors are hidden since they are logged with dbg(), sigh.. tried looking into things a bit, but the SQL stuff is just a mess of a mess. Deadlocks here and there. Not going to waste my time, just use a global Redis and be done with it.

Postponing since probably no one will rework these for 4.0.