Bug 7587

Summary: TxRep: blocks on its own locks, intolerably slow as a result
Product: Spamassassin Reporter: Nick Alcock <nix>
Component: PluginsAssignee: SpamAssassin Developer Mailing List <dev>
Status: REOPENED ---    
Severity: normal CC: D.J, giovanni, kmcgrail, nix
Priority: P2    
Version: 3.4 SVN branch   
Target Milestone: Undefined   
Hardware: PC   
OS: Linux   
Attachments: Do not leak storages or hold them open when not checking
txrep leak partial fix

Description Nick Alcock 2018-06-07 13:32:50 UTC
Created attachment 5572 [details]
Do not leak storages or hold them open when not checking

It appears to be hardwired into txrep that it holds the reputation database open 
indefinitely when spamd is running ($self->{txKeepStoreTied} is set to 1 when not in learning mode, and with that set, we never call finish() on the storage module). Thus, any attempt to use TxRep with local storage and a spamd with more than once instance running at once will lead to huge delays as we block on the tx-reputation.mutex.

But this is not all. Even with only one spamd running, every check_reputation() will lead to an attempt to reopen the storage, because open_storages() is called for every address checked without seeing whether the storage is already open, and unconditionally opens it again, overwriting the default_storage etc and leaking it. This leads to the remarkable sight of the thing blocking on a lock it itself holds.

Patching out the only instance where txKeepStoreTied is set to 1 and arranging to not leak locks leads to a much more desirable state, where multiple spamds can run in parallel without mutual blocking on this lock unless they both attempt to update reputation simultaneously. This is still hardly ideal -- we should probably hold the storage open if it is a SQL storage, and we have dead code hanging around relating to the txKeepStoreTied variable -- but at least we now don't leak them like confetti.

Checking times for me go down from ~186s (due to multiple 30s waits for lock expiry) to about 4s with this patch applied.

(A >180s check time might be considered a denial-of-service problem, only I'm not sure it counts as one if the triggering method is as easy as "send an email, any email". With bugs this big outstanding in its default configuration, is this plugin actually release-quality?)
Comment 1 Nick Alcock 2018-06-07 13:36:20 UTC
btw I submitted a CLA, many years ago. Looks like bz has lost track of it :/
Comment 2 Kevin A. McGrail 2018-06-15 04:20:22 UTC
Nick, my apologies.  There is no automated update for bugzilla when a CLA is filed.  I have confirmed with the secretary your CLA is on file and updated your bugzilla account to reflect this.

Thanks for the patch and taking the time to mention the CLA issue.  -KAM
Comment 3 Nick Alcock 2018-06-16 16:31:27 UTC
Hm, some large-scale sa-learn runs show that it is leaking tied filehandless like confetti too, because it ties to the storage whenever open_storages() is called, but unties only if one of the code paths through check_senders_reputation()->check_reputations()->check_reputation() happens to be invoked.

It is possible that the right fix is to change

  if (defined $storage) {
    $self->{checker} = $self->{default_storage};

in check_reputation to

  if (!defined $storage) {
    $self->{checker} = $self->{default_storage};

since it appears that without that we are currently only untying if we are using different storages per-user and per-system, but even this is error-prone and will leak if we take the wrong path down through these functions. It's easier just to cut the Gordian knot and untie from the storage whether or not $self->{checker} is set.

However, even if you fix this, it still leaks tx-reputation fhs like confetti and I have no idea why: ties and unties are precisely balanced, but the untie is not closing the fh.

If anyone can see what's wrong, I'd be happy to know... but right now I'm tempted to can this whole mess and go back to the AWL. At least the AWL works.
Comment 4 Giovanni Bechis 2018-06-30 07:23:35 UTC
Created attachment 5579 [details]
txrep leak partial fix

I think I can commit this one.
It's not a complete fix for the file descriptor leak but at least is better than
what we have in-tree.
Comment 5 Giovanni Bechis 2018-08-11 18:25:02 UTC
I committed a fix (maybe partial) for this in commit #1837876 in both trunk and 3.4.
Comment 6 Kevin A. McGrail 2018-08-22 00:56:14 UTC
Giovanni, do you think this is resolved or more patches needed?
Comment 7 Giovanni Bechis 2018-08-22 09:11:27 UTC
From my tests this has been fixed with commit #1837876.
Comment 8 Nick Alcock 2018-09-18 19:20:48 UTC
This is still present, I'm afraid. We don't seem to be actively blocking any more, but we are still leaking filehandles like a sieve. I filter three emails through it, and what do I see?

total 0
lr-x------ 1 root root 64 Sep 18 20:16 0 -> /dev/null
l-wx------ 1 root root 64 Sep 18 20:16 1 -> /dev/null
l-wx------ 1 root root 64 Sep 18 20:16 2 -> /dev/null
lr-x------ 1 root root 64 Sep 18 20:16 3 -> /pkg/perl/5.26.2-14/usr/bin/spamd
lrwx------ 1 root root 64 Sep 18 20:16 4 -> 'socket:[78493696]'
lrwx------ 1 root root 64 Sep 18 20:16 5 -> 'socket:[78516226]'
lrwx------ 1 root root 64 Sep 18 20:16 6 -> 'socket:[78516228]'
lrwx------ 1 root root 64 Sep 18 20:16 7 -> 'socket:[78518296]'
lrwx------ 1 root root 64 Sep 18 20:16 8 -> 'socket:[78506775]'
lrwx------ 1 root root 64 Sep 18 20:16 9 -> /home/nix/.spamassassin/tx-reputation.mutex
lrwx------ 1 root root 64 Sep 18 20:16 10 -> /home/nix/.spamassassin/tx-reputation
lrwx------ 1 root root 64 Sep 18 20:16 11 -> /home/nix/.spamassassin/tx-reputation
lrwx------ 1 root root 64 Sep 18 20:16 12 -> /home/nix/.spamassassin/tx-reputation

The mutex is no longer being leaked, so it's not blocking any more, but a tied fh to the actual *database* is still leaked, so we'll soon enough run out of fhs.
Comment 9 Kevin A. McGrail 2018-09-18 20:09:46 UTC
Are we talking about a file descriptor leak or a blocking issue?
Comment 10 Nick Alcock 2018-09-18 22:15:13 UTC
We used to see an fd leak and blocking (because one of the things being leaked was the fh to the mutex).

We're just seeing an fd leak now (one fh per message scanned), but despite the distinct symptoms it is really the same bug, I think.