Bug 1665 - frequently reported sync_journal problem with 2.51
Summary: frequently reported sync_journal problem with 2.51
Status: RESOLVED DUPLICATE of bug 1664
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: 2.51
Hardware: Other other
: P1 major
Target Milestone: 2.60
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-03-20 16:17 UTC by Daniel Quinlan
Modified: 2003-03-20 09:03 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
suggested patch, apply after patch for bug 1629 patch None Theo Van Dinter [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Quinlan 2003-03-20 16:17:27 UTC
If this turns out to be a critical bug, we should fix it, retag, and release
2.52 (since 2.51 has already been released, it will be less confusing).

From: "Robert Covell" <rcovell@rolet.com>
Subject: [SAtalk] 2.50 to 2.51 Problem
To: <spamassassin-talk@lists.sourceforge.net>
Date: Thu, 20 Mar 2003 16:13:26 -0600

When I upgraded to 2.51 I got this output:

Cannot open bayes_path /usr/local/bayes_dbs/bayes R/W: File exists bayes:
Detected problem syncing journal, trying to rename
/usr/local/bayes_dbs/bayes_journal.old to /usr/local/bayes_dbs/bayes_journal

Mar 20 15:50:54 server6 mimedefang-multiplexor: Slave 2 stderr: Cannot open
bayes_path /usr/local/bayes_dbs/bayes R/W: File exists bayes: Detected
problem syncing journal, trying to rename
/usr/local/bayes_dbs/bayes_journal.old to /usr/local/bayes_dbs/bayes_journal
bayes: rename failed at
/usr/lib/perl5/site_perl/Mail/SpamAssassin/BayesStore.pm line 610.


When I switch back to 2.50 everything is fine...

Any ideas?

-Bob

P.S. Using MD 2.30 before and after upgrade

-------------------------------------------------------------------------------

From: Simon Byrnand <simon@igrin.co.nz>
Subject: Re: [SAtalk] 2.50 to 2.51 Problem
To: <spamassassin-talk@lists.sourceforge.net>
Date: Fri, 21 Mar 2003 11:47:01 +1200

Uh oh. *BIG* problems with 2.51 here too. I'm seeing exactly the same error 
messages as you, but I'm using spamc from procmail.

2.51 just about brought the server to its knees within minutes, before I 
killed it of and reverted to 2.50.

Ideas anyone ?

Regards,
Simon

-------------------------------------------------------------------------------

From: "Stewart, John" <johns@artesyncp.com>
Subject: RE: [SAtalk] SpamAssassin 2.51 released - sa-learn bug?
To: SpamAssassin-talk@lists.sourceforge.net
Date: Thu, 20 Mar 2003 16:00:33 -0600

I just installed, and got this on an sa-learn:

bash-2.05$ sa-learn --showdots --rebuild
Cannot open bayes_path /var/amavis/bayes R/W: File exists

bayes: Detected problem syncing journal, trying to rename
/var/amavis/bayes_journal.old to /var/amavis/bayes_journal
synced Bayes databases from journal in 29 seconds: 0 entries
Cannot open bayes_path /var/amavis/bayes R/W: File exists


In /etc/mail/spamassassin/local.cf, I have

use_bayes 1
bayes_path /var/amavis/bayes

Everything under /var/amavis is owned by user amavis, which was who I was
running the sa-learn as. I had no problems under 2.50.

[ second message ]

I should also mention that I am running SA with amavisd-new-20020630. The mail
was bogging up, amavisd was not responding properly, and so I went back to SA
2.51; the mail queue cleared up and all is apparently well.

-------------------------------------------------------------------------------I
should also mention that I am running SA with amavisd-new-20020630. The mail was
bogging up, amavisd was not responding properly, and so I went back to SA 2.51;
the mail queue cleared up and all is apparently well.
Comment 1 Justin Mason 2003-03-20 17:20:33 UTC
Theo reckons this is 'cos 2.51 adds code to do an expiry if the tokens are
wierd, as they would be due to a 2.50 bug.

What happens is this:

proc A runs sync_journal(), which renames the journal to journal.old, locks the
DB, and starts updating the tokens...

B is just doing a scan, and touches some tokens, creating a new journal file.

C tries to do a sync_journal(), renames that new journal to journal.old, tries
to lock the DB, and fails.  It outputs the "file exists" error.

Basically, it (and the huge load) is being caused by N processes trying to
sync journals and expire simultaneously.  See the IRC log below.

Note that this is not critical -- it just creates huge load, but resolves itself
anyway once the expiry eventually finishes.  But we should fix ASAP.  Theo's
working on a patch now.




<felicity> bayes will force a rebuild at the first scan if 2.50 had expire errors.
<felicity> the goal being to fix the db quickly.
<jm> and so multiple processes doing scans will cause contention errors when it
tries to expire?
<jm> and they will *all* try to expire at once?
<felicity> could be.
<felicity> all these people use a single central db.
<jm> I think you're right BTW
<felicity> the logic isn't different between 2.50 and 2.51 in terms of multiple
processes wanting to do the expire.
<jm> looking at the reports, they all seem to share those signs, and the most
recent msg noted that it went away afte
r about 5 minutes
<felicity> it's just that in 2.51 we sort of force the expire, so everyone sees it.
<felicity> it'll happen again when enough new messages go through.
<jm> yep, that definitely sounds like it
<felicity> so I'm not really sure what to do about it.
<jm> I think your suggestion- tie first and do not rename unless tie succeeds --
will fix it
<felicity> that'll fix some of the problem, but we'll still end up with multiple
processes trying to expire at the sa
me time.
<felicity> we need some way for the first process to go "ok, I'm expiring now"
<felicity> then all other processes don't both until that flag goes away.
<felicity> I was thinking maybe adding a new token "expire start" set to current
msgcount.
<jm> if 5 procs are trying it, proc A will get the lock, proc B,C,D,E will wait
for it, proc A will rename out the jo
urnal, complete, and unlink it; then procs B,C,D,E will tie writable in turn,
open the journal, find no entries, and 
close again. note that expire_old_tokens_trapped() does *not* iterate through
the db unless expiry_due() still sez an
 expiry is due; and since proc A will have updated that token it won't be
<felicity> hmmm.
<felicity> well, the problem is that the sync() occurs because expiry_due
returns true.
<felicity> so proc A-E hit expiry_due and all return true.
<felicity> so all 5 call sync().
<felicity> so all 5 call sync_journal().
<felicity> then all 5 call expire_old_tokens().
<jm> yes, but then expire_old_tokens() calls _trapped() which calls expiry_due()
again
<jm> which returns 0 for B,C,D,E since A has updated the tokens
<felicity> ok.
<felicity> so how about this:
<felicity> hmmm.
<felicity> I was going to say, have sync_journal() check journal filesize, and
return if 0.
<jm> sure, sounds OK to start with anyway
<felicity> yeah, but that doesn't solve the problem.
<jm> no; hence "to start with" ;)
<felicity> ;)
<jm> so, do that, and also do the thing where it defers the journal->journal.old
rename until after it's got a lock o
n the db
<jm> that should fix it
<felicity> yeah...  
<felicity> I was just trying to figure out if there was a way we can avoid A-E
trying to sync() all at the same time.
<felicity> I know essentially only one of sync_journal() and expire_old_tokens()
will run.
<felicity> but it seems like the first one should lock the process, and the
other 4 would just skip.
<jm> I dunno, without introducing a new lockfile
<jm> and I'm not sure that'd be a good idea
<felicity> yeah, I'd like to avoid yet another lockfile.
<felicity> as I started saying above, I was thinking about a "expire started"
token in teh db.
<felicity> eh, but we'd need an atomic "check for token and set if not exists"
operation.
<jm> urgh
<jm> I really don't think it's necessary; we just use a different warning if the
lock fails during sync() or expiry, 
noting that it could be because some other process is doing it.  then it's just
a UI issue ;)
<felicity> heh.
<felicity> I wonder if we should put in my code from bug 1629
<felicity> it uses some memory, but seriously reduces time for a sync_journal().
<jm> yes, I would support that
<jm> ah, I thought that had gone in
<felicity> it's in 2.6, but I wasn't quite ready to put it in 2.51.
<felicity> but I've been using the code on my install and it works fine.
<felicity> so, I thought I asked above, but ...  since sync_journal() was your
code -- the eval cal is to trap the ti
e_db_writable() call i assume?
<felicity> or the various tok_sync_* calls?
<felicity> must be the tie.
<jm> the tie
<felicity> yeah.  the tok_* stuff doesn't do anything that needs the eval.
<jm> we have to ensure it'll untie even if it dies
<jm> so it should wrap everything until the untie call, anyway
<jm> I should really have refactored out into sync_journal_trapped(), like for
expiry etc. ;)
<felicity> ok, so we just need to shove the open and everything else into the eval.
<jm> and the rename
<felicity> oh, if there's an error and we want to rename back to the original,
we need to check for existence of the 
old name first.  it's a race condition, but rename() overwrites the destination,
so we should check even though it's 
not 100% perfect.
<jm> rename() overwrites?
<jm> are you sure?
<felicity> yeah.  according to the docs.
<felicity> I thought it would fail if the dest exists, but it overwrites.
<felicity>        rename OLDNAME,NEWNAME
<felicity>                Changes the name of a file; an existing file NEW-
<felicity>                NAME will be clobbered.  Returns true for success,
<felicity>                false otherwise.
<jm> bugger, didn't know that!
<felicity> yeah.  tell me about it.
<felicity> so do you want to do up the sync_journal() suggested patch, or should
I go hacking?
<jm> you do it -- sounds like you're well on the way there ;)
<felicity> I'll send in my current patch for 1629.
<jm> I'll post this log on the bug
<felicity> k. :)

Comment 2 Justin Mason 2003-03-20 17:28:15 UTC
*** Bug 1664 has been marked as a duplicate of this bug. ***
Comment 3 Theo Van Dinter 2003-03-20 18:02:58 UTC
Created attachment 794 [details]
suggested patch, apply after patch for bug 1629
Comment 4 Justin Mason 2003-03-20 18:03:15 UTC
patch now attached to bug 1664

*** This bug has been marked as a duplicate of 1664 ***