SA Bugzilla – Bug 1665
frequently reported sync_journal problem with 2.51
Last modified: 2003-03-20 09:03:15 UTC
If this turns out to be a critical bug, we should fix it, retag, and release 2.52 (since 2.51 has already been released, it will be less confusing). From: "Robert Covell" <rcovell@rolet.com> Subject: [SAtalk] 2.50 to 2.51 Problem To: <spamassassin-talk@lists.sourceforge.net> Date: Thu, 20 Mar 2003 16:13:26 -0600 When I upgraded to 2.51 I got this output: Cannot open bayes_path /usr/local/bayes_dbs/bayes R/W: File exists bayes: Detected problem syncing journal, trying to rename /usr/local/bayes_dbs/bayes_journal.old to /usr/local/bayes_dbs/bayes_journal Mar 20 15:50:54 server6 mimedefang-multiplexor: Slave 2 stderr: Cannot open bayes_path /usr/local/bayes_dbs/bayes R/W: File exists bayes: Detected problem syncing journal, trying to rename /usr/local/bayes_dbs/bayes_journal.old to /usr/local/bayes_dbs/bayes_journal bayes: rename failed at /usr/lib/perl5/site_perl/Mail/SpamAssassin/BayesStore.pm line 610. When I switch back to 2.50 everything is fine... Any ideas? -Bob P.S. Using MD 2.30 before and after upgrade ------------------------------------------------------------------------------- From: Simon Byrnand <simon@igrin.co.nz> Subject: Re: [SAtalk] 2.50 to 2.51 Problem To: <spamassassin-talk@lists.sourceforge.net> Date: Fri, 21 Mar 2003 11:47:01 +1200 Uh oh. *BIG* problems with 2.51 here too. I'm seeing exactly the same error messages as you, but I'm using spamc from procmail. 2.51 just about brought the server to its knees within minutes, before I killed it of and reverted to 2.50. Ideas anyone ? Regards, Simon ------------------------------------------------------------------------------- From: "Stewart, John" <johns@artesyncp.com> Subject: RE: [SAtalk] SpamAssassin 2.51 released - sa-learn bug? To: SpamAssassin-talk@lists.sourceforge.net Date: Thu, 20 Mar 2003 16:00:33 -0600 I just installed, and got this on an sa-learn: bash-2.05$ sa-learn --showdots --rebuild Cannot open bayes_path /var/amavis/bayes R/W: File exists bayes: Detected problem syncing journal, trying to rename /var/amavis/bayes_journal.old to /var/amavis/bayes_journal synced Bayes databases from journal in 29 seconds: 0 entries Cannot open bayes_path /var/amavis/bayes R/W: File exists In /etc/mail/spamassassin/local.cf, I have use_bayes 1 bayes_path /var/amavis/bayes Everything under /var/amavis is owned by user amavis, which was who I was running the sa-learn as. I had no problems under 2.50. [ second message ] I should also mention that I am running SA with amavisd-new-20020630. The mail was bogging up, amavisd was not responding properly, and so I went back to SA 2.51; the mail queue cleared up and all is apparently well. -------------------------------------------------------------------------------I should also mention that I am running SA with amavisd-new-20020630. The mail was bogging up, amavisd was not responding properly, and so I went back to SA 2.51; the mail queue cleared up and all is apparently well.
Theo reckons this is 'cos 2.51 adds code to do an expiry if the tokens are wierd, as they would be due to a 2.50 bug. What happens is this: proc A runs sync_journal(), which renames the journal to journal.old, locks the DB, and starts updating the tokens... B is just doing a scan, and touches some tokens, creating a new journal file. C tries to do a sync_journal(), renames that new journal to journal.old, tries to lock the DB, and fails. It outputs the "file exists" error. Basically, it (and the huge load) is being caused by N processes trying to sync journals and expire simultaneously. See the IRC log below. Note that this is not critical -- it just creates huge load, but resolves itself anyway once the expiry eventually finishes. But we should fix ASAP. Theo's working on a patch now. <felicity> bayes will force a rebuild at the first scan if 2.50 had expire errors. <felicity> the goal being to fix the db quickly. <jm> and so multiple processes doing scans will cause contention errors when it tries to expire? <jm> and they will *all* try to expire at once? <felicity> could be. <felicity> all these people use a single central db. <jm> I think you're right BTW <felicity> the logic isn't different between 2.50 and 2.51 in terms of multiple processes wanting to do the expire. <jm> looking at the reports, they all seem to share those signs, and the most recent msg noted that it went away afte r about 5 minutes <felicity> it's just that in 2.51 we sort of force the expire, so everyone sees it. <felicity> it'll happen again when enough new messages go through. <jm> yep, that definitely sounds like it <felicity> so I'm not really sure what to do about it. <jm> I think your suggestion- tie first and do not rename unless tie succeeds -- will fix it <felicity> that'll fix some of the problem, but we'll still end up with multiple processes trying to expire at the sa me time. <felicity> we need some way for the first process to go "ok, I'm expiring now" <felicity> then all other processes don't both until that flag goes away. <felicity> I was thinking maybe adding a new token "expire start" set to current msgcount. <jm> if 5 procs are trying it, proc A will get the lock, proc B,C,D,E will wait for it, proc A will rename out the jo urnal, complete, and unlink it; then procs B,C,D,E will tie writable in turn, open the journal, find no entries, and close again. note that expire_old_tokens_trapped() does *not* iterate through the db unless expiry_due() still sez an expiry is due; and since proc A will have updated that token it won't be <felicity> hmmm. <felicity> well, the problem is that the sync() occurs because expiry_due returns true. <felicity> so proc A-E hit expiry_due and all return true. <felicity> so all 5 call sync(). <felicity> so all 5 call sync_journal(). <felicity> then all 5 call expire_old_tokens(). <jm> yes, but then expire_old_tokens() calls _trapped() which calls expiry_due() again <jm> which returns 0 for B,C,D,E since A has updated the tokens <felicity> ok. <felicity> so how about this: <felicity> hmmm. <felicity> I was going to say, have sync_journal() check journal filesize, and return if 0. <jm> sure, sounds OK to start with anyway <felicity> yeah, but that doesn't solve the problem. <jm> no; hence "to start with" ;) <felicity> ;) <jm> so, do that, and also do the thing where it defers the journal->journal.old rename until after it's got a lock o n the db <jm> that should fix it <felicity> yeah... <felicity> I was just trying to figure out if there was a way we can avoid A-E trying to sync() all at the same time. <felicity> I know essentially only one of sync_journal() and expire_old_tokens() will run. <felicity> but it seems like the first one should lock the process, and the other 4 would just skip. <jm> I dunno, without introducing a new lockfile <jm> and I'm not sure that'd be a good idea <felicity> yeah, I'd like to avoid yet another lockfile. <felicity> as I started saying above, I was thinking about a "expire started" token in teh db. <felicity> eh, but we'd need an atomic "check for token and set if not exists" operation. <jm> urgh <jm> I really don't think it's necessary; we just use a different warning if the lock fails during sync() or expiry, noting that it could be because some other process is doing it. then it's just a UI issue ;) <felicity> heh. <felicity> I wonder if we should put in my code from bug 1629 <felicity> it uses some memory, but seriously reduces time for a sync_journal(). <jm> yes, I would support that <jm> ah, I thought that had gone in <felicity> it's in 2.6, but I wasn't quite ready to put it in 2.51. <felicity> but I've been using the code on my install and it works fine. <felicity> so, I thought I asked above, but ... since sync_journal() was your code -- the eval cal is to trap the ti e_db_writable() call i assume? <felicity> or the various tok_sync_* calls? <felicity> must be the tie. <jm> the tie <felicity> yeah. the tok_* stuff doesn't do anything that needs the eval. <jm> we have to ensure it'll untie even if it dies <jm> so it should wrap everything until the untie call, anyway <jm> I should really have refactored out into sync_journal_trapped(), like for expiry etc. ;) <felicity> ok, so we just need to shove the open and everything else into the eval. <jm> and the rename <felicity> oh, if there's an error and we want to rename back to the original, we need to check for existence of the old name first. it's a race condition, but rename() overwrites the destination, so we should check even though it's not 100% perfect. <jm> rename() overwrites? <jm> are you sure? <felicity> yeah. according to the docs. <felicity> I thought it would fail if the dest exists, but it overwrites. <felicity> rename OLDNAME,NEWNAME <felicity> Changes the name of a file; an existing file NEW- <felicity> NAME will be clobbered. Returns true for success, <felicity> false otherwise. <jm> bugger, didn't know that! <felicity> yeah. tell me about it. <felicity> so do you want to do up the sync_journal() suggested patch, or should I go hacking? <jm> you do it -- sounds like you're well on the way there ;) <felicity> I'll send in my current patch for 1629. <jm> I'll post this log on the bug <felicity> k. :)
*** Bug 1664 has been marked as a duplicate of this bug. ***
Created attachment 794 [details] suggested patch, apply after patch for bug 1629
patch now attached to bug 1664 *** This bug has been marked as a duplicate of 1664 ***