SA Bugzilla – Bug 1345
Autolearner can't write to db
Last modified: 2003-01-13 04:37:35 UTC
I've been getting these errors when running "spamassassin -r": Cannot open bayes_path /usr/home/matt/.spamassassin/bayes R/W: File exists gdbm store returned -1, errno 17, key "**NSPAM" at lib/Mail/SpamAssassin/BayesStore.pm line 585.
Subject: Re: [SAdev] New: Auotlearner can't write to db On Sun, Jan 05, 2003 at 03:58:43PM -0800, bugzilla-daemon@hughes-family.org wrote: > Cannot open bayes_path /usr/home/matt/.spamassassin/bayes R/W: File exists > gdbm store returned -1, errno 17, key "**NSPAM" at > lib/Mail/SpamAssassin/BayesStore.pm line 585. I get these during the nightly run as well: Cannot open bayes_path /home/felicity/SA/spamassassin-corpora/masses/spamassassin/bayes R/W: File exists Cannot open bayes_path /home/felicity/SA/spamassassin-corpora/masses/spamassassin/bayes R/W: File exists Cannot open bayes_path /home/felicity/SA/spamassassin-corpora/masses/spamassassin/bayes R/W: File exists Cannot open bayes_path /home/felicity/SA/spamassassin-corpora/masses/spamassassin/bayes R/W: File exists
Subject: Re: [SAdev] New: Auotlearner can't write to db Have you been testing with other Perl versions? I've seen similar problems from using different versions of DB_File.
It was all using the same perl version here.
I was kind of hoping that the new "untie early" would solve this for me, but this morning (with mass-check -j 4) I got 107 of these messages... I watched the directory while the run was happening, and you could see processes getting backed up waiting for the lock to free up and then they'd sit there until they timed out. :(
Subject: Re: [SAdev] Auotlearner can't write to db > I was kind of hoping that the new "untie early" would solve this for me, > but this morning (with mass-check -j 4) I got 107 of these messages... > I watched the directory while the run was happening, and you could see > processes gettin g backed up waiting for the lock to free up and then > they'd sit there until the y timed out. :( ach, synch stuff is a horrible problem to fix. :( I've got 5 of them from my -j 2 local+bayes mass-check at the moment, but no more than that, after 4 hours. And it doesn't look like they're hung, just that there was too much contention at several stages. Any ideas on reproducing more easily? Running -j 19 with debugs on and getting a logfile? --j.
I had one in a 9 hour run with -j3 (after two hours or so).
Justin writes: > ach, synch stuff is a horrible problem to fix. :( Yes, we're using a contention technique to solve an access problem. Unfortunately, the load on the Bayes DB is high so it doesn't work as well as we might want. One technique that might help improve the situation without much additional complexity would be to add a small random component to the delay. If you wait one second every time, one second might just be the exact amount of time between usages from one or more of the other threads. I would suggest changing the wait from 1 second to 0.5-1.5 seconds if select() is supported. See the select docs for how to sleep for sub-second times. It helps if the average delay time is equal to (or about the same as) the average time a thread will keep the db locked up. The other solution would be to have some sort of queue. This could be done via the filesystem. If you can't get the lock, then you do this: 1. create a file named "resource_name.$time.$pid" 2. readdir 3. wait a small period of time 4. readdir and see if you are next in line 5. repeat 3-4 until you get the lock That might perform a very small amount worse than our current solution because only 1 thread (instead of any) can pick up the lock after it has been freed, but no threads would wait very long. Or, we could use some sort of lock manager, but I don't like the added complication. I'd like to see if a more random delay can experimentally reduce the number of write failures.
> That might perform a very small amount worse than our current solution > because only 1 thread (instead of any) can pick up the lock after it has > been freed, but no threads would wait very long. Actually, after more thought, I realized it could perform a significantly worse if there are more than two threads. I think it's either the queueing lock manager or the random delay if we want to keep it simple.
Why not just have a local spamd-bayes and spamd-awl sub-process spawned by the main spamd and connect to them locally from the helper spamd (pipe or local tcp socket or other ipc), that way only a single process is doing the bayes db access, same for awl. Seems like that would be a good performance boost for any setup/takedown heavy test.
Nathan: The current discussion is of mass-check in parallel, but I can see some extensions to spamd... >Why not just have a local spamd-bayes and spamd-awl sub-process spawned by the >main spamd and connect to them locally from the helper spamd (pipe or local tcp >socket or other ipc), that way only a single process is doing the bayes db >access, same for awl. >Seems like that would be a good performance boost for any setup/takedown heavy >test. I like the idea long-term for spamd for DNS-related stuff - it would improve DNS performance if it were in a seperate daemon with caching. I'm not sure about implementing it for 2.50, especially portably (e.g., socketpair isn't very portable). (This idea may also be of interest for site-level bayes/awl databases and for site-level Received IP address caching, incidentally.) -Allen
If you made it a fully external daemon, as opposed to spawned from spamd, that would probably work. Looks like bayes only has a few configs, you could just have a bayes_use_server bayes_server [localhost:[x]] bayes_server [/path/to/unix/socket] Have bayesd use the same spamassassin config. You'd just need to come up with a protocol like spamd for passing it the message. The server would also need to be select/poll based instead of forked, but since it would be more simplistic, and only executing one task, that shouldn't be too difficult. The bayes_server/use_server flags would apply to mass-check and regular SA invocations as well as spamd. -------- Alternatively, you might consider using the BerkeleyDB module if it's available, which I believe supports the full locking capabilities of db.
Incidentally, even with per-user bayes/awl files, it occurs to me that if using spamd and one user got a bunch of mail at the same time, that user would potentially have contention problems for the bayes and awl databases. >If you made it a fully external daemon, as opposed to spawned from spamd, that >would probably work. Yes. ArchiveIterator.pm currently has a spawning scheme. >Looks like bayes only has a few configs, you could just have a >bayes_use_server >bayes_server [localhost:[x]] >bayes_server [/path/to/unix/socket] >Have bayesd use the same spamassassin config. You'd just need to come up with >a protocol like spamd for passing it the message. The message, or just what Bayes.pm passes to BayesStore.pm? If the latter is adequately encapsulated (ditto for the AWL code and Dns.pm - I've done my best on the latter), then you could pass it the same thing that the storage/lookup code currently gets, and get it back. >The server would also need to be select/poll based instead of forked, Note systems not capable of select/poll, although that could be taken care of via the encapsulation I mention above - if you've got a daemon, use it; otherwise, do it internally. >The bayes_server/use_server flags would apply to mass-check and regular SA >invocations as well as spamd. For mass-check -j 2+, yes. Internally seems preferable otherwise. >Alternatively, you might consider using the BerkeleyDB module if it's >available, which I believe supports the full locking capabilities of db. If it's available - again, portability. -Allen
Subject: Re: [SAdev] Auotlearner can't write to db > Or, we could use some sort of lock manager, but I don't like the added > complication. I'd like to see if a more random delay can experimentally > reduce the number of write failures. May be an easier way BTW. I think the read-only-db code still locks it to increment a counter or similar; this is unnecessary. Instead we could just append a "." to a bayes_counter file (atomic writes in append mode don't need to be locked), and use the *size* of that file as the counter value. I'll be offline until tomorrow, but I'll take a look then assuming there's no more stuff to go in boxes etc. ;) --j.
Subject: Re: [SAdev] Auotlearner can't write to db > Why not just have a local spamd-bayes and spamd-awl sub-process > spawned by the main spamd and connect to them locally from the helper > spamd (pipe or local tcp socket or other ipc), that way only a single > process is doing the bayes db access, same for awl. > > Seems like that would be a good performance boost for any > setup/takedown heavy test. Maybe later. Much later. I'd want to see a huge performance boost and a problem to begin with (a problem not easily solved with other means).
Subject: Re: [SAdev] Autolearner can't write to db jm@jmason.org writes: > May be an easier way BTW. I think the read-only-db code still locks > it to increment a counter or similar; this is unnecessary. Instead > we could just append a "." to a bayes_counter file (atomic writes in > append mode don't need to be locked), and use the *size* of that > file as the counter value. I don't think we have much contention around reading. I think it's mostly problems waiting for writes to finish. I'm working on the locking code tonight. Unfortunately, O_APPEND mode writes over NFS are not guaranteed to be atomic. See this from the open(2) manual: O_APPEND The file is opened in append mode. Before each write, the file pointer is positioned at the end of the file, as if with lseek. O_APPEND may lead to corrupted files on NFS file systems if more than one process appends data to a file at once. This is because NFS does not support appending to a file, so the client kernel has to simulate it, which can't be done without a race condition. I don't know about Windows. Dan
I have a patch with results. I'll attach the patch in a moment. I'd like one of the other developers to review the patch since it we are late in the cycle. The patch cleans up the locking code, how it is used (adding an unlock subroutine, and adds a random factor to the delay (but keeps the timeout the same). My test was to run "mass-check -j 20 --head=2000" alternating between a run of the old code and the new code while my machine was otherwise idle. Here are the results: CODE WARNINGS WALL,USER,SYSTEM old 34 653.75,402.77,14.95 old 48 664.10,403.29,15.00 old 72 705.05,405.55,15.12 new 6 626.43,404.52,15.57 new 12 570.26,402.67,15.44 new 12 577.66,402.54,15.53 So, the average number of warnings went from 51 to 10, a 5-fold reduction and the average wall time also went from 674 to 591 seconds. And note that this is with -j 20 on a uniprocessor. :-)
Created attachment 574 [details] proposed fix
Note that my proposed fix uses select to get subsecond-delays. There are no portability notes in the Perl select documentation that say it can't be used this way (and this is the only suggested way to get subsecond delays in the Perl documentation) and I also tested select() with ActiveState (Perl 5.8) on my WinXP system and subsecond delays worked fine there.
crap, mid-air collision! Dan, your changes look good, *and* you've bothered to test it thoroughly. ;) Feel free to overwrite my stuff with that, I like it better.
Subject: Re: [SAdev] Autolearner can't write to db jm@jmason.org writes: > crap, mid-air collision! Dan, your changes look good, *and* you've > bothered to test it thoroughly. ;) Feel free to overwrite my stuff > with that, I like it better. Thanks, I checked it in. We could also increase LOCK_MAX_RETRIES (like you did), but it's probably good enough now. Another approach would be to only increase LOCK_MAX_RETRIES from 30 to 60 if we're running in a forked situation and passthe number of forks as an optional parameter to the main SA new(). Also, I saw that Time::HiRes has a sleep() function that also supports sub-second sleep and it *might* be more portable, but we don't require it right now and I have no way to know how select() fails on a theoretical system with a non-working select anyway, so I didn't add a wrapper function. Just be careful to always sleep() for 1 second or more if using the default sleep() since sleep sometimes rounds down.
Ok, we're going to leave this alone for right now since it only slightly has an effect on mass-check runs. If in the "live" testing there are problems, we'll try out randomizing the backoff attempts to lower lock contention. Lowering to P4.
Subject: Re: [SAdev] Autolearner can't write to db I took a look at Time::HiRes. It does support floating-point seconds, but IMO the select() semantics will be more protable -- and you've already noted they work on ActivePerl on windows, which is the #1 portability "issue" we might have. In addition, importing Time::HiRes has caused probs before, since Time::HiRes::sleep() overrides CORE::sleep() and causes probs on some perl versions. I'd stick with the select() one. --j.
closing as done, open a new bug if it happens again for any reasonable -j setting