Bug 1345 - Autolearner can't write to db
Summary: Autolearner can't write to db
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P4 normal
Target Milestone: 2.50
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on: 1353
Blocks:
  Show dependency tree
 
Reported: 2003-01-05 15:58 UTC by Matthew Cline
Modified: 2003-01-13 04:37 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
proposed fix patch None Daniel Quinlan [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Matthew Cline 2003-01-05 15:58:43 UTC
I've been getting these errors when running "spamassassin -r":

Cannot open bayes_path /usr/home/matt/.spamassassin/bayes R/W: File exists
gdbm store returned -1, errno 17, key "**NSPAM" at
lib/Mail/SpamAssassin/BayesStore.pm line 585.
Comment 1 Theo Van Dinter 2003-01-05 16:06:41 UTC
Subject: Re: [SAdev]  New: Auotlearner can't write to db

On Sun, Jan 05, 2003 at 03:58:43PM -0800, bugzilla-daemon@hughes-family.org wrote:
> Cannot open bayes_path /usr/home/matt/.spamassassin/bayes R/W: File exists
> gdbm store returned -1, errno 17, key "**NSPAM" at
> lib/Mail/SpamAssassin/BayesStore.pm line 585.

I get these during the nightly run as well:

Cannot open bayes_path /home/felicity/SA/spamassassin-corpora/masses/spamassassin/bayes R/W: File exists
Cannot open bayes_path /home/felicity/SA/spamassassin-corpora/masses/spamassassin/bayes R/W: File exists
Cannot open bayes_path /home/felicity/SA/spamassassin-corpora/masses/spamassassin/bayes R/W: File exists
Cannot open bayes_path /home/felicity/SA/spamassassin-corpora/masses/spamassassin/bayes R/W: File exists

Comment 2 Justin Mason 2003-01-06 05:09:55 UTC
Subject: Re: [SAdev]  New: Auotlearner can't write to db 


Have you been  testing with other Perl versions?  I've seen similar
problems from using different versions of DB_File.

Comment 3 Theo Van Dinter 2003-01-08 06:14:12 UTC
It was all using the same perl version here.
Comment 4 Theo Van Dinter 2003-01-09 06:57:47 UTC
I was kind of hoping that the new "untie early" would solve this for me, but 
this morning (with mass-check -j 4) I got 107 of these messages...  I watched 
the directory while the run was happening, and you could see processes getting 
backed up waiting for the lock to free up and then they'd sit there until they 
timed out.  :(
Comment 5 Justin Mason 2003-01-09 11:03:24 UTC
Subject: Re: [SAdev]  Auotlearner can't write to db 


> I was kind of hoping that the new "untie early" would solve this for me,
> but this morning (with mass-check -j 4) I got 107 of these messages...
> I watched the directory while the run was happening, and you could see
> processes gettin g backed up waiting for the lock to free up and then
> they'd sit there until the y timed out.  :(

ach, synch stuff is a horrible problem to fix. :(

I've got 5 of them from my -j 2 local+bayes mass-check at the moment, but
no more than that, after 4 hours.  And it doesn't look like they're hung,
just that there was too much contention at several stages.

Any ideas on reproducing more easily?  Running -j 19 with debugs on and
getting a logfile?

--j.

Comment 6 Malte S. Stretz 2003-01-09 13:43:13 UTC
I had one in a 9 hour run with -j3 (after two hours or so). 
Comment 7 Daniel Quinlan 2003-01-11 02:11:16 UTC
Justin writes:

> ach, synch stuff is a horrible problem to fix. :(

Yes, we're using a contention technique to solve an access problem.
Unfortunately, the load on the Bayes DB is high so it doesn't work as well
as we might want.  One technique that might help improve the situation without
much additional complexity would be to add a small random component to the
delay.  If you wait one second every time, one second might just be the exact
amount of time between usages from one or more of the other threads.  I
would suggest changing the wait from 1 second to 0.5-1.5 seconds if select()
is supported.  See the select docs for how to sleep for sub-second times.
It helps if the average delay time is equal to (or about the same as) the
average time a thread will keep the db locked up.

The other solution would be to have some sort of queue.  This could be done
via the filesystem.  If you can't get the lock, then you do this:

1. create a file named "resource_name.$time.$pid"
2. readdir
3. wait a small period of time
4. readdir and see if you are next in line
5. repeat 3-4 until you get the lock

That might perform a very small amount worse than our current solution
because only 1 thread (instead of any) can pick up the lock after it has
been freed, but no threads would wait very long.

Or, we could use some sort of lock manager, but I don't like the added
complication.  I'd like to see if a more random delay can experimentally
reduce the number of write failures.

Comment 8 Daniel Quinlan 2003-01-11 02:16:59 UTC
> That might perform a very small amount worse than our current solution
> because only 1 thread (instead of any) can pick up the lock after it has
> been freed, but no threads would wait very long.

Actually, after more thought, I realized it could perform a significantly
worse if there are more than two threads.  I think it's either the queueing
lock manager or the random delay if we want to keep it simple.
Comment 9 Nathan Neulinger 2003-01-11 06:11:13 UTC
Why not just have a local spamd-bayes and spamd-awl sub-process spawned by the
main spamd and connect to them locally from the helper spamd (pipe or local tcp
socket or other ipc), that way only a single process is doing the bayes db
access, same for awl. 

Seems like that would be a good performance boost for any setup/takedown heavy
test. 
Comment 10 Allen Smith 2003-01-11 08:21:54 UTC
Nathan:

The current discussion is of mass-check in parallel, but I can see some
extensions to spamd...

>Why not just have a local spamd-bayes and spamd-awl sub-process spawned by the
>main spamd and connect to them locally from the helper spamd (pipe or local tcp
>socket or other ipc), that way only a single process is doing the bayes db
>access, same for awl. 

>Seems like that would be a good performance boost for any setup/takedown heavy
>test.

I like the idea long-term for spamd for DNS-related stuff - it would improve
DNS performance if it were in a seperate daemon with caching. I'm not sure
about implementing it for 2.50, especially portably (e.g., socketpair isn't
very portable). (This idea may also be of interest for site-level bayes/awl
databases and for site-level Received IP address caching, incidentally.)

     -Allen
Comment 11 Nathan Neulinger 2003-01-11 08:48:19 UTC
If you made it a fully external daemon, as opposed to spawned from spamd, that
would probably work. 

Looks like bayes only has a few configs, you could just have a 

bayes_use_server
bayes_server [localhost:[x]]
bayes_server [/path/to/unix/socket]

Have bayesd use the same spamassassin config. You'd just need to come up with a
protocol like spamd for passing it the message. The server would also need to be
select/poll based instead of forked, but since it would be more simplistic, and
only executing one task, that shouldn't be too difficult. 

The bayes_server/use_server flags would apply to mass-check and regular SA
invocations as well as spamd. 


--------

Alternatively, you might consider using the BerkeleyDB module if it's available,
which I believe supports the full locking capabilities of db. 

Comment 12 Allen Smith 2003-01-11 08:59:28 UTC
Incidentally, even with per-user bayes/awl files, it occurs to me that if using
spamd and one user got a bunch of mail at the same time, that user would
potentially have contention problems for the bayes and awl databases.

>If you made it a fully external daemon, as opposed to spawned from spamd, that
>would probably work. 

Yes. ArchiveIterator.pm currently has a spawning scheme.

>Looks like bayes only has a few configs, you could just have a 

>bayes_use_server
>bayes_server [localhost:[x]]
>bayes_server [/path/to/unix/socket]

>Have bayesd use the same spamassassin config. You'd just need to come up with
>a protocol like spamd for passing it the message.

The message, or just what Bayes.pm passes to BayesStore.pm? If the latter
is adequately encapsulated (ditto for the AWL code and Dns.pm - I've done
my best on the latter), then you could pass it the same thing that the
storage/lookup code currently gets, and get it back.

>The server would also need to be select/poll based instead of forked,

Note systems not capable of select/poll, although that could be taken care of
via the encapsulation I mention above - if you've got a daemon, use it;
otherwise, do it internally.

>The bayes_server/use_server flags would apply to mass-check and regular SA
>invocations as well as spamd.

For mass-check -j 2+, yes. Internally seems preferable otherwise.

>Alternatively, you might consider using the BerkeleyDB module if it's
>available, which I believe supports the full locking capabilities of db.

If it's available - again, portability.

     -Allen
Comment 13 Justin Mason 2003-01-11 11:32:14 UTC
Subject: Re: [SAdev]  Auotlearner can't write to db 


> Or, we could use some sort of lock manager, but I don't like the added
> complication.  I'd like to see if a more random delay can experimentally
> reduce the number of write failures.

May be an easier way BTW.  I think the read-only-db code still locks
it to increment a counter or similar; this is unnecessary.   Instead
we could just append a "." to a bayes_counter file (atomic writes in
append mode don't need to be locked), and use the *size* of that
file as the counter value.

I'll be offline until tomorrow, but I'll take a look then assuming
there's no more stuff to go in boxes etc. ;)

--j.

Comment 14 Daniel Quinlan 2003-01-11 19:14:34 UTC
Subject: Re: [SAdev]  Auotlearner can't write to db

> Why not just have a local spamd-bayes and spamd-awl sub-process
> spawned by the main spamd and connect to them locally from the helper
> spamd (pipe or local tcp socket or other ipc), that way only a single
> process is doing the bayes db access, same for awl.
> 
> Seems like that would be a good performance boost for any
> setup/takedown heavy test.

Maybe later.  Much later.  I'd want to see a huge performance boost and
a problem to begin with (a problem not easily solved with other means).

Comment 15 Daniel Quinlan 2003-01-11 22:59:10 UTC
Subject: Re: [SAdev]  Autolearner can't write to db

jm@jmason.org writes:

> May be an easier way BTW.  I think the read-only-db code still locks
> it to increment a counter or similar; this is unnecessary.   Instead
> we could just append a "." to a bayes_counter file (atomic writes in
> append mode don't need to be locked), and use the *size* of that
> file as the counter value.

I don't think we have much contention around reading.  I think it's
mostly problems waiting for writes to finish.  I'm working on the
locking code tonight.

Unfortunately, O_APPEND mode writes over NFS are not guaranteed to be
atomic.  See this from the open(2) manual:

       O_APPEND
              The file is opened  in  append  mode.  Before  each
              write, the file pointer is positioned at the end of
              the file, as if with lseek.  O_APPEND may  lead  to
              corrupted  files  on  NFS file systems if more than
              one process appends data to a file at  once.   This
              is  because  NFS  does  not  support appending to a
              file, so the client  kernel  has  to  simulate  it,
              which can't be done without a race condition.

I don't know about Windows.

Dan

Comment 16 Daniel Quinlan 2003-01-12 06:55:45 UTC
I have a patch with results.  I'll attach the patch in a moment.  I'd like
one of the other developers to review the patch since it we are late in the
cycle.  The patch cleans up the locking code, how it is used (adding an
unlock subroutine, and adds a random factor to the delay (but keeps the
timeout the same).

My test was to run "mass-check -j 20 --head=2000" alternating between a
run of the old code and the new code while my machine was otherwise idle.
Here are the results:

CODE  WARNINGS   WALL,USER,SYSTEM
old         34   653.75,402.77,14.95
old         48   664.10,403.29,15.00
old         72   705.05,405.55,15.12

new          6   626.43,404.52,15.57
new         12   570.26,402.67,15.44
new         12   577.66,402.54,15.53

So, the average number of warnings went from 51 to 10, a 5-fold reduction
and the average wall time also went from 674 to 591 seconds.  And note
that this is with -j 20 on a uniprocessor.  :-)
Comment 17 Daniel Quinlan 2003-01-12 06:56:49 UTC
Created attachment 574 [details]
proposed fix
Comment 18 Daniel Quinlan 2003-01-12 06:59:20 UTC
Note that my proposed fix uses select to get subsecond-delays.  There are
no portability notes in the Perl select documentation that say it can't be
used this way (and this is the only suggested way to get subsecond delays
in the Perl documentation) and I also tested select() with ActiveState (Perl 5.8)
on my WinXP system and subsecond delays worked fine there.
Comment 19 Justin Mason 2003-01-12 06:59:57 UTC
crap, mid-air collision!  Dan, your changes look good, *and* you've
bothered to test it thoroughly. ;)  Feel free
to overwrite my stuff with that, I like it better.
Comment 20 Daniel Quinlan 2003-01-12 07:50:32 UTC
Subject: Re: [SAdev]  Autolearner can't write to db

jm@jmason.org writes:

> crap, mid-air collision!  Dan, your changes look good, *and* you've
> bothered to test it thoroughly. ;) Feel free to overwrite my stuff
> with that, I like it better.

Thanks, I checked it in.  We could also increase LOCK_MAX_RETRIES (like
you did), but it's probably good enough now.  Another approach would be
to only increase LOCK_MAX_RETRIES from 30 to 60 if we're running in a
forked situation and passthe number of forks as an optional parameter to
the main SA new().

Also, I saw that Time::HiRes has a sleep() function that also supports
sub-second sleep and it *might* be more portable, but we don't require
it right now and I have no way to know how select() fails on a
theoretical system with a non-working select anyway, so I didn't add a
wrapper function.  Just be careful to always sleep() for 1 second or
more if using the default sleep() since sleep sometimes rounds down.

Comment 21 Theo Van Dinter 2003-01-12 08:20:27 UTC
Ok, we're going to leave this alone for right now since it only slightly has 
an effect on mass-check runs.  If in the "live" testing there are problems, 
we'll try out randomizing the backoff attempts to lower lock contention.  
Lowering to P4.
Comment 22 Justin Mason 2003-01-12 09:52:52 UTC
Subject: Re: [SAdev]  Autolearner can't write to db 


I took a look at Time::HiRes.  It does support floating-point seconds, but
IMO the select() semantics will be more protable -- and you've already
noted they work on ActivePerl on windows, which is the #1 portability
"issue" we might have.  In addition, importing Time::HiRes has caused
probs before, since Time::HiRes::sleep() overrides CORE::sleep() and
causes probs on some perl versions.

I'd stick with the select() one.

--j.

Comment 23 Daniel Quinlan 2003-01-13 13:37:35 UTC
closing as done, open a new bug if it happens again for any reasonable -j
setting