SA Bugzilla – Bug 6046
Add BayesStore backend using direct calls to BerkeleyDB
Last modified: 2010-01-27 04:05:10 UTC
This is intended to be a high-speed, high-concurrency local BayesStore implementation. I think it succeeds in all of those respects. In benchmarking using an updated (but largely unchanged in procedure) masses/bayes-testing/benchmark setup, against a corpus of 6000 messages each of ham and spam, it beats all of the other back-ends I tested in all operations. The smallest gains are against the DB_File back-end, where it generally shows a margin of 10%-20%, while it scores upwards of 3x the speed of the SQL back-ends. I have not yet run this code in production. I intend to start that this week. I would certainly appreciate more eyes on the code.
aiming at 3.3.0. 3.2.6 is unlikely to see any new features ;)
Created attachment 4427 [details] The patch adding BayesStore/BDB.pm I've still not quite got this in production use yet, but on benchmarks it thrashes all of the other options pretty soundly. I'll try to add those figures after a quick re-run.
> I've still not quite got this in production use yet, but on benchmarks it > thrashes all of the other options pretty soundly. I'll try to add those > figures after a quick re-run. What does benchmarking say on auto-purging the database? In my experience, lengthy auto expiration runs was the single and most pressing reason to abandon bdb and switch our Bayes to SQL.
(In reply to comment #3) > What does benchmarking say on auto-purging the database? > In my experience, lengthy auto expiration runs was the single > and most pressing reason to abandon bdb and switch our Bayes > to SQL. Well we don't have to do the dance of move-and-rebuild that DBM does, in part because we are able to create a secondary index on atime that makes it very easy to estimate whether we would expire too many tokens in a given run. So I would expect it to be as efficient as the SQL back-ends in that respect. Still, it's a good question that would not be answered out of the box by the benchmark code, so I've changed the local.cf files in the benchmark directory to remove bayes_auto_expire 0, and I'll re-run and see what the results look like. I'm curious, which SQL back-end are you using? And if it's MySQL, do you have any performance tuning tips? I've been a little startled to find that PgSQL is actually outperforming MySQL on my benchmarks (given their reps) but then I know how to tune PgSQL well; I'm more ignorant about MySQL.
> Well we don't have to do the dance of move-and-rebuild that DBM does, in part > because we are able to create a secondary index on atime that makes it very > easy to estimate whether we would expire too many tokens in a given run. So I > would expect it to be as efficient as the SQL back-ends in that respect. Good to hear. > Still, it's a good question that would not be answered out of the box by the > benchmark code, so I've changed the local.cf files in the benchmark directory > to remove bayes_auto_expire 0, and I'll re-run and see what the results look > like. Thanks! > I'm curious, which SQL back-end are you using? Mail::SpamAssassin::BayesStore::MySQL > And if it's MySQL, do you have any performance tuning tips? Initially not. Later I added (/etc/my.cnf) : [mysqld] bind=127.0.0.1 key_buffer_size=60M innodb_buffer_pool_size=384M innodb_log_buffer_size=6M innodb_flush_log_at_trx_commit=0 max_connections=60 based on some tips from: http://www.mysqlperformanceblog.com/2006/09/29/what-to-tune-in-mysql-server-after-installation/ This is with MySQL 5.1.24 InnoDB (current size 5.5 GB), on FreeBSD, SA 3.3; currently bayes_token has 1M records, bayes_seen has 26M records (I know, I need to ditch bayes_seen and start it from scratch). Initially I used MyISAM and Mail::SpamAssassin::BayesStore::SQL, which would get me in trouble every now and then, requiring a REPAIR TABLE. Now with InnoDB and the dedicated BayesStore::MySQL it never again got me into trouble in two years. > I've been a little startled to find that PgSQL is > actually outperforming MySQL on my benchmarks (given their reps) > but then I know how to tune PgSQL well; I'm more ignorant about MySQL. I was running Bayes for a while on PostgreSQL 8.2 using Mail::SpamAssassin::BayesStore::PgSQL, but the SELECT ... IN (...) with a large set of tokens in the IN-set was quite slow, much slower than with MySQL. I'm still using PostgreSQL for everything else except Bayes, i.e. for AWL, and for amavisd-new SQL logging / pen pals database, which is quite large and outperforms MySQL, especially on purging.
So, running masses/bayes-testing/benchmark (with auto-expire enabled) across each back-end on the same corpus on an otherwise idle machine produces the following results: Benchmarking BDB real 14m19.809s user 10m27.823s sys 0m42.495s Benchmarking PostgreSQL real 44m23.183s user 8m15.167s sys 0m30.978s Benchmarking MySQL real 44m18.552s user 11m2.257s sys 1m6.612s Benchmarking SDBM real 25m14.481s user 9m56.833s sys 1m1.748s Benchmarking DB_File real 41m6.284s user 12m33.203s sys 0m57.692s I won't pretend that MySQL is the best-tuned MySQL install ever, though I have done some. I may make try tweaking that a bit based on Mark Martinec's suggestions and re-run. Anyway, that's based on the patch I've included here. I would appreciate any comments or suggestions anyone has about the code.
wow. those are great numbers ;)
Just as an aside, is there any way to make it so the bayes table can be shared when using BDB? I'm guessing not, but figured it doesn't hurt to ask. ;)
> Just as an aside, is there any way to make it so the bayes table can be shared > when using BDB? I'm guessing not, but figured it doesn't hurt to ask. ;) Technically, I believe it is possible. My DB4.6 reference material discusses the an RPC-based client-server mechanism. That is, however, so far beyond my expertise, I wouldn't really be sure where to begin.
(In reply to comment #7) > wow. those are great numbers ;) Yeah, they're so great that I have to admit I wonder if I've got a bug somewhere. Specifically, bdb and db_file are very close right up until the long single-threaded slogs through large mailboxes at the end of the benchmark, at which point bdb pulls out a 5x-6x speed differential. That seems to me wildly improbable. I think the code is pretty straightforward; I'd love to have other eyes looking at it.
> Specifically, bdb and db_file are very close right up until the long > single-threaded slogs through large mailboxes at the end of the benchmark, at > which point bdb pulls out a 5x-6x speed differential. That seems to me wildly > improbable. Oh, wait, I forgot, this was with auto_expire enabled, so it's not surprising that it takes longer. with auto_expire disabled, bdb is a little faster...but only a little.
I intended to try out the module, but was sidetracked by a perl5.8.9 problem, which crashes the interpreter on FreeBSD when it tries to compile the code autogenerated from rules as soon as some module with threading like BerkeleyDB is brought in: perl -MBerkeleyDB body_0.pm Bus error: 10 (core dumped) I filed a perl bug report #63054: http://rt.perl.org/rt3/Public/Bug/Display.html?id=63054 Perl_peep recursion exceeds stack in 5.8.9 eval-ing a long program but I'm not yet sure how to proceed. Looks like I need to downgrade to 5.8.8 or upgrade to 5.10.0, which is a bit of a pain when most Perl modules are installed from FreeBSD ports. A workaround could be if SA produced smaller code sections, the 83600 lines of code for body rules at prio=0 is a lot (using some SARE and ZDF rules), perl5.8.9 can survive about two thirds of that.
(In reply to comment #12) > I intended to try out the module, but was sidetracked by a perl5.8.9 > problem, which crashes the interpreter on FreeBSD when it tries to > compile the code autogenerated from rules as soon as some module > with threading like BerkeleyDB is brought in: :( > A workaround could be if SA produced smaller code sections, > the 83600 lines of code for body rules at prio=0 is a lot > (using some SARE and ZDF rules), perl5.8.9 can survive about > two thirds of that. yeah, we can do that I think. can you open a separate SA bug for this?
> > A workaround could be if SA produced smaller code sections, > > the 83600 lines of code for body rules at prio=0 is a lot > > yeah, we can do that I think. can you open a separate SA bug for this? Ok, opened Bug 6060.
Now back to the original topic. The module looks alright, but needs testing. I have a few minor coding/style comments, but that can be addressed later or during importing it. Michael, is the attached code the latest that you have, or have you done any changes later? Do we have a licensing agreement from Michael? I'd suggest we go on with importing the module - - as it is an optional backend module what is important is that we have an intention of adopting it and that it compiles cleanly, any minor bugs can be fixed along the way.
(In reply to comment #15) > Michael, is the attached code the latest that you have, or have > you done any changes later? Do we have a licensing agreement > from Michael? I'd suggest we go on with importing the module - > - as it is an optional backend module what is important is that > we have an intention of adopting it and that it compiles cleanly, > any minor bugs can be fixed along the way. yep, the CLA is sorted; http://people.apache.org/~jim/committers.html . so it's ready to go! Mark, if you want to import it into SVN, go right ahead... I'm +1
> Mark, if you want to import it into SVN, go right ahead... I'm +1 Imported a new Bayes backend M::S::BayesStore::BDB, along with its benchmarks and tests (from Bug 6046 attachment); author: Michael Alan Dorman Sending MANIFEST Adding lib/Mail/SpamAssassin/BayesStore/BDB.pm Sending masses/bayes-testing/benchmark/README Adding masses/bayes-testing/benchmark/helper/bdb Adding masses/bayes-testing/benchmark/helper/bdb/cleardb Adding masses/bayes-testing/benchmark/helper/bdb/dbsize Adding masses/bayes-testing/benchmark/helper/bdb/setup Adding masses/bayes-testing/benchmark/tests/bdb Adding masses/bayes-testing/benchmark/tests/bdb/site Adding masses/bayes-testing/benchmark/tests/bdb/site/init.pre Adding masses/bayes-testing/benchmark/tests/bdb/site/local.cf Adding masses/bayes-testing/benchmark/tests/bdb/user_prefs Adding t/bayesbdb.t Transmitting file data .......... Committed revision 742522. Quench the warning: BerkeleyDB::db_version" used only once: possible typo at t/bayesbdb.t Prevent bayesbdb.t from aborting: Can't use an undefined value as an ARRAY reference at ../blib/lib/Mail/SpamAssassin/BayesStore/BDB.pm line 683 Sending lib/Mail/SpamAssassin/BayesStore/BDB.pm Sending t/bayesbdb.t Transmitting file data .. Committed revision 742535.
some rather cosmetic changes to BDB.pm: wrap long lines; remove some redundant parenthesis; TAB -> SP in POD sections (doesn't come out nice); uncomment most dbg calls for the time being, change debug area id 'BDB:' -> 'bayes:' for consistency with other backends; changed 'assert'-like idiom to: 'assert_condition or die' (previously a mix of 'negative_condition and die' and 'positive_cond or die' was used); test status of BerkeleyDB methods for zero as per docs, instead of testing for boolean; avoid idiom 'my $v=expr or die/dbg' separating assignment and a test ('my $x=0 or die' yields: 'Found = in conditional', while 'my($x)=0 or die' never dies) Sending lib/Mail/SpamAssassin/BayesStore/BDB.pm Transmitting file data . Committed revision 742680.
- BDB.pm: no wonder no tokens were ever found, sought in 'seen' instead of in a 'tokens' database; - BDB.pm: fix 'sa-learn --dump' failing: plugin: eval failed: bayes: dump_db_toks: not implemented plugin: eval failed: Can't locate object method "compute_prob_for_token" via package "Mail::SpamAssassin::Plugin::Bayes" at BDB.pm line 615 - BDB.pm: add new subroutine _mget, and move a loop through search tokens from tok_get_all into _mget, avoiding one level of procedure call for few-hundred tokens - BDB.pm: comment-out debug calls to Data::Dumper - added a timing report to Plugin::Bayes::learner_expire_old_training Sending lib/Mail/SpamAssassin/BayesStore/BDB.pm Sending lib/Mail/SpamAssassin/Plugin/Bayes.pm Transmitting file data .. Committed revision 743007.
It seems to be running just fine for a while, but then a database seems to get corrupted: Feb 10 19:34:42 xxx amavis[36895]: (36895-30) _WARN: rules: failed to run BAYES_05 test, skipping: (Couldn't put record: Secondary index corrupt: not consistent with primary at /usr/local/lib/perl5/site_perl/5.10.0/Mail/SpamAssassin/BayesStore/BDB.pm line 948) Feb 10 19:34:42 xxx amavis[36827]: (36827-30) _WARN: plugin: eval failed: bayes: (in learn) Couldn't put record: Secondary index corrupt: not consistent with primary at /usr/local/lib/perl5/site_perl/5.10.0/Mail/SpamAssassin/BayesStore/BDB.pm line 816. Same happened yesterday, which made me scratch the bdb database and start anew, but it's happened again. I'm using db46-4.6.21.3 and BerkeleyDB-0.36 from FreeBSD ports. The same pair of db46 and BerkeleyDB is working well for busy databases maintained by amavisd (nanny/process status, SNMP-like statistics counters, cache and a cache-queue). Could it be some concurrency/locking problem between multiple child processes, each doing its Bayes/BDB thing independently (just like in spamd)?
Now that the perl 5.8.9/5.10.0 stack overflow when Berkeley DB is in use is resolved (Bug 6060) by a workaround, I could play again with this BDB bayesstore backend. M::SA::BayesStore::BDB : keep databases persistently open in an attempt to reduce likelihood of database corruption or permanent lockout, and likely to improve performance; do a db close in a DESTROY method to ensure a clean rundown; remove DB_LOG_AUTOREMOVE as it is not compatible with libdb 4.7 (bdb API change). Sending lib/Mail/SpamAssassin/BayesStore/BDB.pm Committed revision 760962. I'm still having trouble with this module. It can happen that I end up with a database permanently locked (BerkeleyDB::Env->new hangs indefinitely), or DB_SECONDARY_BAD is returned by db_put while expiring tokens, and tests 17 and 24 in t/bayesbdb.t are failing.
is this still occasionally flaky? we could put it into 3.3.0alpha as a disabled-by-default optional "alpha-quality" plugin....
> is this still occasionally flaky? we could put it into 3.3.0alpha as a > disabled-by-default optional "alpha-quality" plugin.... I think it doesn't work at all at the moment, although possibly there isn't much that is missing from making it work. It would be very nice to get it to at least alpha quality for 3.3.0. Regardless, I'd vote for including it in the 3.3.0alpha but disabled and marked as not-quite-ready.
Considering there is no progress or interest in this backend, do we drop it from 3.3 code? (The high-end sites should be using SQL anyway.)
(In reply to comment #21) > and tests 17 and 24 in t/bayesbdb.t are failing. These failing tests I have fixed: r881316 | mmartinec | 2009-11-17 16:00:34 +0100 (Tue, 17 Nov 2009) | 2 lines Fixed bug in BayesStore::BDB::tok_get, fixed corresponding test The rest about alpha-quality still applies.
I tried it running for a couple of hours, seems to work: it did learn tokens starting with an empty database, it did start providing good results once it reached the 200 messages limit, but after a couple of hours it became slower and slower (using libdb47), so I had to turn it off. As it is, it's good enough for 3.3.0 (provided some warning text is written in release notes). I'm moving it off to 3.3.1 target, perhaps someone would be willing to improve it by then.
Ok, this ended up in 3.3.0, with a warning in release notes that it is an alpha. Closing it now. For any particular problems please open separate tickets.