Bug 6046 - Add BayesStore backend using direct calls to BerkeleyDB
Summary: Add BayesStore backend using direct calls to BerkeleyDB
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: All All
: P5 enhancement
Target Milestone: 3.3.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-01-19 06:50 UTC by Michael Alan Dorman
Modified: 2010-01-27 04:05 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
The patch adding BayesStore/BDB.pm patch None Michael Alan Dorman [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Alan Dorman 2009-01-19 06:50:33 UTC
This is intended to be a high-speed, high-concurrency local BayesStore implementation.  I think it succeeds in all of those respects.

In benchmarking using an updated (but largely unchanged in procedure) masses/bayes-testing/benchmark setup, against a corpus of 6000 messages each of ham and spam, it beats all of the other back-ends I tested in all operations.

The smallest gains are against the DB_File back-end, where it generally shows a margin of 10%-20%, while it scores upwards of 3x the speed of the SQL back-ends.

I have not yet run this code in production.  I intend to start that this week.  I would certainly appreciate more eyes on the code.
Comment 1 Justin Mason 2009-01-19 06:57:41 UTC
aiming at 3.3.0.  3.2.6 is unlikely to see any new features ;)
Comment 2 Michael Alan Dorman 2009-01-22 05:52:16 UTC
Created attachment 4427 [details]
The patch adding BayesStore/BDB.pm

I've still not quite got this in production use yet, but on benchmarks it thrashes all of the other options pretty soundly.  I'll try to add those figures after a quick re-run.
Comment 3 Mark Martinec 2009-01-22 06:12:19 UTC
> I've still not quite got this in production use yet, but on benchmarks it
> thrashes all of the other options pretty soundly.  I'll try to add those
> figures after a quick re-run.

What does benchmarking say on auto-purging the database?
In my experience, lengthy auto expiration runs was the single
and most pressing reason to abandon bdb and switch our Bayes
to SQL.
Comment 4 Michael Alan Dorman 2009-01-22 06:31:56 UTC
(In reply to comment #3)
> What does benchmarking say on auto-purging the database?
> In my experience, lengthy auto expiration runs was the single
> and most pressing reason to abandon bdb and switch our Bayes
> to SQL.

Well we don't have to do the dance of move-and-rebuild that DBM does, in part because we are able to create a secondary index on atime that makes it very easy to estimate whether we would expire too many tokens in a given run.  So I would expect it to be as efficient as the SQL back-ends in that respect.

Still, it's a good question that would not be answered out of the box by the benchmark code, so I've changed the local.cf files in the benchmark directory to remove bayes_auto_expire 0, and I'll re-run and see what the results look like.

I'm curious, which SQL back-end are you using?  And if it's MySQL, do you have any performance tuning tips?  I've been a little startled to find that PgSQL is actually outperforming MySQL on my benchmarks (given their reps) but then I know how to tune PgSQL well; I'm more ignorant about MySQL.
Comment 5 Mark Martinec 2009-01-22 07:31:13 UTC
> Well we don't have to do the dance of move-and-rebuild that DBM does, in part
> because we are able to create a secondary index on atime that makes it very
> easy to estimate whether we would expire too many tokens in a given run.  So I
> would expect it to be as efficient as the SQL back-ends in that respect.

Good to hear.

> Still, it's a good question that would not be answered out of the box by the
> benchmark code, so I've changed the local.cf files in the benchmark directory
> to remove bayes_auto_expire 0, and I'll re-run and see what the results look
> like.

Thanks!
 
> I'm curious, which SQL back-end are you using?

Mail::SpamAssassin::BayesStore::MySQL

> And if it's MySQL, do you have any performance tuning tips?

Initially not. Later I added (/etc/my.cnf) :

[mysqld]
bind=127.0.0.1
key_buffer_size=60M
innodb_buffer_pool_size=384M
innodb_log_buffer_size=6M
innodb_flush_log_at_trx_commit=0
max_connections=60

based on some tips from:
http://www.mysqlperformanceblog.com/2006/09/29/what-to-tune-in-mysql-server-after-installation/

This is with MySQL 5.1.24 InnoDB (current size 5.5 GB), on FreeBSD,
SA 3.3; currently bayes_token has 1M records, bayes_seen has 26M records
(I know, I need to ditch bayes_seen and start it from scratch).

Initially I used MyISAM and Mail::SpamAssassin::BayesStore::SQL,
which would get me in trouble every now and then, requiring a
REPAIR TABLE. Now with InnoDB and the dedicated BayesStore::MySQL
it never again got me into trouble in two years.

> I've been a little startled to find that PgSQL is
> actually outperforming MySQL on my benchmarks (given their reps)
> but then I know how to tune PgSQL well; I'm more ignorant about MySQL.

I was running Bayes for a while on PostgreSQL 8.2 using
Mail::SpamAssassin::BayesStore::PgSQL, but the SELECT ... IN (...)
with a large set of tokens in the IN-set was quite slow,
much slower than with MySQL.

I'm still using PostgreSQL for everything else except Bayes,
i.e. for AWL, and for amavisd-new SQL logging / pen pals database,
which is quite large and outperforms MySQL, especially on purging.
Comment 6 Michael Alan Dorman 2009-01-22 09:39:41 UTC
So, running masses/bayes-testing/benchmark (with auto-expire enabled) across each back-end on the same corpus on an otherwise idle machine produces the following results:

Benchmarking BDB

real    14m19.809s
user    10m27.823s
sys     0m42.495s

Benchmarking PostgreSQL

real    44m23.183s
user    8m15.167s
sys     0m30.978s

Benchmarking MySQL

real    44m18.552s
user    11m2.257s
sys     1m6.612s

Benchmarking SDBM

real    25m14.481s
user    9m56.833s
sys     1m1.748s

Benchmarking DB_File

real    41m6.284s
user    12m33.203s
sys     0m57.692s

I won't pretend that MySQL is the best-tuned MySQL install ever, though I have done some.  I may make try tweaking that a bit based on Mark Martinec's suggestions and re-run.

Anyway, that's based on the patch I've included here.  I would appreciate any comments or suggestions anyone has about the code.
Comment 7 Justin Mason 2009-01-23 01:59:22 UTC
wow.  those are great numbers ;)
Comment 8 Quanah Gibson-Mount 2009-01-23 09:47:35 UTC
Just as an aside, is there any way to make it so the bayes table can be shared when using BDB?  I'm guessing not, but figured it doesn't hurt to ask. ;)
Comment 9 Michael Alan Dorman 2009-01-23 10:32:34 UTC
> Just as an aside, is there any way to make it so the bayes table can be shared
> when using BDB?  I'm guessing not, but figured it doesn't hurt to ask. ;)

Technically, I believe it is possible.  My DB4.6 reference material discusses the an RPC-based client-server mechanism.

That is, however, so far beyond my expertise, I wouldn't really be sure where to begin.
Comment 10 Michael Alan Dorman 2009-01-23 10:42:09 UTC
(In reply to comment #7)
> wow.  those are great numbers ;)

Yeah, they're so great that I have to admit I wonder if I've got a bug somewhere.

Specifically, bdb and db_file are very close right up until the long single-threaded slogs through large mailboxes at the end of the benchmark, at which point bdb pulls out a 5x-6x speed differential.  That seems to me wildly improbable.

I think the code is pretty straightforward; I'd love to have other eyes looking at it.
Comment 11 Michael Alan Dorman 2009-01-23 11:18:45 UTC
> Specifically, bdb and db_file are very close right up until the long
> single-threaded slogs through large mailboxes at the end of the benchmark, at
> which point bdb pulls out a 5x-6x speed differential.  That seems to me wildly
> improbable.

Oh, wait, I forgot, this was with auto_expire enabled, so it's not surprising that it takes longer.  with auto_expire disabled, bdb is a little faster...but only a little.
Comment 12 Mark Martinec 2009-02-06 04:14:47 UTC
I intended to try out the module, but was sidetracked by a perl5.8.9
problem, which crashes the interpreter on FreeBSD when it tries to
compile the code autogenerated from rules as soon as some module
with threading like BerkeleyDB is brought in:

perl -MBerkeleyDB body_0.pm
Bus error: 10 (core dumped)

I filed a perl bug report #63054:

  http://rt.perl.org/rt3/Public/Bug/Display.html?id=63054
  Perl_peep recursion exceeds stack in 5.8.9 eval-ing a long program

but I'm not yet sure how to proceed. Looks like I need to
downgrade to 5.8.8 or upgrade to 5.10.0, which is a bit of a
pain when most Perl modules are installed from FreeBSD ports.

A workaround could be if SA produced smaller code sections,
the 83600 lines of code for body rules at prio=0 is a lot
(using some SARE and ZDF rules), perl5.8.9 can survive about
two thirds of that.
Comment 13 Justin Mason 2009-02-06 04:34:14 UTC
(In reply to comment #12)
> I intended to try out the module, but was sidetracked by a perl5.8.9
> problem, which crashes the interpreter on FreeBSD when it tries to
> compile the code autogenerated from rules as soon as some module
> with threading like BerkeleyDB is brought in:

:(

> A workaround could be if SA produced smaller code sections,
> the 83600 lines of code for body rules at prio=0 is a lot
> (using some SARE and ZDF rules), perl5.8.9 can survive about
> two thirds of that.

yeah, we can do that I think.  can you open a separate SA bug for this?
Comment 14 Mark Martinec 2009-02-06 05:22:28 UTC
> > A workaround could be if SA produced smaller code sections,
> > the 83600 lines of code for body rules at prio=0 is a lot
> 
> yeah, we can do that I think.  can you open a separate SA bug for this?

Ok, opened Bug 6060.
Comment 15 Mark Martinec 2009-02-06 05:32:10 UTC
Now back to the original topic.

The module looks alright, but needs testing. I have a few minor
coding/style comments, but that can be addressed later or during
importing it.

Michael, is the attached code the latest that you have, or have
you done any changes later? Do we have a licensing agreement
from Michael? I'd suggest we go on with importing the module -
- as it is an optional backend module what is important is that
we have an intention of adopting it and that it compiles cleanly,
any minor bugs can be fixed along the way.
Comment 16 Justin Mason 2009-02-06 06:18:27 UTC
(In reply to comment #15)
> Michael, is the attached code the latest that you have, or have
> you done any changes later? Do we have a licensing agreement
> from Michael? I'd suggest we go on with importing the module -
> - as it is an optional backend module what is important is that
> we have an intention of adopting it and that it compiles cleanly,
> any minor bugs can be fixed along the way.

yep, the CLA is sorted; http://people.apache.org/~jim/committers.html .
so it's ready to go!

Mark, if you want to import it into SVN, go right ahead... I'm +1
Comment 17 Mark Martinec 2009-02-09 06:42:37 UTC
> Mark, if you want to import it into SVN, go right ahead... I'm +1

Imported a new Bayes backend M::S::BayesStore::BDB, along
with its benchmarks and tests (from Bug 6046 attachment);
author: Michael Alan Dorman
Sending        MANIFEST
Adding         lib/Mail/SpamAssassin/BayesStore/BDB.pm
Sending        masses/bayes-testing/benchmark/README
Adding         masses/bayes-testing/benchmark/helper/bdb
Adding         masses/bayes-testing/benchmark/helper/bdb/cleardb
Adding         masses/bayes-testing/benchmark/helper/bdb/dbsize
Adding         masses/bayes-testing/benchmark/helper/bdb/setup
Adding         masses/bayes-testing/benchmark/tests/bdb
Adding         masses/bayes-testing/benchmark/tests/bdb/site
Adding         masses/bayes-testing/benchmark/tests/bdb/site/init.pre
Adding         masses/bayes-testing/benchmark/tests/bdb/site/local.cf
Adding         masses/bayes-testing/benchmark/tests/bdb/user_prefs
Adding         t/bayesbdb.t
Transmitting file data ..........
Committed revision 742522.


Quench the warning:
  BerkeleyDB::db_version" used only once: possible typo at t/bayesbdb.t
Prevent bayesbdb.t from aborting:
  Can't use an undefined value as an ARRAY reference at
  ../blib/lib/Mail/SpamAssassin/BayesStore/BDB.pm line 683
Sending        lib/Mail/SpamAssassin/BayesStore/BDB.pm
Sending        t/bayesbdb.t
Transmitting file data ..
Committed revision 742535.
Comment 18 Mark Martinec 2009-02-09 11:10:46 UTC
some rather cosmetic changes to BDB.pm: wrap long lines; remove some
redundant parenthesis; TAB -> SP in POD sections (doesn't come out
nice); uncomment most dbg calls for the time being, change debug
area id 'BDB:' -> 'bayes:' for consistency with other backends;
changed 'assert'-like idiom to: 'assert_condition or die' (previously
a mix of 'negative_condition and die' and 'positive_cond or die'
was used); test status of BerkeleyDB methods for zero as per docs,
instead of testing for boolean; avoid idiom 'my $v=expr or die/dbg'
separating assignment and a test ('my $x=0 or die' yields:
'Found = in conditional', while 'my($x)=0 or die' never dies)

Sending        lib/Mail/SpamAssassin/BayesStore/BDB.pm
Transmitting file data .
Committed revision 742680.
Comment 19 Mark Martinec 2009-02-10 08:14:32 UTC
- BDB.pm: no wonder no tokens were ever found, sought in 'seen'
instead of in a 'tokens' database;

- BDB.pm: fix 'sa-learn --dump' failing:
plugin: eval failed: bayes: dump_db_toks: not implemented
plugin: eval failed: Can't locate object method "compute_prob_for_token"
via package "Mail::SpamAssassin::Plugin::Bayes" at BDB.pm line 615

- BDB.pm: add new subroutine _mget, and move a loop through
search tokens from tok_get_all into _mget, avoiding one level
of procedure call for few-hundred tokens

- BDB.pm: comment-out debug calls to Data::Dumper

- added a timing report to Plugin::Bayes::learner_expire_old_training

Sending        lib/Mail/SpamAssassin/BayesStore/BDB.pm
Sending        lib/Mail/SpamAssassin/Plugin/Bayes.pm
Transmitting file data ..
Committed revision 743007.
Comment 20 Mark Martinec 2009-02-10 10:49:00 UTC
It seems to be running just fine for a while, but then a database seems
to get corrupted:

Feb 10 19:34:42 xxx amavis[36895]: (36895-30) _WARN:
  rules: failed to run BAYES_05 test, skipping:
  (Couldn't put record:
  Secondary index corrupt: not consistent with primary
  at /usr/local/lib/perl5/site_perl/5.10.0/Mail/SpamAssassin/BayesStore/BDB.pm
  line 948)

Feb 10 19:34:42 xxx amavis[36827]: (36827-30) _WARN:
  plugin: eval failed: bayes: (in learn)
  Couldn't put record:
  Secondary index corrupt: not consistent with primary
  at /usr/local/lib/perl5/site_perl/5.10.0/Mail/SpamAssassin/BayesStore/BDB.pm
  line 816.

Same happened yesterday, which made me scratch the bdb database and start
anew, but it's happened again. I'm using db46-4.6.21.3 and BerkeleyDB-0.36
from FreeBSD ports. The same pair of db46 and BerkeleyDB is working well
for busy databases maintained by amavisd (nanny/process status, SNMP-like
statistics counters, cache and a cache-queue).

Could it be some concurrency/locking problem between multiple child processes,
each doing its Bayes/BDB thing independently (just like in spamd)?
Comment 21 Mark Martinec 2009-04-01 10:15:28 UTC
Now that the perl 5.8.9/5.10.0 stack overflow when Berkeley DB is in use
is resolved (Bug 6060) by a workaround, I could play again with this
BDB bayesstore backend.

  M::SA::BayesStore::BDB : keep databases persistently open in an
  attempt to reduce likelihood of database corruption or permanent
  lockout, and likely to improve performance; do a db close in a
  DESTROY method to ensure a clean rundown; remove DB_LOG_AUTOREMOVE
  as it is not compatible with libdb 4.7 (bdb API change).
Sending lib/Mail/SpamAssassin/BayesStore/BDB.pm
Committed revision 760962.

I'm still having trouble with this module. It can happen that I end up with
a database permanently locked (BerkeleyDB::Env->new hangs indefinitely),
or DB_SECONDARY_BAD is returned by db_put while expiring tokens,
and tests 17 and 24 in t/bayesbdb.t are failing. 
Comment 22 Justin Mason 2009-06-29 04:35:25 UTC
is this still occasionally flaky?  we could put it into 3.3.0alpha as a disabled-by-default optional "alpha-quality" plugin....
Comment 23 Mark Martinec 2009-06-29 06:45:16 UTC
> is this still occasionally flaky?  we could put it into 3.3.0alpha as a
> disabled-by-default optional "alpha-quality" plugin....

I think it doesn't work at all at the moment, although possibly there
isn't much that is missing from making it work. It would be very nice
to get it to at least alpha quality for 3.3.0.

Regardless, I'd vote for including it in the 3.3.0alpha but
disabled and marked as not-quite-ready.
Comment 24 Mark Martinec 2009-10-15 11:06:44 UTC
Considering there is no progress or interest in this backend, do we
drop it from 3.3 code? (The high-end sites should be using SQL anyway.)
Comment 25 Mark Martinec 2009-11-19 11:21:07 UTC
(In reply to comment #21)
> and tests 17 and 24 in t/bayesbdb.t are failing.

These failing tests I have fixed:

  r881316 | mmartinec | 2009-11-17 16:00:34 +0100 (Tue, 17 Nov 2009) | 2 lines
  Fixed bug in BayesStore::BDB::tok_get, fixed corresponding test

The rest about alpha-quality still applies.
Comment 26 Mark Martinec 2009-11-25 10:25:36 UTC
I tried it running for a couple of hours, seems to work: it did learn tokens
starting with an empty database, it did start providing good results once
it reached the 200 messages limit, but after a couple of hours it became
slower and slower (using libdb47), so I had to turn it off.

As it is, it's good enough for 3.3.0 (provided some warning text is written
in release notes). I'm moving it off to 3.3.1 target, perhaps someone would
be willing to improve it by then.
Comment 27 Mark Martinec 2010-01-27 04:05:10 UTC
Ok, this ended up in 3.3.0, with a warning in release notes that it
is an alpha. Closing it now. For any particular problems please
open separate tickets.