Bug 3951 - Possible tok_get_all optimization
Summary: Possible tok_get_all optimization
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P5 enhancement
Target Milestone: 3.1.0
Assignee: Michael Parker
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-11-04 11:29 UTC by Michael Parker
Modified: 2005-03-11 04:58 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Parker 2004-11-04 11:29:14 UTC
Based on comments from Bug 3225 it may be possible to optimize tok_get_all a
good bit.

The basics are, instead of using fixed queries in bunch sizes to fetch all of
the tokens from the database, use dynamic queries.  This in theory works out ok
because due to atime updates the query cache is pretty much invalid anyway.  We
still need to limit the number of tokens fetched due to SQL statement limits but
it should be tunable.

FWIW, I tried this (with a limit of 100) using my Bayes Benchmark (which scans
8000 ham and spam msgs with spamc/spamd and spamassassin) and found that it
causes a very small increase (4% at the most), nothing like what the original
comment states (57%).  Using a limit of 200 causes a slowdown in some cases.

I think we need to do so analysis and determine the optimal combination here.
Comment 1 Daryl C. W. O'Shea 2004-11-04 19:22:30 UTC
The 57% time decrease I saw was for the tok_get_all sub itself, not an overall
decrease.

I'm assuming that the bayes routines run simultaneously with other tests, which
is why you don't see the same decrease overall.

Still though, reducing the number of queries against the SQL server is A Good
Thing, IMO, since there's no point incurring more disk I/O, CPU time and
possibly network bandwidth/queries than necessary.

In very limited tests, I found that the reduction of queries could make or break
the possibility of using an SQL server, for bayes, NOT ON the machine running
spamd.  A feature which is excellent for any installation who requires more than
one spamd server.
Comment 2 Michael Parker 2004-11-04 20:10:11 UTC
Subject: Re:  Possible tok_get_all optimization

Don't get me wrong, I'm all for lowering the number of queries and
agree that it will help.  My benchmark accounts for 6+ million queries
as it is now, averaging around 3000 queries per second on the exact
same dataset running the exact same rules returning repeatable
results.

What does that all mean?  It means that I plan to change this code in
some form, maybe it's a loop that generates dynamic SQL, that seemed
to provide some speed up.  Perhaps it's tuning the bunch sizes.  Maybe
making use of a temporary table, I just tried this one and it seemed
to help equally well.

Michael
Comment 3 Daryl C. W. O'Shea 2004-11-04 20:33:56 UTC
Regarding bunch size optimization, I don't think it would ever turn out to be
the best solution due to the varied nature of different people's email paths
(and the varying nature of the message itself).  Find bunch sizes that are good
for everyone and consistent throughout a release's lifespan would be, I'd
imagine, next to impossible, not to mention extra work to maintain.

Bunch sizes would never be better optimized more than a dynamic size anyway,
since there's no caching benefit for either method.

Out of curiosity though, you mentioned before, in Bug 3225, that the reason
bunches were used had very little to do with caching.  Looking through the list
archives (including bugzilla) the bunches were introduced by Sidney, in what he
described as to, avoid blowing away the cache.  Later when you implemented, you
never mentioned any other reasons?  What was the reason for bunches, if not the
cache issue?

Anyway, the temporary table sound like it may be useful for large tables that
host many individual bayes databases.  I don't foresee it being any faster, than
not using one though, for global databases.  Perhaps comparing the number of
tokens the current user has in their database (bayes_vars.token_count WHERE ID =
x) to the total number of tokens in the database (SUM(bayes_vars.token_count) or
something better/persistently stored in DB) would determine if using a temporary
table would be worth the overhead trade off.
Comment 4 Michael Parker 2004-11-24 11:56:47 UTC
Setting for 3.1 and taking.
Comment 5 Michael Parker 2005-03-11 13:58:43 UTC
Committed revision 157164.