SA Bugzilla – Bug 3951
Possible tok_get_all optimization
Last modified: 2005-03-11 04:58:43 UTC
Based on comments from Bug 3225 it may be possible to optimize tok_get_all a good bit. The basics are, instead of using fixed queries in bunch sizes to fetch all of the tokens from the database, use dynamic queries. This in theory works out ok because due to atime updates the query cache is pretty much invalid anyway. We still need to limit the number of tokens fetched due to SQL statement limits but it should be tunable. FWIW, I tried this (with a limit of 100) using my Bayes Benchmark (which scans 8000 ham and spam msgs with spamc/spamd and spamassassin) and found that it causes a very small increase (4% at the most), nothing like what the original comment states (57%). Using a limit of 200 causes a slowdown in some cases. I think we need to do so analysis and determine the optimal combination here.
The 57% time decrease I saw was for the tok_get_all sub itself, not an overall decrease. I'm assuming that the bayes routines run simultaneously with other tests, which is why you don't see the same decrease overall. Still though, reducing the number of queries against the SQL server is A Good Thing, IMO, since there's no point incurring more disk I/O, CPU time and possibly network bandwidth/queries than necessary. In very limited tests, I found that the reduction of queries could make or break the possibility of using an SQL server, for bayes, NOT ON the machine running spamd. A feature which is excellent for any installation who requires more than one spamd server.
Subject: Re: Possible tok_get_all optimization Don't get me wrong, I'm all for lowering the number of queries and agree that it will help. My benchmark accounts for 6+ million queries as it is now, averaging around 3000 queries per second on the exact same dataset running the exact same rules returning repeatable results. What does that all mean? It means that I plan to change this code in some form, maybe it's a loop that generates dynamic SQL, that seemed to provide some speed up. Perhaps it's tuning the bunch sizes. Maybe making use of a temporary table, I just tried this one and it seemed to help equally well. Michael
Regarding bunch size optimization, I don't think it would ever turn out to be the best solution due to the varied nature of different people's email paths (and the varying nature of the message itself). Find bunch sizes that are good for everyone and consistent throughout a release's lifespan would be, I'd imagine, next to impossible, not to mention extra work to maintain. Bunch sizes would never be better optimized more than a dynamic size anyway, since there's no caching benefit for either method. Out of curiosity though, you mentioned before, in Bug 3225, that the reason bunches were used had very little to do with caching. Looking through the list archives (including bugzilla) the bunches were introduced by Sidney, in what he described as to, avoid blowing away the cache. Later when you implemented, you never mentioned any other reasons? What was the reason for bunches, if not the cache issue? Anyway, the temporary table sound like it may be useful for large tables that host many individual bayes databases. I don't foresee it being any faster, than not using one though, for global databases. Perhaps comparing the number of tokens the current user has in their database (bayes_vars.token_count WHERE ID = x) to the total number of tokens in the database (SUM(bayes_vars.token_count) or something better/persistently stored in DB) would determine if using a temporary table would be worth the overhead trade off.
Setting for 3.1 and taking.
Committed revision 157164.