SA Bugzilla – Bug 3894
expiry algorithm needs major overhaul
Last modified: 2007-11-29 17:45:30 UTC
There's still the problem from 2.6x that an expiry may not work although all the tokens in the db are perfectly valid and are not negative and not in the future. This happens when the atime values of the tokens span a vast time range, f.i. a year. Such a database may be built by setting the bayes_max_db_size to a high value so that the db doesn't expire for a long time. When the threshold is finally reached the expire algorithm isn't able to calculate an expiry atime delta and quits. I think this is kind of a "secret time bomb" for many sysadmins. I hit this problem on nearly all of our machines running SA. We have big, nicely working databases, but I cannot expire them. Problem 1 is that the normal expire calculation takes the old atime delta and the number of removed tokens into account. (I didn't look the exact method up in the documentation this time, I remember it from earlier this year.) In the end the ratio figure is too high and SA goes in the "fishy" mode. Here comes Problem 2 into play: it stops after 9 iterations at an atime delta of 22118400. I didn't look in the code, but it seems to double the atime delta with each try run, so the 10th iteration would be out of scope. That's why it stops there. (BTW: it seems there has been a code change from 2.6x to 3.0: with 2.6x the iterations are shown one by one right when each one happens, with 3.0 nothing is shown until all have been done. This means that there can be a long time - minutes - of no activity shown.) Here is a typical example. n8:/home/spamd/bayes # sa-learn --dump magic 0.000 0 3 0 non-token data: bayes db version 0.000 0 22270 0 non-token data: nspam 0.000 0 7278 0 non-token data: nham 0.000 0 795902 0 non-token data: ntokens 0.000 0 1052059392 0 non-token data: oldest atime 0.000 0 1097528241 0 non-token data: newest atime 0.000 0 1097528594 0 non-token data: last journal sync atime 0.000 0 1097528806 0 non-token data: last expiry atime 0.000 0 29754654 0 non-token data: last expire atime delta 0.000 0 36 0 non-token data: last expire reduction count debug: bayes: expiry check keep size, 0.75 * max: 750000 debug: bayes: token count: 795902, final goal reduction size: 45902 debug: bayes: First pass? Current: 1097528594, Last: 1096983920, atime: 29754654, count: 36, newdelta: 23335, ratio: 1275.05555555556, period: 43200 debug: bayes: Can't use estimation method for expiry, something fishy, calculating optimal atime delta (first pass) debug: bayes: expiry max exponent: 9 debug: bayes: atime token reduction debug: bayes: ======== =============== debug: bayes: 43200 791701 debug: bayes: 86400 788611 debug: bayes: 172800 785159 debug: bayes: 345600 776707 debug: bayes: 691200 761556 debug: bayes: 1382400 740407 debug: bayes: 2764800 698622 debug: bayes: 5529600 656777 debug: bayes: 11059200 647817 debug: bayes: 22118400 548366 debug: bayes: couldn't find a good delta atime, need more token difference, skipping expire. debug: Syncing complete. I can actually force an expiry of the db above by reducing the max_db_size to something like 100 or 200.000. When I did this the iteration succeeded with the last try and slashed the db down to 162.000 tokens. And after this you hit the same problem again, because it still cannot calculate an atime delta. I had to wait some days until several thousands tokens were added before an expiry would work again. And then fail again and so on. I explained this also with more examples on the users mailing list in the thread "Still "fishy" problems with bayes expiry in SA 3.0", starting with Message-Id: <VA.0000199c.01b63b2d@virtual-access.org>. My proposal to fix this problem is to use a completely different expiry method. First, the db needs to get sorted before an expiry. If the db is sorted by atime you can just start at the top of the file and remove until you reach the reduction goal. And you are done. Possible additional local.cf options: - bayes_removal_percent: % of the whole token count to remove (f.i. 1, 5 or 10) - bayes_expiry_min_db_size: a minimum figure of tokens we want to keep Example: db of 750.000, removal %: 1%, minimum: 500.000. When the expiry runs (either started by autoexpire when hitting the max_db_size or by --force- expire and then NOT taking the max_db_size into account) it calculates 1% from 750.000 = 7.500, sorts the db by atime, starting with the oldest atime. Then it walks thru the db and removes the first 7.500 tokens. Then it stops. It doesn't really matter that it probably removes some tokens which have the same atime and keeps others (the atime of the last removed token). Each time expiry runs again it calculates again and removes again, until the minimum of 500.000 is reached. As long as the reduction would shrink the db under the minimum expiry won't happen. The minimum avoids an accidental slashing of the db if you use higher % removal than you have growth % during the same time. This method also avoids the massive timeout problems when autoexpiry hits the "fishy" path and tries to find an atime delta. This can take so long that SA times out and retries with each subsequent spam check until the expiry has succeeded, if ever.
Subject: Re: New: expiry algorithm needs major overhaul On Mon, Oct 11, 2004 at 03:08:01PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > Problem 1 is that the normal expire calculation takes the old atime delta and > the number of removed tokens into account. (I didn't look the exact method up > in the documentation this time, I remember it from earlier this year.) In the > end the ratio figure is too high and SA goes in the "fishy" mode. Well, it tries to estimate the atime delta for this run from the last run. If the two runs are dissimilar, it won't try estimating and needs to do "phase 1" and figure out an appropriate delta. > Here comes Problem 2 into play: it stops after 9 iterations at an atime delta > of 22118400. I didn't look in the code, but it seems to double the atime delta Right. It starts at 12hours, then multiplies by 2^n where n (0..9), or 12h to 256 days. > with each try run, so the 10th iteration would be out of scope. That's why it > stops there. (BTW: it seems there has been a code change from 2.6x to 3.0: > with 2.6x the iterations are shown one by one right when each one happens, > with 3.0 nothing is shown until all have been done. This means that there can > be a long time - minutes - of no activity shown.) Errr, I'm fairly positive that didn't change. There's no way to know any of the numbers until after all the tokens have been gone through. So it can't possibly have shown any until it was done, where it would then show them all. > My proposal to fix this problem is to use a completely different expiry > method. First, the db needs to get sorted before an expiry. If the db is > sorted by atime you can just start at the top of the file and remove until you > reach the reduction goal. And you are done. Ah... Therein lies the issue. You can't sort the DB. You get the data in a "random" manner (actually based on the hash value of the DB key). You have to go through every token, period. Note: You could have yet another file (DB or flat file or...) which keeps a correlation of atime and token in sorted order, but since you're updating token atimes so much it's likely to get out of sync pretty quickly, and use more resources, and make the code much more complex, and ... BTW: the atime is actually part of the packed binary value piece of the DB's key=>value pair, so even getting the atime out is non-trivial. To do what you are suggesting, you'd have to change the first pass to keep the actual breakdown of delta/token count values. ie: 1s 12 tokens, 2s 2 tokens, 3s 5 tokens, etc. Then at the end, you can order the deltas and go from there, but that's a LOT of memory usage. The current version essentially does this, but only has the 10 (0..9) expoential pools instead. I actually made this modular enough so you can change the max exponent and the starting number in the configuration, although it still multiplies by 2 each time. The configuration options currently aren't changable via configuration file, but it's set to be able to do so, even via plugin... ;) bayes_expiry_max_exponent (9) and bayes_expiry_period (43200), btw. Anyway, if you want fast expiry, use SQL. It's mucho faster at expiry because it doesn't have to do all the work the DBM version needs to do (namely, copying lots of data around, going through each token multiple times, etc). It's also more likely to be able to do what you want since you could just add in "order by atime" (or whatever) to the SQL to get the tokens in order. I'm actually not sure if SQL expiry works the same as DBM expiry. It probably does but... I didn't write that code. ;) > Possible additional local.cf options: > - bayes_removal_percent: % of the whole token count to remove (f.i. 1, 5 or 10) not a useful value really -- should it randomly remove tokens? > - bayes_expiry_min_db_size: a minimum figure of tokens we want to keep > Example: db of 750.000, removal %: 1%, minimum: 500.000. When the expiry runs it's already the larger of 100k or 75% of max_db_size. Another thing to note about expiry is that it's best effort. Most (all?), at least at last check, other Bayes systems don't do expiry at all, and you just do learn on error. The DB doesn't grow much, and things are generally ok. We generally encourage more learning, and allow expiry when it's easy to do so.
Thanks for the answer, so if I understand correctly the problem is inherently based in the dbm format we use because that doesn't allow for sorting. But the problem still remains: many bayes dbs are bound to fail expiry. Shouldn't this be fixed somehow? F.i. using other parameter values or allowing an adjustment would help immensely. > I actually made this modular enough so you > can change the max exponent and the starting number in the configuration, > although it still multiplies by 2 each time. The configuration options > currently aren't changable via configuration file, but it's set to > be able to do so, even via plugin... ;) bayes_expiry_max_exponent (9) > and bayes_expiry_period (43200), btw. Well, changing these parameters doesn't help much. F.i. doubling from 256 to 512 days is simply too big and if you decrease the expiry_period amount you don't gain any advantage. What needs to be done is avoid "phase 1". F.i. if I know that I want to expire 1.000 tokens I can adjust all parameters, so that the calculation will show a good ratio and then would go on. If I can do that with a calculator SA should be able to do that as well. > Anyway, if you want fast expiry, use SQL. Thanks for the hint. I may try, but not in the near future. My emphasis is not so much on speed but on expiry itself. At the moment I have half a dozen of bayes dbs I can't expire and when I go and force them to expire I have to slash them radically and after this expiry fails again. It's a never-ending story. So, in my eyes there's something wrong with the current expiry. I should be able to always expire a db unless it's hopelessly corrupted and these aren't. > not a useful value really -- should it randomly remove tokens? Well, this proposal was based on sorting. If the db isn't sorted it would expire randomly, of course. That was not my goal. > The DB doesn't grow much, and things are generally ok. It doesn't grow much? I see that at about 20 MB it seems to stay exactly at that size for quite a while although tokens get added. Is that what you mean? Is that because of the dbm format? Which means there's a lot of bloat in it?
IMO, we should find a way to expire a db if it's too big, regardless of the *correctness* of the expiry algorithm; random dropping of a percentage of the tokens may be the only course.
>> The DB doesn't grow much, and things are generally ok. > >It doesn't grow much? I think he meant _other_ systems don't grow much and don't need expiry, i.e., > Most (all?), at least at last check, other Bayes systems > don't do expiry at all, and you just do learn on error. > The DB doesn't grow much, and things are generally ok. > We generally encourage more learning, [...] I've seen research indicating that learn everything and periodically resetting the database, which is roughly equivalent to expiry, has better results over time than a learn on error system. The latter doesn't keep up with long term overall changes in the spam and ham coming in. Expiry is good and we should do whatever we can to make it practical.
Subject: Re: expiry algorithm needs major overhaul On Thu, Oct 14, 2004 at 02:52:12PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > IMO, we should find a way to expire a db if it's too big, regardless of the > *correctness* of the expiry algorithm; random dropping of a percentage of the > tokens may be the only course. I'm really opposed to a full random dropping of tokens, space is less important than results imo. I think there are two easy options to help deal with the issue though: - more buckets. having more buckets gives more decision points, so the likelihood of being able to do an expire is higher. the exponent idea comes from older tokens being less and less useful over time. we generically used 12h and the powers of 2 to keep the code small and fast. we could change the # of buckets and manually, in some form do the exponent thing. perhaps keep linear buckets up to a point, then space them further out, going back to a year or something. - targeted random expiry. if all else fails, find the bucket with the highest number of tokens, and "randomly" (aka: in the order they come out of the DB) remove the appropriate number of tokens. I don't know if the "random" expiry should only be a last resort effort or not (ie: only if forced-expire is used or the previous expire # is 0 (means the expire didn't happen through the normal bucket method the last time through)).
Subject: Re: expiry algorithm needs major overhaul On Thu, Oct 14, 2004 at 02:38:22PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > Thanks for the answer, so if I understand correctly the problem is inherently > based in the dbm format we use because that doesn't allow for sorting. But the Well, it's only a problem depending on how you try to use the information. It would be a lot easier if the data were sorted though, imo, yes. > Well, changing these parameters doesn't help much. F.i. doubling from 256 to > 512 days is simply too big and if you decrease the expiry_period amount you It depends what your situation is. You can tweak the numbers so that you start smaller and go back further, so 600 20 goes from 5 minutes back to 3640 days (in days): 0 0 0 0 0 0 0 0 1 3 7 14 28 56 113 227 455 910 1820 3640 > don't gain any advantage. What needs to be done is avoid "phase 1". F.i. if I Yeah, that's what the estimation thing tries to do -- phase 1 is expensive, and phase 2 has to happen anyway, so if we can guess what values we should use this time based on what happened the last time, do so. > know that I want to expire 1.000 tokens I can adjust all parameters, so that > the calculation will show a good ratio and then would go on. If I can do that > with a calculator SA should be able to do that as well. you could do that with a calculator and a lot of scribbling, sure. if people didn't mind having a huge memory usage for expiry, SA could just store all the token values in memory via phase 1, then spend a lot of cpu time figuring out the best possible atime value. but people aren't willing to do that. > bayes dbs I can't expire and when I go and force them to expire I have to > slash them radically and after this expiry fails again. It's a never-ending > story. So, in my eyes there's something wrong with the current expiry. I > should be able to always expire a db unless it's hopelessly corrupted and > these aren't. Something's up with how you're doing learning then, imo. Expiry assumes a fairly consistent amount of learning -- you'll probably get the same number of mails a day, and you'll probably learn the same number each day, based on that the system should work. > It doesn't grow much? I see that at about 20 MB it seems to stay exactly at > that size for quite a while although tokens get added. Is that what you mean? Yeah, it doesn't grow much, unless you're learning a ton of tokens -- therein the problem is you're learning too much. Regular usage doesn't add a large # of tokens, and the amount of new tokens per learn goes down over time as the # of tokens in the DB increases. 20MB is actually not that large, IMO. > Is that because of the dbm format? Which means there's a lot of bloat in it? I don't know the details, but I would guess BDB preallocates space based on number of entries since as you said, the file size doesn't change each time you learn new tokens. This is part of the reason we have to create a new DB file during expire. BDB doesn't free up disk space as you delete tokens, so you have to create a new DB and move the data you want over to it.
Subject: Re: expiry algorithm needs major overhaul On Thu, Oct 14, 2004 at 03:15:10PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote: > I think he meant _other_ systems don't grow much and don't need expiry, i.e., that too. most other folks are doing learn on error, so in theory the amount of learning necessary decreases over time down to some non-zero asymptotic type level. > I've seen research indicating that learn everything and periodically resetting > the database, which is roughly equivalent to expiry, has better results over > time than a learn on error system. The latter doesn't keep up with long term > overall changes in the spam and ham coming in. Expiry is good and we should do > whatever we can to make it practical. I've seen research indicating that learn on error gives better results, and research indicating LOE and constant learning give equal results. Our model has been that more/constant learning is a good thing, and that results are more important than disk usage, hence expiry is currently a best effort operation. If we want to shift the model such that the resource usage is a more important factor, that's fine, but we need to consciously agree to that. Channeling quinlan for a minute, we would also need to do testing to make sure that the results aren't inversely affected by a higher amount of expiry. This is one of the reason, btw, I was thinking about making expiry a plugin. I don't think a single general algorithm is going to work for everybody, and I don't think we want to write so much complexity into the code to have it figure out which method may work best in the current situation, etc. So a plugin would make it easy to have the current model, LRU, etc.
an expiry plugin -- good idea! I think that's a great idea. I was about to say, some people *do* want strict size-limiting of bayes dbs, for example large sites where users have strict quotas. OTOH, some people want non-size-limited bayes dbs, so the db can grow to possibly double it's current size if to expire it may have a bad effect on accuracy. I'm leaning towards thinking the former is the more typical case, nowadays.
I think a plugin is a good idea, too. You mentioned somewhere learning may be too much/fast. The db I quote with a dump in my bug report learns about 2.000 tokens a day. Does this sound too much or too less? There are going maybe a thousand messages per day thru the system, not more. I have seen that the increase on other systems with more mail traffic is significantly higher, but the expiry problem is the same. What about an expiry algorithm which assumes that all tokens are equally spread over the time range in the db. F.i. with a db of 800.000 and a max-size of 1.000.000 the reduction target is 750.000. Makes a reduction count of 50.000 tokens. This is 6.25% of all tokens. Now we take the delta between oldest and newest atime. In my quoted dump: 0.000 0 1052059392 0 non-token data: oldest atime 0.000 0 1097528241 0 non-token data: newest atime delta = 45468849. We take 100 - 6.25 = 93.75% of that = 42627045 This is our new atime delta, so we expire everything older than 1097528241 - 42627045 = 1054901196. If we reach the reduction goal of 50.000+10% during the pass (which means we are going to expire more tokens than we want) we either stop or go on with removing depending on an option. If we don't reach the reduction goal by expiring the 6.25% oldest tokens we do another pass based on the token count we still need. F.i. if we expired 40.000 tokens we still want to remove 10.000 which is 20% of the original goal. So let's try with a new delta goal of 6.25 * 20% = 1.25. We calculate the new delta between newest atime and the "new" oldest atime and then expire the 1.25% oldest tokens. We repeat this until we reached 90% of the original reduction goal. The second pass may not even be necessary. So, this new method would be based solely on the access age of the tokens and try to expire x% of the oldest tokens based strictly on the time range. The current method hopes to use old values for a new run or tries to find a reduction count which resembles the reduction goal by jumping thru the db in ever doubling steps. This is bound to fail because it's likely that the steps are so big in the end that the likeliness of finding a fitting count decreases. So, the chance of a successful "pass 1" run is slim.
Any progress on this? The expiry algorithm *still* needs a major overhaul as of 3.2.x. It clearly performs poorly on high-volume sites.