SA Bugzilla – Bug 2819
Bayes expiration issue after DB format upgrade
Last modified: 2004-02-15 08:41:17 UTC
Hi, When I upgraded from SA 2.55 to SA 2.60, the format of my bayesian DB was upgraded automagically at first expiry run. In the previous format, the bayes DB recorded tokens message numbers, where it now records tokens atimes instead. When the DB got converted, all the tokens got the atime at the time the conversion was performed. Currently my bayes DB contains 186,337 of these old tokens that all have an atime of 1065955165 (Sun Oct 12 12:39:25 2003) The problem is that my DB is now growing bigger and bigger, and going far beyond the limit set by "bayes_expiry_max_db_size", because if the expiry run was to expire all the 186,337 "1065955165" tokens, the DB would then become too small, so the expiry run in the end doesn't want to expire anything. The problem with this is double: First my DB is growing much larger than expected and configured, and I believe it won't actually "expire" anything before the DB is twice its configured size. 2nd, in this abnormal condition, the system perform an "expire for nothing" run every 12 hours, which justs wastes system resources for analyzing a huge DB, and then doing nothing. Cheers.
And I can't force expiration even by much reducing the target size of my DB. Example. The current status of my DB is: [michel@totor mail.ut]$ ./ckdb-stat bayes db version..............: 2 nspam.........................: 2677 nham..........................: 11409 ntokens.......................: 335434 oldest atime..................: 1970-01-01 01:00:00 newest atime..................: 2003-12-08 18:39:24 last journal sync atime.......: 2003-12-08 18:39:32 last expiry atime.............: 2003-12-08 18:40:04 last expire atime delta.......: 86400 last expire reduction count...: 1978 If I first set bayes_expiry_max_db_size to something like 120,000, and then run: sa-learn --showdots --force-expire -D I get: [...] debug: bayes: found bayes db version 2 debug: bayes: expiry check keep size, 75% of max: 90000 debug: bayes: expiry keep size too small, resetting to 100,000 tokens debug: bayes: token count: 335434, final goal reduction size: 235434 debug: bayes: First pass? Current: 1070905172, Last: 1070905106, atime: 86400, count: 1978, newdelta: 725, ratio: 119.026289180991 debug: bayes: something fishy, calculating atime (first pass) debug: bayes: couldn't find a good delta atime, need more token difference, skipping expire. debug: Syncing complete. debug: bayes: 4700 untie-ing debug: bayes: 4700 untie-ing db_toks debug: bayes: 4700 untie-ing db_seen debug: bayes: files locked, now unlocking lock debug: unlock: 4700 unlink /var/qmail/.spamassassin/bayes.lock synced Bayes databases from journal in 0 seconds: 53 unique entries (53 total entries) So what's up, Doc ?
well, this is the expected behavior. when the conversion occurs, the atime is set to the current time. from there, atimes are updated when messages are scanned or learned. > The problem with this is double: First my DB is growing much larger than > expected and configured, and I believe it won't actually "expire" anything > before the DB is twice its configured size. well, it'll expire when there's a large enough difference between token atimes in the db. that doesn't equate into a size easily. for instance, you could learn a billion new tokens with the same atime and it wouldn't help you at all. > 2nd, in this abnormal condition, the system perform an "expire for nothing" > run every 12 hours, which justs wastes system resources for analyzing a huge > DB, and then doing nothing. Well, how is SA supposed to know it's an "expire for nothing" unless it tries to do the expire? In your example, you have 335434 tokens, and want to remove 235434 of them to leave 100k tokens (75% of your max_db size is 90k, which is too small). Expiry attempted to use your last expire to estimate the current one, but the last expire and the current expire are too different for an accurate estimate, so it does a first pass to calculate the right atime to use. After going through, it determined that no atime value is good to use. ie: 12 hours (minimum atime value) would expire too many tokens. therefore no expiry is possible. what you want is to scan/learn around 100k tokens. expiry occurs when an atime will expire at least 1000 and up to the wanted # of tokens.
Theo Van Dinter wrote: > well, it'll expire when there's a large enough difference between token > atimes in the db. that doesn't equate into a size easily. for instance, > you could learn a billion new tokens with the same atime and it wouldn't > help you at all. [...] > what you want is to scan/learn around 100k tokens. Well, My DB currently holds more than 335,000 tokens, about 186,000 of which are "old" converted ones with an atime of October, 12, BUT about 150,000 tokens are new ones that have been learnt in past 2 months. Having 150,000 new tokens, I expected that if I set the number of tokens to keep as something below 150,000, expiry would actually expire old tokens. Which didn't happen, even when lowering the number of tokens to keep down to 100,000...
Subject: Re: [SAdev] Bayes expiration issue after DB format upgrade On Mon, Dec 08, 2003 at 11:33:44AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > Well, My DB currently holds more than 335,000 tokens, about 186,000 of which > are "old" converted ones with an atime of October, 12, BUT about 150,000 > tokens are new ones that have been learnt in past 2 months. Are the messages new from the last 2 months, or did you learn old messages? can you do a --dump and attach the results to the ticket?
Theo Van Dinter wrote: > Are the messages new from the last 2 months, or did you learn old messages? All new messages that have been received since my DB was upgraded 2 months ago. This makes about 150,000 new tokens with all different atimes, and more recent than the 186,000 old "converted" ones. > can you do a --dump and attach the results to the ticket? My Bayes_DB is more than 11 MB in size. A dump would be still larger (and contains personal data from legitimate messages...)
Subject: Re: [SAdev] Bayes expiration issue after DB format upgrade On Tue, Dec 09, 2003 at 04:54:04AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > > can you do a --dump and attach the results to the ticket? > > My Bayes_DB is more than 11 MB in size. A dump would be still larger (and > contains personal data from legitimate messages...) Fair enough. How about just the magic tokens and the atimes from all the tokens? Something like: sa-learn --dump magic > foo sa-learn --dump data | awk '{print $4}' >> foo Then send up foo? I really just want to see the distribution of atimes, the tokens themselves aren't useful to me.
Created attachment 1616 [details] atimes in my Bayes DB Bzipped-2 file with contents as requested by felicity.
Subject: Re: [SAdev] Bayes expiration issue after DB format upgrade On Tue, Dec 09, 2003 at 08:37:14AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > Bzipped-2 file with contents as requested by felicity. I really ought to add something like this to the debug output. It's useful for debugging, and it's interesting to see. Anyway, here's your atime breakdown... atime is the expiry atime delta (ie: delete anything atime seconds ago or older (12h, 24h, 2d, 4d, 8d, 16d, 32d, 64d, 128d, 256d). tokens in range is the # of tokens between the current atime and the previous one (ie: 0-43200, 43201-86400, 86401-172800, etc.) total kept is the # of tokens that would be kept given the atime value. atime tokens in range total kept 43200 4990 4990 86400 3919 8909 172800 4135 13044 345600 5669 18713 691200 17091 35804 1382400 23219 59023 2764800 39142 98165 5529600 239876 338041 11059200 0 338041 22118400 0 338041 FYI: There are 8 tokens with atime 0, so they'll be deleted by all of the atime deltas. Hence the difference between the max total kept and the number of tokens in the DB. As we can see, at 32 days (2764800) we're close to being able to expire, assuming expiry was going to keep the minimum of 100k tokens. Note: expiry actually deals with tokens to remove, not tokens kept, but it's the same thing with an addition operation ... So keep learning/scanning tokens, and expiry will eventually happen by itself, as designed. :) The slightly more technical/algorithmic answer is that expiry finds that it can expire at 5529600 to get the 8 tokens at atime 0, but since that's < 1000 tokens, there's really no point in doing the expire. So it looks at 2764800, and stops because too many tokens will be removed. (as atime delta goes down, # of tokens to expire goes up, so there's no point in going beyond the first value which expires too many tokens ...)
explanation in last message.
Thanks for the explanations Theo. Anyway, there's something which I feel being not-so-good in the current expiry mechanism: The fact that a Bayes DB that could be configured with a bayes_expiry_max_db_size of (let's say) 150,000 token can grow much, much bigger without expiry purging it, if too many tokens have a too close atime, such as in a DB conversion scenario. I think that if somebody wants to give a size limit to its DB, he will probably be unhappy finding himself with a DB 3 or more times bigger, maybe for disk space reasons, maybe for performance reasons. I believe the expiry process should not let the DB grow much bigger than the configured max size, and that a safeguard should be implemented for preventing this to happen. Maybe by expiring a number of the older tokens, if not all of them (based upon the number of times they've been encountered for example ?). I also wonder if the interval growing by powers of 2 may not result in so long intervals, for big DBs, that it might have unwanted effects...
reopening because it'd be good to hash out the expiry issues.
Subject: Re: [SAdev] Bayes expiration issue after DB format upgrade On Tue, Dec 09, 2003 at 03:25:11PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote: > Anyway, there's something which I feel being not-so-good in the current expiry > mechanism: The fact that a Bayes DB that could be configured with a > bayes_expiry_max_db_size of (let's say) 150,000 token can grow much, much > bigger without expiry purging it, if too many tokens have a too close atime, > such as in a DB conversion scenario. Well, this is really a problem of incorrect expectations. Most Bayes systems have no expiry mechanism at all. We feel that doing more than "learn on error" is what people should be doing (ala the "likelihood of of ham vs spam" instead of just the binary "ham or spam"). So we have the expiry method to help allow for more training without enormous growth. That said, the expiry process is "best effort", as stated in the documentation. > I believe the expiry process should not let the DB grow much bigger than the > configured max size, and that a safeguard should be implemented for preventing > this to happen. Maybe by expiring a number of the older tokens, if not all of > them (based upon the number of times they've been encountered for example ?). > > I also wonder if the interval growing by powers of 2 may not result in so long > intervals, for big DBs, that it might have unwanted effects... Ah, the issue is a "last ditch" effort attempt is actually not usually needed in standard usage. In my POV (and as the expiry algorithm is coded) there are 3 types of usage out there: 1) People who receive very little mail, and consequently have very little growth in their DB. At expiry time, they will end up having their tokens over a relatively long timeline (say 1 month to 1 year). In this case, the exponent expiry times helps quickly determine which tokens to remove. Since you want to expire all of the tokens in a given atime at the same time (can't say that token X is more important than token Y), so sorting by atime and dropping the oldest $X tokens isn't valid. Since expiry will not occur very often, these people can take the hit of a longer running, but more accurate expiry (includes first pass). 2) People who receive a moderate amount of mail, and consequently have moderate growth in their DB. At expiry time, they'll have tokens spread over a shorter timeline (say 1 day to 1 month). This is what I consider "normal" usage. Exponent expiry times are still helpful here, but to make this case faster we assume the amount of learning is constant over the timeline and estimate the age of tokens to expire from the last expiry run. This ends up being less accurate, as estimations are by definition, but much much faster. If the code detects that the estimation may be too inaccurate (aka "something looks fishy"), it will do a "first pass" to generate an accurate expiry atime value. 3) People who receive a large amount of mail, and consequently have large growth in their DB. At expiry time, they'll have tokens spread over a very short timeline (say 1 to 24 hours). Mostly a "ditto" of #2, except here the problem becomes the accuracy of the database, not the accuracy of the expire. Bayes works well with non-common tokens (ie "VIAGRA" vs "the"), because they will end up being polar in their ham vs spam probability. Consequently, those tokens are likely to not be used as often, and therefore short expiry times will remove them making the Bayes calculations less accurate overall. While we could make an algorithm to go through and remove the tokens based on some form of "commonality" and atime ratio, it would make the expire run fairly cpu/memory intensive, and these are exactly the people who don't want that due to the volume of mail they're going through. The algorithm from #2 works ok here: assume a standard inflow volume, and estimate based on the last expire, keeping the expire time short, but the cost is a less accurate DB overall. To these people I typically say: learn less tokens, become a #2 person, reap the benefits. wow this is long... anyway, so the point of all this is that in the "common" usage described above, "last ditch" isn't going to help anyone. the issue you raised is one of a "non-common" usage: the conversion from an old db. which definitely happens, and essentially kicks people from a #1 to a #2, or a #2 to a #3. In your case, you seem to be somewhere between #1 and #2 normally, based on the atime breakdown as previously posted. unfortunately the #1 people get the worst case scenario since they'd normally have to wait at least 1 month to learn enough tokens to do an expire run, but have enough tokens in the DB to try expiry at every opportunity (12 hours by default). the more mail they receive, the faster the problem clears itself. the problem becomes what to do about these folks. "delete your db and start over" doesn't really work since it'll take a long time to get the DB functional again. saying "just remove the converted tokens" doesn't work since they could potentially be the whole db. I guess "instead of a single atime, put in random atimes based on polarization" would solve the expiry issue (more polarized tokens get a more recent atime, less polarized tokens get an older atime)... but 1) I don't actually know what this would do to accuracy of the DB, 2) it's wrong to guess based on polarization since the more common tokens will just get learned again quickly and you end up with the same sort of issue (although I think this may be better than the current method). Another possibility is that during the upgrade, we fake an atime based on the msgcount old atime value. but there are several things I don't like. 1) the old atime values rolled over when msgcount hit 65535, so worst case the old db rolled over to all 0 atimes, then they convert which gives us the same situation we have now. 2) we know the range is 0-65535 for the old atime, but how do we map that to epoch atime? there'd have to be some multiplier, and a calculated base atime of, say, 6 months ago. hrm... perhaps I've convinced myself to do this last one? it has the same worst case, but it should have a better common case. then again, for 2.70, how many people are going to be converting from 2.5x where this will be useful? perhaps this is a "we should have done this in hindsight" thing?
*** Bug 2918 has been marked as a duplicate of this bug. ***
ok, I'm reclosing this one since I think it's essentially a "in hindsight we should have done" issue now... :|