3894 – expiry algorithm needs major overhaul

Bug 3894 - expiry algorithm needs major overhaul

Summary: expiry algorithm needs major overhaul

Status:	NEW

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	spamassassin (show other bugs)
Version:	3.0.0
Hardware:	Other Linux

Importance:	P5 normal
Target Milestone:	Future
Assignee:	SpamAssassin Developer Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2004-10-11 15:08 UTC by Kai Sch
Modified:	2007-11-29 17:45 UTC (History)
CC List:	0 users

Attachment	Type	Modified	Status	Actions	Submitter/CLA Status
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Kai Sch 2004-10-11 15:08:00 UTC

There's still the problem from 2.6x that an expiry may not work although all 
the tokens in the db are perfectly valid and are not negative and not in the 
future. This happens when the atime values of the tokens span a vast time 
range, f.i. a year. Such a database may be built by setting the 
bayes_max_db_size to a high value so that the db doesn't expire for a long 
time. When the threshold is finally reached the expire algorithm isn't able to 
calculate an expiry atime delta and quits. 
I think this is kind of a "secret time bomb" for many sysadmins. I hit this 
problem on nearly all of our machines running SA. We have big, nicely working 
databases, but I cannot expire them.
Problem 1 is that the normal expire calculation takes the old atime delta and 
the number of removed tokens into account. (I didn't look the exact method up 
in the documentation this time, I remember it from earlier this year.) In the 
end the ratio figure is too high and SA goes in the "fishy" mode. 
Here comes Problem 2 into play: it stops after 9 iterations at an atime delta 
of 22118400. I didn't look in the code, but it seems to double the atime delta 
with each try run, so the 10th iteration would be out of scope. That's why it 
stops there. (BTW: it seems there has been a code change from 2.6x to 3.0: 
with 2.6x the iterations are shown one by one right when each one happens, 
with 3.0 nothing is shown until all have been done. This means that there can 
be a long time - minutes - of no activity shown.)

Here is a typical example.

n8:/home/spamd/bayes # sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0      22270          0  non-token data: nspam
0.000          0       7278          0  non-token data: nham
0.000          0     795902          0  non-token data: ntokens
0.000          0 1052059392          0  non-token data: oldest atime
0.000          0 1097528241          0  non-token data: newest atime
0.000          0 1097528594          0  non-token data: last journal sync atime
0.000          0 1097528806          0  non-token data: last expiry atime
0.000          0   29754654          0  non-token data: last expire atime delta
0.000          0         36          0  non-token data: last expire reduction 
count

debug: bayes: expiry check keep size, 0.75 * max: 750000
debug: bayes: token count: 795902, final goal reduction size: 45902
debug: bayes: First pass?  Current: 1097528594, Last: 1096983920, atime: 
29754654, count: 36, newdelta: 23335, ratio: 1275.05555555556, period: 43200
debug: bayes: Can't use estimation method for expiry, something fishy, 
calculating optimal atime delta (first pass)
debug: bayes: expiry max exponent: 9
debug: bayes: atime     token reduction
debug: bayes: ========  ===============
debug: bayes: 43200     791701
debug: bayes: 86400     788611
debug: bayes: 172800    785159
debug: bayes: 345600    776707
debug: bayes: 691200    761556
debug: bayes: 1382400   740407
debug: bayes: 2764800   698622
debug: bayes: 5529600   656777
debug: bayes: 11059200  647817
debug: bayes: 22118400  548366
debug: bayes: couldn't find a good delta atime, need more token difference, 
skipping expire.
debug: Syncing complete.

I can actually force an expiry of the db above by reducing the max_db_size to 
something like 100 or 200.000. When I did this the iteration succeeded with 
the last try and slashed the db down to 162.000 tokens. And after this you hit 
the same problem again, because it still cannot calculate an atime delta. I 
had to wait some days until several thousands tokens were added before an 
expiry would work again. And then fail again and so on.
I explained this also with more examples on the users mailing list in the 
thread "Still "fishy" problems with bayes expiry in SA 3.0", starting with 
Message-Id: <VA.0000199c.01b63b2d@virtual-access.org>.

My proposal to fix this problem is to use a completely different expiry 
method. First, the db needs to get sorted before an expiry. If the db is 
sorted by atime you can just start at the top of the file and remove until you 
reach the reduction goal. And you are done. 
Possible additional local.cf options:
- bayes_removal_percent: % of the whole token count to remove (f.i. 1, 5 or 10)
- bayes_expiry_min_db_size: a minimum figure of tokens we want to keep
Example: db of 750.000, removal %: 1%, minimum: 500.000. When the expiry runs 
(either started by autoexpire when hitting the max_db_size or by --force-
expire and then NOT taking the max_db_size into account) it calculates 1% from 
750.000 = 7.500, sorts the db by atime, starting with the oldest atime. Then 
it walks thru the db and removes the first 7.500 tokens. Then it stops.
It doesn't really matter that it probably removes some tokens which have the 
same atime and keeps others (the atime of the last removed token). Each time 
expiry runs again it calculates again and removes again, until the minimum of 
500.000 is reached. As long as the reduction would shrink the db under the 
minimum expiry won't happen.
The minimum avoids an accidental slashing of the db if you use higher % 
removal than you have growth % during the same time.
This method also avoids the massive timeout problems when autoexpiry hits 
the "fishy" path and tries to find an atime delta. This can take so long that 
SA times out and retries with each subsequent spam check until the expiry has 
succeeded, if ever.

Comment 1 Theo Van Dinter 2004-10-11 15:30:59 UTC

Subject: Re:  New: expiry algorithm needs major overhaul

On Mon, Oct 11, 2004 at 03:08:01PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Problem 1 is that the normal expire calculation takes the old atime delta and 
> the number of removed tokens into account. (I didn't look the exact method up 
> in the documentation this time, I remember it from earlier this year.) In the 
> end the ratio figure is too high and SA goes in the "fishy" mode. 

Well, it tries to estimate the atime delta for this run from the last run.  If
the two runs are dissimilar, it won't try estimating and needs to do "phase 1"
and figure out an appropriate delta.

> Here comes Problem 2 into play: it stops after 9 iterations at an atime delta 
> of 22118400. I didn't look in the code, but it seems to double the atime delta 

Right.  It starts at 12hours, then multiplies by 2^n where n (0..9), or 12h to
256 days.

> with each try run, so the 10th iteration would be out of scope. That's why it 
> stops there. (BTW: it seems there has been a code change from 2.6x to 3.0: 
> with 2.6x the iterations are shown one by one right when each one happens, 
> with 3.0 nothing is shown until all have been done. This means that there can 
> be a long time - minutes - of no activity shown.)

Errr, I'm fairly positive that didn't change.  There's no way to know any of
the numbers until after all the tokens have been gone through.  So it can't
possibly have shown any until it was done, where it would then show them all.

> My proposal to fix this problem is to use a completely different expiry 
> method. First, the db needs to get sorted before an expiry. If the db is 
> sorted by atime you can just start at the top of the file and remove until you 
> reach the reduction goal. And you are done. 

Ah...  Therein lies the issue.  You can't sort the DB.  You get the data
in a "random" manner (actually based on the hash value of the DB key).
You have to go through every token, period.

Note: You could have yet another file (DB or flat file or...) which
keeps a correlation of atime and token in sorted order, but since you're
updating token atimes so much it's likely to get out of sync pretty
quickly, and use more resources, and make the code much more complex,
and ...

BTW: the atime is actually part of the packed binary value piece of the
DB's key=>value pair, so even getting the atime out is non-trivial.

To do what you are suggesting, you'd have to change the first pass
to keep the actual breakdown of delta/token count values.  ie: 1s
12 tokens, 2s 2 tokens, 3s 5 tokens, etc.  Then at the end, you can
order the deltas and go from there, but that's a LOT of memory usage.
The current version essentially does this, but only has the 10 (0..9)
expoential pools instead.  I actually made this modular enough so you
can change the max exponent and the starting number in the configuration,
although it still multiplies by 2 each time.  The configuration options
currently aren't changable via configuration file, but it's set to
be able to do so, even via plugin... ;) bayes_expiry_max_exponent (9)
and bayes_expiry_period (43200), btw.

Anyway, if you want fast expiry, use SQL.  It's mucho faster at expiry
because it doesn't have to do all the work the DBM version needs to do
(namely, copying lots of data around, going through each token multiple
times, etc).  It's also more likely to be able to do what you want since
you could just add in "order by atime" (or whatever) to the SQL to get
the tokens in order.  I'm actually not sure if SQL expiry works the same
as DBM expiry.  It probably does but...  I didn't write that code. ;)

> Possible additional local.cf options:
> - bayes_removal_percent: % of the whole token count to remove (f.i. 1, 5 or 10)

not a useful value really -- should it randomly remove tokens?

> - bayes_expiry_min_db_size: a minimum figure of tokens we want to keep
> Example: db of 750.000, removal %: 1%, minimum: 500.000. When the expiry runs 

it's already the larger of 100k or 75% of max_db_size.

Another thing to note about expiry is that it's best effort.  Most (all?), at
least at last check, other Bayes systems don't do expiry at all, and you just
do learn on error.  The DB doesn't grow much, and things are generally ok.  We
generally encourage more learning, and allow expiry when it's easy to do so.

Comment 2 Kai Sch 2004-10-14 14:38:21 UTC

Thanks for the answer, so if I understand correctly the problem is inherently 
based in the dbm format we use because that doesn't allow for sorting. But the 
problem still remains: many bayes dbs are bound to fail expiry. Shouldn't this 
be fixed somehow? F.i. using other parameter values or allowing an adjustment 
would help immensely.

> I actually made this modular enough so you
> can change the max exponent and the starting number in the configuration,
> although it still multiplies by 2 each time.  The configuration options
> currently aren't changable via configuration file, but it's set to
> be able to do so, even via plugin... ;) bayes_expiry_max_exponent (9)
> and bayes_expiry_period (43200), btw.

Well, changing these parameters doesn't help much. F.i. doubling from 256 to 
512 days is simply too big and if you decrease the expiry_period amount you 
don't gain any advantage. What needs to be done is avoid "phase 1". F.i. if I 
know that I want to expire 1.000 tokens I can adjust all parameters, so that 
the calculation will show a good ratio and then would go on. If I can do that 
with a calculator SA should be able to do that as well.

> Anyway, if you want fast expiry, use SQL.

Thanks for the hint. I may try, but not in the near future. My emphasis is not 
so much on speed but on expiry itself. At the moment I have half a dozen of 
bayes dbs I can't expire and when I go and force them to expire I have to 
slash them radically and after this expiry fails again. It's a never-ending 
story. So, in my eyes there's something wrong with the current expiry. I 
should be able to always expire a db unless it's hopelessly corrupted and 
these aren't.

> not a useful value really -- should it randomly remove tokens?

Well, this proposal was based on sorting. If the db isn't sorted it would 
expire randomly, of course. That was not my goal.

> The DB doesn't grow much, and things are generally ok.

It doesn't grow much? I see that at about 20 MB it seems to stay exactly at 
that size for quite a while although tokens get added. Is that what you mean? 
Is that because of the dbm format? Which means there's a lot of bloat in it?

Comment 3 Justin Mason 2004-10-14 14:52:11 UTC

IMO, we should find a way to expire a db if it's too big, regardless of the
*correctness* of the expiry algorithm; random dropping of a percentage of the
tokens may be the only course.

Comment 4 Sidney Markowitz 2004-10-14 15:15:09 UTC

>> The DB doesn't grow much, and things are generally ok.
>
>It doesn't grow much?

I think he meant _other_ systems don't grow much and don't need expiry, i.e.,

> Most (all?), at least at last check, other Bayes systems
> don't do expiry at all, and you just do learn on error.
> The DB doesn't grow much, and things are generally ok.
> We generally encourage more learning, [...]

I've seen research indicating that learn everything and periodically resetting
the database, which is roughly equivalent to expiry, has better results over
time than a learn on error system. The latter doesn't keep up with long term
overall changes in the spam and ham coming in. Expiry is good and we should do
whatever we can to make it practical.

Comment 5 Theo Van Dinter 2004-10-15 06:47:19 UTC

Subject: Re:  expiry algorithm needs major overhaul

On Thu, Oct 14, 2004 at 02:52:12PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> IMO, we should find a way to expire a db if it's too big, regardless of the
> *correctness* of the expiry algorithm; random dropping of a percentage of the
> tokens may be the only course.

I'm really opposed to a full random dropping of tokens, space is less
important than results imo.

I think there are two easy options to help deal with the issue though:

- more buckets.  having more buckets gives more decision points, so the
  likelihood of being able to do an expire is higher.

  the exponent idea comes from older tokens being less and less useful
  over time.  we generically used 12h and the powers of 2 to keep the
  code small and fast.  we could change the # of buckets and manually,
  in some form do the exponent thing.  perhaps keep linear buckets up to
  a point, then space them further out, going back to a year or something.

- targeted random expiry.  if all else fails, find the bucket with the highest
  number of tokens, and "randomly" (aka: in the order they come out of the DB)
  remove the appropriate number of tokens.

I don't know if the "random" expiry should only be a last resort effort
or not (ie: only if forced-expire is used or the previous expire # is 0
(means the expire didn't happen through the normal bucket method the
last time through)).

Comment 6 Theo Van Dinter 2004-10-15 09:32:28 UTC

Subject: Re:  expiry algorithm needs major overhaul

On Thu, Oct 14, 2004 at 02:38:22PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Thanks for the answer, so if I understand correctly the problem is inherently 
> based in the dbm format we use because that doesn't allow for sorting. But the 

Well, it's only a problem depending on how you try to use the information.  It
would be a lot easier if the data were sorted though, imo, yes.

> Well, changing these parameters doesn't help much. F.i. doubling from 256 to 
> 512 days is simply too big and if you decrease the expiry_period amount you 

It depends what your situation is.  You can tweak the numbers so that you
start smaller and go back further, so 600 20 goes from 5 minutes back to 3640
days (in days): 0 0 0 0 0 0 0 0 1 3 7 14 28 56 113 227 455 910 1820 3640

> don't gain any advantage. What needs to be done is avoid "phase 1". F.i. if I 

Yeah, that's what the estimation thing tries to do -- phase 1 is expensive,
and phase 2 has to happen anyway, so if we can guess what values we should use
this time based on what happened the last time, do so.

> know that I want to expire 1.000 tokens I can adjust all parameters, so that 
> the calculation will show a good ratio and then would go on. If I can do that 
> with a calculator SA should be able to do that as well.

you could do that with a calculator and a lot of scribbling, sure.  if people
didn't mind having a huge memory usage for expiry, SA could just store all the
token values in memory via phase 1, then spend a lot of cpu time figuring out
the best possible atime value.  but people aren't willing to do that.

> bayes dbs I can't expire and when I go and force them to expire I have to 
> slash them radically and after this expiry fails again. It's a never-ending 
> story. So, in my eyes there's something wrong with the current expiry. I 
> should be able to always expire a db unless it's hopelessly corrupted and 
> these aren't.

Something's up with how you're doing learning then, imo.  Expiry assumes a
fairly consistent amount of learning -- you'll probably get the same number of
mails a day, and you'll probably learn the same number each day, based on that
the system should work.

> It doesn't grow much? I see that at about 20 MB it seems to stay exactly at 
> that size for quite a while although tokens get added. Is that what you mean? 

Yeah, it doesn't grow much, unless you're learning a ton of tokens -- therein
the problem is you're learning too much.

Regular usage doesn't add a large # of tokens, and the amount of new tokens
per learn goes down over time as the # of tokens in the DB increases.  20MB is
actually not that large, IMO.

> Is that because of the dbm format? Which means there's a lot of bloat in it?

I don't know the details, but I would guess BDB preallocates space based on
number of entries since as you said, the file size doesn't change each time
you learn new tokens.

This is part of the reason we have to create a new DB file during expire.  BDB
doesn't free up disk space as you delete tokens, so you have to create a new
DB and move the data you want over to it.

Comment 7 Theo Van Dinter 2004-10-15 09:38:27 UTC

Subject: Re:  expiry algorithm needs major overhaul

On Thu, Oct 14, 2004 at 03:15:10PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> I think he meant _other_ systems don't grow much and don't need expiry, i.e.,

that too.  most other folks are doing learn on error, so in theory the amount
of learning necessary decreases over time down to some non-zero asymptotic
type level.

> I've seen research indicating that learn everything and periodically resetting
> the database, which is roughly equivalent to expiry, has better results over
> time than a learn on error system. The latter doesn't keep up with long term
> overall changes in the spam and ham coming in. Expiry is good and we should do
> whatever we can to make it practical.

I've seen research indicating that learn on error gives better results, and
research indicating LOE and constant learning give equal results.  Our model
has been that more/constant learning is a good thing, and that results are
more important than disk usage, hence expiry is currently a best effort
operation.

If we want to shift the model such that the resource usage is a more
important factor, that's fine, but we need to consciously agree to that.
Channeling quinlan for a minute, we would also need to do testing to make sure
that the results aren't inversely affected by a higher amount of expiry.

This is one of the reason, btw, I was thinking about making expiry a plugin.
I don't think a single general algorithm is going to work for everybody, and I
don't think we want to write so much complexity into the code to have it
figure out which method may work best in the current situation, etc.  So a
plugin would make it easy to have the current model, LRU, etc.

Comment 8 Justin Mason 2004-10-15 09:59:19 UTC

an expiry plugin -- good idea!  I think that's a great idea.

I was about to say, some people *do* want strict size-limiting of bayes dbs, for
example large sites where users have strict quotas.   OTOH, some people want
non-size-limited bayes dbs, so the db can grow to possibly double it's current
size if to expire it may have a bad effect on accuracy.

I'm leaning towards thinking the former is the more typical case, nowadays.

Comment 9 Kai Sch 2004-10-15 11:08:07 UTC

I think a plugin is a good idea, too.
You mentioned somewhere learning may be too much/fast. The db I quote with a 
dump in my bug report learns about 2.000 tokens a day. Does this sound too 
much or too less? There are going maybe a thousand messages per day thru the 
system, not more. I have seen that the increase on other systems with more 
mail traffic is significantly higher, but the expiry problem is the same.

What about an expiry algorithm which assumes that all tokens are equally 
spread over the time range in the db. F.i. with a db of 800.000 and a max-size 
of 1.000.000 the reduction target is 750.000. Makes a reduction count of 
50.000 tokens. This is 6.25% of all tokens. Now we take the delta between 
oldest and newest atime. In my quoted dump:
0.000          0 1052059392          0  non-token data: oldest atime
0.000          0 1097528241          0  non-token data: newest atime
delta = 45468849. We take 100 - 6.25 = 93.75% of that = 42627045
This is our new atime delta, so we expire everything older than 1097528241 - 
42627045 = 1054901196.
If we reach the reduction goal of 50.000+10% during the pass (which means we 
are going to expire more tokens than we want) we either stop or go on with 
removing depending on an option.
If we don't reach the reduction goal by expiring the 6.25% oldest tokens we do 
another pass based on the token count we still need. F.i. if we expired 40.000 
tokens we still want to remove 10.000 which is 20% of the original goal. So 
let's try with a new delta goal of 6.25 * 20% = 1.25. We calculate the new 
delta between newest atime and the "new" oldest atime and then expire the 
1.25% oldest tokens. We repeat this until we reached 90% of the original 
reduction goal. The second pass may not even be necessary.
So, this new method would be based solely on the access age of the tokens and 
try to expire x% of the oldest tokens based strictly on the time range. The 
current method hopes to use old values for a new run or tries to find a 
reduction count which resembles the reduction goal by jumping thru the db in 
ever doubling steps. This is bound to fail because it's likely that the steps 
are so big in the end that the likeliness of finding a fitting count 
decreases. So, the chance of a successful "pass 1" run is slim.

Comment 10 Matt Kettler 2007-11-29 17:45:30 UTC

Any progress on this?

The expiry algorithm *still* needs a major overhaul as of 3.2.x. It clearly
performs poorly on high-volume sites.