Bug 2819 - Bayes expiration issue after DB format upgrade
Summary: Bayes expiration issue after DB format upgrade
Status: RESOLVED WORKSFORME
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: 2.60
Hardware: PC Linux
: P5 minor
Target Milestone: 3.0.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
: 2918 (view as bug list)
Depends on:
Blocks:
 
Reported: 2003-12-08 09:28 UTC by Michel Bouissou
Modified: 2004-02-15 08:41 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
atimes in my Bayes DB application/x-bzip2 None Michel Bouissou [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Michel Bouissou 2003-12-08 09:28:56 UTC
Hi, 
 
When I upgraded from SA 2.55 to SA 2.60, the format of my bayesian DB was 
upgraded automagically at first expiry run. 
In the previous format, the bayes DB recorded tokens message numbers, where it 
now records tokens atimes instead. 
When the DB got converted, all the tokens got the atime at the time the 
conversion was performed. 
 
Currently my bayes DB contains 186,337 of these old tokens that all have an 
atime of 1065955165 (Sun Oct 12 12:39:25 2003) 
 
The problem is that my DB is now growing bigger and bigger, and going far 
beyond the limit set by "bayes_expiry_max_db_size", because if the expiry run 
was to expire all the 186,337 "1065955165" tokens, the DB would then become 
too small, so the expiry run in the end doesn't want to expire anything. 
 
The problem with this is double: First my DB is growing much larger than 
expected and configured, and I believe it won't actually "expire" anything 
before the DB is twice its configured size. 
2nd, in this abnormal condition, the system perform an "expire for nothing" 
run every 12 hours, which justs wastes system resources for analyzing a huge 
DB, and then doing nothing. 
 
Cheers.
Comment 1 Michel Bouissou 2003-12-08 09:45:16 UTC
And I can't force expiration even by much reducing the target size of my DB. 
Example. The current status of my DB is: 
 
[michel@totor mail.ut]$ ./ckdb-stat 
bayes db version..............: 2 
nspam.........................: 2677 
nham..........................: 11409 
ntokens.......................: 335434 
oldest atime..................: 1970-01-01 01:00:00 
newest atime..................: 2003-12-08 18:39:24 
last journal sync atime.......: 2003-12-08 18:39:32 
last expiry atime.............: 2003-12-08 18:40:04 
last expire atime delta.......: 86400 
last expire reduction count...: 1978 
 
If I first set bayes_expiry_max_db_size to something like 120,000, 
and then run: 
 
sa-learn --showdots --force-expire -D 
 
I get: 
[...] 
debug: bayes: found bayes db version 2 
debug: bayes: expiry check keep size, 75% of max: 90000 
debug: bayes: expiry keep size too small, resetting to 100,000 tokens 
debug: bayes: token count: 335434, final goal reduction size: 235434 
debug: bayes: First pass?  Current: 1070905172, Last: 1070905106, atime: 
86400, count: 1978, newdelta: 725, ratio: 119.026289180991 
debug: bayes: something fishy, calculating atime (first pass) 
debug: bayes: couldn't find a good delta atime, need more token difference, 
skipping expire. 
debug: Syncing complete. 
debug: bayes: 4700 untie-ing 
debug: bayes: 4700 untie-ing db_toks 
debug: bayes: 4700 untie-ing db_seen 
debug: bayes: files locked, now unlocking lock 
debug: unlock: 4700 unlink /var/qmail/.spamassassin/bayes.lock 
synced Bayes databases from journal in 0 seconds: 53 unique entries (53 total 
entries) 
 
So what's up, Doc ? 
 
Comment 2 Theo Van Dinter 2003-12-08 11:19:11 UTC
well, this is the expected behavior.  when the conversion occurs, the atime is set to the current 
time.  from there, atimes are updated when messages are scanned or learned.

> The problem with this is double: First my DB is growing much larger than 
> expected and configured, and I believe it won't actually "expire" anything 
> before the DB is twice its configured size. 

well, it'll expire when there's a large enough difference between token atimes in the db.  that 
doesn't equate into a size easily.  for instance, you could learn a billion new tokens with the same 
atime and it wouldn't help you at all.

> 2nd, in this abnormal condition, the system perform an "expire for nothing" 
> run every 12 hours, which justs wastes system resources for analyzing a huge 
> DB, and then doing nothing. 

Well, how is SA supposed to know it's an "expire for nothing" unless it tries to do the expire?

In your example, you have 335434 tokens, and want to remove 235434 of them to leave 100k 
tokens (75% of your max_db size is 90k, which is too small).  Expiry attempted to use your last 
expire to estimate the current one, but the last expire and the current expire are too different for 
an accurate estimate, so it does a first pass to calculate the right atime to use.  After going 
through, it determined that no atime value is good to use.  ie: 12 hours (minimum atime value) 
would expire too many tokens.  therefore no expiry is possible.

what you want is to scan/learn around 100k tokens.  expiry occurs when an atime will expire at 
least 1000 and up to the wanted # of tokens.
Comment 3 Michel Bouissou 2003-12-08 11:33:43 UTC
Theo Van Dinter wrote: 
> well, it'll expire when there's a large enough difference between token 
> atimes in the db.  that doesn't equate into a size easily.  for instance, 
> you could learn a billion new tokens with the same atime and it wouldn't 
> help you at all. 
[...] 
> what you want is to scan/learn around 100k tokens. 
 
Well, My DB currently holds more than 335,000 tokens, about 186,000 of which 
are "old" converted ones with an atime of October, 12, BUT about 150,000 
tokens are new ones that have been learnt in past 2 months. 
 
Having 150,000 new tokens, I expected that if I set the number of tokens to 
keep as something below 150,000, expiry would actually expire old tokens. 
 
Which didn't happen, even when lowering the number of tokens to keep down to 
100,000... 
Comment 4 Theo Van Dinter 2003-12-08 11:48:30 UTC
Subject: Re: [SAdev]  Bayes expiration issue after DB format upgrade

On Mon, Dec 08, 2003 at 11:33:44AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Well, My DB currently holds more than 335,000 tokens, about 186,000 of which 
> are "old" converted ones with an atime of October, 12, BUT about 150,000 
> tokens are new ones that have been learnt in past 2 months. 

Are the messages new from the last 2 months, or did you learn old messages?

can you do a --dump and attach the results to the ticket?

Comment 5 Michel Bouissou 2003-12-09 04:54:03 UTC
Theo Van Dinter wrote: 
> Are the messages new from the last 2 months, or did you learn old messages? 
 
All new messages that have been received since my DB was upgraded 2 months 
ago. This makes about 150,000 new tokens with all different atimes, and more 
recent than the 186,000 old "converted" ones. 
 
> can you do a --dump and attach the results to the ticket? 
 
My Bayes_DB is more than 11 MB in size. A dump would be still larger (and 
contains personal data from legitimate messages...) 
 
Comment 6 Theo Van Dinter 2003-12-09 08:16:50 UTC
Subject: Re: [SAdev]  Bayes expiration issue after DB format upgrade

On Tue, Dec 09, 2003 at 04:54:04AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> > can you do a --dump and attach the results to the ticket? 
>  
> My Bayes_DB is more than 11 MB in size. A dump would be still larger (and 
> contains personal data from legitimate messages...) 

Fair enough.  How about just the magic tokens and the atimes from all
the tokens?  Something like:

sa-learn --dump magic > foo
sa-learn --dump data | awk '{print $4}' >> foo

Then send up foo?  I really just want to see the distribution of atimes,
the tokens themselves aren't useful to me.


Comment 7 Michel Bouissou 2003-12-09 08:37:13 UTC
Created attachment 1616 [details]
atimes in my Bayes DB

Bzipped-2 file with contents as requested by felicity.
Comment 8 Theo Van Dinter 2003-12-09 10:27:44 UTC
Subject: Re: [SAdev]  Bayes expiration issue after DB format upgrade

On Tue, Dec 09, 2003 at 08:37:14AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Bzipped-2 file with contents as requested by felicity.

I really ought to add something like this to the debug output.
It's useful for debugging, and it's interesting to see.

Anyway, here's your atime breakdown...   atime is the expiry atime delta
(ie: delete anything atime seconds ago or older (12h, 24h, 2d, 4d, 8d,
16d, 32d, 64d, 128d, 256d).  tokens in range is the # of tokens between
the current atime and the previous one (ie: 0-43200, 43201-86400,
86401-172800, etc.)  total kept is the # of tokens that would be kept
given the atime value.

atime           tokens in range         total kept
43200           4990                    4990
86400           3919                    8909
172800          4135                    13044
345600          5669                    18713
691200          17091                   35804
1382400         23219                   59023
2764800         39142                   98165
5529600         239876                  338041
11059200        0                       338041
22118400        0                       338041

FYI: There are 8 tokens with atime 0, so they'll be deleted by all of
the atime deltas.  Hence the difference between the max total kept and
the number of tokens in the DB.

As we can see, at 32 days (2764800) we're close to being able to expire,
assuming expiry was going to keep the minimum of 100k tokens.  Note:
expiry actually deals with tokens to remove, not tokens kept, but it's
the same thing with an addition operation ...

So keep learning/scanning tokens, and expiry will eventually happen
by itself, as designed. :)


The slightly more technical/algorithmic answer is that expiry finds
that it can expire at 5529600 to get the 8 tokens at atime 0, but since
that's < 1000 tokens, there's really no point in doing the expire.  So it
looks at 2764800, and stops because too many tokens will be removed.
(as atime delta goes down, # of tokens to expire goes up, so there's no
point in going beyond the first value which expires too many tokens ...)

Comment 9 Theo Van Dinter 2003-12-09 10:35:37 UTC
explanation in last message.
Comment 10 Michel Bouissou 2003-12-09 15:25:10 UTC
Thanks for the explanations Theo. 
 
Anyway, there's something which I feel being not-so-good in the current expiry 
mechanism: The fact that a Bayes DB that could be configured with a 
bayes_expiry_max_db_size of (let's say) 150,000 token can grow much, much 
bigger without expiry purging it, if too many tokens have a too close atime, 
such as in a DB conversion scenario. 
 
I think that if somebody wants to give a size limit to its DB, he will 
probably be unhappy finding himself with a DB 3 or more times bigger, maybe 
for disk space reasons, maybe for performance reasons. 
 
I believe the expiry process should not let the DB grow much bigger than the 
configured max size, and that a safeguard should be implemented for preventing 
this to happen. Maybe by expiring a number of the older tokens, if not all of 
them (based upon the number of times they've been encountered for example ?). 
 
I also wonder if the interval growing by powers of 2 may not result in so long 
intervals, for big DBs, that it might have unwanted effects... 
 
Comment 11 Theo Van Dinter 2003-12-17 05:41:16 UTC
reopening because it'd be good to hash out the expiry issues.
Comment 12 Theo Van Dinter 2003-12-17 05:59:29 UTC
Subject: Re: [SAdev]  Bayes expiration issue after DB format upgrade

On Tue, Dec 09, 2003 at 03:25:11PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Anyway, there's something which I feel being not-so-good in the current expiry 
> mechanism: The fact that a Bayes DB that could be configured with a 
> bayes_expiry_max_db_size of (let's say) 150,000 token can grow much, much 
> bigger without expiry purging it, if too many tokens have a too close atime, 
> such as in a DB conversion scenario. 

Well, this is really a problem of incorrect expectations.  Most Bayes
systems have no expiry mechanism at all.  We feel that doing more than
"learn on error" is what people should be doing (ala the "likelihood
of of ham vs spam" instead of just the binary "ham or spam").  So we
have the expiry method to help allow for more training without enormous
growth.  That said, the expiry process is "best effort", as stated in
the documentation.

> I believe the expiry process should not let the DB grow much bigger than the 
> configured max size, and that a safeguard should be implemented for preventing 
> this to happen. Maybe by expiring a number of the older tokens, if not all of 
> them (based upon the number of times they've been encountered for example ?). 
>  
> I also wonder if the interval growing by powers of 2 may not result in so long 
> intervals, for big DBs, that it might have unwanted effects... 

Ah, the issue is a "last ditch" effort attempt is actually not usually
needed in standard usage.  In my POV (and as the expiry algorithm is
coded) there are 3 types of usage out there:

1) People who receive very little mail, and consequently have very little
growth in their DB.  At expiry time, they will end up having their tokens
over a relatively long timeline (say 1 month to 1 year).  In this
case, the exponent expiry times helps quickly determine which tokens
to remove.  Since you want to expire all of the tokens in a given atime
at the same time (can't say that token X is more important than token
Y), so sorting by atime and dropping the oldest $X tokens isn't valid.
Since expiry will not occur very often, these people can take the hit
of a longer running, but more accurate expiry (includes first pass).

2) People who receive a moderate amount of mail, and consequently have
moderate growth in their DB.  At expiry time, they'll have tokens spread
over a shorter timeline (say 1 day to 1 month).  This is what I consider
"normal" usage.  Exponent expiry times are still helpful here, but to
make this case faster we assume the amount of learning is constant over
the timeline and estimate the age of tokens to expire from the last expiry
run.  This ends up being less accurate, as estimations are by definition,
but much much faster.  If the code detects that the estimation may be
too inaccurate (aka "something looks fishy"), it will do a "first pass"
to generate an accurate expiry atime value.

3) People who receive a large amount of mail, and consequently have
large growth in their DB.  At expiry time, they'll have tokens spread
over a very short timeline (say 1 to 24 hours).  Mostly a "ditto" of #2,
except here the problem becomes the accuracy of the database, not the
accuracy of the expire.  Bayes works well with non-common tokens (ie
"VIAGRA" vs "the"), because they will end up being polar in their ham vs
spam probability.  Consequently, those tokens are likely to not be used as
often, and therefore short expiry times will remove them making the Bayes
calculations less accurate overall.  While we could make an algorithm to
go through and remove the tokens based on some form of "commonality" and
atime ratio, it would make the expire run fairly cpu/memory intensive,
and these are exactly the people who don't want that due to the volume
of mail they're going through.  The algorithm from #2 works ok here:
assume a standard inflow volume, and estimate based on the last expire,
keeping the expire time short, but the cost is a less accurate DB overall.
To these people I typically say: learn less tokens, become a #2 person,
reap the benefits.


wow this is long...   anyway, so the point of all this is that in the
"common" usage described above, "last ditch" isn't going to help anyone.
the issue you raised is one of a "non-common" usage: the conversion from
an old db.  which definitely happens, and essentially kicks people from
a #1 to a #2, or a #2 to a #3.  In your case, you seem to be somewhere
between #1 and #2 normally, based on the atime breakdown as previously
posted.  unfortunately the #1 people get the worst case scenario since
they'd normally have to wait at least 1 month to learn enough tokens
to do an expire run, but have enough tokens in the DB to try expiry at
every opportunity (12 hours by default).  the more mail they receive,
the faster the problem clears itself.

the problem becomes what to do about these folks.  "delete your db and
start over" doesn't really work since it'll take a long time to get
the DB functional again.  saying "just remove the converted tokens"
doesn't work since they could potentially be the whole db.

I guess "instead of a single atime, put in random atimes based on
polarization" would solve the expiry issue (more polarized tokens get a
more recent atime, less polarized tokens get an older atime)...  but 1)
I don't actually know what this would do to accuracy of the DB, 2) it's
wrong to guess based on polarization since the more common tokens will
just get learned again quickly and you end up with the same sort of issue
(although I think this may be better than the current method).

Another possibility is that during the upgrade, we fake an atime based on
the msgcount old atime value.  but there are several things I don't like.
1) the old atime values rolled over when msgcount hit 65535, so worst case
the old db rolled over to all 0 atimes, then they convert which gives
us the same situation we have now.  2) we know the range is 0-65535 for
the old atime, but how do we map that to epoch atime?  there'd have to
be some multiplier, and a calculated base atime of, say, 6 months ago.

hrm...  perhaps I've convinced myself to do this last one?  it has the
same worst case, but it should have a better common case.  then again,
for 2.70, how many people are going to be converting from 2.5x where
this will be useful?  perhaps this is a "we should have done this in
hindsight" thing?

Comment 13 Theo Van Dinter 2004-01-11 10:13:50 UTC
*** Bug 2918 has been marked as a duplicate of this bug. ***
Comment 14 Theo Van Dinter 2004-02-15 17:41:17 UTC
ok, I'm reclosing this one since I think it's essentially a "in hindsight we 
should have done" issue now... :|