Bug 2975 - bayes_seen database uncontrolled growth
Summary: bayes_seen database uncontrolled growth
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: unspecified
Hardware: Other other
: P5 major
Target Milestone: 3.1.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
: 2771 (view as bug list)
Depends on:
Blocks:
 
Reported: 2004-01-28 00:14 UTC by Kelsey Cummings
Modified: 2005-06-01 10:40 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Kelsey Cummings 2004-01-28 00:14:17 UTC
The bayes_seen databases needs to have built in expiration like the bayes_toks 
database to prevent uncontrolled growth.  In small installations, this will 
probably go unnoticed, however, in large installations with thousands of users 
this can easily account for substantial resource consumption.

It seems like after a few weeks, at most, message-ids could safely be forgotten.
Comment 1 Justin Mason 2004-01-28 11:18:07 UTC
yeah, you're right. :(

FWIW, just nuking the db with "rm" once a month would probably do the trick
acceptably as an interim measure...
Comment 2 Theo Van Dinter 2004-01-28 19:25:08 UTC
Subject: Re:  bayes_seen database uncontrolled growth

On Wed, Jan 28, 2004 at 12:12:47PM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> FWIW, just nuking the db with "rm" once a month would probably do the trick
> acceptably as an interim measure...

sorta.  killing seen makes tie (at least r/o) fail.

Comment 3 Christian Mertes 2004-03-10 08:53:44 UTC
Yep, I train my filter at home and copy the tokens to the university where they 
are used to filter my mail. Since I have very limited disk space there and the 
bayes_seen file is only useful at home I'd like to be able to leave it out 
without producing error messages while filtering.

Regards,

Christian
Comment 4 Kelsey Cummings 2004-04-13 11:15:11 UTC
Michael, I haven't checked but I assume that the SQL Bayes code takes care of
this problem?
Comment 5 Michael Parker 2004-04-13 11:21:11 UTC
Subject: Re:  bayes_seen database uncontrolled growth

On Tue, Apr 13, 2004 at 11:15:12AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Michael, I haven't checked but I assume that the SQL Bayes code takes care of
> this problem?

No.  You can fake it with a lastupdate column in MySQL and just expire
by hand, but we don't do anything explicit for expiry.

Michael

Comment 6 Kelsey Cummings 2004-04-13 11:29:44 UTC
Does the Bayes framework allow for an expiration to occur?  Could it be rolled
in with the token auto expiration code?
Comment 7 Michael Parker 2004-04-13 11:40:34 UTC
Subject: Re:  bayes_seen database uncontrolled growth

On Tue, Apr 13, 2004 at 11:29:45AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Does the Bayes framework allow for an expiration to occur?  Could it be rolled
> in with the token auto expiration code?

No it doesn't.  It would take a slight re-design and probably more
thought.

I see the following possible attributes of things we might want to
support in the future:

1) Straight date based expiry.  Note Date/Time when msgid was first
   learned and after N days expire all msgids > N days old.  We can
   then run this at the same time as expiration or some other process
   could handle this.

2) Similar to 1 but update the timestamp if we try to re-learn the
   msgid.  The thinking here is that for some reason this msgid was
   re-examined so lets keep it around a little bit longer to avoid
   re-learning.

3) Keep no record of learned msgids and allow exhaustive learning or
   explicitly disable learning in this case.  This would help folks
   who learn on one box and copy the bayes_toks file to a production
   box and have auto_learn turned off.  It would also allow for
   multiple learns on the same message (ie exhaustive learning, see
   http://garyrob.blogs.com/garys_longer_rants/2004/02/instructions_fo.html
   via jmason).

4) Log how many times a message has been learned, again see exhaustive
   learning stuffs.

5) Tie which tokens were learned from a particular msgid and then
   expire by msgid instead of by token atime.

6) All the various combinations of all of the above.

Anything else?

Michael

Comment 8 Daniel Quinlan 2004-08-27 17:00:01 UTC
moving accuracy and some bugs to 3.1.0 milestone
Comment 9 Daniel Quinlan 2004-08-27 17:19:26 UTC
more accuracy and performance bugs going to 3.1.0 milestone
Comment 10 Daniel Quinlan 2004-08-27 18:15:38 UTC
*** Bug 2771 has been marked as a duplicate of this bug. ***
Comment 11 Justin Mason 2005-05-03 12:04:55 UTC
we probably should have some way of doing this in 3.1.0 -- even if it's just a
support script that wipes out the db and replaces it with a new, empty one.
Comment 12 Matt Kettler 2005-05-03 13:43:27 UTC
Justin, my sentiments exactly. A lock-safe equivalent of rm -f bayes_seen would
be a pretty desirable tool anyway, and offers at least an interim solution.


As for options to solve this the "right way", I just sent this to Michael
off-line regarding comment #7 and figured this should be echoed here (with more
thoughts added to 3)


I'd say option 2) is the most consistent with how SA handles expiry of tokens,
and seems the most sensible option. 

1)could be workable as well, but 2) strikes me as an improvement.

3) could be implemented as an option that simply disables the bayes_seen portion
entirely, and isn't very relevant to expiry as it works by eliminating the need.
Unless SA is redesigned to exclusively do things this way, you'll need 1,2,or 5.
I don't think that in the general case you want to use this as your normal mode
of operation. Protecting against accidental re-learning is good in most
environments.

4) While an interesting idea, this doesn't address or solve the problem of
expiry. If you went this way you'd still need 1,2, or 5 to solve the
boundless-growth problem. See also thoughts on 3.

5) sounds extensively complicated, bulky in terms of storage demand, and of
limited gain. I think you'll find that this mechanism would allow the bayes_seen
to grow more-or-less without bound anyway. I state this based on the theory that
it only takes one unexpired token to retain a message ID, and the majority of
messages you train are going to have at least one "frequently seen" token that
keeps getting learned in new messages. SA's token expiry is going to favor
keeping these "frequently seen" tokens when it expires tokens, because they're
going to have a short delta-atime. (And it would be right to keep them, as
statistically speaking they are the best candidates to keep)


In summary, it sounds like the best thing to do would be 2. 
Comment 13 Justin Mason 2005-05-09 00:42:02 UTC
let's at least try to think about a quick fix for this for 3.1.0
Comment 14 Justin Mason 2005-05-10 23:09:41 UTC
ok, trivial fix for file-based dbs in 3.1.0:

- we maintain two bayes_seen files: bayes_seen and bayes_seen_old.
- if bayes_seen_old doesn't exist (eg. post-upgrade) it's treated as empty
- if bayes_seen doesn't exist, both files are treated as empty
- once stat(bayes_seen) reports that the file creation time is greater than N
days ago, bayes_seen_old is unlinked, bayes_seen is moved to bayes_seen_old, and
a new, empty bayes_seen file is created.

N would be 90 days by default, let's say.  that's very easy and pretty fast to
implement, and deals with the problem without adding more fields or upgrading
the db format.

doesn't help for SQL dbs, or for auto-whitelist though...
Comment 15 Michael Parker 2005-05-10 23:21:38 UTC
Subject: Re:  bayes_seen database uncontrolled growth

I'd be inclined to veto anything that doesn't include a solution all
around.  It seems far too hackish to just throw something for this,
especially since we're talking about recommending the SQL solution
over the Berkeley DB based bayes.
Comment 16 Theo Van Dinter 2005-05-11 11:36:33 UTC
Subject: Re:  bayes_seen database uncontrolled growth

On Tue, May 10, 2005 at 11:21:38PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> I'd be inclined to veto anything that doesn't include a solution all
> around.  It seems far too hackish to just throw something for this,
> especially since we're talking about recommending the SQL solution
> over the Berkeley DB based bayes.

We need to do something, but a full seen expiry system isn't going to happen
for 3.1.

I still like the idea of just letting bayes_seen be optional.  If people want
to trim it, let them delete the file and have it be recreated.  IIRC, the only
place that's an issue is when going r/o w/ the DB where it requires the file
right now.

Comment 17 Justin Mason 2005-05-11 12:24:17 UTC
'I still like the idea of just letting bayes_seen be optional.  If people want
to trim it, let them delete the file and have it be recreated.  IIRC, the only
place that's an issue is when going r/o w/ the DB where it requires the file
right now.'

ok, I can go for that.
Comment 18 Justin Mason 2005-06-01 18:40:22 UTC
ok, fixed; r179482.