Bug 3471 - [review] add sanity check so too many tokens don't get expired
Summary: [review] add sanity check so too many tokens don't get expired
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P4 normal
Target Milestone: 3.0.0
Assignee: Theo Van Dinter
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-06-04 09:50 UTC by Theo Van Dinter
Modified: 2004-06-09 06:02 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
suggested patch patch None Theo Van Dinter [HasCLA]
new version of implementation patch None Theo Van Dinter [HasCLA]
Patch File patch None Michael Parker [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Theo Van Dinter 2004-06-04 09:50:26 UTC
it's been reported on the -users list that sometimes bayes expiry will get rid of the majority of tokens in 
the db.  the issue seems to occur when most of the tokens have a single atime (like when an upgrade 
happens) and expiry uses the estimation method.

I'm attaching a patch which adds a simple check to the DBM expiry (SQL doesn't do estimation?).
Comment 1 Theo Van Dinter 2004-06-04 09:50:58 UTC
Created attachment 2000 [details]
suggested patch
Comment 2 Theo Van Dinter 2004-06-04 09:51:37 UTC
I think this is 3.0 material.
Comment 3 Michael Parker 2004-06-04 16:38:10 UTC
Subject: Re:  New: add sanity check so too many tokens don't get expired

On Fri, Jun 04, 2004 at 09:50:27AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> 
> I'm attaching a patch which adds a simple check to the DBM expiry (SQL doesn't do estimation?).
> 

Actually, it does.  It determines the optimal atime to expiration and
then deletes everything < that.  If this is a problem for DBM, then it
could happen for SQL as well, but slightly harder to recover from.

Michael

Comment 4 Theo Van Dinter 2004-06-04 17:16:46 UTC
Subject: Re:  add sanity check so too many tokens don't get expired

On Fri, Jun 04, 2004 at 04:38:11PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Actually, it does.  It determines the optimal atime to expiration and
> then deletes everything < that.  If this is a problem for DBM, then it
> could happen for SQL as well, but slightly harder to recover from.

Ah.  I saw the "delete ... < ..." bit, and got confused (it was a very
long night last night...)

I guess the way to guard for this in SQL is to do a "select
count(??) ..." before the delete, and see how many tokens will be removed.
If too many, abort.  If not, do the delete call.  It adds more overhead,
but ...  At least for the delete, the appropriate rows would be in
cache. :)

Comment 5 Michael Parker 2004-06-04 18:35:30 UTC
Subject: Re:  add sanity check so too many tokens don't get expired

It's been awhile since I've looked at this code but I wonder if this
is just fixing the symptom and not the actual problem.

We should be figuring out the optimal atime value that doesn't expire
too many tokens.  So is that part of the code doing the wrong thing?

Michael

Comment 6 Theo Van Dinter 2004-06-04 19:43:26 UTC
Subject: Re:  add sanity check so too many tokens don't get expired

On Fri, Jun 04, 2004 at 06:35:31PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> We should be figuring out the optimal atime value that doesn't expire
> too many tokens.  So is that part of the code doing the wrong thing?

it's all documented in the sa-learn docs. ;)

the expiry works in 2 passes.  the first pass is what generates the
atime to use, and the second pass actually does the removal.

to make expiry go faster, the first pass will try to estimate what the
atime value should be based on what the last expiry run found, as long
as they are similar enough in what they want to accomplish (ie: if you
previously expired 5000 tokens, and you now want to expire 5000 tokens,
you can probably use the same atime delta as last time...).  this works
on the idea that message flow is relatively constant.

if estimation shouldn't be used (it's rather picky), the first pass will
actually go through all the tokens to figure out what atime value should
be used based on an exponential scale (12 hours, 24 hours, up through
256 days).

the problem is that doing estimation gives you, well, an estimate.
most of the time it's right, and sometimes it's wrong.  it's the reason
why people don't like the weatherman on tv.

so to answer your question, we can always use a better estimation
algorithm/equation, but the nature of the beast is that estimations will
sometimes be wrong so we should just put a sanity check in the bottom
to make sure we're not about to shoot off our own foot.

Comment 7 Michael Parker 2004-06-04 19:52:50 UTC
Subject: Re:  add sanity check so too many tokens don't get expired

Ahhh, yeah I forgot about the first pass/guess based on previous
expires.  Makes sense.

Michael


Comment 8 Theo Van Dinter 2004-06-07 11:51:34 UTC
Created attachment 2005 [details]
new version of implementation

this version handles both DBM and SQL
Comment 9 Theo Van Dinter 2004-06-07 11:52:07 UTC
change to review state, taking ticket
Comment 10 Justin Mason 2004-06-07 19:31:24 UTC
+1 ok, that looks good I think
Comment 11 Michael Parker 2004-06-07 22:35:01 UTC
Subject: Re:  [review] add sanity check so too many tokens don't get expired

I hate to do it, but I'm -1 on the patch as is.  There are a couple of
problems with the SQL stuff.

I'll fix it up in the morning and attach a new version.

Michael

Comment 12 Theo Van Dinter 2004-06-08 08:15:39 UTC
Subject: Re:  [review] add sanity check so too many tokens don't get expired

On Mon, Jun 07, 2004 at 10:35:02PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> I hate to do it, but I'm -1 on the patch as is.  There are a couple of
> problems with the SQL stuff.

Heh.  Don't worry about it, as long as we get a good patch in for
the issue.

For the SQL code, I copied the select section from another part of
the module and just changed the query around and added some if/else
statements.

> I'll fix it up in the morning and attach a new version.

:)

Comment 13 Michael Parker 2004-06-08 11:33:06 UTC
Created attachment 2016 [details]
Patch File

All new version.

Pretty much the same as before, however I made some changes to how the SQL
version returns on error and cleaned up the SQL stuffs.

I don't really like introducing the goto, but I couldn't think of a cleaner way
to jump to the end so we could return cleanly.	Ideas on other methods welcome.
Comment 14 Justin Mason 2004-06-08 20:43:05 UTC
Subject: Re:  [review] add sanity check so too many tokens don't get expired 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


>I don't really like introducing the goto, but I couldn't think of a cleaner way
>to jump to the end so we could return cleanly.	Ideas on other methods welcome.

goto's fine for clean error handling IMO -- one for the style guide I
suppose ;)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAxiGKQTcbUG5Y7woRAkyrAJ9n5urBsGCml5wq8qwGFL9AQ4IIMQCgmxPG
I2SxdS+WRDloE+DJn/crlts=
=Z7wx
-----END PGP SIGNATURE-----

Comment 15 Justin Mason 2004-06-08 21:52:01 UTC
+1, assuming it works ;)
Comment 16 Theo Van Dinter 2004-06-09 13:52:06 UTC
+1 looks fine here
Comment 17 Michael Parker 2004-06-09 14:02:10 UTC
Committed revision 20962.