Bug 1776 - Upgrade Bayes DB format to v1
Summary: Upgrade Bayes DB format to v1
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P2 blocker
Target Milestone: 2.60
Assignee: Theo Van Dinter
URL:
Whiteboard:
Keywords:
Depends on: 1775
Blocks: 1523 1666
  Show dependency tree
 
Reported: 2003-04-13 19:45 UTC by Theo Van Dinter
Modified: 2003-04-23 13:59 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Theo Van Dinter 2003-04-13 19:45:08 UTC
The 2.50 Bayes DB format (v1) has several issues, including:

- token atime is an unsigned short, so it overflows
- token values are endian-dependent
- magic tokens could be found in messages
- no db version magic token exists


So for 2.60, we should come up with a new format that addresses these issues.
Comment 1 Theo Van Dinter 2003-04-13 19:50:57 UTC
- message ids may not exist and may not be unique
Comment 2 Theo Van Dinter 2003-04-22 19:41:40 UTC
ok, my initial version is committed to head.

- atime is now an unsigned int (32-bit) so it shouldn't overflow for quite some time
- token values are forced vax (little endian) order
- magic tokens were modified to have a prefix of ^M^A^G^I^C instead of ** so
that there's no possibility of a message corrupting the magic tokens (control
chars are automatically stripped when tokenizing)
- db version magic token added and code added to set it when the db is initially
created


I left the message id issue alone for now.  It's bug 1588.  We should probably
come up with something for that too, but that's not really part of the token db
format change.

all that's left is to figure out what atime should be.  one camp (myself
included) thinks message count should stay, another thinks timestamp (some
variation of time_t into 2bytes (ie: instead of 1 second, do 6-18 hour blocks)).

since the format isn't set in stone, but the code is committed, I disabled bayes
r/w access in 2.60 until we decide on the format.
Comment 3 Duncan Findlay 2003-04-22 20:53:00 UTC
Subject: Re: [SAdev]  Upgrade Bayes DB format to v1

> - magic tokens were modified to have a prefix of ^M^A^G^I^C instead of ** so
> that there's no possibility of a message corrupting the magic tokens (control
> chars are automatically stripped when tokenizing)

How about magic tokens simply starting with \000 instead of something
so complicated?

Comment 4 Antony Mawer 2003-04-22 22:26:25 UTC
Subject: Re: [SAdev]  Upgrade Bayes DB format to v1 


> > - magic tokens were modified to have a prefix of ^M^A^G^I^C instead of ** s
> o
> > that there's no possibility of a message corrupting the magic tokens (contr
> ol
> > chars are automatically stripped when tokenizing)
> 
> How about magic tokens simply starting with \000 instead of something
> so complicated?

I think \000 is probably best avoided -- at least because db manipulation
may become impossible in some languages in that case.

--j.

Comment 5 Theo Van Dinter 2003-04-23 07:21:44 UTC
Subject: Re:  Upgrade Bayes DB format to v1

On Tue, Apr 22, 2003 at 08:53:00PM -0700, bugzilla-daemon@hughes-family.org wrote:
> How about magic tokens simply starting with \000 instead of something
> so complicated?

Yeah, what Justin said. ;)

I wanted to specifically avoid null.  I thought about just using a single
control char, but decided it would be better to be more complex for the
magic tokens since they're so important.  It's also not wasting a ton of
space since there's only ~7 magic tokens.  So if we used 1 instead of 5,
that's 4 bytes * 7 tokens or 28 bytes saved.

Comment 6 Theo Van Dinter 2003-04-23 21:59:22 UTC
ok, I've only heard from Justin WRT atime, and he voted msgcount as well.  so
msgcount it is.  I'll re-enable bayes r/w in 2.60 and close the ticket. :)