Bug 1523 - Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms
Summary: Bayes database endianness-dependence leads to corruption when sharing between...
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: 2.50
Hardware: All All
: P5 normal
Target Milestone: 2.60
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on: 1776
Blocks:
  Show dependency tree
 
Reported: 2003-02-22 12:28 UTC by nix
Modified: 2003-04-22 11:29 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description nix 2003-02-22 12:28:36 UTC
[Previously mentioned on the list; raising a bug for it anyway. I think
 I've beaten my browser into cooperation]

My Bayes DB looks a little damaged; it has entries like

0.000        0 16777430        3  *really*
0.000        0 16777255        3  wildly
0.000      408 33555941     2816  they
1.000 16777275        0     2304  N:H*x:FN.N
0.000       13 16777477        5  sometimes

I guess this is a big/little-endian problem; the bayes DBs were created
via `sa-learn' on a little-endian box and are being updated (mostly) on
a big-endian one. It looks like the stored data is not endian-
independent; of course the db-format itself is, but that doesn't say
much about the data stored therein.

This should fix it (untested as yet, I'm still re-sa-learning my corpus,
I should know if it works in five hours or so):

Index: spamassassin/lib/Mail/SpamAssassin/BayesStore.pm
===================================================================
RCS file: /cvsroot/spamassassin/spamassassin/lib/Mail/SpamAssassin/BayesStore.pm,v
retrieving revision 1.21
diff -u -r1.21 BayesStore.pm
--- spamassassin/lib/Mail/SpamAssassin/BayesStore.pm	18 Feb 2003 17:47:56 -0000	1.21
+++ spamassassin/lib/Mail/SpamAssassin/BayesStore.pm	22 Feb 2003 20:09:12 -0000
@@ -758,7 +758,7 @@
 		$atime || 0);
   }
   elsif (($packed & FORMAT_FLAG) == TWO_LONGS_FORMAT) {
-    my ($packed, $ts, $th, $atime) = unpack("CLLS", $_[0] || 0);
+    my ($packed, $ts, $th, $atime) = unpack("CVVS", $_[0] || 0);
     return ($ts || 0, $th || 0, $atime || 0);
   }
   # other formats would go here...
@@ -774,7 +774,7 @@
   if ($ts < 8 && $th < 8) {
     return pack ("CS", ONE_BYTE_FORMAT | ($ts << 3) | $th, $atime);
   } else {
-    return pack ("CLLS", TWO_LONGS_FORMAT, $ts, $th, $atime);
+    return pack ("CVVS", TWO_LONGS_FORMAT, $ts, $th, $atime);
   }
 }

This'll require a bayes DB rebuild on big-endian hosts (as well as, of
course, on big/little intermixed setups, but they're fubared anyway).
Comment 1 Daniel Quinlan 2003-02-22 12:50:32 UTC
Subject: Re: [SAdev]  New: Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms

Shouldn't you change each "S" into a "v" too?  There are some
pack/unpack statements on other lines that also need to be changed.

Daniel

Comment 2 Ben Rosengart 2003-02-22 13:13:52 UTC
Subject: Re:  Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms

On Sat, 22 Feb 2003, bugzilla-daemon@hughes-family.org yowled:
> Shouldn't you change each "S" into a "v" too?  There are some
> pack/unpack statements on other lines that also need to be changed.

Argh, true. I patched in too much of a hurry.

Revised patch (again, untested until my Bayes rebuild is done):

Index: spamassassin/lib/Mail/SpamAssassin/BayesStore.pm
===================================================================
RCS file: /cvsroot/spamassassin/spamassassin/lib/Mail/SpamAssassin/BayesStore.pm,v
retrieving revision 1.21
diff -u -r1.21 BayesStore.pm
--- spamassassin/lib/Mail/SpamAssassin/BayesStore.pm	18 Feb 2003 17:47:56 -0000	1.21
+++ spamassassin/lib/Mail/SpamAssassin/BayesStore.pm	22 Feb 2003 21:11:36 -0000
@@ -738,7 +738,7 @@
 # This looks like: XXSSSHHH (XX = format bits, SSS = 3 spam-count bits, HHH = 3
 # ham-count bits).  If XX in the first byte is 11, it's packed as this 1-byte
 # representation; otherwise, if XX in the first byte is 00, it's packed as
-# "CLL", ie. 1 byte and 2 32-bit "longs" in perl pack format.
+# "CVV", ie. 1 byte and 2 32-bit "longs" in perl pack format.
 
 # Savings: roughly halves size of toks db, at the cost of a ~10% slowdown.
 
@@ -750,7 +750,7 @@
 use constant ONE_BYTE_HHH_BITS	=> 0x07;	# 00000111
 
 sub tok_unpack {
-  my ($packed, $atime) = unpack("CS", $_[0] || 0);
+  my ($packed, $atime) = unpack("Cv", $_[0] || 0);
 
   if (($packed & FORMAT_FLAG) == ONE_BYTE_FORMAT) {
     return (($packed & ONE_BYTE_SSS_BITS) >> 3,
@@ -758,7 +758,7 @@
 		$atime || 0);
   }
   elsif (($packed & FORMAT_FLAG) == TWO_LONGS_FORMAT) {
-    my ($packed, $ts, $th, $atime) = unpack("CLLS", $_[0] || 0);
+    my ($packed, $ts, $th, $atime) = unpack("CVVv", $_[0] || 0);
     return ($ts || 0, $th || 0, $atime || 0);
   }
   # other formats would go here...
@@ -772,9 +772,9 @@
   my ($ts, $th, $atime) = @_;
   $ts ||= 0; $th ||= 0; $atime ||= 0;
   if ($ts < 8 && $th < 8) {
-    return pack ("CS", ONE_BYTE_FORMAT | ($ts << 3) | $th, $atime);
+    return pack ("Cv", ONE_BYTE_FORMAT | ($ts << 3) | $th, $atime);
   } else {
-    return pack ("CLLS", TWO_LONGS_FORMAT, $ts, $th, $atime);
+    return pack ("CVVv", TWO_LONGS_FORMAT, $ts, $th, $atime);
   }
 }

I think that's caught everything.

Comment 3 Theo Van Dinter 2003-02-22 13:16:05 UTC
Subject: Re: [SAdev]  Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms

On Sat, Feb 22, 2003 at 01:13:52PM -0800, bugzilla-daemon@hughes-family.org wrote:
> Argh, true. I patched in too much of a hurry.
> 
> Revised patch (again, untested until my Bayes rebuild is done):

We also need to look at any other tools we make available (check_bayes_db
comes to mind) to make sure it knows how to deal with the format as well.

Also, can you attach the patches instead of pasting them in?

Comment 4 Ben Rosengart 2003-02-22 13:30:12 UTC
Subject: Re:  Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms

On Sat, 22 Feb 2003, bugzilla-daemon@hughes-family.org mused:
> We also need to look at any other tools we make available (check_bayes_db
> comes to mind) to make sure it knows how to deal with the format as well.

Will check.

> Also, can you attach the patches instead of pasting them in?

OK, although I'd be surprised if Bugzilla mangles patches emailed to
bugzilla-daemon :)

Comment 5 Theo Van Dinter 2003-02-22 13:32:50 UTC
Subject: Re: [SAdev]  Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms

On Sat, Feb 22, 2003 at 01:30:12PM -0800, bugzilla-daemon@hughes-family.org wrote:
> OK, although I'd be surprised if Bugzilla mangles patches emailed to
> bugzilla-daemon :)

I don't mean attach to an email, I mean goto bugzilla and select "add attachment". :)

Comment 6 Ben Rosengart 2003-02-22 13:37:11 UTC
Subject: Re:  Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms

On 22 Feb 2003, nix@esperi.demon.co.uk spake:
> On Sat, 22 Feb 2003, bugzilla-daemon@hughes-family.org mused:
>> We also need to look at any other tools we make available (check_bayes_db
>> comes to mind) to make sure it knows how to deal with the format as well.
> 
> Will check.

Attaching re-revised patch; check_bayes_db and trim_bayes_db needed
fixing.

Note that ArchiveIterator packs floats, and trim_bayes_db unpacks them;
this is not portable, and there is no portable replacement; at least,
none mentioned in the pack docs (how very like perl: a 90% solution yet
again).

I'm not sure how significant those float packs are, though.

Comment 7 Justin Mason 2003-02-23 11:09:02 UTC
Yes -- I used native-endian, for speed.  I didn't see pack('V')...

What we could probably do is to add a DB revision token, indicating the
rev # of the DB file; old bayes files are kept as-is, new ones are created
with pack('CVVS').  Hence backwards compat is preserved. 

Float packing BTW is totally unportable. Forget about it ;)

(BTW folks -- I'll be travelling next week for NAI, so replies may
be infrequent! ;)
Comment 8 Ben Rosengart 2003-02-23 12:00:17 UTC
Subject: Re: [SAdev]  Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms

On Sun, 23 Feb 2003, bugzilla-daemon@hughes-family.org stipulated:
> Yes -- I used native-endian, for speed.  I didn't see pack('V')...

I think speed is less important than correctness, here.

> What we could probably do is to add a DB revision token, indicating the
> rev # of the DB file; old bayes files are kept as-is, new ones are created
> with pack('CVVS').  Hence backwards compat is preserved. 

Er, you mean 'CVVv' :)

Is preservation of backwards compatibility actually important, given
that the DB can be rebuilt via sa-learn anyway?

(Hm, thinking of some of my `change-nothing' users, you may be right...)

> Float packing BTW is totally unportable. Forget about it ;)

Well, SA's doing some. I'm not sure how important this is or even if the
resulting packed data is ever written anywhere, though, because I can't
see anything unpacking the float-packed data except for trim-bayes-db.

Comment 9 Duncan Findlay 2003-02-23 17:59:10 UTC
Subject: Re: [SAdev]  Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms

> Is preservation of backwards compatibility actually important, given
> that the DB can be rebuilt via sa-learn anyway?

Yes. But it would be nice if sa-learn --rebuild would rewrite the DB
in the new format, if it detects an old db.

Comment 10 Theo Van Dinter 2003-02-23 18:10:01 UTC
Subject: Re: [SAdev]  Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms

On Sun, Feb 23, 2003 at 05:59:10PM -0800, bugzilla-daemon@hughes-family.org wrote:
> Yes. But it would be nice if sa-learn --rebuild would rewrite the DB
> in the new format, if it detects an old db.

I don't know about that.  I want "--rebuild" to rebuild, not upgrade.

A new "--upgrade" would be fine though.  Take DB version X and upgrade
to latest version.

Comment 11 Duncan Findlay 2003-02-23 18:49:48 UTC
Subject: Re: [SAdev]  Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms

> I don't know about that.  I want "--rebuild" to rebuild, not upgrade.

If you are already rebuliding the database, why rewrite it in the
previous format?

Comment 12 Theo Van Dinter 2003-02-23 19:04:39 UTC
Subject: Re: [SAdev]  Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms

On Sun, Feb 23, 2003 at 06:49:48PM -0800, bugzilla-daemon@hughes-family.org wrote:
> If you are already rebuliding the database, why rewrite it in the
> previous format?

Path of least astonishment.  There's a difference between syncing the
journal and converting between db formats.

Comment 13 Daniel Quinlan 2003-02-24 11:06:36 UTC
I have a suggestion:

For 2.51 and later: let's add a version field to the SA Bayes database format.
Major versions are not compatible with each other, minor versions are
compatible with each other.  If possible, I'd like to include the Perl DB
module name (DB_File, SDBM, etc.) and anything else that would definitely
affect compatibility in the version number, like:

  module=DB_File,bayes_db_version=1.0
Comment 14 Theo Van Dinter 2003-02-24 12:14:35 UTC
Subject: Re: [SAdev]  Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms

On Mon, Feb 24, 2003 at 11:06:36AM -0800, bugzilla-daemon@hughes-family.org wrote:
> For 2.51 and later: let's add a version field to the SA Bayes database format.
> Major versions are not compatible with each other, minor versions are
> compatible with each other.  If possible, I'd like to include the Perl DB
> module name (DB_File, SDBM, etc.) and anything else that would definitely
> affect compatibility in the version number, like:
> 
>   module=DB_File,bayes_db_version=1.0

a version # I understand, but module?  if you can't read the DB, you
won't be able to read what module you need.  perhaps I'm missing something.

Comment 15 Justin Mason 2003-03-17 12:00:32 UTC
my POV on this one:

Dan's version suggestion is good.  Theo, might as well include the DB type as
well, since Klaus has reported that DB_File and NDBM_File both use berkeley db on
some BSD. (FreeBSD?).   Plus it provides a little sanity-check too.

Regarding --rebuild rebuilding the db in the new format: this will actually take
place, in the current code.   When a user runs "sa-learn --rebuild
--force-expire"  SA will create a totally new _toks db file, then rename it into
place.  So that's already taken care of.

Note that the other db files are left as-is however, and not rebuilt. hmm.
but that's a separate issue...
Comment 16 Justin Mason 2003-03-18 15:39:56 UTC
OK, regarding 2.51: we don't have a fix in for this yet.  it'll require a patch
which (a) adds a version key to _toks, and (b) uses the new, portable packing
format when creating a new _toks.

I don't think it's a biggie that we don't have this fixed for 2.51; sharing a
bayes db across different platforms is pretty unusual anyway.  So let's just
document it for 2.51 instead, fix it in HEAD after 2.51 release, and backport
for a 2.52 if required.
Comment 17 Theo Van Dinter 2003-03-24 13:18:04 UTC
changing database endianness is a 2.6 thing.
Comment 18 Theo Van Dinter 2003-04-22 19:29:31 UTC
code committed to head