SA Bugzilla – Bug 1523
Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms
Last modified: 2003-04-22 11:29:31 UTC
[Previously mentioned on the list; raising a bug for it anyway. I think I've beaten my browser into cooperation] My Bayes DB looks a little damaged; it has entries like 0.000 0 16777430 3 *really* 0.000 0 16777255 3 wildly 0.000 408 33555941 2816 they 1.000 16777275 0 2304 N:H*x:FN.N 0.000 13 16777477 5 sometimes I guess this is a big/little-endian problem; the bayes DBs were created via `sa-learn' on a little-endian box and are being updated (mostly) on a big-endian one. It looks like the stored data is not endian- independent; of course the db-format itself is, but that doesn't say much about the data stored therein. This should fix it (untested as yet, I'm still re-sa-learning my corpus, I should know if it works in five hours or so): Index: spamassassin/lib/Mail/SpamAssassin/BayesStore.pm =================================================================== RCS file: /cvsroot/spamassassin/spamassassin/lib/Mail/SpamAssassin/BayesStore.pm,v retrieving revision 1.21 diff -u -r1.21 BayesStore.pm --- spamassassin/lib/Mail/SpamAssassin/BayesStore.pm 18 Feb 2003 17:47:56 -0000 1.21 +++ spamassassin/lib/Mail/SpamAssassin/BayesStore.pm 22 Feb 2003 20:09:12 -0000 @@ -758,7 +758,7 @@ $atime || 0); } elsif (($packed & FORMAT_FLAG) == TWO_LONGS_FORMAT) { - my ($packed, $ts, $th, $atime) = unpack("CLLS", $_[0] || 0); + my ($packed, $ts, $th, $atime) = unpack("CVVS", $_[0] || 0); return ($ts || 0, $th || 0, $atime || 0); } # other formats would go here... @@ -774,7 +774,7 @@ if ($ts < 8 && $th < 8) { return pack ("CS", ONE_BYTE_FORMAT | ($ts << 3) | $th, $atime); } else { - return pack ("CLLS", TWO_LONGS_FORMAT, $ts, $th, $atime); + return pack ("CVVS", TWO_LONGS_FORMAT, $ts, $th, $atime); } } This'll require a bayes DB rebuild on big-endian hosts (as well as, of course, on big/little intermixed setups, but they're fubared anyway).
Subject: Re: [SAdev] New: Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms Shouldn't you change each "S" into a "v" too? There are some pack/unpack statements on other lines that also need to be changed. Daniel
Subject: Re: Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms On Sat, 22 Feb 2003, bugzilla-daemon@hughes-family.org yowled: > Shouldn't you change each "S" into a "v" too? There are some > pack/unpack statements on other lines that also need to be changed. Argh, true. I patched in too much of a hurry. Revised patch (again, untested until my Bayes rebuild is done): Index: spamassassin/lib/Mail/SpamAssassin/BayesStore.pm =================================================================== RCS file: /cvsroot/spamassassin/spamassassin/lib/Mail/SpamAssassin/BayesStore.pm,v retrieving revision 1.21 diff -u -r1.21 BayesStore.pm --- spamassassin/lib/Mail/SpamAssassin/BayesStore.pm 18 Feb 2003 17:47:56 -0000 1.21 +++ spamassassin/lib/Mail/SpamAssassin/BayesStore.pm 22 Feb 2003 21:11:36 -0000 @@ -738,7 +738,7 @@ # This looks like: XXSSSHHH (XX = format bits, SSS = 3 spam-count bits, HHH = 3 # ham-count bits). If XX in the first byte is 11, it's packed as this 1-byte # representation; otherwise, if XX in the first byte is 00, it's packed as -# "CLL", ie. 1 byte and 2 32-bit "longs" in perl pack format. +# "CVV", ie. 1 byte and 2 32-bit "longs" in perl pack format. # Savings: roughly halves size of toks db, at the cost of a ~10% slowdown. @@ -750,7 +750,7 @@ use constant ONE_BYTE_HHH_BITS => 0x07; # 00000111 sub tok_unpack { - my ($packed, $atime) = unpack("CS", $_[0] || 0); + my ($packed, $atime) = unpack("Cv", $_[0] || 0); if (($packed & FORMAT_FLAG) == ONE_BYTE_FORMAT) { return (($packed & ONE_BYTE_SSS_BITS) >> 3, @@ -758,7 +758,7 @@ $atime || 0); } elsif (($packed & FORMAT_FLAG) == TWO_LONGS_FORMAT) { - my ($packed, $ts, $th, $atime) = unpack("CLLS", $_[0] || 0); + my ($packed, $ts, $th, $atime) = unpack("CVVv", $_[0] || 0); return ($ts || 0, $th || 0, $atime || 0); } # other formats would go here... @@ -772,9 +772,9 @@ my ($ts, $th, $atime) = @_; $ts ||= 0; $th ||= 0; $atime ||= 0; if ($ts < 8 && $th < 8) { - return pack ("CS", ONE_BYTE_FORMAT | ($ts << 3) | $th, $atime); + return pack ("Cv", ONE_BYTE_FORMAT | ($ts << 3) | $th, $atime); } else { - return pack ("CLLS", TWO_LONGS_FORMAT, $ts, $th, $atime); + return pack ("CVVv", TWO_LONGS_FORMAT, $ts, $th, $atime); } } I think that's caught everything.
Subject: Re: [SAdev] Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms On Sat, Feb 22, 2003 at 01:13:52PM -0800, bugzilla-daemon@hughes-family.org wrote: > Argh, true. I patched in too much of a hurry. > > Revised patch (again, untested until my Bayes rebuild is done): We also need to look at any other tools we make available (check_bayes_db comes to mind) to make sure it knows how to deal with the format as well. Also, can you attach the patches instead of pasting them in?
Subject: Re: Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms On Sat, 22 Feb 2003, bugzilla-daemon@hughes-family.org mused: > We also need to look at any other tools we make available (check_bayes_db > comes to mind) to make sure it knows how to deal with the format as well. Will check. > Also, can you attach the patches instead of pasting them in? OK, although I'd be surprised if Bugzilla mangles patches emailed to bugzilla-daemon :)
Subject: Re: [SAdev] Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms On Sat, Feb 22, 2003 at 01:30:12PM -0800, bugzilla-daemon@hughes-family.org wrote: > OK, although I'd be surprised if Bugzilla mangles patches emailed to > bugzilla-daemon :) I don't mean attach to an email, I mean goto bugzilla and select "add attachment". :)
Subject: Re: Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms On 22 Feb 2003, nix@esperi.demon.co.uk spake: > On Sat, 22 Feb 2003, bugzilla-daemon@hughes-family.org mused: >> We also need to look at any other tools we make available (check_bayes_db >> comes to mind) to make sure it knows how to deal with the format as well. > > Will check. Attaching re-revised patch; check_bayes_db and trim_bayes_db needed fixing. Note that ArchiveIterator packs floats, and trim_bayes_db unpacks them; this is not portable, and there is no portable replacement; at least, none mentioned in the pack docs (how very like perl: a 90% solution yet again). I'm not sure how significant those float packs are, though.
Yes -- I used native-endian, for speed. I didn't see pack('V')... What we could probably do is to add a DB revision token, indicating the rev # of the DB file; old bayes files are kept as-is, new ones are created with pack('CVVS'). Hence backwards compat is preserved. Float packing BTW is totally unportable. Forget about it ;) (BTW folks -- I'll be travelling next week for NAI, so replies may be infrequent! ;)
Subject: Re: [SAdev] Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms On Sun, 23 Feb 2003, bugzilla-daemon@hughes-family.org stipulated: > Yes -- I used native-endian, for speed. I didn't see pack('V')... I think speed is less important than correctness, here. > What we could probably do is to add a DB revision token, indicating the > rev # of the DB file; old bayes files are kept as-is, new ones are created > with pack('CVVS'). Hence backwards compat is preserved. Er, you mean 'CVVv' :) Is preservation of backwards compatibility actually important, given that the DB can be rebuilt via sa-learn anyway? (Hm, thinking of some of my `change-nothing' users, you may be right...) > Float packing BTW is totally unportable. Forget about it ;) Well, SA's doing some. I'm not sure how important this is or even if the resulting packed data is ever written anywhere, though, because I can't see anything unpacking the float-packed data except for trim-bayes-db.
Subject: Re: [SAdev] Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms > Is preservation of backwards compatibility actually important, given > that the DB can be rebuilt via sa-learn anyway? Yes. But it would be nice if sa-learn --rebuild would rewrite the DB in the new format, if it detects an old db.
Subject: Re: [SAdev] Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms On Sun, Feb 23, 2003 at 05:59:10PM -0800, bugzilla-daemon@hughes-family.org wrote: > Yes. But it would be nice if sa-learn --rebuild would rewrite the DB > in the new format, if it detects an old db. I don't know about that. I want "--rebuild" to rebuild, not upgrade. A new "--upgrade" would be fine though. Take DB version X and upgrade to latest version.
Subject: Re: [SAdev] Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms > I don't know about that. I want "--rebuild" to rebuild, not upgrade. If you are already rebuliding the database, why rewrite it in the previous format?
Subject: Re: [SAdev] Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms On Sun, Feb 23, 2003 at 06:49:48PM -0800, bugzilla-daemon@hughes-family.org wrote: > If you are already rebuliding the database, why rewrite it in the > previous format? Path of least astonishment. There's a difference between syncing the journal and converting between db formats.
I have a suggestion: For 2.51 and later: let's add a version field to the SA Bayes database format. Major versions are not compatible with each other, minor versions are compatible with each other. If possible, I'd like to include the Perl DB module name (DB_File, SDBM, etc.) and anything else that would definitely affect compatibility in the version number, like: module=DB_File,bayes_db_version=1.0
Subject: Re: [SAdev] Bayes database endianness-dependence leads to corruption when sharing between big- and little-endian platforms On Mon, Feb 24, 2003 at 11:06:36AM -0800, bugzilla-daemon@hughes-family.org wrote: > For 2.51 and later: let's add a version field to the SA Bayes database format. > Major versions are not compatible with each other, minor versions are > compatible with each other. If possible, I'd like to include the Perl DB > module name (DB_File, SDBM, etc.) and anything else that would definitely > affect compatibility in the version number, like: > > module=DB_File,bayes_db_version=1.0 a version # I understand, but module? if you can't read the DB, you won't be able to read what module you need. perhaps I'm missing something.
my POV on this one: Dan's version suggestion is good. Theo, might as well include the DB type as well, since Klaus has reported that DB_File and NDBM_File both use berkeley db on some BSD. (FreeBSD?). Plus it provides a little sanity-check too. Regarding --rebuild rebuilding the db in the new format: this will actually take place, in the current code. When a user runs "sa-learn --rebuild --force-expire" SA will create a totally new _toks db file, then rename it into place. So that's already taken care of. Note that the other db files are left as-is however, and not rebuilt. hmm. but that's a separate issue...
OK, regarding 2.51: we don't have a fix in for this yet. it'll require a patch which (a) adds a version key to _toks, and (b) uses the new, portable packing format when creating a new _toks. I don't think it's a biggie that we don't have this fixed for 2.51; sharing a bayes db across different platforms is pretty unusual anyway. So let's just document it for 2.51 instead, fix it in HEAD after 2.51 release, and backport for a 2.52 if required.
changing database endianness is a 2.6 thing.
code committed to head