Bug 3970 - bad sa-learn dump (encrypted token ?)
Summary: bad sa-learn dump (encrypted token ?)
Status: RESOLVED INVALID
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Learner (show other bugs)
Version: 3.0.1
Hardware: PC Linux
: P5 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-11-16 01:48 UTC by Eric Gerbier
Modified: 2004-11-16 00:17 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Eric Gerbier 2004-11-16 01:48:06 UTC
with spamassassin 3.0.1, the "sa-learn --dump all" give me something like :
0.000          0          3          0  non-token data: bayes db version
0.000          0         77          0  non-token data: nspam
0.000          0          0          0  non-token data: nham
0.000          0       4646          0  non-token data: ntokens
0.000          0 1100538095          0  non-token data: oldest atime
0.000          0 1100593149          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync atime
0.000          0          0          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime delta
0.000          0          0          0  non-token data: last expire reduction count
0.500         14          0 1100591536  90311df836
0.500          2          0 1100538095  d58eb8cdb4
...
the token looks like encrypted
and when I try to use the --regexp, it does not find anything


with spamassassin 2.64, I have a "clear" token, and regexp search work :
0.000          0          2          0  non-token data: bayes db version
0.000          0       8058          0  non-token data: nspam
0.000          0      22695          0  non-token data: nham
0.000          0     158084          0  non-token data: ntokens
0.000          0 1099178165          0  non-token data: oldest atime
0.000          0 1100598197          0  non-token data: newest atime
0.000          0 1100597807          0  non-token data: last journal sync atime
0.000          0 1100560100          0  non-token data: last expiry atime
0.000          0    1382400          0  non-token data: last expire atime delta
0.000          0       1501          0  non-token data: last expire reduction count
0.885          3          1 1100597380  o'clock
0.947          7          1 1100597380  bernoulli
0.891          9          3 1100597380  anticipate
...
Comment 1 Theo Van Dinter 2004-11-16 01:57:43 UTC
Yup.  One of the changes in v3 is that the tokens are now based on sha1 hash values of the raw token 
value.  It's mentioned in the UPGRADE document, and has been well discussed on the users list.

We should probably remove --regexp as an option since it's no longer usable as originally 
implemented.
Comment 2 Eric Gerbier 2004-11-16 02:20:15 UTC
You can also remove the "--dump data" option too, because it is not longer
useful ...
Comment 3 Theo Van Dinter 2004-11-16 08:02:53 UTC
Subject: Re:  bad sa-learn dump (encrypted token ?)

On Tue, Nov 16, 2004 at 02:20:17AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> You can also remove the "--dump data" option too, because it is not longer
> useful ...

No, it's still very useful.  There's a lot of information in knowing what's in
your database, knowing what the actual tokens are isn't so important and in
the face of the resource improvements it's not a bad tradeoff.

Comment 4 Eric Gerbier 2004-11-16 08:59:33 UTC
Ok, we can get general informations with "--dump magic", and probably have some
statistics with the data.

but the new option "--backup" can give us the same informations. Do you will
keep this two options with (almost) same result ?

I do not contest at all ressource improvement. it is just a pity to break
control commands. Is there any plan to (re)add this functionnality (regex
search) in a future release ?
Comment 5 Michael Parker 2004-11-16 09:15:56 UTC
Subject: Re:  bad sa-learn dump (encrypted token ?)

On Tue, Nov 16, 2004 at 08:59:34AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> 
> but the new option "--backup" can give us the same informations. Do you will
> keep this two options with (almost) same result ?

True, backup gives you most of the data, but it doesn't give you the
bayes stats like --dump data does.  This can be useful for folks who
are interested in that sort of thing, see below.

> I do not contest at all ressource improvement. it is just a pity to break
> control commands. Is there any plan to (re)add this functionnality (regex
> search) in a future release ?

The conversion to binary, fixed length token keys was a HUGE win in
performance.  A large effort was made to add the option (for those who
didn't care about performance) to store the raw token value in the
database (sorry I don't have the bug number in front of me).  The sum
total of that work was that adding even the option of storing the raw
token value was a performance hit and when posed to the user community
there was very little call for this.  However, as a sort of compromise
for any future requests for this sort of data several hooks where
added to the Plugin API that allow you to get at the raw token data.
This allows you to create a plugin to fetch this data.  The plan, in
some future version is to expand the Plugin API to allow for something
to happen in the dump (or better named) method.

This is on the todo list for some future version of SA, so I'm closing
this bug as invalid.

Michael
Comment 6 Michael Parker 2004-11-16 09:17:58 UTC
Closing as invalid, this is by design.