Bug 6663

Summary: Use unsigned database fields for int
Product: Spamassassin Reporter: Michiel Hazelhof <michiel>
Component: DocumentationAssignee: SpamAssassin Developer Mailing List <dev>
Status: NEW ---    
Severity: enhancement CC: michiel
Priority: P5    
Version: unspecified   
Target Milestone: Undefined   
Hardware: All   
OS: All   
Whiteboard:
Attachments: Test for the sql tables
New revision #1
bayes_mysql #1
userpref_mysql #1
awl_mysql #1

Description Michiel Hazelhof 2011-09-21 19:43:39 UTC
Using unsigned integers saves some database size and allows for integers to grow bigger.

My patch includes awl_mysql.sql, bayes_mysql.sql and userpref_mysql.sql.

As a secondary question should UTF-8 or another? This way it is determined via the default MySQL? (Latin might be smaller).
Comment 1 Michiel Hazelhof 2011-09-21 19:47:19 UTC
Not sure wether `last_atime_delta` and `last_expire_reduce` could be unsigned, as I haven't got enough test data.
Comment 2 Michiel Hazelhof 2011-09-21 20:21:21 UTC
Latin saves about 1.2% (database copy) but I can't verify whether the database is still intact or not,
Comment 3 Michiel Hazelhof 2011-09-21 20:37:09 UTC
Correction, didn't take the correct measurement regarding the utf8 vs latin, latin costs about 1/3 compared to utf8.
Comment 4 Mark Martinec 2011-09-21 22:22:17 UTC
> My patch includes awl_mysql.sql, bayes_mysql.sql and userpref_mysql.sql

Where is the patch? Should probably be attached here.

> As a secondary question should UTF-8 or another?

Which fields?

A fix for a MySQL data type of bayes_token.token is already in trunk,
it should be binary (just as it already is for PostgreSQL) as it
contains octets which are not associated with any character set.
See Bug 6625.

As for other fields, like username or IP address or preference
or a value of a preference, I don't know - these should probably
also not be associated with any particular character set so a
varbinary would probably suit them best, but ascii (or Latin-1)
is just as good as they are not supposed to contain any characters
outside of the printable ascii set.
Comment 5 Michiel Hazelhof 2011-09-22 08:04:55 UTC
Created attachment 4965 [details]
Test for the sql tables
Comment 6 Michiel Hazelhof 2011-09-22 08:26:57 UTC
Sorry I forgot to add the sql file.
Setting fields like bayes_token.id to unsigned , allows them to contain higher numbers and potentially avoid some problems.

All fields which never contain negative integers should be unsigned (unsigned: 0 to 4,294,967,295 and signed: -2,147,438,648 to 2,147,438,647).

I am not sure about the following fields regarding to unsigned:
bayes_expire.runtime
bayes_vars.last_atime_delta
bayes_vars.last_expire_reduce

As for UTF8 it reduces memory usage a bit (I guess, just my measurements) and is capable of holding nearly all possible characters (that should be used anyway). Also defining a default set should help with debugging purposes as well since everybody should be having the same set.
Comment 7 Michiel Hazelhof 2011-09-22 08:27:15 UTC
Created attachment 4966 [details]
New revision #1
Comment 8 Michiel Hazelhof 2011-09-22 08:27:32 UTC
Created attachment 4967 [details]
bayes_mysql #1
Comment 9 Michiel Hazelhof 2011-09-22 08:27:50 UTC
Created attachment 4968 [details]
userpref_mysql #1
Comment 10 Michiel Hazelhof 2011-09-22 08:28:08 UTC
Created attachment 4969 [details]
awl_mysql #1
Comment 11 Michiel Hazelhof 2011-10-08 11:29:53 UTC
Todo:
Determine whether the following fields can be negative: bayes_expire.runtime, bayes_vars.last_atime_delta, bayes_vars.last_expire_reduce

Add Upgrade notice for table changes (UTF-8 + unsigned), also for Bug 6625 I guess (needed to empty my table for this one).