SA Bugzilla – Bug 8042
Bayes warnings when using MySQL
Last modified: 2024-01-09 20:12:13 UTC
Created attachment 5815 [details] db_connect fix When using Bayes with MySQL, reading bayes data may not work. With debug enabled I have this warnings: Sep 9 18:13:55.180 [67917] dbg: bayes: tok_get_all: SQL error: Illegal mix of collations for operation 'in' Sep 9 18:13:55.180 [67917] dbg: bayes: cannot use bayes on this message; none of the tokens were found in the database Sep 9 18:13:55.181 [67917] dbg: bayes: not scoring message, returning undef It seems that, if some tokens are utf8, MySQL cannot correctly query the database. The attached patch fixes the issue by making sure all returned data are utf8.
+1 to commit for 4.0 release.
Targeting this for 4.0.0 release +1 for the proposed patch. That makes three votes counting Giovanni's implied vote, go ahead and commit it.
Is not using DBI::SQL_BINARY in bind_param already supposed to handle this? The token field is binary, so does it even use a collation? Did you investigate this deeper? I'm -1 until better explanation and an additional unit test, if there's a rare of of token looking like a UTF8 string and messing up things.
Or if the problem is the actual perl $token variable being in UTF8 mode fed to bind_param, is not the logical solution to utf8::encode it? mysql_enable_utf8 "When set, a data retrieved from a textual column type (char, varchar, etc) will have the UTF-8 flag turned on if necessary. This enables character semantics on that string." Seems a radical change in general character semantics for the code. So I question if it's the correct fix.
I was too quick with my vote, Henrik's points are all good. I withdraw my +1, not that I have to since a -1 overrides it until it is settled. Looks like we could use a test case to reproduce this. Giovanni, can you track down an email that led to this?
The first error is dbg: bayes: tok_get_all: SQL error: Illegal mix of collations for operation 'in' That can only be in MYSQL.pm line 464 which is the query constructed from my $multi_sql = "SELECT $token_select, spam_count, ham_count, atime FROM bayes_token WHERE id = ? AND token IN "; with the correct number of question marks in a list added at the end. Then there is my $idx = 0; $sth->bind_param(++$idx, $self->{_userid}); $sth->bind_param(++$idx, $_, DBI::SQL_BINARY) foreach (@tok); In sql/bayes_mysql.sql the schema is shown with CREATE TABLE bayes_token ( id int(11) NOT NULL default '0', token binary(5) NOT NULL default '', ... If that's how the database is set up, then shouldn't there no way that there could be a mismatch of types comparing token from that binary column with a prepared variable of type SQL_BINARY? Giovanni, is it possible that the schema is wrong in the database?
Also, I wondered if it is possible to get such an error by giving more than 5 bytes of binary string to a binary(5) column. When I looked at it I see that bayes tokens are made from the lower 40 bits of a SHA1 hash, so always exactly 5 bytes. But that's yet another indication that the UTF-8 change cannot be the correct fix, there is no way for tokens to be any character set, they are arbitrary 40-bit binary strings.
The issue was a wrong database schema, after fixing the schema bayes is working again.
I ran into this issue as well. I must have created my database back when the token column was CHAR(5) instead on BINARY(5). I agree that changing the schema is the proper solution but changing the schema on a production table with 200 million rows requires some time and effort. I found that a quick workaround is to change the query to cast the values to binary format. --- a/lib/Mail/SpamAssassin/BayesStore/MySQL.pm +++ b/lib/Mail/SpamAssassin/BayesStore/MySQL.pm @@ -438,7 +438,7 @@ sub tok_get_all { $bunch_end = $search_index + $bunch_size; for ( ; $search_index < $bunch_end; $search_index++) { - $in_str .= '?,'; + $in_str .= 'BINARY ?,'; push(@tok, $tokens[$search_index]); } chop $in_str; This will prevent the "Illegal mix of collations" error and get your Bayes working right away. I don't know if it's worth including in the next release or not but I'm posting here in case it helps someone. On another note...it would have been very helpful if the SQL error message was a `warn` instead of a `dbg`. I would have been able to catch this problem sooner if it showed up in my logs. --- a/lib/Mail/SpamAssassin/BayesStore/MySQL.pm +++ b/lib/Mail/SpamAssassin/BayesStore/MySQL.pm @@ -461,7 +461,7 @@ sub tok_get_all { my $rc = $sth->execute(); unless ($rc) { - dbg("bayes: tok_get_all: SQL error: ".$self->{_dbh}->errstr()); + warn("bayes: tok_get_all: SQL error: ".$self->{_dbh}->errstr()); $self->{_dbh}->rollback(); return []; } There are 41 other instances like this in MySQL.pm. I can submit new bug report with a patch file if there is interest. Thanks Kent