Bug 8042 - Bayes warnings when using MySQL
Summary: Bayes warnings when using MySQL
Status: RESOLVED INVALID
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Plugins (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: PC Linux
: P2 blocker
Target Milestone: 4.0.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-09-13 08:34 UTC by Giovanni Bechis
Modified: 2024-01-09 20:12 UTC (History)
5 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
db_connect fix patch None Giovanni Bechis [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Giovanni Bechis 2022-09-13 08:34:34 UTC
Created attachment 5815 [details]
db_connect fix

When using Bayes with MySQL, reading bayes data may not work.
With debug enabled I have this warnings:

Sep  9 18:13:55.180 [67917] dbg: bayes: tok_get_all: SQL error: Illegal mix of collations for operation 'in'
Sep  9 18:13:55.180 [67917] dbg: bayes: cannot use bayes on this message; none of the tokens were found in the database
Sep  9 18:13:55.181 [67917] dbg: bayes: not scoring message, returning undef

It seems that, if some tokens are utf8, MySQL cannot correctly query the database.
The attached patch fixes the issue by making sure all returned data are utf8.
Comment 1 Bill Cole 2022-09-13 13:12:42 UTC
+1 to commit for 4.0 release.
Comment 2 Sidney Markowitz 2022-09-14 04:03:09 UTC
Targeting this for 4.0.0 release

+1 for the proposed patch.

That makes three votes counting Giovanni's implied vote, go ahead and commit it.
Comment 3 Henrik Krohns 2022-09-14 04:47:48 UTC
Is not using DBI::SQL_BINARY in bind_param already supposed to handle this? The token field is binary, so does it even use a collation? Did you investigate this deeper?

I'm -1 until better explanation and an additional unit test, if there's a rare of of token looking like a UTF8 string and messing up things.
Comment 4 Henrik Krohns 2022-09-14 05:04:06 UTC
Or if the problem is the actual perl $token variable being in UTF8 mode fed to bind_param, is not the logical solution to utf8::encode it?

mysql_enable_utf8

"When set, a data retrieved from a textual column type (char, varchar, etc) will have the UTF-8 flag turned on if necessary. This enables character semantics on that string."

Seems a radical change in general character semantics for the code. So I question if it's the correct fix.
Comment 5 Sidney Markowitz 2022-09-14 05:32:01 UTC
I was too quick with my vote, Henrik's points are all good. I withdraw my +1, not that I have to since a -1 overrides it until it is settled.

Looks like we could use a test case to reproduce this. Giovanni, can you track down an email that led to this?
Comment 6 Sidney Markowitz 2022-09-14 06:25:38 UTC
The first error is
dbg: bayes: tok_get_all: SQL error: Illegal mix of collations for operation 'in'

That can only be in MYSQL.pm line 464 which is the query constructed from

  my $multi_sql = "SELECT $token_select, spam_count, ham_count, atime                                                                                                                                                    
                     FROM bayes_token                                                                                                                                                                                    
                    WHERE id = ?                                                                                                                                                                                         
                      AND token IN ";

with the correct number of question marks in a list added at the end.

Then there is 

      my $idx = 0;
      $sth->bind_param(++$idx, $self->{_userid});
      $sth->bind_param(++$idx, $_, DBI::SQL_BINARY) foreach (@tok);

In sql/bayes_mysql.sql the schema is shown with CREATE TABLE bayes_token (
  id int(11) NOT NULL default '0',
  token binary(5) NOT NULL default '', ...

If that's how the database is set up, then shouldn't there no way that there could be a mismatch of types comparing token from that binary column with a prepared variable of type SQL_BINARY?

Giovanni, is it possible that the schema is wrong in the database?
Comment 7 Sidney Markowitz 2022-09-14 06:52:00 UTC
Also, I wondered if it is possible to get such an error by giving more than 5 bytes of binary string to a binary(5) column. When I looked at it I see that bayes tokens are made from the lower 40 bits of a SHA1 hash, so always exactly 5 bytes.

But that's yet another indication that the UTF-8 change cannot be the correct fix, there is no way for tokens to be any character set, they are arbitrary 40-bit binary strings.
Comment 8 Giovanni Bechis 2022-09-14 07:41:30 UTC
The issue was a wrong database schema, after fixing the schema bayes is working again.
Comment 9 Kent Oyer 2024-01-09 20:12:13 UTC
I ran into this issue as well. I must have created my database back when the token column was CHAR(5) instead on BINARY(5). I agree that changing the schema is the proper solution but changing the schema on a production table with 200 million rows requires some time and effort. I found that a quick workaround is to change the query to cast the values to binary format. 

--- a/lib/Mail/SpamAssassin/BayesStore/MySQL.pm
+++ b/lib/Mail/SpamAssassin/BayesStore/MySQL.pm
@@ -438,7 +438,7 @@ sub tok_get_all {
 
       $bunch_end = $search_index + $bunch_size;
       for ( ; $search_index < $bunch_end; $search_index++) {
-	$in_str .= '?,';
+	$in_str .= 'BINARY ?,';
 	push(@tok, $tokens[$search_index]);
       }
       chop $in_str;


This will prevent the "Illegal mix of collations" error and get your Bayes working right away. I don't know if it's worth including in the next release or not but I'm posting here in case it helps someone.

On another note...it would have been very helpful if the SQL error message was a `warn` instead of a `dbg`. I would have been able to catch this problem sooner if it showed up in my logs. 

--- a/lib/Mail/SpamAssassin/BayesStore/MySQL.pm
+++ b/lib/Mail/SpamAssassin/BayesStore/MySQL.pm
@@ -461,7 +461,7 @@ sub tok_get_all {
       my $rc = $sth->execute();

       unless ($rc) {
-       dbg("bayes: tok_get_all: SQL error: ".$self->{_dbh}->errstr());
+       warn("bayes: tok_get_all: SQL error: ".$self->{_dbh}->errstr());
        $self->{_dbh}->rollback();
        return [];
       }

There are 41 other instances like this in MySQL.pm. I can submit new bug report with a patch file if there is interest.

Thanks
Kent