Bug 7645 - Wide character in print at /usr/bin/sa-compile line 433
Summary: Wide character in print at /usr/bin/sa-compile line 433
Product: Spamassassin
Component: sa-compile (show other bugs)
Version: 3.4.2
Hardware: PC Linux
Target Milestone: 4.0.0
Assignee: SpamAssassin Developer Mailing List
Depends on: 7656
Reported: 2018-10-22 07:58 UTC by Fabian Dellwing
Modified: 2022-03-09 14:36 UTC (History)
Description Fabian Dellwing 2018-10-22 07:58:10 UTC
Created attachment 5608 [details]
sa-compile full log


since the last update I'm getting this annoying message each time sa-compile runs (in quiet mode).

> Wide character in print at /usr/bin/sa-compile line 433, <$fh> line 2333.
> Wide character in print at /usr/bin/sa-compile line 433, <$fh> line 2334.
> Wide character in print at /usr/bin/sa-compile line 433, <$fh> line 2849.
> Wide character in print at /usr/bin/sa-compile line 433, <$fh> line 3198.
> Wide character in print at /usr/bin/sa-compile line 433, <$fh> line 3762.

I attached the whole log (without quiet mode).

I'm running Ubuntu 14.04 (Linux mail 4.4.0-130-generic #156~14.04.1-Ubuntu SMP Thu Jun 14 13:54:07 UTC 2018 i686 athlon i686 GNU/Linux) with german locale:

> [09:44 root@mail ~] > locale
> LANG=de_DE.UTF-8
> LC_CTYPE="de_DE.UTF-8"
> LC_TIME="de_DE.UTF-8"
> LC_PAPER="de_DE.UTF-8"
> LC_NAME="de_DE.UTF-8"

This is my sa-care.sh:

> #!/bin/bash
> chown -R root:root /etc/spamassassin/sa-update-keys 
> chmod -R o-rwx /etc/spamassassin/sa-update-keys 
> sa-update --nogpg --channel spamassassin.heinlein-support.de
> sa-update --channel updates.spamassassin.org
> sa-compile --quiet
> service amavis restart > /dev/null

The message occured with the last sa update this week. Any ideas how to fix this?

P.S. I'm aware of this 11 years old bug, with no solution: https://bz.apache.org/SpamAssassin/show_bug.cgi?id=5607
Comment 1 eqx 2018-11-09 13:04:18 UTC
I get the same after adding the heinlein channel.
Comment 2 Jan Brodda 2018-11-09 13:10:08 UTC
(In reply to eqx from comment #1)
> I get the same after adding the heinlein channel.

I am using the same SA rules from channel "spamassassin.heinlein-support.de", so this might be a similarity here..
Comment 3 Bill Cole 2018-11-09 13:43:04 UTC
What version of Perl are you using?
Comment 4 Jan Brodda 2018-11-09 13:49:40 UTC
(In reply to Bill Cole from comment #3)
> What version of Perl are you using?

Perl 5.22.1 on Ubuntu 16.04.5
Comment 5 Fabian Dellwing 2018-11-09 13:51:49 UTC
> [14:50 root@mail ~] > perl -V
Comment 6 Henrik Krohns 2018-11-09 15:15:26 UTC
There's some utf8 rules, for example

(I've used "cat -v" to print them..)
body HS_BODY_899 /The seller hasnM-CM-"M-bM-^BM-,M-bM-^DM-"t provided any postage details yet/
body HS_BODY_1575 /diesem Grund folgende Zahlung zu stornieren. Um den dafM-CM-<r nM-CM-6tigen/

Basically the wide print error comes from outputting "scanner1.re", which ends up containing

char *Mail_SpamAssassin_CompiledRegexps_body_0_scan1(unsigned char **p){
unsigned char *q = 1 + *p;
        "diesem grund folgende zahlung zu stornieren"            {RET("HS_BODY_1575,[l=1]");}
        "the seller hasnâ"            {RET("HS_BODY_899,[l=1]");}
  [\000-\377]        { return NULL; }

Not sure if we should just print with binmode utf8 or similar, so the utf8 characters end up in scanner1.re, or perhaps convert them first to some hex \xAB value. I guess this depends on what re2c is expecting.

I'm not sure what state utf8 rules/checks are in anyway. If there isn't already, we should have some docs/bug describing all the steps from reading .cf with utf8 rules to how the rule is stored and matched to decoded body (which is, or is not utf8?).. and also how sa-compile fits in all of this..
Comment 7 Henrik Krohns 2018-11-14 09:15:33 UTC
Noticed something that made me think of this bug..

"SA rules files are encoded in ISO-8859-1, not UTF-8.  You have to either encode
Japanese characters in pattern tests using \x sequences or develop a new feature
adding support for UTF-8 config files to SA."

I don't know if this (still) true of false, but perhaps we should clarify this somewhere and optionally reject any non-ascii configuration lines. No time to investicate right now.
Comment 8 eqx 2018-11-14 11:23:59 UTC
Thanks Henrik. I notified the maintainers of spamassassin.heinlein-support.de and pointed them here.
Comment 9 Robert R. Richter 2019-02-03 19:39:35 UTC
same problem here under Gentoo:

Wide character in print at /usr/bin/sa-compile line 433, <$fh> line 9716.
Wide character in print at /usr/bin/sa-compile line 433, <$fh> line 10641.

spamassassin 3.4.2 and perl 5.26.2

I am also using spamassassin.heinlein-support.de

Any news on this topic?
Comment 10 Bill Cole 2019-02-03 22:27:48 UTC
(In reply to Robert R. Richter from comment #9)
> same problem here under Gentoo:
> Wide character in print at /usr/bin/sa-compile line 433, <$fh> line 9716.
> Wide character in print at /usr/bin/sa-compile line 433, <$fh> line 10641.
> spamassassin 3.4.2 and perl 5.26.2
> I am also using spamassassin.heinlein-support.de
> Any news on this topic?

Not really. It's a low priority because it seems to be purely cosmetic and only occur with a third-party ruleset. 

1. A simple direct POSSIBLE fix with UNKNOWN side-effects may be to add this UNTESTED line after line 22:

  use open OUT => ':utf8';

2. A better fix will be to not use STDOUT for building the .re files.

Either change is unfit for the 3.4.3 release, which will be the terminal release for the 3.4 branch. The untested one-line possible fix may not work and may not quiet the warning while possibly breaking the rules involved. The refactoring of .re generation is simply too big to put in the final cleanup of the 3.4 branch.
Comment 11 Robert R. Richter 2019-02-03 22:42:40 UTC
I am no expert, so is it safe to just ignore these "Wide character in print at..." warnings/errors? Or are there any other sideeffects so that I should remove this ruleset?

FYI: I still have one 3.4.1 installation left and there are no such warnings using this ruleset on 3.4.1. Seems to be an issue only on 3.4.2.
Comment 12 Henrik Krohns 2019-02-04 05:18:13 UTC
Well if Heinlein is reading this, do not use UTF8 in rule files. That's the most simple fix.

Write rules in pure latin1:


Or better yet, with UTF8 byte alternatives:

$ perl -MEncode -e 'print unpack("H*", encode("UTF-8", "ü"))'


Most portable:

$ perl -e 'print unpack("H*", "ü")'


Some related thread:
Comment 13 Bill Cole 2019-02-04 05:57:28 UTC
(In reply to Robert R. Richter from comment #11)
> I am no expert, so is it safe to just ignore these "Wide character in print
> at..." warnings/errors? Or are there any other sideeffects so that I should
> remove this ruleset?

"Safe" is an imprecise concept, but I think ignoring those messages is safe for my understanding of safety. My understanding is that all of the rules are still being converted into compilable C and that only the specific rules that contain utf8 characters are being mangled in the process, making them generally non-matchable. See Henrik's comments above (comment #6 and comment #12) 

> FYI: I still have one 3.4.1 installation left and there are no such warnings
> using this ruleset on 3.4.1. Seems to be an issue only on 3.4.2.

That's probably because 3.4.1 was liberally sprinkled with "use bytes;" pragmas, which effectively removed handling of "wide" characters as characters rather than as a sequence of unrelated bytes. That wasn't a maintainable strategy given the modern reality of how Perl handles Unicode. If you want to understand the details, "perldoc bytes" is a place to start and it references additional documentation that may be helpful. 

Because this could be seen as a problem with a 3rd-party rule distribution that is distributing rules in a bad format, I am tempted to just close this as "INVALID" (i.e. not OUR problem,) but I do think we need to nail down the code truth in documentation and probably rework sa-compile for 4.0 to create re2c input files in a more tightly specified way.
Comment 14 Henrik Krohns 2019-06-24 15:35:03 UTC
*** Bug 5607 has been marked as a duplicate of this bug. ***
Comment 15 Daniel Migowski 2020-02-10 10:23:54 UTC
I would wish for a better error message, one which says WHICH channel on was parsing. I also have heinlein, but also schaal-it.net, and cannot say for sure without more testing which of them delivers wrong characters now.
Comment 16 Henrik Krohns 2022-03-06 14:01:09 UTC
Closing this as 3.4 will not receive any more fixes, and I'm considering sa-compile deprecated for 4.0.0 (atleast the project should vote on it officially).
Comment 17 eqx 2022-03-06 14:57:27 UTC
Thanks Henrik. Just to confirm, you are saying the issue does no longer exist in sa-compile v4+, so we can stop tracking at this point?

If it still exists we may want to open a bug on v4 for tracking, until deprecation of sa-compile has been confirmed, or simply define/document the re2c input requirements more strictly.
Comment 18 Henrik Krohns 2022-03-06 15:57:51 UTC
There is no issue if one doesn't put raw UTF-8 in cf files, some guidelines have been put into documentation about that. And as said, probably sa-compile will be gone in 4.0 (per Bug 7962).
Comment 19 Henrik Krohns 2022-03-09 10:10:12 UTC
Origin of the warning seemed to be from fixup_re which created utf8 encoded strings, should be silenced now. Judging from sa-compile temp files, nothing changed, so nothing should break (assuming the utf-8 stuff works properly in the first place, there aren't any unit tests for it).

Sending        trunk/lib/Mail/SpamAssassin/Plugin/BodyRuleBaseExtractor.pm
Transmitting file data .done
Committing transaction...
Committed revision 1898776.
Comment 20 Henrik Krohns 2022-03-09 14:36:56 UTC
Now UTF-8 rules might actually work:

Sending        spamassassin-3.4/lib/Mail/SpamAssassin/Plugin/BodyRuleBaseExtractor.pm
Sending        spamassassin-3.4/sa-compile.raw
Sending        trunk/sa-compile.raw
Sending        trunk/t/sa_compile.t
Transmitting file data ....done
Committing transaction...
Committed revision 1898791.