Bug 7645

Summary: Wide character in print at /usr/bin/sa-compile line 433
Product: Spamassassin Reporter: Fabian Dellwing <f.dellwing>
Component: sa-compileAssignee: SpamAssassin Developer Mailing List <dev>
Status: RESOLVED FIXED    
Severity: normal CC: apache-bugzilla, apache, billcole, dmigowski, duncan, eqx, jidanni, sa-bugzilla, toddr
Priority: P2    
Version: 3.4.2   
Target Milestone: 4.0.0   
Hardware: PC   
OS: Linux   
Whiteboard:
Bug Depends on: 7656    
Bug Blocks:    
Attachments: sa-compile full log

Description Fabian Dellwing 2018-10-22 07:58:10 UTC
Created attachment 5608 [details]
sa-compile full log

Hi,

since the last update I'm getting this annoying message each time sa-compile runs (in quiet mode).

> Wide character in print at /usr/bin/sa-compile line 433, <$fh> line 2333.
> Wide character in print at /usr/bin/sa-compile line 433, <$fh> line 2334.
> Wide character in print at /usr/bin/sa-compile line 433, <$fh> line 2849.
> Wide character in print at /usr/bin/sa-compile line 433, <$fh> line 3198.
> Wide character in print at /usr/bin/sa-compile line 433, <$fh> line 3762.

I attached the whole log (without quiet mode).

I'm running Ubuntu 14.04 (Linux mail 4.4.0-130-generic #156~14.04.1-Ubuntu SMP Thu Jun 14 13:54:07 UTC 2018 i686 athlon i686 GNU/Linux) with german locale:

> [09:44 root@mail ~] > locale
> LANG=de_DE.UTF-8
> LANGUAGE=
> LC_CTYPE="de_DE.UTF-8"
> LC_NUMERIC="de_DE.UTF-8"
> LC_TIME="de_DE.UTF-8"
> LC_COLLATE="de_DE.UTF-8"
> LC_MONETARY="de_DE.UTF-8"
> LC_MESSAGES="de_DE.UTF-8"
> LC_PAPER="de_DE.UTF-8"
> LC_NAME="de_DE.UTF-8"
> LC_ADDRESS="de_DE.UTF-8"
> LC_TELEPHONE="de_DE.UTF-8"
> LC_MEASUREMENT="de_DE.UTF-8"
> LC_IDENTIFICATION="de_DE.UTF-8"
> LC_ALL=

This is my sa-care.sh:

> #!/bin/bash
> chown -R root:root /etc/spamassassin/sa-update-keys 
> chmod -R o-rwx /etc/spamassassin/sa-update-keys 
> sa-update --nogpg --channel spamassassin.heinlein-support.de
> sa-update --channel updates.spamassassin.org
> sa-compile --quiet
> service amavis restart > /dev/null

The message occured with the last sa update this week. Any ideas how to fix this?

P.S. I'm aware of this 11 years old bug, with no solution: https://bz.apache.org/SpamAssassin/show_bug.cgi?id=5607
Comment 1 eqx 2018-11-09 13:04:18 UTC
I get the same after adding the heinlein channel.
Comment 2 Jan Brodda 2018-11-09 13:10:08 UTC
(In reply to eqx from comment #1)
> I get the same after adding the heinlein channel.

I am using the same SA rules from channel "spamassassin.heinlein-support.de", so this might be a similarity here..
Comment 3 Bill Cole 2018-11-09 13:43:04 UTC
What version of Perl are you using?
Comment 4 Jan Brodda 2018-11-09 13:49:40 UTC
(In reply to Bill Cole from comment #3)
> What version of Perl are you using?

Perl 5.22.1 on Ubuntu 16.04.5
Comment 5 Fabian Dellwing 2018-11-09 13:51:49 UTC
> [14:50 root@mail ~] > perl -V
> Summary of my perl5 (revision 5 version 18 subversion 2) configuration:
>    
>   Platform:
>     osname=linux, osvers=4.4.0-127-generic, archname=i686-linux-gnu-thread-multi-64int
>     uname='linux lgw01-amd64-009 4.4.0-127-generic #153-ubuntu smp sat may 19 10:58:46 utc 2018 i686 i686 i686 gnulinux '
>     config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -Dldflags= -Wl,-Bsymbolic-functions -Wl,-z,relro -Dlddlflags=-shared -Wl,-Bsymbolic-functions -Wl,-z,relro -Dcccdlflags=-fPIC -Darchname=i686-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.18 -Darchlib=/usr/lib/perl/5.18 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.18.2 -Dsitearch=/usr/local/lib/perl/5.18.2 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Duse64bitint -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -Ui_libutil -Uversiononly -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.18.2 -des'
>     hint=recommended, useposix=true, d_sigaction=define
>     useithreads=define, usemultiplicity=define
>     useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
>     use64bitint=define, use64bitall=undef, uselongdouble=undef
>     usemymalloc=n, bincompat5005=undef
>   Compiler:
>     cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fstack-protector -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',                                                                    
>     optimize='-O2 -g',                                                                                                                                                                                                                       
>     cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fstack-protector -fno-strict-aliasing -pipe -I/usr/local/include'                                                                                                                         
>     ccversion='', gccversion='4.8.4', gccosandvers=''                                                                                                                                                                                        
>     intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678                                                                                                                                                                       
>     d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12                                                                                                                                                                      
>     ivtype='long long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8                                                                                                                                                      
>     alignbytes=4, prototype=define                                                                                                                                                                                                           
>   Linker and Libraries:                                                                                                                                                                                                                      
>     ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'                                                                                                                                                                                  
>     libpth=/usr/local/lib /lib/i386-linux-gnu /lib/../lib /usr/lib/i386-linux-gnu /usr/lib/../lib /lib /usr/lib                                                                                                                              
>     libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt                                                                                                                                                                            
>     perllibs=-ldl -lm -lpthread -lc -lcrypt                                                                                                                                                                                                  
>     libc=, so=so, useshrplib=true, libperl=libperl.so.5.18.2                                                                                                                                                                                 
>     gnulibc_version='2.19'                                                                                                                                                                                                                   
>   Dynamic Linking:                                                                                                                                                                                                                           
>     dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'                                                                                                                                                                        
>     cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib -fstack-protector'                                                                                                                                                               
>                                                                                                                                                                                                                                              
>                                                                                                                                                                                                                                              
> Characteristics of this binary (from libperl):                                                                                                                                                                                               
>   Compile-time options: HAS_TIMES MULTIPLICITY PERLIO_LAYERS                                                                                                                                                                                 
>                         PERL_DONT_CREATE_GVSV                                                                                                                                                                                                
>                         PERL_HASH_FUNC_ONE_AT_A_TIME_HARD                                                                                                                                                                                    
>                         PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP                                                                                                                                                                               
>                         PERL_PRESERVE_IVUV PERL_SAWAMPERSAND USE_64_BIT_INT
>                         USE_ITHREADS USE_LARGE_FILES USE_LOCALE
>                         USE_LOCALE_COLLATE USE_LOCALE_CTYPE
>                         USE_LOCALE_NUMERIC USE_PERLIO USE_PERL_ATOF
>                         USE_REENTRANT_API
>   Locally applied patches:
>         DEBPKG:debian/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for modules installed from CPAN.
>         DEBPKG:debian/db_file_ver - http://bugs.debian.org/340047 Remove overly restrictive DB_File version check.
>         DEBPKG:debian/doc_info - Replace generic man(1) instructions with Debian-specific information.
>         DEBPKG:debian/enc2xs_inc - http://bugs.debian.org/290336 Tweak enc2xs to follow symlinks and ignore missing @INC directories.
>         DEBPKG:debian/errno_ver - http://bugs.debian.org/343351 Remove Errno version check due to upgrade problems with long-running processes.
>         DEBPKG:debian/libperl_embed_doc - http://bugs.debian.org/186778 Note that libperl-dev package is required for embedded linking
>         DEBPKG:fixes/respect_umask - Respect umask during installation
>         DEBPKG:debian/writable_site_dirs - Set umask approproately for site install directories
>         DEBPKG:debian/extutils_set_libperl_path - EU:MM: Set location of libperl.a to /usr/lib
>         DEBPKG:debian/no_packlist_perllocal - Don't install .packlist or perllocal.pod for perl or vendor
>         DEBPKG:debian/prefix_changes - Fiddle with *PREFIX and variables written to the makefile
>         DEBPKG:debian/fakeroot - Postpone LD_LIBRARY_PATH evaluation to the binary targets.
>         DEBPKG:debian/instmodsh_doc - Debian policy doesn't install .packlist files for core or vendor.
>         DEBPKG:debian/ld_run_path - Remove standard libs from LD_RUN_PATH as per Debian policy.
>         DEBPKG:debian/libnet_config_path - Set location of libnet.cfg to /etc/perl/Net as /usr may not be writable.
>         DEBPKG:debian/mod_paths - Tweak @INC ordering for Debian
>         DEBPKG:debian/module_build_man_extensions - http://bugs.debian.org/479460 Adjust Module::Build manual page extensions for the Debian Perl policy
>         DEBPKG:debian/prune_libs - http://bugs.debian.org/128355 Prune the list of libraries wanted to what we actually need.
>         DEBPKG:fixes/net_smtp_docs - [rt.cpan.org #36038] http://bugs.debian.org/100195 Document the Net::SMTP 'Port' option
>         DEBPKG:debian/perlivp - http://bugs.debian.org/510895 Make perlivp skip include directories in /usr/local
>         DEBPKG:debian/cpanplus_definstalldirs - http://bugs.debian.org/533707 Configure CPANPLUS to use the site directories by default.
>         DEBPKG:debian/cpanplus_config_path - Save local versions of CPANPLUS::Config::System into /etc/perl.
>         DEBPKG:debian/deprecate-with-apt - http://bugs.debian.org/702096 Point users to Debian packages of deprecated core modules
>         DEBPKG:debian/squelch-locale-warnings - http://bugs.debian.org/508764 Squelch locale warnings in Debian package maintainer scripts
>         DEBPKG:debian/skip-upstream-git-tests - Skip tests specific to the upstream Git repository
>         DEBPKG:debian/patchlevel - http://bugs.debian.org/567489 List packaged patches for 5.18.2-2ubuntu1.6 in patchlevel.h
>         DEBPKG:debian/skip-kfreebsd-crash - http://bugs.debian.org/628493 [perl #96272] Skip a crashing test case in t/op/threads.t on GNU/kFreeBSD
>         DEBPKG:fixes/document_makemaker_ccflags - http://bugs.debian.org/628522 [rt.cpan.org #68613] Document that CCFLAGS should include $Config{ccflags}
>         DEBPKG:debian/find_html2text - http://bugs.debian.org/640479 Configure CPAN::Distribution with correct name of html2text
>         DEBPKG:debian/hurd_test_skip_stack - http://bugs.debian.org/650175 Disable failing GNU/Hurd tests dist/threads/t/stack.t
>         DEBPKG:fixes/manpage_name_Test-Harness - http://bugs.debian.org/650451 [rt.cpan.org #73399] cpan/Test-Harness: add NAME headings in modules with POD
>         DEBPKG:debian/makemaker-pasthru - http://bugs.debian.org/660195 [rt.cpan.org #28632] Make EU::MM pass LD through to recursive Makefile.PL invocations
>         DEBPKG:debian/perl5db-x-terminal-emulator.patch - http://bugs.debian.org/668490 Invoke x-terminal-emulator rather than xterm in perl5db.pl
>         DEBPKG:debian/cpan-missing-site-dirs - http://bugs.debian.org/688842 Fix CPAN::FirstTime defaults with nonexisting site dirs if a parent is writable
>         DEBPKG:fixes/memoize_storable_nstore - [rt.cpan.org #77790] http://bugs.debian.org/587650 Memoize::Storable: respect 'nstore' option not respected
>         DEBPKG:fixes/net_ftp_failed_command - [rt.cpan.org #37700] http://bugs.debian.org/491062 Net::FTP: cope gracefully with a failed command
>         DEBPKG:fixes/perlbug-patchlist - [3541c11] http://bugs.debian.org/710842 [perl #118433] Make perlbug look up the list of local patches at run time
>         DEBPKG:fixes/module_metadata_security_doc - [68cdd4b] CVE-2013-1437 documentation fix
>         DEBPKG:fixes/module_metadata_taint_fix - [bff978f] http://bugs.debian.org/722210 [rt.cpan.org #88576] untaint version, if needed, in Module::Metadata
>         DEBPKG:fixes/IPC-SysV-spelling - http://bugs.debian.org/730558 [rt.cpan.org #86736] Fix spelling of IPC_CREAT in IPC-SysV documentation
>         DEBPKG:fixes/fix-undef-source -
>         DEBPKG:fixes/CVE-2013-7422.patch - [PATCH] [perl #119505] Segfault from bad backreference
>         DEBPKG:fixes/CVE-2014-4330.patch - [PATCH] don't recurse infinitely in Data::Dumper
>         DEBPKG:fixes/CVE-2016-2381.patch - [PATCH 1/2] remove duplicate environment variables from environ
>         DEBPKG:fixes/CVE-2017-12837.patch - [PATCH] regcomp [perl #131582]
>         DEBPKG:fixes/CVE-2017-12883.patch - [PATCH] PATCH: [perl #131598]
>         DEBPKG:fixes/CVE-2015-8853-1.patch - [PATCH] PATCH [perl #123562] Regexp-matching "hangs"
>         DEBPKG:fixes/CVE-2015-8853-2.patch - [PATCH] regexec.c: Use Perl_croak_nocontext()
>         DEBPKG:fixes/CVE-2016-6185.patch - [PATCH] =?utf8?q?Don=E2=80=99t=20let=20XSLoader=20load=20relative?= =?utf8?q?=20paths?=
>         DEBPKG:fixes/CVE-2017-6512-pre.patch - [PATCH] Correct the order of tests of chmod(). (#294)
>         DEBPKG:fixes/CVE-2017-6512.patch - http://bugs.debian.org/863870 [rt.cpan.org #121951] Prevent directory chmod race attack.
>         DEBPKG:fixes/CVE-2018-6913.patch - (perl #131844) fix various space calculation issues in pp_pack.c
>         DEBPKG:fixes/CVE-2018-12015.patch - [PATCH] [PATCH] Remove existing files before overwriting them
>   Built under linux
>   Compiled at Jun 13 2018 12:40:40
>   @INC:
>     /etc/perl
>     /usr/local/lib/perl/5.18.2
>     /usr/local/share/perl/5.18.2
>     /usr/lib/perl5
>     /usr/share/perl5
>     /usr/lib/perl/5.18
>     /usr/share/perl/5.18
>     /usr/local/lib/site_perl
>     .
Comment 6 Henrik Krohns 2018-11-09 15:15:26 UTC
There's some utf8 rules, for example

(I've used "cat -v" to print them..)
body HS_BODY_899 /The seller hasnM-CM-"M-bM-^BM-,M-bM-^DM-"t provided any postage details yet/
body HS_BODY_1575 /diesem Grund folgende Zahlung zu stornieren. Um den dafM-CM-<r nM-CM-6tigen/

Basically the wide print error comes from outputting "scanner1.re", which ends up containing

char *Mail_SpamAssassin_CompiledRegexps_body_0_scan1(unsigned char **p){
unsigned char *q = 1 + *p;
/*!re2c
        "diesem grund folgende zahlung zu stornieren"            {RET("HS_BODY_1575,[l=1]");}
        "the seller hasnâ"            {RET("HS_BODY_899,[l=1]");}
  [\000-\377]        { return NULL; }
*/

Not sure if we should just print with binmode utf8 or similar, so the utf8 characters end up in scanner1.re, or perhaps convert them first to some hex \xAB value. I guess this depends on what re2c is expecting.

I'm not sure what state utf8 rules/checks are in anyway. If there isn't already, we should have some docs/bug describing all the steps from reading .cf with utf8 rules to how the rule is stored and matched to decoded body (which is, or is not utf8?).. and also how sa-compile fits in all of this..
Comment 7 Henrik Krohns 2018-11-14 09:15:33 UTC
Noticed something that made me think of this bug..
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=5691#c16

"SA rules files are encoded in ISO-8859-1, not UTF-8.  You have to either encode
Japanese characters in pattern tests using \x sequences or develop a new feature
adding support for UTF-8 config files to SA."

I don't know if this (still) true of false, but perhaps we should clarify this somewhere and optionally reject any non-ascii configuration lines. No time to investicate right now.
Comment 8 eqx 2018-11-14 11:23:59 UTC
Thanks Henrik. I notified the maintainers of spamassassin.heinlein-support.de and pointed them here.
Comment 9 Robert R. Richter 2019-02-03 19:39:35 UTC
same problem here under Gentoo:

Wide character in print at /usr/bin/sa-compile line 433, <$fh> line 9716.
Wide character in print at /usr/bin/sa-compile line 433, <$fh> line 10641.

spamassassin 3.4.2 and perl 5.26.2

I am also using spamassassin.heinlein-support.de

Any news on this topic?
Comment 10 Bill Cole 2019-02-03 22:27:48 UTC
(In reply to Robert R. Richter from comment #9)
> same problem here under Gentoo:
> 
> Wide character in print at /usr/bin/sa-compile line 433, <$fh> line 9716.
> Wide character in print at /usr/bin/sa-compile line 433, <$fh> line 10641.
> 
> spamassassin 3.4.2 and perl 5.26.2
> 
> I am also using spamassassin.heinlein-support.de
> 
> Any news on this topic?

Not really. It's a low priority because it seems to be purely cosmetic and only occur with a third-party ruleset. 

1. A simple direct POSSIBLE fix with UNKNOWN side-effects may be to add this UNTESTED line after line 22:

  use open OUT => ':utf8';

2. A better fix will be to not use STDOUT for building the .re files.

Either change is unfit for the 3.4.3 release, which will be the terminal release for the 3.4 branch. The untested one-line possible fix may not work and may not quiet the warning while possibly breaking the rules involved. The refactoring of .re generation is simply too big to put in the final cleanup of the 3.4 branch.
Comment 11 Robert R. Richter 2019-02-03 22:42:40 UTC
I am no expert, so is it safe to just ignore these "Wide character in print at..." warnings/errors? Or are there any other sideeffects so that I should remove this ruleset?

FYI: I still have one 3.4.1 installation left and there are no such warnings using this ruleset on 3.4.1. Seems to be an issue only on 3.4.2.
Comment 12 Henrik Krohns 2019-02-04 05:18:13 UTC
Well if Heinlein is reading this, do not use UTF8 in rule files. That's the most simple fix.

Write rules in pure latin1:

/füübar/

Or better yet, with UTF8 byte alternatives:

$ perl -MEncode -e 'print unpack("H*", encode("UTF-8", "ü"))'
c3bc

/f(?:ü|\xc3\xbc)(?:ü|\xc3\xbc)bar/

Most portable:

$ perl -e 'print unpack("H*", "ü")'
fc

/f(?:\xfc|\xc3\xbc)(?:\xfc|\xc3\xbc)bar/

Some related thread:
http://spamassassin.1065346.n5.nabble.com/UTF8-character-in-doesn-t-match-td154199.html
Comment 13 Bill Cole 2019-02-04 05:57:28 UTC
(In reply to Robert R. Richter from comment #11)
> I am no expert, so is it safe to just ignore these "Wide character in print
> at..." warnings/errors? Or are there any other sideeffects so that I should
> remove this ruleset?

"Safe" is an imprecise concept, but I think ignoring those messages is safe for my understanding of safety. My understanding is that all of the rules are still being converted into compilable C and that only the specific rules that contain utf8 characters are being mangled in the process, making them generally non-matchable. See Henrik's comments above (comment #6 and comment #12) 

> FYI: I still have one 3.4.1 installation left and there are no such warnings
> using this ruleset on 3.4.1. Seems to be an issue only on 3.4.2.

That's probably because 3.4.1 was liberally sprinkled with "use bytes;" pragmas, which effectively removed handling of "wide" characters as characters rather than as a sequence of unrelated bytes. That wasn't a maintainable strategy given the modern reality of how Perl handles Unicode. If you want to understand the details, "perldoc bytes" is a place to start and it references additional documentation that may be helpful. 

Because this could be seen as a problem with a 3rd-party rule distribution that is distributing rules in a bad format, I am tempted to just close this as "INVALID" (i.e. not OUR problem,) but I do think we need to nail down the code truth in documentation and probably rework sa-compile for 4.0 to create re2c input files in a more tightly specified way.
Comment 14 Henrik Krohns 2019-06-24 15:35:03 UTC
*** Bug 5607 has been marked as a duplicate of this bug. ***
Comment 15 Daniel Migowski 2020-02-10 10:23:54 UTC
I would wish for a better error message, one which says WHICH channel on was parsing. I also have heinlein, but also schaal-it.net, and cannot say for sure without more testing which of them delivers wrong characters now.
Comment 16 Henrik Krohns 2022-03-06 14:01:09 UTC
Closing this as 3.4 will not receive any more fixes, and I'm considering sa-compile deprecated for 4.0.0 (atleast the project should vote on it officially).
Comment 17 eqx 2022-03-06 14:57:27 UTC
Thanks Henrik. Just to confirm, you are saying the issue does no longer exist in sa-compile v4+, so we can stop tracking at this point?

If it still exists we may want to open a bug on v4 for tracking, until deprecation of sa-compile has been confirmed, or simply define/document the re2c input requirements more strictly.
Comment 18 Henrik Krohns 2022-03-06 15:57:51 UTC
There is no issue if one doesn't put raw UTF-8 in cf files, some guidelines have been put into documentation about that. And as said, probably sa-compile will be gone in 4.0 (per Bug 7962).
Comment 19 Henrik Krohns 2022-03-09 10:10:12 UTC
Origin of the warning seemed to be from fixup_re which created utf8 encoded strings, should be silenced now. Judging from sa-compile temp files, nothing changed, so nothing should break (assuming the utf-8 stuff works properly in the first place, there aren't any unit tests for it).

Sending        trunk/lib/Mail/SpamAssassin/Plugin/BodyRuleBaseExtractor.pm
Transmitting file data .done
Committing transaction...
Committed revision 1898776.
Comment 20 Henrik Krohns 2022-03-09 14:36:56 UTC
Now UTF-8 rules might actually work:

Sending        spamassassin-3.4/lib/Mail/SpamAssassin/Plugin/BodyRuleBaseExtractor.pm
Sending        spamassassin-3.4/sa-compile.raw
Sending        trunk/sa-compile.raw
Sending        trunk/t/sa_compile.t
Transmitting file data ....done
Committing transaction...
Committed revision 1898791.