Bug 7144 - To normalize_charset or not to normalize_charset, that is the question.
Summary: To normalize_charset or not to normalize_charset, that is the question.
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: 3.4.0
Hardware: All All
: P2 enhancement
Target Milestone: 3.4.1
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-02-23 19:19 UTC by Mark Martinec
Modified: 2015-04-14 00:05 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Martinec 2015-02-23 19:19:20 UTC
To normalize_charset or not to normalize_charset, that is the question.

Isolating this decision from other problem reports (Bug 7126,
Bug 7133) so that they may be closed as they do not introduce
incompatibilities between 3.4.0 and 3.4.1 by themselves.

The documentation on normalize_charset now states:

normalize_charset ( 0 | 1)        (default: 0)
  Whether to decode non- UTF-8 and non-ASCII textual parts and recode
  them to UTF-8 before the text is given over to rules processing.
  The character set used for attempted decoding is primarily based on
  a declared character set in a Content-Type header, but if the
  decoding attempt fails a module Encode::Detect::Detector is
  consulted (if available) to provide a guess based on the actual
  text, and decoding is re-attempted. Even if the option is enabled
  no unnecessary decoding and re-encoding work is done when possible
  (like with an all-ASCII text with a US-ASCII or extended ASCII
  character set declaration, e.g. UTF-8 or ISO-8859-nn or Windows-nnnn).

  Unicode support in old versions of perl or in a core module Encode
  is likely to be buggy in places, so if the normalize_charset
  function is enabled it is advised to stick to more recent versions
  of perl (preferably 5.12 or later). The module
  Encode::Detect::Detector is optional, when necessary it will be
  used if it is available.


The final result is unchanged from what we had in 3.4.0 and earlier,
i.e. when normalize_charset is enabled the resulting decoded text
(as passed to rules and plugins) is transcoded if necessary from
the original character set encoding to UTF-8 octets.
When normalize_charset is disabled the original encoding is retained,
which may or may not be UTF-8 octets.

(There was just one exception/bug to the above claim, now fixed in
Bug 7133, where some unfortunate HTML text could end up as Unicode
perl characters (utf8 flag on) and could slow down rules. This is
now fixed, no Unicode perl characters should be reaching rules any
longer, regardless of original encoding or HTML.)

Now then, seems we have a couple of rules (mostly in 25_replace.cf
and 20_drugs.cf), which assume to be given text in Latin-1 or
Windows-1252 encoding). Such rules are only effective when
normalize_charset is off. If these rules were to be effective
when given text encoded as UTF-8, they would need to be modified.

Some of these modifications is trivial (just replacing Latin-1 char
in a regexp with UTF-8 byte sequence). Unfortunately then Latin-1
characters are used in a perl regexp character class (brackets),
there is no automatic fix, as single bytes would need to be replaced
by multiple bytes (byte pairs for Latin characters), so a character
class approach no longer works, a alternation in a regexp is needed.
The following problem description is copied from Bug 7133#c7

>> body CRAZY_EURO /€uro/
>> header SUBJ_CREDIT_FR Subject =~ /crédit/
>>
>> The /\xE2\x82\xACuro/  and  /€uro/  are even now equivalent
>> (assuming the encoding of a *.cf file is in UTF-8, which is common).
>
> There is a slight gotcha there when writing body and header rules in
> UTF-8.  The above cases are fine, SpamAssassin just sees a sequence
> of octets (UTF-8) and compares them to a sequence of octets in a text.
>
> The gotcha is when it would be desirable to include such non-ASCII
> character in a bracketed character class, e.g. [uµùúûü]. This would
> only work if both a text and regexp from rules is represented in
> Unicode (or as some single-byte encoding), i.e. each logical character
> as a indivisible entity in a character class, not as individual octets.
> Even nastier is a range, e.g. [uµù-ü], which assumes Latin-1 encoding
> and has no sensible equivalence in Unicode.
>
> We have a couple of such cases currently in our ruleset, e.g. in
> 20_drugs.cf and 25_replace.cf.  When converting such rules into
> UTF-8, a character class like [uµùúûü] would need to be converted
> to something like (u|µ|ù|ú|û|ü).


Note that this is not a new issue, there are no incompatibilities
between 3.4.0 and 3.4.1 in this regard introduced by related
Bug 7126, Bug 7130, and Bug 7141). We had this same problem even
with 3.4.0 and earlier, it's just that normalize_charset wasn't
very popular, and UTF-8 encoding was not as widespread ten years ago
as it is now (see some statistics at Bug 7126#c6).


So to summarize: turning on normalize_charset seems like a useful
feature as it can simplify/unify some rules (these may even be written
by a normal text editor in UTF-8 locale, no need for cyptic \xHH in
regexp, nor need not be concerned about different original character
sets and their encodings in a mail). Unfortunately some of the existing
rules which assume original encoding become ineffective (and could
even potentially misfire).

Eventually (in future) the clean solution would be to work entirely
in Unicode domain (perl characters). Seems we are not there yet
due to slowdown in regexp evaluation and due to a need to support
ancient perl versions.
Comment 1 Kevin A. McGrail 2015-03-10 20:32:15 UTC
My vote is set normalize_charset to 1 in the default local.cf for a new install and document in the UPGRADE for people to consider changing it on their installs.

Do we know of any rules that misfire today in the default rule package?

Regards,
KAM
Comment 2 Kevin A. McGrail 2015-03-10 21:21:28 UTC
> Do we know of any rules that misfire today in the default rule package?

From Mark: Turning it on
will make some of the drugs rules ineffective as they
assume Latin encoding of text. Other that that it should
be safe one way or another. 

So I'm a definite on recommending +1 for new installs
Comment 3 Kevin A. McGrail 2015-04-06 20:25:43 UTC
With mine and Mark's votes, we need one more.

In looking at this, if +1's we will need to make Encode::Detect a requirement rather than optional

Also need to update the UPGRADE and README to reflect this change if we get another +1.
Comment 4 Mark Martinec 2015-04-07 00:27:05 UTC
> In looking at this, if +1's we will need to make Encode::Detect a
> requirement rather than optional

Not necessarily. The Encode::Detect is now only used rarely if other
attempts fail - unlike previously in 3.4.0, where the module was
essential for operation. I wouldn't even care much for this module,
but I kept it as it's been there in use before. It is still flagged
as optional in the DependencyInfo.pm, and its importance is played down
in the DependencyInfo's report.


> Also need to update the UPGRADE and README to reflect this change
> if we get another +1.

I wonder how effective these current drugs misspellings rules are,
which assume Latin1 encoding. I haven't noticed degradation when
I began playing with normalize_charset and turned it on (rendering
them ineffective), but that's just anecdotal.

Currently I don't see an easy way to let rules know what encoding
they are dealing with, so can't make them conditional (or tflagged).
One possibility is to use 'rawbody' instead of 'body' for such rules
that expect original encoding of a message.  Rawbody avoids charset
normalization, but also avoids decoding HTML (which may or may not
affect them).

I don't have a strong opinion on the default value of normalize_charset.
For our site I certainly want it on (regardless of possibly rendering
some stock rules ineffective), as it makes it easier to write rules for
non-English text. Perhaps a gentle nudge in the release notes to suggest
people to turn normalize_charset on when upgrading to 3.4.1, but leaving
a default unchanged for this minor version update? The drag is that
there will be some users base staying on pre-3.4.1 version for quite
some time still, yet keeping their rules up-to-date.
Comment 5 Joe Quinn 2015-04-13 17:47:38 UTC
+1

My first concern was that some old versions of perl have crash bugs related to unicode. However, I can't find any issues on Google that have security implications; they seem to all be related to source code.

As for rules that assume latin-1, we can catch those in ruleqa as they lose efficacy and turn them into a pair of rules + meta for each value of normalize_charset (remembering that we still need to support the case of normalize_charset not being the default).
Comment 6 Joe Quinn 2015-04-13 20:04:38 UTC
Adding "normalize_charset 1" to local.cf led to a test failure. It's currently commented out. I have discussed it with Kevin and this is as close as either of us can come up with. Unfortunately, it means people will have to uncomment it themselves, which isn't ideal. I also adjusted the verbiage in UPGRADE to reflect this.

Mark, any ideas?

Committed revision 1673266.
Comment 7 Mark Martinec 2015-04-13 20:37:42 UTC
> Adding "normalize_charset 1" to local.cf led to a test failure.

Please provide more information. It does not fail any tests here,
including the xt/*.t tests.
Comment 8 Kevin A. McGrail 2015-04-13 20:39:32 UTC
I haven't tested but what he described was if you add normalize_charset 1 to rules/local.cf so that a default install would have it turned on, a make test fails cross_user_config_leak.t
Comment 9 Mark Martinec 2015-04-13 23:40:24 UTC
> I haven't tested but what he described was if you add normalize_charset 1 to
> rules/local.cf so that a default install would have it turned on, a make
> test fails cross_user_config_leak.t

It's a quirk of the test itself. Adding any other boolean setting
to rules/local.cf will produce the same failure:

rules/local.cf :
  normalize_charset 1
  always_trust_envelope_sender 1

$ prove t/cross_user_config_leak.t   

t/cross_user_config_leak.t .. 1/6 found='1' wanted=(none) key='normalize_charset' when=...line 266.
found='1' wanted=(none) key='always_trust_envelope_sender' when=...line 266.
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Failed test 4 in t/cross_user_config_leak.t at line 89
found='1' wanted=(none) key='normalize_charset' when=...line 266.
                             ^^^^^^^^^^^^^^^^^

The default needs to be changed in Conf.pm.


Btw, while investigating the abov I found a bug in Conf.pm for setting
of normalize_charset, which has been there since forever but went by
unnoticed: it was only possible to turn this setting on, but not to
turn it off again. Fix:
  Sending lib/Mail/SpamAssassin/Conf.pm
Committed revision 1673314.
Comment 10 Kevin A. McGrail 2015-04-13 23:45:00 UTC
(In reply to Mark Martinec from comment #9)
> > I haven't tested but what he described was if you add normalize_charset 1 to
> > rules/local.cf so that a default install would have it turned on, a make
> > test fails cross_user_config_leak.t
> 
> It's a quirk of the test itself. Adding any other boolean setting
> to rules/local.cf will produce the same failure:

That's what I was hoping.

> The default needs to be changed in Conf.pm.

But we only want the default changed for new installs not existing installs which means we want to change the rules/local.cf and not the value in Conf.pm, correct?

So we can change local/rules.cf but need to figure out how to get that to pass test.

Or we need to push this to 3.4.2.
Comment 11 Mark Martinec 2015-04-13 23:59:45 UTC
> > The default needs to be changed in Conf.pm.
> 
> But we only want the default changed for new installs not existing installs
> which means we want to change the rules/local.cf and not the value in
> Conf.pm, correct?

Right. I was under impression that we are talking about changing a default,
never mind.

> So we can change local/rules.cf but need to figure out how to get that to
> pass test.
> 
> Or we need to push this to 3.4.2.

It's probably safer to keep it as it is, just suggesting in the docs
that it may be worthwhile to turn it on. By changing a default or by
adding it to initial local.cf we'll have a situation that two sets of
users will be seeing different behaviour, which might end up as support
trouble. If dreams come true we can move to a fuller Unicode handling
throughout (and IDN and UTF8SMTP) in 3.5.0 and change the default then  :)
Comment 12 Kevin A. McGrail 2015-04-14 00:05:03 UTC
(In reply to Mark Martinec from comment #11)
> It's probably safer to keep it as it is, just suggesting in the docs
> that it may be worthwhile to turn it on. By changing a default or by
> adding it to initial local.cf we'll have a situation that two sets of
> users will be seeing different behaviour, which might end up as support
> trouble. If dreams come true we can move to a fuller Unicode handling
> throughout (and IDN and UTF8SMTP) in 3.5.0 and change the default then  :)

Agreed.  Marking resolved and good catch on the conf.pm change BTW.  

I think that IDN/UTF8 will be big enough for a 4.0 release, I think!  Perhaps that and the rules ideas I have could be a big high-level focus.