Bug 7656 - UTF8 rules, normalize_charset etc overhaul
Summary: UTF8 rules, normalize_charset etc overhaul
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: All All
: P2 blocker
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on: 4745 6234 7072
Blocks: 7022 7645
  Show dependency tree
 
Reported: 2018-11-17 11:03 UTC by Henrik Krohns
Modified: 2019-08-09 12:38 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Henrik Krohns 2018-11-17 11:03:22 UTC
There are few relating bugs, but I'm creating new to oversee this.

I don't think we should release 4.0.0 before all UTF8 related functionality works adequately and is documented properly.

I made few tests with a message that either contains latin1 or utf8 encoded text (or simple html without any encoding clauses). Also three variants with Content-Type missing or specified as such.

body RULE_LATIN1 /päivää/
body RULE_UTF8 /pÀivÀÀ/

TEXT/PLAIN  normalize_charset 0 / 1
utf8 message, no ct       RULE_UTF8   / RULE_UTF8
utf8 message, utf8 ct     RULE_UTF8   / RULE_UTF8
utf8 message, latin1 ct   RULE_UTF8   / RULE_UTF8
latin1 message, no ct     RULE_LATIN1 / <no hits>
latin1 message, utf8 ct   RULE_LATIN1 / <no hits>
latin1 message, latin1 ct RULE_LATIN1 / RULE_UTF8

TEXT/HTML  normalize_charset 0 / 1
utf8 message, no ct       RULE_UTF8 / RULE_UTF8
utf8 message, utf8 ct     RULE_UTF8 / RULE_UTF8
utf8 message, latin1 ct   RULE_UTF8 / RULE_UTF8
latin1 message, no ct     RULE_UTF8 / <no hits>
latin1 message, utf8 ct   RULE_UTF8 / <no hits>
latin1 message, latin1 ct RULE_UTF8 / RULE_UTF8

- normalize_charset 1 doesn't hit either rule unless message contains Content-Type..ISO-8859-1 ??

- html parser apparently assumes everything is UTF8. Only matches UTF8 rules?

One can't even use simple workarounds such as "body RULE_FOO /p.iv../" to match umlauts(diacritic?) from UTF8 messages, as they obviously eat up two characters.

Let's not even get into other things yet like sa-compile (bug 7645), textcat etc that all expect some correct encoding to work..

Unless people want to use multiple rules to match non-utf8 and utf8 messages, perhaps the only sane solution would be to "upgrade" all non-utf8 rules to utf8 internally and do the matching to utf8 upgraded body. In such case the two rules above would actually be duplicates and work on any message.
Comment 1 Henrik Krohns 2018-11-17 11:10:31 UTC
Lots of talk here too that I haven't digested yet..
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7022
Comment 2 Henrik Krohns 2018-11-17 11:23:49 UTC
(In reply to Henrik Krohns from comment #0)
> Unless people want to use multiple rules to match non-utf8 and utf8
> messages, perhaps the only sane solution would be to "upgrade" all non-utf8
> rules to utf8 internally and do the matching to utf8 upgraded body. In such
> case the two rules above would actually be duplicates and work on any
> message.

Basically with this I mean that normalize_charset should affect rule parsing too and encode the rules (and resulting regexes) to UTF8? I don't think we can simply tell users to "convert all your rules/files to UTF8, if you want them to work". I don't use UTF8 in my editors or Linuxes anywhere. :-)
Comment 3 Henrik Krohns 2018-11-17 14:45:25 UTC
(In reply to Henrik Krohns from comment #0)
> latin1 message, no ct     RULE_LATIN1 / <no hits>
> latin1 message, utf8 ct   RULE_LATIN1 / <no hits>
> latin1 message, no ct     RULE_UTF8 / <no hits>
> latin1 message, utf8 ct   RULE_UTF8 / <no hits>

Ok these should be now fixed..

Basically Encode::Detect::Detector thinks body "päivää" is Windows-1255 (Hebrew!!). 

dbg: message: failed decoding as declared charset UTF-8
dbg: message: decoded as detected charset windows-1255, declared UTF-8

Why are we using a module that hasn't been updated in 10 years anyway? Maybe look at Encode::Guess which has been in core atleast from 5.8.8?

I simply added latin diacretic letters to SA's own basic Win-1252 detection. I borrowed the \xc0-\xd6\xd8-\xde\xe0-\xf6\xf8-\xfe bit from textcat, also looking at https://en.wikipedia.org/wiki/Windows-1252 it seems correct. Not sure if the missing ÿ (\xff) should be added to here and textcat..

Sending        spamassassin-3.4/lib/Mail/SpamAssassin/Message/Node.pm
Sending        trunk/lib/Mail/SpamAssassin/Message/Node.pm
Transmitting file data ..done
Committing transaction...
Committed revision 1846805.
Comment 4 Henrik Krohns 2019-08-04 09:03:42 UTC
So getting back to this.

I've been running my SA with normalize_charset 1 without any ill-effects so far. Should we head towards activating it by default in 4.0.0?

Only thing left after that would be documenting what format .cf files are expected to be in. Probably just "bytes" without any special encoding? For anything else than personal use, pure ascii should be used for portability (non-ascii characters should be in \xff format).

To be compatible for both normalize_charset 0/1, it should be clearly documented that any rules expected to hit latin1 extended characters would need to be written to include both latin1/utf8 - "ä" -> (?:\xe4|\xc3\xa4). We could also detect this automatically from rules and output warning that it should be fixed.

One thing to consider would be removing the whole normalize_charset option, and just force everything normalized, plain and simple.
Comment 5 Henrik Krohns 2019-08-04 09:48:21 UTC
I tried performance tests with mass-check, there's absolutely no difference here for normalize_charset, total duration was always within normal +-2% variance.

Rule differences between these were mainly:

__HIGHBITS
MPART_ALT_DIFF_COUNT
TVD_SPACE_RATIO
__freemail_safe_fwd

As we can see from __freemail_safe_fwd, if normalize is on, we can't assume that a single dot will match a character like "ä".. committed (?:\xe4|\xc3\xa4) fix for it. Question arises whether regexes should be run with unicode semantics (. = single character) instead of matching raw bytes.

Have to investigate if the others need fixing.
Comment 6 Kevin A. McGrail 2019-08-04 18:46:14 UTC
I'm a 0 on this.  I haven't see this proposed for default on dev@ or users@ and would like to see that done.  I know I have some rules that fire differently with normalize_charset.
Comment 7 Henrik Krohns 2019-08-04 18:47:49 UTC
(In reply to Kevin A. McGrail from comment #6)
> I know I have some rules that fire differently with normalize_charset.

Could you show some examples?
Comment 8 Kevin A. McGrail 2019-08-04 19:10:31 UTC
Sure, here's one example:

#ZWNJ
#ZWNJ 200C 157 https://en.wikipedia.org/wiki/Windows-1256
# Also want to look at Unicode U+200C. 
# Also 'zero-width joiner' which is Windows-1256 0x9E and Unicode U+200D. $a

# Per RW, switching for this to work with 'normalize_charset 1', \x9d needs to be replaced with (?:\x9d|\xe2\x80\x8c)
mimeheader      __KAM_ZWNJ1     Content-Type =~ /charset.+windows-1256/i
body            __KAM_ZWNJ2     /(?:\x9D|\xe2\x80\x8c)/
tflags          __KAM_ZWNJ2     multiple maxhits=16
body            __KAM_ZWNJ3     /\&\#x200B;/i
Comment 9 Henrik Krohns 2019-08-04 19:14:37 UTC
Well yes that pretty much sums up what was already said in this bug. You can't expect to match extended ascii characters like before. It's nothing but a documentation issue.
Comment 10 Kevin A. McGrail 2019-08-04 19:19:42 UTC
Well for it to be the default in 4.0.0, I'd like it to be discussed on list, please.
Comment 11 Henrik Krohns 2019-08-04 19:27:36 UTC
This bug already floods dev@ list, if someone wants to chime in, feel free.

I have no intention of spending time posting on users@ at this stage, when it's still only on idea and much left to do. Developers are the ones who need to steer this ship.