SA Bugzilla – Bug 7133
Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities
Last modified: 2015-02-23 19:22:54 UTC
Back in 2004 the Bug 4046 was opened, noting the HTML::Parser module issues a warning: Parsing of undecoded UTF-8 will give garbage when decoding entities when processing certain text/html mail parts. The solution as proposed there right away by Sebastian Jaenicke (the bug submitter) suggested to turn on the HTML::Parser's utf8_mode(), which mostly is the right thing to do (unless we want to decode all HTML parts into Unicode first - i.e. to perl characters). Unfortunately that solution was rejected in 2005, mainly because the HTML::Parser's utf8_mode requires perl 5.8 (5.7.?), which was not considered widely deployed ten years ago. So the chosen solution was just to mask the warning so that it does not show up, effectively hiding the actual problem, which was not considered serious enough, and/or possibly not fully understood. ==== Fast-forward ten years. Analyzing why some (but not all) Bayes tokens (words) as obtained from text/html mail parts look like an unrecognizable 8-bit mess (like encoded into UTF-8 twice in a row), it turns out that the culprit is incorrect use of HTML::Parser - which would indeed rightfully produce a warning, have it not been hidden under a carpet. Consider the following text: Ne traîne pas, adieu - Et tâche d'être heureux which encoded into Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="utf-8" could look in a mail message like: From: test@example.com To: test@example.com Subject: test Date: Wed, 4 Feb 2015 00:01:56 +0100 Message-ID: <54d1536476d7c@example.com> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="utf-8" <html><body><p> Ne tra=C3=AEne pas, adieu - Et tâche d'=C3=AAtre heureux </body></html> Note that characters are encoded as QP UTF-8, except the â character in tâche, which is represented by the â HTML entity (a-circumflex). Both representations are perfectly legal and equivalent in such text/html MIME part. Unfortunately this text (after QP decoding and HTML decoding) ends up as: Ne traîne pas, adieu - Et t�che d'être heureux which is (as octets): Ne tra<C3><AE>ne pas, adieu - Et t<E2>che d'<C3><AA>tre heureux and this is then given to rules and to bayes tokenization. The HTML::Parser does rightfully issue a warning (which is suppressed by Mail::SpamAssassin::HTML), and the result is clearly wrong: characters are encoded as UTF-8, except for the â entity, which is encoded as ISO-8859-1 (i.e. Latin-1). And it gets worse. Replacing the '-' by an HTML entity — like this: <html><body><p> Ne tra=C3=AEne pas, adieu — Et tâche d'=C3=AAtre heureux </body></html> yields: Ne traîne pas, adieu — Et tâche d'être heureux which is (as octets): Ne tra<C3><83><C2><AE>ne pas, adieu <E2><80><94> Et t<C3><A2>che d'<C3><83><C2><AA>tre heureux Note that just adding one dash character (leaving everything else unchanged), the entire encoding is changed into a mojibake scramble: characters represented by HTML entities are correct, but all other UTF-8 octet pairs are incorrectly assumed to be in Latin-1 bytes and encoded into UTF-8 again, resulting in four octets! ==== So what is going on here? As soon as some HTML entity (like ™ or € or š or ˆ or ˜ ...) which has no representation in Latin-1 appears in an UTF-8 text, the entire text is upgraded into perl characters (i.e. Unicode, utf8 flag on), and during this process existing bytes with codes between 128 and 255 are considered to be in Latin-1, thus resulting in doubly-encoded UTF-8 mumble-jumble. It also means that rules and other plugins receive such text flagged as perl characters (utf8 flag on), so some rules may misfire or take longer to evaluate. Just adding one ™ entity ruins the encoding of the entire HTML text!
Now for possible solutions. Carefully reading (and understanding) that HTML::Parser's warning: "Parsing of undecoded UTF-8 will give garbage when decoding entities" gives us a clue. There are exactly two correct options when processing HTML mail parts: 1) decode text (based on its declared encoding) into perl characters (Unicode) and pass that to HTML::Parser for HTML parsing. (A slightly modified MS::Message::_normalize() could do that.) The result would remain as perl characters (utf8 flag), with HTML entities properly decoded. This is the default mode of operation on HTML::Parser and this is how SpamAssassin calls it, except that it violates its assumption about perl characters, giving it encoded text (as UTF-8 or ISO-8859-1 or whatever is the original encoding). No wonder that HTML::Parser complains when it notices when it is given UTF-8 encoded octets (instead of perl characters). 2) another option is to stick to current processing as bytes in most parts of SpamAssassin. The HTML::Parser can also deal with a text given encoded as UTF-8 octets, but in this case the ->utf8_mode(1) needs to be set to let it know this is the case. The result remains encoded as UTF-8 octets, with HTML properly represented as UTF-8 octets. This was the solution as originally proposed by Sebastian Jaenicke in Bug 4046. This alternative is least disruptive to the rest of SpamAssassin, as it retains text as octets (no utf8 flag). The only drawback is when the original text is encoded in something other than UTF-8 (like in 32% of messages, according to Bug 7126). Fortunately there is a clean solution to that: turn on the normalize_charset option (see Bug 6945) - it would transcode any given encoding into UTF-8 encoding, so the rest of the processing (HTML decoding, rules, bayes) can rely on a single encoding, and still keep speed of processing octets (not perl characters). Seems to me the option #1 is the cleanest with good potential for the future, while the option #2 is least disruptive to existing code and its speed. Keeping status-quo (with no ->utf8_mode(1) and with suppressed warnings) seems the worst.
I'm more inclined for #2 at least for 3.4.X.
> I'm more inclined for #2 at least for 3.4.X. So am I for the short term - keeping text encoded as octets (hopefully encoded as utf-8 in most cases) for the rest of processing. Meanwile I found an already documented bug in HTML::Parser : https://rt.cpan.org/Public/Bug/Display.html?id=99755 which also affects SpamAssassin if we chose to take advantage of the utf8_mode of parsing. Seems that some HTML entities are left as Latin-1 octets even if utf8_mode is on. The problem seems localized to individual HTML paragraphs with the right mix of content, and I'm seeing occasional fallout from this bug as junk bayes tokens, now that I'm calling HTML::Parser with utf8_mode enabled. It also seems that HTML::Parser hasn't been maintained lately, so it will probably take some time for the bug to get fixed. I was unable to find an easy workaround, apart from decoding text to Unicode first (utf8 flag on) and letting HTML::Parser work in Unicode - and decode the result to UTF-8 again after HTML parsing. As it happens the sub MS::Message::Node::rendered() contains both a call to _normalize() as well as to HTML->parse a little further down. It may be possible to avoid one unnecessary encoding/decoding pair and carry Unicode from _normalize() to HTML::Parser and do the encoding to UTF-8 only after HTML decoding. Will see how that approach would look like.
While experimenting with our HTML decoding, I noticed that our test t/html_utf8.t expects that an html text would be decoded into Unicode characters (not UTF-8 octets), otherwise it fails. The t/html_utf8.t installs the following rule: body QUOTE_YOUR /\x{201c}Your/ and runs SpamAssassin on a file t/data/spam/009, which contains the following HTML chunk: Click the “Your Account” (these entities represent double quotes: Click the “Your Account”) Grepping through our current rules, I don't see a single case of a Unicode character ( like \x{9999} or \N{U+9999} ) in any of rules, although there are lots of single-bytes encoded as \x99. So our rules do not seem to be expecting Unicode characters, just bytes, unlike the t/html_utf8.t test. Note this has nothing to do with a setting normalize_charset, it is entirely an effect of HTML::Parser called with utf8_mode off (which is a default and in effect in SpamAssassin so far). So the question now is: is the test wrong in what it expects, i.e. should the test rule be: body QUOTE_YOUR /\xE2\x80\x9CYour/ instead? The next logical question could be: in developing new rules (or updating existing rules), what should be the expected representation of a plain or HTML text as given to rules: - in Unicode characters (i.e. QP and Base64 decoded + character set decoded (Content-Type) + HTML decoded) - in UTF-8 octets (i.e. QP and Base64 decoded + character set decoded (Content-Type) + HTML decoded, then encoded into UTF-8 octets) seems this is a situation we are currently aiming at, with normalize_charset option enabled) - as octets of the original character set, mixed with Unicode or Latin-1 entities from html parts (i.e. QP and Base64 decoded; html entities decoded into Unicode or Latin-1) - this is essentially our present situation with setting normalize_charset off) - in UTF-8 octets, mixed with Unicode or Latin-1 entities from html parts (i.e. QP and Base64 decoded + character set decoded (Content-Type); html entities decoded into Unicode or Latin-1) - this is essentially our present situation with setting normalize_charset on) - in UTF-8 octets, mixed with UTF-8 or Latin-1 entities from html parts (i.e. QP and Base64 decoded + character set decoded (Content-Type); html entities recoded into UTF-8 (utf8_mode on) or to Latin-1 (bug)) - no decoding, original bytes (some rules seem to expect junk like QP encoded characters)
Firstly, thanks for all the good work so far on this. Thinking about this purely from the rule-writing user's point of view, totally ignoring the history and largely ignoring the underlying technical details :-) I would want to be able to include any Unicode characters directly in the rule file, and have it match the equivalent characters in the message, regardless of Content-Type charset, and regardless of any base64, quoted-printable and/or (for HTML message parts) &entity; encoding. So I should be able to write things like body CRAZY_EURO /€uro/ header SUBJ_CREDIT_FR Subject =~ /crédit/ and match any occurrences of "€uro" or "crédit" regardless of what charset the message was originally encoded in and whether entities were used. This of course would imply that rule .cf files would need to be encoded in UTF-8 (or whatever) and subjected to charset normalisation. I guess that's a whole new can of worms, but IMO it would make it far easier to address international spam patterns. After all your efforts to normalise the message, it would be a great shame to have to encode all non-ASCII characters in rules, e.g. body CRAZY_EURO /\x{20AC}uro/ though I would of course expect things to work if written that way. It would be an even greater shame if rules had to be written as UTF-8 bytes body CRAZY_EURO /\xE2\x82\xACuro/ Next question: what effect (if any) would this have on rawbody rules?
(In reply to John Wilcock from comment #5) Thanks for the feedback! > Thinking about this purely from the rule-writing user's point of view, > totally ignoring the history and largely ignoring the underlying technical > details :-) I would want to be able to include any Unicode characters > directly in the rule file, and have it match the equivalent characters in > the message, regardless of Content-Type charset, and regardless of any > base64, quoted-printable and/or (for HTML message parts) &entity; encoding. > > So I should be able to write things like > > body CRAZY_EURO /€uro/ > header SUBJ_CREDIT_FR Subject =~ /crédit/ > > and match any occurrences of "€uro" or "crédit" regardless of what charset > the message was originally encoded in and whether entities were used. You can do it even now. For example I have the following rules (for localized spam/phishing): body L_WEBTEAM2_3 m{Ček vaš email hvala} body L_WEBTEAM2_4 m{Administrator je e-poštni sistem} body L_WEBTEAM2_5 m{Hvala za vaše sodelovanje pri zaščiti!} These are encoded as you see them here, i.e. UTF-8 encoded in a local.cf file, using a text editor (in UTF-8 locale). > This of course would imply that rule .cf files would need to be encoded > in UTF-8 (or whatever) and subjected to charset normalisation. Right. The above rules (yours or mine) work in the following cases: - if text in a mail message is already encoded in UTF-8 (after QP and Base64 decoding); - or if normalize_charset is enabled (even in 3.4.0, improved in trunk) and the original character set can be successfully decoded; It does not currently work for HTML entities which (based on HTML::Parser idiosyncrasy, its use in SpamAssassin, and its bugs) can end up as Unicode (wide character) or as Latin-1 - regardless of the original encoding of the text. > I guess that's a > whole new can of worms, but IMO it would make it far easier to address > international spam patterns. After all your efforts to normalise the > message, it would be a great shame to have to encode all non-ASCII > characters in rules, e.g. > > body CRAZY_EURO /\x{20AC}uro/ The above won't work, unless we decide to go for full decoding into Unicode. > though I would of course expect things to work if written that way. > It would be an even greater shame if rules had to be written as UTF-8 bytes > > body CRAZY_EURO /\xE2\x82\xACuro/ The /\xE2\x82\xACuro/ and /€uro/ are even now equivalent (assuming the encoding of a *.cf file is in UTF-8, which is common). > Next question: what effect (if any) would this have on rawbody rules? It shouldn't have any effect on rawbody rules, and neither on plugins which pull the raw ('pristine') text.
> body CRAZY_EURO /€uro/ > header SUBJ_CREDIT_FR Subject =~ /crédit/ > > The /\xE2\x82\xACuro/ and /€uro/ are even now equivalent > (assuming the encoding of a *.cf file is in UTF-8, which is common). There is a slight gotcha there when writing body and header rules in UTF-8. The above cases are fine, SpamAssassin just sees a sequence of octets (UTF-8) and compares them to a sequence of octets in a text. The gotcha is when it would be desirable to include such non-ASCII character in a bracketed character class, e.g. [uµùúûü]. This would only work if both a text and regexp from rules is represented in Unicode (or as some single-byte encoding), i.e. each logical character as a indivisible entity in a character class, not as individual octets. Even nastier is a range, e.g. [uµù-ü], which assumes Latin-1 encoding and has no sensible equivalence in Unicode. We have a couple of such cases currently in our ruleset, e.g. in 20_drugs.cf and 25_replace.cf. When converting such rules into UTF-8, a character class like [uµùúûü] would need to be converted to something like (u|µ|ù|ú|û|ü).
Bug 7133: Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities Sending lib/Mail/SpamAssassin/HTML.pm Sending lib/Mail/SpamAssassin/Message/Node.pm Sending lib/Mail/SpamAssassin/Message.pm Committed revision 1659641. This implements proposed changes: - to avoid HTML::Parser utf8_mode bug (rt#99755) provide input to the HTML parser as Unicode characters when possible (i.e. when normalize_charset is on); - if this is not possible (no charset decoding), then turn on utf8_mode setting in HTML::Parser, which will tell it to adopt byte semantics and not complain; if input text is encoded in US-ASCII or UTF-8 the result will be mostly correct (except for the module's 99755 bug); if input is other than UTF-8 the result will be a mix of encodings: original text will remain unchanged, HTML entities will be encoded as UTF-8 or Latin-1. It's not perfect, but is still an improvement over the present situation (e.g. it will not turn on utf8 flag in results). - the sub _normalize() is capable of returning either Unicode characters or UTF-8 octets. When we know we'll be parsing HTML we can require characters result, thus avoiding unnecessary encoding and decoding pair. - the test t/html_utf8.t is adapted as described earlier. In summary: whatever text comes out of these decoding steps (QP/Base64 decoding, Content-type charset, HTML decoding) will remain as bytes (utf8 flag off) and be given to rules and plugins as such. It will hopefully be encoded as UTF-8 octets (when such is the original encoding of a text, or when normalize_charset is enabled). A consequence: it is advised to turn on the normalize_charset setting. It is no longer as expensive as it could be (the result will be in bytes always), and gives predictable encoding to further processing.
Created attachment 5278 [details] Suggested change
(In reply to Mark Martinec from comment #8) > A consequence: it is advised to turn on the normalize_charset setting. > It is no longer as expensive as it could be (the result will be in > bytes always), and gives predictable encoding to further processing. should we enable this by default in the next release ? or would it be to much of a suprise for most? normalize_charset ( 0 | 1) (default: 0) Whether to detect character sets and normalize message content to Unicode. Requires the Encode::Detect module, HTML::Parser version 3.46 or later, and Perl 5.8.5 or later.
(In reply to AXB from comment #10) > (In reply to Mark Martinec from comment #8) > > > A consequence: it is advised to turn on the normalize_charset setting. > > It is no longer as expensive as it could be (the result will be in > > bytes always), and gives predictable encoding to further processing. > > should we enable this by default in the next release ? > or would it be to much of a suprise for most? > > normalize_charset ( 0 | 1) (default: 0) > Whether to detect character sets and normalize message content to > Unicode. Requires the Encode::Detect module, HTML::Parser version > 3.46 or later, and Perl 5.8.5 or later. It's been a year since the last release so let's worry less about surprise and what is more effective?
(In reply to Kevin A. McGrail from comment #11) > > normalize_charset ( 0 | 1) (default: 0) > > Whether to detect character sets and normalize message content to > > Unicode. Requires the Encode::Detect module, HTML::Parser version > > 3.46 or later, and Perl 5.8.5 or later. > > It's been a year since the last release so let's worry less about surprise > and what is more effective? I say enable it (1) by default.
> normalize_charset ( 0 | 1) (default: 0) > Whether to detect character sets and normalize message content to > Unicode. Requires the Encode::Detect module, HTML::Parser version > 3.46 or later, and Perl 5.8.5 or later. I need to update that text a bit. The Encode::Detect is no longer a requirement (just optional bonus), and the result is in UTF-8 bytes, not Unicode characters. > > It's been a year since the last release so let's worry less about surprise > > and what is more effective? > > I say enable it (1) by default. That would be my advise too. Some rules would need adjusting though (like the drugs / Viagra spellings, which imply Latin encoding) to remain effective. Or just abandoned and replaced with something new.
r1659704 - forgot to checkin t/html_utf8.t r1659705 - forgot to remove test couches "$_[1]" -> $_[1], added comments
> > normalize_charset ( 0 | 1) (default: 0) > > Whether to detect character sets and normalize message content to > > Unicode. Requires the Encode::Detect module, HTML::Parser version > > 3.46 or later, and Perl 5.8.5 or later. > > I need to update that text a bit. The Encode::Detect is no longer a > requirement (just optional bonus), and the result is in UTF-8 bytes, > not Unicode characters. Actually I have already updated that text (r1655758, 2015-01-29, Bug 7126), AXB was looking at an older version. The man page currently states: normalize_charset ( 0 | 1) (default: 0) Whether to decode non- UTF-8 and non-ASCII textual parts and recode them to UTF-8 before the text is given over to rules processing. The character set used for attempted decoding is primarily based on a declared character set in a Content-Type header, but if the decoding attempt fails a module Encode::Detect::Detector is consulted (if available) to provide a guess based on the actual text, and decoding is re-attempted. Even if the option is enabled no unnecessary decoding and re-encoding work is done when possible (like with an all-ASCII text with a US-ASCII or extended ASCII character set declaration, e.g. UTF-8 or ISO-8859-nn or Windows-nnnn). Unicode support in old versions of perl or in a core module Encode is likely to be buggy in places, so if the normalize_charset function is enabled it is advised to stick to more recent versions of perl (preferably 5.12 or later). The module Encode::Detect::Detector is optional, when necessary it will be used if it is available.
(In reply to Mark Martinec from comment #13) > Some rules would need adjusting though (like the drugs / Viagra spellings, > which imply Latin encoding) to remain effective. Or just abandoned and > replaced with something new. And the release notes would of course need to explain the issue so that people can adapt local rules accordingly...
Closing. The HTML issue if fixed here, concerns about whether to enable or not the normalize-charset (and effects it would have on rules) is carried over to a new Bug 7144 (enhancement).