7133 – Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities

Bug 7133 - Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities

Summary: Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garb...

Status:	RESOLVED FIXED

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	Libraries (show other bugs)
Version:	3.4.0
Hardware:	All All

Importance:	P2 normal
Target Milestone:	3.4.1
Assignee:	SpamAssassin Developer Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-02-06 02:53 UTC by Mark Martinec
Modified:	2015-02-23 19:22 UTC (History)
CC List:	3 users (show)

Attachment	Type	Modified	Status	Actions	Submitter/CLA Status
Suggested change	patch			None	Mark Martinec
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Mark Martinec 2015-02-06 02:53:46 UTC

Back in 2004 the Bug 4046 was opened, noting the HTML::Parser module
issues a warning:
  Parsing of undecoded UTF-8 will give garbage when decoding entities
when processing certain text/html mail parts.

The solution as proposed there right away by Sebastian Jaenicke (the
bug submitter) suggested to turn on the HTML::Parser's utf8_mode(),
which mostly is the right thing to do (unless we want to decode all
HTML parts into Unicode first - i.e. to perl characters).

Unfortunately that solution was rejected in 2005, mainly because
the HTML::Parser's utf8_mode requires perl 5.8  (5.7.?), which was
not considered widely deployed ten years ago.

So the chosen solution was just to mask the warning so that it
does not show up, effectively hiding the actual problem, which was
not considered serious enough, and/or possibly not fully understood.

====

Fast-forward ten years. Analyzing why some (but not all) Bayes
tokens (words) as obtained from text/html mail parts look
like an unrecognizable 8-bit mess (like encoded into UTF-8 twice
in a row), it turns out that the culprit is incorrect use of
HTML::Parser - which would indeed rightfully produce a warning,
have it not been hidden under a carpet.

Consider the following text:

  Ne traîne pas, adieu - Et tâche d'être heureux

which encoded into
  Content-Transfer-Encoding: quoted-printable
  Content-Type: text/html; charset="utf-8"

could look in a mail message like:

  From: test@example.com
  To: test@example.com
  Subject: test
  Date: Wed, 4 Feb 2015 00:01:56 +0100
  Message-ID: <54d1536476d7c@example.com>
  Mime-Version: 1.0
  Content-Transfer-Encoding: quoted-printable
  Content-Type: text/html; charset="utf-8"

  <html><body><p>
  Ne tra=C3=AEne pas, adieu - Et t&acirc;che d'=C3=AAtre heureux
  </body></html>

Note that characters are encoded as QP UTF-8, except the â character
in tâche, which is represented by the &acirc; HTML entity (a-circumflex).
Both representations are perfectly legal and equivalent in such
text/html MIME part.


Unfortunately this text (after QP decoding and HTML decoding)
ends up as:
  Ne traîne pas, adieu - Et t�che d'être heureux

which is (as octets):
  Ne tra<C3><AE>ne pas, adieu - Et t<E2>che d'<C3><AA>tre heureux

and this is then given to rules and to bayes tokenization. The
HTML::Parser does rightfully issue a warning (which is suppressed
by Mail::SpamAssassin::HTML), and the result is clearly wrong:
characters are encoded as UTF-8, except for the &acirc; entity,
which is encoded as ISO-8859-1 (i.e. Latin-1).


And it gets worse.  Replacing the '-' by an HTML entity  &mdash;
like this:

  <html><body><p>
  Ne tra=C3=AEne pas, adieu &mdash; Et t&acirc;che d'=C3=AAtre heureux
  </body></html>

yields:
  Ne traÃ®ne pas, adieu — Et tâche d'Ãªtre heureux

which is (as octets):
  Ne tra<C3><83><C2><AE>ne pas, adieu <E2><80><94> Et t<C3><A2>che d'<C3><83><C2><AA>tre heureux


Note that just adding one dash character (leaving everything else
unchanged), the entire encoding is changed into a mojibake scramble:
characters represented by HTML entities are correct, but all other
UTF-8 octet pairs are incorrectly assumed to be in Latin-1 bytes and
encoded into UTF-8 again, resulting in four octets!

====

So what is going on here? As soon as some HTML entity (like
&trade; or &euro; or &scaron; or &circ; or &tilde; ...)
which has no representation in Latin-1 appears in an UTF-8 text,
the entire text is upgraded into perl characters (i.e. Unicode,
utf8 flag on), and during this process existing bytes with codes
between 128 and 255 are considered to be in Latin-1, thus resulting
in doubly-encoded UTF-8 mumble-jumble.

It also means that rules and other plugins receive such text
flagged as perl characters (utf8 flag on), so some rules may misfire
or take longer to evaluate. Just adding one  &trade;  entity ruins
the encoding of the entire HTML text!

Comment 1 Mark Martinec 2015-02-06 03:24:16 UTC

Now for possible solutions. Carefully reading (and understanding) that
HTML::Parser's warning:
  "Parsing of undecoded UTF-8 will give garbage when decoding entities"
gives us a clue.

There are exactly two correct options when processing HTML mail parts:


1) decode text (based on its declared encoding) into perl characters
(Unicode) and pass that to HTML::Parser for HTML parsing. (A slightly
modified MS::Message::_normalize() could do that.) The result would
remain as perl characters (utf8 flag), with HTML entities properly
decoded.

This is the default mode of operation on HTML::Parser and this is how
SpamAssassin calls it, except that it violates its assumption about
perl characters, giving it encoded text (as UTF-8 or ISO-8859-1 or
whatever is the original encoding). No wonder that HTML::Parser
complains when it notices when it is given UTF-8 encoded octets
(instead of perl characters).


2) another option is to stick to current processing as bytes in most
parts of SpamAssassin. The HTML::Parser can also deal with a text
given encoded as UTF-8 octets, but in this case the ->utf8_mode(1)
needs to be set to let it know this is the case. The result remains
encoded as UTF-8 octets, with HTML properly represented as UTF-8
octets. This was the solution as originally proposed by
Sebastian Jaenicke in Bug 4046.

This alternative is least disruptive to the rest of SpamAssassin,
as it retains text as octets (no utf8 flag). The only drawback is
when the original text is encoded in something other than UTF-8
(like in 32% of messages, according to Bug 7126). Fortunately there
is a clean solution to that: turn on the normalize_charset option
(see Bug 6945) - it would transcode any given encoding into UTF-8
encoding, so the rest of the processing (HTML decoding, rules, bayes)
can rely on a single encoding, and still keep speed of processing
octets (not perl characters).



Seems to me the option #1 is the cleanest with good potential for
the future, while the option #2 is least disruptive to existing code
and its speed. Keeping status-quo (with no ->utf8_mode(1) and with
suppressed warnings) seems the worst.

Comment 2 Kevin A. McGrail 2015-02-09 12:20:42 UTC

I'm more inclined for #2 at least for 3.4.X.

Comment 3 Mark Martinec 2015-02-10 16:48:31 UTC

> I'm more inclined for #2 at least for 3.4.X.

So am I for the short term - keeping text encoded as octets (hopefully
encoded as utf-8 in most cases) for the rest of processing.


Meanwile I found an already documented bug in HTML::Parser :

  https://rt.cpan.org/Public/Bug/Display.html?id=99755

which also affects SpamAssassin if we chose to take advantage of
the utf8_mode of parsing. Seems that some HTML entities are left as
Latin-1 octets even if utf8_mode is on. The problem seems localized
to individual HTML paragraphs with the right mix of content,
and I'm seeing occasional fallout from this bug as junk bayes
tokens, now that I'm calling HTML::Parser with utf8_mode enabled.

It also seems that HTML::Parser hasn't been maintained lately,
so it will probably take some time for the bug to get fixed.

I was unable to find an easy workaround, apart from decoding text
to Unicode first (utf8 flag on) and letting HTML::Parser work in
Unicode - and decode the result to UTF-8 again after HTML parsing.

As it happens the sub MS::Message::Node::rendered() contains both
a call to _normalize() as well as to HTML->parse a little further
down. It may be possible to avoid one unnecessary encoding/decoding
pair and carry Unicode from _normalize() to HTML::Parser and do
the encoding to UTF-8 only after HTML decoding. Will see how
that approach would look like.

Comment 4 Mark Martinec 2015-02-12 15:37:24 UTC

While experimenting with our HTML decoding, I noticed that our
test t/html_utf8.t expects that an html text would be decoded
into Unicode characters (not UTF-8 octets), otherwise it fails.

The t/html_utf8.t installs the following rule:

  body QUOTE_YOUR /\x{201c}Your/

and runs SpamAssassin on a file t/data/spam/009, which contains
the following HTML chunk:

  Click the &#8220;Your Account&#8221

(these entities represent double quotes: Click the “Your Account”)


Grepping through our current rules, I don't see a single case of a
Unicode character ( like \x{9999} or \N{U+9999} ) in any of rules,
although there are lots of single-bytes encoded as \x99. So our rules
do not seem to be expecting Unicode characters, just bytes, unlike
the t/html_utf8.t test.

Note this has nothing to do with a setting normalize_charset, it
is entirely an effect of HTML::Parser called with utf8_mode off
(which is a default and in effect in SpamAssassin so far).


So the question now is: is the test wrong in what it expects, i.e.
should the test rule be:  body QUOTE_YOUR /\xE2\x80\x9CYour/  instead?


The next logical question could be: in developing new rules (or updating
existing rules), what should be the expected representation of a plain
or HTML text as given to rules:

 - in Unicode characters
   (i.e. QP and Base64 decoded + character set decoded (Content-Type)
         + HTML decoded)

 - in UTF-8 octets
   (i.e. QP and Base64 decoded + character set decoded (Content-Type)
         + HTML decoded, then encoded into UTF-8 octets)
    seems this is a situation we are currently aiming at,
    with normalize_charset option enabled)

 - as octets of the original character set, mixed with Unicode or Latin-1
   entities from html parts
   (i.e. QP and Base64 decoded; html entities decoded into Unicode
    or Latin-1) - this is essentially our present situation with
    setting normalize_charset off)

 - in UTF-8 octets, mixed with Unicode or Latin-1 entities from html parts
   (i.e. QP and Base64 decoded + character set decoded (Content-Type);
    html entities decoded into Unicode or Latin-1) - this is essentially
    our present situation with setting normalize_charset on)

 - in UTF-8 octets, mixed with UTF-8 or Latin-1 entities from html parts
   (i.e. QP and Base64 decoded + character set decoded (Content-Type);
    html entities recoded into UTF-8 (utf8_mode on) or to Latin-1 (bug))

 - no decoding, original bytes
   (some rules seem to expect junk like QP encoded characters)

Comment 5 John Wilcock 2015-02-12 16:29:33 UTC

Firstly, thanks for all the good work so far on this. 

Thinking about this purely from the rule-writing user's point of view, totally ignoring the history and largely ignoring the underlying technical details :-) I would want to be able to include any Unicode characters directly in the rule file, and have it match the equivalent characters in the message, regardless of Content-Type charset, and regardless of any base64, quoted-printable and/or (for HTML message parts) &entity; encoding. 

So I should be able to write things like

body CRAZY_EURO /€uro/
header SUBJ_CREDIT_FR Subject =~ /crédit/

and match any occurrences of "€uro" or "crédit" regardless of what charset the message was originally encoded in and whether entities were used. 

This of course would imply that rule .cf files would need to be encoded in UTF-8 (or whatever) and subjected to charset normalisation. I guess that's a whole new can of worms, but IMO it would make it far easier to address international spam patterns. After all your efforts to normalise the message, it would be a great shame to have to encode all non-ASCII characters in rules, e.g. 

body CRAZY_EURO /\x{20AC}uro/

though I would of course expect things to work if written that way. 
It would be an even greater shame if rules had to be written as UTF-8 bytes

body CRAZY_EURO /\xE2\x82\xACuro/

Next question: what effect (if any) would this have on rawbody rules?

Comment 6 Mark Martinec 2015-02-12 16:50:35 UTC

(In reply to John Wilcock from comment #5)
Thanks for the feedback!

> Thinking about this purely from the rule-writing user's point of view,
> totally ignoring the history and largely ignoring the underlying technical
> details :-) I would want to be able to include any Unicode characters
> directly in the rule file, and have it match the equivalent characters in
> the message, regardless of Content-Type charset, and regardless of any
> base64, quoted-printable and/or (for HTML message parts) &entity; encoding. 
> 
> So I should be able to write things like
> 
> body CRAZY_EURO /€uro/
> header SUBJ_CREDIT_FR Subject =~ /crédit/
> 
> and match any occurrences of "€uro" or "crédit" regardless of what charset
> the message was originally encoded in and whether entities were used.

You can do it even now. For example I have the following rules (for
localized spam/phishing):

  body L_WEBTEAM2_3  m{Ček vaš email hvala}
  body L_WEBTEAM2_4  m{Administrator je e-poštni sistem}
  body L_WEBTEAM2_5  m{Hvala za vaše sodelovanje pri zaščiti!}

These are encoded as you see them here, i.e. UTF-8 encoded in
a local.cf file, using a text editor (in UTF-8 locale).

> This of course would imply that rule .cf files would need to be encoded
> in UTF-8 (or whatever) and subjected to charset normalisation.

Right. The above rules (yours or mine) work in the following cases:

 - if text in a mail message is already encoded in UTF-8 (after QP and
   Base64 decoding);

 - or if normalize_charset is enabled (even in 3.4.0, improved in trunk)
   and the original character set can be successfully decoded;

It does not currently work for HTML entities which (based on
HTML::Parser idiosyncrasy, its use in SpamAssassin, and its bugs)
can end up as Unicode (wide character) or as Latin-1 - regardless
of the original encoding of the text.


> I guess that's a
> whole new can of worms, but IMO it would make it far easier to address
> international spam patterns. After all your efforts to normalise the
> message, it would be a great shame to have to encode all non-ASCII
> characters in rules, e.g. 
> 
> body CRAZY_EURO /\x{20AC}uro/

The above won't work, unless we decide to go for full decoding into Unicode.

> though I would of course expect things to work if written that way. 
> It would be an even greater shame if rules had to be written as UTF-8 bytes
> 
> body CRAZY_EURO /\xE2\x82\xACuro/

The /\xE2\x82\xACuro/  and  /€uro/  are even now equivalent
(assuming the encoding of a *.cf file is in UTF-8, which is common).


> Next question: what effect (if any) would this have on rawbody rules?

It shouldn't have any effect on rawbody rules, and neither on plugins
which pull the raw ('pristine') text.

Comment 7 Mark Martinec 2015-02-12 18:30:02 UTC

> body CRAZY_EURO /€uro/
> header SUBJ_CREDIT_FR Subject =~ /crédit/
>
> The /\xE2\x82\xACuro/  and  /€uro/  are even now equivalent
> (assuming the encoding of a *.cf file is in UTF-8, which is common).


There is a slight gotcha there when writing body and header rules in
UTF-8.  The above cases are fine, SpamAssassin just sees a sequence
of octets (UTF-8) and compares them to a sequence of octets in a text.

The gotcha is when it would be desirable to include such non-ASCII
character in a bracketed character class, e.g. [uµùúûü]. This would
only work if both a text and regexp from rules is represented in
Unicode (or as some single-byte encoding), i.e. each logical character
as a indivisible entity in a character class, not as individual octets.
Even nastier is a range, e.g. [uµù-ü], which assumes Latin-1 encoding
and has no sensible equivalence in Unicode.

We have a couple of such cases currently in our ruleset, e.g. in
20_drugs.cf and 25_replace.cf.  When converting such rules into
UTF-8, a character class like [uµùúûü] would need to be converted
to something like (u|µ|ù|ú|û|ü).

Comment 8 Mark Martinec 2015-02-13 18:50:07 UTC

Bug 7133: Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will
give garbage when decoding entities
  Sending lib/Mail/SpamAssassin/HTML.pm
  Sending lib/Mail/SpamAssassin/Message/Node.pm
  Sending lib/Mail/SpamAssassin/Message.pm
Committed revision 1659641.


This implements proposed changes:

- to avoid HTML::Parser utf8_mode bug (rt#99755) provide input to
  the HTML parser as Unicode characters when possible (i.e. when
  normalize_charset is on);

- if this is not possible (no charset decoding), then turn on utf8_mode
  setting in HTML::Parser, which will tell it to adopt byte semantics
  and not complain; if input text is encoded in US-ASCII or UTF-8
  the result will be mostly correct (except for the module's 99755 bug);
  if input is other than UTF-8 the result will be a mix of encodings:
  original text will remain unchanged, HTML entities will be encoded
  as UTF-8 or Latin-1. It's not perfect, but is still an improvement over
  the present situation (e.g. it will not turn on utf8 flag in results).

- the sub _normalize() is capable of returning either Unicode characters
  or UTF-8 octets. When we know we'll be parsing HTML we can require
  characters result, thus avoiding unnecessary encoding and decoding
  pair.

- the test t/html_utf8.t is adapted as described earlier.


In summary: whatever text comes out of these decoding steps
(QP/Base64 decoding, Content-type charset, HTML decoding) will remain
as bytes (utf8 flag off) and be given to rules and plugins as such.
It will hopefully be encoded as UTF-8 octets (when such is the
original encoding of a text, or when normalize_charset is enabled).

A consequence: it is advised to turn on the normalize_charset setting.
It is no longer as expensive as it could be (the result will be in
bytes always), and gives predictable encoding to further processing.

Comment 9 Mark Martinec 2015-02-13 18:51:32 UTC

Created attachment 5278 [details]
Suggested change

Comment 10 AXB 2015-02-13 18:59:56 UTC

(In reply to Mark Martinec from comment #8)

> A consequence: it is advised to turn on the normalize_charset setting.
> It is no longer as expensive as it could be (the result will be in
> bytes always), and gives predictable encoding to further processing.

should we enable this by default in the next release ?
or would it be to much of a suprise for most?

normalize_charset ( 0 | 1) (default: 0)
        Whether to detect character sets and normalize message content to
        Unicode. Requires the Encode::Detect module, HTML::Parser version
        3.46 or later, and Perl 5.8.5 or later.

Comment 11 Kevin A. McGrail 2015-02-13 19:01:27 UTC

(In reply to AXB from comment #10)
> (In reply to Mark Martinec from comment #8)
> 
> > A consequence: it is advised to turn on the normalize_charset setting.
> > It is no longer as expensive as it could be (the result will be in
> > bytes always), and gives predictable encoding to further processing.
> 
> should we enable this by default in the next release ?
> or would it be to much of a suprise for most?
> 
> normalize_charset ( 0 | 1) (default: 0)
>         Whether to detect character sets and normalize message content to
>         Unicode. Requires the Encode::Detect module, HTML::Parser version
>         3.46 or later, and Perl 5.8.5 or later.

It's been a year since the last release so let's worry less about surprise and what is more effective?

Comment 12 Quanah Gibson-Mount 2015-02-13 19:28:35 UTC

(In reply to Kevin A. McGrail from comment #11)
 
> > normalize_charset ( 0 | 1) (default: 0)
> >         Whether to detect character sets and normalize message content to
> >         Unicode. Requires the Encode::Detect module, HTML::Parser version
> >         3.46 or later, and Perl 5.8.5 or later.
> 
> It's been a year since the last release so let's worry less about surprise
> and what is more effective?

I say enable it (1) by default.

Comment 13 Mark Martinec 2015-02-13 20:03:49 UTC

> normalize_charset ( 0 | 1) (default: 0)
>   Whether to detect character sets and normalize message content to
>   Unicode. Requires the Encode::Detect module, HTML::Parser version
>   3.46 or later, and Perl 5.8.5 or later.

I need to update that text a bit. The Encode::Detect is no longer a requirement
(just optional bonus), and the result is in UTF-8 bytes, not Unicode characters.


> > It's been a year since the last release so let's worry less about surprise
> > and what is more effective?
> 
> I say enable it (1) by default.

That would be my advise too.

Some rules would need adjusting though (like the drugs / Viagra spellings,
which imply Latin encoding) to remain effective. Or just abandoned and
replaced with something new.

Comment 14 Mark Martinec 2015-02-14 01:11:01 UTC

r1659704 - forgot to checkin t/html_utf8.t

r1659705 - forgot to remove test couches "$_[1]" -> $_[1], added comments

Comment 15 Mark Martinec 2015-02-14 13:47:12 UTC

> > normalize_charset ( 0 | 1) (default: 0)
> >   Whether to detect character sets and normalize message content to
> >   Unicode. Requires the Encode::Detect module, HTML::Parser version
> >   3.46 or later, and Perl 5.8.5 or later.
>
> I need to update that text a bit. The Encode::Detect is no longer a
> requirement (just optional bonus), and the result is in UTF-8 bytes,
> not Unicode characters.


Actually I have already updated that text (r1655758, 2015-01-29,
Bug 7126), AXB was looking at an older version. The man page currently
states:

normalize_charset ( 0 | 1)        (default: 0)
  Whether to decode non- UTF-8 and non-ASCII textual parts and recode
  them to UTF-8 before the text is given over to rules processing.
  The character set used for attempted decoding is primarily based on
  a declared character set in a Content-Type header, but if the
  decoding attempt fails a module Encode::Detect::Detector is
  consulted (if available) to provide a guess based on the actual
  text, and decoding is re-attempted. Even if the option is enabled
  no unnecessary decoding and re-encoding work is done when possible
  (like with an all-ASCII text with a US-ASCII or extended ASCII
  character set declaration, e.g. UTF-8 or ISO-8859-nn or Windows-nnnn).

  Unicode support in old versions of perl or in a core module Encode
  is likely to be buggy in places, so if the normalize_charset
  function is enabled it is advised to stick to more recent versions
  of perl (preferably 5.12 or later). The module
  Encode::Detect::Detector is optional, when necessary it will be
  used if it is available.

Comment 16 John Wilcock 2015-02-14 17:22:14 UTC

(In reply to Mark Martinec from comment #13)
> Some rules would need adjusting though (like the drugs / Viagra spellings,
> which imply Latin encoding) to remain effective. Or just abandoned and
> replaced with something new.

And the release notes would of course need to explain the issue so that people can adapt local rules accordingly...

Comment 17 Mark Martinec 2015-02-23 19:22:54 UTC

Closing. The HTML issue if fixed here, concerns about whether
to enable or not the normalize-charset (and effects it would have
on rules) is carried over to a new Bug 7144 (enhancement).