7126 – Incorrect character set detections by normalize_charset

Bug 7126 - Incorrect character set detections by normalize_charset

Summary: Incorrect character set detections by normalize_charset

Status:	RESOLVED FIXED

Alias:	None

Product:	Spamassassin
Classification:	Unclassified
Component:	Libraries (show other bugs)
Version:	3.4.0
Hardware:	All All

Importance:	P2 normal
Target Milestone:	3.4.1
Assignee:	SpamAssassin Developer Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-01-29 15:01 UTC by Mark Martinec
Modified:	2015-02-23 17:47 UTC (History)
CC List:	0 users

Attachment	Type	Actions	Submitter/CLA Status
The suggested replacement subroutine MS::Message::Node::_normalize()	text/plain	None	Mark Martinec
The full proposed patch (includes documentation update, Conf.pm, DependencyInfo.pm)	patch	None	Mark Martinec
The suggested replacement subroutine MS::Message::Node::_normalize() - V2	text/plain	None	Mark Martinec
Show Obsolete (2) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Mark Martinec 2015-01-29 15:01:58 UTC

Noticing that several of our local mail messages are considered by
MS::Message::Node::_normalize() / normalize_charset as being written
in far-East character sets and decoded as such, which clearly does not
make sense (they are actually in UTF-8 or Windows-1252 or ISO-8859-2),
I have put up an alternative reference implementation of _normalize()
and compared the results of the two, while manually checking the
reported differences.

In our case the one-day statistics shows that more than 8 % of
decisions taken by _normalize() were wrong. The most common
differences were:

- decoded as big5      (should be decoded as iso-8859-2)
- decoded as euc-kr    (should be decoded as utf-8)
- decoded as euc-jp    (should be decoded as utf-8)
- decoded as shift_jis (should be decoded as windows-1252)
- decoded as utf-8     (should be decoded as windows-1252)
- not decoded          (should be decoded as gb2312)
- not decoded          (should be decoded as gbk)
- not decoded          (should be decoded as utf-8)

The source of the problem in my opinion is that the existing
_normalize() puts too much reliance on Encode::Detect::Detector
and the underlying "Mozilla's universal charset detector",
instead of trusting a declared character set (in a Content-Type),
and falling back to guesswork only when the declared character
set seems inconsistent with actual contents of a message part.

While relying primarily on guesswork may have made good sense
ten years ago, and probably still produces sensible results
in the far-East (as it errs on the side of far-Eastern character
sets), nowadays when UTF-8 is much more widespread, in my
opinion the logic is now flawed.

Comment 1 Mark Martinec 2015-01-29 15:28:12 UTC

Created attachment 5271 [details]
The suggested replacement subroutine MS::Message::Node::_normalize()

This implementation is compatible with the existing implementation,
they agree with each other in about 92 % or cases.

Implemented with emphasis on:
- trust a declared character set as long as this assumption is not
  invalidated by the actual text (failed attempted strict decoding)
- avoid unnecessary decoding (by Encode::decode) where possible,
- avoid calling Encode::Detect::Detector unless necessary,
- produce useful debug diagnostics.

Comment 2 Mark Martinec 2015-01-29 16:55:36 UTC

Created attachment 5272 [details]
The full proposed patch (includes documentation update, Conf.pm, DependencyInfo.pm)

Comment 3 Mark Martinec 2015-01-29 17:25:29 UTC

Bug 7126: Incorrect character set detections by normalize_charset
  Sending lib/Mail/SpamAssassin/Conf.pm
  Sending lib/Mail/SpamAssassin/Message/Node.pm
  Sending lib/Mail/SpamAssassin/Util/DependencyInfo.pm
Committed revision 1655758.

Comment 4 Mark Martinec 2015-01-30 16:35:54 UTC

Some small refinements:
  Sending lib/Mail/SpamAssassin/Message/Node.pm
Committed revision 1656048.


- noticed that some undecodable/invalid text is 'almost ascii', but
  contains some characters like a NBSP (non-breaking space) or SHY
  (soft hyphen), or some punctuation from Windows-1252 at the codes
  which are unassigned in ISO-8859  -  so deal with that: decode it
  as Windows-1252;

- improved debugging: report if some encoding is unrecognized
  or unsupported by module Encode

Comment 5 Mark Martinec 2015-02-02 12:06:04 UTC

Seen a declared character set 'ANSI X3.4-1986' (i.e. US-ASCII)
in the wild, interesting.

Bug 7126: refinements: ANSI X3.4-1986, Windows-1252 quotes
  Sending lib/Mail/SpamAssassin/Message/Node.pm
Committed revision 1656447.

Comment 6 Mark Martinec 2015-02-03 15:50:04 UTC

Some interesting statistics, collected from 100.000 textual mail parts
as seen in two working days at our site. A single mail message can
be counted as more than one part (e.g. text/plain + text/html in case
of multipart/alternative), so the number of mail messages analyzed is
slightly less than half that much.

The debug messages were grepped by a ': message: .*charset' and
grouped into the following groups:

  11.1%  true US-ASCII (kept unchanged)
  67.5%  valid UTF-8 as declared in Content-Type (kept unchanged)
   0.2%  valid UTF-8 as detected/guessed (kept unchanged)
  20.8%  decoded (non- UTF-8) as declared in Content-Type
   0.4%  decoded (non- UTF-8) as detected/guessed

The 'decoded' and 'as detected/guessed' only occur with a setting:
  normalize_charset 1
(otherwise these would just have been kept as unchanged octets / Mojibake).

Summarizing the above further down yields:

  11.1%  true US-ASCII (kept unchanged)
  67.6%  is UTF-8      (kept unchanged)
  21.3%  decoded into UTF-8 (when normalize_charset is enabled)

So, 67.6% is natively UTF-8, and 88.9% of textual parts end up
as UTF-8 if normalize_charset is enabled. The remaining 11.1%
of textual mail parts is just plain ASCII text.


Interestingly, (while not directly comparable), our 88.9% UTF-8 figure
corresponds closely to the 82.5% in "Usage of character encodings
for websites" January 2015:
  "UTF-8 is used by 82.5% of all the websites whose character
   encoding we know."
http://w3techs.com/technologies/overview/character_encoding/all

Comment 7 Mark Martinec 2015-02-06 15:21:12 UTC

(sometimes a text is declared as ISO-8859-* but is actually UTF-8)

Bug 7126 - some more tweaks at sub _normalize
  Sending lib/Mail/SpamAssassin/Message/Node.pm
Committed revision 1657862.

Comment 8 Mark Martinec 2015-02-12 13:47:03 UTC

Created attachment 5277 [details]
The suggested replacement subroutine MS::Message::Node::_normalize() - V2

In view of:

  [Bug 7133] Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8
     will give garbage when decoding entities,

  and HTML::Parser bug:
    https://rt.cpan.org/Public/Bug/Display.html?id=99755

it seems desirable to be able to obtain from sub _normalize either
decoded characters (Unicode), or encoded as UTF-8 octets,
so I have generalized the proposed replacement sub _normalize()
to provide one or the other, based on an optional parameter.
In its absence it defaults to current behaviour (returns UTF-8
octets), preserving compatibility.

Attached is my last version of sub _normalize().



Bug 7126: Incorrect character set detections
by normalize_charset - sub _normalize() V2
  Sending lib/Mail/SpamAssassin/Message/Node.pm
Committed revision 1659255.

Comment 9 Mark Martinec 2015-02-23 17:47:17 UTC

This seems to work quite well, with or without normalize_charset enabled.
I don't think this change introduced any user-visible incompatibilities
with 3.4.0.  Closing.