Bug 7126 - Incorrect character set detections by normalize_charset
Summary: Incorrect character set detections by normalize_charset
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: 3.4.0
Hardware: All All
: P2 normal
Target Milestone: 3.4.1
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-01-29 15:01 UTC by Mark Martinec
Modified: 2015-02-23 17:47 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
The suggested replacement subroutine MS::Message::Node::_normalize() text/plain None Mark Martinec [HasCLA]
The full proposed patch (includes documentation update, Conf.pm, DependencyInfo.pm) patch None Mark Martinec [HasCLA]
The suggested replacement subroutine MS::Message::Node::_normalize() - V2 text/plain None Mark Martinec [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Martinec 2015-01-29 15:01:58 UTC
Noticing that several of our local mail messages are considered by
MS::Message::Node::_normalize() / normalize_charset as being written
in far-East character sets and decoded as such, which clearly does not
make sense (they are actually in UTF-8 or Windows-1252 or ISO-8859-2),
I have put up an alternative reference implementation of _normalize()
and compared the results of the two, while manually checking the
reported differences.

In our case the one-day statistics shows that more than 8 % of
decisions taken by _normalize() were wrong. The most common
differences were:

- decoded as big5      (should be decoded as iso-8859-2)
- decoded as euc-kr    (should be decoded as utf-8)
- decoded as euc-jp    (should be decoded as utf-8)
- decoded as shift_jis (should be decoded as windows-1252)
- decoded as utf-8     (should be decoded as windows-1252)
- not decoded          (should be decoded as gb2312)
- not decoded          (should be decoded as gbk)
- not decoded          (should be decoded as utf-8)

The source of the problem in my opinion is that the existing
_normalize() puts too much reliance on Encode::Detect::Detector
and the underlying "Mozilla's universal charset detector",
instead of trusting a declared character set (in a Content-Type),
and falling back to guesswork only when the declared character
set seems inconsistent with actual contents of a message part.

While relying primarily on guesswork may have made good sense
ten years ago, and probably still produces sensible results
in the far-East (as it errs on the side of far-Eastern character
sets), nowadays when UTF-8 is much more widespread, in my
opinion the logic is now flawed.
Comment 1 Mark Martinec 2015-01-29 15:28:12 UTC
Created attachment 5271 [details]
The suggested replacement subroutine MS::Message::Node::_normalize()

This implementation is compatible with the existing implementation,
they agree with each other in about 92 % or cases.

Implemented with emphasis on:
- trust a declared character set as long as this assumption is not
  invalidated by the actual text (failed attempted strict decoding)
- avoid unnecessary decoding (by Encode::decode) where possible,
- avoid calling Encode::Detect::Detector unless necessary,
- produce useful debug diagnostics.
Comment 2 Mark Martinec 2015-01-29 16:55:36 UTC
Created attachment 5272 [details]
The full proposed patch (includes documentation update, Conf.pm, DependencyInfo.pm)
Comment 3 Mark Martinec 2015-01-29 17:25:29 UTC
Bug 7126: Incorrect character set detections by normalize_charset
  Sending lib/Mail/SpamAssassin/Conf.pm
  Sending lib/Mail/SpamAssassin/Message/Node.pm
  Sending lib/Mail/SpamAssassin/Util/DependencyInfo.pm
Committed revision 1655758.
Comment 4 Mark Martinec 2015-01-30 16:35:54 UTC
Some small refinements:
  Sending lib/Mail/SpamAssassin/Message/Node.pm
Committed revision 1656048.


- noticed that some undecodable/invalid text is 'almost ascii', but
  contains some characters like a NBSP (non-breaking space) or SHY
  (soft hyphen), or some punctuation from Windows-1252 at the codes
  which are unassigned in ISO-8859  -  so deal with that: decode it
  as Windows-1252;

- improved debugging: report if some encoding is unrecognized
  or unsupported by module Encode
Comment 5 Mark Martinec 2015-02-02 12:06:04 UTC
Seen a declared character set 'ANSI X3.4-1986' (i.e. US-ASCII)
in the wild, interesting.

Bug 7126: refinements: ANSI X3.4-1986, Windows-1252 quotes
  Sending lib/Mail/SpamAssassin/Message/Node.pm
Committed revision 1656447.
Comment 6 Mark Martinec 2015-02-03 15:50:04 UTC
Some interesting statistics, collected from 100.000 textual mail parts
as seen in two working days at our site. A single mail message can
be counted as more than one part (e.g. text/plain + text/html in case
of multipart/alternative), so the number of mail messages analyzed is
slightly less than half that much.

The debug messages were grepped by a ': message: .*charset' and
grouped into the following groups:

  11.1%  true US-ASCII (kept unchanged)
  67.5%  valid UTF-8 as declared in Content-Type (kept unchanged)
   0.2%  valid UTF-8 as detected/guessed (kept unchanged)
  20.8%  decoded (non- UTF-8) as declared in Content-Type
   0.4%  decoded (non- UTF-8) as detected/guessed

The 'decoded' and 'as detected/guessed' only occur with a setting:
  normalize_charset 1
(otherwise these would just have been kept as unchanged octets / Mojibake).

Summarizing the above further down yields:

  11.1%  true US-ASCII (kept unchanged)
  67.6%  is UTF-8      (kept unchanged)
  21.3%  decoded into UTF-8 (when normalize_charset is enabled)

So, 67.6% is natively UTF-8, and 88.9% of textual parts end up
as UTF-8 if normalize_charset is enabled. The remaining 11.1%
of textual mail parts is just plain ASCII text.


Interestingly, (while not directly comparable), our 88.9% UTF-8 figure
corresponds closely to the 82.5% in "Usage of character encodings
for websites" January 2015:
  "UTF-8 is used by 82.5% of all the websites whose character
   encoding we know."
http://w3techs.com/technologies/overview/character_encoding/all
Comment 7 Mark Martinec 2015-02-06 15:21:12 UTC
(sometimes a text is declared as ISO-8859-* but is actually UTF-8)

Bug 7126 - some more tweaks at sub _normalize
  Sending lib/Mail/SpamAssassin/Message/Node.pm
Committed revision 1657862.
Comment 8 Mark Martinec 2015-02-12 13:47:03 UTC
Created attachment 5277 [details]
The suggested replacement subroutine MS::Message::Node::_normalize() - V2

In view of:

  [Bug 7133] Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8
     will give garbage when decoding entities,

  and HTML::Parser bug:
    https://rt.cpan.org/Public/Bug/Display.html?id=99755

it seems desirable to be able to obtain from sub _normalize either
decoded characters (Unicode), or encoded as UTF-8 octets,
so I have generalized the proposed replacement sub _normalize()
to provide one or the other, based on an optional parameter.
In its absence it defaults to current behaviour (returns UTF-8
octets), preserving compatibility.

Attached is my last version of sub _normalize().



Bug 7126: Incorrect character set detections
by normalize_charset - sub _normalize() V2
  Sending lib/Mail/SpamAssassin/Message/Node.pm
Committed revision 1659255.
Comment 9 Mark Martinec 2015-02-23 17:47:17 UTC
This seems to work quite well, with or without normalize_charset enabled.
I don't think this change introduced any user-visible incompatibilities
with 3.4.0.  Closing.