SA Bugzilla – Bug 7126
Incorrect character set detections by normalize_charset
Last modified: 2015-02-23 17:47:17 UTC
Noticing that several of our local mail messages are considered by MS::Message::Node::_normalize() / normalize_charset as being written in far-East character sets and decoded as such, which clearly does not make sense (they are actually in UTF-8 or Windows-1252 or ISO-8859-2), I have put up an alternative reference implementation of _normalize() and compared the results of the two, while manually checking the reported differences. In our case the one-day statistics shows that more than 8 % of decisions taken by _normalize() were wrong. The most common differences were: - decoded as big5 (should be decoded as iso-8859-2) - decoded as euc-kr (should be decoded as utf-8) - decoded as euc-jp (should be decoded as utf-8) - decoded as shift_jis (should be decoded as windows-1252) - decoded as utf-8 (should be decoded as windows-1252) - not decoded (should be decoded as gb2312) - not decoded (should be decoded as gbk) - not decoded (should be decoded as utf-8) The source of the problem in my opinion is that the existing _normalize() puts too much reliance on Encode::Detect::Detector and the underlying "Mozilla's universal charset detector", instead of trusting a declared character set (in a Content-Type), and falling back to guesswork only when the declared character set seems inconsistent with actual contents of a message part. While relying primarily on guesswork may have made good sense ten years ago, and probably still produces sensible results in the far-East (as it errs on the side of far-Eastern character sets), nowadays when UTF-8 is much more widespread, in my opinion the logic is now flawed.
Created attachment 5271 [details] The suggested replacement subroutine MS::Message::Node::_normalize() This implementation is compatible with the existing implementation, they agree with each other in about 92 % or cases. Implemented with emphasis on: - trust a declared character set as long as this assumption is not invalidated by the actual text (failed attempted strict decoding) - avoid unnecessary decoding (by Encode::decode) where possible, - avoid calling Encode::Detect::Detector unless necessary, - produce useful debug diagnostics.
Created attachment 5272 [details] The full proposed patch (includes documentation update, Conf.pm, DependencyInfo.pm)
Bug 7126: Incorrect character set detections by normalize_charset Sending lib/Mail/SpamAssassin/Conf.pm Sending lib/Mail/SpamAssassin/Message/Node.pm Sending lib/Mail/SpamAssassin/Util/DependencyInfo.pm Committed revision 1655758.
Some small refinements: Sending lib/Mail/SpamAssassin/Message/Node.pm Committed revision 1656048. - noticed that some undecodable/invalid text is 'almost ascii', but contains some characters like a NBSP (non-breaking space) or SHY (soft hyphen), or some punctuation from Windows-1252 at the codes which are unassigned in ISO-8859 - so deal with that: decode it as Windows-1252; - improved debugging: report if some encoding is unrecognized or unsupported by module Encode
Seen a declared character set 'ANSI X3.4-1986' (i.e. US-ASCII) in the wild, interesting. Bug 7126: refinements: ANSI X3.4-1986, Windows-1252 quotes Sending lib/Mail/SpamAssassin/Message/Node.pm Committed revision 1656447.
Some interesting statistics, collected from 100.000 textual mail parts as seen in two working days at our site. A single mail message can be counted as more than one part (e.g. text/plain + text/html in case of multipart/alternative), so the number of mail messages analyzed is slightly less than half that much. The debug messages were grepped by a ': message: .*charset' and grouped into the following groups: 11.1% true US-ASCII (kept unchanged) 67.5% valid UTF-8 as declared in Content-Type (kept unchanged) 0.2% valid UTF-8 as detected/guessed (kept unchanged) 20.8% decoded (non- UTF-8) as declared in Content-Type 0.4% decoded (non- UTF-8) as detected/guessed The 'decoded' and 'as detected/guessed' only occur with a setting: normalize_charset 1 (otherwise these would just have been kept as unchanged octets / Mojibake). Summarizing the above further down yields: 11.1% true US-ASCII (kept unchanged) 67.6% is UTF-8 (kept unchanged) 21.3% decoded into UTF-8 (when normalize_charset is enabled) So, 67.6% is natively UTF-8, and 88.9% of textual parts end up as UTF-8 if normalize_charset is enabled. The remaining 11.1% of textual mail parts is just plain ASCII text. Interestingly, (while not directly comparable), our 88.9% UTF-8 figure corresponds closely to the 82.5% in "Usage of character encodings for websites" January 2015: "UTF-8 is used by 82.5% of all the websites whose character encoding we know." http://w3techs.com/technologies/overview/character_encoding/all
(sometimes a text is declared as ISO-8859-* but is actually UTF-8) Bug 7126 - some more tweaks at sub _normalize Sending lib/Mail/SpamAssassin/Message/Node.pm Committed revision 1657862.
Created attachment 5277 [details] The suggested replacement subroutine MS::Message::Node::_normalize() - V2 In view of: [Bug 7133] Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities, and HTML::Parser bug: https://rt.cpan.org/Public/Bug/Display.html?id=99755 it seems desirable to be able to obtain from sub _normalize either decoded characters (Unicode), or encoded as UTF-8 octets, so I have generalized the proposed replacement sub _normalize() to provide one or the other, based on an optional parameter. In its absence it defaults to current behaviour (returns UTF-8 octets), preserving compatibility. Attached is my last version of sub _normalize(). Bug 7126: Incorrect character set detections by normalize_charset - sub _normalize() V2 Sending lib/Mail/SpamAssassin/Message/Node.pm Committed revision 1659255.
This seems to work quite well, with or without normalize_charset enabled. I don't think this change introduced any user-visible incompatibilities with 3.4.0. Closing.