SA Bugzilla – Bug 5188
Why does any charset with "Windows" bypass all preferred locales?
Last modified: 2006-12-10 23:10:41 UTC
If I've defined what locales I'm interested in then why is the code written to allow any locale with "Windows" in it to by pass that setting such that if I get a Windows-1255 message (Hebrew) UNWANTED_LANGUAGE doesn't fire. in Locales.pm sub is_charset_ok_for_locales { my ($cs, @locales) = @_; $cs = uc $cs; $cs =~ s/[^A-Z0-9]//g; $cs =~ s/^3D//gs; # broken by quoted-printable $cs =~ s/:.*$//gs; # trim off multiple charsets, just use 1st study $cs; #warn "JMD $cs"; # always OK (the net speaks mostly roman charsets) return 1 if ($cs eq 'USASCII'); return 1 if ($cs =~ /^ISO8859/); return 1 if ($cs =~ /^ISO10646/); return 1 if ($cs =~ /^UTF/); return 1 if ($cs =~ /^UCS/); return 1 if ($cs =~ /^CP125/); return 1 if ($cs =~ /^WINDOWS/); # argh, Windows ######## HERE ######## return 1 if ($cs eq 'IBM852'); return 1 if ($cs =~ /^UNICODE11UTF[78]/); # wtf? never heard of it return 1 if ($cs eq 'XUNKNOWN'); # added by sendmail when converting to 8bit return 1 if ($cs eq 'ISO'); # Magellan, sending as 'charset=iso 8859-15'. grr foreach my $locale (@locales) { if (!defined($locale) || $locale eq 'C') { $locale = 'en'; } $locale =~ s/^([a-z][a-z]).*$/$1/; # zh_TW... => zh my $ok_for_loc = $charsets_for_locale{$locale}; next if (!defined $ok_for_loc); if ($ok_for_loc =~ /(?:^| )\Q${cs}\E(?:$| )/) { return 1; } } return 0; }
First of all, you are confusing UNWANTED_LANGUAGE with CHARSET_FARAWAY. The former is the rule associated with the TextCat plugin and uses the configuration option ok_languages. The latter looks at charset, and uses ok_locales. Next, the code for checking charset explicitly excludes the charsets Windows-* and ISO-8859-* and CP125* because they all use ANSI Roman alphabet for characters in the range 0x20 to 0x7E. I see in my archives mail with charsets such as Windows-1251 (Cyrillic) that are all in English with no Cyrillic characters at all.
Ok so if I have ok_locales en th it ok_languages en th it in my preferences and along comes a message in Hebrew using charset Windows-1255 You're saying that the glyphs in Hebrew are acceptable even when I've said I'm only interested in en th it locales? From: ariel@kini12.com Date: September 10, 2006 2:17:51 PM CDT To: robert@elastica.com Subject: ????? ??? ?????? X-Spam-Dcc: : grub.camros.com 1113; Body=5 Fuz1=5 Fuz2=3 X-Spam-Flag: YES X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on grub.camros.com X-Spam-Level: ***** X-Spam-Status: Yes, score=5.7 required=0.6 tests=BAYES_95,FRONTPAGE, HTML_90_100,HTML_IMAGE_RATIO_02,HTML_MESSAGE,HTML_TITLE_SUBJ_DIFF, MIME_HTML_ONLY,NO_REAL_NAME,UNPARSEABLE_RELAY autolearn=no version=3.1.1 X-Spam-Report: * 1.0 NO_REAL_NAME From: does not include a real name * 0.0 UNPARSEABLE_RELAY Informational: message has unparseable relay * lines * 0.5 HTML_IMAGE_RATIO_02 BODY: HTML has a low ratio of text to image * area * 0.1 HTML_90_100 BODY: Message is 90% to 100% HTML * 0.0 HTML_MESSAGE BODY: HTML included in message * 3.0 BAYES_95 BODY: Bayesian spam probability is 95 to 99% * [score: 0.9667] * 0.0 MIME_HTML_ONLY BODY: Message only has text/ html MIME parts * 0.9 FRONTPAGE RAW: Frontpage used to create the message * 0.3 HTML_TITLE_SUBJ_DIFF HTML_TITLE_SUBJ_DIFF Received: (qmail 10557 invoked from network); 10 Sep 2006 18:17:08 -0000 Received: from (HELO kini12.com) (208.53.131.241) by 64.34.193.12 with SMTP; 10 Sep 2006 18:17:08 -0000 Message-Id: <20060910211746.80A7F9EBB64EA933@kini12.com> Mime-Version: 1.0 Content-Type: text/html; charset="windows-1255" Content-Transfer-Encoding: quoted-printable Lines: 124
Created attachment 3753 [details] Message that I forwarded to the list. Hopefully this maintains the glyns in the Subject in the quoted message.
> You're saying that the glyphs in Hebrew are acceptable even > when I've said I'm only interested in en th it locales? Well, I'm saying that the code is written to accept it because of the possibility that the message could contain only Roman alphabet characters. But I did not close this bug report because the end result doesn't pass the "smell test". I can state the reasoning for the code being written that way, but it doesn't make sense to not be able to detect that a message in Hebrew characters is not English when we are checking the charset. I want to leave this open for a while for comment from other developers about the possibility of treating the non-Latin Windows-* and ISO-8859* and CPS125* charsets the same way we do KR* in message bodies, where we trigger the rule only if the majority of the body is high-bit characters. If we do that, we might have to treat headers and body differently whe checking the charsets. By the way, while it is clear what we are dealing with so you don't need to reattach another example, for future reference if you want to submit an example of a message that demonstrates a bug it is best to attach the message itself here rather than a copy of an inline forwarding of the message which has changed the contents.
Well the issue for me is under what circumstances does Windows choose this charset? If it's only when somebody types a glyph then you're concern would be unwarranted so the issue would be to understand whether or not it's possible for that charset header to exist when there are no glyphs present.
> under what circumstances does Windows choose this charset? If you're native language is Hebrew and you communicate by email in Hebrew, you would install the Hebrew version of Windows, which would use that charset. You could still read and write English emails because the charset contains the characters for 0x20 to 0x7E as the other ANSI/ISO and Windows charsets. So an English email from someone who speaks Hebrew could very well use that charset. I'm leaning towards having the charset-faraway test for bodies not give a free pass to the non-Latin Windows, ISO, and CP125 charsets, since there is already a test for the majority of characters in the body being high-bit which will allow through Roman alphabet emails in those charsets. Doing that would require a change to keep the free pass for the charset-faraway-header test.
Every one of the Hebrew spam emails that I have received have charset Windows-1255 in the headers and the body is all image spam. ie. not one single high bit char in the body itself.
I just noticed that there is already an open bug for this. I'm closing this bug as DUPLICATE of bug 4078 and any discussion cn continue there *** This bug has been marked as a duplicate of 4078 ***