Bug 5188 - Why does any charset with "Windows" bypass all preferred locales?
Summary: Why does any charset with "Windows" bypass all preferred locales?
Status: RESOLVED DUPLICATE of bug 4078
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamassassin (show other bugs)
Version: 3.1.7
Hardware: Other other
: P5 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-11-15 16:43 UTC by robert
Modified: 2006-12-10 23:10 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
Message that I forwarded to the list. text/plain None robert@elastica.com [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description robert 2006-11-15 16:43:55 UTC
If I've defined what locales I'm interested in then why is the code written to allow any locale with 
"Windows" in it to by pass that setting such that if I get a Windows-1255 message (Hebrew) 
UNWANTED_LANGUAGE doesn't fire.

in Locales.pm

sub is_charset_ok_for_locales {
  my ($cs, @locales) = @_;

  $cs = uc $cs; $cs =~ s/[^A-Z0-9]//g;
  $cs =~ s/^3D//gs;             # broken by quoted-printable
  $cs =~ s/:.*$//gs;            # trim off multiple charsets, just use 1st

  study $cs;
  #warn "JMD $cs";

  # always OK (the net speaks mostly roman charsets)
  return 1 if ($cs eq 'USASCII');
  return 1 if ($cs =~ /^ISO8859/);
  return 1 if ($cs =~ /^ISO10646/);
  return 1 if ($cs =~ /^UTF/);
  return 1 if ($cs =~ /^UCS/);
  return 1 if ($cs =~ /^CP125/);
  return 1 if ($cs =~ /^WINDOWS/);      # argh, Windows           ######## HERE ########
  return 1 if ($cs eq 'IBM852');
  return 1 if ($cs =~ /^UNICODE11UTF[78]/);     # wtf? never heard of it
  return 1 if ($cs eq 'XUNKNOWN'); # added by sendmail when converting to 8bit
  return 1 if ($cs eq 'ISO');   # Magellan, sending as 'charset=iso 8859-15'. grr

  foreach my $locale (@locales) {
    if (!defined($locale) || $locale eq 'C') { $locale = 'en'; }
    $locale =~ s/^([a-z][a-z]).*$/$1/;  # zh_TW... => zh

    my $ok_for_loc = $charsets_for_locale{$locale};
    next if (!defined $ok_for_loc);

    if ($ok_for_loc =~ /(?:^| )\Q${cs}\E(?:$| )/) {
      return 1;
    }
  }

  return 0;
}
Comment 1 Sidney Markowitz 2006-11-16 03:22:15 UTC
First of all, you are confusing UNWANTED_LANGUAGE with CHARSET_FARAWAY. The
former is the rule associated with the TextCat plugin and uses the configuration
option ok_languages. The latter looks at charset, and uses ok_locales.

Next, the code for checking charset explicitly excludes the charsets Windows-*
and ISO-8859-* and CP125* because they all use ANSI Roman alphabet for
characters in the range 0x20 to 0x7E. I see in my archives mail with charsets
such as Windows-1251 (Cyrillic) that are all in English with no Cyrillic
characters at all.
Comment 2 robert 2006-11-16 06:37:39 UTC
Ok so if I have 

ok_locales en th it
ok_languages en th it 

in my preferences

and along comes a message in Hebrew using charset Windows-1255

You're saying that the glyphs in Hebrew are acceptable even when I've 
said I'm only interested in en th it locales?

From: ariel@kini12.com
Date: September 10, 2006 2:17:51 PM CDT
To: robert@elastica.com
Subject: ????? ??? ??????
X-Spam-Dcc: : grub.camros.com 1113; Body=5 Fuz1=5 Fuz2=3
X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on grub.camros.com
X-Spam-Level: *****
X-Spam-Status: Yes, score=5.7 required=0.6 tests=BAYES_95,FRONTPAGE, 
HTML_90_100,HTML_IMAGE_RATIO_02,HTML_MESSAGE,HTML_TITLE_SUBJ_DIFF, 
MIME_HTML_ONLY,NO_REAL_NAME,UNPARSEABLE_RELAY autolearn=no  version=3.1.1
X-Spam-Report: *  1.0 NO_REAL_NAME From: does not include a real name *  0.0 UNPARSEABLE_RELAY 
Informational: message has unparseable relay *      lines *  0.5 HTML_IMAGE_RATIO_02 BODY: HTML 
has a low ratio of text to image *      area *  0.1 HTML_90_100 BODY: Message is 90% to 100% HTML *  
0.0 HTML_MESSAGE BODY: HTML included in message *  3.0 BAYES_95 BODY: Bayesian spam 
probability is 95 to 99% *      [score: 0.9667] *  0.0 MIME_HTML_ONLY BODY: Message only has text/
html MIME parts *  0.9 FRONTPAGE RAW: Frontpage used to create the message *  0.3 
HTML_TITLE_SUBJ_DIFF HTML_TITLE_SUBJ_DIFF
Received: (qmail 10557 invoked from network); 10 Sep 2006 18:17:08 -0000
Received: from  (HELO kini12.com) (208.53.131.241) by 64.34.193.12 with SMTP; 10 Sep 2006 
18:17:08 -0000
Message-Id: <20060910211746.80A7F9EBB64EA933@kini12.com>
Mime-Version: 1.0
Content-Type: text/html; charset="windows-1255"
Content-Transfer-Encoding: quoted-printable
Lines: 124
Comment 3 robert 2006-11-16 06:53:31 UTC
Created attachment 3753 [details]
Message that I forwarded to the list.

Hopefully this maintains the glyns in the Subject in the quoted message.
Comment 4 Sidney Markowitz 2006-11-16 10:43:33 UTC
> You're saying that the glyphs in Hebrew are acceptable even
> when I've said I'm only interested in en th it locales?

Well, I'm saying that the code is written to accept it because of the
possibility that the message could contain only Roman alphabet characters.

But I did not close this bug report because the end result doesn't pass the
"smell test". I can state the reasoning for the code being written that way, but
it doesn't make sense to not be able to detect that a message in Hebrew
characters is not English when we are checking the charset.

I want to leave this open for a while for comment from other developers about
the possibility of treating the non-Latin Windows-* and ISO-8859* and CPS125*
charsets the same way we do KR* in message bodies, where we trigger the rule
only if the majority of the body is high-bit characters. If we do that, we might
have to treat headers and body differently whe checking the charsets.

By the way, while it is clear what we are dealing with so you don't need to
reattach another example, for future reference if you want to submit an example
of a message that demonstrates a bug it is best to attach the message itself
here rather than a copy of an inline forwarding of the message which has changed
 the contents.
Comment 5 robert 2006-11-18 18:06:58 UTC
Well the issue for me is under what circumstances does Windows choose this charset? If it's only when 
somebody types a glyph then you're concern would be unwarranted so the issue would be to understand 
whether or not it's possible for that charset header to exist when there are no glyphs present.
Comment 6 Sidney Markowitz 2006-11-18 22:03:17 UTC
> under what circumstances does Windows choose this charset?

If you're native language is Hebrew and you communicate by email in Hebrew, you
would install the Hebrew version of Windows, which would use that charset. You
could still read and write English emails because the charset contains the
characters for 0x20 to 0x7E as the other ANSI/ISO and Windows charsets.

So an English email from someone who speaks Hebrew could very well use that charset.

I'm leaning towards having the charset-faraway test for bodies not give a free
pass to the non-Latin Windows, ISO, and CP125 charsets, since there is already a
test for the majority of characters in the body being high-bit which will allow
through Roman alphabet emails in those charsets. Doing that would require a
change to keep the free pass for the charset-faraway-header test.
Comment 7 robert 2006-11-19 07:18:08 UTC
Every one of the Hebrew spam emails that I have received have charset Windows-1255 in the headers and 
the body is all image spam. ie. not one single  high bit char in the body itself.
Comment 8 Sidney Markowitz 2006-12-10 23:10:41 UTC
I just noticed that there is already an open bug for this. I'm closing this bug
as DUPLICATE of bug 4078 and any discussion cn continue there


*** This bug has been marked as a duplicate of 4078 ***