Bug 4352 - Yahoo.com mail false positive for SUBJ_ILLEGAL_CHARS
Summary: Yahoo.com mail false positive for SUBJ_ILLEGAL_CHARS
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: 3.0.2
Hardware: Other other
: P5 normal
Target Milestone: 3.3.0
Assignee: SpamAssassin Developer Mailing List
URL: http://mail.yahoo.com
Whiteboard:
Keywords:
: 5848 (view as bug list)
Depends on:
Blocks:
 
Reported: 2005-05-20 05:41 UTC by era eriksson
Modified: 2009-06-30 15:33 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Sample 8859-1 message from Yahoo, incorrectly tagged as 7bit us-ascii text/plain None era eriksson [NoCLA]
Rules used to test for improvements text/plain None Bob Menschel [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description era eriksson 2005-05-20 05:41:30 UTC
It appears that Yahoo mail blindly sends raw 8-bit characters as us-ascii
without any attempt to figure out what encoding or MIME encapsulation to use.

(It's not obvious that this is a "false positive". They're doing the wrong thing
and the rule fires like it should. But they're fairly popular and that may give
them the right to ignore some standards.)

The problem is compounded I believe by the tendency of Firefox to put ISO-8859-1
in forms even when the containing page is UTF8-encoded (like the mail
composition page in Yahoo's web user interface). But even if that were fixed,
the data should be correctly identified when it's not basic 7-bit US-ASCII,
which is still the default for email.

Given that SpamAssassin is in fairly widespread use, perhaps Yahoo! could be
persuaded to fix their broken system, so that the rule doesn't need to have an
exemption for their servers.
Comment 1 era eriksson 2005-05-20 05:42:29 UTC
Created attachment 2882 [details]
Sample 8859-1 message from Yahoo, incorrectly tagged as 7bit us-ascii
Comment 2 Bob Menschel 2005-05-24 21:06:22 UTC
Statistics for 3.0 indicate this rule hits ham about 0.01% of the time. In my
own corpus, a quick scan indicates I have 19 ham and over 27k spam that hit this
rule. Of that spam, a very significant fraction claims to be from Yahoo, and
some of it probably is (though a quick look at about 100 didn't reveal any). 

I'm thinking that this is a case where we may need to let statistics rule ... if
hams increase in number, they'll drive the scores down. 

What we need, then, is someone with such hams in their corpus to participate in
the 3.1 pre-release mass-check process...
Comment 3 era eriksson 2005-05-24 23:03:10 UTC
Just to reemphasize, if this were Mozilla, I'd tag it as "advocacy" and try to
convince Yahoo to get their act together instead.

In any locale where anything but plain 7-bit ASCII is the norm, you will get
occasional false positives. They're probably relatively rare, because (a) the
"weird" characters are not very frequent in many languages; (b) some people are
still scared of 8-bit and will coerce their own writing into uncomfortable 7-bit
renderings, especially perhaps in the Subject: and other header data, precisely
because some systems go ballistic when there's 8-bit data (even more than 15
years after ISO-8859-1 became a widely supported de facto standard for email in
practice!); and (c) tech savvy people in those locales tend to shun applications
which still don't manage to get this right.

It's surprising that Yahoo! can ignore this issue; they're fairly popular
internationally and it's odd that this lack of standards compliance on their
part doesn't actually cause more trouble.
Comment 4 Bob Menschel 2005-07-16 12:13:26 UTC
Agreed with your last comment. Unfortunately we have no direct influence over
Yahoo. And since the rule is so productive, and since I don't see any way to
"fix" this problem without significantly weaking the rule, closing as WONTFIX.
If anyone has any good ideas how to avoid the ham hits, please reopen.
Comment 5 Justin Mason 2005-07-16 18:17:06 UTC
actually, we could indeed whitelist -- it's only mail from Yahoo! mail, right? 
just turn the rule into a meta something like (__SUBJ_ILLEGAL_CHARS &&
!__RCVD_FROM_YAHOO), where __RCVD_FROM_YAHOO matches yahoo relay IPs.
Comment 6 Bob Menschel 2005-07-18 06:48:49 UTC
Subjects that match this rule, where ham is not from Yahoo: 

Subject: IPS-English POLITICS-US: ”Coddling” at GITMO, or Just Humane Treatment?
("smart" quotes)
X-Mailer: IPSCOM-MAIL-SYSTEM
mailing list, no Yahoo headers anywhere.

X-Amavis-Alert: BAD HEADER Non-encoded 8-bit data (char 92 hex) in message
header 'Subject'
        Subject: Offre d\222emploi: Coordon... ^
Subject: [APC Forum] Offre d’emploi: Coordonnateur de l’information APC
X-Mailman-Version: 2.0.6
List-Post: <mailto:apc.forum@lists.apc.org>
No yahoo headers anywhere

Subject: Introducing eDVD 4 – the “add anything to your DVDs” tool!
From: "Sonic Solutions" <sonic@reply.digitalriver.com>
No yahoo headers anywhere

X-Mailer: /capad/tools/lm-runner_mime 0.6
Subject: Oppose Repressive Measures Promoted as “Reform”
From: ACLU Action Network <action@dcaclu.org>
No yahoo headers anywhere

etc.  And, per bug 4484, there appears to be significant numbers of European
non-spam that hit this test. 

Comment 7 Bob Menschel 2005-07-23 20:01:18 UTC
Created attachment 3034 [details]
Rules used to test for improvements

Attached are some alternate rules I put together to try to eliminate ham hits
from Yahoo.  Didn't have any significant impact here (removed one ham), but
testing against other corpora might show better statistics. My stats: 
(First numeric frequencies, followed by percentage frequencies)

OVERALL%   SPAM%     HAM%     S/O    RANK  SCORE  NAME
 297078   139375   157703    0.469   0.00   0.00  (all messages)
  22159    22129       30    0.999   1.00   0.20  SUBJ_ILLEGAL_CHARS3
  22159    22129       30    0.999   1.00   0.60  SUBJ_ILLEGAL_CHARS6
  22159    22129       30    0.999   1.00   0.40  SUBJ_ILLEGAL_CHARS4
  22159    22129       30    0.999   1.00   0.50  SUBJ_ILLEGAL_CHARS5
  22165    22134       31    0.999   0.90   2.88  SUBJ_ILLEGAL_CHARS
  22102    22072       30    0.999   0.80   0.10  SUBJ_ILLEGAL_CHARS2
   7527     7527	0    1.000   0.70   2.42  MSGID_YAHOO_CAPS
    468      468	0    1.000   0.50   0.50  SARE_HELO_YAHOO
  11397    10826      571    0.955   0.00   1.67  FORGED_YAHOO_RCVD

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
 297078   139375   157703    0.469   0.00    0.00  (all messages)
100.000  46.9153  53.0847    0.469   0.00    0.00  (all messages as %)
  7.459  15.8773   0.0190    0.999   1.00    0.20  SUBJ_ILLEGAL_CHARS3
  7.459  15.8773   0.0190    0.999   1.00    0.60  SUBJ_ILLEGAL_CHARS6
  7.459  15.8773   0.0190    0.999   1.00    0.40  SUBJ_ILLEGAL_CHARS4
  7.459  15.8773   0.0190    0.999   1.00    0.50  SUBJ_ILLEGAL_CHARS5
  7.461  15.8809   0.0197    0.999   0.90    2.88  SUBJ_ILLEGAL_CHARS
  7.440  15.8364   0.0190    0.999   0.80    0.10  SUBJ_ILLEGAL_CHARS2
  2.534   5.4005   0.0000    1.000   0.70    2.42  MSGID_YAHOO_CAPS
  0.158   0.3358   0.0000    1.000   0.50    0.50  SARE_HELO_YAHOO
  3.836   7.7675   0.3621    0.955   0.00    1.67  FORGED_YAHOO_RCVD
Comment 8 Justin Mason 2006-12-12 12:40:20 UTC
moving RFEs and low-priority stuff to 3.3.0 target
Comment 9 Matt Kettler 2008-03-13 03:00:23 UTC
*** Bug 5848 has been marked as a duplicate of this bug. ***
Comment 10 Dilyan Palauzov 2008-03-30 15:20:34 UTC
If the MTA supports the UTF8SMTP extension, as defined in draft-ietf-eai-smtpext (http://tools.ietf.org/html/draft-ietf-eai-smtpext-11) and UTF8 headers, described in draft-ietf-eai-utf8headers (http://tools.ietf.org/html/draft-ietf-eai-utf8headers-09), then it is correct to send 8 bit characters in mail headers. With the exceptions, where spamassassin knows that the MTA does not advertise 8BITMIME SMTP extension (RFC 1652), or the client does not make use of it, or the client uses 8BITMIME, but the MTA does not offer UTF8SMTP .

Therefore my suggestion is to remove the tests for 8 bit characters in mail headers, or add option to spamc or local.cf informing spamd if 8bit characters are allowed in headers.
Comment 11 Justin Mason 2008-03-31 01:49:41 UTC
(In reply to comment #10)
> If the MTA supports the UTF8SMTP extension, as defined in
> draft-ietf-eai-smtpext (http://tools.ietf.org/html/draft-ietf-eai-smtpext-11)
> and UTF8 headers, described in draft-ietf-eai-utf8headers
> (http://tools.ietf.org/html/draft-ietf-eai-utf8headers-09), then it is correct
> to send 8 bit characters in mail headers. With the exceptions, where
> spamassassin knows that the MTA does not advertise 8BITMIME SMTP extension (RFC
> 1652), or the client does not make use of it, or the client uses 8BITMIME, but
> the MTA does not offer UTF8SMTP .
> 
> Therefore my suggestion is to remove the tests for 8 bit characters in mail
> headers, or add option to spamc or local.cf informing spamd if 8bit characters
> are allowed in headers.

SA isn't a standards-compliance testing tool -- we just determine whether a rule is good at matching spam, or not.  as those drafts become standards, and are adopted, the ham hitrate will increase, the rule's score will drop, and the rule will eventually become too poor to use -- at which point we'll drop it.

As a matter of interest, are there any ways to tell that a set of RFC-2822 headers use draft-ietf-eai-utf8headers headers, or not?  scanning for UTF-8 chars?
Comment 12 Dilyan Palauzov 2008-03-31 16:52:42 UTC
> As a matter of interest, are there any ways to tell that a set of 
> RFC-2822 headers use draft-ietf-eai-utf8headers headers, or not?  scanning 
> for UTF-8 chars?

draft-ietf-eai-utf8headers can be used, only if the MTA advertises UTF8SMTP. The sending client does not confirm that UTF8SMTP will be used, it is just used. However UTF8SMTP can be used only when 8BITMIME is offered by the MTA and the client declared its usage for the current message (in MAIL FROM: ... BODY=8BITMIME).

Hence, my suggestion is to let spamc accept one more parameter and reports to spamd if the client has used 8BITMIME and the MTA offered UTF8SMTP . In this case using UTF8 in headers shall lead to less spam scores, than in all other cases (all other cases = the mail contains 8 bit headers and the client does not use 8BITMIME or the server/MTA does offer UTF8SMTP or 8BITMIME) and the currently associated scores for SUBJ_ILLEGAL_CHARS .
Comment 13 Justin Mason 2008-04-01 01:32:27 UTC
(In reply to comment #12)
> > As a matter of interest, are there any ways to tell that a set of 
> > RFC-2822 headers use draft-ietf-eai-utf8headers headers, or not?  scanning 
> > for UTF-8 chars?
> 
> draft-ietf-eai-utf8headers can be used, only if the MTA advertises UTF8SMTP.
> The sending client does not confirm that UTF8SMTP will be used, it is just
> used. However UTF8SMTP can be used only when 8BITMIME is offered by the MTA and
> the client declared its usage for the current message (in MAIL FROM: ...
> BODY=8BITMIME).
> 
> Hence, my suggestion is to let spamc accept one more parameter and reports to
> spamd if the client has used 8BITMIME and the MTA offered UTF8SMTP . In this
> case using UTF8 in headers shall lead to less spam scores, than in all other
> cases (all other cases = the mail contains 8 bit headers and the client does
> not use 8BITMIME or the server/MTA does offer UTF8SMTP or 8BITMIME) and the
> currently associated scores for SUBJ_ILLEGAL_CHARS .

Unfortunately that will only indicate if the *most recent* MTA->MTA hop was utf-8.  If a previous hop did not use that, then any 8-bit data present may not be valid utf-8, even if the last hop used it successfully.
Comment 14 Justin Mason 2009-06-30 15:33:51 UTC
: 319...; svn commit -m "bug 4352: fix SUBJ_ILLEGAL_CHARS to whitelist yahoo.com webmail, which seems common enough to make a special case for their bugs" t.rules rules
Sending        rules/20_head_tests.cf
Adding         t.rules/SUBJ_ILLEGAL_CHARS
Adding         t.rules/SUBJ_ILLEGAL_CHARS/fp-bug4352-att2882
Transmitting file data ..
Committed revision 789992.