SA Bugzilla – Bug 4352
Yahoo.com mail false positive for SUBJ_ILLEGAL_CHARS
Last modified: 2009-06-30 15:33:51 UTC
It appears that Yahoo mail blindly sends raw 8-bit characters as us-ascii without any attempt to figure out what encoding or MIME encapsulation to use. (It's not obvious that this is a "false positive". They're doing the wrong thing and the rule fires like it should. But they're fairly popular and that may give them the right to ignore some standards.) The problem is compounded I believe by the tendency of Firefox to put ISO-8859-1 in forms even when the containing page is UTF8-encoded (like the mail composition page in Yahoo's web user interface). But even if that were fixed, the data should be correctly identified when it's not basic 7-bit US-ASCII, which is still the default for email. Given that SpamAssassin is in fairly widespread use, perhaps Yahoo! could be persuaded to fix their broken system, so that the rule doesn't need to have an exemption for their servers.
Created attachment 2882 [details] Sample 8859-1 message from Yahoo, incorrectly tagged as 7bit us-ascii
Statistics for 3.0 indicate this rule hits ham about 0.01% of the time. In my own corpus, a quick scan indicates I have 19 ham and over 27k spam that hit this rule. Of that spam, a very significant fraction claims to be from Yahoo, and some of it probably is (though a quick look at about 100 didn't reveal any). I'm thinking that this is a case where we may need to let statistics rule ... if hams increase in number, they'll drive the scores down. What we need, then, is someone with such hams in their corpus to participate in the 3.1 pre-release mass-check process...
Just to reemphasize, if this were Mozilla, I'd tag it as "advocacy" and try to convince Yahoo to get their act together instead. In any locale where anything but plain 7-bit ASCII is the norm, you will get occasional false positives. They're probably relatively rare, because (a) the "weird" characters are not very frequent in many languages; (b) some people are still scared of 8-bit and will coerce their own writing into uncomfortable 7-bit renderings, especially perhaps in the Subject: and other header data, precisely because some systems go ballistic when there's 8-bit data (even more than 15 years after ISO-8859-1 became a widely supported de facto standard for email in practice!); and (c) tech savvy people in those locales tend to shun applications which still don't manage to get this right. It's surprising that Yahoo! can ignore this issue; they're fairly popular internationally and it's odd that this lack of standards compliance on their part doesn't actually cause more trouble.
Agreed with your last comment. Unfortunately we have no direct influence over Yahoo. And since the rule is so productive, and since I don't see any way to "fix" this problem without significantly weaking the rule, closing as WONTFIX. If anyone has any good ideas how to avoid the ham hits, please reopen.
actually, we could indeed whitelist -- it's only mail from Yahoo! mail, right? just turn the rule into a meta something like (__SUBJ_ILLEGAL_CHARS && !__RCVD_FROM_YAHOO), where __RCVD_FROM_YAHOO matches yahoo relay IPs.
Subjects that match this rule, where ham is not from Yahoo: Subject: IPS-English POLITICS-US: ”Coddling” at GITMO, or Just Humane Treatment? ("smart" quotes) X-Mailer: IPSCOM-MAIL-SYSTEM mailing list, no Yahoo headers anywhere. X-Amavis-Alert: BAD HEADER Non-encoded 8-bit data (char 92 hex) in message header 'Subject' Subject: Offre d\222emploi: Coordon... ^ Subject: [APC Forum] Offre d’emploi: Coordonnateur de l’information APC X-Mailman-Version: 2.0.6 List-Post: <mailto:apc.forum@lists.apc.org> No yahoo headers anywhere Subject: Introducing eDVD 4 – the “add anything to your DVDs” tool! From: "Sonic Solutions" <sonic@reply.digitalriver.com> No yahoo headers anywhere X-Mailer: /capad/tools/lm-runner_mime 0.6 Subject: Oppose Repressive Measures Promoted as “Reform” From: ACLU Action Network <action@dcaclu.org> No yahoo headers anywhere etc. And, per bug 4484, there appears to be significant numbers of European non-spam that hit this test.
Created attachment 3034 [details] Rules used to test for improvements Attached are some alternate rules I put together to try to eliminate ham hits from Yahoo. Didn't have any significant impact here (removed one ham), but testing against other corpora might show better statistics. My stats: (First numeric frequencies, followed by percentage frequencies) OVERALL% SPAM% HAM% S/O RANK SCORE NAME 297078 139375 157703 0.469 0.00 0.00 (all messages) 22159 22129 30 0.999 1.00 0.20 SUBJ_ILLEGAL_CHARS3 22159 22129 30 0.999 1.00 0.60 SUBJ_ILLEGAL_CHARS6 22159 22129 30 0.999 1.00 0.40 SUBJ_ILLEGAL_CHARS4 22159 22129 30 0.999 1.00 0.50 SUBJ_ILLEGAL_CHARS5 22165 22134 31 0.999 0.90 2.88 SUBJ_ILLEGAL_CHARS 22102 22072 30 0.999 0.80 0.10 SUBJ_ILLEGAL_CHARS2 7527 7527 0 1.000 0.70 2.42 MSGID_YAHOO_CAPS 468 468 0 1.000 0.50 0.50 SARE_HELO_YAHOO 11397 10826 571 0.955 0.00 1.67 FORGED_YAHOO_RCVD OVERALL% SPAM% HAM% S/O RANK SCORE NAME 297078 139375 157703 0.469 0.00 0.00 (all messages) 100.000 46.9153 53.0847 0.469 0.00 0.00 (all messages as %) 7.459 15.8773 0.0190 0.999 1.00 0.20 SUBJ_ILLEGAL_CHARS3 7.459 15.8773 0.0190 0.999 1.00 0.60 SUBJ_ILLEGAL_CHARS6 7.459 15.8773 0.0190 0.999 1.00 0.40 SUBJ_ILLEGAL_CHARS4 7.459 15.8773 0.0190 0.999 1.00 0.50 SUBJ_ILLEGAL_CHARS5 7.461 15.8809 0.0197 0.999 0.90 2.88 SUBJ_ILLEGAL_CHARS 7.440 15.8364 0.0190 0.999 0.80 0.10 SUBJ_ILLEGAL_CHARS2 2.534 5.4005 0.0000 1.000 0.70 2.42 MSGID_YAHOO_CAPS 0.158 0.3358 0.0000 1.000 0.50 0.50 SARE_HELO_YAHOO 3.836 7.7675 0.3621 0.955 0.00 1.67 FORGED_YAHOO_RCVD
moving RFEs and low-priority stuff to 3.3.0 target
*** Bug 5848 has been marked as a duplicate of this bug. ***
If the MTA supports the UTF8SMTP extension, as defined in draft-ietf-eai-smtpext (http://tools.ietf.org/html/draft-ietf-eai-smtpext-11) and UTF8 headers, described in draft-ietf-eai-utf8headers (http://tools.ietf.org/html/draft-ietf-eai-utf8headers-09), then it is correct to send 8 bit characters in mail headers. With the exceptions, where spamassassin knows that the MTA does not advertise 8BITMIME SMTP extension (RFC 1652), or the client does not make use of it, or the client uses 8BITMIME, but the MTA does not offer UTF8SMTP . Therefore my suggestion is to remove the tests for 8 bit characters in mail headers, or add option to spamc or local.cf informing spamd if 8bit characters are allowed in headers.
(In reply to comment #10) > If the MTA supports the UTF8SMTP extension, as defined in > draft-ietf-eai-smtpext (http://tools.ietf.org/html/draft-ietf-eai-smtpext-11) > and UTF8 headers, described in draft-ietf-eai-utf8headers > (http://tools.ietf.org/html/draft-ietf-eai-utf8headers-09), then it is correct > to send 8 bit characters in mail headers. With the exceptions, where > spamassassin knows that the MTA does not advertise 8BITMIME SMTP extension (RFC > 1652), or the client does not make use of it, or the client uses 8BITMIME, but > the MTA does not offer UTF8SMTP . > > Therefore my suggestion is to remove the tests for 8 bit characters in mail > headers, or add option to spamc or local.cf informing spamd if 8bit characters > are allowed in headers. SA isn't a standards-compliance testing tool -- we just determine whether a rule is good at matching spam, or not. as those drafts become standards, and are adopted, the ham hitrate will increase, the rule's score will drop, and the rule will eventually become too poor to use -- at which point we'll drop it. As a matter of interest, are there any ways to tell that a set of RFC-2822 headers use draft-ietf-eai-utf8headers headers, or not? scanning for UTF-8 chars?
> As a matter of interest, are there any ways to tell that a set of > RFC-2822 headers use draft-ietf-eai-utf8headers headers, or not? scanning > for UTF-8 chars? draft-ietf-eai-utf8headers can be used, only if the MTA advertises UTF8SMTP. The sending client does not confirm that UTF8SMTP will be used, it is just used. However UTF8SMTP can be used only when 8BITMIME is offered by the MTA and the client declared its usage for the current message (in MAIL FROM: ... BODY=8BITMIME). Hence, my suggestion is to let spamc accept one more parameter and reports to spamd if the client has used 8BITMIME and the MTA offered UTF8SMTP . In this case using UTF8 in headers shall lead to less spam scores, than in all other cases (all other cases = the mail contains 8 bit headers and the client does not use 8BITMIME or the server/MTA does offer UTF8SMTP or 8BITMIME) and the currently associated scores for SUBJ_ILLEGAL_CHARS .
(In reply to comment #12) > > As a matter of interest, are there any ways to tell that a set of > > RFC-2822 headers use draft-ietf-eai-utf8headers headers, or not? scanning > > for UTF-8 chars? > > draft-ietf-eai-utf8headers can be used, only if the MTA advertises UTF8SMTP. > The sending client does not confirm that UTF8SMTP will be used, it is just > used. However UTF8SMTP can be used only when 8BITMIME is offered by the MTA and > the client declared its usage for the current message (in MAIL FROM: ... > BODY=8BITMIME). > > Hence, my suggestion is to let spamc accept one more parameter and reports to > spamd if the client has used 8BITMIME and the MTA offered UTF8SMTP . In this > case using UTF8 in headers shall lead to less spam scores, than in all other > cases (all other cases = the mail contains 8 bit headers and the client does > not use 8BITMIME or the server/MTA does offer UTF8SMTP or 8BITMIME) and the > currently associated scores for SUBJ_ILLEGAL_CHARS . Unfortunately that will only indicate if the *most recent* MTA->MTA hop was utf-8. If a previous hop did not use that, then any 8-bit data present may not be valid utf-8, even if the last hop used it successfully.
: 319...; svn commit -m "bug 4352: fix SUBJ_ILLEGAL_CHARS to whitelist yahoo.com webmail, which seems common enough to make a special case for their bugs" t.rules rules Sending rules/20_head_tests.cf Adding t.rules/SUBJ_ILLEGAL_CHARS Adding t.rules/SUBJ_ILLEGAL_CHARS/fp-bug4352-att2882 Transmitting file data .. Committed revision 789992.