SA Bugzilla – Bug 3576
Not valid ISO codes should be tagged
Last modified: 2005-01-21 02:11:48 UTC
We are receiving a lot of mails with a not valid ISO code included in the header like "iso-8237-4". It would be nice if such unvalid ISO codes could be tacked.
I mean tagged not tacked :)
All valid charsets are listed here: http://www.iana.org/assignments/character-sets I might do something with this later this week.
Created attachment 2098 [details] All valid code pages according to the previously mentioned url (man I love regex ;) ) I attached a list of all available code pages and aliases. I'm writing a rough rule from this, but it would probably need some optimization ;)
Created attachment 2099 [details] rules to catch invalid charsets in content-type, subject and html I've attached a set of rules to do what you asked, but the results (at least on my system) are disappointing (only one spam hit on 85000 messages). at least it does not hit ham.
Whow. 12h to fix a call! We got a lot of such mails. I will test it and give you a respons soon.
I´ve got hundrets of wrong hits. It seemed that the tagged mails does not have an entry in the header with "charset=xxxx". Who does this role work. Every hit is a invalid_charset_2 hit. What is rawbody?
here one example of a tagged mail witch was tagged as invalid_charset_2, but I expected only a invalid_chaset_1 hit for this mail: Received: from pfx2.example.com (sg001168.intranet.example.com [140.100.200.10]) by sv028081.exchange.example.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2657.72) id 3PVCBKYQ; Thu, 8 Jul 2004 13:11:26 +0200 Received: from localhost (localhost [127.0.0.1]) by pfx2.example.com (example Internal Mail-System) with ESMTP id BC99612ED4 for <SPAMBuffer@example.com>; Thu, 8 Jul 2004 13:11:27 +0200 (CEST) Received: by pfx2.example.com (example Internal Mail-System, from userid 501) id 6CB3812FBE; Thu, 8 Jul 2004 13:11:27 +0200 (CEST) Received: from mail.example.com (extern.postfix.example.com [140.100.155.100]) by pfx2.example.com (example Internal Mail-System) with ESMTP id 5965512EC2 for <SPAMBuffer@example.com>; Thu, 8 Jul 2004 13:11:27 +0200 (CEST) Received: by mail.example.com (example Mail-System, from userid 501) id 209582641; Thu, 8 Jul 2004 11:11:25 +0000 (UTC) Received: from localhost by mail4.example.com with SpamAssassin (2.63 2004-01-11); Thu, 08 Jul 2004 13:11:24 +0200 From: "Royce Peterson" <celrlrogfht@icq.com> To: iain.barbour@exampleib.com Subject: *****SPAM***** this is the best brindisi squalid Date: Thu, 08 Jul 2004 13:19:15 +0200 Message-Id: <20040708111111.78E95260B@mail.example.com> X-Spam-DCC: xmailer: mail4 1192; Body=1 Fuz1=1 Fuz2=1 X-Spam-Flag: YES X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on mail4.example.com X-Spam-Level: ********************* X-Spam-Status: Yes, hits=21.8 required=6.3 tests=BAYES_99, FORGED_RCVD_NET_HELO,HTML_50_60,HTML_FONT_BIG,HTML_FONT_INVISIBLE, HTML_MESSAGE,HTML_TITLE_UNTITLED,INVALID_CHARSET_2,MIME_HTML_ONLY, MIME_HTML_ONLY_MULTI,MSGID_FROM_MTA_SHORT,RCVD_IN_BL_SPAMCOP_NET, RCVD_IN_DSBL,RCVD_IN_NJABL,RCVD_IN_NJABL_PROXY,RCVD_IN_RFCI, RCVD_IN_SORBS,RCVD_IN_SORBS_HTTP,RCVD_IN_SORBS_SOCKS autolearn=spam version=2.63 X-Spam-Pyzor: Reported 0 times. MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----------=_40ED2BDC.CB948519" X-AntiVirus: checked by AntiVir MailGate (version: 2.0.2-6; AVE: 6.26.0.3; VDF: 6.26.0.19; host: pfx2) This is a multi-part message in MIME format. ------------=_40ED2BDC.CB948519 Content-Type: text/plain Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Your Email was identified as SPAM and did not reached the recipient. Please check for the reasons below. Ihre Email ist als SPAM Mail identifiziert worden und wurde dem Empfaenger nicht zugestellt. Die Begruendung finden Sie weiter unten. Contact address: SPAM@example.com. Content preview: Untitled Document Say goodbye to expensive Refills! We are not retreating - we are advancing in another Direction. - General Douglas MacArthur (1880-1964) R_X Warehouse Direct! URI:http://Geraldine.tgoiwe.com/_55d958a932f5b91262baa654773c6a8e/ >>more info<< [...]=20 Content analysis details: (21.8 points, 6.3 required) 0.1 HTML_MESSAGE BODY: HTML included in message 0.3 HTML_FONT_BIG BODY: HTML has a big font 5.4 BAYES_99 BODY: Bayesian spam probability is 99 to 100% [score: 1.0000] 0.3 MIME_HTML_ONLY BODY: Message only has text/html MIME parts 0.4 HTML_TITLE_UNTITLED BODY: HTML title contains "Untitled" 0.6 HTML_FONT_INVISIBLE BODY: HTML font color is same as background 0.1 HTML_50_60 BODY: Message is 50% to 60% HTML 1.0 INVALID_CHARSET_2 BODY: INVALID_CHARSET_2 3.0 MSGID_FROM_MTA_SHORT Message-Id was added by a relay 4.1 FORGED_RCVD_NET_HELO Host HELO'd using the wrong IP network 1.1 RCVD_IN_SORBS_HTTP RBL: SORBS: sender is open HTTP proxy server [210.205.152.10 listed in dnsbl.sorbs.net] 0.5 RCVD_IN_NJABL_PROXY RBL: NJABL: sender is an open proxy [210.205.152.10 listed in dnsbl.njabl.org] 0.1 RCVD_IN_SORBS RBL: SORBS: sender is listed in SORBS [210.205.152.10 listed in dnsbl.sorbs.net] 0.1 RCVD_IN_NJABL RBL: Received via a relay in dnsbl.njabl.org [210.205.152.10 listed in dnsbl.njabl.org] 1.2 RCVD_IN_SORBS_SOCKS RBL: SORBS: sender is open SOCKS proxy server [210.205.152.10 listed in dnsbl.sorbs.net] 0.7 RCVD_IN_DSBL RBL: Received via a relay in list.dsbl.org [<http://dsbl.org/listing?ip=3D210.205.152.10= >] 1.5 RCVD_IN_BL_SPAMCOP_NET RBL: Received via a relay in bl.spamcop.net [Blocked - see <http://www.spamcop.net/bl.shtml?210.205.152= .10>] 0.1 RCVD_IN_RFCI RBL: Sent via a relay in ipwhois.rfc-ignorant= .org [$ has inaccurate or missing WHOIS data at th= e] [RIR] 1.1 MIME_HTML_ONLY_MULTI Multipart message only has text/html MIME par= ts The original message was not completely plain text, and may be unsafe to open with some email clients; in particular, it may contain a virus, or confirm that your address can receive spam. If you wish to view it, it may be safer to save it to a file and open it with an editor. ------------=_40ED2BDC.CB948519 Content-Type: message/rfc822; x-spam-type=original Content-Description: original message before SpamAssassin Content-Disposition: attachment Content-Transfer-Encoding: 7bit Return-Path: <celrlrogfht@icq.com> Received: from 212.149.48.150 (unknown [210.205.152.10]) by mail.example.com (example Mail-System) with SMTP id 78E95260B; Thu, 8 Jul 2004 13:11:11 +0200 (CEST) Original-Encoded-Information-Types: multipart/alternative Language: English Disclose-Recipients: No Reply-To: "Royce Peterson" <celrlrogfht@icq.com> From: "Royce Peterson" <celrlrogfht@icq.com> To: iain.barbour@exampleib.com Subject: this is the best brindisi squalid Date: Thu, 08 Jul 2004 13:19:15 +0200 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="--613821744884439" Message-Id: <20040708111111.78E95260B@mail.example.com> ----613821744884439 Content-Type: text/html; charset="iso-3846-0" Content-Transfer-Encoding: 7Bit <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title>Untitled Document</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> </head> <body bgcolor="#0099FF" text="#FFFFFF" link="#FFFFFF"> <p><font color="#FFFF33" size="4" face="Arial, Helvetica, sans-serif"></font> <font face="Arial, Helvetica, sans-serif">Say goodbye to expensive Refills! <br> <font color="#0099FF">We are not retreating - we are advancing in another Direction. - General Douglas MacArthur (1880-1964) </font></font></p> <h2><font face="Arial, Helvetica, sans-serif">R_X Warehouse Direct!</font> </h2> <p><font face="Arial, Helvetica, sans-serif"><a href="http://Geraldine.tgoiwe.com/_55d958a932f5b91262baa654773c6a8e/">>>mo re info<<</a></font></p> <p><font color="#0099FF" face="Arial, Helvetica, sans-serif">Everyone is a genius at least once a year; a real genius has his original ideas closer together. - Georg Lichtenberg (1742-1799) <br><br> If you were plowing a field which would you rather use? Two strong oxen or 1024 chickens? - Seymour Cray (1925-1996) father of supercomputing</font></p> </body> </html> ----613821744884439-- ------------=_40ED2BDC.CB948519--
I'll look into it over the weekend. he rawbody rule hits on for example illegal charset codes in the HTML content of a message. You could temporarily disable the second rule until I've taken a look at it. Remember that rules posted to the bugzilla system might not work completely as expected sometimes until it says so. Please, also note that it is easier for people working on these bugs if you use the attachment feature to attach sample messages.
Hi, thks. for the answer. I tried also to fix the problem by myself. But Im not a fan of rexec :) I also found taht my example was not the best. The rule realy matches this example. So if will attach some other example of wrong tagged mails as attachment to this bug report, as you request. ... and dont forget to do also some other more funny thinks at the weekend... Greetings Frank
Created attachment 2104 [details] Example for wrong hit 1
Created attachment 2105 [details] Example for wrong hit 2
Created attachment 2106 [details] Example for wrong hit 3
Hmz... I don't see why it hits in those messages. As far as I can see the regex is correct, I might be missing something...
Created attachment 2117 [details] Test Perl porgram for regex
Hi, regex is not my world... I attached a small test program for regex rules. This program also tell me that is something wrong: charset="us-ascii" Matched: |<charset="us-ascii">| charset="fggfh" Matched: |<charset="fggfh">| charset=us-ascii No match. How you can see charset="us-ascii" matches and I shouldn´t charset=us-ascii did not match, thats ok. Frank
Created attachment 2121 [details] New, optimized incorrect charset catcher This one should do better. Perl was backtracking around the ['"]? to make the rule match. The rule is now written in such a way that this can't happen. I opimized the list along the way.
Created attachment 2122 [details] Stupid last error fixed
Created attachment 2123 [details] This should do it (for real)
Great!. That looks much better. I´ve got 10 correct hits in 2min. It´s very interessting that SPAMer are not able to use they tools. Or how can I get somethink like that?: Content-Type: text/html; charset=%CHARSET :) I will take a look on the tagged mails over the day. Seemed that we get only get mails of the type SARE_ILL_CS_2. But from them a lot. Frank
Created attachment 2124 [details] Updated to swim around two FP's
I have one FP until now but with your first version: <META http-equiv=3DContent-Type content=3D"text/html; charset=3Diso-8859-1"> is that what you have fixed in the newest version?
FP or not FP..... Content-Type: text/plain; charset= this was a "not delivered message". It had nothing behind the = But is that worth to tag?
next FP with newest version: <META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; charset=3Diso-885= 9-1">
and the next "not delivered message" from another company with only: Content-Type: text/plain; charset= seemed that this is normal.
hard job: <META http-equiv=3DContent-Type content=3D"text/html; charset=3Diso-8859-1"= > I think this kind of FP should be fixed bevor I send you more.... Is the cr at the end of the line a problem?
Seemed that spaces are also hitted: Content-Type: text/plain; charset= "iso-8859-1"
I'll update the rule to catch this (is quite easy) Content-Type: text/plain; charset= just add $| after (?! I'll run some tests on this tonight I'm having more trouble with the following: charset=3Diso-885= 9-1"> I can think of a rule to catch this, but it would be VERY ugly. I might just check if the line doesn't end with a '=' sign. <META http-equiv=3DContent-Type content=3D"text/html; charset=3Diso-8859-1"= > Should already be fixed, if not please forward me the whole message privately.
Addin whitespace shouldn't be too hard. I'll have a look.
one of this two lines out of one mail where tagged: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Type: text/x-vcard; charset=utf8;
Subject: Re: Not valid ISO codes should be tagged -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 it is interesting BTW. I think it must be deliberate for some reason. - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFA9NvBQTcbUG5Y7woRAo1tAKC8KAGZ5fcxcO8i+qYSAXJfzd+oRACgriep jw8vzWA68YC0cuaX1yXCitQ= =qfhX -----END PGP SIGNATURE-----
All the HTML mails FP where sended as attachment to a new mail. So that the reason for the cr in the mailbody. I think the problem is the header of an attached mail. What happend when I write in the text of a mail somethink like charset="1234"
The major problem seemed to be the cr. This mail was not an attachment: <META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; = charset=3DISO-8859-1">
OMG that rule is ugly! it's best redone as a plugin or eval test. don't need RE to try to get the charset out (just call the internal functions to parse Content-Type), and the RE can probably be made into something more efficient like a table lookup.
I'm all in for it, but haven't got the experience in perl to get this done. It would need to check both the content-type header and the HTML tags that can contain a charset.
Created attachment 2125 [details] Updated to swim around the other FP's that were logged. New rule reesuls are below. The one FP is from someone setting the charset to "ansi" which is no valid value. But being a Microsoft Tech Newsletter I might add it anyways. OVERALL SPAM HAM S/O SCORE NAME 35727 12376 23351 0.346 0.00 0.00 (all messages) 665 664 1 0.999 1.00 0.50 SARE_ILL_CS_2 1 1 0 1.000 0.00 0.50 SARE_ILL_CS_1 OVERALL% SPAM% HAM% S/O RANK SCORE NAME 35727 12376 23351 0.346 0.00 0.00 (all messages) 100.000 34.6405 65.3595 0.346 0.00 0.00 (all messages as %) 1.861 5.3652 0.0043 0.999 1.00 0.50 SARE_ILL_CS_2 0.003 0.0081 0.0000 1.000 0.00 0.50 SARE_ILL_CS_1 I've disabled the subject, from and to lines because they're not catching anything.
Morning :) like every day it looks much better. No FP until now, but a lot of hits: out.26275: charset="iso-9999-9" out.26287: charset="iso-3808-1" out.26299: charset=3D"iso-2D52-3" out.26311: charset="iso-9680-8" out.26323: charset="iso-4458-8" out.26362: charset="iso--" out.26395: charset="iso-5833-9" out.26407: charset="iso-5305-3" out.26419: charset="iso-9089-6" out.26431: charset="iso-9976-8" I will see until the afternoon if its ok now.
200 hits and no FP until now...
Now a have the first wrong hit. It was a very colored jokemail: Content-Type: text/plain; charset=iso-8859-1 Content-Type: text/html; charset=iso-8859-1 <META http-equiv=3DContent-Type content=3D"text/html; charset=3Diso-8859-1"= > an I think the last line was the problem...
That seems to be backtracking related aswell... I think I know how to catch this bugger. I'm testing again tonight. the change would be: "(?:3d?)?" -> "(?:3d?)?(?!3d)"
seemed that everythink is allowed: <META http-equiv=3DContent-Type content=3D"text/html; charset=3DISO-8859-1">
her e Im not shure if this was a SPAM mail or not: Content-type: text/plain; charset=x-euc-jp is that a valid ISO code?
this was no SPAM but I think the ISO code is not valid: Content-Type: text/html; charset=Cp1252
Oh, the next mail with that: Content-Type: text/plain; charset=Cp1252 seemed that this is a valid code
One of them matched: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Type: text/x-vcard; charset=utf8;
wrong hit: <META http-equiv=3DContent-Type content=3D"text/html; charset=3Dunicode">
whats is wrong here? Content-Type: text/html; charset="ISO-8859-1" Content-Type: text/plain; charset="ISO-8859-1"
This paaens also sometimes in Newsletter, but I think its ok to let them tagged: Content-Type: text/plain; charset=ISO8859-1
Created attachment 2134 [details] Another update to fix the =3d issues
I may be being dumb here, but I thought Spamassassin 3 was capable of looking at the message body after content-transfer-encodings had been dealt with? It seems to me that a rule looking at body text really shouldn't have to explicitly know about '=' being encoded in content-printable as '=3D' Surely there must be a better way?
you mean somethink like: body SARE_ILL_CS Content-transfer =~ /ansi|unicode|437|8(?:5[/i that would be helpfull.
seemed that in the list one vaild code is missing: iso-8851-15
another vaild code witch is missing: windows-874
there are some less hits I cant´t understand out of one mail: Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="ISO-8859-1" Content-Disposition: inline Content-Length: 8217 ... Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="ISO-8859-1" Content-Disposition: inline Content-Length: 1585
valid ISO code with is missing: Content-type: text/plain; charset=x-euc-jp
vaild code missing: Content-Type: text/plain; charset=Cp1252
appart from some missing charsets... is this rule working ok now?
Subject: Out of Office AutoReply: Not valid ISO codes should be tagged 19.07.04 - 03.08.04 Bei Problemen und Fragen senden Sie bitte eine Mail an mailservices@commerzbank.com.
lol! Outlook meeting its usual quality standards there ;)
Outlook was here before me :) Itzs hard work to switch 70.000 Mailboxes to another system. .. but the backbone is Linux. And thats my part. I can`t see any other problem in the moment. But it would be nice to have a version with all valid codes, because I get a copy of ervery tagged mail and there are to much mails with the missing valid codes that I can check them all. Can you include the missing codes and I will take a look on it for some more days??? Frank
Hi, is there a chance that this problem will be solved the next time? Frank
If I already did this...
... then I can close it as FIXED in HEAD. OVERALL% SPAM% HAM% S/O RANK SCORE NAME 394900 318772 76128 0.807 0.00 0.00 (all messages) 100.000 80.7222 19.2778 0.807 0.00 0.00 (all messages as %) 5.810 7.1973 0.0000 1.000 0.95 1.00 MIME_BAD_ISO_CHARSET