SA Bugzilla – Bug 7215
Towards supporting IDNA (Internationalizing Domain Names in Applications)
Last modified: 2022-03-06 12:48:00 UTC
Opening this ticket to coordinate our efforts towards supporting Internationalizing Domain Names (which is also coupled with better use of Unicode features of Perl). As Kevin's plan is to play with this during Summer, I'm attaching my current work in this area to avoid duplicating work. None of this is yet set in stone, so it is open to reshuffling of code or ditching/replacing/reorganizing it altogether. The main idea is to provide some tools and example code. It makes use of a perl module Net::LibIDN, and issues a warning if this module is not available (and then the feature is off). Should be compatible with existing code. Might even work with perl 5.8.9, although 5.12 or later would be a better choice for its much improved Unicode support. I'm running this changed code (in SA trunk (4.0), Perl 5.20 and 5.22) for the last two months: it solves my immediate problem in turning U-labels (in Unicode URIs) to ASCII Compatible Encoding (ACE) for the purpose of URI lookups against black/white-lists. Not perfect, but better than nothing. The main problem there is that a text parser (or HTML parser) does a poor job of extracting Unicode URIs from text, e.g. it has no notion of Unicode whitespace or a set of characters allowed in U-labels. Instead of the more complex task of fixing the text parser, as a stop-gap solution I added some sanitation code for extracted URIs: trimming prefix and suffix characters that cannot appear in a valid Unicode URI. This sanitation code would eventually be removed when a parser is improved. Provided general-purpose subroutines are: - MS::Util::idn_to_ascii - MS::Util::is_valid_utf_8 and the three user-defined character classes: - InIDNAWhitespace, - InIDNAFullStop, - InIDNA2008
Created attachment 5313 [details] Provides idn_to_ascii() and is_valid_utf_8(), and some char classes
Some examples from our logs: util: idn_to_ascii: converted to ACE (0): /www.tretjičlen.si/ -> /www.xn--tretjilen-qfb.si/ util: idn_to_ascii: converted to ACE (0): /www.fenster-türen-technik.de/ -> /www.xn--fenster-tren-technik-xec.de/ idn_to_ascii: converted to ACE (0): /www.zdš.si/ -> /www.xn--zd-mta.si/ util: idn_to_ascii: extracted: /www.ichc2016.com’’/ -> /www.ichc2016.com/ util: idn_to_ascii: extracted: /www.incose.org)会员/ -> /www.incose.org/
Created attachment 5322 [details] Introduce Util::idn_to_ascii and make use of it Here is a focused variant of the previously attached patch, which provides the Util::idn_to_ascii (which calls Net::LibIDN if it is available and is needed), and makes use of it where necessary (i.e. wherever DNS query is being assembled). The following feature as mentioned in the opening posting is _not_ included with this patch - it should be implemented elsewhere: | Instead of the more complex task of fixing | the text parser, as a stop-gap solution I added some sanitation | code for extracted URIs: trimming prefix and suffix characters | that cannot appear in a valid Unicode URI. This sanitation code | would eventually be removed when a parser is improved. Similarly the user-defined character classes InIDNAWhitespace, InIDNAFullStop, and InIDNA2008 are _not_ included in this patch so that it can remain clean and focused on a single task of introducing idn_to_ascii() and is_valid_utf_8(). trunk: Sending lib/Mail/SpamAssassin/AsyncLoop.pm Sending lib/Mail/SpamAssassin/Dns.pm Sending lib/Mail/SpamAssassin/DnsResolver.pm Sending lib/Mail/SpamAssassin/Plugin/AskDNS.pm Sending lib/Mail/SpamAssassin/Plugin/HeaderEval.pm Sending lib/Mail/SpamAssassin/Plugin/URIDNSBL.pm Sending lib/Mail/SpamAssassin/Util.pm Committed revision 1694252. Please yell if this is for some reason unacceptable :)
I see that Jenkins is not happy: the t/idn_dots.t test failed, although it does pass here (perl 5.22). Seems like the version of perl on Jenkins is 5.8.6 (i.e. 11 years old:) . Will look for a workaround...
I tested locally with 5.8.6 and recreated. testing now with 5.17.8
got it: trunk: Override the silly global "use bytes", breaks Unicode handling Sending lib/Mail/SpamAssassin/Util.pm Committed revision 1694272. This common idiom 'use bytes' will keep biting us on the way to better Unicode support. Should get rid of it eventually throughout.
Failed on 5.14.8 likely your use bytes change will fix it though. NOTE: For IDN work, I think will want to require a newer perl and possibly using perlbrew or something similar for SA to install it's own perl installation so we aren't distro dependent. Thoughts?
> NOTE: For IDN work, I think will want to require a newer perl and possibly > using perlbrew or something similar for SA to install it's own perl > installation so we aren't distro dependent. Thoughts? The oldest 5.8 in perlbrew is 5.8.9, tried it and it works. I didn't notice the error because I did have a Net::LibIDN installed, which did the dots replacement instead of our explicit code, thus masking the problem in the idn_dots.t test. If we can afford to have multiple versions of perl installed and running under jenkins, perlbrew is probably the most straightforward. Dependency modules would need to be installed (e.g. by cpanm) into each brew instance.
My expectation with using more UTF8 (or 16) as the internal guts for SA is that perl 5.14 becomes a baseline install where that works well from my memory at least. I want to set a good expectation that newer perl is needed for SA 4.X unless you think I'm just being elitist.
> My expectation with using more UTF8 (or 16) as the internal guts for SA is > that perl 5.14 becomes a baseline install where that works well from my > memory at least. > I want to set a good expectation that newer perl is needed for SA 4.X unless > you think I'm just being elitist. I agree that 5.14 or maybe 5.12 is the baseline for more serious Unicode support. Although we may do it in steps: bump up the minimal version one notch with each specific problem we encounter during development. With basic Unicode support (and even the user-defined character classes) it seems the 5.8.9 is still able to cope somehow. > unless you think I'm just being elitist. Elitist? We are running 5.22 and 5.20 on our servers here :)) Have to dig deep to find something running 5.16, let alone 5.14 around here.
(In reply to Kevin A. McGrail from comment #7) > Failed on 5.14.8 likely your use bytes change will fix it though. > > NOTE: For IDN work, I think will want to require a newer perl and possibly > using perlbrew or something similar for SA to install it's own perl > installation so we aren't distro dependent. Thoughts? there is precompiled problems everywhere, but i keep away from them :=) gentoo and freebsd works for me, and google kill the rest of my day now :/
(In reply to Mark Martinec from comment #10) > > My expectation with using more UTF8 (or 16) as the internal guts for SA is > > that perl 5.14 becomes a baseline install where that works well from my > > memory at least. > > I want to set a good expectation that newer perl is needed for SA 4.X unless > > you think I'm just being elitist. > > I agree that 5.14 or maybe 5.12 is the baseline for more serious > Unicode support. Although we may do it in steps: bump up the minimal > version one notch with each specific problem we encounter during > development. With basic Unicode support (and even the user-defined > character classes) it seems the 5.8.9 is still able to cope somehow. > > > unless you think I'm just being elitist. > > Elitist? We are running 5.22 and 5.20 on our servers here :)) > Have to dig deep to find something running 5.16, let alone 5.14 > around here. Hah. Well jenkins is stable and 5.8.6 tests worked though a bit noisy with INFO: module Net::LibIDN not available warnings. For 4.X, we need good, stable UTF support which in my experience means using 5.14+. That said, RHEL/CentOS 5 which isn't EOL until 2017 ships with 5.8.8. RHEL/CentOS 6 ships with 5.10.1. And RHEL/CentOS 7 ships with 5.16.3. I could examine other distros but I think we can require 5.14.8+ for 4.X especially since I'm expecting distros won't include 4.X except on the next major release. To make things easier for those with older perls, we can document and even provide some automation to use the system/distro perl to bootstrap a newer version of perl dedicated for SA with something like perlbrew. It will require compilation tools/libraries but effectively it's like installing your own JRE for a specific product. See https://github.com/gugod/App-perlbrew under the Synopsis to see just how easily people can add alternate perl versions to their system. Any arguments against changing trunk INSTALL to require 5.14.8 as well as the PACKAGING, Makefile.PL and UPGRADE files? I can also look at making Makefile.PL bootstrap with system perl and download perlbrew to make a newer perl available.
> Hah. Well jenkins is stable and 5.8.6 tests worked though a bit noisy with > INFO: module Net::LibIDN not available warnings. Can turn that warn() into info() for comfort, at least temporarily. > For 4.X, we need good, stable UTF support which in my experience means using > 5.14+. > I could examine other distros but I think we can require 5.14.8+ for 4.X > especially since I'm expecting distros won't include 4.X except on the next > major release. > Any arguments against changing trunk INSTALL to require 5.14.8 as well as > the PACKAGING, Makefile.PL and UPGRADE files? I can also look at making > Makefile.PL bootstrap with system perl and download perlbrew to make a newer > perl available. I think it's way to early to set the minimal version of perl as a firm requirement. We haven't even started the more intricate work, it's still a long way to 4.0. It's fine to make 5.14 as a recommended minimal version, but I'd prefer to enforce such in Makefile.PL much closer to a 4.0 release. For starters I'd be happy to have 5.10 as a firm requirement, which will enable the use of a possessive quantifier syntactic sugar ( "?+","*+", "++", "{min,max}+" ), particularly useful in rules, and the defined-or operator ( // ), which can cut down the clutter in code somewhat: ( $a // $b is equivalent to: defined $a ? $a : $b, similarly: $c //= $d instead of: $c = $d unless defined $c ). Also the Digest::SHA module became a core module, so we can get rid of Digest::SHA1 as a backward compatibility fallback. Also in 5.10: (perl5100delta) The regular expression engine is no longer recursive, meaning that patterns that used to overflow the stack will either die with useful explanations, or run to completion. And: (perl5100delta) Alternations, where possible, are optimised into more efficient matching structures. String literal alternations are merged into a trie and are matched simultaneously. This means that instead of O(N) time for matching N alternations at a given point, the new code performs in O(1) time. [...] Note: Much code exists that works around perl's historic poor performance on alternations. Often the tricks used to do so will disable the new optimisations.
Collected a couple of debug log entries produced by idn_to_ascii() to get some feeling on how successful the conversion is. Seems the IDN handling can be quite useful as shown in the following samples. Most of these are in Russian Cyrillic, some are Slovenian (and I remember some German samples, but can't find them in my recent logs): util: idn_to_ascii: converted to ACE (0): /www.грузтранском.рф/ -> /www.xn--80afmnkeilbmhk.xn--p1ai/ /www.отличное-мнение.рф/ -> /www.xn----itbbaldqlgdbdd6c9d.xn--p1ai/ /www.правильный-директ.рф/ -> /www.xn----7sbgjfpcfnewtnj6a0kpa.xn--p1ai/ /грамотное-сео.рф/ -> /xn----7sbijb3bhhbdnti.xn--p1ai/ /грузтранском.рф/ -> /xn--80afmnkeilbmhk.xn--p1ai/ /frižider.si/ -> /xn--friider-fxb.si/ /www.žarnice.si/ -> /www.xn--arnice-2pb.si/ /žarnice.si/ -> /xn--arnice-2pb.si/ /www.контролируемый-имидж.рф/ -> /www.xn----htbbggcafgkndfnb5ad5ay6n.xn--p1ai/ /www.на-отдых-в-сахару.рф/ -> /www.xn------5cddalo2fm2ajiwwf9g.xn--p1ai/ /www.работа-на-себя.рф/ -> /www.xn-----6kcabde5a3ehtuh4q.xn--p1ai/ /www.стройметком.рф/ -> /www.xn--e1ahegchekikf.xn--p1ai/ /делай-деньги.ком.рф/ -> /xn----7sbkbcddzes1a4p.xn--j1aef.xn--p1ai/ /заказ-грузоперевозок.орг.рф/ -> /xn----7sbajemakccd1aj5cdblpe9c.xn--c1avg.xn--p1ai/ /играть-за-деньги.рф/ -> /xn-----6kcbmegiogj2d5a3a4mh.xn--p1ai/ /идеал-мастер.рф/ -> /xn----7sbbnfdp1ak6bjm.xn--p1ai/ /курсы-шоуменов.рф/ -> /xn----dtbislhedmkue7dyb.xn--p1ai/ /люди-и-цифры.рф/ -> /xn-----jlcqbbp0c9as2d0a.xn--p1ai/ /на-отдых-в-сахару.рф/ -> /xn------5cddalo2fm2ajiwwf9g.xn--p1ai/ /обучаем-иностранному.рф/ -> /xn----7sbbbvt0adhbachd0aprjp8d.xn--p1ai/ /плавучая-баня.рф/ -> /xn----7sbabed5dwak7b5b6fe.xn--p1ai/ /работа-на-себя.рф/ -> /xn-----6kcabde5a3ehtuh4q.xn--p1ai/ /скайвуд-лиственница.рф/ -> /xn----7sbbgbkjwcdjr3aa2cirm2e.xn--p1ai/ /такси-московское.орг.рф/ -> /xn----7sbhmmlcbpubc4aede.xn--c1avg.xn--p1ai/ ...but there is also plenty of samples which indicate a miserable failure of the URL extraction code in properly delimiting an URL from surrounding text when dealing with UTF-8 encoded (normalized) text: util: idn_to_ascii: alternative dots normalized: /自由。”/ -> /自由.”/ util: idn_to_ascii: conversion to ACE failed (0): /他一向令女人神魂颠倒的抚摸,就真的那么令她讨厌吗?/ /www.EPChinaShow.com&t=China’s Largest and Most Authorized Electric Power Exhibition/ /经五年没有见到你了,求求你了妈妈,陪陪我,好不好?”/ util: idn_to_ascii: converted to ACE (0): /www.eme2015.org】/ -> /www.eme2015.xn--org-003b/ /www.pdma.org)会员/ -> /www.pdma.xn--org)-ye6ft1z/ /www.uradni-list.si•/ -> /www.uradni-list.xn--si-g3t/ /#IUS_INFO_ČISTOPISI/ -> /xn--#ius_info_istopisi-3gc/ /#STROKOVNI_ČLANKI/ -> /xn--#strokovni_lanki-27b/ /179英文.files/ -> /xn--179-4p8fh21k.files/ /kjn.uradni-list.si / -> /kjn.uradni-list.si / /t.c…”/ -> /t.xn--c...-jb7a/ /www.WaterNexus.net│info@WaterNexus.net/ -> /www.WaterNexus.xn--netinfo@waternexus-w78l.net/ /www.adatours.com / -> /www.adatours.com / /www.aijssnet.com With/ -> /www.aijssnet.com with/ /www.aloftcupertino.com / -> /www.aloftcupertino.com / /www.defensedaily.com that/ -> /www.defensedaily.com that/ /www.disclaimer-uk.wur.nl / -> /www.disclaimer-uk.wur.nl / /www.eme2015.org】/ -> /www.eme2015.xn--org-003b/ /www.hotchips.org For/ -> /www.hotchips.org for/ /www.hoti.org EarlyRegistration/ -> /www.hoti.org earlyregistration/ /www.laboratory-journal.com”/ -> /www.laboratory-journal.xn--com-9o0a/ /www.ontoresinc.com)provides/ -> /www.ontoresinc.com)provides/ /www.particulars.eu”/ -> /www.particulars.xn--eu-02t/ /www.pdma.org)会员/ -> /www.pdma.xn--org)-ye6ft1z/ /www.sunseeker.deÂ/ -> /www.sunseeker.xn--de-qia/ /Õ÷¸åº¯Ó¢ÎÄ1.files/ -> /xn-- o 1-7ea01bd9ezbpw814d5ma.files/ /”http/ -> /xn--http-fb7a/ /中文11.files/ -> /xn--11-py2cs33g.files/ /自由.”/ -> /xn--sny74y.xn--ivg/ /ó/ -> /xn--kda/ Seems the URL extraction code is an area that calls for much more love in the near future. Properly recognizing Unicode delimiters is one obvious defect, but a trickier one is probably dealing with recognizing boundaries in Chinese, Japanese, and Korean writing. Seems it would be valuable to reach contributors of the project: http://emaillab.jp/spamassassin/ja-patch/
trunk: add Net::LibIDN as an optional dependency, add one more missing call to idn_to_ascii() in URIDNSBL.pm Sending lib/Mail/SpamAssassin/Plugin/URIDNSBL.pm Sending lib/Mail/SpamAssassin/Util/DependencyInfo.pm Committed revision 1695622.
Created attachment 5330 [details] Let RegistryBoundaries.pm be able to deal with IDN Bug 7215: Towards supporting IDNA - handle IDN domain boundaries Sending lib/Mail/SpamAssassin/Conf.pm Sending lib/Mail/SpamAssassin/Plugin/HeaderEval.pm Sending lib/Mail/SpamAssassin/RegistryBoundaries.pm Sending lib/Mail/SpamAssassin/Util.pm Sending rules/20_aux_tlds.cf Committed revision 1703247. Let RegistryBoundaries handle Internationalizing Domain Names in Unicode, update documentation on directives util_rb_tld, util_rb_2tld, util_rb_3tld, update comment in 20_aux_tlds.cf. This is now possible too (but not required, handles ACE just fine like before): util_rb_tld कॉम 佛山 慈善 集团 在线 한국 点看 คอม ভারত 八卦 موقع 公益 公司 移动 我爱你 util_rb_tld москва қаз онлайн сайт срб бел קום 时尚 淡马锡 орг नेट 삼성 சிங்கப்பூர் 商标 util_rb_tld 商店 商城 дети мкд 新闻 工行 كوم 中文网 中信 中国 中國 娱乐 谷歌 భారత్ ලංකා util_rb_tld ભારત भारत 网店 संगठन 餐厅 网络 ком укр 香港 飞利浦 台湾 台灣 手机 мон الجزائر util_rb_tld عمان ایران امارات بازار الاردن بھارت المغرب السعودية سودان مليسيا 닷컴 政府 util_rb_tld شبكة გე 机构 组织机构 健康 ไทย سورية рус рф تونس 大拿 みんな グーグル 世界 ਭਾਰਤ util_rb_tld 网址 닷넷 コム 游戏 vermögensberater vermögensberatung 企业 信息 مصر قطر 广东 util_rb_tld இலங்கை இந்தியா հայ 新加坡 فلسطين 政务
trunk: Bug 7215: Towards supporting IDNA - tags AUTHORDOMAIN and SENDERDOMAIN to ACE, - add metadata X-AuthorDomain and X-SenderDomain (to facilitate testing), - domain to ACE in a call to Mail::DKIM::AuthorDomainPolicy->fetch Sending lib/Mail/SpamAssassin/PerMsgStatus.pm Sending lib/Mail/SpamAssassin/Plugin/DKIM.pm Committed revision 1707578.
partly related: trunk: Added a test case for international mail (as allowed by RFC 6532 - SMTPUTF8) and a test Adding t/data/nice/unicode1 Adding t/header_utf8.t Committed revision 1707600.
Hi Mark, Assuming that we still want to leave this in trunk and NOT backport to 3.4/
> Assuming that we still want to leave this in trunk and NOT backport to 3.4/ Yes, I think these changes are too heavyweight for a minor release. [ but it's also true that I would not like to run 3.4 now at our site in production (with a reasonably fresh version of perl), as the 4.0(=trunk) is better behaved and has been better tested (by me), compared to 3.4 with multiple lightly tested backports ]
Understood. My plan is not to backport the full IDN stuff. I will have a few more bugs backported and then that will be 3.4.2. Then perhaps we get 4.0 (3.5?) moving since these major changes have been tested in production. I can switch to the same version too for testing.
Continuing to target this for 4.0. Working hard to produce our last (hopefully) 3.4.2 release.
Isn't all this committed to trunk ages ago? Everything works fine as I understand. Closing.