Bug 7215 - Towards supporting IDNA (Internationalizing Domain Names in Applications)
Summary: Towards supporting IDNA (Internationalizing Domain Names in Applications)
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: All All
: P2 enhancement
Target Milestone: 4.0.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-06-22 19:06 UTC by Mark Martinec
Modified: 2022-03-06 12:48 UTC (History)
3 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Provides idn_to_ascii() and is_valid_utf_8(), and some char classes patch None Mark Martinec [HasCLA]
Introduce Util::idn_to_ascii and make use of it patch None Mark Martinec [HasCLA]
Let RegistryBoundaries.pm be able to deal with IDN patch None Mark Martinec [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Martinec 2015-06-22 19:06:26 UTC
Opening this ticket to coordinate our efforts towards supporting
Internationalizing Domain Names (which is also coupled with
better use of Unicode features of Perl).

As Kevin's plan is to play with this during Summer, I'm attaching
my current work in this area to avoid duplicating work. None of
this is yet set in stone, so it is open to reshuffling of code
or ditching/replacing/reorganizing it altogether. The main idea
is to provide some tools and example code.

It makes use of a perl module Net::LibIDN, and issues a warning
if this module is not available (and then the feature is off).
Should be compatible with existing code. Might even work with
perl 5.8.9, although 5.12 or later would be a better choice
for its much improved Unicode support.

I'm running this changed code (in SA trunk (4.0), Perl 5.20 and
5.22) for the last two months: it solves my immediate problem
in turning U-labels (in Unicode URIs) to ASCII Compatible Encoding
(ACE) for the purpose of URI lookups against black/white-lists.
Not perfect, but better than nothing.

The main problem there is that a text parser (or HTML parser)
does a poor job of extracting Unicode URIs from text, e.g. it
has no notion of Unicode whitespace or a set of characters
allowed in U-labels. Instead of the more complex task of fixing
the text parser, as a stop-gap solution I added some sanitation
code for extracted URIs: trimming prefix and suffix characters
that cannot appear in a valid Unicode URI. This sanitation code
would eventually be removed when a parser is improved.

Provided general-purpose subroutines are:
- MS::Util::idn_to_ascii
- MS::Util::is_valid_utf_8

and the three user-defined character classes:
- InIDNAWhitespace,
- InIDNAFullStop,
- InIDNA2008
Comment 1 Mark Martinec 2015-06-22 19:08:07 UTC
Created attachment 5313 [details]
Provides idn_to_ascii() and is_valid_utf_8(), and some char classes
Comment 2 Mark Martinec 2015-06-22 19:16:23 UTC
Some examples from our logs:

util: idn_to_ascii: converted to ACE (0):
  /www.tretjičlen.si/ -> /www.xn--tretjilen-qfb.si/

util: idn_to_ascii: converted to ACE (0):
  /www.fenster-türen-technik.de/ -> /www.xn--fenster-tren-technik-xec.de/

idn_to_ascii: converted to ACE (0):
  /www.zdš.si/ -> /www.xn--zd-mta.si/

util: idn_to_ascii: extracted:
  /www.ichc2016.com’’/ -> /www.ichc2016.com/

util: idn_to_ascii: extracted:
  /www.incose.org)会员/ -> /www.incose.org/
Comment 3 Mark Martinec 2015-08-05 15:49:57 UTC
Created attachment 5322 [details]
Introduce Util::idn_to_ascii and make use of it

Here is a focused variant of the previously attached patch,
which provides the Util::idn_to_ascii (which calls Net::LibIDN
if it is available and is needed), and makes use of it where
necessary (i.e. wherever DNS query is being assembled).

The following feature as mentioned in the opening posting
is _not_ included with this patch - it should be implemented
elsewhere:
| Instead of the more complex task of fixing
| the text parser, as a stop-gap solution I added some sanitation
| code for extracted URIs: trimming prefix and suffix characters
| that cannot appear in a valid Unicode URI. This sanitation code
| would eventually be removed when a parser is improved.

Similarly the user-defined character classes InIDNAWhitespace,
InIDNAFullStop, and InIDNA2008 are _not_ included in this patch
so that it can remain clean and focused on a single task of
introducing idn_to_ascii() and is_valid_utf_8().

trunk:
  Sending lib/Mail/SpamAssassin/AsyncLoop.pm
  Sending lib/Mail/SpamAssassin/Dns.pm
  Sending lib/Mail/SpamAssassin/DnsResolver.pm
  Sending lib/Mail/SpamAssassin/Plugin/AskDNS.pm
  Sending lib/Mail/SpamAssassin/Plugin/HeaderEval.pm
  Sending lib/Mail/SpamAssassin/Plugin/URIDNSBL.pm
  Sending lib/Mail/SpamAssassin/Util.pm
Committed revision 1694252.


Please yell if this is for some reason unacceptable :)
Comment 4 Mark Martinec 2015-08-05 17:06:07 UTC
I see that Jenkins is not happy: the t/idn_dots.t test failed,
although it does pass here (perl 5.22). Seems like the version
of perl on Jenkins is 5.8.6 (i.e. 11 years old:) .
Will look for a workaround...
Comment 5 Kevin A. McGrail 2015-08-05 17:08:49 UTC
I tested locally with 5.8.6 and recreated.  testing now with 5.17.8
Comment 6 Mark Martinec 2015-08-05 17:46:59 UTC
got it:

trunk:
  Override the silly global "use bytes", breaks Unicode handling
    Sending lib/Mail/SpamAssassin/Util.pm
Committed revision 1694272.


This common idiom 'use bytes' will keep biting us on the way to
better Unicode support. Should get rid of it eventually throughout.
Comment 7 Kevin A. McGrail 2015-08-05 17:59:13 UTC
Failed on 5.14.8 likely your use bytes change will fix it though.

NOTE: For IDN work, I think will want to require a newer perl and possibly using perlbrew or something similar for SA to install it's own perl installation so we aren't distro dependent.  Thoughts?
Comment 8 Mark Martinec 2015-08-05 18:12:30 UTC
> NOTE: For IDN work, I think will want to require a newer perl and possibly
> using perlbrew or something similar for SA to install it's own perl
> installation so we aren't distro dependent.  Thoughts?

The oldest 5.8 in perlbrew is 5.8.9, tried it and it works.
I didn't notice the error because I did have a Net::LibIDN installed,
which did the dots replacement instead of our explicit code,
thus masking the problem in the idn_dots.t test.

If we can afford to have multiple versions of perl installed and
running under jenkins, perlbrew is probably the most straightforward.
Dependency modules would need to be installed (e.g. by cpanm) into
each brew instance.
Comment 9 Kevin A. McGrail 2015-08-05 18:14:59 UTC
My expectation with using more UTF8 (or 16) as the internal guts for SA is that perl 5.14 becomes a baseline install where that works well from my memory at least.

I want to set a good expectation that newer perl is needed for SA 4.X unless you think I'm just being elitist.
Comment 10 Mark Martinec 2015-08-05 18:56:51 UTC
> My expectation with using more UTF8 (or 16) as the internal guts for SA is
> that perl 5.14 becomes a baseline install where that works well from my
> memory at least.
> I want to set a good expectation that newer perl is needed for SA 4.X unless
> you think I'm just being elitist.

I agree that 5.14 or maybe 5.12 is the baseline for more serious
Unicode support. Although we may do it in steps: bump up the minimal
version one notch with each specific problem we encounter during
development. With basic Unicode support (and even the user-defined
character classes) it seems the 5.8.9 is still able to cope somehow.

> unless you think I'm just being elitist.

Elitist? We are running 5.22 and 5.20 on our servers here :))
Have to dig deep to find something running 5.16, let alone 5.14
around here.
Comment 11 Benny Pedersen 2015-08-05 19:10:30 UTC
(In reply to Kevin A. McGrail from comment #7)
> Failed on 5.14.8 likely your use bytes change will fix it though.
> 
> NOTE: For IDN work, I think will want to require a newer perl and possibly
> using perlbrew or something similar for SA to install it's own perl
> installation so we aren't distro dependent.  Thoughts?

there is precompiled problems everywhere, but i keep away from them :=)

gentoo and freebsd works for me, and google kill the rest of my day now :/
Comment 12 Kevin A. McGrail 2015-08-05 20:38:45 UTC
(In reply to Mark Martinec from comment #10)
> > My expectation with using more UTF8 (or 16) as the internal guts for SA is
> > that perl 5.14 becomes a baseline install where that works well from my
> > memory at least.
> > I want to set a good expectation that newer perl is needed for SA 4.X unless
> > you think I'm just being elitist.
> 
> I agree that 5.14 or maybe 5.12 is the baseline for more serious
> Unicode support. Although we may do it in steps: bump up the minimal
> version one notch with each specific problem we encounter during
> development. With basic Unicode support (and even the user-defined
> character classes) it seems the 5.8.9 is still able to cope somehow.
> 
> > unless you think I'm just being elitist.
> 
> Elitist? We are running 5.22 and 5.20 on our servers here :))
> Have to dig deep to find something running 5.16, let alone 5.14
> around here.

Hah. Well jenkins is stable and 5.8.6 tests worked though a bit noisy with INFO: module Net::LibIDN not available warnings.

For 4.X, we need good, stable UTF support which in my experience means using 5.14+.  

That said, RHEL/CentOS 5 which isn't EOL until 2017 ships with 5.8.8. RHEL/CentOS 6 ships with 5.10.1. And RHEL/CentOS 7 ships with 5.16.3.  

I could examine other distros but I think we can require 5.14.8+ for 4.X especially since I'm expecting distros won't include 4.X except on the next major release.

To make things easier for those with older perls, we can document and even provide some automation to use the system/distro perl to bootstrap a newer version of perl dedicated for SA with something like perlbrew.  It will require compilation tools/libraries but effectively it's like installing your own JRE for a specific product.

See https://github.com/gugod/App-perlbrew under the Synopsis to see just how easily people can add alternate perl versions to their system.  

Any arguments against changing trunk INSTALL to require 5.14.8 as well as the PACKAGING, Makefile.PL and UPGRADE files?  I can also look at making Makefile.PL bootstrap with system perl and download perlbrew to make a newer perl available.
Comment 13 Mark Martinec 2015-08-05 23:00:29 UTC
> Hah. Well jenkins is stable and 5.8.6 tests worked though a bit noisy with
> INFO: module Net::LibIDN not available warnings.

Can turn that warn() into info() for comfort, at least temporarily.

> For 4.X, we need good, stable UTF support which in my experience means using
> 5.14+.  
> I could examine other distros but I think we can require 5.14.8+ for 4.X
> especially since I'm expecting distros won't include 4.X except on the next
> major release.
> Any arguments against changing trunk INSTALL to require 5.14.8 as well as
> the PACKAGING, Makefile.PL and UPGRADE files?  I can also look at making
> Makefile.PL bootstrap with system perl and download perlbrew to make a newer
> perl available.

I think it's way to early to set the minimal version of perl as a
firm requirement. We haven't even started the more intricate work,
it's still a long way to 4.0.  It's fine to make 5.14 as a recommended
minimal version, but I'd prefer to enforce such in Makefile.PL much
closer to a 4.0 release.



For starters I'd be happy to have 5.10 as a firm requirement,
which will enable the use of a possessive quantifier syntactic sugar
( "?+","*+", "++", "{min,max}+" ), particularly useful in rules,
and the defined-or operator ( // ), which can cut down the clutter
in code somewhat:
( $a // $b  is equivalent to:  defined $a ? $a : $b,
  similarly: $c //= $d  instead of:  $c = $d unless defined $c ).

Also the Digest::SHA module became a core module, so we can get
rid of Digest::SHA1 as a backward compatibility fallback.

Also in 5.10: (perl5100delta) The regular expression engine is
no longer recursive, meaning that patterns that used to overflow
the stack will either die with useful explanations, or run
to completion.

And: (perl5100delta) Alternations, where possible, are optimised
into more efficient matching structures. String literal alternations
are merged into a trie and are matched simultaneously.  This means
that instead of O(N) time for matching N alternations at a given
point, the new code performs in O(1) time.
[...] Note: Much code exists that works around perl's historic
poor performance on alternations. Often the tricks used to do so
will disable the new optimisations.
Comment 14 Mark Martinec 2015-08-11 23:19:59 UTC
Collected a couple of debug log entries produced by idn_to_ascii()
to get some feeling on how successful the conversion is.

Seems the IDN handling can be quite useful as shown in the following
samples. Most of these are in Russian Cyrillic, some are Slovenian
(and I remember some German samples, but can't find them in my
recent logs):

util: idn_to_ascii: converted to ACE (0):
  /www.грузтранском.рф/ -> /www.xn--80afmnkeilbmhk.xn--p1ai/
  /www.отличное-мнение.рф/ -> /www.xn----itbbaldqlgdbdd6c9d.xn--p1ai/
  /www.правильный-директ.рф/ -> /www.xn----7sbgjfpcfnewtnj6a0kpa.xn--p1ai/
  /грамотное-сео.рф/ -> /xn----7sbijb3bhhbdnti.xn--p1ai/
  /грузтранском.рф/ -> /xn--80afmnkeilbmhk.xn--p1ai/
  /frižider.si/ -> /xn--friider-fxb.si/
  /www.žarnice.si/ -> /www.xn--arnice-2pb.si/
  /žarnice.si/ -> /xn--arnice-2pb.si/
  /www.контролируемый-имидж.рф/ -> /www.xn----htbbggcafgkndfnb5ad5ay6n.xn--p1ai/
  /www.на-отдых-в-сахару.рф/ -> /www.xn------5cddalo2fm2ajiwwf9g.xn--p1ai/
  /www.работа-на-себя.рф/ -> /www.xn-----6kcabde5a3ehtuh4q.xn--p1ai/
  /www.стройметком.рф/ -> /www.xn--e1ahegchekikf.xn--p1ai/
  /делай-деньги.ком.рф/ -> /xn----7sbkbcddzes1a4p.xn--j1aef.xn--p1ai/
  /заказ-грузоперевозок.орг.рф/ -> /xn----7sbajemakccd1aj5cdblpe9c.xn--c1avg.xn--p1ai/
  /играть-за-деньги.рф/ -> /xn-----6kcbmegiogj2d5a3a4mh.xn--p1ai/
  /идеал-мастер.рф/ -> /xn----7sbbnfdp1ak6bjm.xn--p1ai/
  /курсы-шоуменов.рф/ -> /xn----dtbislhedmkue7dyb.xn--p1ai/
  /люди-и-цифры.рф/ -> /xn-----jlcqbbp0c9as2d0a.xn--p1ai/
  /на-отдых-в-сахару.рф/ -> /xn------5cddalo2fm2ajiwwf9g.xn--p1ai/
  /обучаем-иностранному.рф/ -> /xn----7sbbbvt0adhbachd0aprjp8d.xn--p1ai/
  /плавучая-баня.рф/ -> /xn----7sbabed5dwak7b5b6fe.xn--p1ai/
  /работа-на-себя.рф/ -> /xn-----6kcabde5a3ehtuh4q.xn--p1ai/
  /скайвуд-лиственница.рф/ -> /xn----7sbbgbkjwcdjr3aa2cirm2e.xn--p1ai/
  /такси-московское.орг.рф/ -> /xn----7sbhmmlcbpubc4aede.xn--c1avg.xn--p1ai/


...but there is also plenty of samples which indicate a miserable
failure of the URL extraction code in properly delimiting an URL
from surrounding text when dealing with UTF-8 encoded (normalized)
text:

util: idn_to_ascii: alternative dots normalized:
  /自由。”/ -> /自由.”/

util: idn_to_ascii: conversion to ACE failed (0):
  /他一向令女人神魂颠倒的抚摸,就真的那么令她讨厌吗?/
  /www.EPChinaShow.com&t=China’s Largest and Most Authorized Electric Power Exhibition/
  /经五年没有见到你了,求求你了妈妈,陪陪我,好不好?”/

util: idn_to_ascii: converted to ACE (0):
  /www.eme2015.org】/ -> /www.eme2015.xn--org-003b/
  /www.pdma.org)会员/ -> /www.pdma.xn--org)-ye6ft1z/
  /www.uradni-list.si•/ -> /www.uradni-list.xn--si-g3t/
  /#IUS_INFO_ČISTOPISI/ -> /xn--#ius_info_istopisi-3gc/
  /#STROKOVNI_ČLANKI/ -> /xn--#strokovni_lanki-27b/
  /179英文.files/ -> /xn--179-4p8fh21k.files/
  /kjn.uradni-list.si / -> /kjn.uradni-list.si /
  /t.c…”/ -> /t.xn--c...-jb7a/
  /www.WaterNexus.net│info@WaterNexus.net/ -> /www.WaterNexus.xn--netinfo@waternexus-w78l.net/
  /www.adatours.com / -> /www.adatours.com /
  /www.aijssnet.com With/ -> /www.aijssnet.com with/
  /www.aloftcupertino.com / -> /www.aloftcupertino.com /
  /www.defensedaily.com that/ -> /www.defensedaily.com that/
  /www.disclaimer-uk.wur.nl / -> /www.disclaimer-uk.wur.nl /
  /www.eme2015.org】/ -> /www.eme2015.xn--org-003b/
  /www.hotchips.org   For/ -> /www.hotchips.org for/
  /www.hoti.org EarlyRegistration/ -> /www.hoti.org earlyregistration/
  /www.laboratory-journal.com”/ -> /www.laboratory-journal.xn--com-9o0a/
  /www.ontoresinc.com)provides/ -> /www.ontoresinc.com)provides/
  /www.particulars.eu”/ -> /www.particulars.xn--eu-02t/
  /www.pdma.org)会员/ -> /www.pdma.xn--org)-ye6ft1z/
  /www.sunseeker.deÂ/ -> /www.sunseeker.xn--de-qia/
  /Õ÷¸åº¯Ó¢ÎÄ1.files/ -> /xn-- o 1-7ea01bd9ezbpw814d5ma.files/
  /”http/ -> /xn--http-fb7a/
  /中文11.files/ -> /xn--11-py2cs33g.files/
  /自由.”/ -> /xn--sny74y.xn--ivg/
  /ó/ -> /xn--kda/

Seems the URL extraction code is an area that calls for much more
love in the near future. Properly recognizing Unicode delimiters
is one obvious defect, but a trickier one is probably dealing with
recognizing boundaries in Chinese, Japanese, and Korean writing.
Seems it would be valuable to reach contributors of the project:
  http://emaillab.jp/spamassassin/ja-patch/
Comment 15 Mark Martinec 2015-08-12 23:11:27 UTC
trunk:
  add Net::LibIDN as an optional dependency,
  add one more missing call to idn_to_ascii() in URIDNSBL.pm
Sending lib/Mail/SpamAssassin/Plugin/URIDNSBL.pm
Sending lib/Mail/SpamAssassin/Util/DependencyInfo.pm
Committed revision 1695622.
Comment 16 Mark Martinec 2015-09-15 16:40:31 UTC
Created attachment 5330 [details]
Let RegistryBoundaries.pm be able to deal with IDN

Bug 7215: Towards supporting IDNA - handle IDN domain boundaries
  Sending lib/Mail/SpamAssassin/Conf.pm
  Sending lib/Mail/SpamAssassin/Plugin/HeaderEval.pm
  Sending lib/Mail/SpamAssassin/RegistryBoundaries.pm
  Sending lib/Mail/SpamAssassin/Util.pm
  Sending rules/20_aux_tlds.cf
Committed revision 1703247.

Let RegistryBoundaries handle Internationalizing Domain Names in Unicode,
update documentation on directives util_rb_tld, util_rb_2tld, util_rb_3tld,
update comment in 20_aux_tlds.cf.


This is now possible too (but not required, handles ACE just fine like before):

util_rb_tld कॉम 佛山 慈善 集团 在线 한국 点看 คอม ভারত 八卦 موقع 公益 公司 移动 我爱你
util_rb_tld москва қаз онлайн сайт срб бел קום 时尚 淡马锡 орг नेट 삼성 சிங்கப்பூர் 商标
util_rb_tld 商店 商城 дети мкд 新闻 工行 كوم 中文网 中信 中国 中國 娱乐 谷歌 భారత్ ලංකා
util_rb_tld ભારત भारत 网店 संगठन 餐厅 网络 ком укр 香港 飞利浦 台湾 台灣 手机 мон الجزائر
util_rb_tld عمان ایران امارات بازار الاردن بھارت المغرب السعودية سودان مليسيا 닷컴 政府
util_rb_tld شبكة გე 机构 组织机构 健康 ไทย سورية рус рф تونس 大拿 みんな グーグル 世界 ਭਾਰਤ
util_rb_tld 网址 닷넷 コム 游戏 vermögensberater vermögensberatung 企业 信息 مصر قطر 广东
util_rb_tld இலங்கை இந்தியா հայ 新加坡 فلسطين 政务
Comment 17 Mark Martinec 2015-10-08 16:15:55 UTC
trunk:
 Bug 7215: Towards supporting IDNA
  - tags AUTHORDOMAIN and SENDERDOMAIN to ACE,
  - add metadata X-AuthorDomain and X-SenderDomain (to facilitate testing),
  - domain to ACE in a call to Mail::DKIM::AuthorDomainPolicy->fetch
Sending lib/Mail/SpamAssassin/PerMsgStatus.pm
Sending lib/Mail/SpamAssassin/Plugin/DKIM.pm
Committed revision 1707578.
Comment 18 Mark Martinec 2015-10-08 18:51:25 UTC
partly related:

trunk:
 Added a test case for international mail (as allowed
 by RFC 6532 - SMTPUTF8) and a test
  Adding t/data/nice/unicode1
  Adding t/header_utf8.t
Committed revision 1707600.
Comment 19 Kevin A. McGrail 2017-04-11 06:23:51 UTC
Hi Mark, 

Assuming that we still want to leave this in trunk and NOT backport to 3.4/
Comment 20 Mark Martinec 2017-04-11 10:36:13 UTC
> Assuming that we still want to leave this in trunk and NOT backport to 3.4/

Yes, I think these changes are too heavyweight for a minor release.


[ but it's also true that I would not like to run 3.4 now at our site
  in production (with a reasonably fresh version of perl), as the
  4.0(=trunk) is better behaved and has been better tested (by me),
  compared to 3.4 with multiple lightly tested backports ]
Comment 21 Kevin A. McGrail 2017-04-11 11:54:51 UTC
Understood.  My plan is not to backport the full IDN stuff.

I will have a few more bugs backported and then that will be 3.4.2.

Then perhaps we get 4.0 (3.5?) moving since these major changes have been tested in production.

I can switch to the same version too for testing.
Comment 22 Kevin A. McGrail 2018-08-28 03:01:44 UTC
Continuing to target this for 4.0.  Working hard to produce our last (hopefully) 3.4.2 release.
Comment 23 Henrik Krohns 2022-03-06 12:48:00 UTC
Isn't all this committed to trunk ages ago? Everything works fine as I understand. Closing.