Bug 3827 - [review] SURBL ccTLD list updated, please update SA TLD code
Summary: [review] SURBL ccTLD list updated, please update SA TLD code
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamassassin (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P5 minor
Target Milestone: 3.0.1
Assignee: SpamAssassin Developer Mailing List
URL: http://spamcheck.freeapp.net/two-leve...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-09-27 01:17 UTC by Jeff Chan
Modified: 2004-10-02 09:52 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status
patch to update to the list provided patch None Theo Van Dinter [HasCLA]
same as before, minus .de patch None Theo Van Dinter [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Jeff Chan 2004-09-27 01:17:17 UTC
>>> | name,ai,au,bd,bh,ck,eg,et,fk,il,in,kh,kr,mk,mt,na,
>>> | np,nz,pg,pk,qa,sa,sb,sg,sv,ua,ug,uk,uy,vn,za,zw
 
I have updated the ccTLDs for these, removed some duplicates, and added some
data for a few other ccTLDs.  The results are in:

  http://spamcheck.freeapp.net/two-level-tlds

Really this is just for completeness since geographic domains other than .us
aren't used in spams too often.
Comment 1 Theo Van Dinter 2004-09-28 18:38:31 UTC
ok, looking at the diff...  I'm pretty happy with additions, so I'm left with the removes that are 
questionable (aka, why were they removed?):

< org\.au
< notaires\.fr
< nui\.hu
< nm\.kr
< ac\.pa
< co\.sv
Comment 2 Jeff Chan 2004-09-28 18:50:56 UTC
Those were corrected typos and duplicates.
Comment 3 Theo Van Dinter 2004-09-28 19:00:19 UTC
committed for 3.1, r47440

will attach a 3.0.1 patch shortly
Comment 4 Theo Van Dinter 2004-09-28 19:00:58 UTC
Created attachment 2387 [details]
patch to update to the list provided
Comment 5 Jeff Chan 2004-09-28 19:18:55 UTC
BTW, on the advice of a German registrar and others, we've removed the .de
entries from the list.  They are not proper generic ccTLDs.  .de apparently has
no reserved second level generic geographic TLDs. Removed are:

nic.de
denic.de
moebel.de
glueckwunsch.de
buecher.de
boerse.de
kueche.de
buero.de
fluege.de
kuechen.de
aerzte.de
reisebuero.de
doctoren.de
aero.de
museum.de
coop.de
pro.de
Comment 6 Daniel Quinlan 2004-09-28 19:20:12 UTC
+1

Some of these don't really seem like we should be bothering to break them out.

somerandomname.com can all be one domain as far as I'm concerned.  I think
ccTLDs really need to represent some organizational unit larger than a
simple company or organization.
Comment 7 Justin Mason 2004-09-28 19:25:47 UTC
+1

but note:

'somerandomname.com can all be one domain as far as I'm concerned.  I think
ccTLDs really need to represent some organizational unit larger than a
simple company or organization.'

the point is not how big the registrar is -- it's if a spammer can obtain NS and
SOA records in a zone under that domain.  if they can, then we have a possible
hole that we'll miss in our rules; if they can't, we don't have a problem.
Comment 8 Jeff Chan 2004-09-28 19:30:31 UTC
'the point is not how big the registrar is -- it's if a spammer can obtain NS and
SOA records in a zone under that domain.  if they can, then we have a possible
hole that we'll miss in our rules; if they can't, we don't have a problem.'

If delegations are available under a given domain, then those delegations can be
abused independently of the parent domain.  So lists like this help us know
which child domain levels need to be checked.
Comment 9 Theo Van Dinter 2004-09-28 20:28:28 UTC
Created attachment 2388 [details]
same as before, minus .de
Comment 10 Theo Van Dinter 2004-09-28 20:29:16 UTC
Subject: Re:  SURBL ccTLD list updated, please update SA TLD code

On Tue, Sep 28, 2004 at 07:18:56PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> BTW, on the advice of a German registrar and others, we've removed the .de
> entries from the list.  They are not proper generic ccTLDs.  .de apparently has
> no reserved second level generic geographic TLDs. Removed are:

ok, updated in 3.1 and in the 3.0.1 patch.

Comment 11 Malte S. Stretz 2004-09-29 16:52:15 UTC
-0.5:  From the maintenance POV that's a nightmare, can we please split that 
list up into two REs, one conisiting of official second level domains and one 
of other instances which offer those? 
 
Actually, I don't think we should hard-code the list of inofficial domains 
anyway (there will always be n+1 more providers for such things) but if you 
really want, please split the lists.  An alternative for 3.1 or 3.2 could be 
to read this list from a file in DATADIR which can be updated from somewhere 
with a simple wget call. 
Comment 12 Malte S. Stretz 2004-09-29 17:03:29 UTC
Hmmm... as the code is already in 3.0.0 it's probably the best idea to just 
update it.  But especially those .fr, .hu and .jp "word"-domains (maybe I 
missed) some look as useless as the .de ones. 
 
In the initial report Jeff wrote 'Really this is just for completeness since 
geographic domains other than .us aren't used in spams too often.' -- then IMO 
we should not include any of the others as bigger the RE as more overhead we 
have. 
Comment 13 Theo Van Dinter 2004-09-29 17:12:54 UTC
Subject: Re:  [review] SURBL ccTLD list updated, please update SA TLD code

On Wed, Sep 29, 2004 at 05:03:30PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> In the initial report Jeff wrote 'Really this is just for completeness since 
> geographic domains other than .us aren't used in spams too often.' -- then IMO 
> we should not include any of the others as bigger the RE as more overhead we 
> have. 

Yeah, I actually was noticing that the 4TLD and 3TLD domains are REs, but the
2TLD ones are simple "example.tld" strings.  So in theory, we could just grab
the last two sections of the FQDN, then do a hash lookup.

For the 3.0.1 code, I'd like to just do the update.  For the 3.1 code, I'm
tempted to change it around.

If things like SURBL are only going to list actual domains, we need to deal
with that correctly.

Comment 14 Justin Mason 2004-09-29 17:16:39 UTC
'If things like SURBL are only going to list actual domains, we need to deal
with that correctly.'

what d'you mean -- actual registrar-boundary domains, or any domain that a
spammer could possibly register, even if not with an ICANN-recognized registrar
boundary?
Comment 15 Theo Van Dinter 2004-09-29 17:59:49 UTC
Subject: Re:  [review] SURBL ccTLD list updated, please update SA TLD code

On Wed, Sep 29, 2004 at 05:16:40PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> what d'you mean -- actual registrar-boundary domains, or any domain that a
> spammer could possibly register, even if not with an ICANN-recognized registrar
> boundary?

I'd say registrar boundaries.  I want to find all the proper domains.

Comment 16 Jeff Chan 2004-09-30 01:57:08 UTC
Does that mean domains like "medecin.fr" would stay in?  I think the principle
of these is that doctors could register subdomains under that one, etc.
Comment 17 Malte S. Stretz 2004-09-30 03:39:02 UTC
Is medicin.fr an "official" subdomain by the French NIC (whatever it is 
called)?  (And is it actually abused?)  If not, whats the difference to other 
(free) "ccTLDs", like maybe gmxhome.de (just one which came to my mind) or 
somerandomprovider.fr?  Just like the removed .de domains which were actually 
just a random list of generic words. 
 
We just can't keep an up-to-date list of every provider which offers 
third-level domains in our codebase, especially not in one big RE.  (I must 
admit that I don't exactly know what this RE is used for but the above is true 
anyway.) 
 
For 3.0 I'm fine (aka +1) if the RE is updated as suggested.  If not 
impossible, I'd love to see those generic-word-domains go from the list so 
that only the official boundaries stay, but if the SURBL code needs to have 
these they can stay in but that has to be fixed to something more dynamic for 
3.1. 
Comment 18 Justin Mason 2004-09-30 10:05:11 UTC
I think I should update what the pros and cons of this listing
non-ICANN-registrar domain boundaries are, since there seems to be some confusion.

When we initially considered how SURBL and other RHSBL-style domain tests should
work, we considered the possible abusable holes that spammers could use.  This
is one of them.   Here's how it works:

1. if we only list ICANN-registrar domain boundaries (ie, "com", "co.uk",
"info", "cn" et al), then we have a smaller regexp and less maintainance
2. however, if "sh.cn" is a small company that offers for-free or for-pay
subdelegation to third parties, and a spammer registers "foo.sh.cn", but there
are nonspam domains at "bar.sh.cn", "baz.sh.cn", we cannot list them (because
we'd have to list "sh.cn" and hit all the nonspam domains too).  in other words,
we have a hole in our rules and in SURBL.
3. therefore we should list any "registrar boundary" where a company or
organisation allows third parties to register domains under their domain, even
if it's not an "official" one.

(what is an "official" one anyway?  do ICANN maintain a list of all the
sub-ccTLD delegators, like whoever deals with registration for .co.uk, .ac.uk,
et al?)

So the danger is that if we cut the list down, we'll provide a hole spammers can
drive through.   If you all are OK with that, then fine ;)

Comment 19 Malte S. Stretz 2004-09-30 10:30:42 UTC
'2. however, if "sh.cn" is a small company that offers for-free or for-pay 
subdelegation to third parties, and a spammer registers "foo.sh.cn", but there 
are nonspam domains at "bar.sh.cn", "baz.sh.cn", we cannot list them (because 
we'd have to list "sh.cn" and hit all the nonspam domains too).' 
 
k, thanks for the explanation.  But my point still stands:  There are n 
provider which offer third-level domains, with lim(n)->oo.  We simply can't 
list all those providers in one big RE (or even a hash) because (a) they are 
too many and (b) they may change under our feet. 
 
When I grep through my spam I could maybe tell you 10 such provider which are 
actually abused and aren't listed there (one from my head which is very much 
abused: 0catch.com) and think of a list of ones which aren't yet (like 
gmxhome.de and internet-provider.net). 
 
So as long as those currently listed are abused very much I don't see a point 
in including them. 
 
'(what is an "official" one anyway?  do ICANN maintain a list of all the 
sub-ccTLD delegators, like whoever deals with registration for .co.uk, .ac.uk, 
et al?)' 
 
I don't think ICANN maintains such a list; some NICs even changed their mind 
at some time and started to offer direct second-level domains, too.  But at 
least for the "official" ones the possibility that such a domain is abused is 
higher than with some random provider. 
Comment 20 Justin Mason 2004-10-02 16:23:29 UTC
oh -- forgot to mention: +1 ;)
Comment 21 Theo Van Dinter 2004-10-02 17:52:30 UTC
r51815