Bug 5813 - [review] several TLDs are not parsed by URI text scanner in PerMsgStatus.pm
Summary: [review] several TLDs are not parsed by URI text scanner in PerMsgStatus.pm
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamassassin (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P2 normal
Target Milestone: 3.2.6
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard: needs 1 votes for 3.2.6
Keywords:
Depends on: 5780
Blocks:
  Show dependency tree
 
Reported: 2008-02-07 05:45 UTC by Sidney Markowitz
Modified: 2010-06-11 09:48 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Fix both table and regexp to have correct list of TLDs and add all of them to uri_text.t tests patch None Sidney Markowitz [HasCLA]
Combined patch against 3.2 for bugs 5780 and 5813 patch None Sidney Markowitz [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Sidney Markowitz 2008-02-07 05:45:56 UTC
"Schemeless" URI text strings such as example.com which do not begin with http:
or https: or ftp: or mailto:  www. or ftp. are checked to see if they have a
domain that is a valid TLD. That check is done in PerMsgStatus.pm with a regexp
that has a comment saying that it was generated from a list of TLDs in
Util/RegistrarBoundaries.pm

The following URIs are not detected by that regexp, even though they are in the
table in Util/RegistrarBoundaries.pm.

Prefixing these URIs with 'www.' so they are processed through that table does
cause them to be detected:

example.asia
example.cat
example.cv
example.cx
example.cy
example.cz
example.jobs
example.mobi
example.tel
example.travel
example.xxx
Comment 1 Sidney Markowitz 2008-02-07 06:49:30 UTC
Looking at the list I posted in the description I noticed that .xxx is there
even though it is not a valid TLD. That led me to compare the table in
Util/RegistrarBoundaries.pm with the official list in
http://data.iana.org/TLD/tlds-alpha-by-domain.txt

I haven't done a careful diff, but I do notice that we include .cs which is not
an official TLD, as wel as xxx, so the table is not correct. Does anyone have
any objection to my simply snarfing the official list into
Util/RegistrarBoundaries.pm, minus the 11 IDN .test TLDS (Those are the ones
that look like XN--0ZWM56D and are in the root table only for an IDN test that
was recently run)?

Also, the comment in PerMsgStatus says that the regexp was generated using
Regexp::Optimizer, but the Regexp::Optimizer suggests using Regexp::List when
you are dealing with a plain list of words with no metacharacters, which is what
we have here. Does anyone know a reason why we should stick with Regexp::Optimizer?
Comment 2 Sidney Markowitz 2008-02-07 19:30:27 UTC
Created attachment 4257 [details]
Fix both table and regexp to have correct list of TLDs and add all of them to uri_text.t tests

This has been committed to trunk revision 619753.

I've attached the patch here and I'm targtting this bug to 3.2.5 to make it
easy to apply to branch 3.2 if we decide to apply the fix for bug 5780 to
branch 3.2. If we do not apply 5780 to branch 3.2, this bug will be in the
branch but a new patch against 3.2 would have to be worked up.
Comment 3 Sidney Markowitz 2008-02-07 19:46:33 UTC
I've decided to flag this for review and voting, but please look at bug 5780
first, as it has to be voted and committed before this one can.
Comment 4 Jeff Chan 2008-02-08 00:40:21 UTC
IMO using the official TLD table should usually be preferred.
Comment 5 Sidney Markowitz 2008-02-08 01:10:00 UTC
(reply to comment 4)

> using the official TLD table should usually be preferred

I see I didn't make explicit in comment 2 what I did for the attachment 4257 [details].
That patch does what I talked about in comment 1, i.e., it fixes the table in
Util/RegistrarBoundaries.pm to match the official TLD table at
http://data.iana.org/TLD/tlds-alpha-by-domain.txt with the exception of the IDN
test entries, and it redoes the regexp in PerMsgStatus.pm to be the Regexp::List
optimized form of that table.

However, I should mention the discussion from the sa-dev list in which it was
pointed out that there are some TLDs that are defunct, with no domains or active
registrar. These are bv gb pm sj so um yt which perhaps should be rmoved from
the list because they cannot point to a valid URL or mail address, so could only
cause FPs.
Comment 6 Sidney Markowitz 2008-02-08 02:58:16 UTC
Here is a summary of comments from sa-dev that should be here for the record:

  -----------

Michael Parker said:

Would anyone object to removing .so from this list?  The .so TLD is  
basically dead and we've found that lots of bogus domains like lib*.so  
are being caught by this.  Also sometimes you'll have spammers who are  
putting in gibberish or funny punctuation and you'll get sentences  
like 'blah blah.So this is'.  It also occurs with a couple of other  
domains but .so is by far the worst.

For more info on the .so domain you can read about it here:
http://en.wikipedia.org/wiki/.so_%28domain_name%29

 ------------

To which I replied:

After reading the Wikipedia entries for the TLDs that they list as unused, I
would be in 
favor of removing all of them

  bv gb pm sj so um yt

If spammers can't use them as domains, then all including them could do would be to 
generate false positives.

The three listed as "phaseout", su tp yu, on the other hand, I would keep,
because it 
appears from the Wikipedia articles that they are in active use either during a
transition 
period or because their phaseout is being resisted.

If we can reach consensus on this, I'll make the changes to the code I committed.

 ---------------------

And then Justin Mason said:

I'm happy enough with that. btw this discussion should be on
a bug...


  -----------------
And here we are
Comment 7 Sidney Markowitz 2008-02-08 12:53:07 UTC
I have updated trunk to remove bv gb pm sj so um yt from the table and regexp
and to change the test cases to endure that they are not parsed as TLDs.

Committed revision 620000
Comment 8 Sidney Markowitz 2008-02-08 14:09:21 UTC
Created attachment 4258 [details]
Combined patch against 3.2 for bugs 5780 and 5813

This patch includes the fix for bug 5780 as well as this bug, and removes the
unused TLDs bv gb pm sj so um yt. Please use this to review both bug 5780 and
5813 for branch 3.2.
Comment 9 Justin Mason 2008-02-10 15:46:16 UTC
+1, why not ;)
Comment 10 Justin Mason 2008-03-05 04:29:58 UTC
testing new BZ instance
Comment 11 Yusuf Goolamabbas 2008-05-25 15:48:02 UTC
Hi, Should this be also in 3.2.5 given the recent push to roll out a release soon. I think it needs only one more vote
Comment 12 Justin Mason 2008-05-26 01:58:52 UTC
(In reply to comment #11)
> Hi, Should this be also in 3.2.5 given the recent push to roll out a release
> soon. I think it needs only one more vote

I'd be happy to see it go in -- I've given it my +1 -- but OTOH, it's a very big change.
Comment 13 Justin Mason 2008-06-01 03:39:00 UTC
ok, there's been no motion; it looks likely this'll wait for 3.2.6 (or more likely 3.3.0).
Comment 14 Julian Mehnle 2009-10-08 12:46:48 UTC
Just as a heads-up, .so might soon spring back into existence.
Comment 15 Mark Martinec 2010-06-11 07:21:42 UTC
Oops, sorry, retargeting back to 3.2.6. It is already in 3.3.
Should probably just be closed.
Comment 16 Karsten Bräckelmann 2010-06-11 09:48:53 UTC
Agreed, closing Resolved FIXED for 3.3.

Non-trivial patch, also depends on bug 5780 (see comment 2) to be applied to the maintenance 3.2 branch. Stalled for 2 years now.