SA Bugzilla – Bug 5813
[review] several TLDs are not parsed by URI text scanner in PerMsgStatus.pm
Last modified: 2010-06-11 09:48:53 UTC
"Schemeless" URI text strings such as example.com which do not begin with http: or https: or ftp: or mailto: www. or ftp. are checked to see if they have a domain that is a valid TLD. That check is done in PerMsgStatus.pm with a regexp that has a comment saying that it was generated from a list of TLDs in Util/RegistrarBoundaries.pm The following URIs are not detected by that regexp, even though they are in the table in Util/RegistrarBoundaries.pm. Prefixing these URIs with 'www.' so they are processed through that table does cause them to be detected: example.asia example.cat example.cv example.cx example.cy example.cz example.jobs example.mobi example.tel example.travel example.xxx
Looking at the list I posted in the description I noticed that .xxx is there even though it is not a valid TLD. That led me to compare the table in Util/RegistrarBoundaries.pm with the official list in http://data.iana.org/TLD/tlds-alpha-by-domain.txt I haven't done a careful diff, but I do notice that we include .cs which is not an official TLD, as wel as xxx, so the table is not correct. Does anyone have any objection to my simply snarfing the official list into Util/RegistrarBoundaries.pm, minus the 11 IDN .test TLDS (Those are the ones that look like XN--0ZWM56D and are in the root table only for an IDN test that was recently run)? Also, the comment in PerMsgStatus says that the regexp was generated using Regexp::Optimizer, but the Regexp::Optimizer suggests using Regexp::List when you are dealing with a plain list of words with no metacharacters, which is what we have here. Does anyone know a reason why we should stick with Regexp::Optimizer?
Created attachment 4257 [details] Fix both table and regexp to have correct list of TLDs and add all of them to uri_text.t tests This has been committed to trunk revision 619753. I've attached the patch here and I'm targtting this bug to 3.2.5 to make it easy to apply to branch 3.2 if we decide to apply the fix for bug 5780 to branch 3.2. If we do not apply 5780 to branch 3.2, this bug will be in the branch but a new patch against 3.2 would have to be worked up.
I've decided to flag this for review and voting, but please look at bug 5780 first, as it has to be voted and committed before this one can.
IMO using the official TLD table should usually be preferred.
(reply to comment 4) > using the official TLD table should usually be preferred I see I didn't make explicit in comment 2 what I did for the attachment 4257 [details]. That patch does what I talked about in comment 1, i.e., it fixes the table in Util/RegistrarBoundaries.pm to match the official TLD table at http://data.iana.org/TLD/tlds-alpha-by-domain.txt with the exception of the IDN test entries, and it redoes the regexp in PerMsgStatus.pm to be the Regexp::List optimized form of that table. However, I should mention the discussion from the sa-dev list in which it was pointed out that there are some TLDs that are defunct, with no domains or active registrar. These are bv gb pm sj so um yt which perhaps should be rmoved from the list because they cannot point to a valid URL or mail address, so could only cause FPs.
Here is a summary of comments from sa-dev that should be here for the record: ----------- Michael Parker said: Would anyone object to removing .so from this list? The .so TLD is basically dead and we've found that lots of bogus domains like lib*.so are being caught by this. Also sometimes you'll have spammers who are putting in gibberish or funny punctuation and you'll get sentences like 'blah blah.So this is'. It also occurs with a couple of other domains but .so is by far the worst. For more info on the .so domain you can read about it here: http://en.wikipedia.org/wiki/.so_%28domain_name%29 ------------ To which I replied: After reading the Wikipedia entries for the TLDs that they list as unused, I would be in favor of removing all of them bv gb pm sj so um yt If spammers can't use them as domains, then all including them could do would be to generate false positives. The three listed as "phaseout", su tp yu, on the other hand, I would keep, because it appears from the Wikipedia articles that they are in active use either during a transition period or because their phaseout is being resisted. If we can reach consensus on this, I'll make the changes to the code I committed. --------------------- And then Justin Mason said: I'm happy enough with that. btw this discussion should be on a bug... ----------------- And here we are
I have updated trunk to remove bv gb pm sj so um yt from the table and regexp and to change the test cases to endure that they are not parsed as TLDs. Committed revision 620000
Created attachment 4258 [details] Combined patch against 3.2 for bugs 5780 and 5813 This patch includes the fix for bug 5780 as well as this bug, and removes the unused TLDs bv gb pm sj so um yt. Please use this to review both bug 5780 and 5813 for branch 3.2.
+1, why not ;)
testing new BZ instance
Hi, Should this be also in 3.2.5 given the recent push to roll out a release soon. I think it needs only one more vote
(In reply to comment #11) > Hi, Should this be also in 3.2.5 given the recent push to roll out a release > soon. I think it needs only one more vote I'd be happy to see it go in -- I've given it my +1 -- but OTOH, it's a very big change.
ok, there's been no motion; it looks likely this'll wait for 3.2.6 (or more likely 3.3.0).
Just as a heads-up, .so might soon spring back into existence.
Oops, sorry, retargeting back to 3.2.6. It is already in 3.3. Should probably just be closed.
Agreed, closing Resolved FIXED for 3.3. Non-trivial patch, also depends on bug 5780 (see comment 2) to be applied to the maintenance 3.2 branch. Stalled for 2 years now.