SA Bugzilla – Bug 5292
URIDNSBL erroneously matches substrings of words with accents
Last modified: 2019-08-21 09:21:09 UTC
using URIDNSBL, the string Cinéma.ca matches domain ma.ca.
does 'Cinéma.ca' work as a link in any mail user-agent?
fwiw, this doesn't really have anything to do with uridnsbl. it's all about the uri text parser looking for raw domains. uridnsbl would take that information and query for it, but it's not involved in actual parsing.
I let you guys specifiy the appropriate component to report the bug on (spamassassin?). About Cinéma.ca being a link in MUAs, not in Thunderbird. Outlook makes a link if it is prefixed by http[s]. While I'm at it (could be another bug or feature request), urls of the form http://www.startof adomain.com shouldn't match adomain.com as it does now. It's causing quite a few FP with multi.uribl.com.
a few other FPs: "QuébecRencontres.com" matches becrencontres.com a weird one: "He told me.well.in fact, he didn't say much" matches well.in (why is the html code containing me...well... but the dots vanished from the txt version? a bug in hotmail composer I suppose). For the latest case what about validating www. when there's no http:// ? I don't think MUAs make links of non http nor www. uris.
(In reply to comment #4) > a weird one: > "He told me.well.in fact, he didn't say much" matches well.in (why is the html > code containing me...well... but the dots vanished from the txt version? a bug > in hotmail composer I suppose). > For the latest case what about validating www. when there's no http:// ? I don't > think MUAs make links of non http nor www. uris. The issue is spam that says "type gotofoo.com in your browser" ... we're trying to catch that. What this comes down to is: - do we limit ourselves to www.* strings? - do we limit ourselves to old-school TLDs? (.com, .net, etc.) - do we limit ourselves to obvious URLs only (https?://) - do we not bother and have people complain that we don't catch this type of thing?
(In reply to comment #5) > What this comes down to is: > > - do we limit ourselves to www.* strings? > - do we limit ourselves to old-school TLDs? (.com, .net, etc.) > - do we limit ourselves to obvious URLs only (https?://) > - do we not bother and have people complain that we don't catch this type of thing? A balance of each? catching http://anything or www.old-schooltld and validating that there is no starting part ending the previous line (in case of improper line jump) would be nice. I hate spam as much as you guys but false positives are also very annoying.
Improved with latest commit, schemeless parser will only match pure alphanumeric now. Sending spamassassin-3.4/lib/Mail/SpamAssassin/PerMsgStatus.pm Sending trunk/lib/Mail/SpamAssassin/PerMsgStatus.pm Transmitting file data ..done Committing transaction... Committed revision 1865612.