Bug 5292 - URIDNSBL erroneously matches substrings of words with accents
Summary: URIDNSBL erroneously matches substrings of words with accents
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Plugins (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: PC Linux
: P5 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-01-10 11:52 UTC by Martin Lathoud
Modified: 2019-08-21 09:21 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Lathoud 2007-01-10 11:52:48 UTC
using URIDNSBL,
the string Cinéma.ca matches domain ma.ca.
Comment 1 Justin Mason 2007-01-10 12:05:30 UTC
does 'Cinéma.ca' work as a link in any mail user-agent?
Comment 2 Theo Van Dinter 2007-01-10 12:13:53 UTC
fwiw, this doesn't really have anything to do with uridnsbl.  it's all about the
uri text parser looking for raw domains.  uridnsbl would take that information
and query for it, but it's not involved in actual parsing.
Comment 3 Martin Lathoud 2007-01-10 12:25:31 UTC
I let you guys specifiy the appropriate component to report the bug on
(spamassassin?). About Cinéma.ca being a link in MUAs, not in Thunderbird.
Outlook makes a link if it is prefixed by http[s]. While I'm at it (could be
another bug or feature request), urls of the form http://www.startof
adomain.com
shouldn't match adomain.com as it does now. It's causing quite a few FP with
multi.uribl.com.
Comment 4 Martin Lathoud 2007-01-12 09:04:24 UTC
a few other FPs:
"QuébecRencontres.com" matches becrencontres.com
a weird one:
"He told me.well.in fact, he didn't say much" matches well.in (why is the html
code containing me...well... but the dots vanished from the txt version? a bug
in hotmail composer I suppose).
For the latest case what about validating www. when there's no http:// ? I don't
think MUAs make links of non http nor www. uris.
Comment 5 Theo Van Dinter 2007-01-12 14:59:49 UTC
(In reply to comment #4)
> a weird one:
> "He told me.well.in fact, he didn't say much" matches well.in (why is the html
> code containing me...well... but the dots vanished from the txt version? a bug
> in hotmail composer I suppose).
> For the latest case what about validating www. when there's no http:// ? I don't
> think MUAs make links of non http nor www. uris.

The issue is spam that says "type gotofoo.com in your browser" ... we're trying
to catch that.

What this comes down to is:

- do we limit ourselves to www.* strings?
- do we limit ourselves to old-school TLDs? (.com, .net, etc.)
- do we limit ourselves to obvious URLs only (https?://)
- do we not bother and have people complain that we don't catch this type of thing?
Comment 6 Martin Lathoud 2007-01-12 16:05:20 UTC
(In reply to comment #5)
> What this comes down to is:
> 
> - do we limit ourselves to www.* strings?
> - do we limit ourselves to old-school TLDs? (.com, .net, etc.)
> - do we limit ourselves to obvious URLs only (https?://)
> - do we not bother and have people complain that we don't catch this type of
thing?

A balance of each? catching http://anything or www.old-schooltld and validating
that there is no starting part ending the previous line (in case of improper
line jump) would be nice.
I hate spam as much as you guys but false positives are also very annoying.
Comment 7 Henrik Krohns 2019-08-21 09:21:09 UTC
Improved with latest commit, schemeless parser will only match pure alphanumeric now.

Sending        spamassassin-3.4/lib/Mail/SpamAssassin/PerMsgStatus.pm
Sending        trunk/lib/Mail/SpamAssassin/PerMsgStatus.pm
Transmitting file data ..done
Committing transaction...
Committed revision 1865612.