|
SA Bugzilla – Full Text Bug Listing |
Summary: | PORN_4 catches legitimate URLs | ||
---|---|---|---|
Product: | Spamassassin | Reporter: | vince.delvecchio |
Component: | Rules | Assignee: | SpamAssassin Developer Mailing List <dev> |
Status: | RESOLVED FIXED | ||
Severity: | normal | ||
Priority: | P2 | ||
Version: | 2.41 | ||
Target Milestone: | --- | ||
Hardware: | All | ||
OS: | Solaris | ||
Whiteboard: |
Description
vince.delvecchio
2002-09-27 13:26:02 UTC
It would be better to change the filter to recognize word boundaries instead: \banal\b If there are variants of any word that need to be matched, expand the expression. Subject: Re: [SAdev] PORN_4 catches legitimate URLs Porn 4 is a URI type rule. There won't be word boundaries in domain names/urls/etc. hence using \b is pointless. Subject: Re: [SAdev] PORN_4 catches legitimate URLs On Sep 27, 9:00pm, bugzilla-daemon@hughes-family.org wrote: > http://www.hughes-family.org/bugzilla/show_bug.cgi?id=1035 > > ------- Additional Comments From mkettler_sa@comcast.net 2002-09-27 17:55 ------- > Subject: Re: [SAdev] PORN_4 catches legitimate URLs > > Porn 4 is a URI type rule. There won't be word boundaries in domain > names/urls/etc. hence using \b is pointless. Umm... in http://cesario.rutgers.edu, there are quite a few word boundaries: |http|://|cesario|.|rutgers|.|edu| - everyplace I put a |. -Allen Subject: Re: [SAdev] PORN_4 catches legitimate URLs Yes, dots in domains are word breaks, but in the case of porno domains, the domains are usually word concatenations. it's rarely www.anal.fucking.com or anal.fucking.com, it's usually things like: www.analfucking.com www.analsexsluts.com The VAST majority of porno domains will not match with the \b's in place. In fact, I doubt you'd be able to find one. I know my corpus contains zero matches for grep "\.anal\." *, does yours? In fact, very few commercial website links contain dots in any place other than www. and .com. Some do, but not many. This kind of thing is by far more common in .edu's than anyplace else. and I doubt rutgers will allow fucking.anal.whores.rutgers.edu to exist ;) >Umm... in http://cesario.rutgers.edu, there are quite a few word boundaries: >|http|://|cesario|.|rutgers|.|edu| - everyplace I put a |. > > -Allen Vince, thanks for doing the heavy lifting on this bug, those regexp fragments work perfectly. That's now in CVS testing, looks like it does the trick, and once it's verified to get better results than PORN_4, it's in. ok, fixed; beats PORN_4 |