SA Bugzilla – Bug 1035
PORN_4 catches legitimate URLs
Last modified: 2002-10-29 17:56:22 UTC
I work for a company named Analog Devices and was somewhat distressed when a bulk mailing (not spam) from another department in the company triggered my spam filter. The culprit was a URL of the form http://www.analog.com which matched "anal" in the PORN_4 test. Because it was a mass mailing, it had a couple of other hits (remove me from this list, probably) and PORN_4 pushed it over the threshold of 5 hits. Could you please change the filter to "|anal(?!og)|"? I realize that this is only one test, but since it has a relatively high weight (2.8 in v2.41) corresponding to the fact that most porn URLs *are* spam, there is effectively a high cost to making a mistake and lumping the non-porn in with the porn. I did a dictionary search on the other terms in PORN_4 and came up with a long list of words, many of which do occur in real non-porn web site addresses. Again, since there is a fairly high cost to making a mistake, it seems like it would be good to try to address some of these. It would have been nice if I could match against a list of domain names to figure out exactly what domain names use these words, but I don't have such a list, so I was left to Google on the words and see if any web sites with those words in the site name came up. Based on that process, my list of most likely to be misanalyzed is: analog analy[sz]e cluster document ecumeni(c|sm) essex illustrat* illustrious middlesex recumbent slocum sussex thirteen fourteen fifteen sixteen seventeen eighteen nineteen I ran these on my own spam collection, and found that awhile ago I did get a bunch of non-porn spam with URLs http://thirteen.<something> and also http://sixteen.<something>. Not sure what to do about that. No other hits. Here are regexp fragments for those, including all the teens. (?<!es)(?<!dle|sus)sex anal(?!og|y[sz]) (?<!thir|four|eigh|nine)(?<!fif|six)(?<!seven)teen (?<!slo)cum(?!(?<=docum)ent|(?<=ecum)eni[cs]|(?<=recum)bent) lust(?!(?<=illust)(?:rat|rious)|(?<=clust)er) I enclose the full list, but sadly, I think it may be too large to do anything useful with. ------------------------------------ Raw data: /usr/dict/words against PORN_4 accumulate acumen analeptic analgesic analyses analysis analyst analytic asexual balustrade banal bisexual bluster blustery canal canteen circumcircle circumcise circumcision circumference circumferential circumflex circumlocution circumpolar circumscribe circumscription circumspect circumsphere circumstance circumstantial circumvent circumvention cluster cryptanalysis cryptanalyst cryptanalytic cryptanalyze cucumber cumbersome cumin cumulate cumulus document documentary documentation ecumenic ecumenist eighteen eighteenth encumber encumbrance erotic erotica Essex fifteen fifteenth fluster fourteen fourteenth heterosexual homosexual honeysuckle illustrate illustrious incumbent lackluster lesbian lust lustful lustrous lusty mecum Middlesex modicum naughty nineteen nineteenth nymphomania nymphomaniac panty pornographer pornography psychoanalysis psychoanalyst psychoanalytic pussy pussycat recumbent sapsucker schoolgirl schoolgirlish sclerotic scum seersucker seventeen seventeenth sex sextet sextillion sexton sextuple sextuplet sexual sexy sixteen sixteenth Slocum slut Steen succumb suck suckling Sussex taboo talcum tecum teen teenage teensy thirteen thirteenth unisex whore I have categorized these as follows: These words have no porn-related meaning, and these letter sequences seem unlikely to appear in a context referring to porn. accumulate acumen analeptic analgesic analy[sz]* analyst analytic balustrade bluster blustery circumcircle circumference circumferential circumflex circumlocution circumpolar circumscribe circumscription circumspect circumsphere circumstance circumstantial circumvent circumvention cluster cucumber cumbersome cumulate cumulus document ecumeni[cs]* encumber encumbrance fluster honeysuckle illustrat* illustrious incumbent lackluster lustrous Middlesex modicum recumbent sapsucker sclerotic seersucker sextet sextillion sexton sextuple Slocum succumb suckling Sussex These words have meanings primarily or exclusively unrelated to porn, but it is possible that a URL matching one of these strings does refer to porn (usually because we might match parts of two words). banal canal canteen cumin Essex mecum pussycat scum Steen talcum tecum teensy There are many non-porn web sites containing these number names, but it is possible they could be used in a porn context. thirteen fourteen fifteen sixteen seventeen eighteen nineteen These words have perfectly reasonable non-sexual meanings, but they seem to also frequently appear in porn site names (certainly the words appear frequently in porn spam). naughty pussy schoolgirl suck taboo teen teenage These words have only sexual meanings, although many of them could still be used in a non-porn context. asexual bisexual heterosexual homosexual lesbian circumcise circumcision nymphomania nymphomaniac panty pornographer pornography sex sexual sexy slut unisex whore
It would be better to change the filter to recognize word boundaries instead: \banal\b If there are variants of any word that need to be matched, expand the expression.
Subject: Re: [SAdev] PORN_4 catches legitimate URLs Porn 4 is a URI type rule. There won't be word boundaries in domain names/urls/etc. hence using \b is pointless.
Subject: Re: [SAdev] PORN_4 catches legitimate URLs On Sep 27, 9:00pm, bugzilla-daemon@hughes-family.org wrote: > http://www.hughes-family.org/bugzilla/show_bug.cgi?id=1035 > > ------- Additional Comments From mkettler_sa@comcast.net 2002-09-27 17:55 ------- > Subject: Re: [SAdev] PORN_4 catches legitimate URLs > > Porn 4 is a URI type rule. There won't be word boundaries in domain > names/urls/etc. hence using \b is pointless. Umm... in http://cesario.rutgers.edu, there are quite a few word boundaries: |http|://|cesario|.|rutgers|.|edu| - everyplace I put a |. -Allen
Subject: Re: [SAdev] PORN_4 catches legitimate URLs Yes, dots in domains are word breaks, but in the case of porno domains, the domains are usually word concatenations. it's rarely www.anal.fucking.com or anal.fucking.com, it's usually things like: www.analfucking.com www.analsexsluts.com The VAST majority of porno domains will not match with the \b's in place. In fact, I doubt you'd be able to find one. I know my corpus contains zero matches for grep "\.anal\." *, does yours? In fact, very few commercial website links contain dots in any place other than www. and .com. Some do, but not many. This kind of thing is by far more common in .edu's than anyplace else. and I doubt rutgers will allow fucking.anal.whores.rutgers.edu to exist ;) >Umm... in http://cesario.rutgers.edu, there are quite a few word boundaries: >|http|://|cesario|.|rutgers|.|edu| - everyplace I put a |. > > -Allen
Vince, thanks for doing the heavy lifting on this bug, those regexp fragments work perfectly. That's now in CVS testing, looks like it does the trick, and once it's verified to get better results than PORN_4, it's in.
ok, fixed; beats PORN_4