Bug 1035 - PORN_4 catches legitimate URLs
Summary: PORN_4 catches legitimate URLs
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: 2.41
Hardware: All Solaris
: P2 normal
Target Milestone: ---
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2002-09-27 13:26 UTC by vince.delvecchio
Modified: 2002-10-29 17:56 UTC (History)
0 users



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description vince.delvecchio 2002-09-27 13:26:02 UTC
I work for a company named Analog Devices and was somewhat distressed when a 
bulk mailing (not spam) from another department in the company triggered my 
spam filter.  The culprit was a URL of the form http://www.analog.com which 
matched "anal" in the PORN_4 test.  Because it was a mass mailing, it had a 
couple of other hits (remove me from this list, probably) and PORN_4 pushed it 
over the threshold of 5 hits.

Could you please change the filter to "|anal(?!og)|"?

I realize that this is only one test, but since it has a relatively high weight 
(2.8 in v2.41) corresponding to the fact that most porn URLs *are* spam, there 
is effectively a high cost to making a mistake and lumping the non-porn in with 
the porn.

I did a dictionary search on the other terms in PORN_4 and came up with a long 
list of words, many of which do occur in real non-porn web site addresses.  
Again, since there is a fairly high cost to making a mistake, it seems like it 
would be good to try to address some of these.

It would have been nice if I could match against a list of domain names to 
figure out exactly what domain names use these words, but I don't have such a 
list, so I was left to Google on the words and see if any web sites with those 
words in the site name came up.  Based on that process, my list of most likely 
to be misanalyzed is:

analog analy[sz]e cluster document ecumeni(c|sm) essex illustrat* illustrious
middlesex recumbent slocum sussex

thirteen fourteen fifteen sixteen seventeen eighteen nineteen

I ran these on my own spam collection, and found that awhile ago I did get a 
bunch of non-porn spam with URLs http://thirteen.<something> and also 
http://sixteen.<something>.  Not sure what to do about that.  No other hits.

Here are regexp fragments for those, including all the teens.

(?<!es)(?<!dle|sus)sex
anal(?!og|y[sz])
(?<!thir|four|eigh|nine)(?<!fif|six)(?<!seven)teen
(?<!slo)cum(?!(?<=docum)ent|(?<=ecum)eni[cs]|(?<=recum)bent)
lust(?!(?<=illust)(?:rat|rious)|(?<=clust)er)

I enclose the full list, but sadly, I think it may be too large to do anything 
useful with.

------------------------------------

Raw data:  /usr/dict/words against PORN_4

accumulate acumen analeptic analgesic analyses analysis analyst analytic
asexual balustrade banal bisexual bluster blustery canal canteen
circumcircle circumcise circumcision circumference circumferential
circumflex circumlocution circumpolar circumscribe circumscription
circumspect circumsphere circumstance circumstantial circumvent
circumvention cluster cryptanalysis cryptanalyst cryptanalytic
cryptanalyze cucumber cumbersome cumin cumulate cumulus document
documentary documentation ecumenic ecumenist eighteen eighteenth
encumber encumbrance erotic erotica Essex fifteen fifteenth fluster
fourteen fourteenth heterosexual homosexual honeysuckle illustrate
illustrious incumbent lackluster lesbian lust lustful lustrous lusty
mecum Middlesex modicum naughty nineteen nineteenth nymphomania
nymphomaniac panty pornographer pornography psychoanalysis psychoanalyst
psychoanalytic pussy pussycat recumbent sapsucker schoolgirl
schoolgirlish sclerotic scum seersucker seventeen seventeenth sex sextet
sextillion sexton sextuple sextuplet sexual sexy sixteen sixteenth
Slocum slut Steen succumb suck suckling Sussex taboo talcum tecum teen
teenage teensy thirteen thirteenth unisex whore

I have categorized these as follows:

These words have no porn-related meaning, and these letter sequences
seem unlikely to appear in a context referring to porn.

  accumulate acumen analeptic analgesic analy[sz]* analyst analytic
  balustrade bluster blustery circumcircle circumference circumferential
  circumflex circumlocution circumpolar circumscribe circumscription
  circumspect circumsphere circumstance circumstantial circumvent
  circumvention cluster cucumber cumbersome cumulate cumulus document
  ecumeni[cs]* encumber encumbrance fluster honeysuckle illustrat*
  illustrious incumbent lackluster lustrous Middlesex modicum recumbent
  sapsucker sclerotic seersucker sextet sextillion sexton sextuple
  Slocum succumb suckling Sussex

These words have meanings primarily or exclusively unrelated to porn,
but it is possible that a URL matching one of these strings does refer
to porn (usually because we might match parts of two words).

  banal canal canteen cumin Essex mecum pussycat scum Steen talcum
  tecum teensy

There are many non-porn web sites containing these number names, but it
is possible they could be used in a porn context.

  thirteen fourteen fifteen sixteen seventeen eighteen nineteen

These words have perfectly reasonable non-sexual meanings, but they seem
to also frequently appear in porn site names (certainly the words appear
frequently in porn spam).

  naughty pussy schoolgirl suck taboo teen teenage

These words have only sexual meanings, although many of them could still
be used in a non-porn context.

  asexual bisexual heterosexual homosexual lesbian circumcise circumcision
  nymphomania nymphomaniac panty pornographer pornography sex sexual sexy
  slut unisex whore
Comment 1 Daniel Quinlan 2002-09-27 14:55:12 UTC
It would be better to change the filter to recognize word boundaries instead:

  \banal\b

If there are variants of any word that need to be matched, expand the expression.
Comment 2 Matt Kettler 2002-09-27 17:55:58 UTC
Subject: Re: [SAdev]  PORN_4 catches legitimate URLs

Porn 4 is a URI type rule. There won't be word boundaries in domain 
names/urls/etc. hence using \b is pointless.

Comment 3 Allen Smith 2002-09-27 18:05:12 UTC
Subject: Re: [SAdev]  PORN_4 catches legitimate URLs

On Sep 27,  9:00pm, bugzilla-daemon@hughes-family.org wrote:
> http://www.hughes-family.org/bugzilla/show_bug.cgi?id=1035
> 
> ------- Additional Comments From mkettler_sa@comcast.net  2002-09-27 17:55 -------
> Subject: Re: [SAdev]  PORN_4 catches legitimate URLs
> 
> Porn 4 is a URI type rule. There won't be word boundaries in domain 
> names/urls/etc. hence using \b is pointless.

Umm... in http://cesario.rutgers.edu, there are quite a few word boundaries: 
|http|://|cesario|.|rutgers|.|edu| - everyplace I put a |.

	-Allen

Comment 4 Matt Kettler 2002-09-27 18:38:46 UTC
Subject: Re: [SAdev]  PORN_4 catches legitimate URLs

Yes, dots in domains are word breaks, but in the case of porno domains, the 
domains are usually word concatenations. it's rarely www.anal.fucking.com 
or anal.fucking.com, it's usually things like:

www.analfucking.com
www.analsexsluts.com

The VAST majority of porno domains will not match with the \b's in place. 
In fact, I doubt you'd be able to find one. I know my corpus contains zero 
matches for grep "\.anal\." *, does yours?

In fact, very few commercial website links contain dots in any place other 
than www. and .com. Some do, but not many. This kind of thing is by far 
more common in .edu's than anyplace else. and I doubt rutgers will allow 
fucking.anal.whores.rutgers.edu to exist ;)


>Umm... in http://cesario.rutgers.edu, there are quite a few word boundaries:
>|http|://|cesario|.|rutgers|.|edu| - everyplace I put a |.
>
>         -Allen

Comment 5 Justin Mason 2002-10-23 15:18:33 UTC
Vince,

thanks for doing the heavy lifting on this bug, those regexp fragments
work perfectly.  That's now in CVS testing, looks like it does the trick,
and once it's verified to get better results than PORN_4, it's in.
Comment 6 Justin Mason 2002-10-30 02:56:22 UTC
ok, fixed; beats PORN_4