Bug 1375 - do RBL look-ups on URLs
Summary: do RBL look-ups on URLs
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Rules (show other bugs)
Version: unspecified
Hardware: Other other
: P2 normal
Target Milestone: 3.0.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
: 2001 2948 (view as bug list)
Depends on:
Blocks:
 
Reported: 2003-01-14 12:11 UTC by Robert J. Accettura
Modified: 2004-03-02 09:02 UTC (History)
8 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
test implementation patch None Daniel Quinlan [HasCLA]
Perform DNSBL tests on spamvertised URLs' IP addresses text/plain None Florian L. Klein [NoCLA]
Perform DNSBL tests on spamvertised URLs' IP addresses patch None Florian L. Klein [NoCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Robert J. Accettura 2003-01-14 12:11:07 UTC
SpamAssassin should check image links in HTML for IP addresses, which are
commonly used by spammers.  Legitmate newsletters mostly uses their full domain
name, and more commonly use a service like akamai to keep HTML mail speedy.  

Individuals making legitimate HTML mail with images tend to use the full URL
since they host with places like geocities, .mac, etc. and don't know the IP
address.
Comment 1 Daniel Quinlan 2003-01-14 17:32:08 UTC
> and more commonly use a service like akamai to keep HTML mail speedy.

I don't think that's true.

Also, we already test for numeric IP addresses in URLs.  There's a rule for that.

This does suggest to me the idea of doing RBL look-ups for URLs in emails.
That might work really well and I'd be in favor of further experimentation
on that idea.  I am changing the Summary accordingly.
Comment 2 Allen Smith 2003-01-16 16:47:08 UTC
Dan:

>This does suggest to me the idea of doing RBL look-ups for URLs in emails.
>That might work really well and I'd be in favor of further experimentation
>on that idea.  I am changing the Summary accordingly.

Agreed. IIRC, at least one milter in use checks URLs versus
abuse.rfc-ignorant.org, and ex.dnsbl.org (returning 127.0.0.3 especially) and
in.dnsbl.org (checking vs 127.0.0.[2-46]) should work well.

	-Allen
Comment 3 Michael Bell 2003-02-24 20:20:41 UTC
Neat idea!

However I can imagine some deliberately hidden URL's designed to maliciously 
cause SA to have long DNS timeouts. Just a thought...I guess you guys already 
have to worry bout that from RECEIVED headers.
Comment 4 Daniel Quinlan 2003-02-24 21:08:52 UTC
Subject: Re: [SAdev]  do RBL look-ups on URLs

> However I can imagine some deliberately hidden URL's designed to
> maliciously cause SA to have long DNS timeouts. Just a thought...I
> guess you guys already have to worry bout that from RECEIVED headers.

Well, this idea might not work so well.  For one thing, I'm not sure
what percentage of URLs point at known spammer sites.  Some probably
point to temporary free space and such.

The DNS timeout problem shouldn't be a major issue.  We can cap the
number to be tested and we won't be doing reverse lookups, just
blacklist lookups.  Long timeouts are worst for MX/A record reverse
lookups.  (I disabled that rule in the pre-built SAproxy since it caused
so many problems with timeouts it wasn't worth it.)

Daniel

Comment 5 Daniel Quinlan 2003-05-04 02:15:15 UTC
Whee.  Experimenting with an initial implementation of this now.

Here's my initial test set.  I won't check all of these in (some don't
work so well), just letting you know which ones I'm trying so far.

header T_URI_IN_RFCI_ABUSE      rbleval:check_rbl_uris('rfci-abuse',
'abuse.rfc-ignorant.org.')
header T_URI_IN_RFCI_DSN        rbleval:check_rbl_uris('rfci-dsn',
'dsn.rfc-ignorant.org.')
header T_URI_IN_RFCI_POSTMASTER rbleval:check_rbl_uris('rfci-postmaster',
'postmaster.rfc-ignorant.org.')
header T_URI_IN_RFCI_WHOIS      rbleval:check_rbl_uris('rfci-whois',
'whois.rfc-ignorant.org.')
header T_URI_IN_EX_DNSBL        rbleval:check_rbl_uris('ex-dnsbl', 'ex.dnsbl.org.')
header T_URI_IN_EX_DNSBL_EASYDNS        rbleval:check_rbl_sub('ex-dnsbl',
'127.0.0.3')
header T_URI_IN_EX_DNSBL_SPAMSITES      rbleval:check_rbl_sub('ex-dnsbl',
'127.0.0.2')
header T_URI_IN_DEADBEEF        rbleval:check_rbl_uris('deadbeef',
'bl.deadbeef.com.')
header T_URI_IN_IN_DNSBL        rbleval:check_rbl_uris('in-dnsbl', 'in.dnsbl.org.')
header T_URI_IN_PIGS            rbleval:check_rbl_uris('pigs',
'bandwidth-pigs.monkeys.com.')
Comment 6 Marc Perkel 2003-05-04 07:51:37 UTC
Subject: Re: [SAdev]  do RBL look-ups on URLs

It will be interesting to see how this rule works. A lot of spam doesn't 
come from the server it links to.

But - I've made some rules that blacklist based on specific hand build 
URIs and with about 150 that I test for I am really catching a lot of 
spam and the accuracy of what is caught is almost 100%. I REALLY think 
that the ability to somehow automatically generate a blacklist of URIs 
of spam links will be a very effective spam control tool.


Comment 7 Daniel Quinlan 2003-05-04 19:08:54 UTC
Subject: Re:  do RBL look-ups on URLs

bugzilla-daemon  <bugzilla-daemon@hughes-family.org> writes:

> But - I've made some rules that blacklist based on specific hand build
> URIs and with about 150 that I test for I am really catching a lot of
> spam and the accuracy of what is caught is almost 100%. I REALLY think
> that the ability to somehow automatically generate a blacklist of URIs
> of spam links will be a very effective spam control tool.

Maintaining such a list inside of SpamAssassin is the wrong way to go.
We can only afford to list URLs/hostnames that are particularly
frequent.  The URLs change frequently, would need to be very numerous to
be much more effective than what we have now, and we don't have the
capacity to maintain so many.

The best route would be for someone to create a new RBL (a domain-based
one) with spam domains.  Perhaps even one that provided for some way to
do look-ups of full or partial URIs (some type of encoding to allow URIs
to be expressed in hostnames).  The list would need maintenance,
specific policies for listing/delisting, etc. -- the usual stuff.

Daniel

Comment 8 Antony Mawer 2003-05-04 23:44:43 UTC
Subject: Re: [SAdev]  do RBL look-ups on URLs 


> The best route would be for someone to create a new RBL (a domain-based
> one) with spam domains.  Perhaps even one that provided for some way to
> do look-ups of full or partial URIs (some type of encoding to allow URIs
> to be expressed in hostnames).  The list would need maintenance,
> specific policies for listing/delisting, etc. -- the usual stuff.

I've suggested it before a few times, but no movement as yet....

--j.

Comment 9 Daniel Quinlan 2003-05-05 00:09:45 UTC
Subject: Re:  do RBL look-ups on URLs

> I've suggested it before a few times, but no movement as yet....

It's definitely a better idea than a lot of the RBLs out there.  :-)

T_URI_IN_RFCI_DSN is pretty good.  Maybe a meta test is worth trying in
here somewhere...

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
   6000     3000     3000    0.500   0.00    0.00  (all messages)
100.000  50.0000  50.0000    0.500   0.00    0.00  (all messages as %)
  0.433   0.8667   0.0000    1.000   0.91    0.01  T_URI_IN_IN_DNSBL
  2.950   5.8000   0.1000    0.983   0.87    0.01  T_URI_IN_RFCI_DSN
  0.317   0.4667   0.1667    0.737   0.36    0.01  T_URI_IN_DEADBEEF
 10.017  13.2667   6.7667    0.662   0.28    0.01  T_URI_IN_RFCI_ABUSE
 10.567  13.5667   7.5667    0.642   0.26    0.01  T_URI_IN_RFCI_POSTMASTER
  3.717   4.2000   3.2333    0.565   0.17    0.01  T_URI_IN_EX_DNSBL_SPAMSITES
  5.550   5.7333   5.3667    0.517   0.13    0.01  T_URI_IN_RFCI_WHOIS
  0.000   0.0000   0.0000    0.500   0.11    0.01  T_URI_IN_EX_DNSBL_EASYDNS
  1.117   1.0000   1.2333    0.448   0.08    0.01  T_URI_IN_EX_DNSBL
  0.433   0.2333   0.6333    0.269   0.02    0.01  T_URI_IN_PIGS

Comment 10 Marc Perkel 2003-05-05 06:08:23 UTC
Subject: Re: [SAdev]  do RBL look-ups on URLs

I think that an RBL based on what the spam links to as well as where 
spam comes from would be very effective too.

bugzilla-daemon@hughes-family.org wrote:

>http://www.hughes-family.org/bugzilla/show_bug.cgi?id=1375
>
>
>
>
>
>------- Additional Comments From ajmawer@optusnet.com.au  2003-05-04 23:44 -------
>Subject: Re: [SAdev]  do RBL look-ups on URLs 
>
>
>  
>
>>The best route would be for someone to create a new RBL (a domain-based
>>one) with spam domains.  Perhaps even one that provided for some way to
>>do look-ups of full or partial URIs (some type of encoding to allow URIs
>>to be expressed in hostnames).  The list would need maintenance,
>>specific policies for listing/delisting, etc. -- the usual stuff.
>>    
>>
>
>
>  
>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">
  <title></title>
</head>
<body>
I think that an RBL based on what the spam links to as well as where
spam comes from would be very effective too.<br>
<br>
<a class="moz-txt-link-abbreviated" href="mailto:bugzilla-daemon@hughes-family.org">bugzilla-daemon@hughes-family.org</a> wrote:<br>
<blockquote type="cite"
 cite="mid20030505064444.2B46E453A5@belphegore.hughes-family.org">
  <pre wrap=""><a class="moz-txt-link-freetext" href="http://www.hughes-family.org/bugzilla/show_bug.cgi?id=1375">http://www.hughes-family.org/bugzilla/show_bug.cgi?id=1375</a>





------- Additional Comments From <a class="moz-txt-link-abbreviated" href="mailto:ajmawer@optusnet.com.au">ajmawer@optusnet.com.au</a>  2003-05-04 23:44 -------
Subject: Re: [SAdev]  do RBL look-ups on URLs 


  </pre>
  <blockquote type="cite">
    <pre wrap="">The best route would be for someone to create a new RBL (a domain-based
one) with spam domains.  Perhaps even one that provided for some way to
do look-ups of full or partial URIs (some type of encoding to allow URIs
to be expressed in hostnames).  The list would need maintenance,
specific policies for listing/delisting, etc. -- the usual stuff.
    </pre>
  </blockquote>
  <pre wrap=""><!---->

  </pre>
</blockquote>
</body>
</html>
Comment 11 Daniel Quinlan 2003-05-13 13:14:11 UTC
Created attachment 964 [details]
test implementation
Comment 12 Daniel Quinlan 2003-05-13 17:08:06 UTC
Results just aren't all that great for these tests.  The only
one worth much is the T_URI_IN_RFCI_DSN test which has matches
because it finds mailto: and standard-format addresses in the
body that are listed in DSN.  I was able to get all of those by
adding these lines to check_rbl_from_host(), but unfortunately
it slightly decreased the accuracy, so I'm not going to make that
change either:

  for my $uri ($self->get_uri_list()) {
    if ($uri =~ /^mailto:.*?\@([^\s>?@]+\.[^\s>?@]+)/i) {
      $hosts{lc($1)} = 1;
    }
  }

(Note that get_uri_list() really should cache the URI list.)

Raw results:

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
   2000     1000     1000    0.500   0.00    0.00  (all messages)
100.000  50.0000  50.0000    0.500   0.00    0.00  (all messages as %)
  0.100   0.2000   0.0000    1.000   0.90    0.01  T_URI_IN_IN_DNSBL
  2.850   5.6000   0.1000    0.982   0.86    0.01  T_URI_IN_RFCI_DSN
  0.600   0.9000   0.3000    0.750   0.38    0.01  T_URI_IN_DEADBEEF
 11.750  16.3000   7.2000    0.694   0.32    0.01  T_URI_IN_RFCI_ABUSE
  9.450  12.8000   6.1000    0.677   0.30    0.01  T_URI_IN_RFCI_POSTMASTER
  6.200   7.7000   4.7000    0.621   0.23    0.01  T_URI_IN_RFCI_WHOIS
  0.000   0.0000   0.0000    0.500   0.11    0.01  T_URI_IN_PIGS

The meta tests weren't all that interesting.  Closing bug.  Maybe
someone can revive this idea and gain something from the test implementation
once we have some real URL DNSBLs.
Comment 13 Simon Lyall 2003-05-23 21:54:10 UTC
I've done some further testing along these lines and think it's worth reopening
this bug, 

Doing RBL lookups on hostnames in URLs appears to yeld very good results. I ran
a check comparing email less than 6 hours old in part of my customer spam spool
and then checked ham less than 6 hours old for the same customers.

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
   3880     3415      465   0.880   0.00   0.00   (all messages)
100.000    88.02    11.98   0.880   0.00   0.00   (all messages as %)
  4.681    5.329     0.00   1.000   0.00   1.00   URL_IN_BRAZIL_BH
  2.577    2.928     0.00   1.000   0.00   1.00   URL_IN_CHINA_BH   
 27.216   30.922     1.08   0.995   0.00   1.00   URL_IN_SBL
 27.061   30.746     1.29   0.994   0.00   1.00   URL_IN_SPEWS

NOTE: The above scores are from my script below rather than SA, I manually
checked most the FP emails and found that they were in fact spam missed by SA. I
removed these  from the ham corpus EXCEPT for 3 emails already downloaded I
couldn't check and 1 possible real FP (looked very marginal). Thus the FP number
is overstated. I put in the S/O and rank fields myslf so they might not be correct.

Can someone run my scipt (or a variation) on their own corpus to see what 
results they get. It would appear that The spammer URLs can vanish very quickly
so a corpus less than 12 hours old is best. If others get similar results then
bug should be reopened.

Script as follows, it requires the rblcheck program, 

#!/bin/sh

file=$1

echo "$file being checked"

for q in `grep -i "href=\"http://" $file | cut -f2- -d: | cut -f3 -d/ | cut -f1
-d\" | cut -f2 -d@ | sort | uniq | grep [A-Za-z0-9]"\."[A-Za-z] `
do

    ip=`host -t a $q | grep "has address" | head -1 | cut -f4 -d" " `

    if [ ! -z $ip ]
    then
        echo "$file `/root/temp/rbl/rblcheck -c -s brazil.blackholes.us $ip`  "
| grep -v "not RBL" | cut -f1,6 -d" "
        echo "$file `/root/temp/rbl/rblcheck -c -s china.blackholes.us $ip`  " |
grep -v "not RBL" | cut -f1,6 -d" "
        echo "$file `/root/temp/rbl/rblcheck -c -s sbl.spamhaus.org $ip`  " |
grep -v "not RBL" | cut -f1,6 -d" "
        echo "$file `/root/temp/rbl/rblcheck -c -s spews.relays.osirusoft.com
$ip`  " | grep -v "not RBL" | cut -f1,6 -d" "
    fi
done
Comment 14 Daniel Quinlan 2003-05-23 22:21:06 UTC
reopening bug
Comment 15 Daniel Quinlan 2003-05-23 22:57:09 UTC
In my original testing, I used DNSBLs that supported direct queries of
domains.  However, your tests using SBL and SPEWS require a lookup of
the hostname to get an IP address followed by the actual RBL lookup on
SPEWS and SBL.  In other words, the hit-rate is going to be lower than
what you're getting due to the RBL timeout (which was just lowered to
be a default of 10) and it's going to slow down SpamAssassin by a fair
amount (even with the timeout to limit the maximum) since the DNS pipeline
is now going to be two units long instead of one.  In addition, because
each one will use a different DNS server, the persistent timeout code
that I've been working on for RBLs won't be of any use.

Using 87 spam URLs from the last 24 hours (83 spam), I timed how long it took
to do lookups on each.  The average was 1.02 seconds, standard deviation of
2.14 seconds, with 4 coming in at 10 seconds (probably the default timeout
for the host command).  Considering this and given how long RBL checks tend
to take right now (average response is usually from < 1 second to about 5 or
6 seconds for the ones that are sometime slow, average of 1-4 seconds), a
double-lookup really would add an average of 1-2 seconds to each check.

Even so, your results seem promising.
Comment 16 Antony Mawer 2003-05-24 14:33:58 UTC
Subject: Re: [SAdev]  do RBL look-ups on URLs 


>   4.681    5.329     0.00   1.000   0.00   1.00   URL_IN_BRAZIL_BH
>   2.577    2.928     0.00   1.000   0.00   1.00   URL_IN_CHINA_BH   
>  27.216   30.922     1.08   0.995   0.00   1.00   URL_IN_SBL
>  27.061   30.746     1.29   0.994   0.00   1.00   URL_IN_SPEWS

BTW checking against OPM will be very valuable I would guess.  That
and SBL would be the main ones.

--j.

Comment 17 Daniel Quinlan 2003-05-27 23:12:45 UTC
This should really wait until 2.70.
Comment 18 Daniel Quinlan 2003-06-20 12:51:39 UTC
*** Bug 2001 has been marked as a duplicate of this bug. ***
Comment 19 Evan Langlois 2003-11-12 13:58:41 UTC
Could someone email me exactly how to implement this?  I'd be using
spamhaus/SBL, and possibly others, and plan on emerging spamassasin within the
next few days, and this would be the primary function.  From what I've seen,
checking the URI against the SBL finds almost everything and with no false
positives.

evan@ddos.com
Comment 20 Justin Mason 2003-11-12 14:18:34 UTC
Subject: Re: [SAdev]  do RBL look-ups on URLs 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


>Could someone email me exactly how to implement this?  I'd be using
>spamhaus/SBL, and possibly others, and plan on emerging spamassasin within the
>next few days, and this would be the primary function.  From what I've seen,
>checking the URI against the SBL finds almost everything and with no false
>positives.

Hi Evan --

in Mail::SpamAssassin::PerMsgStatus, there's a method to get the
list of URIs from the message body, and a method to check the IPs
in the message against the DNSBLs.

The thing to do would be to fix the latter to call the former,
then extract the hostnames from the URIs, and resolve all the 
hostnames into IP addresses.  Then add those IPs to the list
of IPs (from the headers), so they'll be checked against
the DNSBLs.

(A good option might be to restrict the lookups to just run
against the SBL and one or two others, but that should remain
untried until we see how it does to start with.)

The big problem I can see is that this will be *very* slow,
due to the possibly very large number of DNS lookups involved
to resolve all those hostnames, and the potentially very large
number of addresses listed.

But some investigation would be *really* useful, thanks ;)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
Comment: Exmh CVS

iD8DBQE/srGzQTcbUG5Y7woRAgZtAJ0e3JMwvXOmp0E0hz+SROjbdCuX7wCeK0UG
44kJtRI9EWmk06u18g6oQMg=
=ad3L
-----END PGP SIGNATURE-----

Comment 21 Kenneth Porter 2004-01-05 11:39:04 UTC
For spammers that use a friendly registrar to create constantly changing
domains, it would be desirable to extend this to check the NS records, as the
changing domains often use the same fixed set of name servers.
Comment 22 Chris Santerre 2004-01-06 09:00:27 UTC
I guess I should chime in on what I've seen. Running the bigevil list , I've 
started looking at newer URIs in openrbl.org. I noticed the newer URIs have 
been listed in openrbl.org frequently. HOWEVER this is of what I actually 
receive. I run DNSRBLs at MTA level so I block a ton. 

Having rethought about this, I'm begining to wonder if the time to lookup is 
worth it. What might be better is having a server do this URI RBL lookup, and 
have them create a automated "bigevil" file of there own. Then they can check 
against the local file for a match first. Finding a hit would cause further 
lookups to be skipped. Sort of a local RBL mirror but in a .cf file form. 

Yeah, maybe I should just have another coffee and rethink all my ideas :)
Comment 23 Mikael Olsson 2004-01-15 09:47:36 UTC
I'm mainly chiming in here to say that I think URI domain lookups would
be a very worthwhile idea.  And also to say that I for one would gladly
take the extra 1-2 second hit per mail to lookup domain->ip and then
query e.g. SBL and SPEWS for that IP address.  But then again, my mail
server load isn't very high, so maybe that's just me.

Case in point: 105 domains registered by Atriks, all pointing to the
same web server IP:
http://c0ffee.badf00d.org/atriksdomains-ip.txt

One thing to keep in mind when deciding the implementation is spammers 
that randomize the first DNS component in the URI, e.g. atriks (again):
http://wwhxwxqwqwudxnwcqnrnkwdqcmcmd0627.openbsdmailservers.com/

A system that works with whole URIs wouldn't work here. One that tries
to figure out the actual user-registerable domain would perhaps work,
but that requires knowledge of how TLDs work, lest one suddenly blacklist
e.g. ".co.uk" or ".com.tw" or ".com.au" or ".tm.se" or "lastname.name",
etc etc..

A system that simply resolves the name and checks the resulting IP address
against IP-based RBLs would be fool proof.
Comment 24 Florian L. Klein 2004-01-20 15:30:45 UTC
Created attachment 1714 [details]
Perform DNSBL tests on spamvertised URLs' IP addresses

See http://thread.gmane.org/gmane.mail.spam.spamassassin.general/33572 and (in
German)
http://groups.google.com/groups?selm=crb9ob.3pg.ln@news.home.docsnyder.de

I'm using the patch on productive email systems with SpamAssassin 2.60 (Debian
Woody), 2.61 (Debian Sarge) and the current CVS snapshot (Debian Sarge).
Comment 25 Florian L. Klein 2004-01-20 15:41:53 UTC
Comment on attachment 1714 [details]
Perform DNSBL tests on spamvertised URLs' IP addresses

Sorry, I gzip'd the attachment before submitting it. I'll resubmit it
uncompressed.
Comment 26 Florian L. Klein 2004-01-20 15:45:40 UTC
Created attachment 1715 [details]
Perform DNSBL tests on spamvertised URLs' IP addresses

(Retried patch submission, not gzip'd this time ;-))

See http://thread.gmane.org/gmane.mail.spam.spamassassin.general/33572 and (in
German)
http://groups.google.com/groups?selm=crb9ob.3pg.ln@news.home.docsnyder.de

I'm using the patch successfully on productive email systems running
SpamAssassin 2.60 (Debian Woody), 2.61 (Debian Sarge) and the current CVS
snapshot (Debian Sarge).
Comment 27 Justin Mason 2004-01-20 16:04:30 UTC
Subject: Re:  do RBL look-ups on URLs 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


>One thing to keep in mind when deciding the implementation is spammers 
>that randomize the first DNS component in the URI, e.g. atriks (again):
>http://wwhxwxqwqwudxnwcqnrnkwdqcmcmd0627.openbsdmailservers.com/

>A system that works with whole URIs wouldn't work here. One that tries
>to figure out the actual user-registerable domain would perhaps work,
>but that requires knowledge of how TLDs work, lest one suddenly blacklist
>e.g. ".co.uk" or ".com.tw" or ".com.au" or ".tm.se" or "lastname.name",
>etc etc..

That's not a big problem; we already have code in 2.70 that understands
which CCTLDs use subdelegation (ie. those).

>A system that simply resolves the name and checks the resulting IP address
>against IP-based RBLs would be fool proof.

Although perhaps resolving a name like the openbsdmailservers.com one
above might confirm an email address, if the name contained the address in
encoded form.  But still, I think it may be worthwhile (if optional,
maybe).

Perhaps it could include heuristics to detect encoded-address hostname
parts, and replace those with its own random hostname part text?

BTW another point -- regarding spammers overloading the system by sending
200 URIs in a single message.  IMO the best approach to deal with that
problem is to select 5 URIs to analyze from the message, with preference
given to the largest IMG tags first.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFADcIHQTcbUG5Y7woRAhnKAJ4stUvBraaI0P4rc67zhholhAYimgCg4PIr
afFaCi8MDFHvCsiwBbU/V2M=
=aNY3
-----END PGP SIGNATURE-----

Comment 28 Florian L. Klein 2004-01-20 16:40:15 UTC
Comment on attachment 1715 [details]
Perform DNSBL tests on spamvertised URLs' IP addresses

>diff -ruN spamassassin/lib/Mail/SpamAssassin/Conf.pm spamassassin.new/lib/Mail/SpamAssassin/Conf.pm
>--- spamassassin/lib/Mail/SpamAssassin/Conf.pm	2003-12-17 16:06:29.000000000 +0100
>+++ spamassassin.new/lib/Mail/SpamAssassin/Conf.pm	2004-01-20 23:57:19.000000000 +0100
>@@ -107,6 +107,10 @@
> use constant TYPE_URI_EVALS     => 0x0011;
> use constant TYPE_META_TESTS    => 0x0012;
> use constant TYPE_RBL_EVALS     => 0x0013;
>+use constant TYPE_URIIP_TESTS     => 0x0014;
>+use constant TYPE_URIIP_EVALS     => 0x0015;
>+use constant TYPE_URIIP_RBL_TESTS => 0x0016;
>+use constant TYPE_URIIP_RBL_EVALS => 0x0017;
> 
> $VERSION = 'bogus';     # avoid CPAN.pm picking up version strings later
> 
>@@ -2121,6 +2125,19 @@
>       next;
>     }
> 
>+# URI IP addresses
>+    if (/^uriip\s+(\S+)\s+(?:rbl)?eval:(.*)$/) {
>+      my ($name, $fn) = ($1, $2);
>+
>+      if ($fn =~ /^check_uriip_rbl/) {
>+	$self->add_test ($name, $fn, TYPE_URIIP_RBL_EVALS);
>+      }
>+#     else {
>+#	$self->add_test ($name, $fn, TYPE_URIIP_EVALS);
>+#     }
>+      next;
>+    }
>+
> =item rawbody SYMBOLIC_TEST_NAME /pattern/modifiers
> 
> Define a raw-body pattern test.  C<pattern> is a Perl regular expression.
>@@ -2633,6 +2650,9 @@
> 	elsif ($type == TYPE_RBL_EVALS) {
> 	  $self->{rbl_evals}->{$name} = \@args;
> 	}
>+	elsif ($type == TYPE_URIIP_RBL_EVALS) {
>+	  $self->{uriip_rbl_evals}->{$name} = \@args;
>+	}
> 	elsif ($type == TYPE_RAWBODY_EVALS) {
> 	  $self->{rawbody_evals}->{$name} = \@args;
> 	}
>diff -ruN spamassassin/lib/Mail/SpamAssassin/EvalTests.pm spamassassin.new/lib/Mail/SpamAssassin/EvalTests.pm
>--- spamassassin/lib/Mail/SpamAssassin/EvalTests.pm	2003-12-17 09:09:00.000000000 +0100
>+++ spamassassin.new/lib/Mail/SpamAssassin/EvalTests.pm	2004-01-20 23:57:19.000000000 +0100
>@@ -1329,6 +1329,18 @@
>   $self->check_rbl_backend($rule, $set, $rbl_server, 'TXT', $subtest);
> }
> 
>+sub check_uriip_rbl {
>+  my ($self, $rule, $set, $rbl_server, $subtest) = @_;
>+  my @ips = @{$self->{uriips}};
>+  eval {
>+    foreach my $ip (@ips) {
>+      next unless ($ip =~ /(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})/);
>+      $self->do_rbl_lookup($rule, $set, 'A', $rbl_server,
>+			   "$4.$3.$2.$1.$rbl_server", $subtest);
>+    }
>+  };
>+}
>+
> # run for first message 
> sub check_rbl_sub {
>   my ($self, $rule, $set, $subtest) = @_;
>diff -ruN spamassassin/lib/Mail/SpamAssassin/PerMsgStatus.pm spamassassin.new/lib/Mail/SpamAssassin/PerMsgStatus.pm
>--- spamassassin/lib/Mail/SpamAssassin/PerMsgStatus.pm	2003-12-17 16:06:29.000000000 +0100
>+++ spamassassin.new/lib/Mail/SpamAssassin/PerMsgStatus.pm	2004-01-20 23:57:19.000000000 +0100
>@@ -122,6 +122,9 @@
>     $self->{conf}->set_score_set ($set|2);
>   }
> 
>+  # IPs of spamvertised URIs
>+  $self->{uriips} = [ ];
>+
>   # pre-chew Received headers
>   $self->parse_received_headers();
> 
>@@ -1743,12 +1746,67 @@
>   return @{$self->{uri_list}};
> }
> 
>+sub do_resolve_uri {
>+  my ($self, $uri) = @_;
>+  my @ips = ();
>+
>+  $uri =~ s/^http:\/\///;
>+  $uri =~ s/^mailto:\/\///;
>+  $uri =~ s/\/.*$//;
>+  $uri =~ s/^.*\@//;
>+
>+  @ips = $self->lookup_all_ips($uri);
>+
>+  return @ips;
>+}
>+
>+sub do_body_uriip_tests {
>+  my ($self, @ips) = @_;
>+  local ($_);
>+
>+  dbg ("running uriip tests; score so far=".$self->{hits});
>+  foreach my $ip (@ips) {
>+    dbg ("Testing spamvertised IP '$ip'");
>+    push(@{$self->{uriips}}, $ip);
>+  }
>+
>+  my $evalhash = $self->{conf}->{uriip_rbl_evals};
>+  my ($rulename, @args);
>+  my $debugenabled = $Mail::SpamAssassin::DEBUG->{enabled};
>+
>+  while (my ($rulename, $test) = each %{$evalhash}) {
>+    my $score = $self->{conf}->{scores}->{$rulename};
>+    next unless $score;
>+
>+    $self->{test_log_msgs} = ();
>+
>+    my ($function, @args) = @{$test};
>+    my $result;
>+    eval {
>+      $result = $self->$function($rulename, @args);
>+    };
>+
>+    if ($@) {
>+      warn "Failed to run $rulename URIIP RBL SpamAssassin test, skipping:\n".
>+		"\t($@)\n";
>+      $self->{rule_errors}++;
>+      next;
>+    }
>+  }
>+}
>+
> sub do_body_uri_tests {
>   my ($self, $textary) = @_;
>   local ($_);
> 
>   dbg ("running uri tests; score so far=".$self->{hits});
>   my @uris = $self->get_uri_list();
>+  my @ips  = ();
>+
>+  foreach my $uri (@uris) {
>+    push (@ips, $self->do_resolve_uri($uri));
>+  }
>+  $self->do_body_uriip_tests(@ips);
> 
>   my $doing_user_rules = 
>     $self->{conf}->{user_rules_to_compile}->{Mail::SpamAssassin::Conf::TYPE_URI_TESTS};
>@@ -2166,7 +2224,6 @@
>     $self->{test_log_msgs} = ();	# clear test state
> 
>     my ($function, @args) = @{$test};
>-
>     my $result;
>     eval {
>        $result = $self->$function($rulename, @args);
>diff -ruN spamassassin/rules/20_uriip_tests.cf spamassassin.new/rules/20_uriip_tests.cf
>--- spamassassin/rules/20_uriip_tests.cf	1970-01-01 01:00:00.000000000 +0100
>+++ spamassassin.new/rules/20_uriip_tests.cf	2004-01-20 23:58:36.000000000 +0100
>@@ -0,0 +1,196 @@
>+# SpamAssassin rules file: RBL tests of spamvertised IPs
>+#
>+# Please don't modify this file as your changes will be overwritten with
>+# the next update. Use @@LOCAL_RULES_DIR@@/local.cf instead.
>+# See 'perldoc Mail::SpamAssassin::Conf' for details.
>+#
>+# This program is free software; you can redistribute it and/or modify
>+# it under the terms of either the Artistic License or the GNU General
>+# Public License as published by the Free Software Foundation; either
>+# version 1 of the License, or (at your option) any later version.
>+#
>+# See the file "License" in the top level of the SpamAssassin source
>+# distribution for more details.
>+#
>+###########################################################################
>+
>+require_version @@VERSION@@
>+
>+# Don't activate too many of these rulesets, as the number of DNS
>+# queries per email will become very high!
>+
>+### Spamvertised sites listed on "common" DNSBLs ###
>+#
>+# Spamhaus Block List
>+#
>+uriip HOSTED_SBL eval:check_uriip_rbl('sbl', 'sbl.spamhaus.org.')
>+describe HOSTED_SBL URL ist hosted at a site listed in the Spamhaus Block List.
>+tflags HOSTED_SBL net
>+
>+# Spam Prevention Early Warning System
>+#
>+uriip HOSTED_SPEWS_L1 eval:check_uriip_rbl('spews', 'l1.spews.dnsbl.sorbs.net.')
>+describe HOSTED_SPEWS_L1 URL ist hosted at a site listed in the SPEWS (Level 1) blacklist.
>+tflags HOSTED_SPEWS_L1 net
>+#
>+uriip HOSTED_SPEWS_L2 eval:check_uriip_rbl('spews', 'l2.spews.dnsbl.sorbs.net.')
>+describe HOSTED_SPEWS_L2 URL ist hosted at a site listed in the SPEWS (Level 2) blacklist.
>+tflags HOSTED_SPEWS_L2 net
>+
>+
>+# Habeas(TM) violators blacklist
>+#
>+uriip HOSTED_HABEAS_VIOLATOR eval:check_uriip_rbl('hil', 'sa-hil.habeas.com.')
>+describe HOSTED_HABEAS_VIOLATOR Uses a URL whose IP has been caught as Habeas violator
>+tflags HOSTED_HABEAS_VIOLATOR net
>+
>+
>+### ISPs known to tolerate spamvertised sites ###
>+#
>+#uriip HOSTED_AT_ABOVE eval:check_uriip_rbl('above', 'above.blackholes.us.')
>+#describe HOSTED_AT_ABOVE Uses a URL hosted at AboveNet
>+#tflags HOSTED_AT_ABOVE net
>+
>+#uriip HOSTED_AT_ATT eval:check_uriip_rbl('att', 'att.blackholes.us.')
>+#describe HOSTED_AT_ATT Uses a URL hosted at AT&T
>+#tflags HOSTED_AT_ATT net
>+
>+#uriip HOSTED_AT_BELLSOUTH eval:check_uriip_rbl('bellsouth', 'bellsouth.blackholes.us.')
>+#describe HOSTED_AT_BELLSOUTH Uses a URL hosted at Bellsouth
>+#tflags HOSTED_AT_BELLSOUTH net
>+
>+uriip HOSTED_AT_CHINANET eval:check_uriip_rbl('chinanet', 'chinanet.blackholes.us.')
>+describe HOSTED_AT_CHINANET Uses a URL hosted at Chinanet
>+tflags HOSTED_AT_CHINANET net
>+
>+#uriip HOSTED_AT_CIBERLYNX eval:check_uriip_rbl('ciberlynx', 'ciberlynx.blackholes.us.')
>+#describe HOSTED_AT_CIBERLYNX Uses a URL hosted at Ciberlynx
>+#tflags HOSTED_AT_CIBERLYNX net
>+
>+#uriip HOSTED_AT_COGENTCO eval:check_uriip_rbl('cogentco', 'cogentco.blackholes.us.')
>+#describe HOSTED_AT_COGENTCO Uses a URL hosted at Cogent
>+#tflags HOSTED_AT_COGENTCO net
>+
>+#uriip HOSTED_AT_COMCAST eval:check_uriip_rbl('comcast', 'comcast.blackholes.us.')
>+#describe HOSTED_AT_COMCAST Uses a URL hosted at Comcast
>+#tflags HOSTED_AT_COMCAST net
>+
>+#uriip HOSTED_AT_COVAD eval:check_uriip_rbl('covad', 'covad.blackholes.us.')
>+#describe HOSTED_AT_COVAD Uses a URL hosted at Covad
>+#tflags HOSTED_AT_COVAD net
>+
>+#uriip HOSTED_AT_CW eval:check_uriip_rbl('cw', 'cw.blackholes.us.')
>+#describe HOSTED_AT_CW Uses a URL hosted at Cable & Wireless
>+#tflags HOSTED_AT_CW net
>+
>+#uriip HOSTED_AT_HE eval:check_uriip_rbl('he', 'he.blackholes.us.')
>+#describe HOSTED_AT_HE Uses a URL hosted at HE.net
>+#tflags HOSTED_AT_HE net
>+
>+#uriip HOSTED_AT_HOSTCENTRIC eval:check_uriip_rbl('hostcentric', 'hostcentric.blackholes.us.')
>+#describe HOSTED_AT_HOSTCENTRIC Uses a URL hosted at Hostcentric
>+#tflags HOSTED_AT_HOSTCENTRIC net
>+
>+#uriip HOSTED_AT_INTERBUSINESS eval:check_uriip_rbl('interbusiness', 'interbusiness.blackholes.us.')
>+#describe HOSTED_AT_INTERBUSINESS Uses a URL hosted at Interbusiness
>+#tflags HOSTED_AT_INTERBUSINESS net
>+
>+#uriip HOSTED_AT_INTERNAP eval:check_uriip_rbl('internap', 'internap.blackholes.us.')
>+#describe HOSTED_AT_INTERNAP Uses a URL hosted at Internap
>+#tflags HOSTED_AT_INTERNAP net
>+
>+#uriip HOSTED_AT_LEVEL3 eval:check_uriip_rbl('level3', 'level3.blackholes.us.')
>+#describe HOSTED_AT_LEVEL3 Uses a URL hosted at Level3
>+#tflags HOSTED_AT_LEVEL3 net
>+
>+#uriip HOSTED_AT_QWEST eval:check_uriip_rbl('qwest', 'qwest.blackholes.us.')
>+#describe HOSTED_AT_QWEST Uses a URL hosted at QWest
>+#tflags HOSTED_AT_QWEST net
>+
>+#uriip HOSTED_AT_RACKSPACE eval:check_uriip_rbl('rackspace', 'rackspace.blackholes.us.')
>+#describe HOSTED_AT_RACKSPACE Uses a URL hosted at Rackspace
>+#tflags HOSTED_AT_RACKSPACE net
>+
>+#uriip HOSTED_AT_ROGERS eval:check_uriip_rbl('rogers', 'rogers.blackholes.us.')
>+#describe HOSTED_AT_ROGERS Uses a URL hosted at Rogers
>+#tflags HOSTED_AT_ROGERS net
>+
>+#uriip HOSTED_AT_RR eval:check_uriip_rbl('rr', 'rr.blackholes.us.')
>+#describe HOSTED_AT_RR Uses a URL hosted at RoadRunner
>+#tflags HOSTED_AT_RR net
>+
>+#uriip HOSTED_AT_SERVEPATH eval:check_uriip_rbl('servepath', 'servepath.blackholes.us.')
>+#describe HOSTED_AT_SERVEPATH Uses a URL hosted at ServePath
>+#tflags HOSTED_AT_SERVEPATH net
>+
>+#uriip HOSTED_AT_SPRINT eval:check_uriip_rbl('sprint', 'sprint.blackholes.us.')
>+#describe HOSTED_AT_SPRINT Uses a URL hosted at Sprint
>+#tflags HOSTED_AT_SPRINT net
>+
>+#uriip HOSTED_AT_TELUS eval:check_uriip_rbl('telus', 'telus.blackholes.us.')
>+#describe HOSTED_AT_TELUS Uses a URL hosted at Telus
>+#tflags HOSTED_AT_TELUS net
>+
>+#uriip HOSTED_AT_VALUENET eval:check_uriip_rbl('valuenet', 'valuenet.blackholes.us.')
>+#describe HOSTED_AT_VALUENET Uses a URL hosted at ValueNet
>+#tflags HOSTED_AT_VALUENET net
>+
>+uriip HOSTED_AT_VERIO eval:check_uriip_rbl('verio', 'verio.blackholes.us.')
>+describe HOSTED_AT_VERIO Uses a URL hosted at Verio
>+tflags HOSTED_AT_VERIO net
>+
>+#uriip HOSTED_AT_VERIZON eval:check_uriip_rbl('verizon', 'verizon.blackholes.us.')
>+#describe HOSTED_AT_VERIZON Uses a URL hosted at Verizon
>+#tflags HOSTED_AT_VERIZON net
>+
>+#uriip HOSTED_AT_WANADOOFR eval:check_uriip_rbl('wanadoo-fr', 'wanadoo-fr.blackholes.us.')
>+#describe HOSTED_AT_WANADOOFR Uses a URL hosted at Wanadoo France
>+#tflags HOSTED_AT_WANADOOFR net
>+
>+#uriip HOSTED_AT_XO eval:check_uriip_rbl('xo', 'xo.blackholes.us.')
>+#describe HOSTED_AT_XO Uses a URL hosted at XO.com
>+#tflags HOSTED_AT_XO net
>+
>+
>+### Countries with severe spam problems ###
>+#
>+#uriip HOSTED_IN_ARGENTINA eval:check_uriip_rbl('argentina', 'argentina.blackholes.us.')
>+#describe HOSTED_IN_ARGENTINA Uses a URL hosted in Argentina
>+#tflags HOSTED_IN_ARGENTINA net
>+
>+#uriip HOSTED_IN_BRAZIL eval:check_uriip_rbl('brazil', 'brazil.blackholes.us.')
>+#describe HOSTED_IN_BRAZIL Uses a URL hosted in Brazil
>+#tflags HOSTED_IN_BRAZIL net
>+
>+uriip HOSTED_IN_CHINA eval:check_uriip_rbl('china', 'china.blackholes.us.')
>+describe HOSTED_IN_CHINA Uses a URL hosted in China
>+tflags HOSTED_IN_CHINA net
>+
>+uriip HOSTED_IN_KOREA eval:check_uriip_rbl('korea', 'korea.blackholes.us.')
>+describe HOSTED_IN_KOREA Uses a URL hosted in Korea
>+tflags HOSTED_IN_KOREA net
>+
>+#uriip HOSTED_IN_MALAYSIA eval:check_uriip_rbl('malaysia', 'malaysia.blackholes.us.')
>+#describe HOSTED_IN_MALAYSIA Uses a URL hosted in Malaysia
>+#tflags HOSTED_IN_MALAYSIA net
>+
>+#uriip HOSTED_IN_NIGERIA eval:check_uriip_rbl('nigeria', 'nigeria.blackholes.us.')
>+#describe HOSTED_IN_NIGERIA Uses a URL hosted in Nigeria
>+#tflags HOSTED_IN_NIGERIA net
>+
>+uriip HOSTED_IN_RUSSIA eval:check_uriip_rbl('russia', 'russia.blackholes.us.')
>+describe HOSTED_IN_RUSSIA Uses a URL hosted in Russia
>+tflags HOSTED_IN_RUSSIA net
>+
>+#uriip HOSTED_IN_SINGAPORE eval:check_uriip_rbl('singapore', 'singapore.blackholes.us.')
>+#describe HOSTED_IN_SINGAPORE Uses a URL hosted in Singapore
>+#tflags HOSTED_IN_SINGAPORE net
>+
>+#uriip HOSTED_IN_TAIWAN eval:check_uriip_rbl('taiwan', 'taiwan.blackholes.us.')
>+#describe HOSTED_IN_TAIWAN Uses a URL hosted in Taiwan
>+#tflags HOSTED_IN_TAIWAN net
>+
>+#uriip HOSTED_IN_THAILAND eval:check_uriip_rbl('thailand', 'thailand.blackholes.us.')
>+#describe HOSTED_IN_THAILAND Uses a URL hosted in Thailand
>+#tflags HOSTED_IN_THAILAND net
>+
>diff -ruN spamassassin/rules/50_scores.cf spamassassin.new/rules/50_scores.cf
>--- spamassassin/rules/50_scores.cf	2003-12-17 07:14:52.000000000 +0100
>+++ spamassassin.new/rules/50_scores.cf	2004-01-20 23:57:19.000000000 +0100
>@@ -999,6 +999,56 @@
> score USER_IN_MORE_SPAM_TO -20.000
> score USER_IN_ALL_SPAM_TO -100.000
> 
>+# Spamvertised IPs within black-hat netblocks
>+
>+# Be careful with the scores - some legitimate emails may contain
>+# (informational) links to spamvertised sites - score them high enough
>+# but not too high.
>+
>+# These ones have been proven as *very* useful.
>+score HOSTED_SBL 4.0
>+score HOSTED_SPEWS_L1 4.0
>+score HOSTED_SPEWS_L2 2.0
>+score HOSTED_HABEAS_VIOLATOR 4.0
>+
>+# Only to be activated if a regional or ISP-specific spam problem is
>+# evolving (yet that's what SBL and SPEWS are good for).
>+score HOSTED_AT_ABOVE 1.5
>+score HOSTED_AT_ATT 1.5
>+score HOSTED_AT_BELLSOUTH 1.5
>+score HOSTED_AT_CHINANET 4.0
>+score HOSTED_AT_CIBERLYNX 4.0
>+score HOSTED_AT_COGENTCO 2.0
>+score HOSTED_AT_COMCAST 2.0
>+score HOSTED_AT_COVAD 1.5
>+score HOSTED_AT_CW 1.5
>+score HOSTED_AT_HE 1.5
>+score HOSTED_AT_HOSTCENTRIC 1.5
>+score HOSTED_AT_INTERBUSINESS 2.0
>+score HOSTED_AT_INTERNAP 2.0
>+score HOSTED_AT_LEVEL3 1.5
>+score HOSTED_AT_QWEST 2.0
>+score HOSTED_AT_RACKSPACE 2.0
>+score HOSTED_AT_ROGERS 2.0
>+score HOSTED_AT_RR 2.0
>+score HOSTED_AT_SERVEPATH 2.0
>+score HOSTED_AT_SPRINT 2.0
>+score HOSTED_AT_TELUS 1.5
>+score HOSTED_AT_VALUENET 1.5
>+score HOSTED_AT_VERIO 2.5
>+
>+score HOSTED_IN_ARGENTINA 1.5
>+score HOSTED_IN_BRAZIL 1.5
>+score HOSTED_IN_CHINA 3.0
>+score HOSTED_IN_KOREA 2.5
>+score HOSTED_IN_MALAYSIA 1.5
>+score HOSTED_IN_NIGERIA 2.0
>+score HOSTED_IN_RUSSIA 2.0
>+score HOSTED_IN_SINGAPORE 1.5
>+score HOSTED_IN_TAIWAN 1.5
>+score HOSTED_IN_THAILAND 1.5
>+
>+
> #
> # Habeas: http://www.habeas.com/
> #
Comment 29 Mikael Olsson 2004-01-24 13:52:24 UTC
I've implemented the previous patch, though only checking against
"actual" RBLs.  Ruling out entire countries and ISPs is a wee 
bit dicey for a corporate environment.

Out of 417 spams (past 2+ days)

HOSTED_SBL              319 (76%)
HOSTED_SPEWS_L1         291 (70%)
HOSTED_SPEWS_L2         295 (71%)
HOSTED_HABEAS_VIOLATOR    0 ( 0%)

Now we define VBAD as "SBL || SPEWS_L1 || HABEAS_VIOLATOR"
And MBAD as "SPEWS_L2 && !VBAD"

HOSTED_VBAD             325 (78%)
HOSTED_MBAD               4 ( 1%)


Let's see how RCVD_ rules match:

RCVD_IN_SBL             166 (40%)
RCVD_IN_SPEWS_L1        166 (40%)
RCVD_IN_SPEWS_L2        169 (41%)
HABEAS_VIOLATOR         2   ( 0%)

And now to find out how this matches up with RCVD_ checks.

HOSTED_SBL && !RCVD_IN_SBL           160 (38%)
HOSTED_SPEWS_L1 && !RCVD_IN_SPEWS_L1 137 (33%)


So, we can more or less conclude that people that spam
from SBLed MTAs also host their sites on SBLed web servers.

But the hit rate of checking URIs is twice that of 
sender checks.


There is however a bit of a problem with the scoring, imo.
SPEWS L1 and SBL lists much of the same:

HOSTED_SBL && HOSTED_SPEWS_L1  285 out of a possible 291

So, I'm using the following scoring to avoid too many
RBL-only false positives:

  score HOSTED_SBL 0.5
  score HOSTED_SPEWS_L1 0.5
  score HOSTED_HABEAS_VIOLATOR 0.5

  describe MY_HOSTED_VBAD Contains URIs hosted in SBL/SPEWSL1/HABEASVIO 
locations
  meta     MY_HOSTED_VBAD HOSTED_SBL || HOSTED_SPEWS_L1 || 
HOSTED_HABEAS_VIOLATOR
  score    MY_HOSTED_VBAD 2.0


  score HOSTED_SPEWS_L2 0.01

  describe MY_HOSTED_MBAD  Contains URIs hosted in SPEWSL2 locations
  meta     MY_HOSTED_MBAD  ( HOSTED_SPEWS_L2 ) && !MY_HOSTED_VBAD
  score    MY_HOSTED_MBAD  1.0


Of course, there's a similar problem with FPs in sender lookups
and URI IP lookups (quite likely), but that's for another bug.

Comment 30 Daniel Quinlan 2004-01-24 15:20:13 UTC
> Although perhaps resolving a name like the openbsdmailservers.com one
> above might confirm an email address, if the name contained the address in
> encoded form.  But still, I think it may be worthwhile (if optional,
> maybe).

I'm very concerned about this aspect.  Confirming email addresses is something
we cannot do.  What about just looking up the A/MX record for the domain itself
and checking those?  That should be safe.  Nobody is going to register one
domain per spam victim, but doing a one-way hash between hostname and user is
not too hard.  It doesn't have to be something that easily decodes into an
email address, it could just be an English word in a table, like:

  Hamlet -> quinlan@pathname.com
  Mouse  -> jm@jmason.org

etc.
Comment 31 Justin Mason 2004-01-24 15:46:14 UTC
FYI -- I got in a discussion about this elsewhere, and here's a comment I posted.

big problem with querying A records for URLs in mail messages,
at scan time, is that this can be *very expensive* in terms of runtime. ...

Consider a spammer who wants to DDOS someone's mail site.  If they know
that site uses a scanner which will perform A lookups on all URLs
in the message, they set up a really slow nameserver for a zone,
and use URLs in that zone in their messages, then send hundreds of
msgs.  Scanner will take forever, mail will back up, ouch.

Alternatively, if the scanner times out after 30 seconds of checking
URL A records, then they insert maybe 5 links with really slow A
records, in tiny img tags (let's say) so that humans will overlook
them, and 1 link with the *real* payload after that.   The scanner
will time out after checking several, and not get to the real meat.

If we randomly select N urls to check from a 200-URL message, this
also provides a way for them to get around it; they just throw in
hundreds of junk links to Yahoo! etc.

We can keep coming up with new ways to heuristically determine why URLs
are likely to be spammy, but there's a whole metric crapload of ways for
them to avoid it, or attack it, IMO.

I'm thinking a good approach to this problem would be this:

  - run an offline scanner (something like SpamAssassin's "mass-check")
    over a spam/spamtrap corpus periodically
  - this scanner greps out the IMG SRC and A href links
  - parses out hostname parts
  - does SBL/XBL/whatever lookups *in parallel* so timeouts are not a
    bottleneck
  - if a hostname uses an SBL-listed IP, create an SpamAssassin rule for
    that hostname
  - output the SpamAssassin ruleset to catch those URLs using the "uri"
    rule type
  - also, or alternatively, add them to a DNSBL of "spammer URLs"
    for network lookups

This has 2 benefits:

  - spammers listing legit URLs like www.yahoo.com do not cause FPs,
    because those are not on BL-listed IPs
  - super-slow servers will not bottleneck the scanner itself, just
    the offline rule-generation step

(oh look, Chris Santerre suggested that! Great minds think alike, Chris ;)

Comments?   I would be *very* interested in getting this working, given that
spam nowadays seems to be using a lot of self-hosting and/or proxies to host
their sites.

re danger of confirming email addresses.  Consider this link:

  img src=http://9eea82a2a786474ac9ceebe1ba296ad4.spamscumbag.biz

That's my address md5-encoded.  It's also a valid address in a wildcard
zone.  To avoid this, I'd suggest that we detect long strings in parts of
the hostname that could be wildcard zones, and throw in some random bits,
or just use random bits ourselves...  a wildcard zone will respond to
anything there.
Comment 32 Justin Mason 2004-01-24 15:48:26 UTC
also, generating rules based on a domain's NS records would also be very
valuable, I think.
Comment 33 Justin Mason 2004-01-24 18:17:18 UTC
Thinking out loud here -- but perhaps the correct approach for these "expensive"
slow rule-generation steps, is to come up with a way to centralise them and
generate downloadable rules files from that data?
Comment 34 Justin Mason 2004-01-24 19:03:18 UTC
Dan said:

'Confirming email addresses is something
we cannot do.  What about just looking up the A/MX record for the domain itself
and checking those?  That should be safe.  Nobody is going to register one
domain per spam victim, but doing a one-way hash between hostname and user is
not too hard.  It doesn't have to be something that easily decodes into an
email address, it could just be an English word in a table, like:

  Hamlet -> quinlan@pathname.com
  Mouse  -> jm@jmason.org'

ok -- how's about this algorithm:

1. split hostname into host, domain parts, e.g. "www.slashdot.org" becomes
"www", "slashdot.org"; "www.foo.co.uk" becomes "www", "foo.co.uk";
"three.levels.of.crap.foo.org" becomes "three.levels.of.crap", "foo.org". (we
already have a RE in 2.70 to match the CCTLDs that do ".co.uk"-style subdelegation.)

2. if host != "www", empty, or one of a known set of ok hostnames (determined
empirically from our corpora), then replace it with something different (like
random text) to avoid confirmation.

3. perform lookups etc.

This should still work OK, because:

1. spammers are using wildcard DNS to do addy confirmation (if they are) and to
evade URL filters with random hostnames (if they're not)

2. the level of granularity between a spammer URI and a nonspam one, will be at
the domain level.  Can anyone think of a case where

     host-a.domain.com = spammer
     host-b.domain.com = nonspam

?  All I can think of is something like demon.co.uk who assign subdomains, but
they have a strong antispam clue, do not have a spammer infestation, and are not
the kind of URLs we're talking about catching with these rules anyway.

3. spammers cannot register enough domains to act as addy confirmation
mechanism.  if they have a list of 300000 addresses, that'd require 300000
domains.  Expensive!
Comment 35 Mikael Olsson 2004-01-25 02:18:54 UTC
I don't like the idea of having to run mass-checks manually and
extracting domain names to check from that -- mostly because most
people won't do it.

How about this:

- Extract registerable domain part using reportedly existing heuristics
  (hostpart.spammer.co.uk -> spammer.co.uk)

- Lookup in local cache file (getting to that later) for 
  "spammer.co.uk". If there's a hit, we know the results. If not:

- Fire up full DNS lookups for the given hostpart.spammer.co.uk.
  Get the IP. Check the IP in configured RBLs. Store the results
  in the cache file as something like:
    spammer.co.uk  dnsbl.foo.org 127.0.0.1 127.0.0.2 127.0.0.3
    spammer.co.uk  dnsbl.bar.org 127.0.0.2
  or: for no hit
    spammer.co.uk


Now, when more spam arrives from the same spammer, we're likely
to have the results on file.  Of course, we need to time out the
cache entries somehow; the timeout should probably be configured
on a per-RBL basis given the frequency of updates.  SBL and SPEWS
are pretty static and can probably stay listed for several hours
(a full day?).

Of course, if some spammer starts registering in dyndns sites we
don't know of, we'll be damning all users of that dyndns site. :/
Comment 36 Daniel Quinlan 2004-01-28 13:05:22 UTC
*** Bug 2948 has been marked as a duplicate of this bug. ***
Comment 37 dan hollis 2004-02-02 16:46:28 UTC
Blocking URIs vs IP country of origin is *very* effective.

99.9% of spam sites host on chinese/korean/russian IPs. Blocking russian URI IPs
will block most of the criminal spams too (cc fraud, child porn, etc).

FWIW cn.rbl.cluecentral.net, kr.rbl.cluecentral.net, ru.rbl.cluecentral.net work
well for me.

Blocking on NS is a bit iffy, though there are a number of known hosting
providers who willingly host spam domains and can be blacklisted. Giving end
users the option to block on that would be very nice. Anyone know a blacklist of NS?
Comment 38 Kai 2004-02-02 21:24:31 UTC
See also my email to sa-dev just days before Florian Klein
posted his first patches to implement this: I've used his
patches with nearly unbeatable success since, with the
only real caveat being that his rules seem to home in on
FQDN's of email addresses (completely undesired, but
the relic of other SA code, apparently). The workaround
for this is to use META rules that check for HTML_MESSAGE
as well, but this limits their usefulness and nukes the
original HOSTED_IN_*/HOSTED_AT_* rules reporting the
IP matched.




Date: Fri, 24 Oct 2003 02:37:10 -0400
From: Kai <kai-sa-devel@conti.nu>
To: spamassassin-devel@lists.sourceforge.net
Subject: [SAdev] a new plan for SA: DNSBL and DNS scanning of embedded URL 
hostnames

Hello,

 I wish to propose the following new method to be implemented in future
versions of SA - but I unfortunately lack familiarity with the code
base, so I am unlikely to do the programming :)

I've had this concept in my head for a couple of months, but have not
seen anyone else uttering it.

1) virtually all spam contains either a URL or an email address, or
   both as a means of contacting the spammers. These are easily
   available with current code for URL/URI rules.

2) hostnames in URLs and email addresses tend to resolve to valid IP
   numbers, and have name service from DNS servers at known and
   valid IP addresses to be functional.

3) people are attempting to use SA for assigning scores based on
   appearance of arbitrary IP numbers (in headers) and domain
   names, as witnessed by
   http://www.stearns.org/sa-blacklist/sa-blacklist.current and
   http://www.merchantsoverseas.com/wwwroot/gorilla/evilrules.cf
   This does not scale: maintaining these rulesets by hand and
   distributing them to a larger audience is exceedingly hard and
   completely impractical.
   
4) IP numbers are relatively easily mapped to geographical regions
   (by RIR : ARIN, LACNIC, RIPE, APNIC, JPNIC) - this is a backup
   classification criteria in addition to 5) :

5) IP numbers are listed in great quantities in DNSBLs, with listing
   criteria as diverse as country/region, open relay/proxy, spam
   source, or spam support services like spamvertized web page
   or DNS hosting. Operation of DNSBLs is a well-established
   'science', it scales, it is manageable, there is plenty of
   choice.

6) 'roaming' websites have appeared that are hosted via reverse
   proxies on 1000's of compromised, trojaned and unfirewalled
   (Windows) machines, for both port 80/tcp (http) traffic, as
   well as 53/udp DNS traffic - with delegated nameservers changing
   records for these sites every few minutes, and keeping extremely
   short TTLs (less than 10 min.)

I propose the inclusion of code and rulesets to achieve the following
three goals, with a fourth one being designated a 'far future' goal:

a) based on the concept of the current DNSBL lookups for IP numbers in
   mail headers: extend that concept to every hostname contained
   in URLs or email address FQDNs found in the message body or any
   header line: Subject:,From: and Return-Path: comes to mind, primarily.

b) based on the concept of a), lookup all host nameserver records (IP
   numbers) for said host/domain names (based on statement #2 above)
   in DNSBL's as well, and permit rulesets to assign scores.
   A rule computing a score based on the NUMBER of such NS records
   (in case some crafty spammer tries to DoS this concept by listing
   200+ DNS servers for his domains), and a limit for the number of
   DNSBL lookups so done to a reasonable number is required.

c) lookup the zone SOA values of a given website's domain name records,
   and assign a score based on arbitrary ranges of these values:
   refresh, retry, expiry, minimum time.

d) future concept: follow the spamvertised URL and determine if the
   page gets redirected to some target page and server that can again
   be treated with goal b)
   

Desired result:
- we can now assign arbitrary scores for spamvertized websites and
  their DNS servers that have their IP addresses appear in any DNSBL,
  or have suspiciously 'mobile' DNS configurations.

Example:
- Just ONE rule assigning a substantial score for every hostname
  resolving to an IP number listed in the cn-kr.blackholes.us DNSBL
  would be enough to reliably cut off whatever air supply Alan Ralsky
  thinks he currently has: Web *AND* DNS-hosting in China, and
  criminal spamvertizing my means of breaking and entering through
  100,000's of open proxies. Take a very deep breath before going
  under, Alan, I say.


Thanks,
bye,Kai


--
"Just say No" to Spam                                     Kai Schlichting
New York, Palo Alto, You name it             Sophisticated Technical Peon
Kai's SpamShield <tm> is FREE!                  http://www.SpamShield.org
|                                                                       |
LeasedLines-FrameRelay-IPLs-ISDN-PPP-Cisco-Consulting-VoiceFax-Data-Muxes
WorldWideWebAnything-Intranets-NetAdmin-UnixAdmin-Security-ReallyHardMath


Comment 39 Justin Mason 2004-02-03 15:35:46 UTC
q for the people who've tested Florian's patch -- what's the speed hit like?
Comment 40 Yusuf Goolamabbas 2004-02-03 19:28:21 UTC
I am trying to play around with the new plugin mechanism of 2.70 and thought I
might try to solve this with that mechanism (not sure if this is the right
approach). Whilst PerMsgStatus->get_uris_list() gives me all the uri's in a
message, I haven't figured out the appropiate way to tokenize the hostnames and
then call the rbl_check_from_host(). Initially, I am looking to do only hostname
lookups and probably maybe lookup a cdb file first then lookup dns
The population of the cdb file could occur via spam-traps
Comment 41 Kai 2004-02-03 22:27:36 UTC
> q for the people who've tested Florian's patch -- what's the speed hit like?

difficult to measure due to performance being relative to general system 
performance. Counting the number of DNSBL lookups seems necessary in this 
context as well. How'd we go about this?

Local stats here, based on spamd logging:

Celeron 533Mhz machine with avrg. load approaching 2.0 during mail
receipt: full SA bayes, full SA network tests

- minimum scan time per mail with spamd = 0.5s
- stats since Nov 22nd (2.5 months of data):
  57,779 mails scanned
  397,921 cumulative seconds logged by spamd
  6.88s average per mail scanned.
  95th percentile: 25.5s
  90th percentile: 16.7s
  80th percentile: 10.2s
  70th percentile:  6.2s
- weekly averages in the last 4 weeks, (Sun-Sun):
  7.1s  (ending Jan 31) per mail scanned
  8.7s  (ending Jan 24)
  12.6s (ending Jan 17)
  10.8s (ending Jan 10)
- 13 HOSTED_AT_* rules activated (score != 0)
- 10 HOSTED_IN_* rules activated (score != 0)

This system slows down to a crawl with a load > 5.0 when SpamShield,
dummy-smtpd and spamd are cranking at sustained bursts of up to
5 sim. hostile SMTP connects/sec getting trapped, fended off and
the connecting hosts firewalled in near-realtime.


Some observations, and mitigation techniques to not fall prey to
a message designed to generate a flood/DoS against SA:

- should keep short-time (15 min.) stats on DNS response
  time, especially for re-use within the same mail body
- score (possibly intentionally) slow DNS responses
  for the URLs from servers against them, especially for
  subsequent lookups
- possibly forgo subsequent lookups against the same DNS servers
  marked 'slow' for other URL hostnames.
- control DNS lookups very specifically, and prevent automatic
  recursive lookups, but do 2-stage queries instead: root-nameservers
  and those governing entire TLDs are seldomly slow, while delegated
  DNS servers in spammer-hand might be ; we only want to query the
  latter once or twice, if they're slow.
- come up with a gradient score dependent on number of
  URLs encountered for a given mail.
- create rule to look up directly-delegated DNS servers (from TLDs)
  in DNSBLs as well. Those pesky Ralsky servers in .CN and .BR
- forget about looking up ANY DNSBL-listings for ANY FQDNs of email
  addresses, period. There's too few pieces of spam around that
  do NOT have http:// URLs and only provide an email address as
  a sole point of contact. We are covering those special pieces of
  spam with the nigerian rules (which need some updating, hmm).




Comment 42 Mikael Olsson 2004-02-05 02:55:32 UTC
Kai <kai-sa-devel@conti.nu> wrote:
> Just ONE rule assigning a substantial score for every hostname
> resolving to an IP number listed in the cn-kr.blackholes.us DNSBL
> would be enough to reliably cut off whatever air supply Alan Ralsky
> thinks he currently has: Web *AND* DNS-hosting in China,

I see that you do not do business with China nor Korea. Did you know that 
China alone accounts for more than a fifth of the world's population?
And that the population of Korea and China together is nearly five
times as large as that of the US?

If you want to blackhole all of them on your server, your are of
course free to do so, but please do not advocate that all the world
should do so via gratitious default spamassassin rules, sir.

It's bad enough that they are all taught in school to be polite and
begin their letters with "Dear Sir" and get rewarded with a 3.something 
SA score. (Yes, I changed that already, thankyouverymuch. There is 
obviously not enough chinese mail in the GA ham corpus.)


SBL & co are there to dynamically nail ISPs that can't be bothered 
to get rid of spammers. It works very well when applied to web 
server IPs, as has already been demonstrated.

Comment 43 dan hollis 2004-02-05 11:18:07 UTC
Since you are so eager to do business with china and korea, i'll be more than
happy to forward my chinese and korean spam to you. Deal?
Comment 44 Kai 2004-02-05 11:28:56 UTC
mike-spamassassin@clueby4.org wrote:

> I see that you do not do business with China nor Korea.

Correct, and that's "local configuration customization" at work.

> please do not advocate that all the world
> should do so via gratitious default spamassassin rules, sir.
[...]
I have advocated the grilling of Ralsky and corrupt Chinese networks,
before anything else.
Default values, particularly for new rules and code are exceedingly 
conservative, and have been throughout the SA development process as
far as I can tell. Indeed, no wider testing of this code and accompanying
(or further customized) rules have taken place, yet.


> It's bad enough that they are all taught in school to be polite and
> begin their letters with "Dear Sir" and get rewarded with a 3.something 
> SA score. 

And once again, not all rules work equally well for all people. Anyone
slapping SA on their mail server and not willing to monitor its behavior
or customize it to match local requirements can expect unexpected and
undesired results.

For all its worth: my local scores pertaining to china, Korea, both,
SBL and SORBS-listed space:

score HOSTED_IN_CHINA           4.0
score HOSTED_IN_KOREA           2.0
score HOSTED_IN_CNKR            3.0
score HOSTED_AT_SBL             10.0
score HOSTED_AT_SORBS           5.0

These rules are redefined META rules of the original ones
in Florian's patch along the lines of:

uriip __HOSTED_IN_CHINA eval:check_uriip_rbl('china', 'china.blackholes.us.')
meta HOSTED_IN_CHINA    (HTML_MESSAGE && __HOSTED_IN_CHINA && !__HOSTED_IN_CNKR)
describe HOSTED_IN_CHINA Uses a URL hosted in China
tflags HOSTED_IN_CHINA net

[other rules similar, etc.]

Clearly, my default threshold is not 5.0 either.


Back on topic: all involved/interested in URL-based classification
MUST NOT miss the following presentation by Ken Schneider, Brightmail,
about Brightmail's URL filtering:
http://www.spamconference.org/webcast.html
Chose the "Morning 2" session with appropriate bandwidth, and
forward to 1 hour and 4 minutes into the file.
Sorry for the RealDumbProprietary(tm) format.

Brightmail coming out into the open and airing it like this worries
me: they have a history of filing for patents for what they do.
Given that they call this 'tremendously succesful' (or something
along those words), this could become a patent battleground
around the 'next big thing' in spamfiltering technology.

Voice your ideas in public, often and early (= create prior art),
I'd say.

bye,Kai
Comment 45 Justin Mason 2004-02-05 13:06:08 UTC
'Brightmail coming out into the open and airing it like this worries
me: they have a history of filing for patents for what they do.
Given that they call this 'tremendously succesful' (or something
along those words), this could become a patent battleground
around the 'next big thing' in spamfiltering technology.'

There are several other organisations who've been using this technique recently,
so prior art exists.

However it's a good point --  I would recommend that anyone who's been
investigating this, and is concerned about patents, put up a webpage
ASAP, detailing the idea, a little bit of code history, timestamps,
pointers to web.archive.org and mailing list archives, keywords to
search google for, etc. etc. so that anyone who does in the future
need proof of prior art for such an "invention" can track it down
and identify who did what first.

I know that Mark Reynolds was talking about it on SA-talk several
years ago, FWIW -- *long* before I heard of BM doing this: cf
http://bl.reynolds.net.au/ksi/
Comment 46 dan hollis 2004-02-15 17:34:07 UTC
anyone thought of sending a certified letter of notification to BM and the USPTO
alerting them to prior art of this method, in case BM tries to patent it?

then the USPTO and BM couldn't claim they didn't know about prior art. and if BM
goes ahead and tries to patent it anyway, they could be charged with patent fraud.
Comment 47 Justin Mason 2004-02-26 20:50:53 UTC
OK, an implementation of this is now checked in, as the plugin
Mail::SpamAssassin::Plugin::URIDNSBL.

It's fully event-driven, and will give up 2 seconds after the Received-header
DNSBL lookups complete, so slow DNS servers etc. won't have much effect.

Yet to do: select the URIs to look up more carefully; currently it just picks a
random 20.
Comment 48 Kai 2004-03-02 12:55:19 UTC
I have not had time to install the CVS-current version, but it appears
that Ralsky is already adapting: how might we best exclude URLs that
appear in HREF's with NO clickable link (or an unreasonable small
number of characters for one)? Note that the xccwbvai.com link got
fully hit by HOSTED_AT_SBL under the previous (Florian's) code,
but we want to avoid unnecessary DNSBL lookups.

Return-Path: <kaikai@NETSCAPE.NET>
Received: from file-srv.DUNON.BE (u212-239-180-160.adsl.pi.be [212.239.180.160])
        by conti.nu (8.12.10/8.12.10) with ESMTP id i22EkxB3022923
        for <kai@EXAMPLE.TLD>; Tue, 2 Mar 2004 09:47:04 -0500 (EST)
Received: from tanner ([61.171.33.130]) by file-srv.DUNON.BE with Microsoft 
SMTPSVC(5.0.2195.6713);
         Tue, 2 Mar 2004 15:47:45 +0100
From: "Kedoathiel"<kaikai@NETSCAPE.NET>
To: kai@EXAMPLE.TLD
Subject: kai: CI(AL1S)  w0rks in as 1lttle as 3o m1nutes and 1asts for up t0 36 
hours.
Mime-Version: 1.0
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <FILE-SRV38GkNmRAlWt000a0381@file-srv.DUNON.BE>
X-OriginalArrivalTime: 02 Mar 2004 14:47:46.0867 (UTC) FILETIME=
[53ABA030:01C40065]
Date: 2 Mar 2004 15:47:46 +0100

<html><body bgcolor=#FFFFFF text=#000000><b><font color=#FF0000> kai:<br>
CI(ALI)S  is alm0nd pi1l--it acts quicker and lasts much longer! 
</font></b><br><br>
 1: Overall e*rectile function! <br> 2: Partners' Satisfaction with s-exual 
interc0urse . <br> 3: satisfaction with the hardness of e_rections. <br> 4: 
doctor&FDA a'pproved !
<p><font color=#FF0000><b>kai:</b><br>
  <b><a href=http://kai.xccwbvai.com/as>V`i`s`i`t Our S`i`t`e and O`r`d`e`r  
H`e`r`e </a><br><a href=http://kai.net></a><br><br><br><br><br><p><a 
href=http://kai.com></a></p><p>.</p></b></font>
</P>
</BODY></HTML>

Comment 49 dan hollis 2004-03-02 13:00:08 UTC
the only safe exclusion would be empty links. anything else could be abused by
spammers as false negatives.
Comment 50 Sidney Markowitz 2004-03-02 13:20:16 UTC
I disagree about excluding only empty links. We should exclude anything that is
invisible. If there is a way of making a link invisible that doesn't cause it to
be excluded, the spammer could include as many of them as needed to cause either
a DoS if we looked up all of them, or make it statistically unlikely that we
would hit the one spammish one if we used a random sample. Doing a half-way job
of implementing this will just cause the spammers to exploit the cases that we
don't handle, and we might as well not bother.

Isn't there already code to determine if text is invisible so it can be ignored
by some tests? Can this be used to test the visiblility of a link?

Even if that is true, there is a thornier problem: They can use an image with no
text for the link, and we have no way of knowing whether it points to a visible
or invisible image. What do we do with a hundred hrefs to innocent places like
kai.com with the clickable area an IMG link to a one pixel space gif from some
non-spam website mixed with one href to the real spam site with the clickable
area an IMG link to a picture that says "click here"?

I don't want to be negative or to give the spammers any ideas, but I expect that
they would figure this one out on their own.
Comment 51 dan hollis 2004-03-02 13:48:22 UTC
there is no reliable programmatic method to determine if a link is invisible or
not. the only reliable way is to check for an empty link.

if you can figure out a reliable way to determine invisible links, you'll have
solved most of the current unsolved problems in modern AI research, and probably
win a number of international scientific awards and medals.

the distributed nature of dns would seem to defeat any attempts at dos by
looking up links.

the only thing spammers would achieve by loading up spams with bogus links, is
making it less likely that their spams would get through. which is self
defeating and rather unlikley to survive for long. after all, spammers' goal is
to get spams to the recipients, not get them blocked.
Comment 52 Sidney Markowitz 2004-03-02 14:35:00 UTC
Sorry, but I disagree with most of the previous comment.

Before even getting into the arguments, there is a simple counterexample to your
proposal of ignoring just empty links. The example that was attached a few
comments ago shows a spammer already including an href to an innocent site
kai.com with an empty link area. Your proposal would result in the next spam
from that person including the same href with font size 1 text, making the test
useless. There is no reason to add a useless test.

We are not talking about a general open-ended AI problem. The browser solves the
problem already by interpreting the HTML and rendering pixels. If there are
enough pixels in contrasting foreground and background colors in an area that is
declared as a clickable hotspot, then the link is visible. The question is not
if it is possible to do the same thing, but how close can we get to the same
determination using only a reasonable amount of processing. We already have code
to determine if text has been made invisible by being inside an HTML comment or
in an invisible color or in a very tiny font. We need that already to catch
attempts to make invisible non-spam content dominate the scoring.

That still leaves open the different problem that an image can be visible or
invisible and we cannot tell without downloading it from a website, possibly
triggering a webbug. I don't know how to get around that one, which means that
while I strongly disagree that this is an "AI problem" whose solution would give
us a place in history, I do agree that we may not be able to solve the general
problem.

Most importantly, I disagree with your conclusions:

"the distributed nature of dns would seem to defeat any attempts at dos by
looking up links"

If SA has to look up hundreds of legitimate domains to process each message,
that will slow down processing too much.

Spammers can create throwaway domains and host them on DNS servers that are
designed to slow down anything that queries them. The distributed nature of DNS
only helps to the degree that queries are cached, but spam cam contain
variations of host names that will ensure that doesn't help.

The solution to avoiding DoS is not to look up absolutely every link, instead
choosing a random sample. But that allows the spammer to set their own
probabilities of detection by how many invisible links they include for each
visible link.

"the only thing spammers would achieve by loading up spams with bogus links, is
making it less likely that their spams would get through"

The links would only be "bogus" in the sense that they are not really links that
the spammer wants anybody to click on. They could point to real, innocent
websites that we would not want on any RBL, like the kai.com example. They would
not appear when someone readsd the spam, so they will not be clicked on. The
only thing that might look up the domains of the hrefs would be spam filters,
which will find that they are innocent.

What _might_ work is a rule that is DoS-proof because it looks up only a limited
number of hrefs, and another rule that penalizes mail that has enough links that
it may be an attempt to introduce chaff to fool the first rule. Both of those
would be made more effective by ignoring links that have invisible text. I still
don't know what we would do about links that use images for their clickable area.

Comment 53 Loren Wilton 2004-03-02 14:58:04 UTC
I see one major problem with skipping invisible links in at least some form of 
examination.  That is where the spammer will put the web bugs.  After all, they 
don't need to be (and generally aren't) visible.

I consider the "lookup DOS" argument somewhat moot.  Count the number of links 
in the message and divide into the text size.  If there are more than x*10^n 
links on the page, or the links are more than y% of the total body size, 
declare the thing to be spam without looking anything up.

Incidentally, this makes an argument for self-scoring tests.  A test that 
counted URIs in the body and gave a score based on the number of hits (possibly 
times a factor or with upper and lower bounds) would make this sort of decision 
a lot easier than it currently is.

       Loren
Comment 54 dan hollis 2004-03-02 15:00:20 UTC
Re: your comments

"If SA has to look up hundreds of legitimate domains to process each message,
that will slow down processing too much."

if they are including hundreds of links in each message, that alone is a good
trigger rule for SA. again, a self defeating attack that i dont think any
spammer would use for long.

"Spammers can create throwaway domains and host them on DNS servers that are
designed to slow down anything that queries them. The distributed nature of DNS
only helps to the degree that queries are cached, but spam cam contain
variations of host names that will ensure that doesn't help."

you can look up the SOA and reference to an RBL of known bad DNS servers (or bad
 hosting networks in general. eg SOA which points to DNS server in china). since
the SOAs are kept in the roots, "deliberately slow dns servers" wont have any
effect. a single match would be good enough to bail out on.

as for bogus domains, NXDOMAIN comes back fairly quickly. (but if sitefinder
ever comes back... ugh.). lots of NXDOMAINs would definitely be another good SA
high scoring rule.

i do categorically disagree with your assertion that empty link tests are
useless. otherwise there is no reason for 99% of rules in SA, because spammers
do not always use the same rules and keep changing them. if even one spammer
uses that technique, it is enough to justify its existence. (and the fact that
false positive on the rule is rather unlikely)

IMHO your proposal to interpret images is even worse by several orders of
magnitude than DNS lookups, because now SA has to download referenced images and
interpret them, which is far more resource intensive than simple SOA queries.

there simply is no way to determine if a link is visible or not, anything you
check for, spammers would just _always_ link to images to defeat any check you
could possibly do for visibility. or they would simply make them visible (but
useless). and then all your efforts on this checking are wasted.

the empty href check is a simple one and would already be catching spams today.
Comment 55 Daniel Quinlan 2004-03-02 15:21:24 UTC
Subject: Re:  do RBL look-ups on URLs

> I see one major problem with skipping invisible links in at least some
> form of examination.  That is where the spammer will put the web bugs.
> After all, they don't need to be (and generally aren't) visible.

Web bugs are usually URLs that are loaded without any action on the part
of the user.  We can always test those.

> I consider the "lookup DOS" argument somewhat moot.  Count the number
> of links in the message and divide into the text size.  If there are
> more than x*10^n links on the page, or the links are more than y% of
> the total body size, declare the thing to be spam without looking
> anything up.

We can try that as a separate test.

A test for "more than y%" probably won't work (even with a minimum
length, I suspect) since non-spammers do that all the time.

> Incidentally, this makes an argument for self-scoring tests.  A test
> that counted URIs in the body and gave a score based on the number of
> hits (possibly times a factor or with upper and lower bounds) would
> make this sort of decision a lot easier than it currently is.

We do that with range tests all the time.  Breaking tests into ranges is
a bit clumsy and I think having a scoring function would work better,
but we lack the code to handle it.

Comment 56 Sidney Markowitz 2004-03-02 15:57:09 UTC
Re: Dan Hollis' comments

Whatever disagreements may be left, I think we are very close on what we are
proposing. You say we should skip hrefs that are invisible due to having empty
link areas. I say we should skip them and as long as we are looking for that
also look for ones that match the existing small font and invisible color tests.
The purpose is the same as testing for empty link areas and we already have the
code.

I did not propose looking up image references to decide if they are visible: I
mentioned that as something which would _not_ be practical to do. I agree that
we should implement rules that will help even if they are not perfect, which
means that we should have this DNSBL rule even if we cannot tell what an image
only link points to.

I agree that if spammers have to include hundreds of links to obfuscate one real
one, that in itself would be a good trigger rule. The point I was trying to make
is that if we include the DNSBL rule we should 1) use a small random sample to
avoid DoS, and 2) Add a rule to catch too many links.

So really, we are in agreement, but I'm extending your suggestion that we use
the DNSRBL rule modified to ignore empty link text. The extensions are

1) Ignore for the DNSBL rule not just empty link text but anything that matches
the existing tets for "invisible" text based on font size and color

2) Count the number of such ignored links for a possible separate rule that
penalizes them. There's not much point for invisible links in real email.

3) Separately count the number of links with an image-only clickable area and
score a rule for that separately. Assume that they are not invisible for the
purpose of DNSBL checking, since invisible ones will only be useful to the
spammer in quantity and the quantity will trigger this rule.
Comment 57 Mikael Olsson 2004-03-02 17:11:52 UTC
Re: DoS protection -- why not simply count the number of _unique_
domains in the mail and declare "spam!" if they're too many?
(Refuse to do RBL lookups and trigger some other rule instead)

Legitimate mailers use ImageReady to split their images up
in itty bits (lord knows why), yes, but they'll all point to
the same place.  Only a spammer uses dozens of different
host names.
Comment 58 Daniel Quinlan 2004-03-02 17:15:51 UTC
Subject: Re:  do RBL look-ups on URLs

> Re: DoS protection -- why not simply count the number of _unique_
> domains in the mail and declare "spam!" if they're too many?
> (Refuse to do RBL lookups and trigger some other rule instead)

-1 Redundant

Look, write some code.  Repearing the same comments and debate points
are not really all that helpful.  This idea has already been suggested
in this thread at least once.

Comment 59 Mikael Olsson 2004-03-02 17:28:39 UTC
> > Re: DoS protection -- why not simply count the number of _unique_
> > domains in the mail 
> 
> -1 Redundant

Um, no, this idea hasn't been mentioned before. The other ideas were
all about extracting a random % of all URIs, counting the number of 
URLs or checking the ratio of URLs to text length -- and the latter two
will result in lots of FPs for people that like to use ImageReady.

Counting the number of unique hostnames (not whole URI strings!) is a 
direct measure of the DNS work that SA would have to do.
Comment 60 Loren Wilton 2004-03-02 17:51:12 UTC
> Legitimate mailers use ImageReady to split their images up
> in itty bits (lord knows why), yes, but they'll all point to
> the same place.  Only a spammer uses dozens of different
> host names.

I think you are going to hit ham on this test no matter how you word it.  
Innocent mail like "hi Fred, here are the 12 best sites I've found to get plane 
reservations!"  Followed of course by about 14 urls to different orbitz-type 
sites.

Which isn't to say that it is necessarily a bad test.  I just suspect that the 
thing will hit far more ham than really desired, no matter how it is set up.  
Might be good as part of a meta though.

I think there are at least two cases that can be tested for here.  One is a 
whole bunch of urls to basically the same host (last node or two before .com is 
common), all pointing to the spam site.  The other is a form of poisioning by 
using lots of random urls, many of which might not even be real.  Most of those 
would likely show up as invisible links though, since if the sucker clicks one 
it won't take him to the spammer's site.  

It would probably be good to have the ability to count both types separately 
and make decisions based on the count, to make it easy to adjust for future 
spammer preferences.  For instance, I'm getting a whole lot of stuff today from 
a new spammer that is sending this sort of stuff:

"http://manley.chattel.gluttonearth.com/gld/gld.php"
"http://trinitarian.device.gluttonearth.com/gld/gld.jpg"
"http://rip.burundi.gluttonearth.com/gld/lucky.php"
"http://improvident.centerline.gluttonearth.com/id/cease.html"
"http://denver.troika.gluttonearth.com/gld/morning.jpg"

He seems to have a couple main domains with 2-3 random words in front of the 
main domain name.

       Loren
Comment 61 Daniel Quinlan 2004-03-02 18:00:24 UTC
Reopening bug to remove myself as owner, I'm tired of getting comments on
this RESOLVED bug.
Comment 62 Daniel Quinlan 2004-03-02 18:01:09 UTC
Reassigning bug to list.
Comment 63 Daniel Quinlan 2004-03-02 18:02:40 UTC
Closing bug.

Please stop adding comments to this bug.  It has been resolved and the code
is working.  If you want to make additional requests, then open one new bug
per request (or one new bug per set of very closely related requests).