Bug 2384 - RFE: use SA data to generate RBL lists
Summary: RFE: use SA data to generate RBL lists
Status: RESOLVED WORKSFORME
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: spamassassin (show other bugs)
Version: SVN Trunk (Latest Devel Version)
Hardware: Other other
: P5 enhancement
Target Milestone: Future
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-08-30 07:44 UTC by Marc Perkel
Modified: 2019-07-08 10:04 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Parse email, find non-local relays text/plain None Rich Puhek [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Marc Perkel 2003-08-30 07:44:00 UTC
Here's a thought - suppose that spamassassin had a database of host IP addresses
and stored the number of messages from that IP and the average spam assassin
score from that IP.

With that information one could make a rule that if a host sent over say 200
messages and had an average score over 10 that you could add points. Something
that worked much like auto whitelisting except that it's done by IP address
rather than from address.

Additionally - black lists could be generated and shared.

Just a thought - who likes this idea?
Comment 1 Devin Nate 2003-08-31 11:31:16 UTC
I was thinking about this myself, and generally like the idea. My only concern
is keeping 'significant' data around, because there sure are alot of IP
addresses and to store info on each one could get costly. I'd make the mechanism
that as soon as a site had accumulated a certain number of points (I was
thinking 200 points), and points being calculated as average_score * times_seen
(so a site with an average score of 20, that had sent 10 emails, would have 200
points).

The big issue: which IP address(es) do you consider significant?

Method #1: Received Headers

Received headers are pretty gross sometimes. The IP address listed at the "top"
of the Received header list isn't the one. For example, at our site, there are
internal email routing received headers which will appear at the top of the
list. This happens as emails move through our secondary MX hosts to our primary
MX host. Therefore, keeping track of the "top" Received Header list seems to not
be right.

So, the next thought is to take the originating IP address, the one at the
"bottom" of the Received header list. Except that one can be manipulated by a
spammer, who just needs to insert some crappy Received line at the bottom of the
list. Ironically, I am looking through my spam, and I see that they have done
that on this spam email, geesh, and it was my second pick. The "bottom" received
line is.. well formatted, creative, and wrong. It's wrong because it's claiming
to be our mail server. Now, we run qmail, and our Received lines are in qmail
format - this line is definitely not ours, and yet, if you trusted only the
bottom received line there seems to be bogus info there. So, the "bottom"
received line seems no good...

As best I can see, looking through emails, you'd need to be able to isolate the
Received line of a trusted MX host (probably one of your own servers) and figure
out what IP address was talking to the trusted MX host, and use that. That'd
ensure you get a significant IP address. The logic would have to be something
like: start at the top of the Received list, and walk back until you get to a
non-trusted host, and then use the previous entry.

This could be done; you'd need a config paramater setting a regexp that'd ONLY
match your trusted MX hosts. Definitely a pain in the neck.


Method #2: URL matching

Spammers need to get you somewhere, and they really need a href link to get you
there. So, scanning through an email and get all of the visable href urls
(anything invis or in comments ignored). Convert all href urls into both domain
format, and ip address(es) (a single domain may resolve to multiple ip's). You
risk autoblacklisting shared hosts - whatever, we have some shared hosting
features, and we specifically disallow spammers and have never had an issue with
it - if you're stupid enough to mix spammers with legit business, you deserve
what you get).

Domain's are a different thing. DNS can have an almost endless list of hostnames
that match to a single real host. Version one of a patch would probably only
record the url address listed, however, subsequent work might go into finding
trends. So that... sitea.spammer.com, siteb.spammer.com, sitec.spammer.com,
sited.spammer.com might get stuck together somehow. There are a few ways of
doing this, and I don't know what I think about them right now.

Given Method #1 and Method #2, which are not mutually exclusive, there is then
the question of scoring:

I believe that scoring would be best handled like the BAYES or AWL system. It'd
have to give a +/- score to bias the overall score towards the long term
average, but it'd also have to consider how confident it is (i.e. if you've seen
an ip address 2 times, you give no points, 10 times, you give reduced points,
100 times, ok now we're confident give it full points). It would also have to
consider how old/useful the data is and expire data at some point in time. This
would probably be db size driven and time driven.

Hmm, I think I'm most interested in Method #2, and scoring needs some
consideration. 

Are you thinking about making some patches?

--
Devin Nate

Comment 2 Devin Nate 2003-08-31 13:37:01 UTC
Humm, looking at the code a brief moment regarding learning IP addresses from
URLs, it looks like all of the URI's are stuck into a list in HTML.pm
@{$self->{html_text}} .. and it looks like that list is given to the BAYES
engine. (Can anyone verify that BAYES gets the URI: list?)

If that's the case, one might be able to do a few things in addition to the
current system:

1) The URI list is encoded sometimes (using % encoding hacks). Decode URI's
(they seem to be in raw form when passed to the URI list) & only record the
domain name. Only grab href/form/base URIs (specifically don't get image sites
because they are often innocent bystanders and the spammers are 'stealing' their
images).

2) The URI is often a hostname. Resolve hostnames into a list of IP addresses.

3) Convert all IP addresses into dot decimal format (avoid hex encoding).

4) Add the decoded domain name & resolved decoded IP addresses to the URI list
in HTML.pm so that the BAYES engine can learn them.

To put them into a seperate database seems possible too. The tricky part
(UNDERSTATEMENT) seems to be deciding which of the URLs are spammers, and which
are innocent bystanders (e.g. www.sec.gov, some story on yahoo.com, "as seen on
cnn.com", etc etc). (I guess you'd want to look for the most prominent links -
by size, color, image, etc -not very easy).

--
Devin Nate
Comment 3 Brian White 2003-09-02 06:30:21 UTC
Subject: Re: [SAdev]  Host IP Database - An automated SPEWS

> I was thinking about this myself, and generally like the idea. My only concern
> is keeping 'significant' data around, because there sure are alot of IP
> addresses and to store info on each one could get costly. I'd make the mechanism
> that as soon as a site had accumulated a certain number of points (I was
> thinking 200 points), and points being calculated as average_score * times_seen
> (so a site with an average score of 20, that had sent 10 emails, would have 200
> points).

I think there are some things to learn from Razor about this.  If it's a
community effort, then we need a way for people to acquire "status" as
they make more and more good reports.

Also, it's needs to be reasonably distributed in order to be resiliant to
DDoS attacks.  Anything that works is going to be attacked.  Making it a
neutral peer-to-peer type network also makes it so that there is no single
point of failure for lawsuits, too.

But wouldn't something like this be much like a distributed and cooperative
MAPS system?


> Method #1: Received Headers
> 
> Received headers are pretty gross sometimes. The IP address listed at the "top"
> of the Received header list isn't the one. For example, at our site, there are
> internal email routing received headers which will appear at the top of the
> list. This happens as emails move through our secondary MX hosts to our primary
> MX host. Therefore, keeping track of the "top" Received Header list seems to not
> be right.

This should be pretty easy to automate.  Just do "mx domain.com" to find
out all the MX records for your domain and search for the first Received
header that says "received by valid.mx.ip.address from
spammer.idiot.ip.address".
Or is there something I'm missing here?


> Method #2: URL matching
> 
> Spammers need to get you somewhere, and they really need a href link to get you
> there. So, scanning through an email and get all of the visable href urls
> (anything invis or in comments ignored). Convert all href urls into both domain
> format, and ip address(es) (a single domain may resolve to multiple ip's). You
> risk autoblacklisting shared hosts - whatever, we have some shared hosting
> features, and we specifically disallow spammers and have never had an issue with
> it - if you're stupid enough to mix spammers with legit business, you deserve
> what you get).

I like this a lot!  What about redirectors, though?


> Domain's are a different thing. DNS can have an almost endless list of hostnames
> that match to a single real host. Version one of a patch would probably only
> record the url address listed, however, subsequent work might go into finding
> trends. So that... sitea.spammer.com, siteb.spammer.com, sitec.spammer.com,
> sited.spammer.com might get stuck together somehow. There are a few ways of
> doing this, and I don't know what I think about them right now.

It seems like the existing BAYES tests are probably the best choice for
domain matching.  A distributed version may batch better and/or more
frequently, but matching the resolved IP address in a distributed manner
would probably make up for this and more.

                                          Brian
                                 ( bcwhite@precidia.com )

-------------------------------------------------------------------------------
    Many times the difference between failure and success is doing something
                   nearly right... or doing it exactly right.

Comment 4 Brian White 2003-09-02 06:34:25 UTC
Subject: Re: [SAdev]  Host IP Database - An automated SPEWS

> Humm, looking at the code a brief moment regarding learning IP addresses from
> URLs, it looks like all of the URI's are stuck into a list in HTML.pm
> @{$self->{html_text}} .. and it looks like that list is given to the BAYES
> engine. (Can anyone verify that BAYES gets the URI: list?)

Don't forget to match URIs within text/plain, too.  Most mailers will
recognize

			http://bcwhite.dhs.org/

and automatically make it a link.


> 4) Add the decoded domain name & resolved decoded IP addresses to the URI list
> in HTML.pm so that the BAYES engine can learn them.

Would Bayes or another database be effective here when used by only a
single person?  The messages of all spammers are similar enough for Bayes to
be effective, but there's no "similarity" between IP addresses; they either
match exactly or not at all.  A single person may not see enough references
to the same spam site for this to be effective.

                                          Brian
                                 ( bcwhite@precidia.com )

-------------------------------------------------------------------------------
    Many times the difference between failure and success is doing something
                   nearly right... or doing it exactly right.

Comment 5 Devin Nate 2003-09-02 15:46:15 UTC
Gah, I spent way too much time thinking on the weekend about how to only get
spammer data. Unfortunately, I didn't find some amazing solution. Finding the
correct MX record is tough - not everything stamps an equally formatted received
line. I see our MX servers don't even include their hostname or IP address
(something else to fix). So, finding the first non-trusted received line isn't
trivial.

As to URI scanning - it seems easy enough to grab the URIs. It's difficult to
only grab visable URI's. And it's really friggen hard to figure out which URI's
are just 'as seen on somesomesite.com', and 'in compliance with fcc.com
regulation...', and 'as featured on this yahoo shopping site..'. Even assuming
you have well scored urls, how do you score urls found in an email? Some spams
have a ton of URLs, some have only 1. Some spams refer to non-spam sites, some
do not. So what happens when your mother sends you a picture that's hosted on
one of these sites (e.g. yahoo) that a spammer also periodically will link to.
The trick seems to be one of looking for the SIGNIFICANT URI.. the one in bold.
Except then images come into the picture and convelude matters. Worst case, a
spammer embeds 100URLs into a spam and hides them in various ways, with only 1
significant URL.

I haven't even started thinking about how to make a distributed effort. That
assumes a person knows how to reliably collect the info in the first place. I
like the idea of attacking the URI, since a spammer needs to get you to a
website to sell something. I just don't have a clue (besides a horrendeously
ugly html parser) how to narrow down the URI's collected.

--
Devin Nate
Comment 6 Brian White 2003-09-03 06:13:54 UTC
Subject: Re: [SAdev]  Host IP Database - An automated SPEWS

> Gah, I spent way too much time thinking on the weekend about how to only get
> spammer data. Unfortunately, I didn't find some amazing solution. Finding the
> correct MX record is tough - not everything stamps an equally formatted received
> line. I see our MX servers don't even include their hostname or IP address
> (something else to fix). So, finding the first non-trusted received line isn't
> trivial.

How annoying.  Would it be reasonably easy to recognize _most_ mailers
and allow it to be customized by a user if need be?


> As to URI scanning - it seems easy enough to grab the URIs. It's difficult to
> only grab visable URI's. And it's really friggen hard to figure out which URI's
> are just 'as seen on somesomesite.com', and 'in compliance with fcc.com
> regulation...', and 'as featured on this yahoo shopping site..'. Even assuming
> you have well scored urls, how do you score urls found in an email? Some spams
> have a ton of URLs, some have only 1. Some spams refer to non-spam sites, some
> do not. So what happens when your mother sends you a picture that's hosted on
> one of these sites (e.g. yahoo) that a spammer also periodically will link to.
> The trick seems to be one of looking for the SIGNIFICANT URI.. the one in bold.
> Except then images come into the picture and convelude matters. Worst case, a
> spammer embeds 100URLs into a spam and hides them in various ways, with only 1
> significant URL.

I think the best solution would be to only look for bad URIs.  If it's not
a spam URI, ignore it completely.  No ham will have spam URIs and spam _must_
have a one or more (otherwise the spam is pretty pointless).


> I haven't even started thinking about how to make a distributed effort. That
> assumes a person knows how to reliably collect the info in the first place. I
> like the idea of attacking the URI, since a spammer needs to get you to a
> website to sell something. I just don't have a clue (besides a horrendeously
> ugly html parser) how to narrow down the URI's collected.

Fully distributed systems (with no "master" or "slave" relationships) are
wonderful fun!  Would could simplify things, though, since it isn't essential
for all the servers to be in perfect sync.  As long as each can do it's job
independent of the others, then it doesn't matter if they are marginally
different in the results they might return.

                                          Brian
                                 ( bcwhite@precidia.com )

-------------------------------------------------------------------------------
    Many times the difference between failure and success is doing something
                   nearly right... or doing it exactly right.

Comment 7 Rich Puhek 2003-09-03 09:09:04 UTC
Created attachment 1310 [details]
Parse email, find non-local relays

I hacked together the attached script as part of auto-generating my own RBL.
This is takes an email on STDIN (I have a dummy user which pipes to
spamparser.pl), scans the email for Received lines, parses out the IP address,
and compares to a list of "good" servers.

The @our_relays list is a list of local IP addresses. The parser skips over
these when looking at Received lines. The first non-local relay is entered into
my DB of received spam (along with the actual spam, the date, etc.).

The key contribution to this idea is the Received header parsing. 

The script has some rudimentary protection of forwarded emails, MIME sections,
etc.
Comment 8 Rich Puhek 2003-09-03 09:25:49 UTC
Subject: Re: [SAdev]  Host IP Database - An automated SPEWS



bugzilla-daemon@bugzilla.spamassassin.org wrote:

> Given Method #1 and Method #2, which are not mutually exclusive, there is then
> the question of scoring:
> 
> I believe that scoring would be best handled like the BAYES or AWL system. It'd
> have to give a +/- score to bias the overall score towards the long term
> average, but it'd also have to consider how confident it is (i.e. if you've seen
> an ip address 2 times, you give no points, 10 times, you give reduced points,
> 100 times, ok now we're confident give it full points). It would also have to
> consider how old/useful the data is and expire data at some point in time. This
> would probably be db size driven and time driven.
> 
> Hmm, I think I'm most interested in Method #2, and scoring needs some
> consideration. 
> 

I've pondered something like the above. My long term plan is to create a 
system that will take info from SA regarding hosts, SA scores of email 
from that host, rate of email from that host, and possibly other info, 
and create a pair of local RBLs (I planned on using the RBL concept so 
that it is easy to integrate with MTAs). One RBL, "warn", would result 
in 4xx temp failures. The other, "deny", would result in 5xx permanent 
failures.

The idea is that we'd get an immediate response to something like a 
small server we never communicate with suddenly sending 100 fairly 
spammy messages per minute. At the same time, AOL mailservers sending 
100 "very spammy" messages per minute might be "normal" for a given user 
base.

The 4xx response list will throttle spammers (and possibly legit 
traffic). Broken spamware will continue to aggressivly hammer the 
server, but legit servers will back off and retry. The idea is that the 
"warn" list will be cleaned of legit servers before the typical 4h retry 
interval.

I've created a sourceforge project (activespam), but haven't pushed up 
any code or documentation yet. Need to get my chicken-scratchings 
transferred to HTML first.

--Rich

-- 

_________________________________________________________

Rich Puhek
ETN Systems Inc.
2125 1st Ave East
Hibbing MN 55746

tel:   218.262.1130
email: rpuhek@etnsystems.com
_________________________________________________________

Comment 9 Justin Mason 2003-09-03 09:44:20 UTC
Subject: Re: [SAdev]  Host IP Database - An automated SPEWS

> > Gah, I spent way too much time thinking on the weekend about how to only get
> > spammer data. Unfortunately, I didn't find some amazing solution. Finding the
> > correct MX record is tough - not everything stamps an equally formatted received
> > line. I see our MX servers don't even include their hostname or IP address
> > (something else to fix). So, finding the first non-trusted received line isn't
> > trivial.
> 
> How annoying.  Would it be reasonably easy to recognize _most_ mailers
> and allow it to be customized by a user if need be?

Note that 2.60 already has support for detecting and/or specifying which Received
headers are "trusted" and makes the results available as 2 separate metadata
headers -- X-Relays-Trusted and X-Relays-Untrusted.  see the Conf manpage to
figure out how to add those to the output so you can see what to match against.

> > As to URI scanning - it seems easy enough to grab the URIs. It's difficult to
> > only grab visable URI's. And it's really friggen hard to figure out which URI's
> > are just 'as seen on somesomesite.com', and 'in compliance with fcc.com
> > regulation...', and 'as featured on this yahoo shopping site..'. Even assuming
> > you have well scored urls, how do you score urls found in an email? Some spams
> > have a ton of URLs, some have only 1. Some spams refer to non-spam sites, some
> > do not. So what happens when your mother sends you a picture that's hosted on
> > one of these sites (e.g. yahoo) that a spammer also periodically will link to.
> > The trick seems to be one of looking for the SIGNIFICANT URI.. the one in bold.
> > Except then images come into the picture and convelude matters. Worst case, a
> > spammer embeds 100URLs into a spam and hides them in various ways, with only 1
> > significant URL.
> 
> I think the best solution would be to only look for bad URIs.  If it's not
> a spam URI, ignore it completely.  No ham will have spam URIs and spam _must_
> have a one or more (otherwise the spam is pretty pointless).

Or use Bayesian logic to determine which are spammy, which are not, and which
are "in between".
In fact, just dumping the contents of the bayes db and grepping for URI
components would work fine -- Bayes already (a) decodes URIs, (b) breaks
them apart and (c) tracks spam/ham occurrences ;)

--j.

Comment 10 Devin Nate 2003-09-03 21:18:29 UTC
Ok.. so, a few interesting things have come from this. SA 2.60 has the ability
to find the first non-trusted relays apparently. I have corrected our email
server (situation was tcpserver -l 0, we run qmail), so that it properly stamps
received lines. This may make IP addresses from the received headers possible,
which would be seriously cool. I may try a first crack at that.

I am totally ignoring the distributed mechanism for now; it's been done, Razor2
does it, the DNS blacklists do it, DCC does it, spamd does it: it's not
impossible, and there are several models to choose from. Collecting accurate
information, on the other hand, is more difficult.

I still believe URIs are a problem even having read all of the posts. Some URIs
appear only in spam. Some URIs will only appear in ham. And a whole ton of URIs
can exist in either, and presumably as soon as we write a rule that looks for
spam uri's the spammers will add ham type URIs. So, the simple thing to say is
'when we find an email with a spam URI, we give it hella points cause it must be
spam'.

The problem is, how do we know it's a spam URI in the first place? I have a ton
of spam that refers to legit websites. I also have a great deal of spam which
has multiple different urls on it. Much of my spam has only 1 link on the whole
page; much of my spam has a friggen ton of links. Some links are image links,
some area links, some base/href links, some form cgi actions, etc. I can write a
program that'll work today - that's easy enough, it's when the spammers start
adding hamish urls to confuse the filters that things get tricky.

What I think the answer starts with is some sort of probability. If a message
has 1 link on it, and only 1 link, and gets 50 points: it's going to be a spam
link. If it's a spam with 10 different links, perhaps each unique, and gets a
score of 20... then what? For the investor club spams, there's often a link
saying something like 'we comply with federal law; click here www.fcc.gov for
more info' or some such thing. Obviously, www.fcc.gov isn't an address we want
to hit. In another spam I viewed, it linked to specific pages on www.yahoo.com
.. clearly spammers can set up mini sites on such services, but there's a holy
ton of email floating around linking to www.yahoo.com in one sort or another.

Hand training does begin to solve this problem. Perhaps the answer is to simply
have someone submit domains as spam sites, and someone manually verify. And then
a distributed database be set up to deal with that. I'd prefer some sort of
autolearning also.

What I don't want to happen is a polluted autolearned database. The spammers can
easily embed non-visable urls or non-significant urls to cause a relatively
effective DoS if we simply learn on any URI that's presented in an email. Hint:
generate a spam that'll get 100 points and stick urls to ibm.com, microsoft.com,
yahoo.com, hotmail.com, etc., and watch a naive autolearner blacklist all emails
with links to those sites. Spammers already include random words to confuse
filters, random urls would be easy enough too. With some trickery, you could
probably get those urls embedded in such a way that a parser would see them, but
not display them. Hmmm, I don't know how to solve that. Any suggestions?

Still thinking about this!

Thanks,
Devin Nate
Comment 11 Brian White 2003-09-04 06:12:34 UTC
Subject: Re: [SAdev]  Host IP Database - An automated SPEWS

> The problem is, how do we know it's a spam URI in the first place? I have a ton
> of spam that refers to legit websites. I also have a great deal of spam which
> has multiple different urls on it. Much of my spam has only 1 link on the whole
> page; much of my spam has a friggen ton of links. Some links are image links,
> some area links, some base/href links, some form cgi actions, etc. I can write a
> program that'll work today - that's easy enough, it's when the spammers start
> adding hamish urls to confuse the filters that things get tricky.
> 
> What I think the answer starts with is some sort of probability. If a message
> has 1 link on it, and only 1 link, and gets 50 points: it's going to be a spam
> link. If it's a spam with 10 different links, perhaps each unique, and gets a
> score of 20... then what? For the investor club spams, there's often a link
> saying something like 'we comply with federal law; click here www.fcc.gov for
> more info' or some such thing. Obviously, www.fcc.gov isn't an address we want
> to hit. In another spam I viewed, it linked to specific pages on www.yahoo.com
> .. clearly spammers can set up mini sites on such services, but there's a holy
> ton of email floating around linking to www.yahoo.com in one sort or another.

Just normal training on a corpus of ham/spam should solve this pretty
quickly.  Unlike Bayes/CRM training on words/phrases, URIs (or rather,
the IP addresses they map to) are a bit different.  It's not ham/spam
but rather neutral/spam.  Words are abiguous, IP addresses are not.

Thus, the weighting values grow much like word values in Bayes but the
combining would be different.  A "neutral" address would not weight
negatively (where a negative number means "non-spam") while a "spam"
address would weight positively.  It may be as simple as just taking
the highest-ranking address and using that as your indicator.

The obvious problem is if somebody forwards a spam message, but that's
already a problem and the reason why no single rule in SA is enough to
get something classified as spam.


> What I don't want to happen is a polluted autolearned database. The spammers can
> easily embed non-visable urls or non-significant urls to cause a relatively
> effective DoS if we simply learn on any URI that's presented in an email. Hint:
> generate a spam that'll get 100 points and stick urls to ibm.com, microsoft.com,
> yahoo.com, hotmail.com, etc., and watch a naive autolearner blacklist all emails
> with links to those sites. Spammers already include random words to confuse
> filters, random urls would be easy enough too. With some trickery, you could
> probably get those urls embedded in such a way that a parser would see them, but
> not display them. Hmmm, I don't know how to solve that. Any suggestions?

The "Achilles Heel" for this is the fact that spam may include neutral URIs
as well as spam ones but ham will never include the spam ones.  Thus, anything
that is unknown or seen more than 1 time in ham for every 100 times in spam
can probably just be ignored.

                                          Brian
                                 ( bcwhite@precidia.com )

-------------------------------------------------------------------------------
    Many times the difference between failure and success is doing something
                   nearly right... or doing it exactly right.

Comment 12 Justin Mason 2004-02-18 18:25:07 UTC
Method #2 is already handled, BTW -- Bayes tracks URI data.

Method #1 however is not (yet).  retitling bug to more correctly reflect that.
'tools/bayes_dump_to_trusted_networks' in the 3.0.0 svn trunk may be interesting
-- just use the inverse to find *un*trustworthy hosts ;)
Comment 13 Daniel Quinlan 2005-03-30 01:08:36 UTC
move bug to Future milestone (previously set to Future -- I hope)
Comment 14 Henrik Krohns 2019-07-08 10:04:59 UTC
Closing ancient stale bug. Probably not relevant anymore..