SA Bugzilla – Bug 2384
RFE: use SA data to generate RBL lists
Last modified: 2019-07-08 10:04:59 UTC
Here's a thought - suppose that spamassassin had a database of host IP addresses and stored the number of messages from that IP and the average spam assassin score from that IP. With that information one could make a rule that if a host sent over say 200 messages and had an average score over 10 that you could add points. Something that worked much like auto whitelisting except that it's done by IP address rather than from address. Additionally - black lists could be generated and shared. Just a thought - who likes this idea?
I was thinking about this myself, and generally like the idea. My only concern is keeping 'significant' data around, because there sure are alot of IP addresses and to store info on each one could get costly. I'd make the mechanism that as soon as a site had accumulated a certain number of points (I was thinking 200 points), and points being calculated as average_score * times_seen (so a site with an average score of 20, that had sent 10 emails, would have 200 points). The big issue: which IP address(es) do you consider significant? Method #1: Received Headers Received headers are pretty gross sometimes. The IP address listed at the "top" of the Received header list isn't the one. For example, at our site, there are internal email routing received headers which will appear at the top of the list. This happens as emails move through our secondary MX hosts to our primary MX host. Therefore, keeping track of the "top" Received Header list seems to not be right. So, the next thought is to take the originating IP address, the one at the "bottom" of the Received header list. Except that one can be manipulated by a spammer, who just needs to insert some crappy Received line at the bottom of the list. Ironically, I am looking through my spam, and I see that they have done that on this spam email, geesh, and it was my second pick. The "bottom" received line is.. well formatted, creative, and wrong. It's wrong because it's claiming to be our mail server. Now, we run qmail, and our Received lines are in qmail format - this line is definitely not ours, and yet, if you trusted only the bottom received line there seems to be bogus info there. So, the "bottom" received line seems no good... As best I can see, looking through emails, you'd need to be able to isolate the Received line of a trusted MX host (probably one of your own servers) and figure out what IP address was talking to the trusted MX host, and use that. That'd ensure you get a significant IP address. The logic would have to be something like: start at the top of the Received list, and walk back until you get to a non-trusted host, and then use the previous entry. This could be done; you'd need a config paramater setting a regexp that'd ONLY match your trusted MX hosts. Definitely a pain in the neck. Method #2: URL matching Spammers need to get you somewhere, and they really need a href link to get you there. So, scanning through an email and get all of the visable href urls (anything invis or in comments ignored). Convert all href urls into both domain format, and ip address(es) (a single domain may resolve to multiple ip's). You risk autoblacklisting shared hosts - whatever, we have some shared hosting features, and we specifically disallow spammers and have never had an issue with it - if you're stupid enough to mix spammers with legit business, you deserve what you get). Domain's are a different thing. DNS can have an almost endless list of hostnames that match to a single real host. Version one of a patch would probably only record the url address listed, however, subsequent work might go into finding trends. So that... sitea.spammer.com, siteb.spammer.com, sitec.spammer.com, sited.spammer.com might get stuck together somehow. There are a few ways of doing this, and I don't know what I think about them right now. Given Method #1 and Method #2, which are not mutually exclusive, there is then the question of scoring: I believe that scoring would be best handled like the BAYES or AWL system. It'd have to give a +/- score to bias the overall score towards the long term average, but it'd also have to consider how confident it is (i.e. if you've seen an ip address 2 times, you give no points, 10 times, you give reduced points, 100 times, ok now we're confident give it full points). It would also have to consider how old/useful the data is and expire data at some point in time. This would probably be db size driven and time driven. Hmm, I think I'm most interested in Method #2, and scoring needs some consideration. Are you thinking about making some patches? -- Devin Nate
Humm, looking at the code a brief moment regarding learning IP addresses from URLs, it looks like all of the URI's are stuck into a list in HTML.pm @{$self->{html_text}} .. and it looks like that list is given to the BAYES engine. (Can anyone verify that BAYES gets the URI: list?) If that's the case, one might be able to do a few things in addition to the current system: 1) The URI list is encoded sometimes (using % encoding hacks). Decode URI's (they seem to be in raw form when passed to the URI list) & only record the domain name. Only grab href/form/base URIs (specifically don't get image sites because they are often innocent bystanders and the spammers are 'stealing' their images). 2) The URI is often a hostname. Resolve hostnames into a list of IP addresses. 3) Convert all IP addresses into dot decimal format (avoid hex encoding). 4) Add the decoded domain name & resolved decoded IP addresses to the URI list in HTML.pm so that the BAYES engine can learn them. To put them into a seperate database seems possible too. The tricky part (UNDERSTATEMENT) seems to be deciding which of the URLs are spammers, and which are innocent bystanders (e.g. www.sec.gov, some story on yahoo.com, "as seen on cnn.com", etc etc). (I guess you'd want to look for the most prominent links - by size, color, image, etc -not very easy). -- Devin Nate
Subject: Re: [SAdev] Host IP Database - An automated SPEWS > I was thinking about this myself, and generally like the idea. My only concern > is keeping 'significant' data around, because there sure are alot of IP > addresses and to store info on each one could get costly. I'd make the mechanism > that as soon as a site had accumulated a certain number of points (I was > thinking 200 points), and points being calculated as average_score * times_seen > (so a site with an average score of 20, that had sent 10 emails, would have 200 > points). I think there are some things to learn from Razor about this. If it's a community effort, then we need a way for people to acquire "status" as they make more and more good reports. Also, it's needs to be reasonably distributed in order to be resiliant to DDoS attacks. Anything that works is going to be attacked. Making it a neutral peer-to-peer type network also makes it so that there is no single point of failure for lawsuits, too. But wouldn't something like this be much like a distributed and cooperative MAPS system? > Method #1: Received Headers > > Received headers are pretty gross sometimes. The IP address listed at the "top" > of the Received header list isn't the one. For example, at our site, there are > internal email routing received headers which will appear at the top of the > list. This happens as emails move through our secondary MX hosts to our primary > MX host. Therefore, keeping track of the "top" Received Header list seems to not > be right. This should be pretty easy to automate. Just do "mx domain.com" to find out all the MX records for your domain and search for the first Received header that says "received by valid.mx.ip.address from spammer.idiot.ip.address". Or is there something I'm missing here? > Method #2: URL matching > > Spammers need to get you somewhere, and they really need a href link to get you > there. So, scanning through an email and get all of the visable href urls > (anything invis or in comments ignored). Convert all href urls into both domain > format, and ip address(es) (a single domain may resolve to multiple ip's). You > risk autoblacklisting shared hosts - whatever, we have some shared hosting > features, and we specifically disallow spammers and have never had an issue with > it - if you're stupid enough to mix spammers with legit business, you deserve > what you get). I like this a lot! What about redirectors, though? > Domain's are a different thing. DNS can have an almost endless list of hostnames > that match to a single real host. Version one of a patch would probably only > record the url address listed, however, subsequent work might go into finding > trends. So that... sitea.spammer.com, siteb.spammer.com, sitec.spammer.com, > sited.spammer.com might get stuck together somehow. There are a few ways of > doing this, and I don't know what I think about them right now. It seems like the existing BAYES tests are probably the best choice for domain matching. A distributed version may batch better and/or more frequently, but matching the resolved IP address in a distributed manner would probably make up for this and more. Brian ( bcwhite@precidia.com ) ------------------------------------------------------------------------------- Many times the difference between failure and success is doing something nearly right... or doing it exactly right.
Subject: Re: [SAdev] Host IP Database - An automated SPEWS > Humm, looking at the code a brief moment regarding learning IP addresses from > URLs, it looks like all of the URI's are stuck into a list in HTML.pm > @{$self->{html_text}} .. and it looks like that list is given to the BAYES > engine. (Can anyone verify that BAYES gets the URI: list?) Don't forget to match URIs within text/plain, too. Most mailers will recognize http://bcwhite.dhs.org/ and automatically make it a link. > 4) Add the decoded domain name & resolved decoded IP addresses to the URI list > in HTML.pm so that the BAYES engine can learn them. Would Bayes or another database be effective here when used by only a single person? The messages of all spammers are similar enough for Bayes to be effective, but there's no "similarity" between IP addresses; they either match exactly or not at all. A single person may not see enough references to the same spam site for this to be effective. Brian ( bcwhite@precidia.com ) ------------------------------------------------------------------------------- Many times the difference between failure and success is doing something nearly right... or doing it exactly right.
Gah, I spent way too much time thinking on the weekend about how to only get spammer data. Unfortunately, I didn't find some amazing solution. Finding the correct MX record is tough - not everything stamps an equally formatted received line. I see our MX servers don't even include their hostname or IP address (something else to fix). So, finding the first non-trusted received line isn't trivial. As to URI scanning - it seems easy enough to grab the URIs. It's difficult to only grab visable URI's. And it's really friggen hard to figure out which URI's are just 'as seen on somesomesite.com', and 'in compliance with fcc.com regulation...', and 'as featured on this yahoo shopping site..'. Even assuming you have well scored urls, how do you score urls found in an email? Some spams have a ton of URLs, some have only 1. Some spams refer to non-spam sites, some do not. So what happens when your mother sends you a picture that's hosted on one of these sites (e.g. yahoo) that a spammer also periodically will link to. The trick seems to be one of looking for the SIGNIFICANT URI.. the one in bold. Except then images come into the picture and convelude matters. Worst case, a spammer embeds 100URLs into a spam and hides them in various ways, with only 1 significant URL. I haven't even started thinking about how to make a distributed effort. That assumes a person knows how to reliably collect the info in the first place. I like the idea of attacking the URI, since a spammer needs to get you to a website to sell something. I just don't have a clue (besides a horrendeously ugly html parser) how to narrow down the URI's collected. -- Devin Nate
Subject: Re: [SAdev] Host IP Database - An automated SPEWS > Gah, I spent way too much time thinking on the weekend about how to only get > spammer data. Unfortunately, I didn't find some amazing solution. Finding the > correct MX record is tough - not everything stamps an equally formatted received > line. I see our MX servers don't even include their hostname or IP address > (something else to fix). So, finding the first non-trusted received line isn't > trivial. How annoying. Would it be reasonably easy to recognize _most_ mailers and allow it to be customized by a user if need be? > As to URI scanning - it seems easy enough to grab the URIs. It's difficult to > only grab visable URI's. And it's really friggen hard to figure out which URI's > are just 'as seen on somesomesite.com', and 'in compliance with fcc.com > regulation...', and 'as featured on this yahoo shopping site..'. Even assuming > you have well scored urls, how do you score urls found in an email? Some spams > have a ton of URLs, some have only 1. Some spams refer to non-spam sites, some > do not. So what happens when your mother sends you a picture that's hosted on > one of these sites (e.g. yahoo) that a spammer also periodically will link to. > The trick seems to be one of looking for the SIGNIFICANT URI.. the one in bold. > Except then images come into the picture and convelude matters. Worst case, a > spammer embeds 100URLs into a spam and hides them in various ways, with only 1 > significant URL. I think the best solution would be to only look for bad URIs. If it's not a spam URI, ignore it completely. No ham will have spam URIs and spam _must_ have a one or more (otherwise the spam is pretty pointless). > I haven't even started thinking about how to make a distributed effort. That > assumes a person knows how to reliably collect the info in the first place. I > like the idea of attacking the URI, since a spammer needs to get you to a > website to sell something. I just don't have a clue (besides a horrendeously > ugly html parser) how to narrow down the URI's collected. Fully distributed systems (with no "master" or "slave" relationships) are wonderful fun! Would could simplify things, though, since it isn't essential for all the servers to be in perfect sync. As long as each can do it's job independent of the others, then it doesn't matter if they are marginally different in the results they might return. Brian ( bcwhite@precidia.com ) ------------------------------------------------------------------------------- Many times the difference between failure and success is doing something nearly right... or doing it exactly right.
Created attachment 1310 [details] Parse email, find non-local relays I hacked together the attached script as part of auto-generating my own RBL. This is takes an email on STDIN (I have a dummy user which pipes to spamparser.pl), scans the email for Received lines, parses out the IP address, and compares to a list of "good" servers. The @our_relays list is a list of local IP addresses. The parser skips over these when looking at Received lines. The first non-local relay is entered into my DB of received spam (along with the actual spam, the date, etc.). The key contribution to this idea is the Received header parsing. The script has some rudimentary protection of forwarded emails, MIME sections, etc.
Subject: Re: [SAdev] Host IP Database - An automated SPEWS bugzilla-daemon@bugzilla.spamassassin.org wrote: > Given Method #1 and Method #2, which are not mutually exclusive, there is then > the question of scoring: > > I believe that scoring would be best handled like the BAYES or AWL system. It'd > have to give a +/- score to bias the overall score towards the long term > average, but it'd also have to consider how confident it is (i.e. if you've seen > an ip address 2 times, you give no points, 10 times, you give reduced points, > 100 times, ok now we're confident give it full points). It would also have to > consider how old/useful the data is and expire data at some point in time. This > would probably be db size driven and time driven. > > Hmm, I think I'm most interested in Method #2, and scoring needs some > consideration. > I've pondered something like the above. My long term plan is to create a system that will take info from SA regarding hosts, SA scores of email from that host, rate of email from that host, and possibly other info, and create a pair of local RBLs (I planned on using the RBL concept so that it is easy to integrate with MTAs). One RBL, "warn", would result in 4xx temp failures. The other, "deny", would result in 5xx permanent failures. The idea is that we'd get an immediate response to something like a small server we never communicate with suddenly sending 100 fairly spammy messages per minute. At the same time, AOL mailservers sending 100 "very spammy" messages per minute might be "normal" for a given user base. The 4xx response list will throttle spammers (and possibly legit traffic). Broken spamware will continue to aggressivly hammer the server, but legit servers will back off and retry. The idea is that the "warn" list will be cleaned of legit servers before the typical 4h retry interval. I've created a sourceforge project (activespam), but haven't pushed up any code or documentation yet. Need to get my chicken-scratchings transferred to HTML first. --Rich -- _________________________________________________________ Rich Puhek ETN Systems Inc. 2125 1st Ave East Hibbing MN 55746 tel: 218.262.1130 email: rpuhek@etnsystems.com _________________________________________________________
Subject: Re: [SAdev] Host IP Database - An automated SPEWS > > Gah, I spent way too much time thinking on the weekend about how to only get > > spammer data. Unfortunately, I didn't find some amazing solution. Finding the > > correct MX record is tough - not everything stamps an equally formatted received > > line. I see our MX servers don't even include their hostname or IP address > > (something else to fix). So, finding the first non-trusted received line isn't > > trivial. > > How annoying. Would it be reasonably easy to recognize _most_ mailers > and allow it to be customized by a user if need be? Note that 2.60 already has support for detecting and/or specifying which Received headers are "trusted" and makes the results available as 2 separate metadata headers -- X-Relays-Trusted and X-Relays-Untrusted. see the Conf manpage to figure out how to add those to the output so you can see what to match against. > > As to URI scanning - it seems easy enough to grab the URIs. It's difficult to > > only grab visable URI's. And it's really friggen hard to figure out which URI's > > are just 'as seen on somesomesite.com', and 'in compliance with fcc.com > > regulation...', and 'as featured on this yahoo shopping site..'. Even assuming > > you have well scored urls, how do you score urls found in an email? Some spams > > have a ton of URLs, some have only 1. Some spams refer to non-spam sites, some > > do not. So what happens when your mother sends you a picture that's hosted on > > one of these sites (e.g. yahoo) that a spammer also periodically will link to. > > The trick seems to be one of looking for the SIGNIFICANT URI.. the one in bold. > > Except then images come into the picture and convelude matters. Worst case, a > > spammer embeds 100URLs into a spam and hides them in various ways, with only 1 > > significant URL. > > I think the best solution would be to only look for bad URIs. If it's not > a spam URI, ignore it completely. No ham will have spam URIs and spam _must_ > have a one or more (otherwise the spam is pretty pointless). Or use Bayesian logic to determine which are spammy, which are not, and which are "in between". In fact, just dumping the contents of the bayes db and grepping for URI components would work fine -- Bayes already (a) decodes URIs, (b) breaks them apart and (c) tracks spam/ham occurrences ;) --j.
Ok.. so, a few interesting things have come from this. SA 2.60 has the ability to find the first non-trusted relays apparently. I have corrected our email server (situation was tcpserver -l 0, we run qmail), so that it properly stamps received lines. This may make IP addresses from the received headers possible, which would be seriously cool. I may try a first crack at that. I am totally ignoring the distributed mechanism for now; it's been done, Razor2 does it, the DNS blacklists do it, DCC does it, spamd does it: it's not impossible, and there are several models to choose from. Collecting accurate information, on the other hand, is more difficult. I still believe URIs are a problem even having read all of the posts. Some URIs appear only in spam. Some URIs will only appear in ham. And a whole ton of URIs can exist in either, and presumably as soon as we write a rule that looks for spam uri's the spammers will add ham type URIs. So, the simple thing to say is 'when we find an email with a spam URI, we give it hella points cause it must be spam'. The problem is, how do we know it's a spam URI in the first place? I have a ton of spam that refers to legit websites. I also have a great deal of spam which has multiple different urls on it. Much of my spam has only 1 link on the whole page; much of my spam has a friggen ton of links. Some links are image links, some area links, some base/href links, some form cgi actions, etc. I can write a program that'll work today - that's easy enough, it's when the spammers start adding hamish urls to confuse the filters that things get tricky. What I think the answer starts with is some sort of probability. If a message has 1 link on it, and only 1 link, and gets 50 points: it's going to be a spam link. If it's a spam with 10 different links, perhaps each unique, and gets a score of 20... then what? For the investor club spams, there's often a link saying something like 'we comply with federal law; click here www.fcc.gov for more info' or some such thing. Obviously, www.fcc.gov isn't an address we want to hit. In another spam I viewed, it linked to specific pages on www.yahoo.com .. clearly spammers can set up mini sites on such services, but there's a holy ton of email floating around linking to www.yahoo.com in one sort or another. Hand training does begin to solve this problem. Perhaps the answer is to simply have someone submit domains as spam sites, and someone manually verify. And then a distributed database be set up to deal with that. I'd prefer some sort of autolearning also. What I don't want to happen is a polluted autolearned database. The spammers can easily embed non-visable urls or non-significant urls to cause a relatively effective DoS if we simply learn on any URI that's presented in an email. Hint: generate a spam that'll get 100 points and stick urls to ibm.com, microsoft.com, yahoo.com, hotmail.com, etc., and watch a naive autolearner blacklist all emails with links to those sites. Spammers already include random words to confuse filters, random urls would be easy enough too. With some trickery, you could probably get those urls embedded in such a way that a parser would see them, but not display them. Hmmm, I don't know how to solve that. Any suggestions? Still thinking about this! Thanks, Devin Nate
Subject: Re: [SAdev] Host IP Database - An automated SPEWS > The problem is, how do we know it's a spam URI in the first place? I have a ton > of spam that refers to legit websites. I also have a great deal of spam which > has multiple different urls on it. Much of my spam has only 1 link on the whole > page; much of my spam has a friggen ton of links. Some links are image links, > some area links, some base/href links, some form cgi actions, etc. I can write a > program that'll work today - that's easy enough, it's when the spammers start > adding hamish urls to confuse the filters that things get tricky. > > What I think the answer starts with is some sort of probability. If a message > has 1 link on it, and only 1 link, and gets 50 points: it's going to be a spam > link. If it's a spam with 10 different links, perhaps each unique, and gets a > score of 20... then what? For the investor club spams, there's often a link > saying something like 'we comply with federal law; click here www.fcc.gov for > more info' or some such thing. Obviously, www.fcc.gov isn't an address we want > to hit. In another spam I viewed, it linked to specific pages on www.yahoo.com > .. clearly spammers can set up mini sites on such services, but there's a holy > ton of email floating around linking to www.yahoo.com in one sort or another. Just normal training on a corpus of ham/spam should solve this pretty quickly. Unlike Bayes/CRM training on words/phrases, URIs (or rather, the IP addresses they map to) are a bit different. It's not ham/spam but rather neutral/spam. Words are abiguous, IP addresses are not. Thus, the weighting values grow much like word values in Bayes but the combining would be different. A "neutral" address would not weight negatively (where a negative number means "non-spam") while a "spam" address would weight positively. It may be as simple as just taking the highest-ranking address and using that as your indicator. The obvious problem is if somebody forwards a spam message, but that's already a problem and the reason why no single rule in SA is enough to get something classified as spam. > What I don't want to happen is a polluted autolearned database. The spammers can > easily embed non-visable urls or non-significant urls to cause a relatively > effective DoS if we simply learn on any URI that's presented in an email. Hint: > generate a spam that'll get 100 points and stick urls to ibm.com, microsoft.com, > yahoo.com, hotmail.com, etc., and watch a naive autolearner blacklist all emails > with links to those sites. Spammers already include random words to confuse > filters, random urls would be easy enough too. With some trickery, you could > probably get those urls embedded in such a way that a parser would see them, but > not display them. Hmmm, I don't know how to solve that. Any suggestions? The "Achilles Heel" for this is the fact that spam may include neutral URIs as well as spam ones but ham will never include the spam ones. Thus, anything that is unknown or seen more than 1 time in ham for every 100 times in spam can probably just be ignored. Brian ( bcwhite@precidia.com ) ------------------------------------------------------------------------------- Many times the difference between failure and success is doing something nearly right... or doing it exactly right.
Method #2 is already handled, BTW -- Bayes tracks URI data. Method #1 however is not (yet). retitling bug to more correctly reflect that. 'tools/bayes_dump_to_trusted_networks' in the 3.0.0 svn trunk may be interesting -- just use the inverse to find *un*trustworthy hosts ;)
move bug to Future milestone (previously set to Future -- I hope)
Closing ancient stale bug. Probably not relevant anymore..