SA Bugzilla – Bug 4770
use ASN data as Bayes token
Last modified: 2007-01-22 04:32:32 UTC
Karsten M. Self _aaaages_ ago noted a strong correlation between the ASN a message was relayed from, and spamminess. urls: http://kmself.home.netcom.com/ http://twiki.iwethey.org/Main/SpamByASN http://linuxmafia.com/~karsten/Images/spam-by-asn.png http://linuxmafia.com/~karsten/Images/cum-spam-by-asn.png http://linuxmafia.com/~karsten/monthly-asn-report-current.txt http://linuxmafia.com/~karsten/Download/procmail-asn-header I thought we had a bug tracking this, but it appears we didn't. Anyway, here's a bugzilla entry for this. aspath.routeviews.org seems to provide the most useful data: dig 101.96.218.195.aspath.routeviews.org. IN TXT ;; QUESTION SECTION: ;101.96.218.195.aspath.routeviews.org. IN TXT ;; ANSWER SECTION: 101.96.218.195.aspath.routeviews.org. 86400 IN TXT "12682 6461 3356 8760" "195.218.96.0" "19" ;; AUTHORITY SECTION: aspath.routeviews.org. 86400 IN NS routeviews.org. that's the ASN numbers it passed through, teh IP range, and CIDR mask. the ASN numbers in particular would be good bayes tokens, if the correlation still stands.
Karsten responded by email... > > I was just googling around my name and spam, ran across SA's Bug 4770: > > > > http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4770 > > > > First off, thanks for filing that ;-) > > > > Second: I've got a user base now, well, two other guys doing > > spam-by-asn reporting. One of them's got stats posted on ASN as well: > > > > http://spam.thegrebs.com/reports/spam_by_asn.pl > > http://spam.thegrebs.com/reports/spam_by_cidr.pl > > http://spam.thegrebs.com/reports/spam_by_provider.pl > > > > ....specifics are a tad different from my results, but the overall > > pattern is the same. > > > > > > My own historical stats are posted here: > > > > http://linuxmafia.com/~karsten/monthly-asn-report > > http://linuxmafia.com/~karsten/monthly-cidr-report > > > > ... with data from January, 2004 (with a couple of breaks) by month: > > > > http://linuxmafia.com/~karsten/monthly-asn-report-200401.txt > > . > > . > > . > > http://linuxmafia.com/~karsten/monthly-asn-report-200601.txt > > > > http://linuxmafia.com/~karsten/monthly-cidr-report200401.txt > > . > > . > > . > > http://linuxmafia.com/~karsten/monthly-cidr-report-200601.txt > > > > The general rule has held, for _my_ sample (YMMV) that: > > > > 2-5 ASNs account for 25% of all spam. > > 11 - 30 ASNs account for 50% of all spam. > > > > ... with the concentration actually increasing for the most part over > > the study period. > > > > Aggregating by CIDR gives spectacular aggregation -- 25% comes in at > > about 15 CIDR blocks, but most of the top 60 or so CIDRs are very, very > > spammy. > > > > I think the really valuable place for this would be in the MTA itself > > rather than spamassassin, but if there's built-in support, so much the > > better.
just adding a comment -- nowadays this would be best implemented either as Karsten has (upfront as a message-annotating filter) or in SpamAssassin as a plugin which annotates the message using add_header(). both ways expose the data for bayes. in terms of what makes sense for SA, the latter is more logical IMO -- and less overhead, since it reduces forks and message parsing required.
Matthias Leisi has written a plugin to do this: http://matthias.leisi.net/archives/174-ASN-and-SpamAssassin.html
Created attachment 3786 [details] Plugin to add _ASN_ and _ASNCIDR_ tags The plugin jm referred to in comment #3; it has an Apache license (text copied from one of the other sa source files) to make reuse easy. Code is pod'ed. Since the asn.routeviews.org zone is handled by a single nameserveronly (as I write this: 128.223.61.18) usage should probably be limited to low-volume sites.
thanks Matthias! yep, I definitely think this should be kept inactive by default; I'm pretty sure routeviews would not be happy if we shipped it enabled. aiming (optimistically) at 3.2.0...
Should I rewrite it so that it fits in the Mail::SpamAssassin::Plugin package?
that'd be awesome, thanks ;)
Created attachment 3787 [details] Plugin to add _ASN_ and _ASNCIDR_ tags (revised) * Moved to Mail::SpamAssassin::Plugin package * Cleaned up POD doc, added warning on routeviews.org load * Cleaned up debug output * Added warning on zero-length items
Chris Pollock noticed that the plugin gives less accurate results than the procmail recipe by Karsten M. Self (see links in comment #1), and I've also seen a number of "non-responses" in my own corpus. The procmail recipe uses host(1) with "-R 10" (ten retries upon failure) which is pretty aggressive but gives more accurate results. One possible solution is to set up a local mirror of the asn.routeviews.org zone using the data from ftp://ftp.routeviews.org/dnszones/. However this data is not available through rsync (which increases bandwidth) and only in BIND format (which results in enormous memory consumption). [I just asked them if they would offer them.] The second possible solution is to not use SA's check_rbl_text() but do direct queries from within the plugin using Net::DNS with appropriate retries etc. Would that be acceptable? Is it necessary to do async lookups or are the plugins themselves called asynchronously?
it's not necessary to do async lookups, no. If the addition of a few seconds of latency is acceptable (which it probably will be, IMO), then that may be the best option. +1 I would suggest using our own frontend for Net::DNS, though, Mail::SpamAssassin::DnsResolver -- it works around a Net::DNS bug. 10 retries might be overkill, though ;) (Down the line, there's plenty of time to modify it to use the async lookup infrastructure. that would be a good idea so that the lookup/retry/lookup/retry/... chain can happen in parallel with other rules. not urgent though.) Also, it'd be great if asn.routeviews.org was rsyncable -- I'm sure there'd be a lot of people willing to set up mirrors, too... that would be a good way to get the zone usable for internet-scale lookups without hammering that guy's personal infrastructure.
Created attachment 3788 [details] Plugin to add _ASN_ and _ASNCIDR_ tags (revised 2) * Changed from check_rbl_txt to SA's internal async DNS * Does not require a 0.001 score any more * Prepend "AS" to the ASN, eg "AS2828" to make it more distinct
Created attachment 3789 [details] Plugin to add _ASN_ and _ASNCIDR_ tags (revised 3) * Tested with 3.1.7 (previously only with 3.1.0 and 3.1.3) * Fixed typo * Make sure we do not look up 127.0.0.1 -- relays_externals->[0] is (sometimes?) 127.0.0.1 when called in spamd(8)/Postfix' content_filter context as opposed to spamassassin(1). * Removed the _handle_hit call and the "return 1"s, as they caused a default score of 1.0 on the rule that called the asn_lookup() eval function. * Added a parameter to the asn_lookup() eval function to specify the number of simultaneous DNS queries
Couple of suggestions/comments: 1) I wouldn't cut out just 127.0.0.1 IPs, I'd cut out all private ips, you can determine if its a private ip with the following type of check: if ($scanner->{relays_external}->[0]->{ip_private}) { ..... You might also want to just limit yourself to relays_untrusted. 2) This part really worries me: "Please make sure that your use of the plugin does not overload their infrastructure - this generally means that B<you should not use this plugin in a high-volume environment> or that you should use a local mirror of the zone (see ftp://ftp.routeviews.org/dnszones/)." With that sort of caveat I'll be -1 for inclusion in the base pkg, you could put it up on the wiki of course. I think that if its included, even in turned off state, that enough people will turn it on to possibly cause a problem.
'With that sort of caveat I'll be -1 for inclusion in the base pkg, you could put it up on the wiki of course. I think that if its included, even in turned off state, that enough people will turn it on to possibly cause a problem.' hmm -- I'd tend to disagree ;) I doubt many people would enable it, if it incurs a latency hit for no immediate increase in accuracy (ie it just generates additional tokens for bayes). As such I think it'd be OK to have in the base distro, commented. (Another alternative might be to ask the guy who runs the zone if he'd be ok with its inclusion in this form, too.) apart from that -- Michael's comments about IP choice are correct, though...
(In reply to comment #14) > (Another alternative might be to ask the guy who runs the zone if he'd be ok > with its inclusion in this form, too.) I would hope that we'd do that for any new DNS test that we include (that's not a part of an already used combined zone). At the very least it'd be nice to give the operator a heads up as to why his traffic is suddenly increasing.
(In reply to comment #15) > I would hope that we'd do that for any new DNS test that we include (that's not > a part of an already used combined zone). At the very least it'd be nice to > give the operator a heads up as to why his traffic is suddenly increasing. I already tried to contact them through the help /at/ routeviews.org address provided on the site, not specifically for inclusion in SA, but generally for rsync'ing / mirroring of their zone. I haven't received an answer yet -- it may be helpful if somebody has a better way to contact them (it's hosted by uoregon.edu). As to the choice of IP addresses (comment #13 and #14): An update should be ready later today.
Open issue (see "TODO" section in the POD): For some IP addresses, an AS announces more than one network (more/less specific, eg a.b.c.d/20 and a.b.c.e/23). what's the preferred option to handle these more/less specific announcements? Just add them to the _ASNCIDR_ tag (eg space separated)? Currently the last answer wins.
'Open issue (see "TODO" section in the POD): For some IP addresses, an AS announces more than one network (more/less specific, eg a.b.c.d/20 and a.b.c.e/23). what's the preferred option to handle these more/less specific announcements? Just add them to the _ASNCIDR_ tag (eg space separated)? Currently the last answer wins.' I'd vote for adding all of the answers, space-separated...
Created attachment 3790 [details] Plugin to add _ASN_ and _ASNCIDR_ tags (revised 4) * Use of $scanner->{relays_untrusted} to determine the IP address to look up, skipping {private_ip}'s (comment #13) * Add multiple responses (more/less specific networks) to _ASNCIDR_, space-separated (comment #18) Regarding load on routeviews.org infrastructure (comment #15): I received an answer from the project and we are discussing the potential load implication. There are in fact three nameservers, but they are inconsistently advertised through DNS. I'll update here as soon as I have more information. What is your guesstimate: How much load (eg queries/day) would a white-/blacklist receive if it were added to SA and enabled by default? If it were included, but disabled?
Update on the useage of asn.routeviews.org: 1) John Heasly, the developer of the tools around the asn and asnpath.routeviews.org zone, writes by mail: | As Joel mentioned, we have a daemon specially written to handle these zones. | I would not expect there to be any problem with the additional load. 2) John is working on an rbldsnd format of the data. There is one missing(?) feature in rbldnsd (return a default A/TXT record instead of NXDOMAIN) which I'm taking up with the developer of rbldsnd. 3) For those wanting to set up a local mirror, rsync has been made available at rsync://archive.routeviews.org/routeviews/dnszones: | rsync rsync://archive.routeviews.org/routeviews/dnszones/aspath.zone . IMHO this should solve the concerns brought up in comment #13 and #15.
I did some statistics on the ASN data gathered over the past couple of days (see http://matthias.leisi.net/archives/176-Where-does-your-spam-come-from.html). It seems that the ASN data alone is not too helpful, but a combined view on ASN and prefixes announced by these ASNs (the _ASNCIDR_ tag) helps to identify "hotspots". In that light, the two tags may well be helpful as Bayes tokens.
so, just to clarify -- the asn.routeviews.org developers are happy for (off-by-default) support to be included in SA?
Yes, see comment #20. John Heasly wrote by mail: | I would not expect there to be any problem with the additional load. I'll forward you the complete mail privately.
update: we followed it up just to get to get a definite answer -- >> Are you fine with the plugin pointing to asn.routeviews.org being >> distributed (and the additional load this may create)? > >That is fine. so we're good to go, IMO. Michael?
Michael -- is your veto still in place? please commnent.
I remove my veto.
cool -- thanks! will apply shortly
ok, thanks -- applied: : jm 1375...; svn commit -m "bug 4770: add ASN.pm plugin, contributed by Matthias Leisi <matthias at leisi.net>" lib/Mail/SpamAssassin/Plugin/ASN.pm MANIFEST rules/v320.pre Sending MANIFEST Adding lib/Mail/SpamAssassin/Plugin/ASN.pm Sending rules/v320.pre Transmitting file data ... Committed revision 496501.
Created attachment 3826 [details] patch as applied oops. Daryl just pointed out -- we need to sort out a CLA first... Matthias, could you fax through a CLA? http://www.apache.org/licenses/#clas
btw, it's now possible to file CLAs via email: 'CLAs can be filed electronically now. You can send PGP/GPG-signed emails with the scanned PDFs of the signed CLA form to secretary@apache.org and legal-archive@apache.org. This removes the need to fax or send physical mails of the CLA.' handy!
(In reply to comment #30) CLA completed, signed and submitted to secretary@apache.org and legal-archive@apache.org (Cc: Justin). As someone mentioned on the dev list, there is a superfluous copyright line in the plugin as attached to this bug: | # <@LICENSE> | # Copyright 2006 dnswl.org, Matthias Leisi <matthias@leisi.net> Justin, can you just remove it when you re-apply the patch or shall I attach a revised version?
thanks! I'll replace with the default ASF license block, as seen in the other .pm files, if that's ok.
JimJag just noted the CLA as received, so I'll apply this rsn...
(In reply to comment #32) > I'll replace with the default ASF license block, as seen in the other .pm files, > if that's ok. That's perfect, thanks.
ok, re-applied with the ASF license header: svn commit -m "bug 4770: re-apply Mail::SpamAssassin::Plugin::ASN patch, now that licensing is sorted. exposes ASN data as a Bayes token and the _ASNCIDR_ and _ASN_ header-rewriting tags. thanks to Matthias Leisi <matthias /at/ leisi.net>" Sending CREDITS Sending MANIFEST Adding lib/Mail/SpamAssassin/Plugin/ASN.pm Sending rules/v320.pre Transmitting file data .... Committed revision 498595.