Bug 4770 - use ASN data as Bayes token
use ASN data as Bayes token
Status: RESOLVED FIXED
Product: Spamassassin
Classification: Unclassified
Component: Rules
SVN Trunk (Latest Devel Version)
Other other
: P2 minor
: 3.2.0
Assigned To: SpamAssassin Developer Mailing List
:
Depends on:
Blocks:
  Show dependency tree
 
Reported: 2006-01-25 23:53 UTC by Justin Mason
Modified: 2007-01-22 04:32 UTC (History)
1 user (show)



Attachment Type Modified Status Actions Submitter/CLA Status
Plugin to add _ASN_ and _ASNCIDR_ tags text/plain None Matthias Leisi [HasCLA]
Plugin to add _ASN_ and _ASNCIDR_ tags (revised) text/plain None Matthias Leisi [HasCLA]
Plugin to add _ASN_ and _ASNCIDR_ tags (revised 2) text/plain None Matthias Leisi [HasCLA]
Plugin to add _ASN_ and _ASNCIDR_ tags (revised 3) text/plain None Matthias Leisi [HasCLA]
Plugin to add _ASN_ and _ASNCIDR_ tags (revised 4) text/plain None Matthias Leisi [HasCLA]
patch as applied patch None Justin Mason [HasCLA]

Note You need to log in before you can comment on or make changes to this bug.
Description Justin Mason 2006-01-25 23:53:16 UTC
Karsten M. Self _aaaages_ ago noted a strong correlation between the ASN a
message was relayed from, and spamminess.  urls:

http://kmself.home.netcom.com/
http://twiki.iwethey.org/Main/SpamByASN
http://linuxmafia.com/~karsten/Images/spam-by-asn.png
http://linuxmafia.com/~karsten/Images/cum-spam-by-asn.png
http://linuxmafia.com/~karsten/monthly-asn-report-current.txt
http://linuxmafia.com/~karsten/Download/procmail-asn-header

I thought we had a bug tracking this, but it appears we didn't.  Anyway, here's
a bugzilla entry for this.

aspath.routeviews.org seems to provide the most useful data:

dig 101.96.218.195.aspath.routeviews.org. IN TXT

;; QUESTION SECTION:
;101.96.218.195.aspath.routeviews.org. IN TXT

;; ANSWER SECTION:
101.96.218.195.aspath.routeviews.org. 86400 IN TXT "12682 6461 3356 8760"
"195.218.96.0" "19"

;; AUTHORITY SECTION:
aspath.routeviews.org.  86400   IN      NS      routeviews.org.


that's the ASN numbers it passed through, teh IP range, and CIDR mask.

the ASN numbers in particular would be good bayes tokens, if the correlation
still stands.
Comment 1 Justin Mason 2006-02-21 16:20:02 UTC
Karsten responded by email...

> > I was just googling around my name and spam, ran across SA's Bug 4770:
> > 
> >     http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4770
> > 
> > First off, thanks for filing that ;-)
> > 
> > Second:  I've got a user base now, well, two other guys doing
> > spam-by-asn reporting.  One of them's got stats posted on ASN as well:
> > 
> >     http://spam.thegrebs.com/reports/spam_by_asn.pl
> >     http://spam.thegrebs.com/reports/spam_by_cidr.pl
> >     http://spam.thegrebs.com/reports/spam_by_provider.pl
> > 
> > ....specifics are a tad different from my results, but the overall
> > pattern is the same.
> > 
> > 
> > My own historical stats are posted here:
> > 
> >     http://linuxmafia.com/~karsten/monthly-asn-report
> >     http://linuxmafia.com/~karsten/monthly-cidr-report
> > 
> > ... with data from January, 2004 (with a couple of breaks) by month:
> > 
> >     http://linuxmafia.com/~karsten/monthly-asn-report-200401.txt
> >      .
> >      .
> >      .
> >     http://linuxmafia.com/~karsten/monthly-asn-report-200601.txt
> > 
> >     http://linuxmafia.com/~karsten/monthly-cidr-report200401.txt
> >      .
> >      .
> >      .
> >     http://linuxmafia.com/~karsten/monthly-cidr-report-200601.txt
> > 
> > The general rule has held, for _my_ sample (YMMV) that:
> > 
> >   2-5 ASNs account for 25% of all spam.
> >   11 - 30 ASNs account for 50% of all spam.
> > 
> > ... with the concentration actually increasing for the most part over
> > the study period.
> > 
> > Aggregating by CIDR gives spectacular aggregation -- 25% comes in at
> > about 15 CIDR blocks, but most of the top 60 or so CIDRs are very, very
> > spammy.
> > 
> > I think the really valuable place for this would be in the MTA itself
> > rather than spamassassin, but if there's built-in support, so much the
> > better.
Comment 2 Justin Mason 2006-12-04 10:39:54 UTC
just adding a comment --

nowadays this would be best implemented either as Karsten has (upfront as a
message-annotating filter) or in SpamAssassin as a plugin which annotates the
message using add_header().  both ways expose the data for bayes.  in terms of
what makes sense for SA, the latter is more logical IMO -- and less overhead,
since it reduces forks and message parsing required.
Comment 3 Justin Mason 2006-12-15 05:33:53 UTC
Matthias Leisi has written a plugin to do this:

http://matthias.leisi.net/archives/174-ASN-and-SpamAssassin.html
Comment 4 Matthias Leisi 2006-12-15 09:45:37 UTC
Created attachment 3786 [details]
Plugin to add _ASN_ and _ASNCIDR_ tags

The plugin jm referred to in comment #3; it has an Apache license (text copied
from one of the other sa source files) to make reuse easy. Code is pod'ed.

Since the asn.routeviews.org zone is handled by a single nameserveronly (as I
write this: 128.223.61.18) usage should probably be limited to low-volume
sites.
Comment 5 Justin Mason 2006-12-15 09:59:05 UTC
thanks Matthias!  yep, I definitely think this should be kept inactive by
default; I'm pretty sure routeviews would not be happy if we shipped it enabled.

aiming (optimistically) at 3.2.0...
Comment 6 Matthias Leisi 2006-12-15 10:35:14 UTC
Should I rewrite it so that it fits in the Mail::SpamAssassin::Plugin package? 
Comment 7 Justin Mason 2006-12-15 10:47:21 UTC
that'd be awesome, thanks ;)
Comment 8 Matthias Leisi 2006-12-15 11:43:22 UTC
Created attachment 3787 [details]
Plugin to add _ASN_ and _ASNCIDR_ tags (revised)

* Moved to Mail::SpamAssassin::Plugin package
* Cleaned up POD doc, added warning on routeviews.org load
* Cleaned up debug output
* Added warning on zero-length items
Comment 9 Matthias Leisi 2006-12-16 01:09:53 UTC
Chris Pollock noticed that the plugin gives less accurate results than the
procmail recipe by Karsten M. Self (see links in comment #1), and I've also seen
a number of "non-responses" in my own corpus. 

The procmail recipe uses host(1) with "-R 10" (ten retries upon failure) which
is pretty aggressive but gives more accurate results. 

One possible solution is to set up a local mirror of the asn.routeviews.org zone
using the data from ftp://ftp.routeviews.org/dnszones/. However this data is not
available through rsync (which increases bandwidth) and only in BIND format
(which results in enormous memory consumption). [I just asked them if they would
offer them.]

The second possible solution is to not use SA's check_rbl_text() but do direct
queries from within the plugin using Net::DNS with appropriate retries etc. 

Would that be acceptable? Is it necessary to do async lookups or are the plugins
themselves called asynchronously? 
Comment 10 Justin Mason 2006-12-16 03:41:18 UTC
it's not necessary to do async lookups, no.  If the addition of a few seconds of
latency is acceptable (which it probably will be, IMO), then that may be the
best option. +1

I would suggest using our own frontend for Net::DNS, though,
Mail::SpamAssassin::DnsResolver -- it works around a Net::DNS bug.

10 retries might be overkill, though ;)

(Down the line, there's plenty of time to modify it to use the async lookup
infrastructure.  that would be a good idea so that the
lookup/retry/lookup/retry/... chain can happen in parallel with other rules. not
urgent though.)

Also, it'd be great if asn.routeviews.org was rsyncable -- I'm sure there'd be a
lot of people willing to set up mirrors, too... that would be a good way to get
the zone usable for internet-scale lookups without hammering that guy's personal
infrastructure.
Comment 11 Matthias Leisi 2006-12-17 04:23:00 UTC
Created attachment 3788 [details]
Plugin to add _ASN_ and _ASNCIDR_ tags (revised 2) 

* Changed from check_rbl_txt to SA's internal async DNS
* Does not require a 0.001 score any more
* Prepend "AS" to the ASN, eg "AS2828" to make it more distinct
Comment 12 Matthias Leisi 2006-12-17 06:22:17 UTC
Created attachment 3789 [details]
Plugin to add _ASN_ and _ASNCIDR_ tags (revised 3)

* Tested with 3.1.7 (previously only with 3.1.0 and 3.1.3)
* Fixed typo
* Make sure we do not look up 127.0.0.1 -- relays_externals->[0] is
(sometimes?) 127.0.0.1 when called in spamd(8)/Postfix' content_filter context
as opposed to spamassassin(1).
* Removed the _handle_hit call and the "return 1"s, as they caused a default
score of 1.0 on the rule that called the asn_lookup() eval function.
* Added a parameter to the asn_lookup() eval function to specify the number of
simultaneous DNS queries
Comment 13 Michael Parker 2006-12-17 10:14:16 UTC
Couple of suggestions/comments:

1) I wouldn't cut out just 127.0.0.1 IPs, I'd cut out all private ips, you can
determine if its a private ip with the following type of check:
if ($scanner->{relays_external}->[0]->{ip_private}) { .....

You might also want to just limit yourself to relays_untrusted.

2) This part really worries me:

"Please make sure
that your use of the plugin does not overload their infrastructure -
this generally means that B<you should not use this plugin in a
high-volume environment> or that you should use a local mirror of the
zone (see ftp://ftp.routeviews.org/dnszones/)."

With that sort of caveat I'll be -1 for inclusion in the base pkg, you could put
it up on the wiki of course.  I think that if its included, even in turned off
state, that enough people will turn it on to possibly cause a problem.
Comment 14 Justin Mason 2006-12-17 13:38:06 UTC
'With that sort of caveat I'll be -1 for inclusion in the base pkg, you could put
it up on the wiki of course.  I think that if its included, even in turned off
state, that enough people will turn it on to possibly cause a problem.'

hmm -- I'd tend to disagree ;)  I doubt many people would enable it, if it
incurs a latency hit for no immediate increase in accuracy (ie it just generates
additional tokens for bayes).  As such I think it'd be OK to have in the base
distro, commented.

(Another alternative might be to ask the guy who runs the zone if he'd be ok
with its inclusion in this form, too.)

apart from that -- Michael's comments about IP choice are correct, though...
Comment 15 Daryl C. W. O'Shea 2006-12-17 14:07:52 UTC
(In reply to comment #14)
> (Another alternative might be to ask the guy who runs the zone if he'd be ok
> with its inclusion in this form, too.)

I would hope that we'd do that for any new DNS test that we include (that's not
a part of an already used combined zone).  At the very least it'd be nice to
give the operator a heads up as to why his traffic is suddenly increasing.
Comment 16 Matthias Leisi 2006-12-17 23:55:40 UTC
(In reply to comment #15)

> I would hope that we'd do that for any new DNS test that we include (that's not
> a part of an already used combined zone).  At the very least it'd be nice to
> give the operator a heads up as to why his traffic is suddenly increasing.

I already tried to contact them through the help /at/ routeviews.org address
provided on the site, not specifically for inclusion in SA, but generally for
rsync'ing / mirroring of their zone. 

I haven't received an answer yet -- it may be helpful if somebody has a better
way to contact them (it's hosted by uoregon.edu). 

As to the choice of IP addresses (comment #13 and #14): An update should be
ready later today.
Comment 17 Matthias Leisi 2006-12-17 23:58:52 UTC
Open issue (see "TODO" section in the POD): For some IP addresses, an AS
announces more than one network (more/less specific, eg a.b.c.d/20 and a.b.c.e/23). 

what's the preferred option to handle these more/less specific announcements?
Just add them to the _ASNCIDR_ tag (eg space separated)? Currently the last
answer wins.
Comment 18 Justin Mason 2006-12-18 03:05:03 UTC
'Open issue (see "TODO" section in the POD): For some IP addresses, an AS
announces more than one network (more/less specific, eg a.b.c.d/20 and a.b.c.e/23). 

what's the preferred option to handle these more/less specific announcements?
Just add them to the _ASNCIDR_ tag (eg space separated)? Currently the last
answer wins.'

I'd vote for adding all of the answers, space-separated...
Comment 19 Matthias Leisi 2006-12-18 09:11:08 UTC
Created attachment 3790 [details]
Plugin to add _ASN_ and _ASNCIDR_ tags (revised 4)

* Use of $scanner->{relays_untrusted} to determine the IP address to look up,
skipping {private_ip}'s (comment #13)
* Add multiple responses (more/less specific networks) to _ASNCIDR_,
space-separated (comment #18)

Regarding load on routeviews.org infrastructure (comment #15): I received an
answer from the project and we are discussing the potential load implication.
There are in fact three nameservers, but they are inconsistently advertised
through DNS. I'll update here as soon as I have more information. 

What is your guesstimate: How much load (eg queries/day) would a
white-/blacklist receive if it were added to SA and enabled by default? If it
were included, but disabled?
Comment 20 Matthias Leisi 2006-12-19 16:54:42 UTC
Update on the useage of asn.routeviews.org:

1) John Heasly, the developer of the tools around the asn and
asnpath.routeviews.org zone, writes by mail:

| As Joel mentioned, we have a daemon specially written to handle these zones.
| I would not expect there to be any problem with the additional load.

2) John is working on an rbldsnd format of the data. There is one missing(?)
feature in rbldnsd (return a default A/TXT record instead of NXDOMAIN) which I'm
taking up with the developer of rbldsnd. 

3) For those wanting to set up a local mirror, rsync has been made available at
rsync://archive.routeviews.org/routeviews/dnszones:

| rsync rsync://archive.routeviews.org/routeviews/dnszones/aspath.zone .

IMHO this should solve the concerns brought up in comment #13 and #15.
Comment 21 Matthias Leisi 2006-12-31 06:10:48 UTC
I did some statistics on the ASN data gathered over the past couple of days (see
http://matthias.leisi.net/archives/176-Where-does-your-spam-come-from.html). 

It seems that the ASN data alone is not too helpful, but a combined view on ASN
and prefixes announced by these ASNs (the _ASNCIDR_ tag) helps to identify
"hotspots". In that light, the two tags may well be helpful as Bayes tokens.

Comment 22 Justin Mason 2007-01-02 10:36:16 UTC
so, just to clarify -- the asn.routeviews.org developers are happy for
(off-by-default) support to be included in SA?
Comment 23 Matthias Leisi 2007-01-02 14:21:10 UTC
Yes, see comment #20. John Heasly wrote by mail:

| I would not expect there to be any problem with the additional load.

I'll forward you the complete mail privately.
Comment 24 Justin Mason 2007-01-07 10:18:09 UTC
update: we followed it up just to get to get a definite answer --

>> Are you fine with the plugin pointing to asn.routeviews.org being
>> distributed (and the additional load this may create)?
>
>That is fine.

so we're good to go, IMO.  Michael?
Comment 25 Justin Mason 2007-01-14 06:25:19 UTC
Michael -- is your veto still in place?  please commnent.
Comment 26 Michael Parker 2007-01-15 10:14:28 UTC
I remove my veto.
Comment 27 Justin Mason 2007-01-15 10:23:47 UTC
cool -- thanks!  will apply shortly
Comment 28 Justin Mason 2007-01-15 13:33:44 UTC
ok, thanks -- applied:

: jm 1375...; svn commit -m "bug 4770: add ASN.pm plugin, contributed by
Matthias Leisi <matthias at leisi.net>"  lib/Mail/SpamAssassin/Plugin/ASN.pm
MANIFEST rules/v320.pre
Sending        MANIFEST
Adding         lib/Mail/SpamAssassin/Plugin/ASN.pm
Sending        rules/v320.pre
Transmitting file data ...
Committed revision 496501.
Comment 29 Justin Mason 2007-01-15 14:13:37 UTC
Created attachment 3826 [details]
patch as applied

oops.  Daryl just pointed out -- we need to sort out a CLA first...

Matthias, could you fax through a CLA?	http://www.apache.org/licenses/#clas
Comment 30 Justin Mason 2007-01-17 15:44:19 UTC
btw, it's now possible to file CLAs via email:

'CLAs can be filed
electronically now.  You can send PGP/GPG-signed emails with the
scanned PDFs of the signed CLA form to secretary@apache.org and
legal-archive@apache.org.  This removes the need to fax or send
physical mails of the CLA.'

handy!
Comment 31 Matthias Leisi 2007-01-20 08:41:25 UTC
(In reply to comment #30)
CLA completed, signed and submitted to secretary@apache.org and
legal-archive@apache.org (Cc: Justin).

As someone mentioned on the dev list, there is a superfluous copyright line in
the plugin as attached to this bug:

| # <@LICENSE>
| # Copyright 2006 dnswl.org, Matthias Leisi <matthias@leisi.net>

Justin, can you just remove it when you re-apply the patch or shall I attach a
revised version? 
Comment 32 Justin Mason 2007-01-21 08:03:51 UTC
thanks!

I'll replace with the default ASF license block, as seen in the other .pm files,
if that's ok.
Comment 33 Justin Mason 2007-01-21 08:27:04 UTC
JimJag just noted the CLA as received, so I'll apply this rsn...
Comment 34 Matthias Leisi 2007-01-21 11:37:19 UTC
(In reply to comment #32)

> I'll replace with the default ASF license block, as seen in the other .pm files,
> if that's ok.

That's perfect, thanks.
Comment 35 Justin Mason 2007-01-22 04:32:32 UTC
ok, re-applied with the ASF license header:

svn commit -m "bug 4770: re-apply Mail::SpamAssassin::Plugin::ASN patch, now
that licensing is sorted.  exposes ASN data as a Bayes token and the _ASNCIDR_
and _ASN_ header-rewriting tags.  thanks to Matthias Leisi <matthias /at/
leisi.net>"
Sending        CREDITS
Sending        MANIFEST
Adding         lib/Mail/SpamAssassin/Plugin/ASN.pm
Sending        rules/v320.pre
Transmitting file data ....
Committed revision 498595.