SA Bugzilla – Bug 5891
Let AWL keep separate records for DKIM-signed and unsigned mail
Last modified: 2009-11-25 05:40:51 UTC
The change to be submitted shortly lets AWL plugin keep separate records in an SQL database for author addresses with a valid DKIM or DomainKeys signature, and separate records for unsigned addresses. The idea is that popular freemail domains like gmail.com and yahoo.com are frequently abused to send spam, and it is much more efficient for a spammer to send their junk directly, instead of going to trouble of submitting through a freemail provider. And if a spam/fraud does come from a provider, at least the yahoo.com is quite responsive to legitimate/proven complaints sent to abuse@yahoo.com. The resulting effect is that a message with a valid DKIM/DK signature from such providers is far less likely to be spam than a message with the same author address which carries no valid signature, so it makes sense to keep separate AWL averages for each type, yielding significantly more useful AWL scores. For example: yahoo.com not signed, avg.score= 14.8 yahoo.com valid sign., avg.score= -0.7 gmail.com not signed, avg.score= 2.9 gmail.com valid sign., avg.score= -3.3 It is implemented by adding one additional field 'signedby' to an SQL table awl, which receives a signing identity if a message carries a valid DKIM or DomainKeys signature, otherwise the field is set to an empty string. The field awl.ip is ignored when selecting records with a nonempty awl.signedby. For compatibility with existing SQL tables without a signedby field the new feature needs to be explicitly enabled in local.cf: auto_whitelist_distinguish_signed 1 Upgrading instructions are provided in README.awl, the awl_pg.sql and the awl_pg.sql schemas already carry the new field - it does no harm (and no good) if a field exists but a feature is not enabled. I have tested it under MySQL 5.1 and PostgreSQL 8.2, along with an upgrading (ALTER TABLE) procedure. The code is in use at our site for the last couple of months, but was lacking documentation touches. As an alternative to adding a new field, it would be just as fine to re-purpose a field awl.ip to carry either an IP addrress or a signer id. As this field if very short (10 characters), tables would need to be modified one way or another, so I chose a somewhat cleaner approach. One side-benefit of collected data is that average scores can be obtained for each signing domain, yielding some form of a 'reputation score' for each domain. The new code does provide a sub get_signer_reputation() which provides an average score for a supplied signing id, and this score can be used to adjust the AWL result. The call to get_signer_reputation is currently disabled to save one SQL select (an extra index on signedby is needed to make this quick enough). It turned out that after a while this average 'reputation score' settles to a rather constant value, so dynamically fetching it every time is a bit of an unnecessary effort, and a manual mechanism suffices. Since I don't use bdb databases for AWL I did not venture into modifying DBBasedAddrList.pm, so enabling the feature only has an effect when AWL database is in SQL. Improvements welcome.
$ svn -m 'PR-5891: let AWL keep separate records for DKIM-signed and unsigned mail' ci Sending lib/Mail/SpamAssassin/AutoWhitelist.pm Sending lib/Mail/SpamAssassin/DBBasedAddrList.pm Sending lib/Mail/SpamAssassin/PersistentAddrList.pm Sending lib/Mail/SpamAssassin/Plugin/AWL.pm Sending lib/Mail/SpamAssassin/SQLBasedAddrList.pm Sending sql/README.awl Sending sql/awl_mysql.sql Sending sql/awl_pg.sql Transmitting file data ........ Committed revision 650852.
> As an alternative to adding a new field, it would be just as fine to > re-purpose a field awl.ip to carry either an IP addrress or a signer id. > As this field if very short (10 characters), tables would need to be > modified one way or another, so I chose a somewhat cleaner approach. Btw, with a r720963 (on 2008-11-26) I have lengthened the awl.ip field to 45 characters, which was necessary to support IPv6 addresses. This opened a previously dismissed possibility of using the same awl.ip field for a signing domain when a message carries valid signature(s). It's probably still cleaner to use a separate field as the current code does - I just thought I should mention it. As a message can carry more than one valid DKIM signature, keeping multiple signing domains in one SQL field is still somewhat unclean from a point of view of a database design - storing signers in a separate table would be nicer. From a practical viewpoint, the current simpler solution suffices for the time being. Btw, is anybody (except me) running AWL on SQL with updated schema and 'auto_whitelist_distinguish_signed 1' ?
Closing, if there are no objections. The feature if off by default.
Reopening, to solicit some further comments, and document recent work. The following changes mostly just polished my previous work on the topic, no changes in functionality (apart from a bug fix). r817328 | mmartinec | 2009-09-21 19:24:02 +0200 (Mon, 21 Sep 2009) AWL/AutoWhitelist/SQLBasedAddrList cleanups: remove leftovers of signer reputations; prevent a new nonsigned query from updating a signed SQL entry; average across all signed entries in case of multiple valid signatures (e.g. author domain + signing remailer); cleanup (long lines, minor details); documentation consistency r816683 | mmartinec | 2009-09-18 17:51:14 +0200 (Fri, 18 Sep 2009) When storing an IPv6 address to AWL database, append a '::' to a /48 network address - although it wastes 2 chars, it's nice to make it look like a syntactically correct IPv6 address; just fits into SQL 16 char field.
> Reopening, to solicit some further comments, and document recent work. Now to my question. Considering that an awl.ip field needs to be extended from 10 characters (as in SA 3.2.3) to 16 characters or more to accommodate an /48 IPv6 address (RIPE is allocating /48 blocks to organizations) - perhaps this opportunity could be seized and to also require adding a field awl.signedby, as needed for the topic of this PR. Note that SpamAssassin/MTA can see an IPv6 address in Received header fields even if running on a host with no IPv6 connectivity - e.g. in a header field inserted by a 'trusted' MTA. Doing that could eliminate a need for the auto_whitelist_distinguish_signed option, which is only there for the purpose of compatibility with an unchanged awl SQL database. The less options, the fewer opportunity for support questions, misconfigurations, undertested code ... Also some admins would choose to keep auto_whitelist_distinguish_signed at its default (off) compatibility setting solely out of laziness and for sticking blindly to defaults, which has unfavourable affect on AWL results. So, I propose that with upgrading to SA 3.3.0 and keeping existing AWL database on SQL, one is obliged to do both changes (extend awl.ip and add awl.signedby), not just the first. And to get rid of my compatibility option auto_whitelist_distinguish_signed. The required change is described in sql/README.awl : MySQL: ALTER TABLE awl MODIFY ip varchar(16); ALTER TABLE awl DROP PRIMARY KEY, ADD signedby varchar(255) NOT NULL DEFAULT '', ADD PRIMARY KEY (username,email,signedby,ip); PostgreSQL: ALTER TABLE awl ALTER ip TYPE varchar(16); DROP INDEX awl_pkey; ALTER TABLE awl ADD signedby varchar(255) NOT NULL DEFAULT '', ADD PRIMARY KEY (username,email,signedby,ip);
Mark, you positive adding a awl.signedby column is going to change AWL's life? Isn't that most of the dkim and spf-signed mail from freemail providers is after all caming from a well-defined bunch of net bunches? This would end in a co-relation between the signedby field and the ip one. Signedby would then add almost nothing to the equation. Also, please note that a awl.signedby would make the AWL test a network-based one, which actually is not. I do understand AWL needs to adjust for ipv6 addresses. I would also agree it could be "squeezed" a bit further. But my very personal thinking about this, that is the one of (mostly) a SA user, is that anything is going to break existing setups must be worth the troubles. So, the ipv6 problem do or will exist. Since patching it would require some changes to the AWL table structure, I am quite in favor of adding anything can squeeze more from AWL. But pardon me, I think a signedby column may at most make a difference in border cases. In another bug you opened about AWL I said that detecting the assigned bunch address to which a source belongs was unfeasible. Well, that's true, but I forgot ASN routes. We can instead use ASN routes to discriminate the source. SA already uses the ASN plugin to discover the ASN number related to a source. The ASN plugin issues a DNS query to ans.routeviews.org in order to get it, but the reply from routeviews.org also contains route netaddress and netmask length... So, AWL should store a cidr instead of a fixed-length address. This would at least mean adding a column to the AWL table and change somehow the content (and possibly the name) of the ip one in order to accomodate a variable-length, IPvX network address. This would add a bit of complexity to the AWL code, but I believe it is worth the effort. I don't have numbers for this and I really would like to. What do you think about it?
(In reply to comment #6) > Mark, you positive adding a awl.signedby column is going to change AWL's life? > > Isn't that most of the dkim and spf-signed mail from freemail providers is > after all coming from a well-defined bunch of net bunches? This would end in a > co-relation between the signedby field and the ip one. Signedby would then add > almost nothing to the equation. Let's keep the awl.signedby optional, as it is now for some time (enabled by setting the: auto_whitelist_distinguish_signed 1). I agree there is a strong correlation between an IP address and a signing domain for direct mail. But it also does help validate some re-routed mail (forwarding, and non-clobbering mailing lists like the users@spamassassin or postfix-users). The collected data is also useful for assessing domain reputations (the code to dynamically do that was in the awl for a while, but I removed it, as it suffices to collect reputations offline from the awl database). > Also, please note that a awl.signedby would make the AWL test a network-based > one, which actually is not. Right. > I do understand AWL needs to adjust for ipv6 addresses. I would also agree it > could be "squeezed" a bit further. But my very personal thinking about this, > that is the one of (mostly) a SA user, is that anything is going to break > existing setups must be worth the troubles. > > So, the ipv6 problem do or will exist. Since patching it would require some > changes to the AWL table structure, I am quite in favor of adding anything can > squeeze more from AWL. But pardon me, I think a signedby column may at most > make a difference in border cases. The IPv6 issues and the configurable net mask length is now tracked by Bug 6203. > In another bug you opened about AWL I said that detecting the assigned bunch > address to which a source belongs was unfeasible. Well, that's true, but I > forgot ASN routes. We can instead use ASN routes to discriminate the source. SA > already uses the ASN plugin to discover the ASN number related to a source. The > ASN plugin issues a DNS query to ans.routeviews.org in order to get it, but the > reply from routeviews.org also contains route netaddress and netmask length... > So, AWL should store a cidr instead of a fixed-length address. This would at > least mean adding a column to the AWL table and change somehow the content (and > possibly the name) of the ip one in order to accomodate a variable-length, IPvX > network address. This would add a bit of complexity to the AWL code, but I > believe it is worth the effort. I don't have numbers for this and I really > would like to. What do you think about it? It is a good idea worth remembering, indeed the ASN data is already available (if the plugin is enabled), it would just need to be passed to awl and managed somehow. Let's leave this for some future version, where awl could be enhanced into a more general purpose reputation-assessing tool. I'd just close this ticket for now (3.3.0), but may be reopened later for some new thoughts on future versions.