Bug 5891 - Let AWL keep separate records for DKIM-signed and unsigned mail
Summary: Let AWL keep separate records for DKIM-signed and unsigned mail
Status: RESOLVED FIXED
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: Libraries (show other bugs)
Version: unspecified
Hardware: Other All
: P5 enhancement
Target Milestone: 3.3.0
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-04-23 05:47 UTC by Mark Martinec
Modified: 2009-11-25 05:40 UTC (History)
2 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Martinec 2008-04-23 05:47:05 UTC
The change to be submitted shortly lets AWL plugin keep separate records
in an SQL database for author addresses with a valid DKIM or DomainKeys
signature, and separate records for unsigned addresses. The idea is
that popular freemail domains like gmail.com and yahoo.com are frequently
abused to send spam, and it is much more efficient for a spammer to send
their junk directly, instead of going to trouble of submitting through
a freemail provider. And if a spam/fraud does come from a provider, at
least the yahoo.com is quite responsive to legitimate/proven complaints
sent to abuse@yahoo.com.

The resulting effect is that a message with a valid DKIM/DK signature
from such providers is far less likely to be spam than a message with
the same author address which carries no valid signature, so it makes
sense to keep separate AWL averages for each type, yielding significantly
more useful AWL scores.

For example:
  yahoo.com  not signed,  avg.score= 14.8
  yahoo.com  valid sign., avg.score= -0.7

  gmail.com  not signed,  avg.score=  2.9
  gmail.com  valid sign., avg.score= -3.3

It is implemented by adding one additional field 'signedby' to an SQL
table awl, which receives a signing identity if a message carries a
valid DKIM or DomainKeys signature, otherwise the field is set to an
empty string. The field awl.ip is ignored when selecting records with
a nonempty awl.signedby.

For compatibility with existing SQL tables without a signedby field
the new feature needs to be explicitly enabled in local.cf:
  auto_whitelist_distinguish_signed 1

Upgrading instructions are provided in README.awl, the awl_pg.sql and
the awl_pg.sql schemas already carry the new field - it does no harm
(and no good) if a field exists but a feature is not enabled.

I have tested it under MySQL 5.1 and PostgreSQL 8.2, along with an
upgrading (ALTER TABLE) procedure. The code is in use at our site for
the last couple of months, but was lacking documentation touches.

As an alternative to adding a new field, it would be just as fine to
re-purpose a field awl.ip to carry either an IP addrress or a signer id.
As this field if very short (10 characters), tables would need to be
modified one way or another, so I chose a somewhat cleaner approach.

One side-benefit of collected data is that average scores can be obtained
for each signing domain, yielding some form of a 'reputation score'
for each domain. The new code does provide a sub get_signer_reputation()
which provides an average score for a supplied signing id, and this score
can be used to adjust the AWL result. The call to get_signer_reputation
is currently disabled to save one SQL select (an extra index on signedby
is needed to make this quick enough). It turned out that after a while
this average 'reputation score' settles to a rather constant value, so
dynamically fetching it every time is a bit of an unnecessary effort,
and a manual mechanism suffices.

Since I don't use bdb databases for AWL I did not venture into modifying
DBBasedAddrList.pm, so enabling the feature only has an effect when
AWL database is in SQL. Improvements welcome.
Comment 1 Mark Martinec 2008-04-23 05:54:54 UTC
$ svn -m 'PR-5891: let AWL keep separate records
          for DKIM-signed and unsigned mail' ci
Sending        lib/Mail/SpamAssassin/AutoWhitelist.pm
Sending        lib/Mail/SpamAssassin/DBBasedAddrList.pm
Sending        lib/Mail/SpamAssassin/PersistentAddrList.pm
Sending        lib/Mail/SpamAssassin/Plugin/AWL.pm
Sending        lib/Mail/SpamAssassin/SQLBasedAddrList.pm
Sending        sql/README.awl
Sending        sql/awl_mysql.sql
Sending        sql/awl_pg.sql
Transmitting file data ........
Committed revision 650852.
Comment 2 Mark Martinec 2009-01-06 17:33:20 UTC
> As an alternative to adding a new field, it would be just as fine to
> re-purpose a field awl.ip to carry either an IP addrress or a signer id.
> As this field if very short (10 characters), tables would need to be
> modified one way or another, so I chose a somewhat cleaner approach.

Btw, with a r720963 (on 2008-11-26) I have lengthened the awl.ip field
to 45 characters, which was necessary to support IPv6 addresses.
This opened a previously dismissed possibility of using the same awl.ip
field for a signing domain when a message carries valid signature(s).
It's probably still cleaner to use a separate field as the current
code does - I just thought I should mention it.

As a message can carry more than one valid DKIM signature, keeping
multiple signing domains in one SQL field is still somewhat unclean
from a point of view of a database design - storing signers in a
separate table would be nicer. From a practical viewpoint, the
current simpler solution suffices for the time being.

Btw, is anybody (except me) running AWL on SQL with updated schema
and  'auto_whitelist_distinguish_signed 1'  ?
Comment 3 Mark Martinec 2009-08-04 18:00:19 UTC
Closing, if there are no objections.
The feature if off by default.
Comment 4 Mark Martinec 2009-09-22 04:57:47 UTC
Reopening, to solicit some further comments, and document recent work.

The following changes mostly just polished my previous work on the topic,
no changes in functionality (apart from a bug fix).

r817328 | mmartinec | 2009-09-21 19:24:02 +0200 (Mon, 21 Sep 2009)
AWL/AutoWhitelist/SQLBasedAddrList cleanups: remove leftovers of
signer reputations; prevent a new nonsigned query from updating
a signed SQL entry; average across all signed entries in case of
multiple valid signatures (e.g. author domain + signing remailer);
cleanup (long lines, minor details); documentation consistency

r816683 | mmartinec | 2009-09-18 17:51:14 +0200 (Fri, 18 Sep 2009)
When storing an IPv6 address to AWL database, append
a '::' to a /48 network address -  although it wastes
2 chars, it's nice to make it look like a syntactically
correct IPv6 address; just fits into SQL 16 char field.
Comment 5 Mark Martinec 2009-09-22 05:44:05 UTC
> Reopening, to solicit some further comments, and document recent work.

Now to my question.

Considering that an awl.ip field needs to be extended from 10 characters
(as in SA 3.2.3) to 16 characters or more to accommodate an /48 IPv6
address (RIPE is allocating /48 blocks to organizations) - perhaps
this opportunity could be seized and to also require adding a field
awl.signedby, as needed for the topic of this PR.

Note that SpamAssassin/MTA can see an IPv6 address in Received header
fields even if running on a host with no IPv6 connectivity - e.g. in
a header field inserted by a 'trusted' MTA.

Doing that could eliminate a need for the auto_whitelist_distinguish_signed
option, which is only there for the purpose of compatibility with an
unchanged awl SQL database. The less options, the fewer opportunity
for support questions, misconfigurations, undertested code ...

Also some admins would choose to keep auto_whitelist_distinguish_signed
at its default (off) compatibility setting solely out of laziness and for
sticking blindly to defaults, which has unfavourable affect on AWL results.

So, I propose that with upgrading to SA 3.3.0 and keeping existing AWL
database on SQL, one is obliged to do both changes (extend awl.ip and
add awl.signedby), not just the first. And to get rid of my compatibility
option auto_whitelist_distinguish_signed.

The required change is described in sql/README.awl :

MySQL:

  ALTER TABLE awl MODIFY ip varchar(16);

  ALTER TABLE awl
    DROP PRIMARY KEY,
    ADD signedby varchar(255) NOT NULL DEFAULT '',
    ADD PRIMARY KEY (username,email,signedby,ip);

PostgreSQL:

  ALTER TABLE awl ALTER ip TYPE varchar(16);

  DROP INDEX awl_pkey;
  ALTER TABLE awl
    ADD signedby varchar(255) NOT NULL DEFAULT '',
    ADD PRIMARY KEY (username,email,signedby,ip);
Comment 6 Giampaolo Tomassoni 2009-09-22 07:17:50 UTC
Mark, you positive adding a awl.signedby column is going to change AWL's life?

Isn't that most of the dkim and spf-signed mail from freemail providers is after all caming from a well-defined bunch of net bunches? This would end in a co-relation between the signedby field and the ip one. Signedby would then add almost nothing to the equation.

Also, please note that a awl.signedby would make the AWL test a network-based one, which actually is not.

I do understand AWL needs to adjust for ipv6 addresses. I would also agree it could be "squeezed" a bit further. But my very personal thinking about this, that is the one of (mostly) a SA user, is that anything is going to break existing setups must be worth the troubles.

So, the ipv6 problem do or will exist. Since patching it would require some changes to the AWL table structure, I am quite in favor of adding anything can squeeze more from AWL. But pardon me, I think a signedby column may at most make a difference in border cases.

In another bug you opened about AWL I said that detecting the assigned bunch address to which a source belongs was unfeasible. Well, that's true, but I forgot ASN routes. We can instead use ASN routes to discriminate the source. SA already uses the ASN plugin to discover the ASN number related to a source. The ASN plugin issues a DNS query to ans.routeviews.org in order to get it, but the reply from routeviews.org also contains route netaddress and netmask length... So, AWL should store a cidr instead of a fixed-length address. This would at least mean adding a column to the AWL table and change somehow the content (and possibly the name) of the ip one in order to accomodate a variable-length, IPvX network address. This would add a bit of complexity to the AWL code, but I believe it is worth the effort. I don't have numbers for this and I really would like to. What do you think about it?
Comment 7 Mark Martinec 2009-11-25 05:40:51 UTC
(In reply to comment #6)
> Mark, you positive adding a awl.signedby column is going to change AWL's life?
> 
> Isn't that most of the dkim and spf-signed mail from freemail providers is
> after all coming from a well-defined bunch of net bunches? This would end in a
> co-relation between the signedby field and the ip one. Signedby would then add
> almost nothing to the equation.

Let's keep the awl.signedby optional, as it is now for some time (enabled
by setting the: auto_whitelist_distinguish_signed 1).

I agree there is a strong correlation between an IP address and a signing
domain for direct mail. But it also does help validate some re-routed mail
(forwarding, and non-clobbering mailing lists like the users@spamassassin
or postfix-users). The collected data is also useful for assessing domain
reputations (the code to dynamically do that was in the awl for a while,
but I removed it, as it suffices to collect reputations offline from the
awl database).

> Also, please note that a awl.signedby would make the AWL test a network-based
> one, which actually is not.

Right.

> I do understand AWL needs to adjust for ipv6 addresses. I would also agree it
> could be "squeezed" a bit further. But my very personal thinking about this,
> that is the one of (mostly) a SA user, is that anything is going to break
> existing setups must be worth the troubles.
> 
> So, the ipv6 problem do or will exist. Since patching it would require some
> changes to the AWL table structure, I am quite in favor of adding anything can
> squeeze more from AWL. But pardon me, I think a signedby column may at most
> make a difference in border cases.

The IPv6 issues and the configurable net mask length is now tracked by Bug 6203.

> In another bug you opened about AWL I said that detecting the assigned bunch
> address to which a source belongs was unfeasible. Well, that's true, but I
> forgot ASN routes. We can instead use ASN routes to discriminate the source. SA
> already uses the ASN plugin to discover the ASN number related to a source. The
> ASN plugin issues a DNS query to ans.routeviews.org in order to get it, but the
> reply from routeviews.org also contains route netaddress and netmask length...
> So, AWL should store a cidr instead of a fixed-length address. This would at
> least mean adding a column to the AWL table and change somehow the content (and
> possibly the name) of the ip one in order to accomodate a variable-length, IPvX
> network address. This would add a bit of complexity to the AWL code, but I
> believe it is worth the effort. I don't have numbers for this and I really
> would like to. What do you think about it?

It is a good idea worth remembering, indeed the ASN data is already available
(if the plugin is enabled), it would just need to be passed to awl and managed
somehow. Let's leave this for some future version, where awl could be enhanced
into a more general purpose reputation-assessing tool.

I'd just close this ticket for now (3.3.0), but may be reopened later for
some new thoughts on future versions.