Bug 6281 - Speeding up the pipeline from rules update to a final user
Summary: Speeding up the pipeline from rules update to a final user
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: RuleQA (show other bugs)
Version: 3.3.0
Hardware: All All
: P3 normal
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-01-08 09:13 UTC by Mark Martinec
Modified: 2015-04-12 15:32 UTC (History)
5 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Martinec 2010-01-08 09:13:45 UTC
The New Year's Y2K10 blunder showed the importance of having a quick way to
update people's rules through sa-update. Even after the event, which polished
somewhat the rusty ways of the pipeline, the second incident (Bug 6279)
demonstrated there is still a lot of room for improvement, and plenty of
chances that things break - this one took about 15 hours to propagate
rules to the last DNS server.

Let's see if we can improve some of the steps involved, and avoid some
potential or real traps.

Here are some steps of a pipeline:

- commit updated rule (human response time, SVN availability, ..??..);
- propagate rules to mirrors (I'm not familiar how this currently works);
- propagate changed rule ID to a master DNS server for spamassassin.org,
  increment its zone sequence number, reload the zone;
- propagate zone update to ALL slave DNS servers;
- let users run sa-update - periodically or manually for intervention
Comment 1 Mark Martinec 2010-01-08 09:16:25 UTC
Bug 6279 Comment 8 (Kevin A. McGrail):

My $0.02: This sounds like something more for infra where DNS notifies /
transfers aren't being properly received.
And, if needed, I originally got involved with helping the SA project because
of DNS.  I run very nice, stable, disparate network name servers and would be
happy to be a DNS mirror (or primary) if it helps.
KAM


Bug 6279 Comment 9 (Justin Mason):

We should take this off the ticket (and probably to the dev@ list).
but +1 to replacing hyperreal.org with KAM's servers if possible
Comment 2 Mark Martinec 2010-01-08 09:33:02 UTC
Also closely related:


Bug 6265 Comment 7 (Warren Togami):
Fedora's RPM now sa-update by default on a nightly basis.
sa-update is no longer optional.


Bug 6265 Comment 8 (AXB):
Aplogize for hijacking this bug...
I truly hope you're not hammering the donated sa-updates servers with this and
that RedHat uses its own sa-update server.


Bug 6265 Comment 9 (Warren Togami):
Mitigating factors:
* It skips sa-update if spamd or amavisd is not running, and we don't run spamd
by default if you have the packaged installed.
* It also delays a random amount of time before doing it, so it wont bog down
the server with requests all at the same moment.
Is this still too much?


Bug 6265 Comment 10 (AXB):
for default setups, once a week would be more than enough.
"dedicated" admins can tweak, most won't
imo, ideally, distros which enable sa-update by default should provide
resources to cover the load they're adding.
I imagine this auto sa-update could also land in mainstream RHE / Centos etc,
the load added can become... HUGE, plus the pressure put on donated time to run
the stuff to keep an even larger user base happy.
Honestly, don't think its a good idea to enable sa-update by default. Dunno
what others think....especially the ones supplying the sa-update
infrastructure.


Bug 6265 Comment 13 (Justin Mason):
> > Honestly, don't think its a good idea to enable sa-update by default. Dunno
> > what others think....especially the ones supplying the sa-update
> > infrastructure.
> I think it's a good idea but I *thought* sa-update checked DNS for the
> availability of an update not an http query.
> KAM
yep it does.  I'm quite happy to see sa-update enabled by default, to be
honest; if we need more mirrors, we need more mirrors, and that's easily done. 
in the meantime sa-update will do the right thing, retry where necessary, etc.
at some point we should do the "random offset from hour" thing ourselves, but
the right way (as per my coworker Colm):
http://www.stdlib.net/~colmmacc/2009/09/14/period-pain/
http://www.stdlib.net/~colmmacc/2009/09/27/period-pain-part-2/


Bug 6265 Comment 14 (Warren Togami):
http://cvs.fedoraproject.org/viewvc/devel/spamassassin/sa-update.cronscript?revision=1.7&view=co
Here is the script we run from cron by default as of 3.3.0.
* Looks for daemons and runs sa-update only if it sees a running daemon.
* Random delay up to 2 hours.
* .d directory to specify arbitrary channels in separate files, makes it easy
to add/remove channels automatically using packages later.
* Restarts the appropriate daemon after successful sa-update.
Comment 3 Warren Togami 2010-01-08 09:53:55 UTC
If the goal is to speed up the pipeline to final user, then is sa-update once a day really enough?  Consider the possibility of twice a day at random intervals.

Most of the time the sa-update channel would not have changed.  The cost of a negative check is only a DNS query.  Surely this is not so onerous?
Comment 4 John Hardin 2010-01-08 15:22:14 UTC
(In reply to comment #2)

> Bug 6265 Comment 8 (AXB):
> I truly hope you're not hammering the donated sa-updates servers with this
> and that RedHat uses its own sa-update server.
> 
> Bug 6265 Comment 9 (Warren Togami):
> Mitigating factors:
> * It skips sa-update if spamd or amavisd is not running, and we don't run
> spamd by default if you have the packaged installed.
> * It also delays a random amount of time before doing it, so it wont bog
> down the server with requests all at the same moment.
> Is this still too much?

At this point I'd like to bring up Coral distributed caching again... (bug 6181)
Comment 5 Daryl C. W. O'Shea 2010-01-08 18:01:02 UTC
FWIW, I'm completely happy with the hosting setup of the default sa-update channel as it is now.  I'd rather not farm it out.  Eventually being able to compile statistics on versions in use, update intervals, number of clients, etc, would be of much interest... at least to me (and possibly to the ASF if we ever need to justify our use of resources... we probably consume one of the highest units of resources per committer).

Regarding the DNS for spamassassin.org... it's not managed by infra.  We manage that directly so please don't bother them about it.
Comment 6 Matthias Leisi 2010-01-09 00:43:30 UTC
The dnswl.org project can donate some DNS resources to SpamAssassin.
Comment 7 Daryl C. W. O'Shea 2010-01-09 14:30:27 UTC
BTW, we should probably return to talking to ASF infra about them running secondaries for DNS.  We started talking about it a year or so ago and then got side tracked by being reprimanded for using the zone server for production services (which we still do anyway out of necessity).  Last time around they wanted to control the master zone file rather than slave off of our shadow master.

I think I'll talk to Paul again to see if this time we can get the ASF DNS servers setup as slaves of our shadow master.

I appreciate the offers so far, but I think our first step should be utilizing the infrastructure that has been sponsored (via ASF sponsorship) before we use the resources of others.  If after getting DNS onto the ASF servers we still need more DNS servers we could consider expanding to the outside then.
Comment 8 Kevin A. McGrail 2010-01-10 07:37:19 UTC
(In reply to comment #7)
> BTW, we should probably return to talking to ASF infra about them running
> secondaries for DNS.  We started talking about it a year or so ago and then got
> side tracked by being reprimanded for using the zone server for production
> services (which we still do anyway out of necessity).  Last time around they
> wanted to control the master zone file rather than slave off of our shadow
> master.
> 
> I think I'll talk to Paul again to see if this time we can get the ASF DNS
> servers setup as slaves of our shadow master.
> 
> I appreciate the offers so far, but I think our first step should be utilizing
> the infrastructure that has been sponsored (via ASF sponsorship) before we use
> the resources of others.  If after getting DNS onto the ASF servers we still
> need more DNS servers we could consider expanding to the outside then.

+1 on that. I had assumed Infra was already involved.  However, in addition to getting more secondaries, have to figure out why the current servers are getting timely notifies and transfers.
Comment 9 Mark Martinec 2010-01-10 15:12:23 UTC
I followed what happened to my (test) change r897247 on 60_whitelist_dkim.cf
(Bug 6279 #c7), letting a script poll the four DNS servers every minute:

Rule change committed on 2010-01-08 16:15 UTC,
zone serial number at that time was 2010010800,
TXT record showed "897136" (last change).

41 hours (!) later (2010-01-10 09:00 UTC) the three a,b,c.auth-ns.sonic.net
DNS servers started to pick up the change, which took about 6 minutes
to propagate to all three, including sequence number update on a zone SOA,
and a rules TXT record update.

The ns.hyperreal.org was left behind for another 10 minutes on a zone
SOA seq.no. change (still ok), but it took another 50 minutes for it
to pick up the TXT record change too. This coincides to the TTL on a
TXT record at a time of a SOA update by ns.hyperreal.org, which may
indicate that update notifications are not passed between master
and a slave.

Which is weird in itself, as the SOA for spamassassin.org shows that
ns.hyperreal.org is the master and the other tree are slaves,
not the other way around, as follows from propagation times. Strange.

As the TTL on a TXT record is 3600 seconds, clients could experience
up to a further hour to notice a change. A client that polls hourly
but has bad luck, could just miss the event, which may suggest that
we should not pick a round number for a TTL, but perhaps 50 minutes
(in view of Justin's link
http://www.stdlib.net/~colmmacc/2009/09/14/period-pain/ ).

But the principal question here is, why it took 41 hours from SVN check-in
to the first zone change. What steps are involved here? Cron job?
Is a manual intervention required or is this supposed to propagate
automatically?
Comment 10 Justin Mason 2010-01-11 03:51:07 UTC
'But the principal question here is, why it took 41 hours from SVN check-in
to the first zone change. What steps are involved here? Cron job?
Is a manual intervention required or is this supposed to propagate
automatically?'

http://wiki.apache.org/spamassassin/SaUpdateBackend details the steps between
check-in and sa-update publishing.  41 hours sounds like a very long time,
however, agreed; I would have expected 24 hours to be the highest potential
delay.
Comment 11 Mark Martinec 2010-01-21 07:24:52 UTC
The tonight's change went better, it was picked up by both mirrors and
all three DNS servers at about 9:45 UTC, which loosely corresponds
to a cron run start at 8:50 UTC as documented in SaUpdateBackend wiki.

The ns.hyperreal.org DNS was again late for lunch by about 15 minutes,
it started to serve a fresh 0.3.3.updates.spamassassin.org TXT record
only after a TTL on the previous one expired (which can take up to one hour).
Again, this is most weird, as the ns.hyperreal.org is supposed to be a
master and the other three slaves. Perhaps it is behind some caching-only
front-end DNS.
Comment 12 Daryl C. W. O'Shea 2010-01-21 15:14:58 UTC
(In reply to comment #11)
> The ns.hyperreal.org DNS was again late for lunch by about 15 minutes,
> it started to serve a fresh 0.3.3.updates.spamassassin.org TXT record
> only after a TTL on the previous one expired (which can take up to one hour).
> Again, this is most weird, as the ns.hyperreal.org is supposed to be a
> master and the other three slaves. Perhaps it is behind some caching-only
> front-end DNS.

It looks like ns.hyperreal.org isn't getting or accepting notifies.  All four public NSes are slaves.  ns.hyperreal.org just happens to be listed as the public master.  There is a hidden master in actuality.

I wouldn't worry about it too much.  We're going to drop hyperreal.org in favour of ASF infrastructure.  I plan to discuss it with Paul this weekend.  It was on last weekend's agenda but something else came up. ;)
Comment 13 Justin Mason 2010-01-27 02:20:42 UTC
moving most remaining 3.3.0 bugs to 3.3.1 milestone
Comment 14 Justin Mason 2010-01-27 03:16:27 UTC
reassigning, too
Comment 15 Justin Mason 2010-03-23 16:33:43 UTC
moving all open 3.3.1 bugs to 3.3.2
Comment 16 Karsten Bräckelmann 2010-03-23 17:42:47 UTC
Moving back off of Security, which got changed by accident during the mass Target Milestone move.
Comment 17 Kevin A. McGrail 2013-06-21 16:08:00 UTC
Moving all open bugs where target is defined and 3.4.0 or lower to 3.4.1 target
Comment 18 Kevin A. McGrail 2015-04-12 15:32:36 UTC
I've long since fixed the DNS issue but improving ruleqa for faster turnaround, forced updates, etc. is still an important issue.  Moving to undefined as this is a ruleQA issue and doesn't depend on code anymore.