SA Bugzilla – Bug 6281
Speeding up the pipeline from rules update to a final user
Last modified: 2015-04-12 15:32:36 UTC
The New Year's Y2K10 blunder showed the importance of having a quick way to update people's rules through sa-update. Even after the event, which polished somewhat the rusty ways of the pipeline, the second incident (Bug 6279) demonstrated there is still a lot of room for improvement, and plenty of chances that things break - this one took about 15 hours to propagate rules to the last DNS server. Let's see if we can improve some of the steps involved, and avoid some potential or real traps. Here are some steps of a pipeline: - commit updated rule (human response time, SVN availability, ..??..); - propagate rules to mirrors (I'm not familiar how this currently works); - propagate changed rule ID to a master DNS server for spamassassin.org, increment its zone sequence number, reload the zone; - propagate zone update to ALL slave DNS servers; - let users run sa-update - periodically or manually for intervention
Bug 6279 Comment 8 (Kevin A. McGrail): My $0.02: This sounds like something more for infra where DNS notifies / transfers aren't being properly received. And, if needed, I originally got involved with helping the SA project because of DNS. I run very nice, stable, disparate network name servers and would be happy to be a DNS mirror (or primary) if it helps. KAM Bug 6279 Comment 9 (Justin Mason): We should take this off the ticket (and probably to the dev@ list). but +1 to replacing hyperreal.org with KAM's servers if possible
Also closely related: Bug 6265 Comment 7 (Warren Togami): Fedora's RPM now sa-update by default on a nightly basis. sa-update is no longer optional. Bug 6265 Comment 8 (AXB): Aplogize for hijacking this bug... I truly hope you're not hammering the donated sa-updates servers with this and that RedHat uses its own sa-update server. Bug 6265 Comment 9 (Warren Togami): Mitigating factors: * It skips sa-update if spamd or amavisd is not running, and we don't run spamd by default if you have the packaged installed. * It also delays a random amount of time before doing it, so it wont bog down the server with requests all at the same moment. Is this still too much? Bug 6265 Comment 10 (AXB): for default setups, once a week would be more than enough. "dedicated" admins can tweak, most won't imo, ideally, distros which enable sa-update by default should provide resources to cover the load they're adding. I imagine this auto sa-update could also land in mainstream RHE / Centos etc, the load added can become... HUGE, plus the pressure put on donated time to run the stuff to keep an even larger user base happy. Honestly, don't think its a good idea to enable sa-update by default. Dunno what others think....especially the ones supplying the sa-update infrastructure. Bug 6265 Comment 13 (Justin Mason): > > Honestly, don't think its a good idea to enable sa-update by default. Dunno > > what others think....especially the ones supplying the sa-update > > infrastructure. > I think it's a good idea but I *thought* sa-update checked DNS for the > availability of an update not an http query. > KAM yep it does. I'm quite happy to see sa-update enabled by default, to be honest; if we need more mirrors, we need more mirrors, and that's easily done. in the meantime sa-update will do the right thing, retry where necessary, etc. at some point we should do the "random offset from hour" thing ourselves, but the right way (as per my coworker Colm): http://www.stdlib.net/~colmmacc/2009/09/14/period-pain/ http://www.stdlib.net/~colmmacc/2009/09/27/period-pain-part-2/ Bug 6265 Comment 14 (Warren Togami): http://cvs.fedoraproject.org/viewvc/devel/spamassassin/sa-update.cronscript?revision=1.7&view=co Here is the script we run from cron by default as of 3.3.0. * Looks for daemons and runs sa-update only if it sees a running daemon. * Random delay up to 2 hours. * .d directory to specify arbitrary channels in separate files, makes it easy to add/remove channels automatically using packages later. * Restarts the appropriate daemon after successful sa-update.
If the goal is to speed up the pipeline to final user, then is sa-update once a day really enough? Consider the possibility of twice a day at random intervals. Most of the time the sa-update channel would not have changed. The cost of a negative check is only a DNS query. Surely this is not so onerous?
(In reply to comment #2) > Bug 6265 Comment 8 (AXB): > I truly hope you're not hammering the donated sa-updates servers with this > and that RedHat uses its own sa-update server. > > Bug 6265 Comment 9 (Warren Togami): > Mitigating factors: > * It skips sa-update if spamd or amavisd is not running, and we don't run > spamd by default if you have the packaged installed. > * It also delays a random amount of time before doing it, so it wont bog > down the server with requests all at the same moment. > Is this still too much? At this point I'd like to bring up Coral distributed caching again... (bug 6181)
FWIW, I'm completely happy with the hosting setup of the default sa-update channel as it is now. I'd rather not farm it out. Eventually being able to compile statistics on versions in use, update intervals, number of clients, etc, would be of much interest... at least to me (and possibly to the ASF if we ever need to justify our use of resources... we probably consume one of the highest units of resources per committer). Regarding the DNS for spamassassin.org... it's not managed by infra. We manage that directly so please don't bother them about it.
The dnswl.org project can donate some DNS resources to SpamAssassin.
BTW, we should probably return to talking to ASF infra about them running secondaries for DNS. We started talking about it a year or so ago and then got side tracked by being reprimanded for using the zone server for production services (which we still do anyway out of necessity). Last time around they wanted to control the master zone file rather than slave off of our shadow master. I think I'll talk to Paul again to see if this time we can get the ASF DNS servers setup as slaves of our shadow master. I appreciate the offers so far, but I think our first step should be utilizing the infrastructure that has been sponsored (via ASF sponsorship) before we use the resources of others. If after getting DNS onto the ASF servers we still need more DNS servers we could consider expanding to the outside then.
(In reply to comment #7) > BTW, we should probably return to talking to ASF infra about them running > secondaries for DNS. We started talking about it a year or so ago and then got > side tracked by being reprimanded for using the zone server for production > services (which we still do anyway out of necessity). Last time around they > wanted to control the master zone file rather than slave off of our shadow > master. > > I think I'll talk to Paul again to see if this time we can get the ASF DNS > servers setup as slaves of our shadow master. > > I appreciate the offers so far, but I think our first step should be utilizing > the infrastructure that has been sponsored (via ASF sponsorship) before we use > the resources of others. If after getting DNS onto the ASF servers we still > need more DNS servers we could consider expanding to the outside then. +1 on that. I had assumed Infra was already involved. However, in addition to getting more secondaries, have to figure out why the current servers are getting timely notifies and transfers.
I followed what happened to my (test) change r897247 on 60_whitelist_dkim.cf (Bug 6279 #c7), letting a script poll the four DNS servers every minute: Rule change committed on 2010-01-08 16:15 UTC, zone serial number at that time was 2010010800, TXT record showed "897136" (last change). 41 hours (!) later (2010-01-10 09:00 UTC) the three a,b,c.auth-ns.sonic.net DNS servers started to pick up the change, which took about 6 minutes to propagate to all three, including sequence number update on a zone SOA, and a rules TXT record update. The ns.hyperreal.org was left behind for another 10 minutes on a zone SOA seq.no. change (still ok), but it took another 50 minutes for it to pick up the TXT record change too. This coincides to the TTL on a TXT record at a time of a SOA update by ns.hyperreal.org, which may indicate that update notifications are not passed between master and a slave. Which is weird in itself, as the SOA for spamassassin.org shows that ns.hyperreal.org is the master and the other tree are slaves, not the other way around, as follows from propagation times. Strange. As the TTL on a TXT record is 3600 seconds, clients could experience up to a further hour to notice a change. A client that polls hourly but has bad luck, could just miss the event, which may suggest that we should not pick a round number for a TTL, but perhaps 50 minutes (in view of Justin's link http://www.stdlib.net/~colmmacc/2009/09/14/period-pain/ ). But the principal question here is, why it took 41 hours from SVN check-in to the first zone change. What steps are involved here? Cron job? Is a manual intervention required or is this supposed to propagate automatically?
'But the principal question here is, why it took 41 hours from SVN check-in to the first zone change. What steps are involved here? Cron job? Is a manual intervention required or is this supposed to propagate automatically?' http://wiki.apache.org/spamassassin/SaUpdateBackend details the steps between check-in and sa-update publishing. 41 hours sounds like a very long time, however, agreed; I would have expected 24 hours to be the highest potential delay.
The tonight's change went better, it was picked up by both mirrors and all three DNS servers at about 9:45 UTC, which loosely corresponds to a cron run start at 8:50 UTC as documented in SaUpdateBackend wiki. The ns.hyperreal.org DNS was again late for lunch by about 15 minutes, it started to serve a fresh 0.3.3.updates.spamassassin.org TXT record only after a TTL on the previous one expired (which can take up to one hour). Again, this is most weird, as the ns.hyperreal.org is supposed to be a master and the other three slaves. Perhaps it is behind some caching-only front-end DNS.
(In reply to comment #11) > The ns.hyperreal.org DNS was again late for lunch by about 15 minutes, > it started to serve a fresh 0.3.3.updates.spamassassin.org TXT record > only after a TTL on the previous one expired (which can take up to one hour). > Again, this is most weird, as the ns.hyperreal.org is supposed to be a > master and the other three slaves. Perhaps it is behind some caching-only > front-end DNS. It looks like ns.hyperreal.org isn't getting or accepting notifies. All four public NSes are slaves. ns.hyperreal.org just happens to be listed as the public master. There is a hidden master in actuality. I wouldn't worry about it too much. We're going to drop hyperreal.org in favour of ASF infrastructure. I plan to discuss it with Paul this weekend. It was on last weekend's agenda but something else came up. ;)
moving most remaining 3.3.0 bugs to 3.3.1 milestone
reassigning, too
moving all open 3.3.1 bugs to 3.3.2
Moving back off of Security, which got changed by accident during the mass Target Milestone move.
Moving all open bugs where target is defined and 3.4.0 or lower to 3.4.1 target
I've long since fixed the DNS issue but improving ruleqa for faster turnaround, forced updates, etc. is still an important issue. Moving to undefined as this is a ruleQA issue and doesn't depend on code anymore.