Bug 6876 - Partial / diff sync for sa-update
Summary: Partial / diff sync for sa-update
Status: NEW
Alias: None
Product: Spamassassin
Classification: Unclassified
Component: sa-update (show other bugs)
Version: unspecified
Hardware: All All
: P2 enhancement
Target Milestone: Undefined
Assignee: SpamAssassin Developer Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-12-11 09:48 UTC by Matthias Leisi
Modified: 2013-01-07 15:54 UTC (History)
3 users (show)



Attachment Type Modified Status Actions Submitter/CLA Status

Note You need to log in before you can comment on or make changes to this bug.
Description Matthias Leisi 2012-12-11 09:48:01 UTC
Creating this ticket to ensure the issue is not lost in the e-mail discussion around the sa-update mirrors: 

| I noticed that there are some IPs who download every 5 
| to 20 minutes (and are actually downloading, eg 
| 1395916.tar.gz at 251213 bytes according to the Apache 
| log). I'm sure it is worse on the other mirrors.

We have no ability to know that this isn't just firewalling.  
We know from tickets that some installations have hundreds 
and hundreds of servers so this could be correct.

| Should this be limited? And if yes, within sa-update or 
| on the infrastructure level?

It's been debated and debated with no good answer how to 
limit it. At worst, perhaps making sa-update only download 
(not check) one time per day via a file that's created and 
if it's not 23 hours old, the proggy aborts?  It's easy for 
Admin's to work around if they need but serves as a simple 
barrier for those whose installations might have gone wacky.

| It's a bit the same effect we had back when we allowed 
| transfer of the dnswl.org zone files via HTTP. There, we 
| moved completely to using rsync, which I don't believe is 
| an option for sa-update (or is there a "native Perl" 
| implementation of an rsync client?).

For me, I want to keep things simpler so we can keep things 
running with minimal oversight.
Comment 1 John Hardin 2012-12-11 14:55:17 UTC
I've suggested it before and I'll suggest it again: if the volume of sa-update downloads is larger than we're comfortable supporting with available infrastructure, then the config files should try using the official mirror servers. This is simple, transparent, automatically distributed, retrieves from a cache server topologically near the requesting host, and is (in my experience) reliable. It also makes updates robust in the face of temporary unavailablility of the mirror servers.

I'm not offering this as an alternative to code changes that rate-limit sa-update download attempts, but as anothar approach to use along with such limits to manage load.
Comment 2 Kevin A. McGrail 2013-01-04 18:34:08 UTC
(In reply to comment #1)
> I've suggested it before and I'll suggest it again: if the volume of
> sa-update downloads is larger than we're comfortable supporting with
> available infrastructure, then the config files should try using the
> official mirror servers. This is simple, transparent, automatically
> distributed, retrieves from a cache server topologically near the requesting
> host, and is (in my experience) reliable. It also makes updates robust in
> the face of temporary unavailablility of the mirror servers.
> 
> I'm not offering this as an alternative to code changes that rate-limit
> sa-update download attempts, but as anothar approach to use along with such
> limits to manage load.

I believe this issue is that these are unnecessary checks. For example bug 6655.

If someone is checking for updates, that just hits DNS and is minor.  But we have systems for no known reason checking repeatedly for updates.  

While I agree with you John, the reality is that replacing and keeping the existing infrastructure running for the entire project is time consuming.  I want to try and focus on publishing code rather than changing/improving infrastructure as much as possible.  

Regards,
KAM
Comment 3 John Hardin 2013-01-04 20:06:45 UTC
(In reply to comment #2)
> (In reply to comment #1)
> > I've suggested it before and I'll suggest it again: if the volume of
> > sa-update downloads is larger than we're comfortable supporting with
> > available infrastructure, then the config files should try using the
> > official mirror servers.

Argh! I don't know how "official mirror servers" crept into that comment.

What I've suggested before and was suggesting again here was to first try CORAL-ified URLs.

> > This is simple, transparent, automatically
> > distributed, retrieves from a cache server topologically near the requesting
> > host, and is (in my experience) reliable. It also makes updates robust in
> > the face of temporary unavailablility of the mirror servers.
> > 
> > I'm not offering this as an alternative to code changes that rate-limit
> > sa-update download attempts, but as anothar approach to use along with such
> > limits to manage load.
> 
> I believe this issue is that these are unnecessary checks. For example bug
> 6655.
> 
> If someone is checking for updates, that just hits DNS and is minor.  But we
> have systems for no known reason checking repeatedly for updates.  
> 
> While I agree with you John, the reality is that replacing and keeping the
> existing infrastructure running for the entire project is time consuming.  I
> want to try and focus on publishing code rather than changing/improving
> infrastructure as much as possible.  

Trying CORAL first is no change whatsoever to infrastructure, I apologize that my misstatement about "official mirror servers" gave that inpression. It is either a minor change to the base sa-update URLs file, or a (likely) minor code change in sa-update, to first try downloading:

  http://buildbot.spamassassin.org.nyud.net:8080/updatestage/1422798.tar.gz

before trying to download:

  http://buildbot.spamassassin.org/updatestage/1422798.tar.gz

simply append ".nyud.net:8080" to the base host FQDN and you try to retrieve the file from a transparent automatic distributed cache service.
Comment 4 Michael Parker 2013-01-04 20:43:14 UTC
> Trying CORAL first is no change whatsoever to infrastructure, I apologize that > my misstatement about "official mirror servers" gave that inpression. It is 
> either a minor change to the base sa-update URLs file, or a (likely) minor code > change in sa-update, to first try downloading:
>
>  http://buildbot.spamassassin.org.nyud.net:8080/updatestage/1422798.tar.gz
>
>before trying to download:
>
>  http://buildbot.spamassassin.org/updatestage/1422798.tar.gz
>
> simply append ".nyud.net:8080" to the base host FQDN and you try to retrieve 
> the file from a transparent automatic distributed cache service.

We used todo this and changed it for some reason.  You might want to search for some history to see why that was to make sure it wasn't any sort of issue, rather than just no longer needing the cache.
Comment 5 John Hardin 2013-01-04 21:50:19 UTC
(In reply to comment #4)
> > simply append ".nyud.net:8080" to the base host FQDN and you try to retrieve 
> > the file from a transparent automatic distributed cache service.
> 
> We used todo this and changed it for some reason.  You might want to search
> for some history to see why that was to make sure it wasn't any sort of
> issue, rather than just no longer needing the cache.

ISTR the reason was that some people found it (or believed it to be) unreliable.

If sa-update is robust in the face of a corrupt download from one mirror (i.e. it tries another if one fails) this shouldn't be a problem. A corrupt download is a potential problem and a common failure mode that is not specific to any particular source.

Coral has been reliable in my experience.
Comment 6 Tom Schulz 2013-01-07 15:04:01 UTC
>  http://buildbot.spamassassin.org.nyud.net:8080/updatestage/1422798.tar.gz

Hmm.. A non-standard port. That will not go through our firewall unless I
add a rule for it. It is not too hard to do that as long as I know that I
need to. Another place for some documentation.
Comment 7 Henrik Krohns 2013-01-07 15:54:07 UTC
Coral has (also) been in standard port for ages.

http://buildbot.spamassassin.org.nyud.net/updatestage/1422798.tar.gz