The Dos and Dont’s of software mirroring

Posted on October 5, 2005, under general.

I was reminded today that it’s 3 years this week since the first re-launched went live, and we got swamped with about 400Mbit/sec of RedHat 8 downloads. In those 3 years I’ve really enjoyed admining “ftp dot”, designing and building 3 successive systems, getting to present papers about it to share our experiences, and beating the big guys on a miniscule budget, just by putting in the time and effort to really know our systems.

We now mirror well over 50,000 distinct projects (although I’m cheating, since most of those are via SourceForge), over 100 distinct targets, nearly 5 terabytes of content and just shy of 11 million files through 4Gbit/sec of capacity; we’ve come a long way. We’ve served over a gigabit/sec of real traffic and over 23,000 concurrent downloads from a single webserver. We think we’re definitely a contender for hosting the world’s busiest webserver (it’s hard to tell, the pornography guys don’t publish stats). Those 3 years have also given us a lot of experience with common problems that arise, and to mark our third birthday (of sorts), I thought it would be a good idea to share some of that experience and maybe make the whole process easier for everyone.

It has always annoyed me that although the mirroring community is a vital part of the Open Source ecosystem, it doesn’t itself operate on any Open Source principles. There really is no central discussion forum or mailing list for mirror server operators and there isn’t much cross-pollination of ideas, which is a shame – because there are some incredibly impressive mirrors, from which we’ve learned a lot. If any mirror operators are reading, and would like to be on such a list, I’ve created a mirror-ops mailing list for that purpose, I think it would be great to see some co-operation, and maybe even a central point of contact for software distributors.

The last 3 years have also seen growth in technology. Excellent protocols and software like BitTorrent have really taken off and become part of the software distribution landscape. But file-sharing isn’t a catch-all, and it is still generally not as fast or easy for end-users as downloading from a well-connected nearby mirror, so it will be a while before traditional mirrors become outdated. All the same, we’ve noticed that a lot of software distributors underrate the continued usefulness of software mirrors. Even if file-sharing protocols could reach everyone and be just as speedy, there is still a significant portion of people who always default to their local mirror when looking for content. To use management buzz-speak; it’s a key distribution chain because that’s where they expect to find it. In some cases if they don’t find it there, the transient desire may disappear, and they’ll never download your software. It’s weird, but there you go, these people exist.

Just to give one example, we deliberately are not listed as a mirror of Apache software, because our counterparts in BT Ireland do a fine job of it already and we didn’t see a point in crowding the list. Nevertheless we serve a lot of Apache downloads from people who’ve manually found the content, or simply expect it to be there (a lot of our own customers fall into this category).

Software mirroring is still important to Open Source. It saves volunteers an awful lot of money in potential bandwidth usage, and it provides resilience and redundancy where it otherwise wouldn’t be available. And yet, mainly due to the aforementioned lack of co-operation between mirror servers, a lot of projects find problems the hard way, repeat the mistakes of others or make it harder to get their content to end-users than it need be. To help try and correct this, here are 10 “Dos and Don’ts” for software projects, based on the most frequent problems and questions we’ve encountered over the last 3 years. If you’re involved in an Open Source project, consider forwarding it to whoever looks after the distribution side of things.

  1. Do make it easy and clear how to mirror your project.

    If there isn’t a simple “How to mirror” guide somewhere on your site, there needs to be. Mirrors are going out of their way to give you free resources; please don’t make it hard for them. Searching for the word “mirror” on your front page should probably lead to a useful link too.

    If you are mailing mirror operators asking to get your software mirrored, firstly I apologise on behalf of mirror operators that there is no single point of contact that would get most of us, but for your part; make the instructions concise and simple. When asking to be mirrored include all of the information that would be necessary to do so.

    Do these things and you’ll have mirror operators doing the hard work for you. You’d be surprised how often they get requests from end-users that content be mirrored. Just having a good guide will get you nearly all of the way there, and you may never even need to intervene in order to get your project mirrored.

  2. Don’t require dynamic content on your mirrors.

    In my opinion this is the cardinal sin of software mirroring. It’s bad enough that projects do it but it’s even worse that a lot of mirror server administrators let them away with it. Yes, I’m looking at you PHP and Eclipse.

    Mirror servers typically have a lot of storage and a lot of bandwidth, a twin combination that makes them not unattractive targets for script kiddies and more clueful attackers. Mirror server admins do not have the time, or the inclination, to review your PHP code, your server-side include or well, any of your dynamic content. Do you really want to be responsible for a flaw which doesn’t just get you compromised but also turns dozens if not hundreds of well-resourced mirror servers into someone’s botnet?

    When you think about it, a list of mirrors is a very convenient source of well-resourced systems running identical code. If a bug is found, it could be disastrous. And I don’t care how good you think your code is; it contains bugs. If that wasn’t reason enough; mirroring systems are for distributing content not generating it, we really don’t need fancy tricks wasting CPU.

  3. Do use rsync and HTTP.

    The vast majority of software mirroring is done via rsync, and it is more than good enough for any project. Mirroring via FTP is a huge pain in the backside, and mirroring via HTTP is at least 100 times worse again. rsync was designed with this task in mind, and since the vast majority of projects a mirror server will typically mirror use rsync, you’ll find that using rsync allows you to integrate with their architecture much more easily and quickly.

    While I’m ranting about preferred protocols; if you maintain a list of mirrors, which typically list the urls with one for ftp://mirror-server/, one for http://mirror-server/ and so on, you should prefer HTTP in your layout. Have it be the default option, list it first, whatever; either way, try and get users using HTTP and not FTP. O.k., so I’m a little biased, as an Apache httpd committer; but hey, I’m also a mod_ftp committer, and I’m telling you HTTP is a much better protocol for downloads, and there are far fewer end-user problems that way.

  4. Don’t ever change the contents of a mirrored file.

    O.k., this isn’t a 100% don’t, there are some exceptions (like for a README file, or for a packages list file – which has to stay in one place but change over time), but for the most part you should never ever, ever change the content of a mirrored file, especially if it’s a binary file. If the file changes, then so should the filename, usually by incrementing some part of a version number (which is, of course, a part of the filename, right?).

    There are a few reasons why this is good practise; a) it ensures the change will actually be mirrored. If a files size hasn’t changed a lot of systems won’t re-mirror it – even if the mtime has changed. b) It guards against a race condition we see a lot; the packages list file, or an ftp or http index says one thing about a file (like say a checksum or a file size) and then when the user downloads it, they don’t match, because the file has changed between the time they downloaded the package list, and the time they downloaded the file itself. We get a fair number of end-user reports telling us they’ve downloaded a corrupt package, when all that actually happened was that they were unlucky enough to do an apt/yum/ports upgrade at the same time as the content was being re-mirrored.

    We have looked at building atomic mirroring systems, using per-project snapshots, or clever hardlink and symlink tricks and then keeping track of a user’s session and directing them to the correct copy – but doing these kind of things server-side is highly non-trivial and resource intensive, whereas it’s incredibly easy for you to just not modify file contents, ever!

  5. Do use a tiered hierarchical distribution chain.

    If your project starts to develop anything beyond a dozen or so mirrors and contains more than a small amount of content, please strongly consider managing a tiered hierarchy of mirrors. For one thing, this will reduce the bandwidth you have to use to synchronise with those mirrors, but for another it means that the network of mirrors has a much better chance of replicating all of the content when the primary is swamped (which is exactly when you want your mirrors to have the content).

    There are other solutions to this problem, like well-managed hidden-primaries, but realistically, using a tiered hierarchy of mirrors is the easiest and best way to cope with this problem. All it requires is for you to nominate a few tier 1 mirrors and ask other mirrors to sync from their closest one.

    Do this, and you’ll avoid the RedHat problem, where they spend money on managing more primaries, bandwidth and administrators than they actually need, but still have busy times close to releases when the mirror network can’t sync the content. This dramatically increases the workload needed for mirror operators to donate their hard disk space and bandwidth to RedHat. Contrast this with another distro; Debian, who manage a tiered mirroring system. We have never had any problems syncing Debian content, and the procedure is stunningly simple. Oh, and we’ve never had to wade through >100 mails a month for Debian either. They do this with some volunteer effort, a single primary and a healthy dose of forethought as their only resources, you can too.

    If you want to go the extra mile, and really make life easy for yourself and your mirrors, you should consider push mirroring, which allows you to trigger mirror updates on demand.

  6. Don’t overburden mirror operators with stuff they do not need to know about.

    You should aim to keep your distribution system simple and efficient, with as few dependencies and as little administrative intervention neccessary as possible. Think long and hard before you require your mirrors to keep track of 100 new MIME -types in their Webserver configuration; is there another way you could do this? Try to remember that it’s highly likely that mirror operators will be mirroring a lot more than just your project.

    If you have a mailing list for your mirrors, please try and limit traffic to what absolutely everyone on that list needs to know. A >100 mail per month junk-mail box ala the RedHat mirrors list is considerably less than ideal. And clamav folks; all of the mirrors do not need to know when you add and remove DNS records.

  7. Do include timestamps in your mirroring system.

    Timestamps, like the ones you’ll find at are a great idea, that not enough projects use. They are very simple to implement – all you have to do is ask your mirror operators to add a line like “date > somedir/`hostname`” at the end of their mirroring script.

    In return for that simple command, you gain a lot. The obvious application is in determining whether a mirror is up to date or not, which can be integrated into a monitoring system if you want. But even better than that, these timestamp files add trace-ability and loop-prevention to the mirroring network.

    More than a few mirror operators sync content from a nearby mirror rather than a primary, and occasionally the operator of that nearby mirror can have the very same thought, “hey, let’s mirror from that nearby mirror”, and voila you have two mirrors happily syncing from each other, without ever updating any content. You’d be surprised just how common that is. Don’t let our stupidity infect your project.

  8. Don’t ask for logs.

    It is understandable that some projects would like to monitor how popular their software is, and how frequently it is downloaded. I can sympathise with SunFreeWare, whose very funding depends on such information being available. But in most cases, sorting log information is onerous for mirror operators. Although we can all agree that it would be great to have per-project logs readily available, it represents a serious amount of complexity and effort on the server-side, and that will decrease your likelihood of getting mirrors.

    Try to remember that mirrors will be mirroring dozens, if not hundreds, if not thousands of projects, and it is not easy to split out separate logfiles for each one. When you have log files that are millions of lines long per-day, it becomes a real task to try and organise that information and collate it. In most cases, mirror operators just ignore these requirements and do not split out the logs, or send them anywhere. The request goes silently ignored, and they probably feel just a small twinge of guilt about it.

    In other cases, the information just isn’t available. To give a hideous example; for a short time the only FTP daemon we could find that would satisfy all of our requirements (IPv6 support, Largefile downloads, chrootable, passive FTP support) logged to … wait for it … wtmp, leaving “last” as our log interpreter. So don’t even ask.

  9. Do use relative links.

    For the love of all that is simple, if you are having HTML content mirrored, use relative links and relative links only. There are a surprising number of site mirror procedures which involve running perl commands to mogrify absolute links in-place. This is a very tired, old, solved problem, people. If you can’t fit all of your content into a neat hierarchy, it probably needs to be re-organised anyway. Don’t make your lack of organisation a problem for your mirrors to re-solve.

  10. Don’t get automatic updates wrong.

    Increasingly, projects are starting to distribute the load that it represents across the projects mirrors. In my opinion this is a good thing, is represents a lot of trust in the mirroring system, but when correctly managed is entirely workable.

    Doing this is not without its pitfalls though. The increased number of users polling a server frequently does add up, and this needs to be taken into account. This is practically what conditional HTTP requests were invented for, and they are generally the way to go about it. If the update tools are not making conditional requests, they need to be. Even if they are being “clever” and using a HEAD request and then a GET if they figure they need it; what they really need to be doing is making an “If-Modified-Since:” or “If-Match:” GET request, because it reduces the overall number of requests and it is handled better by the server in most cases (which will usually just stat() the content).

    Another well-intentioned, but oft-repeated mistake is to try and distribute the load “intelligently” but fall down on the implementation. Everyone gets that hard-coding something like “check for updates every hour on the hour” is a bad idea, because you’ll flood the server every hour. But a lot of people seem to solve this problem by doing; “every hour on the hour wait a random number of seconds, between 0 and 3599, and then update”. This is wrong.

    Although this approach does spread the load stochastically, and will create an even spread of requests, it actually doubles the number of requests it takes to guarantee updates are polled at least at a certain frequency; which is usually what users want. The example I’ve just given actually guarantees that there will an update at least every two hours, not every hour. There is a chance that in hour “A” the random number will be 0, and then in hour “A+1″ it will be 3599 – those two updates just occurred two hours apart. So to guarantee an update every hour, you’d need to have a job scheduled every half hour with a random wait-time between 0 and 1800 seconds.

    This is nonsensical, and actually doubles the load on the server for the expected user-experience. The correct thing to do is to calculate a random wait-time once, at install time, at configuration time, whenever, and to always use that offset. So on system A, it’s “update every 17 minutes past the hour” and on system B it’s “update every 53 minutes past the hour”.

    The server gets the same, even spread of requests, but only half the number of requests to guarantee the same frequency. Makes a lot more sense to me.

Wow, I’m only now starting to realise how long this blog post is, and I havn’t even covered other things like pointing out the usefulness of jigdo, keeping mirrors within your DNS zone (think, how mad and bad an idea writing CVSup in Modula-3 was, and our funny examples of how to get rid of broken HTTP clients. Hopefully it will prove of value to someone, and happy birthday us!

2 Replies to "The Dos and Dont’s of software mirroring"


Software mirroring - some guidelines  on October 5, 2005

[...] Colm MacCárthaigh’s blog has a very interesting entry this evening on running software mirrors. He’s involved with HEANET, so he knows what he’s talking about [...]


jmason  on October 5, 2005

Excellent article! I’m looking forward to the one about getting rid of broken HTTP clients btw. ;)

Leave a Comment