Posted on November 26, 2009, under general.
The most important point to get is that for nearly all real-world use cases the actual time that a “scheduled” task runs at doesn’t matter. Tasks that have to occur at a specific time on Tuesday are vanishingly rare. cron is one of the most abused tools going, rather than encode specific times it would make more sense to let a scheduler decide when the tasks should be run based on criteria such as overall network and system load.
It’s not even hard. There are plenty of scheduling algorithms and theory to borrow from, and many large organisations even have private implementations that let you be a bit more fuzzy about when tasks are run. But there’s another level to the periodicity problem that is worth thinking about.
Rather than simply using numbers and values that come readily to humans, it can be worth putting more effort and research into the values of periods themselves. This isn’t meant in some fetishistic sense . Yes, for say virus updates, it’s possible to produce a gigantic linear algebra equation, with 100s of parameters, that would balance the likelihood and cost of a security breach against the cost and frequency of checking for updates and it would come out with some answer, but that’s a lot of work for little gain.
More interesting, and more tractable, are the effects that arise when multiple periodic tasks coincide. These are really common in distributed systems, and a real pain to debug and diagnose.
It could be as simple as the case we’ve been looking at; a cron job that runs once a day, but across many systems, or it could be as complex as a full-blown peer to peer app that’s got a control loop with multiple peers, a supernode or two and a user-interface polling loop.
And a pattern that’s repeated over and over again is that people choose “convenient” values for the periods .. and these choices are so common that when the periods end up in phase with each other we get constructive interference and elevated load events when the tasks coincide.
Take for example 3 loops – one with a period of 5 minutes, one with 10 minutes and one with 30 minutes. If the loops end up in phase then every 30 minutes we have all three tasks running at once. It’s a mess, and it’s an easy one to prevent – use prime numbers for the values of the periods;
at least that way the number of coincidental events is minimized, and if any load events show up with a periodicity it is very straightforward to identify what single event, or combination of events, should be responsible. Sometimes I can easily imagine a cron replacement that runs exactly like this, but never get to writing it.
And these sorts of loops show up in places you might not necessarily think of. Caches are a good example. If you serve every piece of content of your website with the same Max-Age, then you can expect a thundering herd of requests whenever a browser or proxy expires them all at the same time. One the other hand, if you use prime number cache lifetimes for each resource, you’ll get much more nicely staggered and spread out series of requests. It’s a really simple, neat, optimisation. Tuning things doesn’t have to be hard.