Period Pain

Posted on September 14, 2009, under coding.

Imagine it was your job – along with 1,000 other people – to pick a number between 1 and 60. You can use any method you like (though you must use the same one), but if more than 30 of you choose the the same number, those of you who did would be shot. Would you let the group pick numbers at random?

Probably not, there’s always a chance it could go horribly wrong. And that chance? We could derive the correct p-value for having to shoot people easily enough from the uniform distribution, but forget that, let’s do it with code. It’s a lot easier to understand – and it’s a good habit too.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import random

count = 0

# Simulate 10,000 runs of 1,000 people
#  picking a number between 0 and 59.
for runs in range(10000):
    numbers = [0] * 60

    for people in range(1000):
        r = random.randint(0, 59)
        numbers[ r ] += 1

    if sorted(numbers)[-1] >= 30:
        count += 1

print count/10000.0

If you run this, hopefully you’ll get an answer that’s around “0.1″. In other words, about 10% of the time we expect to have at least one number being chosen by 30 or more people. Those don’t seem like great odds, and I wouldn’t play Russian roulette with lives like that.

Yet this is almost exactly what we do with a lot of automated tasks in some large-scale distributed systems. A pattern than can be observed over and over again is that someone writes a scheduled task that runs once an hour, day, month or whatever but then sleeps a random amount of time (typically between 0 and 3600 seconds) when it starts- in a very naive attempt to distribute impact across the larger system evenly.

The march of time

The impacts can be pretty serious – it might be extended load on a dependent service, or it might simply be too many busy nodes in a cluster. Or it might be a two-day global outage of a popular Voip service. Too many things happening at once is usually a bad thing in distributed systems.

Security and anti-virus updates, apt/yum/ports repositories and auditing tools in particular seem to get this pattern wrong. Here’s a good example, from Ubuntu’s daily apt script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# sleep for a random intervall of time (default 30min)
# (some code taken from cron-apt, thanks)
random_sleep()
{
    RandomSleep=1800
    eval $(apt-config shell RandomSleep APT::Periodic::RandomSleep)
    if [ $RandomSleep -eq 0 ]; then
    return
    fi
    if [ -z "$RANDOM" ] ; then
        # A fix for shells that do not have this bash feature.
    RANDOM=$(dd if=/dev/urandom count=1 2> /dev/null | cksum | cut -c"1-5")
    fi
    TIME=$(($RANDOM % $RandomSleep))
    sleep $TIME
}

This is a very bad pattern – I’d go so far as to say that it’s actually worse than letting everything happen at the same time in the first place. It has 2 particularly dangerous qualities;

  1. It increases the effective period of the task

    If a task is running once an hour – it’s probably because we need to do something once an hour. That may seem tautological, but there’s subtlety. The clock-time hours we define as humans are arbitrary, by once an hour we should really mean “at least once in a 60 minute interval”, not “once between 1PM and 2PM”.

    If we pick a random sleep time, we might end up running just under two hours apart. If at 1PM we pick 0 sleep seconds, and then at 2PM two we pick 3599 sleep seconds – look, we just ran two real hours apart!. Unsurprisingly the converse can happen, and we’ll run just 1 second apart, doing heavens knows what to load along the way.

  2. The system as a whole has to cope with dangerous load spikes

    Using our earlier example, If we allocated the numbers evenly, each value would get chosen only 16 or 17 times. We could plan for an impact of say 20 running at once. But as we’ve seen, if we pick random numbers every time, then 1 in every 10 runs, we’re going to have to cope with an impact of 30. That’s 50% more load, because of a one line error!

    If this task is running every hour, then about 1 in every 7 weeks, we’re going to have to deal with an impact of 40 or more. And it will appear totally out of the blue, it’s a random occurrence. Nice! To take it to the extreme, there is a small – but finite – chance all all 1,000 systems choosing the same value. But avoiding this is why we are spreading the load in the first place, so why leave it to chance?

    I used to run a busy Ubuntu mirror, and every day between 6:25 and 6:55 we’d see a gigantic wedge of load that could be distributed a lot more evenly. Though I think the worst of these problems have now been fixed.

The optimal fix for the problem is simple; coordinate – use a central allocator, or a gossip protocol, to ensure that every slot has at most N/M consumers.

This isn’t always possible though – Open Source security updates are usually polled from relatively dumb mirror servers, that don’t keep enough state to be able to divide load like this. But there are still two better approaches.

Firstly, you can pick a random number just once, and then reuse the same random number every time. This guarantees that the task does actually run once an hour, and it makes load predictable for the distributed system as a whole.

Secondly, you can use any uniformly distributed unique piece of data, local to the node, to choose your number. I like to use an MD5 hash of the MAC address, but even the hostname would do.

Probability-wise, the two approaches are identical, but in the real world you get fewer collisions with the latter – probably due to the similar state of entropy identical systems will be in when you mass boot a fleet of several hundred. In both cases, where aberrations emerge due to bad luck, they at least only have to be identified and fixed once. We’re no longer playing the load lottery once an hour.

But this is only half the story … what about other automated tasks with their own periods … how do we avoid colliding with them? And how do we architect timings in distributed systems in general to make them a lot easier to debug. That’s what the next blog post will be about. Stay tuned.

14 Replies to "Period Pain"

gravatar

Cillian  on September 14, 2009

The apticron package now uses the first approach you mention – it picks a random time at installation to use in its daily crontab. When I first spotted “late” apticron emails, I initially suspected serious clock drift on the servers :-)

gravatar

Peter Kankowski  on September 15, 2009

See http://en.wikipedia.org/wiki/Birthday_Paradox#Collision_counting for expected number of collided items. For n = 1000 and d = 60 it is 940 collisions.

gravatar

Cliff  on September 15, 2009

Stupid question from a non programmer. How do you “convert” an MD5 hash of the MAC address into a number between 0 and 59? I have a client that checks in randomly every hour on about 1500 server and I’d like to change it to a fixed number based on your suggestion.

Thanks.

gravatar

jcorn  on September 15, 2009

The md5 hash is just a number, but it’s printed in hexidecimal for clarity. So we use the number it represents, divide by the maximum possible value (32 Fs in a row), and multiply by the number of values we want it to fit in, 60.

md5 / (32 Fs in a row) * 60

gravatar

jcorn  on September 15, 2009

You know, I changed my mind: multiply by 59 instead. I always get all mixed up by this procedure.

gravatar

colmmacc  on September 15, 2009

@cliff , @jcorn – the easiest and fastest way is to use the mod operator.

$md5sum % 60

will turn it into an evenly distributed number in the 0:59 range.

gravatar

gavinmc  on September 15, 2009

It might be useful if cron could have option #1 built into it, so you could do something like:

R 4 * * * root test -x /usr/sbin/cron-apt && /usr/sbin/cron-apt

and cron could make up a random valid number for the R and remember it either permanently for that line or until it was next restarted.

gravatar

Foobot  on September 16, 2009

1. Taking the *daily* apt script as the example, and then extrapolating it to a nominally hourly event is a bit off. A 60 min wobble factor is perfectly reasonable for a daily event. Especially if it’s something like this which is kicked off across the globe at the same *local* time.
A 60 min wobble for an hourly event is bonkers, because 60 mins is a stupid value to use, not because randomising the wait period is necessarily bad.

2. If we’re talking about worst-cases then “Pick a random number once” still has the same worst-case as “random every time”; everybody might randomly pick the same number, only this way they’ll never pick again and the thundering herd
will be a permanent feature of that time-of-day. Predictability is nice but I’m not sure that’s solving the right problem :)

gravatar

Brian  on September 17, 2009

The cron.daily for spamassassin in ubuntu seems to have this feature, if you’re volunteering to keep a list…

gravatar

brady  on September 20, 2009

Here is an example of something that uses the md5sum of the hostname using modern bashisms. Its not the best code in the world but it works. =)

(the last line is a cheap abs() function as the mod function in bash returns negative numbers)

SYS_HASH=$(echo $HOSTNAME | md5sum | awk ‘{print $1}’)
SLEEP_TIME=$(( (16#$SYS_HASH)%3600 ))
SLEEP_TIME=${SLEEP_TIME/-/}

gravatar

/~colmmacc/ » Period Pain part 2  on September 27, 2009

[...] week I wrote about problems with periodicity but it was only half of the problem. But before moving on to the second half, it seems like a good [...]

gravatar

/~colmmacc/ » Period Pain 3  on November 26, 2009

[...] promised, though it’s been a while coming, I wrote that there’d be a followup on scheduling [...]

gravatar

warpedvisions.org :: Link: The problem of periodic processes  on January 3, 2010

[...] smart writeup of the problem of periodic processes, or why to consider gossipy protocols. window.onload = [...]

gravatar

diamond  on February 3, 2010

Nice writeup, i must bear this in mind.

Btw, in the interests of simplicity, line 14 of your first code snippet could be simplified from:

if sorted(numbers)[-1] >= 30:

to:

if max(numbers) >= 30:

Steve

Leave a Comment