## Weighty matters

Posted on May 15, 2011, under coding, general.

One of my favourite programming exercises is to write a text generator using Markov Chains. It usually takes a relatively small amount of code and is especially useful for learning new programming languages. A useful goal is to able to have a tool that ingests a large body of text, breaks it into N-tuples (where N is chosen by the operator), and then emits a random text of length L where that text uses those tuples at the same frequency that the original body did.

On its own, that task can produce some fun results, but it’s also a very repurpose-able technique. You can use this kind of statistical generation to simulate realistic network jitter (record some N-tuples of observed RTT with ping), or awesome simulated-user fuzz tests (record some N-tuples of observed user inputs). It’s surprising that it isn’t more common. But when approaching these problems, from experience of working with newcomers, what seems to be a common first tripping point is how to do weighted selection at all. Put most simply, if we have a table of elements;

 element weight A 2 B 1 C 1

how do we write a function that will choose A about half the time, and B and C about a quarter each? It also turns out that this is a really interesting design problem. We can choose to implement a random solution, a non-random solution, a solution that runs in constant time, a solution that runs in linear time and a solution that runs in logarithmic time. This post is about those potential solutions.

### Non-random solutions

When coming to the problem, the first thing to decide is whether we really want the selection to be random or not. One could imagine a function that tries to keep track of previous selections, for example;

 1234567891011121314151617 elements = [ { 'name' : 'A' , 'weight' : 2 },              { 'name' : 'B' , 'weight' : 1 },              { 'name' : 'C' , 'weight' : 1 } ] def select_element(elements, count):   total_weight = 0   for element in elements:     total_weight += element['weight']     if count < total_weight:          return  ( element ,                    (count + 1) % len(elements) ) # select some elements c = 0 (i , c) = select_element(elements, c) (j , c) = select_element(elements, c) (k , c) = select_element(elements, c)

This function uses the “count” parameter to remember where it left off last time, and runs in time proportionate to the total number of elements. Used correctly, the first two times it’s called this function will return A, followed by B, followed by C, followed by A twice again and so on. Hopefully that’s obvious.

There’s a very simple optimisation possible to make it run in constant time, just sacrifice memory proportionate to the total sum of weights;

 12345678 elements = [ { 'name' : 'A' },              { 'name' : 'A' },              { 'name' : 'B' },              { 'name' : 'C' } ] def select_element(elements, count):   return  ( element[count] ,             (count + 1) % len(elements) )

This basic approach is surprisingly common, it's how a lot of networking devices implement weighted path selection.

### Random solutions

But if we're feeding into some kind of simulation, or a fuzz-test, then randomised selection is probably a better thing. Luckily, there are at least 3 ways to do it. The first approach is to use the same space-inefficient approach our "optimised" non-random selection did. Flatten the list of elements into an array and just randomly jump into it;

 123456789 import random elements = [ { 'name' : 'A' },              { 'name' : 'A' },              { 'name' : 'B' },              { 'name' : 'C' } ] def select_element(elements):   return random.choice(elements)

As before, this runs in constant time, but it's easy to see how this could get really ugly if there's some weights that are more like;

 element weight A 998 B 1 C 1

Unfortunately, common datasets can follow distributions that are just like this (Zipfian, for example). It would take an unreasonably large amount of space to store all of the words in the English language, proportionate to their frequencies, using this method.

### Reservoir sampling

But luckily we have a way to avoid all of this space, and it's got some other useful utilities too. It's called reservoir sampling, and it's pretty magical. Reservoir sampling is a form of statistical sampling that lets you choose a limited number of samples from an arbitrarily large stream of data, without knowing how big it is in advance. It doesn't seem related, but it is.

Imagine you have a stream of data, and it looks something like;

"Banana" , "Car" , "Bus", "Apple", "Orange" , "Banana" , ....

and you want to choose a sample of 3 events. All events in the stream should have equal probability of making it into the sample. A better real-world example is collecting a sample of 1,000 web server requests over the last minute, but you get the idea.

What a reservoir sample does is simple;

 123456789101112131415 def reservoir_sample(events, size):   events_observed = 0   sample = [ None ] * size   for event in events:    if events_observed >= len(sample):      r = random.randint(0, events_observed):    else:      r = events_observed    if r < len(sample):      sample[r] = event        events_observed += 1   return sample

and how this works is pretty cool. The first 3 events have a 100% likelihood of being sampled, r will be 0, 1, 2 in their case. The interesting part is after that. The fourth element has a 3/4 probability of being selected. So that's pretty simple.

But consider the likelihood of an element already in the sample of "staying". For any given element, there are two ways of "staying". One is that the fourth element is not chosen (1/4 probability) , another is that the fourth element is chosen (3/4 probability) but that a different element is selected (2/3) to be replaced. If you do the math;

```1/4 + (3/4 * 2/3)   =>
1/4 + 6/12          =>
3/12 + 6/12         =>
9/12                =>
3/4
```

We see that any given element has a 3/4 likelihood of staying. Which is exactly what we want. There have been four elements observed, and all of them have had a 3/4 probability of being in our sample. Let's extend this an iteration, and see what happens at element 5.

When we get to this element, it has a 3/5 chance of being selected (r is a random number from the set 0,1,2,3,4 - if it is one of 0, 1 or 2 then the element will be chosen). We've already established that the previous elements had a 3/4 probability of being in the sample. Those elements that are in the sample again have two ways of staying, either the fifth element isn't chosen (2/5 likelihood) or it is, but a different element is replaced (3/5 * 2/3) . Again, let's do the math;

```3/4 * (2/5 + (3/5 * 2/3))  =>
3/4 * (2/5 + 6/15)         =>
3/4 * (2/5 + 2/5)          =>
3/4 * 4/5                  =>
3/5
```

So, once again, all elements have a 3/5 likelihood of being in the sample. Every time another element is observed, the math gets a bit longer, but it stays the same - they always have equal probability of being in the final sample.

So what does this have to do with random selection? Well imagine our original weighted elements as a stream of events, and that our goal is to choose a sample of 1.

 123456789101112131415 import random elements = [ { 'name' : 'A' },              { 'name' : 'A' },              { 'name' : 'B' },              { 'name' : 'C' } ] def select_element(elements):   elements_observed = 0   chosen_element = None   for element in elements:     r = randint(0, elements_observed):     if r < 1:         chosen_element = element   return chosen_element

So far, this is actually worse than what we've had before. We're still using space in memory proportionate to the total sum of weights, and we're running in linear time. But now that we've structured things like this, we can use the weights to cheat;

 1234567891011121314 elements = [ { 'name' : 'A' , 'weight' : 2 },              { 'name' : 'B' , 'weight' : 1 },              { 'name' : 'C' , 'weight' : 1 } ] def select_element(elements):   total_weight = 0   chosen_element = None   for element in elements:     total_weight += element['weight']     r = randint(0, total_weight -1)     if r < element['weight']:          chosen_element = element   return chosen_element

It's the same kind of self-correcting math as before, except now we can take bigger jumps rather than having to take steps of just 1 every time.

We now have a loop that runs in time proportionate to the total number of elements, and also uses memory proportionate to the total number of elements. That's pretty good, but there's another optimisation we can make.

### Using a tree

If there are a large number of elements, it can still be a pain to have to iterate over them all just to select one. One solution to this problem is to compile the weighted elements into a weighted binary tree, so that we need only perform O(log) operations.

Let's take a larger weighted set;

 element weight A 1 B 2 C 3 D 1 E 1 F 1 G 1 H 1 I 1 J 1

which we can express in tree form; where each node has the cumulative weight of its children.

Tree-solutions like this lend themselves easily to recursion;

 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748 import random elements = [ {  'name' : 'A',                 'weight' : 10 },              {  'name' : 'B',                 'weight' : 20 },              {  'name' : 'C',                 'weight' : 30 },              {  'name' : 'D',                 'weight' : 10 },              {  'name' : 'E',                 'weight' : 10 },              {  'name' : 'F',                 'weight' : 10 },              {  'name' : 'G',                 'weight' : 10 },              {  'name' : 'H',                 'weight' : 10 },              {  'name' : 'I',                 'weight' : 10 },              {  'name' : 'J',                 'weight' : 10 } ] # Compile a set of weighted elements into a weighted tree def compile_choices(elements):     if len(elements) > 1:         left =  elements[ : len(elements) / 2 ]         right    =  elements[ (len(elements) / 2) : ]         return [ { 'child': compile_choices(left) ,                    'weight' : sum( map( lambda x : x['weight'] , left ) ) } ,                  { 'child': compile_choices(right) ,                    'weight' : sum( map( lambda x : x['weight'] , right ) ) } ]     else:         return elements # Choose an element from a weighted tree def tree_weighted_choice(tree):     if len(tree) > 1:         total_weight = tree['weight'] + tree['weight']         if random.randint(0, total_weight - 1) < tree['weight']:             return tree_weighted_choice( tree['child'] )         else:             return tree_weighted_choice( tree['child'] )     else:         return tree['name'] tree = compile_choices(elements)

And there we have it! A pretty good trade-off, we get sub-linear selection time with a relatively small overhead in memory.

### A huffman(ish) tree

I should have included this is in the first version of this post, and Fergal was quick to point it out in the comments, but there is a further optimisation we can make. Instead of using a tree that is balanced purely by the number of nodes, as above, we can use a tree that is balanced by the sum of weights.

To re-use his example, imagine if you have elements that are weighted [ 10000, 1 , 1 , 1 ... ] (with a hundred ones). It doesn't make sense to bury the very weighty element deep in the tree, instead it should go near the root - so that on average we minimise the expected number of lookups.

 12345678910111213141516171819202122 def huffman_choices(elements):     if len(elements) > 1:         total_weight = sum( map( lambda x : x['weight'] , elements ) )         sorted_elements = sorted(elements, key = lambda x : x['weight'] )[::-1]         left = []         right = []         observed_weight = 0         for element in sorted_elements:             if observed_weight < total_weight / 2:                 left.append(element)             else:                 right.append(element)             observed_weight += element['weight']         return [ { 'child': huffman_choices(left) ,                    'weight' : sum( map( lambda x : x['weight'] , left ) ) } ,                  { 'child': huffman_choices(right) ,                    'weight' : sum( map( lambda x : x['weight'] , right ) ) } ]     else:         return elements

None of these techniques are novel, in fact they're all quite common and very old, yet for some reason they don't have broad awareness. Reservoir sampling - which on its own is incredibly useful - is only afforded a few sentences in Knuth.

But dealing with randomness and sampling is one of the inevitable complexities of programming. If you've never done it, it's worth taking the above and trying to write your own weighted markov chain generator. And then have even more fun thinking about how to test it.

## Period Pain

Posted on September 14, 2009, under coding.

Imagine it was your job – along with 1,000 other people – to pick a number between 1 and 60. You can use any method you like (though you must use the same one), but if more than 30 of you choose the the same number, those of you who did would be shot. Would you let the group pick numbers at random?

Probably not, there’s always a chance it could go horribly wrong. And that chance? We could derive the correct p-value for having to shoot people easily enough from the uniform distribution, but forget that, let’s do it with code. It’s a lot easier to understand – and it’s a good habit too.

 1234567891011121314151617 import random count = 0 # Simulate 10,000 runs of 1,000 people #  picking a number between 0 and 59. for runs in range(10000):     numbers =  * 60     for people in range(1000):         r = random.randint(0, 59)         numbers[ r ] += 1     if sorted(numbers)[-1] >= 30:         count += 1 print count/10000.0

If you run this, hopefully you’ll get an answer that’s around “0.1″. In other words, about 10% of the time we expect to have at least one number being chosen by 30 or more people. Those don’t seem like great odds, and I wouldn’t play Russian roulette with lives like that.

Yet this is almost exactly what we do with a lot of automated tasks in some large-scale distributed systems. A pattern than can be observed over and over again is that someone writes a scheduled task that runs once an hour, day, month or whatever but then sleeps a random amount of time (typically between 0 and 3600 seconds) when it starts- in a very naive attempt to distribute impact across the larger system evenly. The impacts can be pretty serious – it might be extended load on a dependent service, or it might simply be too many busy nodes in a cluster. Or it might be a two-day global outage of a popular Voip service. Too many things happening at once is usually a bad thing in distributed systems.

Security and anti-virus updates, apt/yum/ports repositories and auditing tools in particular seem to get this pattern wrong. Here’s a good example, from Ubuntu’s daily apt script:

 12345678910111213141516 # sleep for a random intervall of time (default 30min) # (some code taken from cron-apt, thanks) random_sleep() {     RandomSleep=1800     eval \$(apt-config shell RandomSleep APT::Periodic::RandomSleep)     if [ \$RandomSleep -eq 0 ]; then     return     fi     if [ -z "\$RANDOM" ] ; then         # A fix for shells that do not have this bash feature.     RANDOM=\$(dd if=/dev/urandom count=1 2> /dev/null | cksum | cut -c"1-5")     fi     TIME=\$((\$RANDOM % \$RandomSleep))     sleep \$TIME }

This is a very bad pattern – I’d go so far as to say that it’s actually worse than letting everything happen at the same time in the first place. It has 2 particularly dangerous qualities;

1. ### It increases the effective period of the task

If a task is running once an hour – it’s probably because we need to do something once an hour. That may seem tautological, but there’s subtlety. The clock-time hours we define as humans are arbitrary, by once an hour we should really mean “at least once in a 60 minute interval”, not “once between 1PM and 2PM”.

If we pick a random sleep time, we might end up running just under two hours apart. If at 1PM we pick 0 sleep seconds, and then at 2PM two we pick 3599 sleep seconds – look, we just ran two real hours apart!. Unsurprisingly the converse can happen, and we’ll run just 1 second apart, doing heavens knows what to load along the way.

2. ### The system as a whole has to cope with dangerous load spikes

Using our earlier example, If we allocated the numbers evenly, each value would get chosen only 16 or 17 times. We could plan for an impact of say 20 running at once. But as we’ve seen, if we pick random numbers every time, then 1 in every 10 runs, we’re going to have to cope with an impact of 30. That’s 50% more load, because of a one line error!

If this task is running every hour, then about 1 in every 7 weeks, we’re going to have to deal with an impact of 40 or more. And it will appear totally out of the blue, it’s a random occurrence. Nice! To take it to the extreme, there is a small – but finite – chance all all 1,000 systems choosing the same value. But avoiding this is why we are spreading the load in the first place, so why leave it to chance?

I used to run a busy Ubuntu mirror, and every day between 6:25 and 6:55 we’d see a gigantic wedge of load that could be distributed a lot more evenly. Though I think the worst of these problems have now been fixed.

The optimal fix for the problem is simple; coordinate – use a central allocator, or a gossip protocol, to ensure that every slot has at most N/M consumers.

This isn’t always possible though – Open Source security updates are usually polled from relatively dumb mirror servers, that don’t keep enough state to be able to divide load like this. But there are still two better approaches.

Firstly, you can pick a random number just once, and then reuse the same random number every time. This guarantees that the task does actually run once an hour, and it makes load predictable for the distributed system as a whole.

Secondly, you can use any uniformly distributed unique piece of data, local to the node, to choose your number. I like to use an MD5 hash of the MAC address, but even the hostname would do.

Probability-wise, the two approaches are identical, but in the real world you get fewer collisions with the latter – probably due to the similar state of entropy identical systems will be in when you mass boot a fleet of several hundred. In both cases, where aberrations emerge due to bad luck, they at least only have to be identified and fixed once. We’re no longer playing the load lottery once an hour.

But this is only half the story … what about other automated tasks with their own periods … how do we avoid colliding with them? And how do we architect timings in distributed systems in general to make them a lot easier to debug. That’s what the next blog post will be about. Stay tuned.