<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>/~colmmacc/ &#187; coding</title>
	<atom:link href="http://www.stdlib.net/~colmmacc/category/coding/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.stdlib.net/~colmmacc</link>
	<description>An Irishman's Fiery</description>
	<lastBuildDate>Sun, 15 May 2011 17:01:26 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.2</generator>
		<item>
		<title>Weighty matters</title>
		<link>http://www.stdlib.net/~colmmacc/2011/05/15/weighty-matters/</link>
		<comments>http://www.stdlib.net/~colmmacc/2011/05/15/weighty-matters/#comments</comments>
		<pubDate>Sun, 15 May 2011 03:49:57 +0000</pubDate>
		<dc:creator>colmmacc</dc:creator>
				<category><![CDATA[coding]]></category>
		<category><![CDATA[general]]></category>

		<guid isPermaLink="false">http://www.stdlib.net/~colmmacc/?p=752</guid>
		<description><![CDATA[One of my favourite programming exercises is to write a text generator using Markov Chains. It usually takes a relatively small amount of code and is especially useful for learning new programming languages. A useful goal is to able to have a tool that ingests a large body of text, breaks it into N-tuples (where [...]]]></description>
			<content:encoded><![CDATA[<p>One of my favourite programming exercises is to write a text generator using Markov Chains. It usually takes a relatively small amount of code and is especially useful for learning new programming languages. A useful goal is to able to have a tool that ingests a large body of text, breaks it into N-tuples (where N is chosen by the operator), and then emits a random text of length L where that text uses those tuples at the same frequency that the original body did. </p>
<p>On its own, that task can produce some fun results, but it&#8217;s also a very repurpose-able technique. You can use this kind of statistical generation to simulate realistic network jitter (record some N-tuples of observed RTT with ping), or awesome simulated-user fuzz tests (record some N-tuples of observed user inputs). It&#8217;s surprising that it isn&#8217;t more common. </p>
<div align="center"><img src="http://www.stdlib.net/~colmmacc/odd-one-out.jpg" alt="" /></div>
<p>But when approaching these problems, from experience of working with newcomers, what seems to be a common first tripping point is how to do weighted selection at all. Put most simply, if we have a table of elements;</p>
<table>
<tr>
<td><strong>element</strong></td>
<td><strong>weight</strong></td>
</tr>
<tr>
<td>A</td>
<td>2</td>
</tr>
<tr>
<td>B</td>
<td>1</td>
</tr>
<tr>
<td>C</td>
<td>1</td>
</tr>
</table>
<p>how do we write a function that will choose A about half the time, and B and C about a quarter each?  It also turns out that this is a really interesting design problem. We can choose to implement a random solution, a non-random solution, a solution that runs in constant time, a solution that runs in linear time and a solution that runs in logarithmic time. This post is about those potential solutions. </p>
<h3>Non-random solutions</h3>
<p>When coming to the problem, the first thing to decide is whether we really want the selection to be random or not. One could imagine a function that tries to keep track of previous selections, for example;</p>
<div class="codecolorer-container python default" style="overflow:auto;white-space:nowrap;border: 1px solid #9F9F9F;width:435px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">elements = <span style="color: black;">&#91;</span> <span style="color: black;">&#123;</span> <span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'A'</span> , <span style="color: #483d8b;">'weight'</span> : <span style="color: #ff4500;">2</span> <span style="color: black;">&#125;</span>, <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> <span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'B'</span> , <span style="color: #483d8b;">'weight'</span> : <span style="color: #ff4500;">1</span> <span style="color: black;">&#125;</span>, <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> <span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'C'</span> , <span style="color: #483d8b;">'weight'</span> : 1 <span style="color: black;">&#125;</span> <span style="color: black;">&#93;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> select_element<span style="color: black;">&#40;</span>elements, count<span style="color: black;">&#41;</span>:<br />
&nbsp; total_weight = 0<br />
&nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> element <span style="color: #ff7700;font-weight:bold;">in</span> elements:<br />
&nbsp; &nbsp; total_weight += element<span style="color: black;">&#91;</span><span style="color: #483d8b;">'weight'</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> count <span style="color: #66cc66;">&lt;</span> total_weight:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">return</span> &nbsp;<span style="color: black;">&#40;</span> element , <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#40;</span>count + 1<span style="color: black;">&#41;</span> <span style="color: #66cc66;">%</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>elements<span style="color: black;">&#41;</span> <span style="color: black;">&#41;</span><br />
<br />
<span style="color: #808080; font-style: italic;"># select some elements</span><br />
c = <span style="color: #ff4500;">0</span><br />
<span style="color: black;">&#40;</span>i , c<span style="color: black;">&#41;</span> = select_element<span style="color: black;">&#40;</span>elements, c<span style="color: black;">&#41;</span><br />
<span style="color: black;">&#40;</span>j , c<span style="color: black;">&#41;</span> = select_element<span style="color: black;">&#40;</span>elements, c<span style="color: black;">&#41;</span><br />
<span style="color: black;">&#40;</span>k , c<span style="color: black;">&#41;</span> = select_element<span style="color: black;">&#40;</span>elements, c<span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>This function uses the &#8220;count&#8221; parameter to remember where it left off last time, and runs in time proportionate to the total number of elements.  Used correctly, the first two times it&#8217;s called this function will return A, followed by B, followed by C, followed by A twice again and so on. Hopefully that&#8217;s obvious. </p>
<p>There&#8217;s a very simple optimisation possible to make it run in constant time, just sacrifice memory proportionate to the total sum of weights;</p>
<p></code></p>
<div class="codecolorer-container python default" style="overflow:auto;white-space:nowrap;border: 1px solid #9F9F9F;width:435px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">elements = <span style="color: black;">&#91;</span> <span style="color: black;">&#123;</span> <span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'A'</span> <span style="color: black;">&#125;</span>, <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> <span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'A'</span> <span style="color: black;">&#125;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> <span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'B'</span> <span style="color: black;">&#125;</span>, <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> <span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'C'</span> <span style="color: black;">&#125;</span> <span style="color: black;">&#93;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> select_element<span style="color: black;">&#40;</span>elements, count<span style="color: black;">&#41;</span>:<br />
&nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> &nbsp;<span style="color: black;">&#40;</span> element<span style="color: black;">&#91;</span>count<span style="color: black;">&#93;</span> , <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: black;">&#40;</span>count + 1<span style="color: black;">&#41;</span> <span style="color: #66cc66;">%</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>elements<span style="color: black;">&#41;</span> <span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>This basic approach is surprisingly common, it's how a lot of networking devices implement weighted path selection. </p>
<h3>Random solutions</h3>
<p>But if we're feeding into some kind of simulation, or a fuzz-test, then randomised selection is probably a better thing. Luckily, there are at least 3 ways to do it. The first approach is to use the same space-inefficient approach our "optimised" non-random selection did. Flatten the list of elements into an array and just randomly jump into it;</p>
<div class="codecolorer-container python default" style="overflow:auto;white-space:nowrap;border: 1px solid #9F9F9F;width:435px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">random</span><br />
<br />
elements = <span style="color: black;">&#91;</span> <span style="color: black;">&#123;</span> <span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'A'</span> <span style="color: black;">&#125;</span>, <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> <span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'A'</span> <span style="color: black;">&#125;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> <span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'B'</span> <span style="color: black;">&#125;</span>, <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> <span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'C'</span> <span style="color: black;">&#125;</span> <span style="color: black;">&#93;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> select_element<span style="color: black;">&#40;</span>elements<span style="color: black;">&#41;</span>:<br />
&nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #dc143c;">random</span>.<span style="color: black;">choice</span><span style="color: black;">&#40;</span>elements<span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p>As before, this runs in constant time, but it's easy to see how this could get really ugly if there's some weights that are more like;</p>
<table>
<tr>
<td><strong>element</strong></td>
<td><strong>weight</strong></td>
</tr>
<tr>
<td>A</td>
<td>998</td>
</tr>
<tr>
<td>B</td>
<td>1</td>
</tr>
<tr>
<td>C</td>
<td>1</td>
</tr>
</table>
<p>Unfortunately, common datasets can follow distributions that are just like this (Zipfian, for example). It would take an unreasonably large amount of space to store all of the words in the English language, proportionate to their frequencies, using this method. </p>
<h3>Reservoir sampling</h3>
<p>But luckily we have a way to avoid all of this space, and it's got some other useful utilities too. It's called reservoir sampling, and it's pretty magical. Reservoir sampling is a form of statistical sampling that lets you choose a limited number of samples from an arbitrarily large stream of data, without knowing how big it is in advance. It doesn't seem related, but it is.</p>
<p>Imagine you have a stream of data, and it looks something like;</p>
<p>  "Banana" , "Car" , "Bus", "Apple", "Orange" , "Banana" , ....</p>
<p>and you want to choose a sample of 3 events. All events in the stream should have equal probability of making it into the sample. A better real-world example is collecting a sample of 1,000 web server requests over the last minute, but you get the idea. </p>
<p>What a reservoir sample does is simple;</p>
<div class="codecolorer-container python default" style="overflow:auto;white-space:nowrap;border: 1px solid #9F9F9F;width:435px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #ff7700;font-weight:bold;">def</span> reservoir_sample<span style="color: black;">&#40;</span>events, size<span style="color: black;">&#41;</span>:<br />
&nbsp; events_observed = 0<br />
&nbsp; sample = <span style="color: black;">&#91;</span> <span style="color: #008000;">None</span> <span style="color: black;">&#93;</span> <span style="color: #66cc66;">*</span> size<br />
<br />
&nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> event <span style="color: #ff7700;font-weight:bold;">in</span> events:<br />
&nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">if</span> events_observed <span style="color: #66cc66;">&gt;</span>= <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>sample<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp;r = <span style="color: #dc143c;">random</span>.<span style="color: black;">randint</span><span style="color: black;">&#40;</span>0, events_observed<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp;r = events_observed<br />
<br />
&nbsp; &nbsp;<span style="color: #ff7700;font-weight:bold;">if</span> r <span style="color: #66cc66;">&lt;</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>sample<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp;sample<span style="color: black;">&#91;</span>r<span style="color: black;">&#93;</span> = event &nbsp; &nbsp; <br />
&nbsp; &nbsp;events_observed += 1<br />
<br />
&nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> sample</div></td></tr></tbody></table></div>
<p>and how this works is pretty cool. The first 3 events have a 100% likelihood  of being sampled, r will be 0, 1, 2 in their case. The interesting part is after that. The fourth element has a 3/4 probability of being selected. So that's pretty simple.</p>
<p>But consider the likelihood of an element already in the sample of "staying".  For any given element, there are two ways of "staying". One is that the fourth element is not chosen (1/4 probability) , another is that the fourth element is chosen (3/4 probability) but that a different element is selected (2/3) to be replaced. If you do the math; </p>
<pre>
1/4 + (3/4 * 2/3)   =>
1/4 + 6/12          =>
3/12 + 6/12         =>
9/12                =>
3/4
</pre>
<p>We see that any given element has a 3/4 likelihood of staying. Which is exactly what we want. There have been four elements observed, and all of them have had a 3/4 probability of being in our sample. Let's extend this an iteration, and see what happens at element 5.</p>
<p>When we get to this element, it has a 3/5 chance of being selected (r is a random number from the set 0,1,2,3,4 - if it is one of 0, 1 or 2 then the element will be chosen). We've already established that the previous elements had a 3/4 probability of being in the sample. Those elements that are in the sample again have two ways of staying, either the fifth element isn't chosen (2/5 likelihood) or it is, but a different element is replaced (3/5 * 2/3) . Again, let's do the math;</p>
<pre>
3/4 * (2/5 + (3/5 * 2/3))  =>
3/4 * (2/5 + 6/15)         =>
3/4 * (2/5 + 2/5)          =>
3/4 * 4/5                  =>
3/5
</pre>
<p>So, once again, all elements have a 3/5 likelihood of being in the sample. Every time another element is observed, the math gets a bit longer, but it stays the same - they always have equal probability of being in the final sample. </p>
<p>So what does this have to do with random selection? Well imagine our original weighted elements as a stream of events, and that our goal is to choose a sample of 1. </p>
<p></code></p>
<div class="codecolorer-container python default" style="overflow:auto;white-space:nowrap;border: 1px solid #9F9F9F;width:435px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">random</span><br />
<br />
elements = <span style="color: black;">&#91;</span> <span style="color: black;">&#123;</span> <span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'A'</span> <span style="color: black;">&#125;</span>, <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> <span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'A'</span> <span style="color: black;">&#125;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> <span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'B'</span> <span style="color: black;">&#125;</span>, <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> <span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'C'</span> <span style="color: black;">&#125;</span> <span style="color: black;">&#93;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> select_element<span style="color: black;">&#40;</span>elements<span style="color: black;">&#41;</span>:<br />
&nbsp; elements_observed = 0<br />
&nbsp; chosen_element = <span style="color: #008000;">None</span><br />
&nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> element <span style="color: #ff7700;font-weight:bold;">in</span> elements:<br />
&nbsp; &nbsp; r = randint<span style="color: black;">&#40;</span>0, elements_observed<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> r <span style="color: #66cc66;">&lt;</span> 1:<br />
&nbsp; &nbsp; &nbsp; &nbsp; chosen_element = element<br />
&nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> chosen_element</div></td></tr></tbody></table></div>
<p>So far, this is actually worse than what we've had before. We're still using space in memory proportionate to the total sum of weights, and we're running in linear time. But now that we've structured things like this, we can use the weights to cheat;</p>
<p></code></p>
<div class="codecolorer-container python default" style="overflow:auto;white-space:nowrap;border: 1px solid #9F9F9F;width:435px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">elements = <span style="color: black;">&#91;</span> <span style="color: black;">&#123;</span> <span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'A'</span> , <span style="color: #483d8b;">'weight'</span> : <span style="color: #ff4500;">2</span> <span style="color: black;">&#125;</span>, <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> <span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'B'</span> , <span style="color: #483d8b;">'weight'</span> : <span style="color: #ff4500;">1</span> <span style="color: black;">&#125;</span>, <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> <span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'C'</span> , <span style="color: #483d8b;">'weight'</span> : 1 <span style="color: black;">&#125;</span> <span style="color: black;">&#93;</span><br />
<br />
<span style="color: #ff7700;font-weight:bold;">def</span> select_element<span style="color: black;">&#40;</span>elements<span style="color: black;">&#41;</span>:<br />
&nbsp; total_weight = 0<br />
&nbsp; chosen_element = <span style="color: #008000;">None</span><br />
&nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> element <span style="color: #ff7700;font-weight:bold;">in</span> elements:<br />
&nbsp; &nbsp; total_weight += element<span style="color: black;">&#91;</span><span style="color: #483d8b;">'weight'</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; r = randint<span style="color: black;">&#40;</span>0, total_weight -1<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> r <span style="color: #66cc66;">&lt;</span> element<span style="color: black;">&#91;</span><span style="color: #483d8b;">'weight'</span><span style="color: black;">&#93;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;chosen_element = element<br />
<br />
&nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> chosen_element</div></td></tr></tbody></table></div>
<p></code></p>
<p>It's the same kind of self-correcting math as before, except now we can take bigger jumps rather than having to take steps of just 1 every time.</p>
<p>We now have a loop that runs in time proportionate to the total number of elements, and also uses memory proportionate to the total number of elements. That's pretty good, but there's another optimisation we can make.</p>
<h3>Using a tree</h3>
<p>If there are a large number of elements, it can still be a pain to have to iterate over them all just to select one. One solution to this problem is to compile the weighted elements into a weighted binary tree, so that we need only perform O(log) operations. </p>
<p>Let's take a larger weighted set;</p>
<table>
<tr>
<td><strong>element</strong></td>
<td><strong>weight</strong></td>
</tr>
<tr>
<td>A</td>
<td>1</td>
</tr>
<tr>
<td>B</td>
<td>2</td>
</tr>
<tr>
<td>C</td>
<td>3</td>
</tr>
<tr>
<td>D</td>
<td>1</td>
</tr>
<tr>
<td>E</td>
<td>1</td>
</tr>
<tr>
<td>F</td>
<td>1</td>
</tr>
<tr>
<td>G</td>
<td>1</td>
</tr>
<tr>
<td>H</td>
<td>1</td>
</tr>
<tr>
<td>I</td>
<td>1</td>
</tr>
<tr>
<td>J</td>
<td>1</td>
</tr>
</table>
<p>which we can express in tree form;</p>
<div align="center"><img src="http://www.stdlib.net/~colmmacc/weighted-tree.png" alt=""/></div>
<p>where each node has the cumulative weight of its children. </p>
<p>Tree-solutions like this lend themselves easily to recursion;</p>
<div class="codecolorer-container python default" style="overflow:auto;white-space:nowrap;border: 1px solid #9F9F9F;width:435px;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br />48<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">random</span><br />
<br />
elements = <span style="color: black;">&#91;</span> <span style="color: black;">&#123;</span> &nbsp;<span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'A'</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'weight'</span> : <span style="color: #ff4500;">10</span> <span style="color: black;">&#125;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> &nbsp;<span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'B'</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'weight'</span> : <span style="color: #ff4500;">20</span> <span style="color: black;">&#125;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> &nbsp;<span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'C'</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'weight'</span> : <span style="color: #ff4500;">30</span> <span style="color: black;">&#125;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> &nbsp;<span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'D'</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'weight'</span> : <span style="color: #ff4500;">10</span> <span style="color: black;">&#125;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> &nbsp;<span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'E'</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'weight'</span> : <span style="color: #ff4500;">10</span> <span style="color: black;">&#125;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> &nbsp;<span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'F'</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'weight'</span> : <span style="color: #ff4500;">10</span> <span style="color: black;">&#125;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> &nbsp;<span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'G'</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'weight'</span> : <span style="color: #ff4500;">10</span> <span style="color: black;">&#125;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> &nbsp;<span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'H'</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'weight'</span> : <span style="color: #ff4500;">10</span> <span style="color: black;">&#125;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> &nbsp;<span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'I'</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'weight'</span> : <span style="color: #ff4500;">10</span> <span style="color: black;">&#125;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> &nbsp;<span style="color: #483d8b;">'name'</span> : <span style="color: #483d8b;">'J'</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #483d8b;">'weight'</span> : <span style="color: #ff4500;">10</span> <span style="color: black;">&#125;</span> <span style="color: black;">&#93;</span><br />
<br />
<span style="color: #808080; font-style: italic;"># Compile a set of weighted elements into a weighted tree</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> compile_choices<span style="color: black;">&#40;</span>elements<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>elements<span style="color: black;">&#41;</span> <span style="color: #66cc66;">&gt;</span> 1:<br />
&nbsp; &nbsp; &nbsp; &nbsp; left = &nbsp;elements<span style="color: black;">&#91;</span> : <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>elements<span style="color: black;">&#41;</span> / 2 <span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; right &nbsp; &nbsp;= &nbsp;elements<span style="color: black;">&#91;</span> <span style="color: black;">&#40;</span><span style="color: #008000;">len</span><span style="color: black;">&#40;</span>elements<span style="color: black;">&#41;</span> / 2<span style="color: black;">&#41;</span> : <span style="color: black;">&#93;</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: black;">&#91;</span> <span style="color: black;">&#123;</span> <span style="color: #483d8b;">'child'</span>: compile_choices<span style="color: black;">&#40;</span>left<span style="color: black;">&#41;</span> ,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #483d8b;">'weight'</span> : <span style="color: #008000;">sum</span><span style="color: black;">&#40;</span> <span style="color: #008000;">map</span><span style="color: black;">&#40;</span> <span style="color: #ff7700;font-weight:bold;">lambda</span> x : x<span style="color: black;">&#91;</span><span style="color: #483d8b;">'weight'</span><span style="color: black;">&#93;</span> , left <span style="color: black;">&#41;</span> <span style="color: black;">&#41;</span> <span style="color: black;">&#125;</span> ,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> <span style="color: #483d8b;">'child'</span>: compile_choices<span style="color: black;">&#40;</span>right<span style="color: black;">&#41;</span> ,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #483d8b;">'weight'</span> : <span style="color: #008000;">sum</span><span style="color: black;">&#40;</span> <span style="color: #008000;">map</span><span style="color: black;">&#40;</span> <span style="color: #ff7700;font-weight:bold;">lambda</span> x : x<span style="color: black;">&#91;</span><span style="color: #483d8b;">'weight'</span><span style="color: black;">&#93;</span> , right <span style="color: black;">&#41;</span> <span style="color: black;">&#41;</span> <span style="color: black;">&#125;</span> <span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> elements<br />
<br />
<span style="color: #808080; font-style: italic;"># Choose an element from a weighted tree</span><br />
<span style="color: #ff7700;font-weight:bold;">def</span> tree_weighted_choice<span style="color: black;">&#40;</span>tree<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>tree<span style="color: black;">&#41;</span> <span style="color: #66cc66;">&gt;</span> <span style="color: #ff4500;">1</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; total_weight = tree<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'weight'</span><span style="color: black;">&#93;</span> + tree<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'weight'</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #dc143c;">random</span>.<span style="color: black;">randint</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span>, total_weight - <span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span> <span style="color: #66cc66;">&lt;</span> tree<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'weight'</span><span style="color: black;">&#93;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> tree_weighted_choice<span style="color: black;">&#40;</span> tree<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'child'</span><span style="color: black;">&#93;</span> <span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> tree_weighted_choice<span style="color: black;">&#40;</span> tree<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'child'</span><span style="color: black;">&#93;</span> <span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> tree<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'name'</span><span style="color: black;">&#93;</span><br />
<br />
tree = compile_choices<span style="color: black;">&#40;</span>elements<span style="color: black;">&#41;</span></div></td></tr></tbody></table></div>
<p></code></p>
<p>And there we have it! A pretty good trade-off, we get sub-linear selection time with a relatively small overhead in memory.</p>
<h3>A huffman(ish) tree</h3>
<p>I should have included this is in the first version of this post, and Fergal was quick to point it out in the comments, but there is a further optimisation we can make. Instead of using a tree that is balanced purely by the number of nodes, as above, we can use a tree that is balanced by the sum of weights. </p>
<p>To re-use his example, imagine if you have elements that are weighted [ 10000, 1 , 1 , 1 ... ] (with a hundred ones). It doesn't make sense to bury the very weighty element deep in the tree, instead it should go near the root - so that on average we minimise the expected number of lookups.</p>
<div class="codecolorer-container python default" style="overflow:auto;white-space:nowrap;border: 1px solid #9F9F9F;width:435px;height:300px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #ff7700;font-weight:bold;">def</span> huffman_choices<span style="color: black;">&#40;</span>elements<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>elements<span style="color: black;">&#41;</span> <span style="color: #66cc66;">&gt;</span> 1:<br />
&nbsp; &nbsp; &nbsp; &nbsp; total_weight = <span style="color: #008000;">sum</span><span style="color: black;">&#40;</span> <span style="color: #008000;">map</span><span style="color: black;">&#40;</span> <span style="color: #ff7700;font-weight:bold;">lambda</span> x : x<span style="color: black;">&#91;</span><span style="color: #483d8b;">'weight'</span><span style="color: black;">&#93;</span> , elements <span style="color: black;">&#41;</span> <span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; sorted_elements = <span style="color: #008000;">sorted</span><span style="color: black;">&#40;</span>elements, key = <span style="color: #ff7700;font-weight:bold;">lambda</span> x : x<span style="color: black;">&#91;</span><span style="color: #483d8b;">'weight'</span><span style="color: black;">&#93;</span> <span style="color: black;">&#41;</span><span style="color: black;">&#91;</span>::-1<span style="color: black;">&#93;</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; left = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; right = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; observed_weight = 0<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> element <span style="color: #ff7700;font-weight:bold;">in</span> sorted_elements:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> observed_weight <span style="color: #66cc66;">&lt;</span> total_weight / 2:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; left.<span style="color: black;">append</span><span style="color: black;">&#40;</span>element<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; right.<span style="color: black;">append</span><span style="color: black;">&#40;</span>element<span style="color: black;">&#41;</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; observed_weight += element<span style="color: black;">&#91;</span><span style="color: #483d8b;">'weight'</span><span style="color: black;">&#93;</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: black;">&#91;</span> <span style="color: black;">&#123;</span> <span style="color: #483d8b;">'child'</span>: huffman_choices<span style="color: black;">&#40;</span>left<span style="color: black;">&#41;</span> ,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #483d8b;">'weight'</span> : <span style="color: #008000;">sum</span><span style="color: black;">&#40;</span> <span style="color: #008000;">map</span><span style="color: black;">&#40;</span> <span style="color: #ff7700;font-weight:bold;">lambda</span> x : x<span style="color: black;">&#91;</span><span style="color: #483d8b;">'weight'</span><span style="color: black;">&#93;</span> , left <span style="color: black;">&#41;</span> <span style="color: black;">&#41;</span> <span style="color: black;">&#125;</span> ,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: black;">&#123;</span> <span style="color: #483d8b;">'child'</span>: huffman_choices<span style="color: black;">&#40;</span>right<span style="color: black;">&#41;</span> ,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #483d8b;">'weight'</span> : <span style="color: #008000;">sum</span><span style="color: black;">&#40;</span> <span style="color: #008000;">map</span><span style="color: black;">&#40;</span> <span style="color: #ff7700;font-weight:bold;">lambda</span> x : x<span style="color: black;">&#91;</span><span style="color: #483d8b;">'weight'</span><span style="color: black;">&#93;</span> , right <span style="color: black;">&#41;</span> <span style="color: black;">&#41;</span> <span style="color: black;">&#125;</span> <span style="color: black;">&#93;</span><br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">return</span> elements</div></td></tr></tbody></table></div>
<p>None of these techniques are novel, in fact they're all quite common and very old, yet for some reason they don't have broad awareness. Reservoir sampling - which on its own is incredibly useful - is only afforded a few sentences in <a href="http://amzn.com/0321751043">Knuth</a>. </p>
<p>But dealing with randomness and sampling is one of the inevitable complexities of programming. If you've never done it, it's worth taking the above and trying to write your own weighted markov chain generator. And then have even more fun thinking about how to test it.</code></p>
]]></content:encoded>
			<wfw:commentRss>http://www.stdlib.net/~colmmacc/2011/05/15/weighty-matters/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Period Pain</title>
		<link>http://www.stdlib.net/~colmmacc/2009/09/14/period-pain/</link>
		<comments>http://www.stdlib.net/~colmmacc/2009/09/14/period-pain/#comments</comments>
		<pubDate>Mon, 14 Sep 2009 16:15:27 +0000</pubDate>
		<dc:creator>colmmacc</dc:creator>
				<category><![CDATA[coding]]></category>

		<guid isPermaLink="false">http://www.stdlib.net/~colmmacc/?p=473</guid>
		<description><![CDATA[Imagine it was your job &#8211; along with 1,000 other people &#8211; to pick a number between 1 and 60. You can use any method you like (though you must use the same one), but if more than 30 of you choose the the same number, those of you who did would be shot. Would [...]]]></description>
			<content:encoded><![CDATA[<p>Imagine it was your job &#8211; along with 1,000 other people &#8211; to pick a number between 1 and 60.  You can use any method you like (though you must use the same one), but if more than 30 of you choose the the same number, those of you who did would be shot. Would you let the group pick numbers at random?  </p>
<p>Probably not, there&#8217;s always a chance it could go horribly wrong. And that chance? We could derive the correct p-value for having to shoot people easily enough from the uniform distribution, but forget that, let&#8217;s do it with code. It&#8217;s a lot easier to understand &#8211; and it&#8217;s a good habit too.</p>
<div class="codecolorer-container python default" style="overflow:auto;white-space:nowrap;border: 1px solid #9F9F9F;width:435px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br /></div></td><td><div class="python codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">random</span><br />
<br />
count = <span style="color: #ff4500;">0</span><br />
<br />
<span style="color: #808080; font-style: italic;"># Simulate 10,000 runs of 1,000 people</span><br />
<span style="color: #808080; font-style: italic;"># &nbsp;picking a number between 0 and 59.</span><br />
<span style="color: #ff7700;font-weight:bold;">for</span> runs <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span>10000<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; numbers = <span style="color: black;">&#91;</span>0<span style="color: black;">&#93;</span> <span style="color: #66cc66;">*</span> 60<br />
<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">for</span> people <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #008000;">range</span><span style="color: black;">&#40;</span>1000<span style="color: black;">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; r = <span style="color: #dc143c;">random</span>.<span style="color: black;">randint</span><span style="color: black;">&#40;</span>0, 59<span style="color: black;">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; numbers<span style="color: black;">&#91;</span> r <span style="color: black;">&#93;</span> += 1<br />
<br />
&nbsp; &nbsp; <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #008000;">sorted</span><span style="color: black;">&#40;</span>numbers<span style="color: black;">&#41;</span><span style="color: black;">&#91;</span>-1<span style="color: black;">&#93;</span> <span style="color: #66cc66;">&gt;</span>= 30:<br />
&nbsp; &nbsp; &nbsp; &nbsp; count += 1<br />
<br />
<span style="color: #ff7700;font-weight:bold;">print</span> count/<span style="color: #ff4500;">10000.0</span></div></td></tr></tbody></table></div>
<p>If you run this, hopefully you&#8217;ll get an answer that&#8217;s around &#8220;0.1&#8243;. In other words, about 10% of the time we expect to have at least one number being chosen by 30 or more people. Those don&#8217;t seem like great odds, and I wouldn&#8217;t play Russian roulette with lives like that. </p>
<p>Yet this is almost exactly what we do with a lot of automated tasks in some large-scale distributed systems. A pattern than can be observed over and over again is that someone writes a scheduled task that runs once an hour, day, month or whatever but then sleeps a random amount of time (typically between 0 and 3600 seconds) when it starts- in a very naive attempt to distribute impact across the larger system evenly.  </p>
<div align="center"><img src="http://www.stdlib.net/~colmmacc/time.jpg" alt="The march of time" /></div>
<p>The impacts can be pretty serious &#8211; it might be extended load on a dependent service, or it might simply be too many busy nodes in a cluster. Or it might be a <a href="http://heartbeat.skype.com/2007/08/what_happened_on_august_16.html">two-day global outage of a popular Voip service</a>. Too many things happening at once is usually a bad thing in distributed systems. </p>
<p>Security and anti-virus updates, apt/yum/ports repositories and auditing tools in particular seem to get this pattern wrong. Here&#8217;s a good example, from Ubuntu&#8217;s daily apt script:</p>
<div class="codecolorer-container bash default" style="overflow:auto;white-space:nowrap;border: 1px solid #9F9F9F;width:435px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br /></div></td><td><div class="bash codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #666666; font-style: italic;"># sleep for a random intervall of time (default 30min)</span><br />
<span style="color: #666666; font-style: italic;"># (some code taken from cron-apt, thanks)</span><br />
random_sleep<span style="color: #7a0874; font-weight: bold;">&#40;</span><span style="color: #7a0874; font-weight: bold;">&#41;</span><br />
<span style="color: #7a0874; font-weight: bold;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #007800;">RandomSleep</span>=1800<br />
&nbsp; &nbsp; <span style="color: #7a0874; font-weight: bold;">eval</span> $<span style="color: #7a0874; font-weight: bold;">&#40;</span>apt-config shell RandomSleep APT::Periodic::RandomSleep<span style="color: #7a0874; font-weight: bold;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #7a0874; font-weight: bold;">&#91;</span> <span style="color: #007800;">$RandomSleep</span> <span style="color: #660033;">-eq</span> 0 <span style="color: #7a0874; font-weight: bold;">&#93;</span>; <span style="color: #000000; font-weight: bold;">then</span><br />
&nbsp; &nbsp; <span style="color: #7a0874; font-weight: bold;">return</span><br />
&nbsp; &nbsp; <span style="color: #000000; font-weight: bold;">fi</span><br />
&nbsp; &nbsp; <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #7a0874; font-weight: bold;">&#91;</span> <span style="color: #660033;">-z</span> <span style="color: #ff0000;">&quot;<span style="color: #007800;">$RANDOM</span>&quot;</span> <span style="color: #7a0874; font-weight: bold;">&#93;</span> ; <span style="color: #000000; font-weight: bold;">then</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #666666; font-style: italic;"># A fix for shells that do not have this bash feature.</span><br />
&nbsp; &nbsp; <span style="color: #007800;">RANDOM</span>=$<span style="color: #7a0874; font-weight: bold;">&#40;</span><span style="color: #c20cb9; font-weight: bold;">dd</span> <span style="color: #000000; font-weight: bold;">if</span>=<span style="color: #000000; font-weight: bold;">/</span>dev<span style="color: #000000; font-weight: bold;">/</span>urandom <span style="color: #007800;">count</span>=1 2<span style="color: #000000; font-weight: bold;">&gt;</span> <span style="color: #000000; font-weight: bold;">/</span>dev<span style="color: #000000; font-weight: bold;">/</span>null <span style="color: #000000; font-weight: bold;">|</span> cksum <span style="color: #000000; font-weight: bold;">|</span> <span style="color: #c20cb9; font-weight: bold;">cut</span> <span style="color: #660033;">-c</span><span style="color: #ff0000;">&quot;1-5&quot;</span><span style="color: #7a0874; font-weight: bold;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #000000; font-weight: bold;">fi</span><br />
&nbsp; &nbsp; <span style="color: #007800;">TIME</span>=$<span style="color: #7a0874; font-weight: bold;">&#40;</span><span style="color: #7a0874; font-weight: bold;">&#40;</span><span style="color: #007800;">$RANDOM</span> <span style="color: #000000; font-weight: bold;">%</span> <span style="color: #007800;">$RandomSleep</span><span style="color: #7a0874; font-weight: bold;">&#41;</span><span style="color: #7a0874; font-weight: bold;">&#41;</span><br />
&nbsp; &nbsp; <span style="color: #c20cb9; font-weight: bold;">sleep</span> <span style="color: #007800;">$TIME</span><br />
<span style="color: #7a0874; font-weight: bold;">&#125;</span></div></td></tr></tbody></table></div>
<p>This is a very bad pattern &#8211; I&#8217;d go so far as to say that it&#8217;s actually worse than letting everything happen at the same time in the first place. It has 2 particularly dangerous qualities;</p>
<ol>
<li>
<h3>It increases the effective period of the task</h3>
<p>If a task is running once an hour &#8211; it&#8217;s probably because we need to do something once an hour. That may seem tautological, but there&#8217;s subtlety. The clock-time hours we define as humans are arbitrary, by once an hour we should really mean &#8220;at least once in a 60 minute interval&#8221;, <strong>not</strong> &#8220;once between 1PM and 2PM&#8221;. </p>
<p>If we pick a random sleep time, we might end up running just under <strong>two</strong> hours apart. If at 1PM  we pick 0 sleep seconds, and then at 2PM two we pick 3599 sleep seconds &#8211; look, we just ran two real hours apart!. Unsurprisingly the converse can happen, and we&#8217;ll run just 1 second apart, doing heavens knows what to load along the way.
</li>
<li>
<h3>The system as a whole has to cope with dangerous load spikes</h3>
<p>Using our earlier example, If we allocated the numbers evenly, each value would get chosen only 16 or 17 times. We could plan for an impact of say 20 running at once. But as we&#8217;ve seen, if we pick random numbers every time, then 1 in every 10 runs, we&#8217;re going to have to cope with an impact of 30.  That&#8217;s 50% more load, because of a one line error!</p>
<p>If this task is running every hour, then about 1 in every 7 weeks, we&#8217;re going to have to deal with an impact of 40 or more. And it will appear totally out of the blue, it&#8217;s a random occurrence. Nice! To take it to the extreme, there is a small &#8211; but finite &#8211; chance all all 1,000 systems choosing the same value. But avoiding this is why we are spreading the load in the first place, so why leave it to chance?</p>
<p>I used to run a busy Ubuntu mirror, and every day between 6:25 and 6:55 we&#8217;d see a gigantic wedge of load that could be distributed a lot more evenly. Though I think the worst of these problems have now been fixed.
</li>
</ol>
<p>The optimal fix for the problem is simple; coordinate &#8211; use a central allocator, or a gossip protocol, to ensure that every slot has at most N/M consumers. </p>
<p>This isn&#8217;t always possible though &#8211; Open Source security updates are usually polled from relatively dumb mirror servers, that don&#8217;t keep enough state to be able to divide load like this. But there are still two better approaches. </p>
<p>Firstly, you can pick a random number just once, and then reuse the <strong>same</strong> random number every time.  This guarantees that the task does actually run once an hour, and it makes load predictable for the distributed system as a whole. </p>
<p>Secondly, you can use any uniformly distributed unique piece of data, local to the node, to choose your number. I like to use an MD5 hash of the MAC address, but even the hostname would do. </p>
<p>Probability-wise, the two approaches are identical, but in the real world you get fewer collisions with the latter &#8211; probably due to the similar state of entropy identical systems will be in when you mass boot a fleet of several hundred. In both cases, where aberrations emerge due to bad luck, they at least only have to be identified and fixed <em>once</em>. We&#8217;re no longer playing the load lottery once an hour.</p>
<p>But this is only half the story &#8230; what about other automated tasks with their own periods &#8230; how do we avoid colliding with them? And how do we architect timings in distributed systems in general to make them a lot easier to debug. That&#8217;s what the next blog post will be about. Stay tuned.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.stdlib.net/~colmmacc/2009/09/14/period-pain/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>N+1</title>
		<link>http://www.stdlib.net/~colmmacc/2009/09/02/n1/</link>
		<comments>http://www.stdlib.net/~colmmacc/2009/09/02/n1/#comments</comments>
		<pubDate>Wed, 02 Sep 2009 09:35:10 +0000</pubDate>
		<dc:creator>colmmacc</dc:creator>
				<category><![CDATA[coding]]></category>
		<category><![CDATA[general]]></category>

		<guid isPermaLink="false">http://www.stdlib.net/~colmmacc/?p=430</guid>
		<description><![CDATA[I rarely write introspective or meta blog posts, in fact I rarely even use the word &#8220;I&#8221; on this blog (one of the habits you should develop as a manager or team-member is to use the word &#8220;we&#8221; almost all of the time) &#8230; so I hope you&#8217;ll forgive this brief, obnoxious, self-centered, round-up of [...]]]></description>
			<content:encoded><![CDATA[<p>I rarely write introspective or meta blog posts, in fact I rarely even use the word &#8220;I&#8221; on this blog (one of the habits you should develop as a manager or team-member is to use the word &#8220;we&#8221; almost all of the time) &#8230; so I hope you&#8217;ll forgive this brief, obnoxious, self-centered, round-up of the last year. I&#8217;ll try to smatter it with some useful observations to make up for it. </p>
<p>Today it&#8217;s one year since I started working for Amazon.com, and officially moved in to my new place and hence back to Dublin. Before that, I&#8217;d been spending most of my time in Amsterdam and traveling. By just September last year, I&#8217;d managed 78 flights in 2008. In the year since, it&#8217;s been a much more comfortable, and relatively conservative, 24. Some visits to Seattle, New York, New Orleans, Barcelona, Bristol and London keeping me occupied and worldly.</p>
<div align="center"><img src="http://www.stdlib.net/~colmmacc/moi.jpg" alt="Me!" /></div>
<p>I&#8217;ve learned a lot in the last year, a lot more and in very different directions, than I expected. Amazon turns out to be even more interesting than I had anticipated, and getting thrown in at the deep end early on &#8211; with a new Amazon Web Service to help build and support &#8211; has been an amazing experience. Before I started, I naively thought that scale was mostly about understanding distributed systems problems and performant designs. It is about those things, but not mostly.</p>
<p>In reality (well, in my opinion) the hard problems of scale are the incredible attention to detail and testing it requires, because when your rapidly-changing code is handling billions of requests &#8211; even a tiny fault that is triggered on one in every million requests will get you paged in no time at all. I&#8217;m not sure that there is any other way to learn it than having to support it for real. I recommend it.</p>
<p>I&#8217;m getting a chance to work on stuff I enjoy, building some big things and making some important and critical code go really really fast. It&#8217;s always nice to feel that features exist &#8211; that the universe is different in some way &#8211; because of work you&#8217;ve been involved in. That should always be the measure of progress; &#8220;How has the universe been changed?&#8221;, everything else is meta-work. </p>
<p>I am learning technical things at Amazon; when I started I had never written a line of perl, now I&#8217;ve written entire perl frameworks, a very basic perl compiler, a bytecode analyzer, perform low-level code-reviews, and teach perl once a week. The experience has reinforced my opinion that the notion of &#8220;knowing programming language X&#8221; is itself a broken anti-pattern. </p>
<p>That said, Perl was comparatively difficult to learn. Coming to it with about 15 years programming experience in many languages, it still took about 5 weeks to become proficient in it to the point that I really understood all of the magic symbols, operators and patterns in front of me at a fundamental level. At a guess, it took just 2 weeks to reach the same level of proficiency in python. </p>
<p>But that learning is self-driven, the <em>real</em> learning experience at Amazon is how things are organised and managed. How a huge multinational can structure and orientate itself such that things can happen incredibly efficiently and quickly is fascinating to observe and participate in. It&#8217;s like getting a free MBA. I can recommend that too.</p>
<p>On the college front &#8211; it&#8217;s been a strange year, as it was also my final year doing my BSc. in Computer Science in Trinity College, Dublin. It was tough going, mostly just to stick through it, but I think worth it in the end. Somehow came out with a first class honours degree. I&#8217;ve decided not to progress with a part-time postgrad just yet &#8211; it seems like a good time to try not doing so much college for a change, but maybe next year. </p>
<p>In the last year, I&#8217;ve learned two new instruments, banjo &#8211; for the fun of it &#8211; to a level where I can now keep up in sessions &#8211; and piano &#8211; to a lesser level but I can now arrange and play relatively intricate pieces. My place has worked out very well, and a year on I&#8217;m still very happy to be living here, it&#8217;s ideal. Ups and downs in my personal life in the last year have been extreme, but I&#8217;ve learned a lot from those too and am mostly the better for it.</p>
<p>But now that I have some more free time, I have to admit I&#8217;m not entirely sure what to do with it. I&#8217;m finding things to do with my evenings, and catching up with friends properly, but still have itches to be a bit more productive. I haven&#8217;t done as much Apache stuff as I&#8217;d have liked in the last 2 years, but the urge to write a webserver from scratch using a more functional programming oriented approach (though not necessarily a lamba-calculus derived language) is strong &#8230; and also pointless.</p>
<p>This past year also saw the final, conclusive, victory of sense in the Irish E-voting debacle. In short; we were correct all along, and the system has been completely abandoned. I have to admit it was a bit fun to be able to gloat about it on the radio. </p>
<p>So now &#8230; what next for my free time &#8230; well maybe you can help. To further compound the impression of arrogance and self-obsession I have no doubt created, it&#8217;s like this; I&#8217;m pretty clever. I&#8217;ve got above average maths, analytical and linguistics skills. I&#8217;m an expert programmer/developer and builder of things. I&#8217;m a fairly decent musician and photographer with some basic sense of style, design and composition.</p>
<p>Oh god there&#8217;s more! I&#8217;m politically knowledgeable, and I know how to manipulate the political and press systems and strategise (with a proven track record through two lobby groups). I&#8217;ve worked for two start-ups, and I know how to make things happen. I&#8217;m a quick learner, and I know that when something is worthwhile, realistic and interesting I can throw a huge amount of effort and pragmatism right at it. I like doing cool things for free, and earn just enough to be able to.  </p>
<p>Now all that said, to moderate things I&#8217;m also basically an introvert and can be pretty awkward and quiet around too many unfamiliar people, am about the last person you&#8217;d ever want to take to a bar (I don&#8217;t drink for a start), know close to nothing about popular culture and have read maybe only 4 or 5 fiction books in the last 10 years. So there&#8217;s a strong counter-argument that I&#8217;m also an illiterate anti-social bore to be kept in mind here too. But thankfully, not a terrible one.</p>
<p>But with all of that in mind, what does need doing? Especially in Ireland/Dublin &#8211; because it would be nice to involve meeting people and getting better at that whole social thing. Any ideas? the bigger the better. Any technical, political, or social gaps that really need filling? anybody need some help? What would you, or indeed, jesus do? There are a few ideas knocking around already, I&#8217;ll be sure to update when they are more concrete.  How should we change the universe?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.stdlib.net/~colmmacc/2009/09/02/n1/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Optimising strlen()</title>
		<link>http://www.stdlib.net/~colmmacc/2009/03/01/optimising-strlen/</link>
		<comments>http://www.stdlib.net/~colmmacc/2009/03/01/optimising-strlen/#comments</comments>
		<pubDate>Sun, 01 Mar 2009 15:51:25 +0000</pubDate>
		<dc:creator>colmmacc</dc:creator>
				<category><![CDATA[coding]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[optimisation]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.stdlib.net/~colmmacc/?p=270</guid>
		<description><![CDATA[Optimisation has a bit of a bad rep these days, good advice such as Hoare&#8217;s dictum that &#8220;Premature optimization is the root of all evil&#8221;, has led to a stern outlook on adding obfuscated mess to gain efficiency. Sometimes going really really fast just makes you really really insane. Economically, the convenience of developers has [...]]]></description>
			<content:encoded><![CDATA[<p>Optimisation has a bit of a bad rep these days, good advice such as <a href="http://c2.com/cgi/wiki?PrematureOptimization">Hoare&#8217;s dictum</a> that &#8220;Premature optimization is the root of all evil&#8221;, has led to a stern outlook on adding obfuscated mess to gain efficiency. Sometimes going really really fast just makes you really really insane.</p>
<div align="center"><a href="http://flickr.com/photos/colmmacc/2728354476/"><img src="http://www.stdlib.net/~colmmacc/mass.jpg" alt=""/></a></div>
<p>Economically, the convenience of developers has won out over efficiency. Very high-level languages are near ubiquitous. Very few people have to squeeze implementations into so many CPU operations, or such and such an amount of memory. On the contrary, click-dragging many millions of CPU operations and many millions of bytes of memory in the form of some library, form-control or icon in an even more inefficient IDE is the norm. </p>
<p>But as others have put far better, advocating the complete absence of optimisation is advocating a <a href="http://www.acm.org/ubiquity/views/v7i24_fallacy.html">fallacy</a>. Focused, intelligent, and crucially &#8211; well documented &#8211; optimisation is a very important skill that can save lots of time and money. To give three concrete examples;</p>
<dl>
<dt>1. Deriving the most optimal buffer sizes</dt>
<dd>
<p>At <a href="http://www.heanet.ie/">HEAnet</a> when we were scaling <a href="http://ftp.heanet.ie/">ftp.heanet.ie</a> to cope with over 50,000 concurrent downloads from a single host, we ran a lot of experiments to figure out the most optimal read buffer sizes.</p>
<p>This is very rarely done, most buffers are chosen arbitrarily, 4k and 8k being common for various reasons. Our experiments showed that the optimal buffer size was actually around 40k, and when we made this change within <a href="http://httpd.apache.org">Apache</a> we measured a 25% capacity improvement.</dd>
<dt>2. Using integers to compare strings</dt>
<dd>
<p>The objective of a contract I was involved in some years ago was to analyse a <strong>lot</strong> of HTTP data, from a load-balancer, to try and determine some statistics about it. Our input was billions of requests and getting the reports as quickly as possible was important (it fed into a health-check system, with a rolling average).</p>
<p>One of the key stages of the process was the very start of the request, because we would branch on the type of request (GET, HEAD, CONNECT, PUT, POST &#8230; etc). A branch here was bad for two reasons, it interfered with pipe-lining and the CPU caches, which is especially inefficient when most requests were GETs anyway. </p>
<p>Re-ordering the branches such that the GET case was a &#8220;fall-through&#8221; (that it didn&#8217;t involve a jump) helped a little, but there was still some inefficiency going on. The string comparison itself seemed to be wasting some of the L1 cache on us. </p>
<p>So, as a crazy solution, we treat the first 4 bytes of the request as an integer &#8211; and use integer comparison. On x86, &#8220;GET&#8221; == 5522759, &#8220;HEAD&#8221;  == 1145128264 and so on. The first four bytes just happen to be unique in HTTP methods, and the check can happen in the CPU registers directly without having to deference pointers.  </p>
<p>The app got about twice as fast, but this was probably due to some other variable now fitting in the L1 cache. Plenty of explanatory comments in the code made this an acceptable, if still crazy, optimisation to make. It also taught everyone involved exactly what SIGBUS really is.</dd>
<dt>3. Replacing inefficient calls</dt>
<dd>
<p>Sometimes the simplest optimisations are the most effective. At my current job, we managed to speed one of our external services up by about 20% just by replacing the top 3 calls that showed up in <a href="http://www.cs.utah.edu/dept/old/texinfo/as/gprof_toc.html">gprof</a> with more standard implementations that are implemented in x86 assembly.</p>
<p>This is classic optimisation, run a profiler and go for a targeted attack, but it can still be amazing what it can achieve.
</dd>
</dl>
<p>Unfortunately one of the problems with optimisation is that it&#8217;s <em>hard</em>. Being able to profile something is relatively straightforward and repeatable, but even the basic step of <em>knowing that your application is slow</em> is non-trivial. Sure there might be a high-level SLA, but you can always try throwing hardware at a problem. </p>
<p>Knowing that something is slow, or large, involves being able to make educated estimates about implementations, from first principals. Optimisers can approach a problem and think &#8220;well we&#8217;re accessing X much data, and performing about Y basic CPU operations, the fundamental limit is probably around Z&#8221;. That&#8217;s just step 1. </p>
<p>The other key part is that opimisers tend to have a deep, intuitive, understanding of systems at every level. They know not just a programming language, but the intricacies of how it is implemented. They&#8217;ll know how memory managers, file-systems, compilers, CPUs, networks and all sorts of things in-between actually work. This takes years, and quite a few &#8220;ahhhh&#8221; moments at 3AM in the middle of some nightmarish failure scenario.</p>
<h3>strlen()</h3>
<p>So where to start? One good place is to look is standard libraries, they contain many implementations of very basic routines that are called so often that they have to be optimal. Additionally, the really common routines are implemented in nearly every language, and you can look the trade-offs made within each. It&#8217;s a good way to learn a lot. One great case-study is strlen(). </p>
<p>Before we go on, let&#8217;s get some things defined. For the most part here, we&#8217;re going to look at C style strings. A quick refresher; in C, strings are just a 0 terminated array of bytes (<em>char</em> is the C type for a character). They look like this;</p>
<div class="codecolorer-container c default" style="overflow:auto;white-space:nowrap;border: 1px solid #9F9F9F;width:435px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br /></div></td><td><div class="c codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">/* A C string initialised as an array */</span><br />
<span style="color: #993333;">char</span> hello<span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span> <span style="color: #ff0000;">'h'</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">'e'</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">'l'</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">'l'</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">'o'</span><span style="color: #339933;">,</span> 0 <span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #808080; font-style: italic;">/* The same C string */</span><br />
<span style="color: #993333;">char</span> hello<span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #ff0000;">&quot;hello&quot;</span><span style="color: #339933;">;</span><br />
<br />
<span style="color: #808080; font-style: italic;">/* A pointer to a C string */</span><br />
<span style="color: #993333;">char</span> <span style="color: #339933;">*</span> hello <span style="color: #339933;">=</span> <span style="color: #ff0000;">&quot;hello&quot;</span><span style="color: #339933;">;</span></div></td></tr></tbody></table></div>
<p>strlen(), as the name suggests, returns the length of a string;</p>
<div class="codecolorer-container c default" style="overflow:auto;white-space:nowrap;border: 1px solid #9F9F9F;width:435px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br /></div></td><td><div class="c codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #808080; font-style: italic;">/** <br />
&nbsp;* Return the length of a string.<br />
&nbsp;*<br />
&nbsp;* @param str The string to measure<br />
&nbsp;* @return &nbsp; &nbsp;The string's length<br />
&nbsp;*/</span><br />
size_t strlen<span style="color: #009900;">&#40;</span><span style="color: #993333;">const</span> <span style="color: #993333;">char</span> <span style="color: #339933;">*</span> str<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></div></td></tr></tbody></table></div>
<p>Some important things; <em>strlen(&#8220;hello&#8221;)</em> returns 5, not 6. The zero at the end doesn&#8217;t count as part of the string. Zero is the only thing that defines the end of a string, all sorts of unprintable and control characters can be inside a string. </p>
<p>When someone is learning to program, a typical first attempt at a <em>strlen</em> implementation will be something like this:</p>
<h4>Method 1: an iterative for-loop</h4>
<div class="codecolorer-container c default" style="overflow:auto;white-space:nowrap;border: 1px solid #9F9F9F;width:435px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br /></div></td><td><div class="c codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">size_t strlen<span style="color: #009900;">&#40;</span><span style="color: #993333;">const</span> <span style="color: #993333;">char</span> <span style="color: #339933;">*</span> str<span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; size_t len<span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span>len <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> str<span style="color: #009900;">&#91;</span>len<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span> len<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #b1b100;">return</span> len<span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<p>This method is easy to understand, and read, but it&#8217;s not very good. It runs in time O(n) and involves a lot of jumps.</p>
<p>The next thing programmers usually realise is that they don&#8217;t have to use array indices, that pointers are enough.</p>
<h4>Method 2: sacrifice a variable and readability</h4>
<div class="codecolorer-container c default" style="overflow:auto;white-space:nowrap;border: 1px solid #9F9F9F;width:435px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br /></div></td><td><div class="c codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">size_t strlen<span style="color: #009900;">&#40;</span><span style="color: #993333;">const</span> <span style="color: #993333;">char</span> <span style="color: #339933;">*</span> str<span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp;<span style="color: #993333;">char</span> <span style="color: #339933;">*</span> ptr<span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp;<span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span>ptr <span style="color: #339933;">=</span> str<span style="color: #339933;">;</span> <span style="color: #339933;">*</span>ptr<span style="color: #339933;">;</span> <span style="color: #339933;">++</span>ptr<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <br />
&nbsp; &nbsp;<span style="color: #b1b100;">return</span> ptr <span style="color: #339933;">-</span> str<span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<p>Depending on how clever the compiler is, this code may be slightly faster, because there won&#8217;t be so many additions to the base pointer. We increment the pointer only, and use the difference between it and the base-pointer as the length. This code is less readable though, and probably counts as premature optimisation.</p>
<p>This method is still O(n) and still involves a lot of jumps. Let&#8217;s see what we can do about that next.</p>
<h4>Method 3: partially unroll the loop</h4>
<div class="codecolorer-container c default" style="overflow:auto;white-space:nowrap;border: 1px solid #9F9F9F;width:435px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br /></div></td><td><div class="c codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">size_t strlen<span style="color: #009900;">&#40;</span><span style="color: #993333;">const</span> <span style="color: #993333;">char</span> <span style="color: #339933;">*</span> str<span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp;<span style="color: #993333;">char</span> <span style="color: #339933;">*</span> ptr <span style="color: #339933;">=</span> str<span style="color: #339933;">;</span><br />
<br />
&nbsp; &nbsp;<span style="color: #b1b100;">while</span><span style="color: #009900;">&#40;</span>1<span style="color: #009900;">&#41;</span><br />
&nbsp; &nbsp;<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #b1b100;">if</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">!*</span><span style="color: #009900;">&#40;</span>ptr<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">break</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #b1b100;">if</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">!*</span><span style="color: #009900;">&#40;</span>ptr<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">break</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #b1b100;">if</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">!*</span><span style="color: #009900;">&#40;</span>ptr<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">break</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #b1b100;">if</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">!*</span><span style="color: #009900;">&#40;</span>ptr<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">break</span><span style="color: #339933;">;</span><br />
&nbsp; &nbsp; <span style="color: #009900;">&#125;</span><br />
<br />
&nbsp; &nbsp; <span style="color: #b1b100;">return</span> <span style="color: #009900;">&#40;</span>ptr <span style="color: #339933;">-</span> 1<span style="color: #009900;">&#41;</span> <span style="color: #339933;">-</span> str<span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<p>Now we&#8217;re into serious unreadability territory, and by the way, the above is how <a href="http://cr.yp.to/">djb</a> implements <em>strlen</em>. But we have actually gained some efficiency, now for every jump operation that the loop creates, we test for the 0 value 4 times. The choice of <em>4</em> times is arbitrary here, but an interesting exercise would be to vary this number and test each possibility. </p>
<p>More tests will interfere with pipelines, but there will be less jumping. Who knows where the sweet spot is. Still, though this implementation is faster, it&#8217;s still O(n). Unfortunately in C, we&#8217;re doomed to an O(n) implementation, best case, but we&#8217;re still not done &#8230; we can do something about the very size of n.</p>
<h4>Method 4: word-wise checks</h4>
<p>Just like the example I gave earlier, where we used integers to compare 4 bytes of a string all at once, we can do something even cleverer with strlen. We can construct a test, that with one small exception, can determine if <em>any</em> byte within a <em>long</em> is set to zero. This works for 2, 4 and 8 byte word-sizes.</p>
<p>It works by setting up a bit pattern like &#8220;01111110 11111110 11111110 11111111&#8243;, and then we add an arbitrary word to it. Anything except 0 in any of the bytes <em>should</em> cause an overflow into the neighbouring bytes. So by performing an addition, and then testing the &#8220;hole&#8221; bits, we can tell that there was likely a zero. There is one case where we&#8217;ll get a false-positive, which can happen when bit 31 is set &#8211; but that&#8217;s easy to check for. </p>
<p>It&#8217;s too long to reproduce here, but you can see the <a href="http://www.stdlib.net/~colmmacc/strlen.c.html">entire glibc implementation</a> as a great example.</p>
<p>With this method we have divided our problem space, N is now N/4 or N/8, and the checks happen much more quickly. Though with really small strings, this method may actually be <em>worse</em>. The reason for that is that word-wise checks also have to be word-aligned. So if the string starts on a mis-aligned byte we may have to scroll up to 3-characters (for 32-bit words, it&#8217;s 7-characters for 64-bits) to become word-aligned. </p>
<p>Additionally when we do find a zero, we still have to perform 4 tests to figure out exactly which byte it was. If you had a lot of mis-aligned 7 byte strings, this method would be highly sub-optimal. </p>
<h4>Method 5: Outsource the problem</h4>
<p>In the many years war between RISC and &#8220;kitchen sink&#8221; architectures, the latter appears to have won. x86 is king, and that means we usually have plenty of instructions at our disposal. Two such instruction are &#8220;repnz&#8221; (repeat while non zero), and &#8220;scasb&#8221; (scan string) that combined allow to instruct the CPU to go find the next zero in a range of memory, using whatever tricks it likes.</p>
<p>The CPU can implement any of the approaches we&#8217;ve already talked about, but much more is possible. Modern memory, and SRAM in particular, has been designed to make parallelised searches possible. The chips can be electrically probing many sequences of bits all at once, and feed into a tree of gates that makes sure the earlier wins &#8211; O(ln) , in hardware. This is highly optimal, and if you&#8217;re using strlen() on an x86 host right now, that&#8217;s probably what&#8217;s happening.</p>
<p>But what else can we do? What if we make some trade-offs about the very nature of C strings?</p>
<h4>Method 6: Add a cache</h4>
<p>C is rightly criticised for how it handles strings, it&#8217;s pretty dumb, and C buffer overflows remain a major source of security problems. There really isn&#8217;t any excuse. Treating strings as arbitrary regions of zero-terminated memory has one main advantage; it allows strings to be seperated in-place. One can take &#8220;Hello World&#8221; and make it &#8220;Hello&#8221; and &#8220;World&#8221; with the very simple insertion of a zero byte.</p>
<p>But strings are not separated a whole lot, and it tends to be a one-time operation. If we throw away this &#8220;feature&#8221;, there is a much more optimal way to get string length &#8211; just keep recording it in a cache.</p>
<div class="codecolorer-container c default" style="overflow:auto;white-space:nowrap;border: 1px solid #9F9F9F;width:435px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br /></div></td><td><div class="c codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #993333;">typedef</span> <span style="color: #993333;">struct</span> <span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #993333;">char</span> <span style="color: #339933;">*</span> str<span style="color: #339933;">;</span><br />
&nbsp; &nbsp; size_t len<span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span> <span style="color: #993333;">string</span><span style="color: #339933;">;</span><br />
<br />
size_t strlen<span style="color: #009900;">&#40;</span><span style="color: #993333;">const</span> <span style="color: #993333;">string</span> <span style="color: #339933;">*</span> s<span style="color: #009900;">&#41;</span><br />
<span style="color: #009900;">&#123;</span><br />
&nbsp; &nbsp; <span style="color: #b1b100;">return</span> s.<span style="color: #202020;">len</span><span style="color: #339933;">;</span><br />
<span style="color: #009900;">&#125;</span></div></td></tr></tbody></table></div>
<p>Of course, we needn&#8217;t modify the API here, instead of creating a new type, we could just keep a static index of pointer values (say a hashMap) and record the lengths there. In fact, we could even have a compiler do this for us by defining the return value of the function as <em>const</em>. </p>
<p>But we do need to <em>enforce</em> the API. Now everything that deals with strings has to update this cache, and we can&#8217;t permit the programmer to mess with the string internally. Our string functions are the only valid way.</p>
<p>This is actually the norm. There are quite a few APIs for doing this in C, and it&#8217;s also what djb does in his code. Almost all modern languages take this approach, and have an internal record of every string length in their symbol table. Finally, we have a real O(1) solution.</p>
<h4>What next?</h4>
<p>But it doesn&#8217;t end there. Every solution above had trade-offs, and there are plenty of other ways of doing it. </p>
<p>Some text processors, for example, are designed to accommodate lots of string concatenation, so strings are optimistically allocated much more memory than they need, with lots of zeroes at the end. For these strings, if we know how much memory is allocated, we can implement strlen as a word-wise binary search, which will be O(ln).  </p>
<p>Some XML processors, for another example, know that some very high percentage of the time that a closing tag will be exactly the same length as the name of the opening tag + 3 (&#8216;< ', '/' and '>&#8216;) &#8211; and take that as a best guess, and skip straight to that length and see if they find what the expect. </p>
<p>Other times, optimisers perform statistical analysis on their inputs and find out where the peaks are, and optimise the tests to go for those paths first. There really is an endless series of possibilities, and this is for something as simple as getting the length of a string. Underneath every little problem is a world of opportunity and a wealth of material from very clever people. (The Varnish source code, and the <a href="http://varnish.projects.linpro.no/wiki/ArchitectNotes">Architect Notes</a> are also a great source of inspiration).</p>
<p>A few years ago, at <a href="http://www.joost.com/">Joost</a>, we came across exactly this problem. An application that was spending about 10% of its time inside strlen. Within an hour or so of rewriting some primitives, it was down to less than a hundredth of a percent. You can&#8217;t drag and drop that kind of improvement in an IDE.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.stdlib.net/~colmmacc/2009/03/01/optimising-strlen/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		</item>
	</channel>
</rss>

