/~colmmacc/ http://www.stdlib.net/~colmmacc An Irishman's Fiery Sun, 15 May 2011 17:01:26 +0000 en hourly 1 http://wordpress.org/?v=3.3.1 Weighty matters http://www.stdlib.net/~colmmacc/2011/05/15/weighty-matters/ http://www.stdlib.net/~colmmacc/2011/05/15/weighty-matters/#comments Sun, 15 May 2011 03:49:57 +0000 colmmacc http://www.stdlib.net/~colmmacc/?p=752 One of my favourite programming exercises is to write a text generator using Markov Chains. It usually takes a relatively small amount of code and is especially useful for learning new programming languages. A useful goal is to able to have a tool that ingests a large body of text, breaks it into N-tuples (where N is chosen by the operator), and then emits a random text of length L where that text uses those tuples at the same frequency that the original body did.

On its own, that task can produce some fun results, but it’s also a very repurpose-able technique. You can use this kind of statistical generation to simulate realistic network jitter (record some N-tuples of observed RTT with ping), or awesome simulated-user fuzz tests (record some N-tuples of observed user inputs). It’s surprising that it isn’t more common.

But when approaching these problems, from experience of working with newcomers, what seems to be a common first tripping point is how to do weighted selection at all. Put most simply, if we have a table of elements;

element weight
A 2
B 1
C 1

how do we write a function that will choose A about half the time, and B and C about a quarter each? It also turns out that this is a really interesting design problem. We can choose to implement a random solution, a non-random solution, a solution that runs in constant time, a solution that runs in linear time and a solution that runs in logarithmic time. This post is about those potential solutions.

Non-random solutions

When coming to the problem, the first thing to decide is whether we really want the selection to be random or not. One could imagine a function that tries to keep track of previous selections, for example;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
elements = [ { 'name' : 'A' , 'weight' : 2 },
             { 'name' : 'B' , 'weight' : 1 },
             { 'name' : 'C' , 'weight' : 1 } ]

def select_element(elements, count):
  total_weight = 0
  for element in elements:
    total_weight += element['weight']
    if count < total_weight:
         return  ( element ,
                   (count + 1) % len(elements) )

# select some elements
c = 0
(i , c) = select_element(elements, c)
(j , c) = select_element(elements, c)
(k , c) = select_element(elements, c)

This function uses the “count” parameter to remember where it left off last time, and runs in time proportionate to the total number of elements. Used correctly, the first two times it’s called this function will return A, followed by B, followed by C, followed by A twice again and so on. Hopefully that’s obvious.

There’s a very simple optimisation possible to make it run in constant time, just sacrifice memory proportionate to the total sum of weights;

1
2
3
4
5
6
7
8
elements = [ { 'name' : 'A' },
             { 'name' : 'A' },
             { 'name' : 'B' },
             { 'name' : 'C' } ]

def select_element(elements, count):
  return  ( element[count] ,
            (count + 1) % len(elements) )

This basic approach is surprisingly common, it's how a lot of networking devices implement weighted path selection.

Random solutions

But if we're feeding into some kind of simulation, or a fuzz-test, then randomised selection is probably a better thing. Luckily, there are at least 3 ways to do it. The first approach is to use the same space-inefficient approach our "optimised" non-random selection did. Flatten the list of elements into an array and just randomly jump into it;

1
2
3
4
5
6
7
8
9
import random

elements = [ { 'name' : 'A' },
             { 'name' : 'A' },
             { 'name' : 'B' },
             { 'name' : 'C' } ]

def select_element(elements):
  return random.choice(elements)

As before, this runs in constant time, but it's easy to see how this could get really ugly if there's some weights that are more like;

element weight
A 998
B 1
C 1

Unfortunately, common datasets can follow distributions that are just like this (Zipfian, for example). It would take an unreasonably large amount of space to store all of the words in the English language, proportionate to their frequencies, using this method.

Reservoir sampling

But luckily we have a way to avoid all of this space, and it's got some other useful utilities too. It's called reservoir sampling, and it's pretty magical. Reservoir sampling is a form of statistical sampling that lets you choose a limited number of samples from an arbitrarily large stream of data, without knowing how big it is in advance. It doesn't seem related, but it is.

Imagine you have a stream of data, and it looks something like;

"Banana" , "Car" , "Bus", "Apple", "Orange" , "Banana" , ....

and you want to choose a sample of 3 events. All events in the stream should have equal probability of making it into the sample. A better real-world example is collecting a sample of 1,000 web server requests over the last minute, but you get the idea.

What a reservoir sample does is simple;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def reservoir_sample(events, size):
  events_observed = 0
  sample = [ None ] * size

  for event in events:
   if events_observed >= len(sample):
     r = random.randint(0, events_observed):
   else:
     r = events_observed

   if r < len(sample):
     sample[r] = event    
   events_observed += 1

  return sample

and how this works is pretty cool. The first 3 events have a 100% likelihood of being sampled, r will be 0, 1, 2 in their case. The interesting part is after that. The fourth element has a 3/4 probability of being selected. So that's pretty simple.

But consider the likelihood of an element already in the sample of "staying". For any given element, there are two ways of "staying". One is that the fourth element is not chosen (1/4 probability) , another is that the fourth element is chosen (3/4 probability) but that a different element is selected (2/3) to be replaced. If you do the math;

1/4 + (3/4 * 2/3)   =>
1/4 + 6/12          =>
3/12 + 6/12         =>
9/12                =>
3/4

We see that any given element has a 3/4 likelihood of staying. Which is exactly what we want. There have been four elements observed, and all of them have had a 3/4 probability of being in our sample. Let's extend this an iteration, and see what happens at element 5.

When we get to this element, it has a 3/5 chance of being selected (r is a random number from the set 0,1,2,3,4 - if it is one of 0, 1 or 2 then the element will be chosen). We've already established that the previous elements had a 3/4 probability of being in the sample. Those elements that are in the sample again have two ways of staying, either the fifth element isn't chosen (2/5 likelihood) or it is, but a different element is replaced (3/5 * 2/3) . Again, let's do the math;

3/4 * (2/5 + (3/5 * 2/3))  =>
3/4 * (2/5 + 6/15)         =>
3/4 * (2/5 + 2/5)          =>
3/4 * 4/5                  =>
3/5

So, once again, all elements have a 3/5 likelihood of being in the sample. Every time another element is observed, the math gets a bit longer, but it stays the same - they always have equal probability of being in the final sample.

So what does this have to do with random selection? Well imagine our original weighted elements as a stream of events, and that our goal is to choose a sample of 1.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import random

elements = [ { 'name' : 'A' },
             { 'name' : 'A' },
             { 'name' : 'B' },
             { 'name' : 'C' } ]

def select_element(elements):
  elements_observed = 0
  chosen_element = None
  for element in elements:
    r = randint(0, elements_observed):
    if r < 1:
        chosen_element = element
  return chosen_element

So far, this is actually worse than what we've had before. We're still using space in memory proportionate to the total sum of weights, and we're running in linear time. But now that we've structured things like this, we can use the weights to cheat;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
elements = [ { 'name' : 'A' , 'weight' : 2 },
             { 'name' : 'B' , 'weight' : 1 },
             { 'name' : 'C' , 'weight' : 1 } ]

def select_element(elements):
  total_weight = 0
  chosen_element = None
  for element in elements:
    total_weight += element['weight']
    r = randint(0, total_weight -1)
    if r < element['weight']:
         chosen_element = element

  return chosen_element

It's the same kind of self-correcting math as before, except now we can take bigger jumps rather than having to take steps of just 1 every time.

We now have a loop that runs in time proportionate to the total number of elements, and also uses memory proportionate to the total number of elements. That's pretty good, but there's another optimisation we can make.

Using a tree

If there are a large number of elements, it can still be a pain to have to iterate over them all just to select one. One solution to this problem is to compile the weighted elements into a weighted binary tree, so that we need only perform O(log) operations.

Let's take a larger weighted set;

element weight
A 1
B 2
C 3
D 1
E 1
F 1
G 1
H 1
I 1
J 1

which we can express in tree form;

where each node has the cumulative weight of its children.

Tree-solutions like this lend themselves easily to recursion;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import random

elements = [ {  'name' : 'A',
                'weight' : 10 },
             {  'name' : 'B',
                'weight' : 20 },
             {  'name' : 'C',
                'weight' : 30 },
             {  'name' : 'D',
                'weight' : 10 },
             {  'name' : 'E',
                'weight' : 10 },
             {  'name' : 'F',
                'weight' : 10 },
             {  'name' : 'G',
                'weight' : 10 },
             {  'name' : 'H',
                'weight' : 10 },
             {  'name' : 'I',
                'weight' : 10 },
             {  'name' : 'J',
                'weight' : 10 } ]

# Compile a set of weighted elements into a weighted tree
def compile_choices(elements):
    if len(elements) > 1:
        left =  elements[ : len(elements) / 2 ]
        right    =  elements[ (len(elements) / 2) : ]

        return [ { 'child': compile_choices(left) ,
                   'weight' : sum( map( lambda x : x['weight'] , left ) ) } ,
                 { 'child': compile_choices(right) ,
                   'weight' : sum( map( lambda x : x['weight'] , right ) ) } ]
    else:
        return elements

# Choose an element from a weighted tree
def tree_weighted_choice(tree):
    if len(tree) > 1:
        total_weight = tree[0]['weight'] + tree[1]['weight']
        if random.randint(0, total_weight - 1) < tree[0]['weight']:
            return tree_weighted_choice( tree[0]['child'] )
        else:
            return tree_weighted_choice( tree[1]['child'] )
    else:
        return tree[0]['name']

tree = compile_choices(elements)

And there we have it! A pretty good trade-off, we get sub-linear selection time with a relatively small overhead in memory.

A huffman(ish) tree

I should have included this is in the first version of this post, and Fergal was quick to point it out in the comments, but there is a further optimisation we can make. Instead of using a tree that is balanced purely by the number of nodes, as above, we can use a tree that is balanced by the sum of weights.

To re-use his example, imagine if you have elements that are weighted [ 10000, 1 , 1 , 1 ... ] (with a hundred ones). It doesn't make sense to bury the very weighty element deep in the tree, instead it should go near the root - so that on average we minimise the expected number of lookups.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def huffman_choices(elements):
    if len(elements) > 1:
        total_weight = sum( map( lambda x : x['weight'] , elements ) )
        sorted_elements = sorted(elements, key = lambda x : x['weight'] )[::-1]

        left = []
        right = []
        observed_weight = 0
        for element in sorted_elements:
            if observed_weight < total_weight / 2:
                left.append(element)
            else:
                right.append(element)

            observed_weight += element['weight']

        return [ { 'child': huffman_choices(left) ,
                   'weight' : sum( map( lambda x : x['weight'] , left ) ) } ,
                 { 'child': huffman_choices(right) ,
                   'weight' : sum( map( lambda x : x['weight'] , right ) ) } ]
    else:
        return elements

None of these techniques are novel, in fact they're all quite common and very old, yet for some reason they don't have broad awareness. Reservoir sampling - which on its own is incredibly useful - is only afforded a few sentences in Knuth.

But dealing with randomness and sampling is one of the inevitable complexities of programming. If you've never done it, it's worth taking the above and trying to write your own weighted markov chain generator. And then have even more fun thinking about how to test it.

]]>
http://www.stdlib.net/~colmmacc/2011/05/15/weighty-matters/feed/ 10
Prime and Proper http://www.stdlib.net/~colmmacc/2010/09/02/prime-and-proper/ http://www.stdlib.net/~colmmacc/2010/09/02/prime-and-proper/#comments Thu, 02 Sep 2010 05:57:14 +0000 colmmacc http://www.stdlib.net/~colmmacc/?p=709 A common software engineering pattern is the need to test elements for set membership.

Practical questions like “is this item in my cache” arise surprisingly often. Bloom Filters are a great mechanism for performing this test, and if you haven’t encountered them before – they’re well worth knowing about. But over the last year, I’ve been experimenting with a different method for these kinds of membership tests on enumerated sets, and thought it was worth sharing.

It’s definitely not the first time the technique has been used, Cian found some papers on its use. But it’s not a common pattern, which is a pity – because it’s incredibly simple, and lends itself very well to modern hardware.

First things first, some huge disclaimers;

  • The approach only works on enumerated sets. Each element in the problem space has to be assigned a unique counted number. Hashes won’t do.
  • The method is constrained by the size of product of the total number of elements and the maximum size of any set.

So, it’s very simple; Take each element in the problem space, assign it a unique prime number, and then represent sets as the products of those primes. This works because of the fundamental theorem of arithmetic.

Here’s a summary of operations;

Computing a set:
Compute the product of all of the elemental primes in the set.

s = a * b * c

Testing for element membership:
Test the modulus of the set number.

(s % a) == 0

Adding to the set:
Multiply the existing set product by the new element’s prime number.

s *= d

Removing from the set:
Divide the existing set product by the element’s prime number.

s /= a

Now at this point, you’re probably skeptical about the usefulness of the technique, given the constraints. Obviously other operations like unions and intersections between sets are possible, but they require factorisation – and so are not particularly efficient (though you’d be surprised how quickly they do run). But look at the benefits;

  • Unlike Bloom filters, there are zero false positives, and zero false negatives. The method is 100% precise.
  • Due to their use in cryptography; libraries, CPUs and other hardware have arisen that are highly efficient at computing the products and modulus of very large numbers.

As a case-in-point, let’s use the internet as an example. Suppose you want to model the internet as a system of interconnected autonomous systems and paths. A really common test when using such models is to determine whether or not a particular system is on a particular path.

If we take the internet as 100,000 autonomous systems, and the longest practical path on the internet as containing 20 such systems (longer paths exist, but are oddities – in real terms these numbers are both over-estimates) the largest product we would ever need to compute is smaller than 2410. That’s much smaller than even modern cryptographic primes themselves, and much much smaller than their products. There is a lot more room.

Actually, that was my first real-world use-case for this technique – modeling the internet in a few hundred lines of python, on a laptop – as a prototype. Surprisingly, a Macbook Air could perform well over one million such membership tests per second without skipping a beat – after reordering to assign the most popular AS numbers the lowest prime-numbers. And incidentally, if that kind of problem interests you – Cloudfront are hiring.

Now a question, for the information theorists; How is it that this method can seemingly perfectly encode N bits of information in fewer than N bits? For example it takes 2 bits to represent the number “2″ and 2-bits to represent the number “3″, yet to represent both through their product – “6″ it takes only 3 bits. But this isn’t like addition or some other operation – there really only is one way of factorising 6 to get back to “3″ and “2″. The information is seemingly encoded perfectly.

]]>
http://www.stdlib.net/~colmmacc/2010/09/02/prime-and-proper/feed/ 13
Fact-checking Ireland Offline http://www.stdlib.net/~colmmacc/2010/03/29/fact-checking-ireland-offline/ http://www.stdlib.net/~colmmacc/2010/03/29/fact-checking-ireland-offline/#comments Mon, 29 Mar 2010 19:40:01 +0000 colmmacc http://www.stdlib.net/~colmmacc/?p=676 One of the saddest things about Ireland and technology is the degree to which the incumbent telco, Eircom, has been permitted to dominate and skew the market. The privatisation of Eircom was very very badly designed, effectively creating a monopoly for important national infrastructure. There have been multiple legal cases between Eircom and the Regulator, and many many complaints from consumers and competitors, but over the years little has happened.

Sadder still, and not unrelated, is that the most prominent activist group on this issue; Ireland Offline, have consistently been one of the most ineffective, maladministered bunch of ill-informed whinge-baggers going. Despite media prominence, many many column inches, a strong membership, public meetings, and more, I can’t think of a single achievement Ireland Offline could point to as a success. Some of their goals may have been achieved over the years, but only after geological delay and through coincidence.

On the contrary, I can think of many many falsehoods, myths and bizarre theories the group has promoted. My favourite being the simple falsehood that the majority of schools in Ireland have satellite broadband connectivity. But the latest effort takes the prize.

In an anonymous, unattributed blog-post with no contact details; http://irelandoffline.org/2010/03/eircom-and-next-generation-broadband/, Ireland Offline are making some very strange claims.

eircom today have announced their “Next Generation Broadband” promising speeds of up to 8MB to all subscribers, they claim to have also finally addressed the contention issue (which they call congestion). What is not in the headlines is the subtle but important change from a speed based model to a usage based model of charging.

In reality, there has always been a usage-based model of charging. In fact this is why Ireland Offline was founded, when Esat decided to place caps on their “no limits” service. Also in reality – the real costs of operating a network are aligned with usage and are not fixed. Anyone with even a modicum of familiarity about how networks and economics work can realise that the only long-term sustainable business models for Internet consumption are going to be usage-based.

The only alternative would be for the light users to effectively subsidise heavier ones. Does that really make sense? Is that equitable? Is there really a solid case for internet socialism?

Complaints about the pricing are justified, but the model? This kind of demand is very alienating, and in place-of such childishness, it would be much better to see a concerted economic argument for better broadband. What is the lack of good broadband doing for business, and so on.

The only thing “Next Generation about this product is the charging policy. This is a 200% increase on the average user’s bill for those who can least afford it. This product is only Dublin based for the moment.

Nowhere can I find any evidence of this 200% increase. My bill would decrease, I don’t know if I’m an average user or not, but I’d have thought I am.

This is yet another attack on the hard pressed telecommunications consumers of Ireland. Not only do they have to endure the highest line-rental on the planet they now have to endure a €2 charge for each Gigabyte they go over their paltry caps. For instance the “Broadband Basic” has a minuscule 10GB cap barely enough for the average “tech savy” family.

In reality this package will cost €50 for all but the lightest users. That’s €50 + line rental making the total €75.36 per month. This is hardly a giant leap forward it’s more like a gigantic leap into the dark past and into your wallet.

This is just false. Firstly, the cap for the Broadband basic (also called Broadband regular) package is 30GB, and line-rental is included. At least according to Eircom. Hell, it almost makes me feel good about giving Eircom my business.

In order to incur a € 76 bill, you’d actually have to use at least 43 GB. That’s not light usage. I’ve heard that there’s a ceiling to the charges too, at 50GB which means that if you use 80GB or more, you have a total bill of € 101, but I can’t find any documentation supporting this. I’m guessing Eircom continue to reserve the right to kick off persistent ultra-heavy users, but still ; even 43GB is quite a lot of bytes ; and about 35 hours of HD streaming video.

Now Eircom’s product launch is not without its faults. It’s still pretty expensive, by international standards, and in my opinion their statement that “With Next Generation Broadband (NGB) you will not be sharing your broadband connection with other Internet users.” is simply false ; upstream congestion can and will occur. The fundamental competitiveness problems in the market are not being fixed, and this will serve as a distraction.

But forget all of that, the biggest problem with the product is that it is only available in Dublin. Gee, that sounds like that kind of thing that a lobby group should be leading with. It’s almost like it’s exactly the kind of complaint that many many rural politicians would get onboard about. Instead it’s more of a byline to kooky economics and mistaken information.

Ireland Offline; seriously? Is this what it means to “be a source of accurate and impartial information on relevant technologies“?. Come on.

]]>
http://www.stdlib.net/~colmmacc/2010/03/29/fact-checking-ireland-offline/feed/ 4
Holy Orders http://www.stdlib.net/~colmmacc/2010/01/02/holy-orders/ http://www.stdlib.net/~colmmacc/2010/01/02/holy-orders/#comments Sat, 02 Jan 2010 15:08:21 +0000 colmmacc http://www.stdlib.net/~colmmacc/?p=641 So, yesterday marked the day that the new Blasphemy law came into force in Ireland. It also marked the day when Atheist Ireland published 25 “blasphemous” quotes, in a supposed act of defiance.

Atheist Ireland have gone about it in a very strange way; the url in their blog post is not a hyperlink, and the quotes aren’t simply in their post or on their front page. That’s their prerogative, of course, but I do wonder if it’s a sign of overt caution. Either way, the circus seems to have worked, and CNN and the BBC have both picked it up, among many more.

As a press-stunt, it’s genius. Getting the attention of the international media like that is not easy, and through careful choice of celebrities to quote, and the right tone, Atheist Ireland have pulled off a well-executed PR coup.

But as an act of online advocacy, and of affecting political change, frankly it’s stupid. And I think that Atheist Ireland will have approximately zero success. Such is the magnitude of their “not getting it” that they are probably forever doomed to an existence of committeeism and tokenism in near-equal measure.

Firstly, the action itself is ineffective, it does not – in my opinion – constitute any kind of an offence. Let’s go to the source, the 2009 Defamation Act;

(2) For the purposes of this section, a person publishes or utters blasphemous matter if—

(a) he or she publishes or utters matter that is grossly abusive or insulting in relation to matters held sacred by any religion, thereby causing outrage among a substantial number of the adherents of that religion, and

(b) he or she intends, by the publication or utterance of the matter concerned, to cause such outrage.

(3) It shall be a defence to proceedings for an offence under this section for the defendant to prove that a reasonable person would find genuine literary, artistic, political, scientific, or academic value in the matter to which the offence relates.

Clearly there is political, scientific and academic value in Atheist Ireland’s statements. If they really wanted to engage in civil disobedience, it would take something akin to cartoons of the prophet, getting it on in a threesome with Buddha and Jesus while being bitten by some vampire Jehova’s Witnesses. That’s what the law was designed to impede. To be clear, I don’t agree with that law, I think freedom of speech is more important than a bizarre right not to be offended, but the law simply isn’t of the nature that Atheist Ireland and others imply.

I’m not sure if it’s willful misrepresentation, but by conveying the impression that the new law is designed to outlaw such trivial, and relatively innocuous, statements of the sort that Atheist Ireland quotes it is only serving to undermine Atheist Ireland’s own credibility. In light of the actual facts, it’s hard to take them seriously on the issue. Sacrificing integrity and accuracy for punchy-PR is never a great idea, long term.

To be fair to Atheist Ireland, they don’t actually claim that their statements are prosecutable, and are making the broader (but again, tokenistic) point that the law is simply silly and unworkable. Great, that’s an important point – but what use is making so much noise about it? This actually entrenches politicians and makes them less and less likely to respond favourably. It’s ironic that this the kind of reactionary “we must be listened to” tokenism that led to the law in the first place.

PR-led advocacy groups, which are really pressure groups almost never work in Ireland. There simply isn’t any political incentive to respond to pressure on these kind of issues. Across all of Ireland there are maybe 200 swing votes on the Blasphemy issue, thinly spread across the constituencies. Elected politicians, rightly, place it very low on the agenda. Going to 2 funerals, fixing some lamps, and changing a speed limit or two will get you as many votes – and in one parish.

Real, successful, advocacy groups don’t look like this. Instead there are small, targeted and strategic efforts. It’s lots of small meetings, tactically-directed briefs and letters and relationship building. The effective groups work somewhat within the system, more quietly, and get a lot more done. Groups like Barnardos get more mileage from lobbying the Department of Finance behind the scenes than they do out of a hundred press releases.

When we were running our campaigns against E-voting, I’m convinced we made ten times as much progress by working quietly behind the scenes than we ever did through PR-based activity. In order to run a successful effort, I think it’s important to keep some basics in mind;

  • Think in terms of the people you are trying to influence, and how they perceive it. How do you make your cause be in their interest? What can you do for them?

  • If it’s a political change you need, how do you frame the issue in terms of jobs, money or core political ideology (ideally patriotism)? If you need to influence a small number of politicians, how do you make the issue about their legacy?

  • If you plan to take your case to the courts, how do you influence the context in which judges rule? How do you develop and frame legal theory, how do you make it such that “your side” seems like the obviously equitable one, to judges and their peer-group. An article in a legal journal espousing your side will go a long way.

  • When you do make a press-release or PR effort, consider very carefully the question “How will this actually progress our cause? what are we aiming to achieve here”. Inciting rage, generating “pressure” and elevating one’s personal profile should be non-goals, they aren’t effective. A lot of progress can me made by suppressing your ego and not taking credit; ghost-writing political speeches and news editorials is incredibly common, for example. Have those been considered first?

    An effective goals would be “increasing awareness, and converting this into membership and resources”. A clueful organisation ends their campaign with a notice to donate money, rather than to sign a petition. It’s an important dictinction, the money can help make real change, most petitions are ignored.

Bottom line; always judge an advocacy group by the amount of actual progress they make on their issues, not column inches. And if you are an advocacy group, ask at every stage – for every action – “what is the sequence of events I hope to trigger that will actually cause the change I desire?”.

Now, to be more constructive. What would I do? Firstly, I think the priority should be to convert all of the PR into money, and to use that money to fund legal research. Small, but targeted academic briefs – aiming for 3 or more papers a year establishing the jurisprudence of the blasphemy law. The aim would be to establish a credible context in which it showed that the law is out of place, that it doesn’t fit with many of our international treaty commitments and is comparatively regressive.

Next priority would be to start to frame the issue as anti-republican. Ireland is at a very important cross-roads, in the wake of the Ryan and Murphy reports there are calls for more church-state separation. The way to capitalise on this is to make the case for a newly invigorated Republicanism, one of the great Irish political ideals. Republics are supposed to be by and for the people, relatively free of church interference. Our proclamation says;

The Republic guarantees religious and civil liberty, equal rights and equal opportunities to all its citizens, and declares its resolve to pursue the happiness and prosperity of the whole nation and all of its parts, cherishing all of the children of the nation equally and oblivious of the differences carefully fostered by an alien government, which have divided a minority from the majority in the past.

It’s not that hard to frame the blasphemy law as anti-republican.

With a few short years of that kind of hard work, it would be possible to create a context in which an Irish politician could see it as in their interest, and in the betterment of their legacy, that they bravely spear-head a new Republican ideal.

Backed by research and briefs which will help them not look stupid to their peer-group, or in the face of opposition questions, we can give them the confidence to propose it in the first place. This is what achieving real change looks like; careful groundwork. The system simply isn’t going to respond whining from a marginalised minority group.

The good news is that it’s doable. It took us a long time to get it finished, but a relatively small group of e-voting campaigners finally managed to get a key government policy 100% reversed. And this was in spite of far more political (and real economic) capital having been invested than in the new blasphemy law. There were many many small meetings, letters, briefing documents, ghost-written speeches and editorials, researched papers, legal strategies and only about 2 press releases per year on average – with a cluster around 2004 during the run up to the elections.

To me, based on experience with ICTE and other lobby groups, that’s a lot more like what this kind of successful advocacy looks like. I do hope Atheist Ireland, or another secular-interest group, can find the time and experience to develop these issues properly, but for now it doesn’t seem promising.

Update: Throwing caution to the wind, Atheist Ireland have now made the url I mentioned a hyperlink and published the quotes on their front page. Also, there’s some lively discussion in the comments to this post.

]]>
http://www.stdlib.net/~colmmacc/2010/01/02/holy-orders/feed/ 24
Point Break http://www.stdlib.net/~colmmacc/2009/12/28/point-break/ http://www.stdlib.net/~colmmacc/2009/12/28/point-break/#comments Mon, 28 Dec 2009 21:12:52 +0000 colmmacc http://www.stdlib.net/~colmmacc/?p=615 So, about 2 minutes after taking this photo;

Hot!

I slipped on the paraffin-laden solid granite pavement, everything went flying, and I ended up getting a very different kind of photo entirely;

That’s my right arm, and somewhere in there a minimally-deforming radial fracture that I don’t remotely have the training to actually see. Though I definitely don’t like the look of that suspiciously lumpy bit of bone. Here’s another look;

As fractures go, I think I got the best kind. It’s now 3 weeks later and I’m almost back to a full range of movement and strength is starting to return. I can write, type, take photos and most importantly, play music once again. The people at St. James’s Hospital have been very good, and if you ever plan to break anything I can recommend doing it near them.

Science is really really cool. Randomised control trials have shown that a simple millennia-old sling (not a cast) is the best treatment, so that’s what I got. About a century of rigourous bio-chemical engineering has led to little pills I could buy in a pharmacy that magically suppress complex sources of pain with minimal side-effects.

Over that same century we’ve progressed from a naive understanding of “Röntgen radiation” as mysterious emanations from a vacuum tube, to a complex quantum-mechanical model of X-Ray interactions that allows us to record these highly-ionising photon jet-streams on a semi-conductor, convert the impression into a digital image and then have it pop up on my clinicians desktop. Right now, reading a random blog on the internet, you can see inside my body – as easily as you might look out the window.

And social science is used too. Because people frequently forget their appointments, there are text message reminders; a few days, and one day before anything that’s scheduled. Again, a trial has shown that this is both cost-effective and clinically beneficial. And when I go to the physio-therapy clinic, I get given a simple set of tried and tested exercises that have been shown to lead to improvements.

Cooler still is that despite all of the science, and care and attention involved in the whole process – really it’s the body doing the work. Through some magic – mostly unknown – system of DNA signaling, controlled protein unfolding, stem-cells and “stuff”, tiny microscopic organic “bits” somehow communicate and coordinate the building of whole new bone, mostly in the right place. So without having to do all that much, I can play Fussball and write dumb blog entries again.

It’s almost worth doing just for the experience.

]]>
http://www.stdlib.net/~colmmacc/2009/12/28/point-break/feed/ 2
Con-science http://www.stdlib.net/~colmmacc/2009/12/16/con-science/ http://www.stdlib.net/~colmmacc/2009/12/16/con-science/#comments Wed, 16 Dec 2009 17:43:38 +0000 colmmacc http://www.stdlib.net/~colmmacc/?p=589 I’m sitting on a train, the Enterprise, moving along at about 140 kilometers per hour between Drogheda and Dublin. Something is causing practically every particle in my body to spark into a new life at a different position in space one instant to the next.

Dancing and spinning

For each one of those particles, there’s some chance it could just pop into life on the far side of the moon, or even the universe. But something, I guess “motion” is averaging everything out and look, there the sum of me pops, every tiny instant getting slightly closer to Dublin.

If the train was in the vaccuum of space – or somehow free from the effects of friction – it wouldn’t even take anything to keep this atomic leap-frogging going. Each small micro-part of me would just keep on dancing. And somehow at the same time, it’s exactly as if I were staying still all along – there’s no real difference. This has been tested.

But if for any reason a change of tempo was required, to move from a waltz to a samba, something intangible and ephemeral – mysteriously lumped into the word “energy” – is required. It’s weird and mystical and it doesn’t make a whole lot of sense, but it can be measured and described – and relied upon.

And as I sit here and write, using the word “I” like that, it feels like there’s a “me”. It seems as if there’s a real sense of ownership over my thinking, I can’t tell where my thoughts come from, but they are mine.

I feel like I’m free to choose things, when I get off of the train I could get straight into a taxi or walk to a tram. I don’t know which I’ll do yet, there are a lot of factors to consider, but neither choice seems pre-ordained, and once it happens it will feel like “I” owned that choice.

“Free will” seems as real to me as the forward motion of this train, it’s an inexplicable dance, but it’s my tune. And that makes less sense. Reality is describable by all of these symbols and equations, and we can make predictions with them. This has been tested.

There are even divisions between what is predictable in detail, and what is predictable only in the aggregate, random and unknowable at the finer grain.

Here in my head, it doesn’t seem like my thoughts are predictable, or could be. If they are just an inevitable cadence, then that “I” is merely an illusion. It also doesn’t seem like thoughts are random or unknowable, to me anyway. I have patterns of thinking, recurring themes, and a detectable personality. It can’t be the same dance.

How many miracles are there on this train? Motion is miraculous enough, that “I” can think, that the universe even exists seems miraculous too. But most astonishing, is that these simple realities are seemingly contradictory.

I’m still on that train, I haven’t gone anywhere, there haven’t been any angels visiting me, or deitious interventions. These thoughts haven’t led anyone to complicated dogmas, wars or ceremonies. Some say science is cold. Maybe the dance is a myth.

And if this mental dance should one day just end, where are the real love-songs? Songs that speak to the real meaning, the genuine warmth, of spending a fleeting, passing, bittersweet whirl around the floor with someone. Not an infinitesimal slice of eternity living in hope of a better dance; that’s cold. This has been tested.

]]>
http://www.stdlib.net/~colmmacc/2009/12/16/con-science/feed/ 4
Role Models http://www.stdlib.net/~colmmacc/2009/12/13/role-models/ http://www.stdlib.net/~colmmacc/2009/12/13/role-models/#comments Sun, 13 Dec 2009 18:00:21 +0000 colmmacc http://www.stdlib.net/~colmmacc/?p=553 Consciously, I’ve never been keen on the idea of role-models. Thinking it synonymous with hero-worship, it has always seemed a bit of an anti-pattern to me. Why try to emulate anyone? There are enough people in the world behaving the same as someone else, being different and original is definitely more useful, even if it makes you a bit crazy. When I did a dubious “leadership style” test I came up as “anti-follower”, so maybe it’s just another form of contrarianism on my part.

Over time, I’ve found that the best way to learn is by example, even if it’s a process of unconscious osmosis. And when I’ve spent time on what is sometimes called “personal development” I’ve found that there is real benefit in reading the biographies and the writings of truly awesome people. It certainly seems more productive than reading self-help books that are written in truisms and marketing crap.

I thought I’d share some of the people who I’ve really benefited from reading about, truly amazing people.

Richard P. Feynman

RPF is a legend; a nobel laureate physicist with an uncanny ability to explain complex ideas, an anti-authoritarian maverick who loved to screw with officialdom but most of all an incredibly generous, warm, loving guy (even if a womaniser at times). His writings on physics and his letters to his first, dying, wife are an inspiration.

Robert M. Pirsig

Pirsig, someone genuinely crazy enough to have been institutionalised, still managed to write one of the best sellers of the 20th century and to invent a philosophical system that many consider to have merit. ZMM is amazingly well written, all the more so when you consider that every paragraph was planned out in advance on index cards. Worrying, his narrator in ZMM is the only literary character I’ve ever strongly identified with.

Grace Hopper

Grace Hopper signed up for the US navy during World War 2, and rose (primarily as a reservist) to the rank of commodore/rear-admiral back when this was incredibly unusual. But more than this, she was an excellent experimenter, and kept a rigourous lab-book, despite being mainly a computer scientist. She was a strong believer in getting things done, and coined the phrases “dare to do” and “It’s easier to ask forgiveness than it is to get permission”. Seriously awesome woman. Oh yeah, and she invented the compiler.

Doc Watson

Doc Watson has been blind nearly his entire life, but that doesn’t stop him from being the truly most amazing guitar picker the world has ever seen, or doing crazy things like mending the tiles on his roof. His solo runs and accompaniment are incredibly good, and he’s somehow maintained humility in the face of multiple grammy awards and playing for the president on a regular basis. Another doer, he just kept going and became more productive after the tragic death of his duet partner and son Merle.

Dolly Parton

Dolly is a self-described mis-fit, but she is also a very very shrewd business woman as well as being a dedicated humanitarian and gifted songwriter. She is one of the really great singers, and is emotionally invested in every song she sings (even the ones that sound like bubblegum, listen to how sad she is in “Here you come again”).

CP Snow

CP Snow was basically a troll, but a very very good one. His arguments, lectures and writings weren’t always rigourous and balanced but they were always enlightening, thought-provoking and forward thinking. Most famously he identified the tension between literary and scientific cultures and made a great case for the unfair treatment of science. A scientist and a well-regarded author CP Snow is a great example that it is possible to straddle both worlds.

Peter Watson

Peter Watson is a prolific researcher and writer, the volume of his output and the breadth of his knowledge is unfathomable. I’m constantly reading something of his. He has methodically and thoroughly condensed practically all of known intellectual history, writing about all of the inventions of the human mind. His writing is great, but it also brings home how relatively ordinary our time in history really is, yet serves as a great reminder that so many things we take for granted even had to be invented.

No doubt I’ll think of more now that I’ve put a list together. I’ve been fortunate enough to meet some of these people, but I’ve also been even more fortunate in that other people I’ve come across in my life have served as role models (starting naturally enough with my parents). I don’t intend this post as meme, but if you have role-models, I’d be interested in hearing about them. As I mentioned, it’s definitely a great benefit to read about such inspiring people.

]]>
http://www.stdlib.net/~colmmacc/2009/12/13/role-models/feed/ 3
Period Pain 3 http://www.stdlib.net/~colmmacc/2009/11/26/period-pain-3/ http://www.stdlib.net/~colmmacc/2009/11/26/period-pain-3/#comments Thu, 26 Nov 2009 21:44:49 +0000 colmmacc http://www.stdlib.net/~colmmacc/?p=529 As promised, though it’s been a while coming, I wrote that there’d be a followup on scheduling periodic tasks.

The most important point to get is that for nearly all real-world use cases the actual time that a “scheduled” task runs at doesn’t matter. Tasks that have to occur at a specific time on Tuesday are vanishingly rare. cron is one of the most abused tools going, rather than encode specific times it would make more sense to let a scheduler decide when the tasks should be run based on criteria such as overall network and system load.

It’s not even hard. There are plenty of scheduling algorithms and theory to borrow from, and many large organisations even have private implementations that let you be a bit more fuzzy about when tasks are run. But there’s another level to the periodicity problem that is worth thinking about.

Rather than simply using numbers and values that come readily to humans, it can be worth putting more effort and research into the values of periods themselves. This isn’t meant in some fetishistic sense . Yes, for say virus updates, it’s possible to produce a gigantic linear algebra equation, with 100s of parameters, that would balance the likelihood and cost of a security breach against the cost and frequency of checking for updates and it would come out with some answer, but that’s a lot of work for little gain.

More interesting, and more tractable, are the effects that arise when multiple periodic tasks coincide. These are really common in distributed systems, and a real pain to debug and diagnose.

It could be as simple as the case we’ve been looking at; a cron job that runs once a day, but across many systems, or it could be as complex as a full-blown peer to peer app that’s got a control loop with multiple peers, a supernode or two and a user-interface polling loop.

And a pattern that’s repeated over and over again is that people choose “convenient” values for the periods .. and these choices are so common that when the periods end up in phase with each other we get constructive interference and elevated load events when the tasks coincide.

Aligned Events

Take for example 3 loops – one with a period of 5 minutes, one with 10 minutes and one with 30 minutes. If the loops end up in phase then every 30 minutes we have all three tasks running at once. It’s a mess, and it’s an easy one to prevent – use prime numbers for the values of the periods;

Aligned Events

at least that way the number of coincidental events is minimized, and if any load events show up with a periodicity it is very straightforward to identify what single event, or combination of events, should be responsible. Sometimes I can easily imagine a cron replacement that runs exactly like this, but never get to writing it.

And these sorts of loops show up in places you might not necessarily think of. Caches are a good example. If you serve every piece of content of your website with the same Max-Age, then you can expect a thundering herd of requests whenever a browser or proxy expires them all at the same time. One the other hand, if you use prime number cache lifetimes for each resource, you’ll get much more nicely staggered and spread out series of requests. It’s a really simple, neat, optimisation. Tuning things doesn’t have to be hard.

]]>
http://www.stdlib.net/~colmmacc/2009/11/26/period-pain-3/feed/ 4
Period Pain part 2 http://www.stdlib.net/~colmmacc/2009/09/27/period-pain-part-2/ http://www.stdlib.net/~colmmacc/2009/09/27/period-pain-part-2/#comments Sun, 27 Sep 2009 12:37:36 +0000 colmmacc http://www.stdlib.net/~colmmacc/?p=517 Last week I wrote about problems with periodicity but it was only half of the problem. But before moving on to the second half, it seems like a good time to post with some clarifications.

I wrote that using some locally unique well-distributed value, such as a mac address, was better than choosing a random number once. But crucially, I left out how to do such a thing. A few commenters asked what the best way might be, including some good examples.

To be a bit more rigourous about it, and make sure, the great people at HEAnet provided me with an anonymous list (the prefixes had been stripped) of over 200,000 IPv6 addresses that have used ftp.heanet.ie in the last month. Included in that list were over 150,000 EUI-64 style addresses, which look like this …

2001:880:18:1:214:4fff:fe02:e6ee

the last 4 octets include a slightly modified version of the user’s MAC address. The details are straightforward, but you can take it from me that “214:4fff:fe02:e6ee” corresponds to a MAC address of “00:14:4F:02:E6:EE”, and that the md5sum of that string is “d32227ed9a3bf7d8714590f837884286″.

Mac addresses, and the hash, are both really just numbers. A 48-bit number and a 128-bit number respectively. Bash can handle these kind of numbers natively, and if you need a well-distributed number between say 0 and 999 then the mod operator is perfect:

# Prove that bash can handle even 128-bit numbers
colmmacc@infiltrator (~) $ echo $(( 0xd32227ed9a3bf7d8714590f837884286 ))
8162089295436857990

# Use the MAC address directly to pick a number
colmmacc@infiltrator (~) $ MACADDR=`/sbin/ifconfig | grep HWad \
                               | awk '{print $5 }' | head -1`
colmmacc@infiltrator (~) $ echo $((  16#$MACADDR  % 1000 )) | sed 's/^-//'
174

# Use the md5sum of the MAC address to pick a number
colmmacc@infiltrator (~) $  echo $((  `echo $MACADDR | md5sum |\
                             cut -d\  -f1| sed 's/^/0x/'` \
                             % 1000 )) | sed 's/^-//'
363

As per one of the comments on the previous post, from brady, getting rid of any minus sign (the last sed operation), is a cheap form of abs().

But, which is better; randomness, mac addresses or the md5sums? To get rid of any temporal bias, I’ve graphed the distribution of the above operations for 18,365 real world MAC addresses from one day’s worth of requests to ftp.heanet.ie.

Mac address distributions

MD5Sums come out slightly ahead (stddev of 4.6 compared to 4.81), and essentially performed just as well as how random numbers should do (around a stddev of 4.55). Using the MAC addresses on their own, without md5suming should be good enough for most purposes too.

So why not use a random number? Well two reasons.

  1. It’s harder – you have to store the state somewhere. A mac address on the other hand is already stored state, if you can look it up each time and it will be relatively stable.
  2. A lot of the time automated tasks are being installed at provisioning time – there isn’t actually that much real entropy available, so the randomness either tends to be weak, or you contribute to exhausting entropy and denying it to more useful things.

And lastly, James pointed out that from the point of view of a single host half an hour of jitter doesn’t really matter. He’s dead right – and of course the combined effects do matter for the distributed system – and the next post will be how to exploit that property to get better scheduling.

]]>
http://www.stdlib.net/~colmmacc/2009/09/27/period-pain-part-2/feed/ 3
Period Pain http://www.stdlib.net/~colmmacc/2009/09/14/period-pain/ http://www.stdlib.net/~colmmacc/2009/09/14/period-pain/#comments Mon, 14 Sep 2009 16:15:27 +0000 colmmacc http://www.stdlib.net/~colmmacc/?p=473 Imagine it was your job – along with 1,000 other people – to pick a number between 1 and 60. You can use any method you like (though you must use the same one), but if more than 30 of you choose the the same number, those of you who did would be shot. Would you let the group pick numbers at random?

Probably not, there’s always a chance it could go horribly wrong. And that chance? We could derive the correct p-value for having to shoot people easily enough from the uniform distribution, but forget that, let’s do it with code. It’s a lot easier to understand – and it’s a good habit too.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import random

count = 0

# Simulate 10,000 runs of 1,000 people
#  picking a number between 0 and 59.
for runs in range(10000):
    numbers = [0] * 60

    for people in range(1000):
        r = random.randint(0, 59)
        numbers[ r ] += 1

    if sorted(numbers)[-1] >= 30:
        count += 1

print count/10000.0

If you run this, hopefully you’ll get an answer that’s around “0.1″. In other words, about 10% of the time we expect to have at least one number being chosen by 30 or more people. Those don’t seem like great odds, and I wouldn’t play Russian roulette with lives like that.

Yet this is almost exactly what we do with a lot of automated tasks in some large-scale distributed systems. A pattern than can be observed over and over again is that someone writes a scheduled task that runs once an hour, day, month or whatever but then sleeps a random amount of time (typically between 0 and 3600 seconds) when it starts- in a very naive attempt to distribute impact across the larger system evenly.

The march of time

The impacts can be pretty serious – it might be extended load on a dependent service, or it might simply be too many busy nodes in a cluster. Or it might be a two-day global outage of a popular Voip service. Too many things happening at once is usually a bad thing in distributed systems.

Security and anti-virus updates, apt/yum/ports repositories and auditing tools in particular seem to get this pattern wrong. Here’s a good example, from Ubuntu’s daily apt script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# sleep for a random intervall of time (default 30min)
# (some code taken from cron-apt, thanks)
random_sleep()
{
    RandomSleep=1800
    eval $(apt-config shell RandomSleep APT::Periodic::RandomSleep)
    if [ $RandomSleep -eq 0 ]; then
    return
    fi
    if [ -z "$RANDOM" ] ; then
        # A fix for shells that do not have this bash feature.
    RANDOM=$(dd if=/dev/urandom count=1 2> /dev/null | cksum | cut -c"1-5")
    fi
    TIME=$(($RANDOM % $RandomSleep))
    sleep $TIME
}

This is a very bad pattern – I’d go so far as to say that it’s actually worse than letting everything happen at the same time in the first place. It has 2 particularly dangerous qualities;

  1. It increases the effective period of the task

    If a task is running once an hour – it’s probably because we need to do something once an hour. That may seem tautological, but there’s subtlety. The clock-time hours we define as humans are arbitrary, by once an hour we should really mean “at least once in a 60 minute interval”, not “once between 1PM and 2PM”.

    If we pick a random sleep time, we might end up running just under two hours apart. If at 1PM we pick 0 sleep seconds, and then at 2PM two we pick 3599 sleep seconds – look, we just ran two real hours apart!. Unsurprisingly the converse can happen, and we’ll run just 1 second apart, doing heavens knows what to load along the way.

  2. The system as a whole has to cope with dangerous load spikes

    Using our earlier example, If we allocated the numbers evenly, each value would get chosen only 16 or 17 times. We could plan for an impact of say 20 running at once. But as we’ve seen, if we pick random numbers every time, then 1 in every 10 runs, we’re going to have to cope with an impact of 30. That’s 50% more load, because of a one line error!

    If this task is running every hour, then about 1 in every 7 weeks, we’re going to have to deal with an impact of 40 or more. And it will appear totally out of the blue, it’s a random occurrence. Nice! To take it to the extreme, there is a small – but finite – chance all all 1,000 systems choosing the same value. But avoiding this is why we are spreading the load in the first place, so why leave it to chance?

    I used to run a busy Ubuntu mirror, and every day between 6:25 and 6:55 we’d see a gigantic wedge of load that could be distributed a lot more evenly. Though I think the worst of these problems have now been fixed.

The optimal fix for the problem is simple; coordinate – use a central allocator, or a gossip protocol, to ensure that every slot has at most N/M consumers.

This isn’t always possible though – Open Source security updates are usually polled from relatively dumb mirror servers, that don’t keep enough state to be able to divide load like this. But there are still two better approaches.

Firstly, you can pick a random number just once, and then reuse the same random number every time. This guarantees that the task does actually run once an hour, and it makes load predictable for the distributed system as a whole.

Secondly, you can use any uniformly distributed unique piece of data, local to the node, to choose your number. I like to use an MD5 hash of the MAC address, but even the hostname would do.

Probability-wise, the two approaches are identical, but in the real world you get fewer collisions with the latter – probably due to the similar state of entropy identical systems will be in when you mass boot a fleet of several hundred. In both cases, where aberrations emerge due to bad luck, they at least only have to be identified and fixed once. We’re no longer playing the load lottery once an hour.

But this is only half the story … what about other automated tasks with their own periods … how do we avoid colliding with them? And how do we architect timings in distributed systems in general to make them a lot easier to debug. That’s what the next blog post will be about. Stay tuned.

]]>
http://www.stdlib.net/~colmmacc/2009/09/14/period-pain/feed/ 14