Archive for April, 2008

AWS, AppEngine and the future of data

Posted on April 21, 2008, under general.

Since I last blogged about Google’s new AppEngine service, I’ve been playing around with it and comparing it to Amazon’s Web Services, getting a feel for each of their abilities and limitations.

It’s not a like-by-like comparison; Google’s AppEngine is a higher level service. Unlike AWS, it presents an application framework – not an operating system – and it takes full responsibility for balancing the load and handling resilience and diversity. They also have different target markets.

Target markets

Google / Amazon markets

Google’s target is direct; developers; there’s no need for any intermediaries like system or data administrators, they can start off for free and ramp up if it takes off, and they can replicate the entire environment on their laptop. The barrier to entry is very low (as long as you know Python) and encourages experimentation. It’s easy to see teenagers writing OpenSocial and MySpace widgets on Google AppEngine. If AppEngine didn’t exist, its stereotypical user would be writing iPhone apps, a shining star developer at some


, or just the best Counterstrike sniper in their city and that’s not a put down.

Amazon’s target is more serious; entrepreneurs; there’s more control and freedom, at the cost of a little commitment. The services are designed to slot into business plans in ways that make sense to both the lead architect and the CFO. It’s easy to see small but growing enterprises and a bit of a marketing budget using AWS. If AWS didn’t exist, its stereotypical user would be employing more admins, buying more machines and going through the pain of building or renting comms space.

Amazon are adding new services to AWS every few months, and Google are open about their plans to expand on theirs. No doubt we’ll see some interesting new offerings from both of them, but in the meantime we can compare them based on their capabilities today.

Ultimately the differentiators for these kinds of services will probably be the languages supported and the management tooling, but personally, the part I find the most interesting to compare is data storage and processing. I also think it’s the most important fundamental change that these computing clouds bring about.

Information is power, knowledge is wealth and data is very literally the bits they’re made of, or so the cliché goes. The entire field of IT, Information technology is really about data, it’s pretty key stuff, and managing it is the hard part of building an online service. So how do they compare when you want to provide a data-driven service?

Architecting geographically resilient services

Well, before we can quite answer that question we need to go over how a global-scale geographically resilient service looks and works. This stuff is hard, with a lot of theory and problems, so I hope you’ll forgive me if I boil it down to the three big challenges;

  1. Getting a user’s queries to different sites

    Using either stupid DNS tricks (seriously, that’s a technical term), IP anycast using BGP, application-level redirection/load balancing or a combination of these, a user’s queries need to get to different sites depending on conditions. At the very least, the system needs to be able to route around an outage of one site. Advanced features include route optimisation, where users queries are served from the location closest to them network-wise.

  2. Getting a user’s data to every site

    Even in the most basic case (where all data is local to a single user), there’s a risk that the site a user is being served by will go offline suddenly, so that data needs to be synchronised to every site when written. For more typical cases the entire dataset is being used by all sites anyway, as
    global statistics and user data are interesting to other users.

  3. The speed of light

    Our present speed limit is the speed of light in a vacuum. You’re probably reading this sentence faster than you could ever send a single photon to your antipode and back again. Planet-scale communications using electromagnetic radiation are forever, intractably, hampered in this way (thankfully our own planet has the conveniently placed Pacific Ocean to keep a tab on the problem).

    Earth is 67ms wide

    As if that weren’t bad enough news for us, we don’t beam information through a vacuum but actually slow it down by bouncing it around in tiny bundles of dense glass. Raw bits travel at about 0.66 c. We also have the discourtesy to convert photons to electrons a few dozen times, wrap them in error codes, and shove them into protocols that require math to happen every time they hit a hop on a traceroute.

    End result: as information leaves the desktop, and lives in distributed computing clouds, it moves quite slowly and well within the range of human perception.

Outsourcing these hard problems, and getting the benefits and economies of scale that Google and Amazon have, is what these products are really about.

At the intersection of those three challenges is the worst problem of all; handling a data-level netsplit, when users may be interacting with multiple sites that each have different ideas about the state of play. This, more than anything else, exerts the strongest influence on both the data and application architectures.

  1. A common or garden netsplit

    A netsplit in action

    Scenario: two (or more) sites stay online and reachable to some internet users, but have poor connectivity to each other. Now neither site has any automated way of knowing that the other is “really” down (ie from the entire internet) or that any slowdown is temporary, so both sites should keep serving users – but data isn’t being synchronized between the sites. There needn’t be a full outage for it to be a problem, it might just be a delay – a clogged link, or some faulty storage infrastructure – but the result is the same, inconsistent data in real time. Or as Amazon put it;

    Amazon SimpleDB keeps multiple copies of each domain. When data is written or updated (using PutAttributes, DeleteAttributes, CreateDomain or DeleteDomain) and “Success” is returned, all copies of the data are updated. However, it takes time for the data to propagate to all storage locations. The data will eventually be consistent, but an immediate read might not show the change.

    Consistency is usually reached within seconds, but a high system load or network partition might increase this time. Performing a read after a short period of time should return the updated data.

    Designers put in multiple resilient paths between sites, add out-of-band management systems, and wake up Network Operation Centre engineers at 2AM to try and minimise these problems, but it can and does still happen. Some commentators think that it’s just matter of time before service providers can get around these problems, but really it’s not, unless they know something about the speed of light the rest of us have missed.

  1. A virtual netsplit

    Scenario; there are 2 (or more) sites, say 6,000km, or better expressed as 100ms, apart. There’s a user in the middle, half-way between both. For that user, it only takes 50ms to reach either site, but it takes 200ms for any reliable commit to their data to roundtrip. Adding 200ms of delay to every user query is not acceptable, so we must live with a little lazy data replication.

    What happens if they end up being served by both sites? There’s two ways this can happen – one is ordinary enough, their main site goes offline suddenly, and they failover to the other, but their data didn’t make it in time. The other is far more insidious.

    a netsplit

    When DNS tricks or IP anycast are used for global load distribution, there are bizarre corner cases in which a user may be intermittently served by different sites. Their DNS or proxy servers may have source addresses in different networks, or maybe their ISP doesn’t carry a full routing table and they’re a victim of some equal-cost multi-path routing (although that’s a severe case, which will kill all TCP services anyway, it’s still a problem with low TTL DNS tricks).

    If the application queries rely on state from previous requests, it can get very confusing very quickly when the data is taking longer to replicate than the users queries are flipping between sites. (As an aside: imagine this problem multiplied by factorial(N) where N is your total number of users and welcome the pain of implementing a stateful peer to peer network).

So, those are the problems, and there are good and practical engineering measures used to mitigate them; decent global load-balancing techniques will minimise any flipping of users between sites, well-maintained and monitored resilient links between sites will keep splits very rare, and a larger number of more widely dispersed sites will keep alternative sites closer-by and allow for tiered data synchronization.

Why am I going on about netsplits and the speed of light? well, the thing to take from all this is that there are no guarantees about data consistency across sites in real time, and reasons behind this are fundamental. You can have excellent data consistency, at the cost of some latency, or you can have low latency with slower guarantees on data consistency across sites, but never both; sorry.

Often, this doesn’t matter at all, there are so few writes – or they are not critical to interactivity – that it’s acceptable to have a single master site and some delay from far away locations. But many web applications are highly interactive and data heavy. Every time you make a move on scrabulous, you need your opponent to see it. When you add a photo to a set on flickr, you expect your friends to be able to see it.

Data-modeling with Object and Tuple stores

The constraints all of this places on data management have large effects on how both Amazon’s and Google’s load-distribution and data-stores work, they’ve both been designed with this in mind. There’s no referential integrity, uniqueness of records is not enforced at all, nearly all of the features we’ve become addicted to from RDBMSs are missing, and the documentation makes no guarantees of data consistency or even disclaims it.

Developers would love it, but it’s simply impossible to provide a global-scale super-resilient super-low-latency SQL-alike data-store. So what happens when you learn to live with lazy data?

Well it’s 2008, and these days it’s hard to imagine any new web service that didn’t have support for tags and folksonomies. So what if you’re selling furniture online, couches can have tags. Providing services to a salesforce? Yep, you need tags. A forum for talking about candle-making? how could you remember what it was like before tags. Tags tags tags, everybody needs tags. So, let’s add them to an AWS or AppEngine site.

I’ve always thought there was a hidden market to be tapped into in tagged blob storage. Todays dotComs are all about conquering a middleware niche, so why not make the best tagged blob service there is – let other sites worry about what the blob actually is. So this should be easy peasy, right? Tags are just a one to many attribute. We knock up something like;

Google AppEngine API python:

class Blob(db.Model):
    blob = db.BlobProperty(required = True)
    tag = db.ListProperty(db.Category)

blob = Blob(blob = bindata, tag = ["colmmacc", "cc2.0bync"])

AWS SimpleDB pseudocode:

PUT (blob, $uuid), (data, $s3url), (tag, colmmacc), (tag, cc2.0bync)

We’re done, aren’t we? we can now tag blobs, select them by tag, and even the list of tags on a per-blob basis. That’ll add an extra zero to our selling price, and it only took 5 minutes, great stuff. Well, not quite, because we don’t just want tags, the marketing department says that’s not cool enough, we want tagclouds.

A tagcloud

That subtle change in usage pattern makes a big difference though. Now we need to count tags across all of our blobs, and probably against other dimensions too – like which users own the blobs, because we want per-user and global tagclouds. To get an idea of the difference in complexity this all implies, have a good read of a suggested SQL schema for tags.

Google’s data-store is really an object store, and rather than designing data schemas, if you think about how you would solve a problem if it was a bunch of in-memory objects, with standard data-structures like trees and indexes, you’ll generally come up with a good data model for your problem. If we try to re-create something like the SQL-schema for tags, we won’t get very far. We can create tag and linker objects;

class Tag(db.Model):
    name = db.StringProperty(required = True)

class Blob(db.Model):
    blob = db.BlobProperty(required = True)

class TagBlobLink(db.Model):
    tag  = db.ReferenceProperty(Tag)
    blob = db.ReferenceProperty(Blob)

but this let’s us down in one big way; we can’t enforce the uniqueness of tags. Although we can hammer out something like;

# See if our tag exists
     t = Tag.all().filter("name =", tagname).fetch(1)[0]

# otherwise create it
except IndexError:
     t = Tag(name = tagname)

without a global, reliable, locking service it’s just not safe – two instances might create identically named tag objects at the same time. The same problem crops up if you try to maintain your own reciprocal indexes of tags and blobs in the Tag and Blob objects themselves – when it’s time to add a new tag or blob, you can’t guarantee uniqueness without a lock-step. Unfortunately Google don’t provide AppEngine authors with access to Chubby or any other locking service.

One solution may be to use the data-store itself as a locking service, but you’d have to have a good idea of any likely data propagation times, build sleep() calls in, and still hope for the best. A better, more acceptable solution, is probably to tolerate some conflicts – simply allow for the possibility of two

Tag[tagname = "colmmacc"]

objects – and have an offline vacuum process which merges the conflicts, particularly easy in the case like this.

A bigger problem with the Google Datastore API is the extremely poor support for operators and extended query results. It’s not even possible to




query conditions, to


full result sets or retrieve more than 1,000 results from any query. That makes implementing our tag-cloud even harder again, because we’re going to have tags associated with more then 1,000 blobs.

If we wanted to build a tag cloud for two users (which would normally use an OR), we would have to read in all of their blobs – 1,000 at a time, store and sort them all to eliminate duplicates and then count the results iteratively. That’s a lot of overhead to have at the application layer.

Amazon’s data-store is all about rows of associative tuples. As all attributes are string-based tuples, it makes it possible for them to indexed in multiple dimensions and directions – and when you think about it, it’s easy to see how almost any data model can be described in that way.

Going back to our simpledb example:

PUT (blob, $uuid), (data, $s3url), (tag, colmmacc), (tag, cc2.0bync)

The great thing about simpledb is that you don’t have to update the whole row at once. You can add one tuple at a time. It might not make it to all sites instantly, but it at least keeps the consistency problem local to that attribute – not the entire row/object. Since I don’t have to worry about keeping a list consistent, I tag the blob by adding an attribute.

There are still some race conditions, but they tend to match human expectations. For example, John at site A might add a tag “foo” at time T, then Sally at site B might add the same tag at time T+1. Then John changes his mind and removes it at time T+2, but then Sally’s change might come in and trounce it. But that’s what happens when humans make conflicting changes in real-time anyway.

SimpleDB has full support for OR’s, NOT’s and other such logical operators. It doesn’t have a COUNT, so you still have to read in the results to count them – this time only 250 results at a time – but at least you don’t need to sort them. Amazon doesn’t have a locking system either per se, though it does have a Queue service, and any CS undergrad will be able to tell you how to implement a semaphore-based lock using a queue. But that’s definitely not what it’s intended for.

Still, both examples are ugly; to get our counts of tags, we’re having to paginate through result sets iteratively, consuming memory and CPU in the process – that’s hardly optimal. One solution would be to replicate in our application what many RDBMS’s do, and to maintain an N dimensional cache of our tag counts, and update it on write.

Distribute the operations, not just values

But that brings me to something else, besides a locking service, that I think is missing. Wouldn’t it be cool if you could distribute operations instead of just values? Lets say our tag cache looked like:

Tag Count
sex 4816
love 510
rockandroll 1377

Whether we’re using Amazon or Google’s services, if we want to add a new tag and update the cache, we have to read the value, increment it, and then write it out again. If this happens simultaneously at two sites, our count will be wrong, and we’ll be missing a value.

Instead, imagine an API that let you define those values as counters, and you were given a set of distributable operations,




. The cloud could apply the operation as it distributes, regardless of what the value is. Although the values still won’t be instantaneously consistent, it will be roughly right, and we’re never going to be negating ourselves.

One way to implement this kind of behaviour in application space is to use SimpleDB’s tuples like a stack, pop-ing operations when we’re sure they have successfully distributed. When we want to add or subtract to our total, we push an operation, a row might look like this;

(tag, love), (count, 510), (pending-operation, "increment,timestamp,uuid")

the pending-operation consists of the operation name, a timestamp so that we can later
get rid of them, and a uuid or hostid to handle the case where the same operation occurs at the same time from two different hosts.

Now, when we want to perform a count, we take our total, and proceed through the pending-operations – if any – at runtime. SimpleDB’s eventual consistency model will have been adding pending-operations as the data synchronises, so we might be a bit out, but that’s ok. Then, to keep the stack of pending-operations to a reasonable size, we have an offline process, or a conditional block, which does the following:

for operation in pending-operations:
     # 10 minutes should be plenty of time for operations to propagate
     if now() - operation[timestamp]  > (10 * 60):
            count = operation[operation](count)

     # Pop the operation from the stack

Of course to apply the above safely from a conditional block, we’d still need to use a lock – underscoring the importance of that missing feature, but it avoids expensive locking at run time. it’s a bit like have a map-reduce for real-time distributed data, though the examples here are simplistic, there are plenty of useful commutative functions and operators.

Counters like this are common enough in distributed applications, how else would you know how many people have viewed your OkCupid profile that day? But there’s another very closely related data type for which distributed operations rock; inventory.

Let’s say I have 200 concert tickets to sell online, and I’m pretty sure it’s going to sell out in minutes (it’s Bob Dylan performing to a small audience). If I use a global lock and counter, my site is going to be slow, and it will have a hard time as it is. If I start a count at 200 and use a distributed decrement approach like above, there’s a big risk of overflowing past 0 and upsetting some people.

Instead, there’s a better way – use a registry or ticketing system. Have an inventory data-type, and allow it to be divided between the sites on demand. So, say it starts at 200, and we tell the API to carve it up 10 tickets at a time; each site should take a batch of 10 tickets, and when it gets down to only 2 left, request another 10 and so on. This gives us the best of both worlds, we have a local inventory that we won’t take seconds to lock, and we guarantee that we won’t overflow the total.

If you’re using EC2, and you know how many sites you’re running in, you can build this kind of logic into an application easily enough using Amazon’s Simple Queue Service, but it sure would be a nice feature to have available more directly.

Data from the other side

One big gap present in both services is data analysis. That might seem like a trivial thing, but it’s a big deal – analysing your business and usage data can tell you can awful lot about what’s happening and what directions to take.

Both Google and AWS have reasonable services for reporting your usage; there are graphs, logs, and all of the data is there. In Google’s case, since all of the requests go through its application framework, it sees everything and it’s all collated into one easy-to-use analytics-like interface. You get joined-up access to HTTP logs and Python tracebacks.

If I were Amazon, I’d be working on a simple logging service, with plugins for Apache, syslogd, all of the programming language, and whatever else could ever want to log. Having everything in one place is a big deal, you can then correlate events, establish usage patterns and identify attacks and uncommon behaviour.

But what both miss is the higher level; not just the graphs and numbers, but statistical analyses like Holt-Winters projections that will allow users to identify real dips in their throughput which could be due to technical or marketing problems.

Service monitoring and usage analysis is a big problem for businesses, and my guess is that many would jump at the chance to benefit from the scale and experience of a provider like Amazon or Google.

Interesting times

These are certainly interesting times for data management. Internet-scale applications are coming to the masses, and very clever computer scientists and architects are coming up with abstractions and models that allow ordinary developers to cope with the fallout of using a planet instead of a hard-disk.

Are we going to see mid-level services like locking and distributable operations, that are present in some DBMS’s become part of this too? or will they turn out to be dangerous anti-patterns that prove too tempting to abusive and lazy developers. I hope it’s the former, but in the meantime it’s great to think out and learn from the design decisions made by the industry giants. They’ve thought very hard about these things.

The tag-cloud image is from Luca Cremonini, via the Wikimedia Commons project.

Apachecon EU 2008

Posted on April 11, 2008, under apache.

Somewhat unexpectedly, I’ve spent today at Apachecon Europe 2008 in Amsterdam, and managed to deliver a talk on APR on short-notice. It’s been great to catch up with people I haven’t seen in a long time, years in many cases. Hopefully I’ll be in New Orleans later this year to catch up over a full apachecon, and get back to hacking again.

APR Hello World

The presentation wasn’t exactly well-prepared, so there are only 13 slides, but for the record … here they are. And yep, even the Hello World example contains a bug!