HOWTO: Making an amazing Canvas print with reads

Posted on September 20, 2008, under general, photography.

I recently moved into my own apartment in Dublin, and after a lot of painting and hard work, it needed decorating. If you like taking photos, it’s a great opportunity to put in the extra effort and take some shots that are worthy of enlargement and putting up on a wall. Over the last few weeks I’ve gotten 6 prints done up on canvas, and am getting a feel for what makes amazing prints and what doesn’t.

Reads on Nassau Street in Dublin have a great canvas printing service. but there are some caveats to doing it right and avoiding some extra charges.

Step 1: Take a great photo

Photos with lots of colour and brightness seem to work best, avoid too much detail, landscapes and large-aperture portraits work great. For my largest print, I went out on a mission, and took this photo in the Phoenix park;

Remember if you want to wrap the photo around the canvas you need to make sure that there’s nothing important in the 4.5 cm around the edges.

Step 2: process the photo for canvas

Printing on canvas takes thick ink, to get something amazing looking the photo needs to be bright, and vibrant. I use Lightroom, so I increased the vibrance by about 25% and upped the exposure and brightness by around 15% - it will look a bit glarey on a monitor, but it will come out right on canvas. The same can be done in the gimp, iPhoto, or photoshop.

Lightroom also has an option to “sharpen for matte” when exporting, so I chose that, and exported at 100% quality.

Step 3: burn the files to disc

Reads have a “disc handling” fee of 5 or 10 euro, which essentially covers their costs of getting the files off of the disc and doing things with them. Most customers need to have their files cropped, rotated, sorted and that kind of thing. However if you put a little effort in, you can avoid the charge entirely. First, make sure that your images are exactly the right dimensions for the size you want (keep the resolution as high as possible, but match the aspect ratio of your target size). Next, name the images in an obvious way - if you want a 30×40 canvas print, with the image wrapped around the edges, name the file “40×30-canvas-colour-wrapped-something.jpg” - this means it takes about 5 seconds to give them the files (on a USB fob will do btw) and they won’t waste any time doing anything.

The prints are ready within 24 hours, well all of mine were.

Admire the result!

New place, new job!

Posted on September 1, 2008, under general.

So a few months ago I got myself into a lifetime’s worth of debt and bought a place, in Dublin. I haven’t gotten around to taking some good photos of it yet - especially as the decorating part isn’t finished yet. In the meantime, here’s a photo of the view from the Liffey bank walk it overlooks - it’s about a 3 minute walk down to where I took it.


Click on the photo for more. From my windows I can see the Dublin Mountains, Howth Head, the trees of Phoenix park, putting greens, and the Liffey itself. It’s like being in the countryside, yet it’s only about 10 minutes from Trinity College. It rocks!. I can strongly recommend Chapelizod if you’re looking for anywhere to stay in Dublin.

But as if this wasn’t enough, I have a new job, working for Amazon.com - based in Dublin.

So far so good; today was my first day and yet I have already been sorted out with a free t-shirt, attended the annual summer party (it was good too), set up with a pc , laptop, screens, phones and I’m getting an idea of the absolutely ginormous size of everything that is Amazon. There are some very interesting projects and challenges to be tackled, I’m looking forward to getting stuck in!

Apachecon US registration now open!

Posted on August 3, 2008, under apache, general.

Registration for Apachecon US 2008, in the wonderful city of New Orleans this November, is now open.

Apachecon US 2008 logo

For the first time ever, instead of just speaking, I’ll be running a 2 day tutorial/hands-on workshop on how to scale a webserver/site. Both intensive performance tuning, and the architectural considerations that come with high scalability.

Based on what it takes to performance-tune;

  • a single webserver to handle over 80,000 concurrent downloads
  • a website to many thousands of SSL connections per second

and the thought, state-avoidance, cacheability and expandability that goes into turning;

  • clusters of slow dynamic back-ends into a speedy and responsive user experience

Everything from the basics of configuration and benchmarking to high level monitoring will be covered.

It should be fun! There might be a small bit of theory; how virtual memory managers, file-systems and schedulers work, but the emphasis will be on the practical and experimental. Attendees will be given remote machines for server and client testing, and by going and doing it, we’ll step through the configuration, test and tune iteration cycle.

Let me know if you have any suggestions!

Irish Open Source Technology Conference

Posted on June 11, 2008, under general.

The Irish Open Source Technology Conference is on in Dublin next week and looks like it’s going to good fun.

We’ve had events like Guadec and Apachecon in Dublin before, and some smaller events mostly aimed at business, but this is the first real Open Source conference Dublin has seen, and as a bonus there’ll be a Firefox 3 launch party!

From the agenda it looks like it will be good fun! With a bit of luck it’ll become a regular event. I’ll be there, on the Wednesday evening - it should be a good opportunity to catch up with folk I haven’t seen in a year or more. Let me know if you’re going and say hi.

Security breach disclosure practice

Posted on May 15, 2008, under general.

For a long time now, we at Digital Rights Ireland have been campaigning for a law which would oblige companies who store our data to inform us of the details of any security breaches.

This is a hot-topic, with recent disclosures from the Bank of Ireland, and the Irish Blood Transfusion Service, of just this nature. Today I received a letter in the post from Adobe, informing me that some details I uploaded to their website may have been similarly subject to compromise.

As part of the process for making a student-discount purchase with Adobe, I was asked to upload a scan or photograph of my student card - which I was happy to do - and it appears that those images may have been available for others to view. According to the letter, this process may have been used for credit card details in some cases.

In the absence of a law obliging them to so, Adobe, BoI and the IBTS are actually to be commended for telling us about these breaches of security. Of course the notifications may be driven by an increasing consensus that not to do so would be a true negligence, as the real-world ramifications and triviality of identify theft is increasingly apparent, but it is welcome nonetheless. It is better at least to know that it has happened.

When the Bank of Ireland revealed its problems with laptop theft, it was big news, and widely discussed; ordinary consumers expressed fears, the data protection commissioner made recommendations and our collective security has improved since. Already, the adverse commercial effects of these notifications are spurring other businesses to review and audit their own practices. This can’t be but a good thing.

But despite these voluntary notifications, and the emerging consensus of their necessity, there are similar events like this every week that go unreported. Maybe the new minister for Justice, Dermot Ahern, with relevant experience from the Dept. of Communications, can remedy this situation. In the meantime, I think we’re actually better off with the companies who do tell us about these problems, at least they are proving a track record that the customer should matter.

AWS, AppEngine and the future of data

Posted on April 21, 2008, under general.

Since I last blogged about Google’s new AppEngine service, I’ve been playing around with it and comparing it to Amazon’s Web Services, getting a feel for each of their abilities and limitations.

It’s not a like-by-like comparison; Google’s AppEngine is a higher level service. Unlike AWS, it presents an application framework - not an operating system - and it takes full responsibility for balancing the load and handling resilience and diversity. They also have different target markets.

Target markets

Google / Amazon markets

Google’s target is direct; developers; there’s no need for any intermediaries like system or data administrators, they can start off for free and ramp up if it takes off, and they can replicate the entire environment on their laptop. The barrier to entry is very low (as long as you know Python) and encourages experimentation. It’s easy to see teenagers writing OpenSocial and MySpace widgets on Google AppEngine. If AppEngine didn’t exist, its stereotypical user would be writing iPhone apps, a shining star developer at some $BigCo, or just the best Counterstrike sniper in their city and that’s not a put down.

Amazon’s target is more serious; entrepreneurs; there’s more control and freedom, at the cost of a little commitment. The services are designed to slot into business plans in ways that make sense to both the lead architect and the CFO. It’s easy to see small but growing enterprises and a bit of a marketing budget using AWS. If AWS didn’t exist, its stereotypical user would be employing more admins, buying more machines and going through the pain of building or renting comms space.

Amazon are adding new services to AWS every few months, and Google are open about their plans to expand on theirs. No doubt we’ll see some interesting new offerings from both of them, but in the meantime we can compare them based on their capabilities today.

Ultimately the differentiators for these kinds of services will probably be the languages supported and the management tooling, but personally, the part I find the most interesting to compare is data storage and processing. I also think it’s the most important fundamental change that these computing clouds bring about.

Information is power, knowledge is wealth and data is very literally the bits they’re made of, or so the cliché goes. The entire field of IT, Information technology is really about data, it’s pretty key stuff, and managing it is the hard part of building an online service. So how do they compare when you want to provide a data-driven service?

Architecting geographically resilient services

Well, before we can quite answer that question we need to go over how a global-scale geographically resilient service looks and works. This stuff is hard, with a lot of theory and problems, so I hope you’ll forgive me if I boil it down to the three big challenges;

  1. Getting a user’s queries to different sites

    Using either stupid DNS tricks (seriously, that’s a technical term), IP anycast using BGP, application-level redirection/load balancing or a combination of these, a user’s queries need to get to different sites depending on conditions. At the very least, the system needs to be able to route around an outage of one site. Advanced features include route optimisation, where users queries are served from the location closest to them network-wise.

  2. Getting a user’s data to every site

    Even in the most basic case (where all data is local to a single user), there’s a risk that the site a user is being served by will go offline suddenly, so that data needs to be synchronised to every site when written. For more typical cases the entire dataset is being used by all sites anyway, as
    global statistics and user data are interesting to other users.

  3. The speed of light

    Our present speed limit is the speed of light in a vacuum. You’re probably reading this sentence faster than you could ever send a single photon to your antipode and back again. Planet-scale communications using electromagnetic radiation are forever, intractably, hampered in this way (thankfully our own planet has the conveniently placed Pacific Ocean to keep a tab on the problem).

    Earth is 67ms wide

    As if that weren’t bad enough news for us, we don’t beam information through a vacuum but actually slow it down by bouncing it around in tiny bundles of dense glass. Raw bits travel at about 0.66 c. We also have the discourtesy to convert photons to electrons a few dozen times, wrap them in error codes, and shove them into protocols that require math to happen every time they hit a hop on a traceroute.

    End result: as information leaves the desktop, and lives in distributed computing clouds, it moves quite slowly and well within the range of human perception.

Outsourcing these hard problems, and getting the benefits and economies of scale that Google and Amazon have, is what these products are really about.

At the intersection of those three challenges is the worst problem of all; handling a data-level netsplit, when users may be interacting with multiple sites that each have different ideas about the state of play. This, more than anything else, exerts the strongest influence on both the data and application architectures.

  1. A common or garden netsplit

    A netsplit in action

    Scenario: two (or more) sites stay online and reachable to some internet users, but have poor connectivity to each other. Now neither site has any automated way of knowing that the other is “really” down (ie from the entire internet) or that any slowdown is temporary, so both sites should keep serving users - but data isn’t being synchronized between the sites. There needn’t be a full outage for it to be a problem, it might just be a delay - a clogged link, or some faulty storage infrastructure - but the result is the same, inconsistent data in real time. Or as Amazon put it;

    Amazon SimpleDB keeps multiple copies of each domain. When data is written or updated (using PutAttributes, DeleteAttributes, CreateDomain or DeleteDomain) and “Success” is returned, all copies of the data are updated. However, it takes time for the data to propagate to all storage locations. The data will eventually be consistent, but an immediate read might not show the change.

    Consistency is usually reached within seconds, but a high system load or network partition might increase this time. Performing a read after a short period of time should return the updated data.

    Designers put in multiple resilient paths between sites, add out-of-band management systems, and wake up Network Operation Centre engineers at 2AM to try and minimise these problems, but it can and does still happen. Some commentators think that it’s just matter of time before service providers can get around these problems, but really it’s not, unless they know something about the speed of light the rest of us have missed.

  1. A virtual netsplit

    Scenario; there are 2 (or more) sites, say 6,000km, or better expressed as 100ms, apart. There’s a user in the middle, half-way between both. For that user, it only takes 50ms to reach either site, but it takes 200ms for any reliable commit to their data to roundtrip. Adding 200ms of delay to every user query is not acceptable, so we must live with a little lazy data replication.

    What happens if they end up being served by both sites? There’s two ways this can happen - one is ordinary enough, their main site goes offline suddenly, and they failover to the other, but their data didn’t make it in time. The other is far more insidious.

    a netsplit

    When DNS tricks or IP anycast are used for global load distribution, there are bizarre corner cases in which a user may be intermittently served by different sites. Their DNS or proxy servers may have source addresses in different networks, or maybe their ISP doesn’t carry a full routing table and they’re a victim of some equal-cost multi-path routing (although that’s a severe case, which will kill all TCP services anyway, it’s still a problem with low TTL DNS tricks).

    If the application queries rely on state from previous requests, it can get very confusing very quickly when the data is taking longer to replicate than the users queries are flipping between sites. (As an aside: imagine this problem multiplied by factorial(N) where N is your total number of users and welcome the pain of implementing a stateful peer to peer network).

So, those are the problems, and there are good and practical engineering measures used to mitigate them; decent global load-balancing techniques will minimise any flipping of users between sites, well-maintained and monitored resilient links between sites will keep splits very rare, and a larger number of more widely dispersed sites will keep alternative sites closer-by and allow for tiered data synchronization.

Why am I going on about netsplits and the speed of light? well, the thing to take from all this is that there are no guarantees about data consistency across sites in real time, and reasons behind this are fundamental. You can have excellent data consistency, at the cost of some latency, or you can have low latency with slower guarantees on data consistency across sites, but never both; sorry.

Often, this doesn’t matter at all, there are so few writes - or they are not critical to interactivity - that it’s acceptable to have a single master site and some delay from far away locations. But many web applications are highly interactive and data heavy. Every time you make a move on scrabulous, you need your opponent to see it. When you add a photo to a set on flickr, you expect your friends to be able to see it.

Data-modeling with Object and Tuple stores

The constraints all of this places on data management have large effects on how both Amazon’s and Google’s load-distribution and data-stores work, they’ve both been designed with this in mind. There’s no referential integrity, uniqueness of records is not enforced at all, nearly all of the features we’ve become addicted to from RDBMSs are missing, and the documentation makes no guarantees of data consistency or even disclaims it.

Developers would love it, but it’s simply impossible to provide a global-scale super-resilient super-low-latency SQL-alike data-store. So what happens when you learn to live with lazy data?

Well it’s 2008, and these days it’s hard to imagine any new web service that didn’t have support for tags and folksonomies. So what if you’re selling furniture online, couches can have tags. Providing services to a salesforce? Yep, you need tags. A forum for talking about candle-making? how could you remember what it was like before tags. Tags tags tags, everybody needs tags. So, let’s add them to an AWS or AppEngine site.

I’ve always thought there was a hidden market to be tapped into in tagged blob storage. Todays dotComs are all about conquering a middleware niche, so why not make the best tagged blob service there is - let other sites worry about what the blob actually is. So this should be easy peasy, right? Tags are just a one to many attribute. We knock up something like;

Google AppEngine API python:

class Blob(db.Model):
    blob = db.BlobProperty(required = True)
    tag = db.ListProperty(db.Category)

blob = Blob(blob = bindata, tag = ["colmmacc", "cc2.0bync"])

AWS SimpleDB pseudocode:

PUT (blob, $uuid), (data, $s3url), (tag, colmmacc), (tag, cc2.0bync)

We’re done, aren’t we? we can now tag blobs, select them by tag, and even the list of tags on a per-blob basis. That’ll add an extra zero to our selling price, and it only took 5 minutes, great stuff. Well, not quite, because we don’t just want tags, the marketing department says that’s not cool enough, we want tagclouds.

A tagcloud

That subtle change in usage pattern makes a big difference though. Now we need to count tags across all of our blobs, and probably against other dimensions too - like which users own the blobs, because we want per-user and global tagclouds. To get an idea of the difference in complexity this all implies, have a good read of a suggested SQL schema for tags.

Google’s data-store is really an object store, and rather than designing data schemas, if you think about how you would solve a problem if it was a bunch of in-memory objects, with standard data-structures like trees and indexes, you’ll generally come up with a good data model for your problem. If we try to re-create something like the SQL-schema for tags, we won’t get very far. We can create tag and linker objects;

class Tag(db.Model):
    name = db.StringProperty(required = True)

class Blob(db.Model):
    blob = db.BlobProperty(required = True)

class TagBlobLink(db.Model):
    tag  = db.ReferenceProperty(Tag)
    blob = db.ReferenceProperty(Blob)

but this let’s us down in one big way; we can’t enforce the uniqueness of tags. Although we can hammer out something like;

# See if our tag exists
try:
     t = Tag.all().filter("name =", tagname).fetch(1)[0]

# otherwise create it
except IndexError:
     t = Tag(name = tagname)
     t.put()

without a global, reliable, locking service it’s just not safe - two instances might create identically named tag objects at the same time. The same problem crops up if you try to maintain your own reciprocal indexes of tags and blobs in the Tag and Blob objects themselves - when it’s time to add a new tag or blob, you can’t guarantee uniqueness without a lock-step. Unfortunately Google don’t provide AppEngine authors with access to Chubby or any other locking service.

One solution may be to use the data-store itself as a locking service, but you’d have to have a good idea of any likely data propagation times, build sleep() calls in, and still hope for the best. A better, more acceptable solution, is probably to tolerate some conflicts - simply allow for the possibility of two Tag[tagname = "colmmacc"] objects - and have an offline vacuum process which merges the conflicts, particularly easy in the case like this.

A bigger problem with the Google Datastore API is the extremely poor support for operators and extended query results. It’s not even possible to OR or NOT query conditions, to COUNT full result sets or retrieve more than 1,000 results from any query. That makes implementing our tag-cloud even harder again, because we’re going to have tags associated with more then 1,000 blobs.

If we wanted to build a tag cloud for two users (which would normally use an OR), we would have to read in all of their blobs - 1,000 at a time, store and sort them all to eliminate duplicates and then count the results iteratively. That’s a lot of overhead to have at the application layer.

Amazon’s data-store is all about rows of associative tuples. As all attributes are string-based tuples, it makes it possible for them to indexed in multiple dimensions and directions - and when you think about it, it’s easy to see how almost any data model can be described in that way.

Going back to our simpledb example:

PUT (blob, $uuid), (data, $s3url), (tag, colmmacc), (tag, cc2.0bync)

The great thing about simpledb is that you don’t have to update the whole row at once. You can add one tuple at a time. It might not make it to all sites instantly, but it at least keeps the consistency problem local to that attribute - not the entire row/object. Since I don’t have to worry about keeping a list consistent, I tag the blob by adding an attribute.

There are still some race conditions, but they tend to match human expectations. For example, John at site A might add a tag “foo” at time T, then Sally at site B might add the same tag at time T+1. Then John changes his mind and removes it at time T+2, but then Sally’s change might come in and trounce it. But that’s what happens when humans make conflicting changes in real-time anyway.

SimpleDB has full support for OR’s, NOT’s and other such logical operators. It doesn’t have a COUNT, so you still have to read in the results to count them - this time only 250 results at a time - but at least you don’t need to sort them. Amazon doesn’t have a locking system either per se, though it does have a Queue service, and any CS undergrad will be able to tell you how to implement a semaphore-based lock using a queue. But that’s definitely not what it’s intended for.

Still, both examples are ugly; to get our counts of tags, we’re having to paginate through result sets iteratively, consuming memory and CPU in the process - that’s hardly optimal. One solution would be to replicate in our application what many RDBMS’s do, and to maintain an N dimensional cache of our tag counts, and update it on write.

Distribute the operations, not just values

But that brings me to something else, besides a locking service, that I think is missing. Wouldn’t it be cool if you could distribute operations instead of just values? Lets say our tag cache looked like:

Tag Count
sex 4816
love 510
rockandroll 1377

Whether we’re using Amazon or Google’s services, if we want to add a new tag and update the cache, we have to read the value, increment it, and then write it out again. If this happens simultaneously at two sites, our count will be wrong, and we’ll be missing a value.

Instead, imagine an API that let you define those values as counters, and you were given a set of distributable operations, increment(value), decrement(value). The cloud could apply the operation as it distributes, regardless of what the value is. Although the values still won’t be instantaneously consistent, it will be roughly right, and we’re never going to be negating ourselves.

One way to implement this kind of behaviour in application space is to use SimpleDB’s tuples like a stack, pop-ing operations when we’re sure they have successfully distributed. When we want to add or subtract to our total, we push an operation, a row might look like this;

(tag, love), (count, 510), (pending-operation, "increment,timestamp,uuid")

the pending-operation consists of the operation name, a timestamp so that we can later
get rid of them, and a uuid or hostid to handle the case where the same operation occurs at the same time from two different hosts.

Now, when we want to perform a count, we take our total, and proceed through the pending-operations - if any - at runtime. SimpleDB’s eventual consistency model will have been adding pending-operations as the data synchronises, so we might be a bit out, but that’s ok. Then, to keep the stack of pending-operations to a reasonable size, we have an offline process, or a conditional block, which does the following:

for operation in pending-operations:
     # 10 minutes should be plenty of time for operations to propagate
     if now() - operation[timestamp]  > (10 * 60):
            count = operation[operation](count)

     # Pop the operation from the stack
     operation.remove()

Of course to apply the above safely from a conditional block, we’d still need to use a lock - underscoring the importance of that missing feature, but it avoids expensive locking at run time. it’s a bit like have a map-reduce for real-time distributed data, though the examples here are simplistic, there are plenty of useful commutative functions and operators.

Counters like this are common enough in distributed applications, how else would you know how many people have viewed your OkCupid profile that day? But there’s another very closely related data type for which distributed operations rock; inventory.

Let’s say I have 200 concert tickets to sell online, and I’m pretty sure it’s going to sell out in minutes (it’s Bob Dylan performing to a small audience). If I use a global lock and counter, my site is going to be slow, and it will have a hard time as it is. If I start a count at 200 and use a distributed decrement approach like above, there’s a big risk of overflowing past 0 and upsetting some people.

Instead, there’s a better way - use a registry or ticketing system. Have an inventory data-type, and allow it to be divided between the sites on demand. So, say it starts at 200, and we tell the API to carve it up 10 tickets at a time; each site should take a batch of 10 tickets, and when it gets down to only 2 left, request another 10 and so on. This gives us the best of both worlds, we have a local inventory that we won’t take seconds to lock, and we guarantee that we won’t overflow the total.

If you’re using EC2, and you know how many sites you’re running in, you can build this kind of logic into an application easily enough using Amazon’s Simple Queue Service, but it sure would be a nice feature to have available more directly.

Data from the other side

One big gap present in both services is data analysis. That might seem like a trivial thing, but it’s a big deal - analysing your business and usage data can tell you can awful lot about what’s happening and what directions to take.

Both Google and AWS have reasonable services for reporting your usage; there are graphs, logs, and all of the data is there. In Google’s case, since all of the requests go through its application framework, it sees everything and it’s all collated into one easy-to-use analytics-like interface. You get joined-up access to HTTP logs and Python tracebacks.

If I were Amazon, I’d be working on a simple logging service, with plugins for Apache, syslogd, all of the programming language, and whatever else could ever want to log. Having everything in one place is a big deal, you can then correlate events, establish usage patterns and identify attacks and uncommon behaviour.

But what both miss is the higher level; not just the graphs and numbers, but statistical analyses like Holt-Winters projections that will allow users to identify real dips in their throughput which could be due to technical or marketing problems.

Service monitoring and usage analysis is a big problem for businesses, and my guess is that many would jump at the chance to benefit from the scale and experience of a provider like Amazon or Google.

Interesting times

These are certainly interesting times for data management. Internet-scale applications are coming to the masses, and very clever computer scientists and architects are coming up with abstractions and models that allow ordinary developers to cope with the fallout of using a planet instead of a hard-disk.

Are we going to see mid-level services like locking and distributable operations, that are present in some DBMS’s become part of this too? or will they turn out to be dangerous anti-patterns that prove too tempting to abusive and lazy developers. I hope it’s the former, but in the meantime it’s great to think out and learn from the design decisions made by the industry giants. They’ve thought very hard about these things.

The tag-cloud image is from Luca Cremonini, via the Wikimedia Commons project.

Apachecon EU 2008

Posted on April 11, 2008, under apache.

Somewhat unexpectedly, I’ve spent today at Apachecon Europe 2008 in Amsterdam, and managed to deliver a talk on APR on short-notice. It’s been great to catch up with people I haven’t seen in a long time, years in many cases. Hopefully I’ll be in New Orleans later this year to catch up over a full apachecon, and get back to hacking again.

APR Hello World

The presentation wasn’t exactly well-prepared, so there are only 13 slides, but for the record … here they are. And yep, even the Hello World example contains a bug!

Google Apps Engine - a quick review

Posted on April 8, 2008, under general.

Yesterday Google launched AppEngine, a new hosting service for Python Applications, and having had a few hours to play with it - I’m finding it quite slick. It took about 10 minutes to go from nothing to a fully working first-application from the documentation, with the SDK making things very easy.

I’m going to to write some applications, for personal and work use, so I’ll get a better chance to review it in detail, but so far I’m very impressed - it does what it says on the tin.

From my immediate needs, the only thing that I’d like that is missing is support for scheduled tasks (run this python script in the same context, once an hour) - so that I can pre-generate some particularly burdensome output. Some other things would be nice, like the ability to act as a mail receiving gateway - not just sending e-mails, but that’s mostly useful for helpdesk, or e-commerce style apps, and there are workarounds.

I haven’t been this excited about a Google product in years, this is good stuff! Now, time to port my photo-blog :-)

Off to Andorra

Posted on March 5, 2008, under general.

For the first time in 4 years, I’m going on a real holiday, one of those bizarre periods of time that apparently involve not doing any work. For the first time in even longer, probably at least 10 years, I plan not to be on-line at all during that time. There’s been a regime building up to this important milestone; at first some evenings went by without checking e-mail, then even entire weekends, so I think I’m ready - it’s time.

I’m really looking forward to spending some quality time with earhart, and in a totally new departure for me - learning to Ski. Taking a week without thinking about work or college at all should do me a world of good, and hopefully I’ll come back refreshed and prepared to take on both again at full speed. But if you’re looking for me next week, that’s where I am - I’ll have my phone with me for genuine emergencies, but that should be it.

As part of the prep work for a ski-trip, I decided to buy a ski jacket rather than borrow one, and was amazed at the service I got from Port West. I ordered an excellent, breathable, fleece-lined, dozens-of-pockets-having high quality jacket at 11 AM on a Thursday, and it was delivered less than 24 hours later, for only €76 including delivery, which seems to be at most half what I would have otherwise paid in a shop. Already it’s turned out to be useful, as we’ve been having blizzard-like conditions in TCD this week. Highly recommended!

FlickrTouchr.py - keep your flickr album and favourites on an iPod Touch or iPhone

Posted on January 21, 2008, under general, photography.

One of the great things I really like about the iPod touch is that it makes an excellent portable viewer for photos. In return for sacrificing 1GB of potential music storage, I get about 2,000 browseable, zoomable photos onto something that’s always in my pocket. Cool!

Flickr sets on an Ipod     Flickr favourites on an Ipod

To make the process easier, here’s a python script to synchronise the photos on an ipod from a flickr account. It’s at http://www.stdlib.net/~colmmacc/flickrtouchr.py.txt. It should run on any *nix platform (personally I’m running it on a Mac) and there’s some kind of a chance it might even work on Windows (but who knows).

The whole thing is a single python file, as implementing the flickr API calls myself rather than use any of various python flickr libraries (as they each seem poorly maintained) seemed easier. One of the reasons I’m even putting it online is that beyond anything else, it may be useful as an example of a truly minimal python implementation with the flickr API.

You run the script like so;

    flickrtouchr directory-name

and after it asks you to authenticate against your flickr account, it will progress through syncing your entire flickr account into a directory hierarchy. It keeps a small cache of authentication data in the directory too. It will be organised like;

    directory-name/set-name/[photos]
    directory-name/other-set-name/[photos]
    directory-name/No Set/[photos]
    directory-name/Favourites/[photos]

If you have the same photo in multiple sets it will use hard-links to preserve local disk-space. It also only downloads the most optimal version of the photograph available for the iPod touch/iPhone screen. Everytime you run it, it downloads the missing photos and new sets. Once you configure iTunes to synchronise photos from a folder, rather than iPhoto, it works great.

iTunes syncing from a folder

Of course it will work with any other device or software that takes photo from a hierarchy of folders. Happy flickring.