Archive for November, 2005

Duff’s Device and Scheduler Benchmarking

Posted on November 28, 2005, under general.

In the past year, I’ve given the Scaling Apache talk more than a few times and essentially I think there’s one genuinely useful thing in it; using dd as a window into scheduler and vm performance. This actually came about by accident more than design, but it’s proved to be a reasonably good, quick, way of getting a handle on how well a system will perform. It’s not intended to be as useful as something like lmbench, aimbench or tiobench but it will certainly tell you whether a system has improved and will allow you to compare two systems with each other. And since the book I reviewed in my previous post dealt with analytical system administration, I thought I’d add a small contribution to the blogosphere.

About three and a half years ago, when zero-copy I/O was a hot topic in performance tuning we decided to set about determining the optimum buffer size for such I/O. If you’ve never encountered zero-copy before, it’s a pretty obvious idea really. When data is being transferred from one place to another and various different parts of hardware and software are involved they each have their own buffer size, the size of the chunk of data they read at a time. So, a disk drive might read 64 KBytes at a time, but then the kernel might read 128 Kbyte at a time and the application only 4 Kbyte at a time. This means that memory somewhere in the system has to buffer all of the gaps, and it also means that we have a lot more I/O operations than we need. This becomes even worse when you’re faced with buffers that differ by only a small amount, a bit like this:

Since we were going to go to the trouble of ensuring that buffer sizes matched up everywhere from the socket layer through to the application layer, it made a lot of sense to pick the most efficient buffer size. We reasoned that even the complex mash of disk drives, RAID memory, kernel subsystems and the virtual filesystem layer must have some preferred data size. The simplest method to try and put a figure on this is to read with lots and lots of different buffer sizes and measure the speed of those reads. So we wrote dder.sh:

#!/bin/sh

STARTNUM="1"
ENDNUM="102400"

# create a 100 MB file
dd bs=1024 count=102400 if=/dev/zero of=local.tmp

# Clear the record
rm -f record

# Find the most efficient size
for size in `seq $STARTNUM $ENDNUM`; do
        dd bs=$size if=local.tmp of=/dev/null 2>> record
done

# get rid of junk
grep  "transf" record | awk '{ print $7 }' | cut -b 2- | cat -n | \
while read number result ; do
        echo -n $(( $number + $STARTNUM - 1 ))
        echo " " $result
done > record.sane

This rather hacky script relies on the pretty common dd-performance-counter.patch which makes dd output a “transferred in X seconds” message when done. When run, the script produces a nice simple file, formatted “<buffer size> <bytes/second>“, which plotted with gnuplot comes out with something looking like;

It’s a bit of mess, isn’t it? Well, no, this graph is actually a wealth of information. There are over 100,000 measurements in there, that’s a good deal more than is the basis for our knowledge of the existance of most sub-atomic particles. We just need to try and read it.

This messy graph, which is typical of a memory-constrained x86 system, conveys a lot. First off, it’s pretty clear that there is a general trend, and that the peak is somewhere around the 30,000 byte mark. This is pretty surprising in and of itself, since nearly all unix software uses 4096 or 1024 bytes as its default buffer size. On modern systems, were 30k of memory really is not a whole lot, this may be sub-optimal and there could be a strong case for changing those defaults to something much larger.

Looking more closely at the peak, we can see that it corresponds to over 1.1 x 109 bytes/sec, which is 8.8 Gigabit/sec (something not a lot of poeple realise is that the SI prefixes are interpretted exclusively as powers of 10, never powers of 2, when dealing with bandwidth), which is close to saturation for a 10 Gigabit interface.

This particular data set was actually attained from some RAID5 7200 RPM IDE disks, and there was a lot more I/O going on at the time too. It can be surprising just how little difference fast disks make to this kind of sequential-read load. Try to remember next time you’ve been staring at iostat or sar output trying to get better performance for your load or debug an I/O problem; even 7200 RPM disks can push 10 Gig if the load is organised correctly, replacing the disks may not be the answer. (though of course loads with a lot of seeks and writes, like a database, really do need fasts disks, for reasons unrelated to raw throughput).

The next thing we can see on the graph should be pretty clear, there is obvious bandation. Although it’s a bit spread out, there are bands of white between the more densely packed red regions. Anyone familiar with steady-state physical processes will recognise this as a tell-tale sign of an underlying resonant frequency and its harmonics, and repeating the experiment always reproduces similar bands. Experimenting on different systems, different operating systems with differing underlying hardware seems to indicate that these frequencies co-incide with optimal and highly sub-optimal buffer sizes for the physical hard disks.

What is really useful about the bandation though is that it allows us to discern the next bundle of information from the graph. There are some clear sudden shifts up and down. The whole plot seems to instantaneously move upwards and downward by about 3 x 108 bytes per second, bands and all. The graph looks like the superposition of an exponential curve and some step functions, or a square wave of varying periodicity.

Now, since we’ve gone to the trouble to determine that the bands are due to a physical component, and physical components don’t instantaneously change, it’s pretty unlikely that those sudden shifts are due to physical wear or something at the hardware level. Instead, it turns out those bands are a reflection of how well the system scheduler and virtual-memory manager are co-operating to provide the dd processes with the resources they need. Sudden deprioritisation of the dd process, or kswapd becoming active becomes very apparent. The jerkiness of the graph is a reflection of how uniformly (or non-uniformly as the case may be) the system affords the dd processes CPU time.

To illustrate this, here’s a graph from the same system with 4 times the amount of memory:

And here again with 12 times the original amount of memory:

We can now see that the dramatic shifts have nearly dissappeared entirely, which shows just how big an effect memory has, even on processes which are barely using any. The virtual memory manager can eat a lot of CPU time when you’re short on memory, and processes become a lot more predictable when this doesn’t have to happen. Another thing which is clear from the graphs is that the peak hasn’t moved, it’s still around the 30kb mark, and the bandation is still present, though now we can count the 4 clear bands. This lends more credibility to our theory about the dependence of those factors on underlying physical factors.

Now, let’s look at a graph from a totally different kind of system:

Note the difference in scale, the peak in the previous graphs wouldn’t even get a quarter of the way up this scale. This system looks like it could fill a 40 Gigabit/sec interface, and the graph seems to become a gentle asymptote rather than a peak and then recess. The bands are crystal clear and we have some really neatly clustered measurements around the harmonics, and there is almost no shifting. This graph is from a dual Itanium with 32GB of RAM. Turns out that all of Intel and HPs work optimising the I/O paths for the chipset wasn’t entirely a waste of time, especially when you consider that the bus frequency is less than half that of the systems the prior graphs are from. Despite the terrible reputation Itanium has, it seems that it’s actually quite a good platform if raw throughput is what you want.

Now, I’m not really certain of the general-purpose usefulness of all of the above information. As it turns out, the optimal buffer sizes arn’t useful for true zero copy when dealing with the internet. Because, for the most part, it’s only possible to send a mere 1500 bytes at a time over internet links, and as the graphs show 1500 bytes is way down in the sub-optimal range of any system.

We use large buffer sizes anyway and buy enough memory to cope. Where I have found the graphs useful is as a good visual aid to demonstrate to audiences the differences between certain systems and the effecacy of some tunings, as it’s always easy enough to see the differences. Tune the scheduler/vm better and the graph gets smoother, get faster disks and the plot gets higher, that sort of thing.

But what I am sure about is that approaching these things analytically, designing experiments, taking measurements, interpretting results and refining theories is a much more productive way to get the most out of systems. The four graphs in this post convey a huge amount of information, as long as you know how to read it, and the means of measurement could hardly be any simpler. This was all based on just one variable, and it still tought us a whole lot.

There are subtle tweaks that can be made too. For example, instead of using the bytes/second number from dd itself, we could record the time before calling dd and after its exit within the shellscript. This number would not only measure the I/O efficiency, but also incorporate the fork() and exec() speed, which can often be useful values.

At this point many system adminstrators will probably be tearing their hair out at my over-analysis and the sheer obviousness of much of the information in the graphs, but the real point is in going to the trouble of making them. The point is that once you get into the habit of doing these things analytically, and getting really used to having the data available to you in such a readily accessible form, you start to see entirely new ways in which you can optimise your system.

Just look at the amount of incredibly useful information which is condensed into Justin’s latest graphs on SpamAssassin rulesets for another example. Or have a browse around HEAnet’s MRTGs and see how much information you can parse quickly. So why on earth are we still staring at iostat, top and other ncurses and simple text apps?

Oh, and buy this book. Now hopefully I can finally put that talk to bed, and have forced myself into writing a new one.

Review: Analytical Network and System Administration

Posted on November 25, 2005, under general.

After spending about 2 weeks on my desk, I finally got around to reading Analytical Network and System Administration, by Mark Burgess. It’s an excellent book, and a useful resource for all system administrators, but it might not be entirely what you expect.

Unlike Burgess’ previous system administration books, or indeed any other book in the field, this is a true academic textbook and approaches system administration from a more formal perspective that has been sorely lacking. This book fills all of the gaps between what a real B.Eng. would typically teach and what a CompSci course usually covers. It explores and explains the science behind systems and networks.

If you already know things like process control theory, information theory, set theory and boolean algebra, digital signal processing, statistics (of all manners), queuing theory and so on, you’ll hit the ground running and find this an excellent text which will help you apply those utilities to greater effect within your systems. If you’d like to learn any of those things, and get a real feel for what makes complex systems tick and let applying the scientific method to systems empower you this is a great book for that too, though you might need some other texts to help explain some of the theory. This book is not afraid of real maths, and it’s reasonably difficult to open a page that doesn’t have an equation or graph requiring at least some third-level mathematical education to digest.

There are some ommissions, there isn’t much real network theory for example. Spanning tree, Dijkstra and Floyd get less than a sentence each and there’s not much discussion of how queueing theory affects network utilisation. Likewise, the coverage of variable co-dependence and the effect on experimental observations doesn’t mention the Taguchi method. Although the Nash equilibrium is covered later in the book. It strikes a balance between overwhelming a reader and introducing real practical analytical methods.

On the whole, the book is an amazingly thorough collection of the theories applicable to systems. A strong theme throughout the book is that systems interact with humans and vice versa, and that this has to be borne in mind. The book is subtitled “Managing Human-Computer Systems”, and this even goes so far as to take into account the periodicity of daily usage patterns, and by the end of the book, you might even want to run that through a Fourier transform to see if there are any other underlying patterns. (incedentally, our MRTGs show at least 3 such frequencies; daily, academic-yearly and calendar-yearly).

System Administration right now is very much an ad-hoc haphhazard art, and truly formal best-practise has yet to emerge. But this book is the best start yet. Rather than vague handwaving and hunch-based estimates, this book presents a real metric-based approach to system administration. It shows that although these are complex systems, the mathematical tools do exist to monitor, mesaure and refine them, and that we can and should put them to use.

The book is naturally aimed as a text for academic programmes on System Administration, something we should see more and more of, but it is still a very useful reference for any engineer who has found themselves in the field. If you know what a steady-state process looks like, what hysteresis is, who Claude Shannon was, why bitwise operations are relatively quick and can calculate a standard deviation, you should buy this book right now. If you’d like to know those kind of things, it’s still a good buy and its excellent but terse introductions to each area will serve you well as insight into the usefulness of these theories as you learn them.

What I’d love to see next is for someone to bridge the gap between what an MBA usually teaches and what a system adminitrator does. There are an awful lot of commonalities between system and financial administration, particularly in how to perform audits in a robust and repeatable manner and how to analyse the efficacy of adminstration. The business world has developed a lot of theory for the management of people and processess, and not all of it is nonsense. It’d be great to see that book written too.

Getting rid of errant HTTP requests

Posted on November 24, 2005, under apache, general, mirroring.

When you run a busy website, you’re bound to pick up a lot of wacky requests, and some downright broken clients. Annoying behaviour can range from anything from repeated nuisance requests to a full-scale Denial of Service attack. Competent mail server administrators will be very familiar with protocol-level techniques to try and hinder these requests, but you don’t hear much about it within HTTP.

This is partly because, luckily, abuse is rarer in HTTP and because not many people actually read their logs all that closely. Over at ftp.heanet.ie though, we get many many millions of requests per day and the bad boys all add up. We see broken browsers, broken mirror scripts, huge “wget -r” or lftp grabs of massive portions of our tree, paths on our server hardcoded into some applications – which consider us the most reliable place to fetch a XSL file or check for the latest version of a perl package.

And we’re not alone, Dave Malone had to deal with a ridiculous NTP client which was using his Webserver as a time source (yep, NTP via HTTP date headers!), and it wasn’t even being polite enough to use a HEAD request, it was actually using “GET / HTTP/1.0\r\n\r\n”. We had to patch Apache to get around that one.

Over time, we’ve developed a few tactics for helping us defeat these annoying requests and get them off of our server as quickly as possible. The first trick of course is to identify them in the first place. Any decent log analysis package or even just getting a regular “feel” for traffic through mod_status will quickly identify any odd requests. If you see 250,000 requests for an XSL file, you know something is up. Likewise if you observe that a particular host is constantly connected to you, it’s possible there’s something that needs looking at.

The next thing to look at is whether these requests are really a problem or not. In our case we can tolerate 250,000 requests for an XSL file, after all we’re not short of bandwidth which is the main resource being used. But it’s not something we would want to leave unchecked, we’re there to serve all sorts of content – not just XSL files. Huge “wget -r”s or clients which poll is far too long are a concern for us though, because we optimise the server for the average long downloads that make up most of our server. We don’t want to see lots and lots of small requests and all of the context switching that entails. They slow down the responsivity of the system.

Unfortunately, it’s pretty rare that these illegitimate requests come from single, fixed, IP addresses, and when they do more often than not that address is a proxy or a NAT box and serves many other legitimate requests, so just applying an ACL doesn’t always suit. Instead we use mod_setenvif and mod_rewrite to classify requests based on the more exact nature of the requests.

Once we’ve done that, how we deal with them falls into 4 different categories;

  1. Malformed output

    The first thing we tried was simply returning malformed data at the URL they expect.

    So if a client was persistently querying say /foo/bar/blah.xsl , we would return an XSL file that was crafted to be utterly broken and contain lots of comments explaining why (though of course only to this client, other users get the original file). This is the same tactic Dave Malone employed to combat the bold NTP clients. We patched Apache so that a date header set with mod_headers would work (ordinarily Apache doesn’t let anything else set a Date header) and returned Dec 31st 1999 to every such client.

  2. Teergrubbing

    For a lot of cases, malformed output works pretty well. But for others; typically automated processes long forgotten by their owners, it does little. For those we next tried a variation on “teergrubbing” used in some SMTP-level anti-spam defences. We just redirect to a cgi that does only a very little more than something like;

    #/bin/sh
    
    echo "Content-type: text/plain"
    echo
    
    while true; do
        echo please contact mirrors@heanet.ie
        sleep 10
    done

    That worked pretty well, and caught a lot of brokenness including people who had hard-coded us as a dtd source. Still left us with some annoying stragglers, though.

  3. CPU exhaustion

    The next trick we try, and one which is really pretty dirty is to try an mount a Denial of Service attack on the HTTP client. Typically these clients arn’t well written, and if they even have any loop-prevention, it’s basic in the extreme. We exploit this by trying causing their client to loop and loop between sucessive HTTP requests. Now, anyone it’s relatively easy to detect a URI that redirects to itself even with one or two levels of indirection, so instead we do;

    #/bin/sh
    
    echo "Location: /thiscgi?$RANDOM"
    echo

    Now that’s what I call mean. Now our system can easily take the load and bandwidth this causes, theirs cannot, and it can pretty quickly wear them out and soon enough we see the requests die.

  4. Memory exhaustion

    Something else I’ve been playing with lately is using the Content-Length: header, or the features of chunked-encoding to try and exhaust memory on these clients. Many of the dumb clients seem to allocate a single buffer for the entire response, especially when using chunked-encoding. By trying to make the client allocate several Gigabytes of memory, they can ocasionally be stopped dead in their tracks.

In reality, we actually implement the above with tiny Apache modules rather than CGI, for the sake of efficiency. Of course all of these are really only appropriate when you’re dealing with remote tasks someone is inappropriately running, or if they’re using woefully outdated harmful software, and if the requests are easily categorisable, and even then only if the CPU hit it causes the server results in a net drop in these requests over time.

Hallo Von München

Posted on November 20, 2005, under general.

I’m over in Munich, where it’s starting to get a bit cold. I haven’t had long here (though I’ll be back in 2 weeks) but it’s a very nice place. I’ve been staying out with Noirin, at the OlympiaZentrum, right beside the Olympic Stadium, the BMW building and lots more.

Colm in Munich

As usual, the integrated transport system is an awful lot better than Dublin’s mess. Especially as Munich is actually smaller than Dublin, both in terms of population and geographical size. It’s real pity that Dublin doesn’t have anything comparable, because a great public transport network has a great social impact too. Here, London, New York, Paris and everywhere I’ve been that has a real network there has always been examples of people from all walks of life, side by side on the same journeys.

In Dublin we have transport apartheid, with real class segregation. For example if you live in south east Dublin, the journey into town you take every day will go blissfully through leafy suburb after leafy suburb, classy georgian Dublin, and then you’ll probably park right behind BTs and make use of the Powerscourt Centre, and maybe Grafton St. if you’re feeling adventourous. If you don’t feel like a drive, sure you can get the Dart anyway. All the while missing out on the real Dublin. Ocasionally there might be a bus journey where you’ll have to tolerate two dozen teenagers with their 150 euro each to get plastered in the Wesley, but there won’t be much dealing with an upper deck full of scumbags smoking away. To me, that’s one of the biggest failings of a lack of transport integration. It helps pervade an attitude of real social ignorance, of comfortable isolation. It makes Ross O’Carroll-Kelly a possibility.

Unfortunately for us, Munich had the Olympics to solve its Transport problems and we’ve got Martin Cullen.

’srl

Posted on November 13, 2005, under general.

I’ve been incredibly busy this month, with Digital Rights Ireland, the HEAnet National Networking Conference and some other things. The conference went well, I gave a presentation on ftp.heanet.ie, with the Apache tuning things, but the real highlight for me was that the whole Digital Rights agenda has gone mainstream. There were 4 talks in this area, including excellent plenaries from Mike Nowlan and Karlin Lillington. Hearing the head of a computer services department speak positively about copyfighting left me with a real impression that a lot of important arguments have really been won.

Between multiple trips to Munich to see Noirin, and San Diego for LISA and Apachecon I’ve got 38 hours flying time over the next 4 weeks, so it’s unlikely I’ll update this blog much. But I’m going to try and put up more useful content, I have a few drafts sitting around that may prove interesting too.

Yesterday, thanks to Noirin, Apache got Irish Language error pages, which is pretty cool! I’ve got them installed on a few servers already, and they’ll ship with 2.2 when it releases soon. It’d be great to see more minority languages supported too, I’m surprised we beat Welsh.

Highly Recommended

Posted on November 1, 2005, under music.

The last 16 CDs I’ve bought:

Allison Krauss and Union Station - Lonely runs both ways Karan Casey - Chasing the Sun Cathie Ryan - The Farthest Wave Danu - the Road less travelled
Danu - When all is said and done The Ultimate Dolly Parton John Doyle - Wayward Son Hayseed Dixie - Kiss my Grass
Kila - Luna Park Diana Krall - The girl in the other room Kate Rusby - The girl who couldn't fly Kate Rusby - Underneath the Stars
Serenity, the Soundtrack Solas - Waiting for an Echo Michael McGoldrick - Wired Jennifer and Hazel Wrigley - Mither O' The Sea

All superb, particularly recommended are Kate Rusby and Danu, if you’ve never listened.