Archive for November, 2005

Duff’s Device and Scheduler Benchmarking

Posted on November 28, 2005, under general.

In the past year, I’ve given the Scaling Apache talk more than a few times and essentially I think there’s one genuinely useful thing in it; using dd as a window into scheduler and vm performance. This actually came about by accident more than design, but it’s proved to be a reasonably good, quick, way of getting a handle on how well a system will perform. It’s not intended to be as useful as something like lmbench, aimbench or tiobench but it will certainly tell you whether a system has improved and will allow you to compare two systems with each other. And since the book I reviewed in my previous post dealt with analytical system administration, I thought I’d add a small contribution to the blogosphere.

About three and a half years ago, when zero-copy I/O was a hot topic in performance tuning we decided to set about determining the optimum buffer size for such I/O. If you’ve never encountered zero-copy before, it’s a pretty obvious idea really. When data is being transferred from one place to another and various different parts of hardware and software are involved they each have their own buffer size, the size of the chunk of data they read at a time. So, a disk drive might read 64 KBytes at a time, but then the kernel might read 128 Kbyte at a time and the application only 4 Kbyte at a time. This means that memory somewhere in the system has to buffer all of the gaps, and it also means that we have a lot more I/O operations than we need. This becomes even worse when you’re faced with buffers that differ by only a small amount, a bit like this:

Since we were going to go to the trouble of ensuring that buffer sizes matched up everywhere from the socket layer through to the application layer, it made a lot of sense to pick the most efficient buffer size. We reasoned that even the complex mash of disk drives, RAID memory, kernel subsystems and the virtual filesystem layer must have some preferred data size. The simplest method to try and put a figure on this is to read with lots and lots of different buffer sizes and measure the speed of those reads. So we wrote dder.sh:

#!/bin/sh

STARTNUM="1"
ENDNUM="102400"

# create a 100 MB file
dd bs=1024 count=102400 if=/dev/zero of=local.tmp

# Clear the record
rm -f record

# Find the most efficient size
for size in `seq $STARTNUM $ENDNUM`; do
        dd bs=$size if=local.tmp of=/dev/null 2>> record
done

# get rid of junk
grep  "transf" record | awk '{ print $7 }' | cut -b 2- | cat -n | \
while read number result ; do
        echo -n $(( $number + $STARTNUM - 1 ))
        echo " " $result
done > record.sane

This rather hacky script relies on the pretty common dd-performance-counter.patch which makes dd output a “transferred in X seconds” message when done. When run, the script produces a nice simple file, formatted “<buffer size> <bytes/second>“, which plotted with gnuplot comes out with something looking like;

It’s a bit of mess, isn’t it? Well, no, this graph is actually a wealth of information. There are over 100,000 measurements in there, that’s a good deal more than is the basis for our knowledge of the existance of most sub-atomic particles. We just need to try and read it.

This messy graph, which is typical of a memory-constrained x86 system, conveys a lot. First off, it’s pretty clear that there is a general trend, and that the peak is somewhere around the 30,000 byte mark. This is pretty surprising in and of itself, since nearly all unix software uses 4096 or 1024 bytes as its default buffer size. On modern systems, were 30k of memory really is not a whole lot, this may be sub-optimal and there could be a strong case for changing those defaults to something much larger.

Looking more closely at the peak, we can see that it corresponds to over 1.1 x 109 bytes/sec, which is 8.8 Gigabit/sec (something not a lot of poeple realise is that the SI prefixes are interpretted exclusively as powers of 10, never powers of 2, when dealing with bandwidth), which is close to saturation for a 10 Gigabit interface.

This particular data set was actually attained from some RAID5 7200 RPM IDE disks, and there was a lot more I/O going on at the time too. It can be surprising just how little difference fast disks make to this kind of sequential-read load. Try to remember next time you’ve been staring at iostat or sar output trying to get better performance for your load or debug an I/O problem; even 7200 RPM disks can push 10 Gig if the load is organised correctly, replacing the disks may not be the answer. (though of course loads with a lot of seeks and writes, like a database, really do need fasts disks, for reasons unrelated to raw throughput).

The next thing we can see on the graph should be pretty clear, there is obvious bandation. Although it’s a bit spread out, there are bands of white between the more densely packed red regions. Anyone familiar with steady-state physical processes will recognise this as a tell-tale sign of an underlying resonant frequency and its harmonics, and repeating the experiment always reproduces similar bands. Experimenting on different systems, different operating systems with differing underlying hardware seems to indicate that these frequencies co-incide with optimal and highly sub-optimal buffer sizes for the physical hard disks.

What is really useful about the bandation though is that it allows us to discern the next bundle of information from the graph. There are some clear sudden shifts up and down. The whole plot seems to instantaneously move upwards and downward by about 3 x 108 bytes per second, bands and all. The graph looks like the superposition of an exponential curve and some step functions, or a square wave of varying periodicity.

Now, since we’ve gone to the trouble to determine that the bands are due to a physical component, and physical components don’t instantaneously change, it’s pretty unlikely that those sudden shifts are due to physical wear or something at the hardware level. Instead, it turns out those bands are a reflection of how well the system scheduler and virtual-memory manager are co-operating to provide the dd processes with the resources they need. Sudden deprioritisation of the dd process, or kswapd becoming active becomes very apparent. The jerkiness of the graph is a reflection of how uniformly (or non-uniformly as the case may be) the system affords the dd processes CPU time.

To illustrate this, here’s a graph from the same system with 4 times the amount of memory:

And here again with 12 times the original amount of memory:

We can now see that the dramatic shifts have nearly dissappeared entirely, which shows just how big an effect memory has, even on processes which are barely using any. The virtual memory manager can eat a lot of CPU time when you’re short on memory, and processes become a lot more predictable when this doesn’t have to happen. Another thing which is clear from the graphs is that the peak hasn’t moved, it’s still around the 30kb mark, and the bandation is still present, though now we can count the 4 clear bands. This lends more credibility to our theory about the dependence of those factors on underlying physical factors.

Now, let’s look at a graph from a totally different kind of system:

Note the difference in scale, the peak in the previous graphs wouldn’t even get a quarter of the way up this scale. This system looks like it could fill a 40 Gigabit/sec interface, and the graph seems to become a gentle asymptote rather than a peak and then recess. The bands are crystal clear and we have some really neatly clustered measurements around the harmonics, and there is almost no shifting. This graph is from a dual Itanium with 32GB of RAM. Turns out that all of Intel and HPs work optimising the I/O paths for the chipset wasn’t entirely a waste of time, especially when you consider that the bus frequency is less than half that of the systems the prior graphs are from. Despite the terrible reputation Itanium has, it seems that it’s actually quite a good platform if raw throughput is what you want.

Now, I’m not really certain of the general-purpose usefulness of all of the above information. As it turns out, the optimal buffer sizes arn’t useful for true zero copy when dealing with the internet. Because, for the most part, it’s only possible to send a mere 1500 bytes at a time over internet links, and as the graphs show 1500 bytes is way down in the sub-optimal range of any system.

We use large buffer sizes anyway and buy enough memory to cope. Where I have found the graphs useful is as a good visual aid to demonstrate to audiences the differences between certain systems and the effecacy of some tunings, as it’s always easy enough to see the differences. Tune the scheduler/vm better and the graph gets smoother, get faster disks and the plot gets higher, that sort of thing.

But what I am sure about is that approaching these things analytically, designing experiments, taking measurements, interpretting results and refining theories is a much more productive way to get the most out of systems. The four graphs in this post convey a huge amount of information, as long as you know how to read it, and the means of measurement could hardly be any simpler. This was all based on just one variable, and it still tought us a whole lot.

There are subtle tweaks that can be made too. For example, instead of using the bytes/second number from dd itself, we could record the time before calling dd and after its exit within the shellscript. This number would not only measure the I/O efficiency, but also incorporate the fork() and exec() speed, which can often be useful values.

At this point many system adminstrators will probably be tearing their hair out at my over-analysis and the sheer obviousness of much of the information in the graphs, but the real point is in going to the trouble of making them. The point is that once you get into the habit of doing these things analytically, and getting really used to having the data available to you in such a readily accessible form, you start to see entirely new ways in which you can optimise your system.

Just look at the amount of incredibly useful information which is condensed into Justin’s latest graphs on SpamAssassin rulesets for another example. Or have a browse around HEAnet’s MRTGs and see how much information you can parse quickly. So why on earth are we still staring at iostat, top and other ncurses and simple text apps?

Oh, and buy this book. Now hopefully I can finally put that talk to bed, and have forced myself into writing a new one.

Review: Analytical Network and System Administration

Posted on November 25, 2005, under general.

After spending about 2 weeks on my desk, I finally got around to reading Analytical Network and System Administration, by Mark Burgess. It’s an excellent book, and a useful resource for all system administrators, but it might not be entirely what you expect.

Unlike Burgess’ previous system administration books, or indeed any other book in the field, this is a true academic textbook and approaches system administration from a more formal perspective that has been sorely lacking. This book fills all of the gaps between what a real B.Eng. would typically teach and what a CompSci course usually covers. It explores and explains the science behind systems and networks.

If you already know things like process control theory, information theory, set theory and boolean algebra, digital signal processing, statistics (of all manners), queuing theory and so on, you’ll hit the ground running and find this an excellent text which will help you apply those utilities to greater effect within your systems. If you’d like to learn any of those things, and get a real feel for what makes complex systems tick and let applying the scientific method to systems empower you this is a great book for that too, though you might need some other texts to help explain some of the theory. This book is not afraid of real maths, and it’s reasonably difficult to open a page that doesn’t have an equation or graph requiring at least some third-level mathematical education to digest.

There are some ommissions, there isn’t much real network theory for example. Spanning tree, Dijkstra and Floyd get less than a sentence each and there’s not much discussion of how queueing theory affects network utilisation. Likewise, the coverage of variable co-dependence and the effect on experimental observations doesn’t mention the Taguchi method. Although the Nash equilibrium is covered later in the book. It strikes a balance between overwhelming a reader and introducing real practical analytical methods.

On the whole, the book is an amazingly thorough collection of the theories applicable to systems. A strong theme throughout the book is that systems interact with humans and vice versa, and that this has to be borne in mind. The book is subtitled “Managing Human-Computer Systems”, and this even goes so far as to take into account the periodicity of daily usage patterns, and by the end of the book, you might even want to run that through a Fourier transform to see if there are any other underlying patterns. (incedentally, our MRTGs show at least 3 such frequencies; daily, academic-yearly and calendar-yearly).

System Administration right now is very much an ad-hoc haphhazard art, and truly formal best-practise has yet to emerge. But this book is the best start yet. Rather than vague handwaving and hunch-based estimates, this book presents a real metric-based approach to system administration. It shows that although these are complex systems, the mathematical tools do exist to monitor, mesaure and refine them, and that we can and should put them to use.

The book is naturally aimed as a text for academic programmes on System Administration, something we should see more and more of, but it is still a very useful reference for any engineer who has found themselves in the field. If you know what a steady-state process looks like, what hysteresis is, who Claude Shannon was, why bitwise operations are relatively quick and can calculate a standard deviation, you should buy this book right now. If you’d like to know those kind of things, it’s still a good buy and its excellent but terse introductions to each area will serve you well as insight into the usefulness of these theories as you learn them.

What I’d love to see next is for someone to bridge the gap between what an MBA usually teaches and what a system adminitrator does. There are an awful lot of commonalities between system and financial administration, particularly in how to perform audits in a robust and repeatable manner and how to analyse the efficacy of adminstration. The business world has developed a lot of theory for the management of people and processess, and not all of it is nonsense. It’d be great to see that book written too.