First results from the Niagara Benchmarking

Posted on March 11, 2006, under general, niagara.

In order to make sure that the real benchmarks are as efficient as they can be, we’ve repeated our usual procedure using dd to determine the most efficient buffer size on the platform. More details about that procedure can be found in my earlier mis-titled blog post on scheduler benchmarking.

For the sake of comparison, I’ll repeat the results from our dual Itanium, which has 32GB of memory;

buffer size Vs bytes/sec

The important information to derive from the graph is the smoothness of the lines (which is a function of how well the scheduler and VM perform) and the absolute value of the bytes/sec number. The Itanium box can push about 3.5 x 109 bytes per second, or 3.5 Gigabytes/sec which is 28 Gigabits/sec. Now bear in mind that the procedure involved is not multi-threaded or even multi-process, so we can very generously guess that the dual-CPU system could push about 56 Gigabytes/sec of pure I/O throughput, completely ignoring the overhead implicit in multi-CPU I/O scheduling.

The benchmarking process relies on the presence of a version of dd with the dd-performance-counter patch from Debian, which the Sun box doesn’t have. Luckily however, SUN now have much of the source for Solaris online, so I popped over to the OpenSolaris code browser and grabbed a copy of dd.c and based on the Debian patch came up with the following patch. I also modified our dder.sh script a little to cope with the lack of seq;

#!/usr/bin/bash

STARTNUM="1"
ENDNUM="102400"

# create a 100 MB file
./dd bs=1024 count=102400 if=/dev/zero of=local.tmp

# Clear the record
rm -f record

# Find the most efficient size
i=$STARTNUM
while test $i -le $ENDNUM; do
    ./dd bs=$i if=local.tmp of=/dev/null 2>> record
    i=$(( $i + 1 ))
done

# get rid of junk
grep  "transf" record | awk '{ print $7 }' | cut -b 2- | cat -n | \
while read number result ; do
    echo -n $(( $number + $STARTNUM - 1 ))
    echo " " $result
done > record.sane

and after a few hours of running, here is the result;

buffer size Vs bytes/sec

So, we have a graph which is very similar in shape to the dual-Itanium box, except it’s a whole order of magnitude less in raw throughput terms. As we’ve seen above, a process could push up to 3.5 Gigabytes/sec on the Itanium box, on the Niagara box that becomes .34 Gigabytes/sec or about 2.72 Gigabit/sec. Now, the Niagara box is virtualised in that the process runs on one logical CPU, the Niagara box has 32. So if we are going to extrapolate from there, and do the same generous guess as we did for the dual-Itanium box, we’d get 87 Gigabit/sec, again completely ignoring the multi-CPU overhead.

Now bearing in mind that the Itanium box would have to deal with the overhead of managing two physical CPUs and the Niagara box would have to deal with the overhead of managing 32 logical CPUs on 4 physical CPUs there probably isn’t very much between them in reality, in terms of how much raw overall I/O they can push – though if I had to guess at this stage which would win, I’d say the Niagara box – but hopefully we’ll get much more meaningful information over the next few weeks.

Either way, both systems can probably comfortably saturate a 10Gigabit/sec interface and can certainly have a single process saturate a gigabit interface, which is all they ever have to be engineered for, beyond that the number doesn’t matter a whole lot, unless you’re running a very very busy database server. But this information is still very useful. For one thing, it gives me some confidence that with a properly tuned Apache build we can blow SUN’s own benchmarking numbers away, this system looks like it’s capable of very decent I/O performance. It also confirms the architectural and engineering decisions SUN says they’ve made. This system is architected for paralellism, it’s not supposed to have super amazing performance for any single-process task, it’s designed to be able to run lots of those tasks better than anything else can, all at the same time.

They’re not lying, this is very different from hyperthreading. With hyperthreading, we don’t see anything like these results, when we run our graphs we get plots like this;

buffer size Vs bytes/sec

and it barely changes when we turn hyper-threading off. Hyper-threading seems like a convenient interface to enable better pipelining, and there’s nothing wrong with that, but if you’re running one process it won’t make a difference. Niagara on the other hand seems to behave just like a load of individual CPU’s, with a lot of less cross-subsidisation (if any). So, if you really want amazing single-process performance, or you have a requirement for a single process to be able to sustain say a 10Gigabit/sec download then Niagara is definitely not the right platform, but then SUN don’t claim that it is. SUN have made the design choice to really build for multi-threaded systems and our benchmark here seems to validate that.

The only other information that can be gleaned from our graph is that the lines are slightly less smooth than on the Itanium box. Only a little, and frankly I’m surprised they’re as smooth as they are considering the amount of virtualisation which is going on. The graph is still vastly more smooth than any x86 plot we’ve ever performed and about the only conclusion we can make is that if you had some real-time task that was sensitive to the microsecond, Niagara probably wouldn’t be as good a choice of platform as Itanium, though still much much better than x86. Again, this is not the market SUN are aiming Niagara at, and we’re really just validating some engineering choices here.

So, armed with our new knowledge of the systems potential, we’re going to really put this system to task and hopefully get the most useful information of all out of it; Just how much real network throughput can it manage, and just how many concurrent downloads can it really handle.

Update:
O.k., so it looks like the roughness of the graph can be neatly explained by the Solaris scheduler granularity. Reading about it here with pointers from Paul reveals that by default ordinary processes have a 100HZ scheduling period, contrary to my statement about real-time applications there are APIs available which expose much higher frequency scheduling, so ignore that part. Also I’m informed that the Niagara SMT is also a means to increase pipelining efficiency, but that the pipelines are shorter and the switching latency much smaller. That does seem to be born out by the above.

2 Replies to "First results from the Niagara Benchmarking"

gravatar

/~colmmacc/ » Blog Archive » Niagara vs ftp.heanet.ie Showdown  on March 23, 2006

[...] Another hack we applied to speed up Apache was to change the default buffer size, which is buried in the bowels of APR and can only be changed at build-time. In each case, the buffer size was changed as per the most efficient value (as determined by our previous benchmarks on single-threaded I/O). Don’t try this at home kids, unless you really know what you’re doing. [...]

gravatar

/~colmmacc/ » Blog Archive » Niagara Benchmarks: Update  on March 27, 2006

[...] Since I’ve been posting material up on Niagara, a few other httpd and solaris folk have been chiming in with some more expert opinions. I’ve also received more dder output from various platforms, which I plan to post up when I get a chance. [...]

Leave a Comment