Archive for 'niagara'

Niagara boxes deployed!

Posted on August 6, 2006, under general, niagara.

So, earlier this week, we deployed murphy.redbrick.dcu.ie, the T2000 Sun donated as the new main login server for RedBrick.

It’s been named Murphy after RedBrick founder David Murphy, who died tragically early this year. It’s both fitting that a main RedBrick machine be named after him, but also that it is a Sun server. Before being one of Google’s earliest employees in Ireland, David had a successful career at Sun Ireland.

It’s running Ubuntu, and so far the usual problems of a migration have been minimal. There was, as ever, a large number of programs for which users didn’t use the 2 month pre-migration period to tell us were missing, as well as a few binaries in user directories which needed to be recompiled. We’re also having a problem with nscd. But apart from those relatively minor issues (compared to the other migrations in the past!), it’s gone very well indeed, and it’s nice to have 16 Gigabytes of RAM to play with!

Seperately from that, after playing with some different installations on it, I’ve installed the T1000 and plan to deploy that in the next week or two. It’s also performing very well for us, and should make an amazing replacement for the Celeron with 256M of RAM it’s replacing! It will be used as a development host for some Apache work, a web host for some high-traffic sites (including this one) and a few other things. We havn’t decided where to colocate it just yet, but have a few options.

Solar Wind

Posted on May 23, 2006, under general, niagara, photography.

So, this time last week I was in San Francisco, courtesy of Sun, which was brilliant. Originally the plan was that I would be receiving a T2000 on-stage from Jonathan Schwartz during the JaveOne keynote. There were a lot of last-minute changes though, and in the end I managed to evade that, though I did get my access-all-areas badge for the general session – and managed to catch Jonathan Schwartz during the pre-keynote practice.

Union Square

SUN’s Global Communications team did arrange for a few press interviews, to cover the try and buy program. It was a curious experience, although I’ve done more than a few media interviews at this point, it’s always been on political issues, never with “analysts”, who ask very different sorts of questions. Still it was very pressure free, I could do them or not, and there was no line to tow or anything like that, which was good.

I didn’t quite get to meat everyone on my trip, but I did bump into fellow Apache committer Wendy Smoak through pure coincidence, and Tim Bray, who I got to thank for helping me get the T2000, and who put me in his Video blog! Mark was also kind enough to give me a great walking tour of San Francisco.

By far the best thing from the conference, from my point of view at least, was Mark Shuttleworth appearing and giving us some news about Ubuntu. It looks like 6.06 (Dapper) is going to have Sparc as an official distro, which is excellent sign of Fabio’s great work. Niagara is fully supported the kernel that ships, but alas the T1000 doesn’t quite work yet due to some issues with the Broadcom driver which are currently being blogged about.

Speaking of analysts, the discredited (at least if you follow GrokLaw) analyst Laura DiDio has weighed in on Ubuntu, so now I can confidently predict it will be a roaring success. Actually, over on ftp.heanet.ie, we’re getting ready for what could be an interesting week when 6.06 is actually released.

Goldengate Bridge

Yet more reviews of the T2000 boxes are appearing, including one which is on Linux aswell as Solaris, which gets some interesting results. It’s a bit worrying that only 24 CPUs showed up. There are probably still some T1 Linux kernel issues to be found.

The guys in SUN QA (based in Dublin) are doing some excellent profiling work too, and their Solaris numbers are surpassing mine at this point, so when I get the T1000 online (in a week or two hopefully) I’ll see if I can contribute any more useful feedback.

More Ubuntu on T2000

Posted on April 13, 2006, under apache, general, niagara.

Fabbione, the Ubuntu-sparc maintainer, got in touch and helped me out with our Ubuntu on T2000 issues. Turns out that the installing dapper can be a bit sensitive to changes in the archive while you’re installing it. A re-install fixed the problems, and this issue will totally dissappear once dapper is marked stable – as the archive will settle down.

He also pointed me at Dave Miller’s latest kernel with T1 fixes, which I’ve built from git, and showed me the way to the libc6-sparcv9v and libc6-sparc64v packages, which contain runtime optimisations for the platform. And the result is stunning. Ubuntu is now outperforming even Solaris express, and we’re sustaining 22,183.43 requests per second – using out of the box Apache 2.2.0. Not a single kernel tuning, Apache tuning, or anything beyond “CFLAGS=-Os” applied.

Frankly, I’m amazed. Amazed enough that I’ve rebooted into Nevada twice now just to confirm it’s not a change in the test environment. This machine just gets better and better, and Linux/Ubuntu really helps it get there. Of course things like hardware SSL acceleration don’t quite work in Linux yet, but I’m sure it’ll get there.

Now I know we’re going to be buying a few of T1 boxes this year, and although I’ll be using Solaris for debugging and development work (where it is a superb environment), it’s looking less and less attractive for production deployments.

Ubuntu on a T2000

Posted on April 12, 2006, under apache, general, niagara.

After going through the pain of setting up a rarp daemon and debugging a very odd tftp problem (it turns out Sun’s OBP tftp client and tftpd-hpa do not interoperate), we finally got Ubuntu installed on the T2000. I absolutely love old-school boot managers. I just really like the simplicity of being able to connect to a serial line (especially from our Cyclades), and do anything. With the T2000 I can interrupt to the OBP, but also to the ALOM (advanced lights out manager) and the system controller. It’s great, but it could do with a “boot pxe” option :-)

The debian-install system isn’t as optimised for 9600 baud serial as the Nevada or Solaris 10 installers were, so it is occasionally annoying to sit through the screen refresh, but overall the install took much less time (about 30 minutes, compared to about 1 hour and 30 minutes) and worked without any big problems.

The ubuntu sparc distribution is a bit hairy right now, and I wouldn’t recommend it for anyone who isn’t very experienced with linux. For example, out of the box, /dev/null, /dev/zero, /dev/random and more have the wrong permissions – which breaks a lot of things – and the dummy start-stop-daemon binary has been left installed, which means that almost no init scripts actually work.

However, for all that, apt does work – and it kicks the living daylights out of anything Solaris has to offer in the way of patch or package management. It took me about 3 minutes to get a fully working compile environment up and running, and it took about 20 seconds to get the box fully up to date. Sun badly need this kind of ease-of-deployment, or there’s simply no way people like me could consider deploying Solaris on hundreds of machines. We keep over 100 Debian boxes up-to-date in about an average of 15 minutes work per week, using a combination of apt, sudo and our own apticron. Solaris simply has nothing to compete.

Ok, problem number 1 is that Solaris isn’t a distro, so it can’t really provide updates for most packages, but there isn’t even a credible way of upgrading the native Solaris-bundled stuff. Solaris may be great at providing the tick-box-compatible certification neccessary for running many commercial applications, but it really is dire when compared to a good Linux distro in terms of ease of administration. It’s literally a waste of time. Everyone in Sun who works on Solaris should be forced to abandon BFU and start using what their customers have to deal with. Allow to bake for about 10 weeks, and out will come apt.

Anyway, back to Ubuntu;

colmmacc@murphy:~$ cat /proc/cpuinfo
cpu             : UltraSparc T1 (Niagara)
fpu             : UltraSparc T1 integrated FPU
prom            : OBP 4.19.0 2005/10/27 17:24
type            : sun4v
ncpus probed    : 32
ncpus active    : 32
D$ parity tl1   : 0
I$ parity tl1   : 0
Cpu0Bogo        : 2001.08
Cpu0ClkTck      : 000000003b9aca00
Cpu1Bogo        : 2000.02
Cpu1ClkTck      : 000000003b9aca00
Cpu2Bogo        : 2000.02
Cpu2ClkTck      : 000000003b9aca00
Cpu3Bogo        : 2000.01
Cpu3ClkTck      : 000000003b9aca00
Cpu4Bogo        : 2000.03
Cpu4ClkTck      : 000000003b9aca00
Cpu5Bogo        : 2000.03
Cpu5ClkTck      : 000000003b9aca00
Cpu6Bogo        : 2000.01
Cpu6ClkTck      : 000000003b9aca00
Cpu7Bogo        : 2000.02
Cpu7ClkTck      : 000000003b9aca00
Cpu8Bogo        : 2000.03
Cpu8ClkTck      : 000000003b9aca00
Cpu9Bogo        : 2000.03
Cpu9ClkTck      : 000000003b9aca00
Cpu10Bogo       : 2000.02
Cpu10ClkTck     : 000000003b9aca00
Cpu11Bogo       : 2000.02
Cpu11ClkTck     : 000000003b9aca00
Cpu12Bogo       : 2000.03
Cpu12ClkTck     : 000000003b9aca00
Cpu13Bogo       : 2000.02
Cpu13ClkTck     : 000000003b9aca00
Cpu14Bogo       : 2000.02
Cpu14ClkTck     : 000000003b9aca00
Cpu15Bogo       : 2000.02
Cpu15ClkTck     : 000000003b9aca00
Cpu16Bogo       : 2000.03
Cpu16ClkTck     : 000000003b9aca00
Cpu17Bogo       : 2000.02
Cpu17ClkTck     : 000000003b9aca00
Cpu18Bogo       : 2000.02
Cpu18ClkTck     : 000000003b9aca00
Cpu19Bogo       : 2000.02
Cpu19ClkTck     : 000000003b9aca00
Cpu20Bogo       : 2000.03
Cpu20ClkTck     : 000000003b9aca00
Cpu21Bogo       : 2000.02
Cpu21ClkTck     : 000000003b9aca00
Cpu22Bogo       : 2000.02
Cpu22ClkTck     : 000000003b9aca00
Cpu23Bogo       : 2000.02
Cpu23ClkTck     : 000000003b9aca00
Cpu24Bogo       : 2000.03
Cpu24ClkTck     : 000000003b9aca00
Cpu25Bogo       : 2000.02
Cpu25ClkTck     : 000000003b9aca00
Cpu26Bogo       : 2000.03
Cpu26ClkTck     : 000000003b9aca00
Cpu27Bogo       : 2000.02
Cpu27ClkTck     : 000000003b9aca00
Cpu28Bogo       : 2000.03
Cpu28ClkTck     : 000000003b9aca00
Cpu29Bogo       : 2000.02
Cpu29ClkTck     : 000000003b9aca00
Cpu30Bogo       : 2000.02
Cpu30ClkTck     : 000000003b9aca00
Cpu31Bogo       : 2000.02
Cpu31ClkTck     : 000000003b9aca00
MMU Type        : Hypervisor (sun4v)
State:
CPU0:           online
CPU1:           online
CPU2:           online
CPU3:           online
CPU4:           online
CPU5:           online
CPU6:           online
CPU7:           online
CPU8:           online
CPU9:           online
CPU10:          online
CPU11:          online
CPU12:          online
CPU13:          online
CPU14:          online
CPU15:          online
CPU16:          online
CPU17:          online
CPU18:          online
CPU19:          online
CPU20:          online
CPU21:          online
CPU22:          online
CPU23:          online
CPU24:          online
CPU25:          online
CPU26:          online
CPU27:          online
CPU28:          online
CPU29:          online
CPU30:          online
CPU31:          online

Nice! But how does it perform? Surprisingly well actually. Linux performs much better on the single-threaded I/O test. Again, I’ll hold off on the graphs, but the dd tests under Linux were roughly 10 times faster than Solaris. In rather an odd way, this didn’t translate into better single-download performance though – instead, under Linux we could only push about 60MB/sec.

And for the number that really matters; requests per second. Linux managed a neat 18,210.57 requests per second, which is within 10% of Nevada and more or less identical to Solaris 10. I should point out that the ridiculous Linux OOM killer did rear its ugly head during our more insane scalability testing though (try starting 500,000 threads due to a typo and you’ll find out all about it!). Solaris handles OOM much more gracefully IMO.

Another note is that “make -j 64″ under Linux built Apache (and apr, apr-util) in about 2 minutes, compared to 5 minutes for “dmake -j 64″ under Solaris, but that could be due to almost anything, it probably isn’t indicative of better FLOP performance. All in all, I’d say that Linux – which only started working on T1 a few short weeks ago – compares reasonably favourably with Solaris on the T2000. I’d also say that the T2000 is even better value for money with this information in hand, because it presents a greater range of options.

I wouldn’t rush out and run Linux on the box in production quite yet though. Ubuntu-sparc still needs some work, and there are doubtless many T1 kernel bugs yet to be found and ironed out. dtrace still represents a huge win on Solaris, and when the T1000 arrives, I can see us running dtrace dozens of times a day – it’s already helped me determine a huge amount of information useful for tuning Apache, and for reworking some of the code. But that said, for a production environment, once the Linux kernel and the Ubuntu distro get a bit more stable, they would be my personal choices for the T2000. The working day is just too short to sacrifice the wins of apt.

This is probably my last really large post on the T2000, as in a week, I’m going to drive it over to RedBrick, but it’s been great fun benchmarking the box, and I’m very, very grateful to Sun – for both their kind donations, and the opportunity to test the platform. It is very, very impressive kit. My Scaling Apache talk is going to be on at ApacheCon EU 2006 (check out the new logo btw), and I’ll be including some more detail there too, including some of the more useful information we’ve gleaned from using dtrace (I’m currently working on a per-nanosecond break down of an Apache request – how cool is that!), so do come along to that.

Nevada Testing

Posted on April 11, 2006, under apache, general, niagara.

Before we give Ubuntu a try on the T2000, we’ve upgraded it to the latest build of Solaris Express (nevada) with the latest version of the e1000 driver, and it certainly does make some improvement. The single-threaded I/O results are not much different, so I’m beginning to suspect that these results are more related to the 7200 RPM disks that anything else. If we get a chance to hook the T2000 up to a serious RAID array with 15K RPM disks, we’ll give it another shot. I’m going to post soon with the I/O graphs for a range of processors and operating systems, so I’ll save that graph for a small while.

Before upgrading, we applied some more tunings that found on the T2000 SpecWeb page, but they didn’t make much difference it has to be said. As I previously blogged, regular Solaris 10 managed to push 15298.68 requests per second. Well straight out of the box, Nevada was pushing over 18,000 – which is a 20% improvement, not bad! The kernel bug which caused the event MPM to crash is now fixed, and using the event MPM gets us to … drum-roll … 20,417.33 requests per second. Better yet, the really nifty kssl functionality on Solaris means that you can use the event MPM without having to worry about its lack of support for mod_ssl right now.

Although the single-threaded I/O numbers have not really improved, Nevada is better for single-threaded downloads, and here we see a neat doubling of performance and the T2000 has no problem maintaining 80MB/sec (that’s 640 Mbit/sec) downloads, which is more than good enough.

One funny thing I have noticed is a bug in the T2000 firmware;

Sun Fire T200, No Keyboard
Copyright 2005 Sun Microsystems, Inc.  All rights reserved.

A google search for it seems to indicate it’s universal. But I can’t find a bug ID :-) Maybe the product line got renamed, or maybe it took legal a while to negotiate with the Terminator people, or maybe it’s just a typo, but either way I like it, it’s fun to have a few anachronisms, it adds character.

Not a Rio!

Posted on March 29, 2006, under apache, general, niagara.

Ok, so yet more generosity from Sun today. I’ve been told we can keep the T2000! Incredibly, the T1000 is still on its way, and that will be with us some time next week – but I can now recycle the cardboard from the T2000. Brilliant!

We would actually be more than comfortable deploying a T2000 as a host for ftp.heanet.ie, but things are not that simple. We’d have to migrate the 12TB of data from XFS to UFS, for a start. We’d also prefer to buy a server, with a support contract and so on (update: the donations actually come with SunService contracts), anyway. But the T2000 would give us better performance, and a much better NFS stack, which greatly expands our options for the future. When we next purchase an ftp.heanet.ie, Niagara is at the top of our list.

So, the T2000 will go to RedBrick, and we’re working on something else for the T1000. In the immediate future, it will serve as an excellent development and testing machine for Apache, and we’ll use it for the dtrace work (now well underway, by the way). It will also be incredibly useful for us to have a machine for ongoing comparison purposes, and maybe even some OpenSolaris hacking.

Stay tuned for more benchmarking work though, results will be coming soon.

Update: SUN have also very kindly offered to cover Nóirín’s travel costs to ApacheCon Europe in Dublin (read how this will help her, here). Nóirín has been doing a lot of work behind the scenes editing my blog-posts into readable English (and also improving the spelling to that above the level of a 10 year old!) and with Sun’s help, the universe has been particularly efficient at getting the kharmic reward in order.

Erie in Eire

Posted on March 28, 2006, under apache, general, niagara.

Good news, SUN are donating a Niagara T1000 to us (and hence RedBrick), excellent! Damien, from the Sun performance team here in Ireland, called to tell us it’s on its way. We should have it in the next week or two. This isn’t part of the original contest, it’s from the Sun engineering team, and we’re very grateful. I’m sure RedBrick will put it to excellent use, and we’ll get them thinking about a server name right away!

After doing a lot of analysis today, we’ve figured out what was up with the benchmarks – we were inserting a systematic delay of about 30ms into every connection. How? By not using epoll() on the client side. Turns out that the blocking select() time in the clients was actually a very significant factor in how quickly the benchmarks were running.

What this is means is that although the results from the previous benchmarks are almost certainly still in the correct order, and that the comparison of relative performances is valid, the absolute requests per second numbers are invalid, and there’s a lot of room for improvement, probably beyond 25,000 requests per second. I’m now planning to give everything a serious try with the latest Nevada build, instead of update 1, which is what we’re currently running.

We also have a list of other cool things to try out, and from the sounds of things, the guys in EastPoint have extremely extensive knowledge about tunings and just what levels of performance can be achieved. As a lot of their work is being opened up as part of the OpenSolaris effort, there may well be some great ways Sun and Apache can help each other out. Hopefully, we might even see some of them at ApacheCon.

We’ve still got over a month with the T2000 left though, and plenty more to try out on it, so we’re not letting that resting just yet. It’s a very, very strong candidate for the next ftp.heanet.ie iteration, and it looks like there may be a solution within UFS to our single-threaded I/O numbers. ZFS would definitely help out there too.

It’s hard for me to look neutral anymore, since we’ve been donated a box, or maybe it’s easier – since we don’t really have anything to gain – but I have to say that I am very, very impressed, not only by Sun’s latest technologies and their approach to engineering (which has always been good), but also by the genuinely open and receptive nature they’ve shown. I’ve had contact with a few Sun employees during this trial, and they’ve all been extremely helpful. There was never any hint of pressure, or even corporate schmooze, just simple and honest advice, which is good to see. I still think Solaris has a bit to go before we could mass-deploy it in production again, mainly to do with the lack of a decent packaging system. That said, Sun now looks like an organisation which is genuinely receptive to feedback, so before the 60 days are up, I’ll try and write a blog post comparing Solaris to other platforms, for common administrative and automation tasks.

Anyway, thanks Sun, especially all of the people who’ve been speaking to me over the past few days – we’ll put the server to good use.

Niagara Benchmarks: Update

Posted on March 27, 2006, under apache, general, niagara.

Since I’ve been posting material up on Niagara, a few other httpd and solaris folk have been chiming in with some more expert opinions. I’ve also received more dder output from various platforms, which I plan to post up when I get a chance.

What I couldn’t wait for though, is to communicate the effect some single changes in our benchmarking setup have achieved. A few days ago I raved about the 5700 requests per second I was getting out of the Niagara box. Turns out that was a load of crap, here’s what I’m getting now;

Requests per second:    15298.68 [#/sec] (mean) 

And here’s what an active ftp.heanet.ie is pushing;

Requests per second:    4445.26 [#/sec] (mean) 

Abandoning siege, and using the latest version of ab is revealing some fundamental limitations in our previous benchmarks and some errors in our assumptions. O.k., so my assumptions. My bad, and I wanted to try and rectify it as quickly as possible. As I scramble together some time over the next few days, we’ll recheck our other results, and see what else is lurking under the hood in terms of performance. We suspect there’s more room for growth.

Thanks to Brian Akins for nailing the problem. Oh, and we’re hearing a lot of very very good things about, and seeing some very nice numbers from, Opteron systems too.

SUN contest form

Posted on March 27, 2006, under general, niagara.

So, the SUN contest form went online today. It does seem to have changed a lot in character since it originally started, but I guess it’s SUN’s hardware to give away. You can find the terms and conditions online too. Anyway, here’s how I filled it out;

First Name: Colm
Last Name: MacCarthaigh
Phone: PHONE
Company/Org: HEAnet
Location: Dublin
Try and Buy Sales Number: No Idea
May we contact you: yes

Describe application workload for all systems tested (ie Dynamic content currently using PHP and JSP, Page images requested using 2 parallel HTTP connections, simulates active users accessing their email via a standard Web browser):

T1000/T2000 Ultra-scalable Apache httpd deployment
Other system(s) Ultra-scalable Apache httpd deployment

Describe hardware configuration for all systems tested (active cores, # of disks, memory, network settings and topology):

Sun Fire T1000/T2000: T2000 with 16 GB of RAM
Other System(s): Dual 1.5Ghz Itanium with 32GB of RAM
Dual 3.2Ghz Xeon with 12GB of RAM

What were the results?

Sun Fire T1000/T2000: See blog http://www.stdlib.net/~colmmacc/category/niagara/
Other System(s): See blog http://www.stdlib.net/~colmmacc/category/niagara/

How is this good/how does it compare to previous results?
Sun Fire T1000/T2000: See blog http://www.stdlib.net/~colmmacc/category/niagara/
Other System(s): See blog http://www.stdlib.net/~colmmacc/category/niagara/

What is the SWaP measurement?

Sun Fire T1000/T2000: The SWaP measurement is highly questionable and very little scientific or statistical information is included on how it is considered relevant. It vastly over-simplifies a complicated metric, and hence is not included in any of my results. I have calculated power usage differences and extrapolated rough costs, as that is reasonably deterministic.

Did you find/solve any bugs? Yes, multiple. These are in the process of being resolved. See blog comments.

What can we do to improve the system/software? Solaris badly needs a credible packaging system, a clone of apt would be ideal.

I wish to make clear that I DO NOT accept all of the terms of this competition. I most certainly do not agree with;

12. Publicity: Except where prohibited, participation in the Contest constitutes a Contestant’s consent to Sun’s use of his/her name, likeness, voice, opinions, biographical information, place of residence and Performance Results for promotional purposes in any media without further payment or consideration. Contestant also agrees to assist Sun in promoting and marketing the Contest.

In particular, Sun may make absolutely no use of my place of residence whatsoever.

External links to posted results or blogs:

http://www.stdlib.net/~colmmacc/category/niagara/

I’m blogging it here just to make sure that my refusal to accept some of the ridiculous conditions is on-record. These wern’t part of the original announcement, in fact seemed a lot simpler and a lot less beurocratic when announced. But we’ll see how the contest progresses anyway.

Niagara vs ftp.heanet.ie Showdown

Posted on March 23, 2006, under apache, general, mirroring, niagara.

So, after a week with the Niagara T2000, I’ve managed to find some time to do some more detailed benchmarks, and the results are very impressive. The T2000 is definitely an impressive piece of equipment, it seems very, very capable, and we may very well end up going with the platform for our mirror server. Bottom line, the T2000 was able to handle over 3 times the number of transactions per-second and about 60% more concurrent downloads than the current ftp.heanet.ie machine can (a dual Itanium with 32Gb of memory) running identical software. Its advantages were even bigger than that again, when compared to a well-specced x86 machine. Not bad!

The Introduction

ftp.heanet.ie is one of the single busiest webservers in the world. We handle many millions of downloads per day, but unusually for a high-demand site, we do it all from one machine. This is usually a bad idea, but as a mirror server has built-in resilience (in the form of a world-wide network of mirrors), and as we can’t afford 20 terabytes of ultra-scalable, network-available storage, we use a single machine with directly attached storage, and rely on our ability to tune the machine to within an inch of its life. We regularly serve up to 1.2 Gigabit/sec, and have handled over 27,000 concurrent downloads. There’s some more detail on our previous set-up (which is mostly identical to the current one) in my paper on Apache Scalability.

Over four years ago, when I started in HEAnet, Solaris and Sparc hardware represented about 50% of our Unix systems. Now it represents less than 2%, so I’ve had less and less opportunity to tinker on Solaris in the last few years, but have kept up with it enough to know how to use dtrace, and to still understand the Solaris fundamentals. At ApacheCon US 2005, Covalent had a T2000 along as a demonstration machine. I got to play with it a little and was very impressed. Unlike prior experiences, this machine felt very responsive. There was no waiting for the output of commands, no listening to the whirring of hard disks, and the benchmarking numbers it was producing weren’t bad either.

When Jonathan Schwartz announced the “Free Niagara box for 60 Days” deal, we jumped at the opportunity to test one of the these boxes – which may be ideal for our needs. It took a while for Sun to iron out some administrative problems, but they certainly held up their end of the deal, and a nice shiny T2000 arrived a little over a week ago, for us to try out.

The Machines

To get a better sense of the machine’s performance in comparison to our other options, we rustled together a Dell 2850 Dual 3.2Ghz Xeon with 12GB of RAM, running Debian, and our current Dell 7250 Itanium (which is a dual 1.5Ghz with 32GB of RAM).

T2000 2850 7250

Throughout the benchmarking, the machine used for firing off the benchmarks (using ab, httperf and siege) was another Dell 2850, this time a dual 2.8Ghz xeon with 4GB of memory. For performing the concurrency and latency tests, we used more, similarly-configured (and identical each time), 2850’s and 2650’s to run yet more parallel benchmarks.

As ftp.heanet.ie is a live system which we can’t simply take off-air because we want to complete some benchmarks, we ran the tests during its quietest periods of use. To be fair, we also made sure that the other two systems – when benchmarked – were loaded with a baseline of 40 requests per second, with an average concurrency of around 300. After initially determining which machines were “winning” the benchmarks we tried to structure the load to favour the “loser” of the benchmarks, if any decision was needed. This means that where one machine comes out on top, the margin by which it wins is actually a conservative estimate.

Ordinarily, we try to drastically reduce the number of services on a machine, to free up memory and scheduler time on the system. However, as the T2000 came with a large number of services running, and it’s not entirely easy to determine what is and isn’t actually a critical service, we shut down obvious candidates – such as the various network filesystem daemons – but left some others alone. Again, if anything, this means that our results are actually conservative for the Sun, although they probably do reflect a real-world set-up, which will have these services running.

The Preparation

As no system comes configured perfectly for such extreme tests, we did a number of things to each machine we tested, to achieve as much performance as we could manage. Since my Solaris skills are rustier than my Linux skills by a fair margin, it’s more than possible that our benchmarks under-represent the performance of the T2000.

T2000
The first thing we did after receiving the system was to get smpatch configured, and to run “smpatch update”. Getting the system completely up to date took a good 6 hours, and that still only covered critical and security updates, as we don’t have a subscription for everything else. Being a Debian and Ubuntu user, this is annoying. “apt-get update && apt-get dist-upgrade” would have done the same thing, and upgraded everything in about 15 minutes, at the very, very longest. Hopefully though, that will be improved upon.

Next, we installed the SUNWspro suite, in order to have a compiler, linker and so on – which is mighty useful for compiling Apache from source! Some reasonably trivial invokations of apachebench seem to show that this compiler produces faster binaries than gcc. Over the years, there have been claims that 64-bit binaries are actually slower than 32-bit binaries. Our testing didn’t show much of a difference, but just in case there is one, we used 32-bit builds of Apache, though with the correct largefile-magic, so that we could still transfer very large files.

We didn’t apply many Solaris kernel tunings, mainly because the Solaris team seem to be working hard to get rid of them, and putting a lot of effort into making the default behaviour ultra-scalable. Nevertheless, we upped max_nprocs various times to cope with the insane number of processess we were creating. Keeping an eye on tcp:tcp_conn_hash_size with ndd seemed to show little problem with the default values, and this is the main Solaris tunable we’ve had to tune in the past.

Apart from mounting the filesystems with the “noatime” mount-option, we did no filesystem tuning, which is something I’m keen to improve on, particularly if we can try out ZFS. Again, if anything, this means that the performance of the T2000 may be under-represented. However, as our benchmarking was restricted to just 3 files, with no directory traverals, probably not by much. If anyone has any pointers on intensive filesystem tuning on Solaris, please send them my way!

Itanium 7250
The Itanium box runs version 2.6.15.2 of the Linux kernel and our list of related sysctl’s looks like this;

net/core/wmem_default=5000000
net/core/wmem_max=5000000
net/core/rmem_default=5000000
net/core/rmem_max=5000000
net/ipv4/tcp_rmem="8192 87380  1747600"
net/ipv4/tcp_wmem="8192 87380  1747600"
net/ipv4/tcp_wmem="8192 10000000 10000000"
net/core/netdev_max_backlog=25

We also up the txqueulen on our interfaces to 50000, for achieving super-high throughput to our Geant users. The XFS filesystem was mounted with the “noatime” and “ihashsize=65535″ mount options.

2850 Xeon
For the sake of consistency, the 2.6.15.2 kernel was also installed on the Xeon box, with the same system and interface settings as the Itanium box. The ext3 filesystem used was mounted with the “noatime” mount-option.
Apache
Common to each box were the usual Apache tunings we apply. For each machine, we tried to determine the quickest MPM to use. In the case of the two Dell boxes, this was the event MPM, which was ahead of the worker MPM by about 2%. We couldn’t get the event MPM working on Solaris (more about that later), so we used the worker MPM – which was over twice as fast as prefork on the platform.

As Solaris seemed to respond better to more LWP’s than PID’s, we ran with 64 threads per child – which is not at all an unreasonable number. Increasing beyond this did give us slightly better results, but the potential for 64 downloads to die at once, when there’s a problem, is just about enough real-world risk to deal with, for me. The relevant configuration stanza looks like:

<IfModule mpm_worker_module>
    ServerLimit            1563
    ThreadLimit            64
    StartServers           10
    MaxClients             100032
    MinSpareThreads        25
    MaxSpareThreads        75
    ThreadsPerChild        64
    MaxRequestsPerChild    0
</IfModule>

Note: these are stupid values for a real-world server, and will waste a lot of memory for the scoreboard. They are really only useful if you are doing some insane benchmarking and testing.

We naturally set “AllowOverride None”. Interestingly, although sendfile() functions flawlessly on Solaris (unlike on Linux), using it seemed to have an impact on performance. Using it did reduce the amount of memory used by Apache on the box, but it gave slower performance than just read() and write() – so perhaps it’s blocking characteristics are slightly different. Thus, we set “EnableSendfile off” and used MMap instead (via “EnableMmap”) which seemed to be the fastest way to ship bytes.

Another hack we applied to speed up Apache was to change the default buffer size, which is buried in the bowels of APR and can only be changed at build-time. In each case, the buffer size was changed as per the most efficient value (as determined by our previous benchmarks on single-threaded I/O). Don’t try this at home kids, unless you really know what you’re doing.

So, with our tunings applied, we set about performing our benchmarks, and for the sake of sticking with the showdown theme, I’ve divided the results into good, bad and ugly. (No, there weren’t really any ugly results – it’s just a fun theme for a post).

The Good

Power Usage
As I’d previously blogged, one of the first things we were able to measure was the power usage of the machine. Much to my amazement, it remained at the original level (+/- 20%) of current draw for the duration of our tests, peaking at a mere 1.2 Amps, or about 290 Watts. This compares pretty favourably with our Dells, though I should add that the Dells both have more disks in their chassis than the T2000.

Machine Average draw Peak Yearly cost
Sun T2000 1 Amperes 1.2 Amperes €210
Dell 2850 1.6 Amperes 2 Amperes €350
Dell 7250 1.8 Amperes 2.2 Amperes €395

Costs are calculated on the average draw, at the Irish commercial ESB rate, and do not include cooling costs (roughly triple the number to get the overall yearly cost). Electricity supply was 240V, so multiply the Amperes by 240 to get the raw numbers of watts. These results were calculated using an APC metred PDU. This is not a scientific instrument, and it’s entirely possible that results are inaccurate. Some rough calibration did show that the unit produced consistent results, so personally I’m confident enough about the order in which the machines are ranked, but I wouldn’t go so far as to be certain of the raw numbers produced. We really need a good power meter to produce that kind of reliability.

Requests per second
How many requests the machine can handle in a second is probably the most valuable statistic when talking about webserver performance. It’s a direct measure of how many user requests you can handle. Fellow ASF committer, Dan Diephouse, has been producing some interesting stats for requests-per-second for webservices (and they are impressive), however we were more interested in how many plain-old static files the machine could really ship in a hurry. And without further ado, those numbers are;

Concurrent Downloads

Sun’s own benchmarks have quoted up to 2500 requests per second, which we didn’t find particularly impressive. Our current box – merely a dual Itanium – can do 2700 requests per-second without much trouble. I’m happy to confirm though, that the tricks we do to reduce Apache’s memory usage on Linux have as much of an effect on Solaris. Our results are averaged over 5 runs of the testing, during which the T2000 managed a very, very impressive 5718 requests per second. Not bad!

Despite the new kernel, the x86 box still struggled to push out a disappointing 982 requests per second, while our Itanium churned through a reliable 2712 requests per second.

Concurrency
Unfortunately, neither the siege nor apachebench utilities can cope with the levels of concurrency we test with these days, as there are simply far too many sockets involved. Tuning the client machine itself becomes a serious task in order to be able to cope with the sheer volume of outbound requests. We currently have some commercial traffic generation and scaling testers in our test-lab, but we decided not to use those either. Instead, multiple servers were thrown at the problem and we used 11 machines all-in, all running instances of siege at the same time. The instances were fired off by hand, but within a few seconds of each other, and more than enough requests (100,000) were used, to ensure that the processes were given enough time to ramp up to the level of parallelism required. Each machine was on the same LAN as the server we were benchmarking.

With those limitations in mind, the test certainly allowed us to find out the rough breaking point of each machine. On any system, sustaining over 10,000 concurrent requests would involve denying some requests outright, but the cut-off or breaking point was defined as the point when the server got to 50% availability. We used some other tricks, like assigning the server multiple IP addresses and targetting each client at a different address, to a) give the tuple-tracking code in the IP stacks an easier time and b) allow us to easily track how many clients each server was sustaining.

Also, in each case, the system was pretty much unusable by the time we were done! After killing all of the connections, the Linux boxes would take about 5 minutes before becoming responsive enough that we could get to a shell prompt. The T2000 would take about 20 minutes, although I think that if we reserved more processes for the root uid, that might change – sshd seemed responsive enough, but would block on fork() when trying to create a shell process.

Concurrent Downloads

As you can see, the T2000 was able to sustain about 83,000 concurrent downloads, and my limited dtrace skills tell me that thread-creation at that point seemed to be the main limiting factor, which is hardly surprising. For us, that number represents an upper limit on what the machine could handle when faced with a barrage of clients. Of course, no server should ever be allowed to get into that kind of insane territory, but it’s always good to know that there is plenty of headroom. More to the point, it means that availability at the lower levels of concurrency is much higher. Compared to the 57,000 concurrent connections our Itanium box, and the 27,000 our Xeon box can handle, it looks like the T2000 would be a very, very good choice of server for our load.

Latency vs concurrency
I would have liked to have been able to measure availability vs concurrency, but unfortunately our method of testing doesn’t really allow for this. Although we can sum the availabilities as seen by each client participating in the benchmark, this doesn’t always time-average correctly. In other words, if we used two client systems, and client A reported 90% availability and client B reported 80% availability, does that mean 85% uptime overall, or 80%? Unfortunately, it doesn’t mean either. Averaging only works if the two figures are perfectly overlapped in time, so it’s an average – but weighted in proportion to the lack of an overlap. The real availability is somewhere between 80% and 85%, and it’s very hard to figure out where. If the client systems were identical in hardware terms, we could come close to solving the problem by firing off the benchmarks with the at command, but our systems aren’t all that close in terms of spec.

Instead, what we can do, is to measure the latency as it increases with concurrency, in each case taking the worst value from our benchmarking clients. Benchmarking from a single system shows that there is a very high degree of correlation between an increase in latency and a decrease in availability, so this measurable gives us a good idea of both.

Latency vs Concurrency

Overall, the T2000 performs very impressively. At very low numbers of concurrency, it actually has a higher latency than either of the Dell machines we tested, but these latencies are of the order of tens of milliseconds. In other words, the network latency makes a bigger difference in the overall scheme of things.

With no concurrency at all, the T2000 would exhibit latency of 9 milliseconds, compared to the Itanium’s 1 millisecond (and in fact, ab actually outputs 0, so it’s less than 1 millisecond) and at 1000 concurrent requests the T2000 would have 48 milliseconds, compared to 12 milliseconds for the Dual Itanium box. However, as we scaled up the concurrency, the latency numbers change fairly rapidly, in favour of the T2000. Due to the huge changes in scale, we’ve had to use a logarithimic graph, but at 50,000 concurrent downloads, our Itanium would take up 38 seconds to respond to a client, compared to the T2000’s 26 seconds. At 83,000 downloads, which only the T2000 could manage, the latency had gone up to 57 seconds, but it still responded.

Overall, I think it’s fair to say that while the T2000 doesn’t seem to have ultra-low latency performance, it has much better scalability and provides much better availability as more and more connections are added. So again, overall, the T2000 is still the better webserver.

The Bad

I’m a bit reticent to label these results “bad”, because they really are in areas in which Sun have never claimed the machine will perform. The Niagara platform is architected for parallelism, it’s not supposed to give great performance for any single-threaded task. If you have a load which requires great performance to a single client, Sun have an array of other hardware they’d prefer to sell you instead. However, since some aspects of single-threaded performance do have a direct impact on webserver performance, I’ve included some relevant ones here.

Single-threaded I/O
As I’ve previously blogged, one of the first benchmarks we run on any machine is to determine how much I/O a single-threaded task can drive, and what the most efficient buffer size to achieve that is. There’s much more detail in the linked blog post, but the summary information can be easily graphed:

Maximum single-thread throughput

These results may be attributable in part to the relatively slow system disks that the T2000 ships with, and much better performance can probably be derived by using a faster disk setup. On the other hand, the performance Linux achieves is mainly due to the very aggresive vfs caching it performs. Unlike the Linux box, the T2000 produces the same throughput numbers whether it is the first time or the tenth time it has read a file. Linux, on the other hand, takes much longer to serve a file the first time, but after that, it’s served from RAM.

It’s also useful to put these results in context; what they mean is that a single-threaded task, doing as pure and simple an I/O task as possible, can push 3.5Gigabytes per second. The Niagara box comes with 4 Gigabit/sec interfaces, so even a single-threaded task could fill that, 7 times over. Still, if I were deploying a load with a large and very active database component, I would do some more extensive testing to ensure that any single-threaded I/O constraints had no overall effect.

Single-download throughput
After gathering the numbers on single-threaded I/O, and confirming that the T2000 could easily saturate its 4 Gigabit interfaces – at any level of concurrency high enough to generate that level of traffic – we decided to see if the I/O numbers exhibited themselves for a single download. To perform this benchmark we went back to basics, and used curl and wget to grab a 1 Gigabyte file repeatedly. To help the systems out, we increased the MTU to 9000 bytes and made sure the TCP window size was big enough to take the entire file straight away. We also monitored for any packet loss during the tests (there was none).

Due to the way we handle the load-balancing of our network interfaces on the Linux boxes, which is per-flow, any single download is limited to 1Gigabit/second. Sure enough, wget reported a neat 123 MB/sec fairly reliably. Since the balancing was per-flow, it’s entirely possible the machine can actually ship faster downloads, and neither system seemed under any strain while doing this. With the T2000 on the other hand, we could push no more than 48 MB/sec, which is still a very respectable 384Mbit/sec.

Single download performance

Apart from increasing the MTU and Window size, we didn’t apply any Solaris-specific tunings for improving these numbers, so again, it’s possible that these numbers are under-representing true possible performance. And once again, we really have to put these numbers into context. As a whole, the T2000 has no problems saturating it’s 4 Gigabit/sec of connectivity, and that’s what it’s designed for – parallelism. All our numbers mean, is that if you wanted truly incredible performance for any single download, this probably isn’t the right architecture. Outside of where I work, and other high-speed research networks, I’m not aware of any place where high-speed, single-flow statistics really matter a whole lot, especially for HTTP. The network is usually a limiting factor anyway. I mean, how many people have jumboframe capable multi-gig WANs?

The Ugly

Ok, so ugly is a bad choice of word. But like I said, this is a “showdown”. While testing the event MPM, we did manage to upset the Solaris kernel to the extent that it actually crashed;

panic[cpu21]/thread=300024a7020: BAD TRAP: type=31 rp=2a102c87720 addr=0 mmu_fsr=0
occurred in module "genunix" due to a NULL pointer dereference

httpd: trap type = 0x31
pid=652, pc=0x10fb4dc, sp=0x2a102c86fc1, tstate=0x4400001607, context=0x514
g1-g7: 0, 0, 12, 38, 0, 0, 300024a7020

Nice! I havn’t looked into this in detail yet, but it’s likely due to the unusual synchronisation semantics the event MPM features right now. The event MPM is marked as experimental, and if you’re not an Apache developer, you probably shouldn’t be running it. Still, the thread-handling code within the MPM all runs as a non-root user, so it really shouldn’t be able to cause the kernel to crash. Then again, it was handling about 30,000 requests at the time, with no accept mutex. This isn’t exactly within the normal range of expected behavior for a userland application. Since switching to the worker MPM, we’ve had flawless performance and not a single crash.

The Conclusion
The T2000 is one very impressive piece of kit, and at a list price of around €15,000 ($16,995), costs less than half of the price of the dual Itanium we’ve been benchmarking it against (it’s also less than I can price up a comparable X86 box for – seems to be the memory that does it). We may very well go with the platform for our next iteration of ftp.heanet.ie.

The benchmarks we’ve run were all run with our own load in mind, but hopefully they’re still of some use to others. If you’re thinking about giving the platform a try, do run your own benchmarks though, don’t take our word for it. It’s always better to have these things validated and improved upon.

The Future
We’re not finished benchmarking just yet, we still have more planned! The Niagara box has some impressive SSL-offload features, and if we get a chance, we’d like to test those capabilities. We just needs to get the hacked-up engine3-supporting versions of openssl and flood onto the box, which will involve a bit of research. Some of the Apache SpamAssassin guys may try running some SpamAssassin benchmarks on the machine too, which should be impressive, as they lend themselves to parallelisation very well. We’re also going to try and improve on our above tests, and I’ll keep blogging about the results as we manage to do that.

Rather tantalisingly, there’s a comment on Dan Kegel’s C10K page saying that “Doug Royer noted that he’d gotten 100,000 connections on Solaris 2.6 while he was working on the Sun calendar server”. but doesn’t give any details of the hardware involved. But still, 100,000 connections, on 2.6! It gives me hope that with more tuning, the T2000 might be capable of scaling beyond the 83,000 we had.

If I develop some more free time, I also hope to use the machine to instrument Apache httpd (and maybe apr) for dtrace. Do check out Matty’s mod_dtrace though, for a cool module which instruments all of the handlers.

In the meantime, you can check out all of my blog posts about the Niagara box through my new Niagara category. Mads is also keeping tabs on other benchmarks taking place within the ASF community.

The Cheeky Part!

I don’t know what the status of Jonathan’s offer to be allowed to keep a server, at the discretion of the Niagara team, is – but we might as well give it a try.

Although we’re seriously considering the platform for the future, HEAnet doesn’t have a use for a Niagara box right now, but the other participants in our benchmarking efforts (and hopefully we’ll be blogging their results soon enough too) do – DCU’s Networking Society, RedBrick. RedBrick just celebrated 10 years as a networking society, and 5 years ago, Sun donated a massive E450 to the society, on which we ran our 2000 user shell server for 2 years.

We even pulled out all of the stops at the time, and had the Taoiseach (the Irish Prime Minister) turn out to launch the machine. I’m hoping we can convince SUN to donate the Niagara box to RedBrick, where they can use it for even more testing and benchmarking, as it really is an ideal machine for a shell environment. Lots and lots of low-memory parallel tasks.

So if you thought this round-up was of any use, digg it, link to it, or mail it to your local Sun Niagara team member, and we’ll see if we can be useful enough to merit a donation!