Archive for March, 2006
Not a Rio!
Ok, so yet more generosity from Sun today. I’ve been told we can keep the T2000! Incredibly, the T1000 is still on its way, and that will be with us some time next week – but I can now recycle the cardboard from the T2000. Brilliant!
We would actually be more than comfortable deploying a T2000 as a host for ftp.heanet.ie, but things are not that simple. We’d have to migrate the 12TB of data from XFS to UFS, for a start. We’d also prefer to buy a server, with a support contract and so on (update: the donations actually come with SunService contracts), anyway. But the T2000 would give us better performance, and a much better NFS stack, which greatly expands our options for the future. When we next purchase an ftp.heanet.ie, Niagara is at the top of our list.
So, the T2000 will go to RedBrick, and we’re working on something else for the T1000. In the immediate future, it will serve as an excellent development and testing machine for Apache, and we’ll use it for the dtrace work (now well underway, by the way). It will also be incredibly useful for us to have a machine for ongoing comparison purposes, and maybe even some OpenSolaris hacking.
Stay tuned for more benchmarking work though, results will be coming soon.
Update: SUN have also very kindly offered to cover Nóirín’s travel costs to ApacheCon Europe in Dublin (read how this will help her, here). Nóirín has been doing a lot of work behind the scenes editing my blog-posts into readable English (and also improving the spelling to that above the level of a 10 year old!) and with Sun’s help, the universe has been particularly efficient at getting the kharmic reward in order.
Erie in Eire
Good news, SUN are donating a Niagara T1000 to us (and hence RedBrick), excellent! Damien, from the Sun performance team here in Ireland, called to tell us it’s on its way. We should have it in the next week or two. This isn’t part of the original contest, it’s from the Sun engineering team, and we’re very grateful. I’m sure RedBrick will put it to excellent use, and we’ll get them thinking about a server name right away!
After doing a lot of analysis today, we’ve figured out what was up with the benchmarks – we were inserting a systematic delay of about 30ms into every connection. How? By not using epoll() on the client side. Turns out that the blocking select() time in the clients was actually a very significant factor in how quickly the benchmarks were running.
What this is means is that although the results from the previous benchmarks are almost certainly still in the correct order, and that the comparison of relative performances is valid, the absolute requests per second numbers are invalid, and there’s a lot of room for improvement, probably beyond 25,000 requests per second. I’m now planning to give everything a serious try with the latest Nevada build, instead of update 1, which is what we’re currently running.
We also have a list of other cool things to try out, and from the sounds of things, the guys in EastPoint have extremely extensive knowledge about tunings and just what levels of performance can be achieved. As a lot of their work is being opened up as part of the OpenSolaris effort, there may well be some great ways Sun and Apache can help each other out. Hopefully, we might even see some of them at ApacheCon.
We’ve still got over a month with the T2000 left though, and plenty more to try out on it, so we’re not letting that resting just yet. It’s a very, very strong candidate for the next ftp.heanet.ie iteration, and it looks like there may be a solution within UFS to our single-threaded I/O numbers. ZFS would definitely help out there too.
It’s hard for me to look neutral anymore, since we’ve been donated a box, or maybe it’s easier – since we don’t really have anything to gain – but I have to say that I am very, very impressed, not only by Sun’s latest technologies and their approach to engineering (which has always been good), but also by the genuinely open and receptive nature they’ve shown. I’ve had contact with a few Sun employees during this trial, and they’ve all been extremely helpful. There was never any hint of pressure, or even corporate schmooze, just simple and honest advice, which is good to see. I still think Solaris has a bit to go before we could mass-deploy it in production again, mainly to do with the lack of a decent packaging system. That said, Sun now looks like an organisation which is genuinely receptive to feedback, so before the 60 days are up, I’ll try and write a blog post comparing Solaris to other platforms, for common administrative and automation tasks.
Anyway, thanks Sun, especially all of the people who’ve been speaking to me over the past few days – we’ll put the server to good use.
Niagara Benchmarks: Update
Since I’ve been posting material up on Niagara, a few other httpd and solaris folk have been chiming in with some more expert opinions. I’ve also received more dder output from various platforms, which I plan to post up when I get a chance.
What I couldn’t wait for though, is to communicate the effect some single changes in our benchmarking setup have achieved. A few days ago I raved about the 5700 requests per second I was getting out of the Niagara box. Turns out that was a load of crap, here’s what I’m getting now;
Requests per second: 15298.68 [#/sec] (mean)
And here’s what an active ftp.heanet.ie is pushing;
Requests per second: 4445.26 [#/sec] (mean)
Abandoning siege, and using the latest version of ab is revealing some fundamental limitations in our previous benchmarks and some errors in our assumptions. O.k., so my assumptions. My bad, and I wanted to try and rectify it as quickly as possible. As I scramble together some time over the next few days, we’ll recheck our other results, and see what else is lurking under the hood in terms of performance. We suspect there’s more room for growth.
Thanks to Brian Akins for nailing the problem. Oh, and we’re hearing a lot of very very good things about, and seeing some very nice numbers from, Opteron systems too.
SUN contest form
So, the SUN contest form went online today. It does seem to have changed a lot in character since it originally started, but I guess it’s SUN’s hardware to give away. You can find the terms and conditions online too. Anyway, here’s how I filled it out;
First Name: Colm
Last Name: MacCarthaigh
Phone: PHONE
Company/Org: HEAnet
Location: Dublin
Try and Buy Sales Number: No Idea
May we contact you: yesDescribe application workload for all systems tested (ie Dynamic content currently using PHP and JSP, Page images requested using 2 parallel HTTP connections, simulates active users accessing their email via a standard Web browser):
T1000/T2000 Ultra-scalable Apache httpd deployment
Other system(s) Ultra-scalable Apache httpd deploymentDescribe hardware configuration for all systems tested (active cores, # of disks, memory, network settings and topology):
Sun Fire T1000/T2000: T2000 with 16 GB of RAM
Other System(s): Dual 1.5Ghz Itanium with 32GB of RAM
Dual 3.2Ghz Xeon with 12GB of RAMWhat were the results?
Sun Fire T1000/T2000: See blog http://www.stdlib.net/~colmmacc/category/niagara/
Other System(s): See blog http://www.stdlib.net/~colmmacc/category/niagara/How is this good/how does it compare to previous results?
Sun Fire T1000/T2000: See blog http://www.stdlib.net/~colmmacc/category/niagara/
Other System(s): See blog http://www.stdlib.net/~colmmacc/category/niagara/What is the SWaP measurement?
Sun Fire T1000/T2000: The SWaP measurement is highly questionable and very little scientific or statistical information is included on how it is considered relevant. It vastly over-simplifies a complicated metric, and hence is not included in any of my results. I have calculated power usage differences and extrapolated rough costs, as that is reasonably deterministic.
Did you find/solve any bugs? Yes, multiple. These are in the process of being resolved. See blog comments.
What can we do to improve the system/software? Solaris badly needs a credible packaging system, a clone of apt would be ideal.
I wish to make clear that I DO NOT accept all of the terms of this competition. I most certainly do not agree with;
12. Publicity: Except where prohibited, participation in the Contest constitutes a Contestant’s consent to Sun’s use of his/her name, likeness, voice, opinions, biographical information, place of residence and Performance Results for promotional purposes in any media without further payment or consideration. Contestant also agrees to assist Sun in promoting and marketing the Contest.
In particular, Sun may make absolutely no use of my place of residence whatsoever.
External links to posted results or blogs:
http://www.stdlib.net/~colmmacc/category/niagara/
I’m blogging it here just to make sure that my refusal to accept some of the ridiculous conditions is on-record. These wern’t part of the original announcement, in fact seemed a lot simpler and a lot less beurocratic when announced. But we’ll see how the contest progresses anyway.
Monolithic stupidity
I just caught this via digg, http://monolith.sourceforge.net/, man are these guys dumb.
Not only is what they’re doing incredibly illegal (those mono files certainly can be considered derivative works, and regardless transitive copying is involved), but they actually think they’re the first people to come up with this. Anyone remember the XOR encrypted DeCSS code that used the US Bill of Rights as its pad?
This really really is not the way to go about fighting these things.
Niagara vs ftp.heanet.ie Showdown
So, after a week with the Niagara T2000, I’ve managed to find some time to do some more detailed benchmarks, and the results are very impressive. The T2000 is definitely an impressive piece of equipment, it seems very, very capable, and we may very well end up going with the platform for our mirror server. Bottom line, the T2000 was able to handle over 3 times the number of transactions per-second and about 60% more concurrent downloads than the current ftp.heanet.ie machine can (a dual Itanium with 32Gb of memory) running identical software. Its advantages were even bigger than that again, when compared to a well-specced x86 machine. Not bad!
The Introduction
ftp.heanet.ie is one of the single busiest webservers in the world. We handle many millions of downloads per day, but unusually for a high-demand site, we do it all from one machine. This is usually a bad idea, but as a mirror server has built-in resilience (in the form of a world-wide network of mirrors), and as we can’t afford 20 terabytes of ultra-scalable, network-available storage, we use a single machine with directly attached storage, and rely on our ability to tune the machine to within an inch of its life. We regularly serve up to 1.2 Gigabit/sec, and have handled over 27,000 concurrent downloads. There’s some more detail on our previous set-up (which is mostly identical to the current one) in my paper on Apache Scalability.
Over four years ago, when I started in HEAnet, Solaris and Sparc hardware represented about 50% of our Unix systems. Now it represents less than 2%, so I’ve had less and less opportunity to tinker on Solaris in the last few years, but have kept up with it enough to know how to use dtrace, and to still understand the Solaris fundamentals. At ApacheCon US 2005, Covalent had a T2000 along as a demonstration machine. I got to play with it a little and was very impressed. Unlike prior experiences, this machine felt very responsive. There was no waiting for the output of commands, no listening to the whirring of hard disks, and the benchmarking numbers it was producing weren’t bad either.
When Jonathan Schwartz announced the “Free Niagara box for 60 Days” deal, we jumped at the opportunity to test one of the these boxes – which may be ideal for our needs. It took a while for Sun to iron out some administrative problems, but they certainly held up their end of the deal, and a nice shiny T2000 arrived a little over a week ago, for us to try out.
The Machines
To get a better sense of the machine’s performance in comparison to our other options, we rustled together a Dell 2850 Dual 3.2Ghz Xeon with 12GB of RAM, running Debian, and our current Dell 7250 Itanium (which is a dual 1.5Ghz with 32GB of RAM).
![]() |
![]() |
![]() |
Throughout the benchmarking, the machine used for firing off the benchmarks (using ab, httperf and siege) was another Dell 2850, this time a dual 2.8Ghz xeon with 4GB of memory. For performing the concurrency and latency tests, we used more, similarly-configured (and identical each time), 2850′s and 2650′s to run yet more parallel benchmarks.
As ftp.heanet.ie is a live system which we can’t simply take off-air because we want to complete some benchmarks, we ran the tests during its quietest periods of use. To be fair, we also made sure that the other two systems – when benchmarked – were loaded with a baseline of 40 requests per second, with an average concurrency of around 300. After initially determining which machines were “winning” the benchmarks we tried to structure the load to favour the “loser” of the benchmarks, if any decision was needed. This means that where one machine comes out on top, the margin by which it wins is actually a conservative estimate.
Ordinarily, we try to drastically reduce the number of services on a machine, to free up memory and scheduler time on the system. However, as the T2000 came with a large number of services running, and it’s not entirely easy to determine what is and isn’t actually a critical service, we shut down obvious candidates – such as the various network filesystem daemons – but left some others alone. Again, if anything, this means that our results are actually conservative for the Sun, although they probably do reflect a real-world set-up, which will have these services running.
The Preparation
As no system comes configured perfectly for such extreme tests, we did a number of things to each machine we tested, to achieve as much performance as we could manage. Since my Solaris skills are rustier than my Linux skills by a fair margin, it’s more than possible that our benchmarks under-represent the performance of the T2000.
- T2000
- The first thing we did after receiving the system was to get smpatch configured, and to run “smpatch update”. Getting the system completely up to date took a good 6 hours, and that still only covered critical and security updates, as we don’t have a subscription for everything else. Being a Debian and Ubuntu user, this is annoying. “apt-get update && apt-get dist-upgrade” would have done the same thing, and upgraded everything in about 15 minutes, at the very, very longest. Hopefully though, that will be improved upon.
Next, we installed the SUNWspro suite, in order to have a compiler, linker and so on – which is mighty useful for compiling Apache from source! Some reasonably trivial invokations of apachebench seem to show that this compiler produces faster binaries than gcc. Over the years, there have been claims that 64-bit binaries are actually slower than 32-bit binaries. Our testing didn’t show much of a difference, but just in case there is one, we used 32-bit builds of Apache, though with the correct largefile-magic, so that we could still transfer very large files.
We didn’t apply many Solaris kernel tunings, mainly because the Solaris team seem to be working hard to get rid of them, and putting a lot of effort into making the default behaviour ultra-scalable. Nevertheless, we upped max_nprocs various times to cope with the insane number of processess we were creating. Keeping an eye on tcp:tcp_conn_hash_size with ndd seemed to show little problem with the default values, and this is the main Solaris tunable we’ve had to tune in the past.
Apart from mounting the filesystems with the “noatime” mount-option, we did no filesystem tuning, which is something I’m keen to improve on, particularly if we can try out ZFS. Again, if anything, this means that the performance of the T2000 may be under-represented. However, as our benchmarking was restricted to just 3 files, with no directory traverals, probably not by much. If anyone has any pointers on intensive filesystem tuning on Solaris, please send them my way!
- Itanium 7250
- The Itanium box runs version 2.6.15.2 of the Linux kernel and our list of related sysctl’s looks like this;
net/core/wmem_default=5000000 net/core/wmem_max=5000000 net/core/rmem_default=5000000 net/core/rmem_max=5000000 net/ipv4/tcp_rmem="8192 87380 1747600" net/ipv4/tcp_wmem="8192 87380 1747600" net/ipv4/tcp_wmem="8192 10000000 10000000" net/core/netdev_max_backlog=25
We also up the txqueulen on our interfaces to 50000, for achieving super-high throughput to our Geant users. The XFS filesystem was mounted with the “noatime” and “ihashsize=65535″ mount options.
- 2850 Xeon
- For the sake of consistency, the 2.6.15.2 kernel was also installed on the Xeon box, with the same system and interface settings as the Itanium box. The ext3 filesystem used was mounted with the “noatime” mount-option.
- Apache
- Common to each box were the usual Apache tunings we apply. For each machine, we tried to determine the quickest MPM to use. In the case of the two Dell boxes, this was the event MPM, which was ahead of the worker MPM by about 2%. We couldn’t get the event MPM working on Solaris (more about that later), so we used the worker MPM – which was over twice as fast as prefork on the platform.
As Solaris seemed to respond better to more LWP’s than PID’s, we ran with 64 threads per child – which is not at all an unreasonable number. Increasing beyond this did give us slightly better results, but the potential for 64 downloads to die at once, when there’s a problem, is just about enough real-world risk to deal with, for me. The relevant configuration stanza looks like:
<IfModule mpm_worker_module> ServerLimit 1563 ThreadLimit 64 StartServers 10 MaxClients 100032 MinSpareThreads 25 MaxSpareThreads 75 ThreadsPerChild 64 MaxRequestsPerChild 0 </IfModule>Note: these are stupid values for a real-world server, and will waste a lot of memory for the scoreboard. They are really only useful if you are doing some insane benchmarking and testing.
We naturally set “AllowOverride None”. Interestingly, although sendfile() functions flawlessly on Solaris (unlike on Linux), using it seemed to have an impact on performance. Using it did reduce the amount of memory used by Apache on the box, but it gave slower performance than just read() and write() – so perhaps it’s blocking characteristics are slightly different. Thus, we set “EnableSendfile off” and used MMap instead (via “EnableMmap”) which seemed to be the fastest way to ship bytes.
Another hack we applied to speed up Apache was to change the default buffer size, which is buried in the bowels of APR and can only be changed at build-time. In each case, the buffer size was changed as per the most efficient value (as determined by our previous benchmarks on single-threaded I/O). Don’t try this at home kids, unless you really know what you’re doing.
So, with our tunings applied, we set about performing our benchmarks, and for the sake of sticking with the showdown theme, I’ve divided the results into good, bad and ugly. (No, there weren’t really any ugly results – it’s just a fun theme for a post).
The Good
- Power Usage
- As I’d previously blogged, one of the first things we were able to measure was the power usage of the machine. Much to my amazement, it remained at the original level (+/- 20%) of current draw for the duration of our tests, peaking at a mere 1.2 Amps, or about 290 Watts. This compares pretty favourably with our Dells, though I should add that the Dells both have more disks in their chassis than the T2000.
Machine Average draw Peak Yearly cost Sun T2000 1 Amperes 1.2 Amperes €210 Dell 2850 1.6 Amperes 2 Amperes €350 Dell 7250 1.8 Amperes 2.2 Amperes €395 Costs are calculated on the average draw, at the Irish commercial ESB rate, and do not include cooling costs (roughly triple the number to get the overall yearly cost). Electricity supply was 240V, so multiply the Amperes by 240 to get the raw numbers of watts. These results were calculated using an APC metred PDU. This is not a scientific instrument, and it’s entirely possible that results are inaccurate. Some rough calibration did show that the unit produced consistent results, so personally I’m confident enough about the order in which the machines are ranked, but I wouldn’t go so far as to be certain of the raw numbers produced. We really need a good power meter to produce that kind of reliability.
- Requests per second
- How many requests the machine can handle in a second is probably the most valuable statistic when talking about webserver performance. It’s a direct measure of how many user requests you can handle. Fellow ASF committer, Dan Diephouse, has been producing some interesting stats for requests-per-second for webservices (and they are impressive), however we were more interested in how many plain-old static files the machine could really ship in a hurry. And without further ado, those numbers are;

Sun’s own benchmarks have quoted up to 2500 requests per second, which we didn’t find particularly impressive. Our current box – merely a dual Itanium – can do 2700 requests per-second without much trouble. I’m happy to confirm though, that the tricks we do to reduce Apache’s memory usage on Linux have as much of an effect on Solaris. Our results are averaged over 5 runs of the testing, during which the T2000 managed a very, very impressive 5718 requests per second. Not bad!
Despite the new kernel, the x86 box still struggled to push out a disappointing 982 requests per second, while our Itanium churned through a reliable 2712 requests per second.
- Concurrency
- Unfortunately, neither the siege nor apachebench utilities can cope with the levels of concurrency we test with these days, as there are simply far too many sockets involved. Tuning the client machine itself becomes a serious task in order to be able to cope with the sheer volume of outbound requests. We currently have some commercial traffic generation and scaling testers in our test-lab, but we decided not to use those either. Instead, multiple servers were thrown at the problem and we used 11 machines all-in, all running instances of siege at the same time. The instances were fired off by hand, but within a few seconds of each other, and more than enough requests (100,000) were used, to ensure that the processes were given enough time to ramp up to the level of parallelism required. Each machine was on the same LAN as the server we were benchmarking.
With those limitations in mind, the test certainly allowed us to find out the rough breaking point of each machine. On any system, sustaining over 10,000 concurrent requests would involve denying some requests outright, but the cut-off or breaking point was defined as the point when the server got to 50% availability. We used some other tricks, like assigning the server multiple IP addresses and targetting each client at a different address, to a) give the tuple-tracking code in the IP stacks an easier time and b) allow us to easily track how many clients each server was sustaining.
Also, in each case, the system was pretty much unusable by the time we were done! After killing all of the connections, the Linux boxes would take about 5 minutes before becoming responsive enough that we could get to a shell prompt. The T2000 would take about 20 minutes, although I think that if we reserved more processes for the root uid, that might change – sshd seemed responsive enough, but would block on fork() when trying to create a shell process.

As you can see, the T2000 was able to sustain about 83,000 concurrent downloads, and my limited dtrace skills tell me that thread-creation at that point seemed to be the main limiting factor, which is hardly surprising. For us, that number represents an upper limit on what the machine could handle when faced with a barrage of clients. Of course, no server should ever be allowed to get into that kind of insane territory, but it’s always good to know that there is plenty of headroom. More to the point, it means that availability at the lower levels of concurrency is much higher. Compared to the 57,000 concurrent connections our Itanium box, and the 27,000 our Xeon box can handle, it looks like the T2000 would be a very, very good choice of server for our load.
- Latency vs concurrency
- I would have liked to have been able to measure availability vs concurrency, but unfortunately our method of testing doesn’t really allow for this. Although we can sum the availabilities as seen by each client participating in the benchmark, this doesn’t always time-average correctly. In other words, if we used two client systems, and client A reported 90% availability and client B reported 80% availability, does that mean 85% uptime overall, or 80%? Unfortunately, it doesn’t mean either. Averaging only works if the two figures are perfectly overlapped in time, so it’s an average – but weighted in proportion to the lack of an overlap. The real availability is somewhere between 80% and 85%, and it’s very hard to figure out where. If the client systems were identical in hardware terms, we could come close to solving the problem by firing off the benchmarks with the at command, but our systems aren’t all that close in terms of spec.
Instead, what we can do, is to measure the latency as it increases with concurrency, in each case taking the worst value from our benchmarking clients. Benchmarking from a single system shows that there is a very high degree of correlation between an increase in latency and a decrease in availability, so this measurable gives us a good idea of both.

Overall, the T2000 performs very impressively. At very low numbers of concurrency, it actually has a higher latency than either of the Dell machines we tested, but these latencies are of the order of tens of milliseconds. In other words, the network latency makes a bigger difference in the overall scheme of things.
With no concurrency at all, the T2000 would exhibit latency of 9 milliseconds, compared to the Itanium’s 1 millisecond (and in fact, ab actually outputs 0, so it’s less than 1 millisecond) and at 1000 concurrent requests the T2000 would have 48 milliseconds, compared to 12 milliseconds for the Dual Itanium box. However, as we scaled up the concurrency, the latency numbers change fairly rapidly, in favour of the T2000. Due to the huge changes in scale, we’ve had to use a logarithimic graph, but at 50,000 concurrent downloads, our Itanium would take up 38 seconds to respond to a client, compared to the T2000′s 26 seconds. At 83,000 downloads, which only the T2000 could manage, the latency had gone up to 57 seconds, but it still responded.
Overall, I think it’s fair to say that while the T2000 doesn’t seem to have ultra-low latency performance, it has much better scalability and provides much better availability as more and more connections are added. So again, overall, the T2000 is still the better webserver.
The Bad
I’m a bit reticent to label these results “bad”, because they really are in areas in which Sun have never claimed the machine will perform. The Niagara platform is architected for parallelism, it’s not supposed to give great performance for any single-threaded task. If you have a load which requires great performance to a single client, Sun have an array of other hardware they’d prefer to sell you instead. However, since some aspects of single-threaded performance do have a direct impact on webserver performance, I’ve included some relevant ones here.
- Single-threaded I/O
- As I’ve previously blogged, one of the first benchmarks we run on any machine is to determine how much I/O a single-threaded task can drive, and what the most efficient buffer size to achieve that is. There’s much more detail in the linked blog post, but the summary information can be easily graphed:

These results may be attributable in part to the relatively slow system disks that the T2000 ships with, and much better performance can probably be derived by using a faster disk setup. On the other hand, the performance Linux achieves is mainly due to the very aggresive vfs caching it performs. Unlike the Linux box, the T2000 produces the same throughput numbers whether it is the first time or the tenth time it has read a file. Linux, on the other hand, takes much longer to serve a file the first time, but after that, it’s served from RAM.
It’s also useful to put these results in context; what they mean is that a single-threaded task, doing as pure and simple an I/O task as possible, can push 3.5Gigabytes per second. The Niagara box comes with 4 Gigabit/sec interfaces, so even a single-threaded task could fill that, 7 times over. Still, if I were deploying a load with a large and very active database component, I would do some more extensive testing to ensure that any single-threaded I/O constraints had no overall effect.
- Single-download throughput
- After gathering the numbers on single-threaded I/O, and confirming that the T2000 could easily saturate its 4 Gigabit interfaces – at any level of concurrency high enough to generate that level of traffic – we decided to see if the I/O numbers exhibited themselves for a single download. To perform this benchmark we went back to basics, and used curl and wget to grab a 1 Gigabyte file repeatedly. To help the systems out, we increased the MTU to 9000 bytes and made sure the TCP window size was big enough to take the entire file straight away. We also monitored for any packet loss during the tests (there was none).
Due to the way we handle the load-balancing of our network interfaces on the Linux boxes, which is per-flow, any single download is limited to 1Gigabit/second. Sure enough, wget reported a neat 123 MB/sec fairly reliably. Since the balancing was per-flow, it’s entirely possible the machine can actually ship faster downloads, and neither system seemed under any strain while doing this. With the T2000 on the other hand, we could push no more than 48 MB/sec, which is still a very respectable 384Mbit/sec.

Apart from increasing the MTU and Window size, we didn’t apply any Solaris-specific tunings for improving these numbers, so again, it’s possible that these numbers are under-representing true possible performance. And once again, we really have to put these numbers into context. As a whole, the T2000 has no problems saturating it’s 4 Gigabit/sec of connectivity, and that’s what it’s designed for – parallelism. All our numbers mean, is that if you wanted truly incredible performance for any single download, this probably isn’t the right architecture. Outside of where I work, and other high-speed research networks, I’m not aware of any place where high-speed, single-flow statistics really matter a whole lot, especially for HTTP. The network is usually a limiting factor anyway. I mean, how many people have jumboframe capable multi-gig WANs?
The Ugly
Ok, so ugly is a bad choice of word. But like I said, this is a “showdown”. While testing the event MPM, we did manage to upset the Solaris kernel to the extent that it actually crashed;
panic[cpu21]/thread=300024a7020: BAD TRAP: type=31 rp=2a102c87720 addr=0 mmu_fsr=0 occurred in module "genunix" due to a NULL pointer dereference httpd: trap type = 0x31 pid=652, pc=0x10fb4dc, sp=0x2a102c86fc1, tstate=0x4400001607, context=0x514 g1-g7: 0, 0, 12, 38, 0, 0, 300024a7020
Nice! I havn’t looked into this in detail yet, but it’s likely due to the unusual synchronisation semantics the event MPM features right now. The event MPM is marked as experimental, and if you’re not an Apache developer, you probably shouldn’t be running it. Still, the thread-handling code within the MPM all runs as a non-root user, so it really shouldn’t be able to cause the kernel to crash. Then again, it was handling about 30,000 requests at the time, with no accept mutex. This isn’t exactly within the normal range of expected behavior for a userland application. Since switching to the worker MPM, we’ve had flawless performance and not a single crash.
The Conclusion
The T2000 is one very impressive piece of kit, and at a list price of around €15,000 ($16,995), costs less than half of the price of the dual Itanium we’ve been benchmarking it against (it’s also less than I can price up a comparable X86 box for – seems to be the memory that does it). We may very well go with the platform for our next iteration of ftp.heanet.ie.
The benchmarks we’ve run were all run with our own load in mind, but hopefully they’re still of some use to others. If you’re thinking about giving the platform a try, do run your own benchmarks though, don’t take our word for it. It’s always better to have these things validated and improved upon.
The Future
We’re not finished benchmarking just yet, we still have more planned! The Niagara box has some impressive SSL-offload features, and if we get a chance, we’d like to test those capabilities. We just needs to get the hacked-up engine3-supporting versions of openssl and flood onto the box, which will involve a bit of research. Some of the Apache SpamAssassin guys may try running some SpamAssassin benchmarks on the machine too, which should be impressive, as they lend themselves to parallelisation very well. We’re also going to try and improve on our above tests, and I’ll keep blogging about the results as we manage to do that.
Rather tantalisingly, there’s a comment on Dan Kegel’s C10K page saying that “Doug Royer noted that he’d gotten 100,000 connections on Solaris 2.6 while he was working on the Sun calendar server”. but doesn’t give any details of the hardware involved. But still, 100,000 connections, on 2.6! It gives me hope that with more tuning, the T2000 might be capable of scaling beyond the 83,000 we had.
If I develop some more free time, I also hope to use the machine to instrument Apache httpd (and maybe apr) for dtrace. Do check out Matty’s mod_dtrace though, for a cool module which instruments all of the handlers.
In the meantime, you can check out all of my blog posts about the Niagara box through my new Niagara category. Mads is also keeping tabs on other benchmarks taking place within the ASF community.
The Cheeky Part!
I don’t know what the status of Jonathan’s offer to be allowed to keep a server, at the discretion of the Niagara team, is – but we might as well give it a try.
Although we’re seriously considering the platform for the future, HEAnet doesn’t have a use for a Niagara box right now, but the other participants in our benchmarking efforts (and hopefully we’ll be blogging their results soon enough too) do – DCU’s Networking Society, RedBrick. RedBrick just celebrated 10 years as a networking society, and 5 years ago, Sun donated a massive E450 to the society, on which we ran our 2000 user shell server for 2 years.
We even pulled out all of the stops at the time, and had the Taoiseach (the Irish Prime Minister) turn out to launch the machine. I’m hoping we can convince SUN to donate the Niagara box to RedBrick, where they can use it for even more testing and benchmarking, as it really is an ideal machine for a shell environment. Lots and lots of low-memory parallel tasks.
So if you thought this round-up was of any use, digg it, link to it, or mail it to your local Sun Niagara team member, and we’ll see if we can be useful enough to merit a donation!
Sponsor Noirin!
Noirin is an Apache httpd-docs committer, and among other things has authored the Irish translation of the Apache error documents, improved the mod_rewrite documentation, did a massive overhaul on the mod_ssl documentation (now it no longer references things which havn’t existed in 5 years!), added event MPM documentation and fixed a load of grammar and spelling mistakes. Although she normally lives in Dublin, she’s currently spending a year in Munich on a European Exchange (erasmus) study programme (she’s doing a Degree in Computational Linguistics). She’s one of httpd’s youngest committers and one of a very few women committers (which we’re trying to encourage!).
Annoyingly, although ApacheCon is in Dublin this year, she isn’t, and as a poor student – even getting to the hackathon isn’t very easy for her – with the World Cup in Germany adding a lot to the flight costs around the time. It would cost her about 400 euro just to be in Dublin for the hackathon, and ApacheCon itself costs around 900 euro. So she’s looking for sponsors, to see if anyone can sponsor her. Corporate or otherwise.
First results from the Niagara Benchmarking
In order to make sure that the real benchmarks are as efficient as they can be, we’ve repeated our usual procedure using dd to determine the most efficient buffer size on the platform. More details about that procedure can be found in my earlier mis-titled blog post on scheduler benchmarking.
For the sake of comparison, I’ll repeat the results from our dual Itanium, which has 32GB of memory;

The important information to derive from the graph is the smoothness of the lines (which is a function of how well the scheduler and VM perform) and the absolute value of the bytes/sec number. The Itanium box can push about 3.5 x 109 bytes per second, or 3.5 Gigabytes/sec which is 28 Gigabits/sec. Now bear in mind that the procedure involved is not multi-threaded or even multi-process, so we can very generously guess that the dual-CPU system could push about 56 Gigabytes/sec of pure I/O throughput, completely ignoring the overhead implicit in multi-CPU I/O scheduling.
The benchmarking process relies on the presence of a version of dd with the dd-performance-counter patch from Debian, which the Sun box doesn’t have. Luckily however, SUN now have much of the source for Solaris online, so I popped over to the OpenSolaris code browser and grabbed a copy of dd.c and based on the Debian patch came up with the following patch. I also modified our dder.sh script a little to cope with the lack of seq;
#!/usr/bin/bash
STARTNUM="1"
ENDNUM="102400"
# create a 100 MB file
./dd bs=1024 count=102400 if=/dev/zero of=local.tmp
# Clear the record
rm -f record
# Find the most efficient size
i=$STARTNUM
while test $i -le $ENDNUM; do
./dd bs=$i if=local.tmp of=/dev/null 2>> record
i=$(( $i + 1 ))
done
# get rid of junk
grep "transf" record | awk '{ print $7 }' | cut -b 2- | cat -n | \
while read number result ; do
echo -n $(( $number + $STARTNUM - 1 ))
echo " " $result
done > record.sane
and after a few hours of running, here is the result;

So, we have a graph which is very similar in shape to the dual-Itanium box, except it’s a whole order of magnitude less in raw throughput terms. As we’ve seen above, a process could push up to 3.5 Gigabytes/sec on the Itanium box, on the Niagara box that becomes .34 Gigabytes/sec or about 2.72 Gigabit/sec. Now, the Niagara box is virtualised in that the process runs on one logical CPU, the Niagara box has 32. So if we are going to extrapolate from there, and do the same generous guess as we did for the dual-Itanium box, we’d get 87 Gigabit/sec, again completely ignoring the multi-CPU overhead.
Now bearing in mind that the Itanium box would have to deal with the overhead of managing two physical CPUs and the Niagara box would have to deal with the overhead of managing 32 logical CPUs on 4 physical CPUs there probably isn’t very much between them in reality, in terms of how much raw overall I/O they can push – though if I had to guess at this stage which would win, I’d say the Niagara box – but hopefully we’ll get much more meaningful information over the next few weeks.
Either way, both systems can probably comfortably saturate a 10Gigabit/sec interface and can certainly have a single process saturate a gigabit interface, which is all they ever have to be engineered for, beyond that the number doesn’t matter a whole lot, unless you’re running a very very busy database server. But this information is still very useful. For one thing, it gives me some confidence that with a properly tuned Apache build we can blow SUN’s own benchmarking numbers away, this system looks like it’s capable of very decent I/O performance. It also confirms the architectural and engineering decisions SUN says they’ve made. This system is architected for paralellism, it’s not supposed to have super amazing performance for any single-process task, it’s designed to be able to run lots of those tasks better than anything else can, all at the same time.
They’re not lying, this is very different from hyperthreading. With hyperthreading, we don’t see anything like these results, when we run our graphs we get plots like this;

and it barely changes when we turn hyper-threading off. Hyper-threading seems like a convenient interface to enable better pipelining, and there’s nothing wrong with that, but if you’re running one process it won’t make a difference. Niagara on the other hand seems to behave just like a load of individual CPU’s, with a lot of less cross-subsidisation (if any). So, if you really want amazing single-process performance, or you have a requirement for a single process to be able to sustain say a 10Gigabit/sec download then Niagara is definitely not the right platform, but then SUN don’t claim that it is. SUN have made the design choice to really build for multi-threaded systems and our benchmark here seems to validate that.
The only other information that can be gleaned from our graph is that the lines are slightly less smooth than on the Itanium box. Only a little, and frankly I’m surprised they’re as smooth as they are considering the amount of virtualisation which is going on. The graph is still vastly more smooth than any x86 plot we’ve ever performed and about the only conclusion we can make is that if you had some real-time task that was sensitive to the microsecond, Niagara probably wouldn’t be as good a choice of platform as Itanium, though still much much better than x86. Again, this is not the market SUN are aiming Niagara at, and we’re really just validating some engineering choices here.
So, armed with our new knowledge of the systems potential, we’re going to really put this system to task and hopefully get the most useful information of all out of it; Just how much real network throughput can it manage, and just how many concurrent downloads can it really handle.
Update:
O.k., so it looks like the roughness of the graph can be neatly explained by the Solaris scheduler granularity. Reading about it here with pointers from Paul reveals that by default ordinary processes have a 100HZ scheduling period, contrary to my statement about real-time applications there are APIs available which expose much higher frequency scheduling, so ignore that part. Also I’m informed that the Niagara SMT is also a means to increase pipelining efficiency, but that the pipelines are shorter and the switching latency much smaller. That does seem to be born out by the above.
Sun Goodness
Following on from my previous post about our experiences trying to get a Sun server, I got some great help from some SUN employees, not least Paul Jakma (of Quagga fame) and after filling in one form and posting a comment on Jonathan Schwartz’s blog, and telling a small fib about what country we’re in, the T2000 arrived yesterday.
Right now, I’m on study leave from work, but I’m in a few half-days and plan to steadily put the system through its paces. We went for the T2000, with 16Gb of memory, and prtdiag shows me a grand total of 32 logical processors, nice!
MB/CMP0/P0 0 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P1 1 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P2 2 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P3 3 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P4 4 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P5 5 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P6 6 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P7 7 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P8 8 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P9 9 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P10 10 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P11 11 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P12 12 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P13 13 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P14 14 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P15 15 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P16 16 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P17 17 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P18 18 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P19 19 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P20 20 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P21 21 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P22 22 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P23 23 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P24 24 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P25 25 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P26 26 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P27 27 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P28 28 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P29 29 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P30 30 1000 MHz SUNW,UltraSPARC-T1 MB/CMP0/P31 31 1000 MHz SUNW,UltraSPARC-T1
I’m going to have to install some more software (not least Apache!) and apply some tunnigs before it can be benchmarked properly, but already we’re getting some useful information. We’ve plugged the system into a metred Power Distribution Unit (PDU), to get a sense of how much current it draws, and while the meter is not granular enough to tell me what each port uses, here’s the graph of before/after plugging it in;
Plugging the T2000 in is almost exactly half-way through the graph, and you see a small spike there as the system controller comes online. The next ramp then is when I issued the poweron command and the whole system came onine. The power unit only measures in .1 AMPs, and the step correspondonds to an increase from 5.8 amps before plugging in the unit to 6.8 amps after it is fully powered on.
So, 1 Amp at 220 volts is 220 Watts +/1 10% given the accuracy of the unit. That’s pretty good for a beefy server. The Niagara platform is not the subject of SUN’s famous adds comparing their power usage to Dell’s – those were x64 servers – but still it’s good to see that power usage has been kept in trim. I remember when E450s would guzzle many times that number of watts. Our Dell 2850′s use about 290 Watts each +/- 10%, when they’re not busy, for the sake of comparison.
I’ll keep blogging our results from all of the testing and benchmarking as we go through it, and since I’m due to talk at the Irish System Administrators Guild this tuesday, I’ll probably include a lof of the results there too.


