Archive for March, 2009

Optimising strlen()

Posted on March 1, 2009, under coding, general.

Optimisation has a bit of a bad rep these days, good advice such as Hoare’s dictum that “Premature optimization is the root of all evil”, has led to a stern outlook on adding obfuscated mess to gain efficiency. Sometimes going really really fast just makes you really really insane.

Economically, the convenience of developers has won out over efficiency. Very high-level languages are near ubiquitous. Very few people have to squeeze implementations into so many CPU operations, or such and such an amount of memory. On the contrary, click-dragging many millions of CPU operations and many millions of bytes of memory in the form of some library, form-control or icon in an even more inefficient IDE is the norm.

But as others have put far better, advocating the complete absence of optimisation is advocating a fallacy. Focused, intelligent, and crucially – well documented – optimisation is a very important skill that can save lots of time and money. To give three concrete examples;

1. Deriving the most optimal buffer sizes

At HEAnet when we were scaling ftp.heanet.ie to cope with over 50,000 concurrent downloads from a single host, we ran a lot of experiments to figure out the most optimal read buffer sizes.

This is very rarely done, most buffers are chosen arbitrarily, 4k and 8k being common for various reasons. Our experiments showed that the optimal buffer size was actually around 40k, and when we made this change within Apache we measured a 25% capacity improvement.

2. Using integers to compare strings

The objective of a contract I was involved in some years ago was to analyse a lot of HTTP data, from a load-balancer, to try and determine some statistics about it. Our input was billions of requests and getting the reports as quickly as possible was important (it fed into a health-check system, with a rolling average).

One of the key stages of the process was the very start of the request, because we would branch on the type of request (GET, HEAD, CONNECT, PUT, POST … etc). A branch here was bad for two reasons, it interfered with pipe-lining and the CPU caches, which is especially inefficient when most requests were GETs anyway.

Re-ordering the branches such that the GET case was a “fall-through” (that it didn’t involve a jump) helped a little, but there was still some inefficiency going on. The string comparison itself seemed to be wasting some of the L1 cache on us.

So, as a crazy solution, we treat the first 4 bytes of the request as an integer – and use integer comparison. On x86, “GET” == 5522759, “HEAD” == 1145128264 and so on. The first four bytes just happen to be unique in HTTP methods, and the check can happen in the CPU registers directly without having to deference pointers.

The app got about twice as fast, but this was probably due to some other variable now fitting in the L1 cache. Plenty of explanatory comments in the code made this an acceptable, if still crazy, optimisation to make. It also taught everyone involved exactly what SIGBUS really is.

3. Replacing inefficient calls

Sometimes the simplest optimisations are the most effective. At my current job, we managed to speed one of our external services up by about 20% just by replacing the top 3 calls that showed up in gprof with more standard implementations that are implemented in x86 assembly.

This is classic optimisation, run a profiler and go for a targeted attack, but it can still be amazing what it can achieve.

Unfortunately one of the problems with optimisation is that it’s hard. Being able to profile something is relatively straightforward and repeatable, but even the basic step of knowing that your application is slow is non-trivial. Sure there might be a high-level SLA, but you can always try throwing hardware at a problem.

Knowing that something is slow, or large, involves being able to make educated estimates about implementations, from first principals. Optimisers can approach a problem and think “well we’re accessing X much data, and performing about Y basic CPU operations, the fundamental limit is probably around Z”. That’s just step 1.

The other key part is that opimisers tend to have a deep, intuitive, understanding of systems at every level. They know not just a programming language, but the intricacies of how it is implemented. They’ll know how memory managers, file-systems, compilers, CPUs, networks and all sorts of things in-between actually work. This takes years, and quite a few “ahhhh” moments at 3AM in the middle of some nightmarish failure scenario.

strlen()

So where to start? One good place is to look is standard libraries, they contain many implementations of very basic routines that are called so often that they have to be optimal. Additionally, the really common routines are implemented in nearly every language, and you can look the trade-offs made within each. It’s a good way to learn a lot. One great case-study is strlen().

Before we go on, let’s get some things defined. For the most part here, we’re going to look at C style strings. A quick refresher; in C, strings are just a 0 terminated array of bytes (char is the C type for a character). They look like this;

1
2
3
4
5
6
7
8
/* A C string initialised as an array */
char hello[] = { 'h', 'e', 'l', 'l', 'o', 0 };

/* The same C string */
char hello[] = "hello";

/* A pointer to a C string */
char * hello = "hello";

strlen(), as the name suggests, returns the length of a string;

1
2
3
4
5
6
7
/**
 * Return the length of a string.
 *
 * @param str The string to measure
 * @return    The string's length
 */

size_t strlen(const char * str);

Some important things; strlen(“hello”) returns 5, not 6. The zero at the end doesn’t count as part of the string. Zero is the only thing that defines the end of a string, all sorts of unprintable and control characters can be inside a string.

When someone is learning to program, a typical first attempt at a strlen implementation will be something like this:

Method 1: an iterative for-loop

1
2
3
4
5
6
7
8
size_t strlen(const char * str)
{
    size_t len;

    for (len = 0; str[len]; len++);

    return len;
}

This method is easy to understand, and read, but it’s not very good. It runs in time O(n) and involves a lot of jumps.

The next thing programmers usually realise is that they don’t have to use array indices, that pointers are enough.

Method 2: sacrifice a variable and readability

1
2
3
4
5
6
7
8
size_t strlen(const char * str)
{
   char * ptr;

   for (ptr = str; *ptr; ++ptr);
   
   return ptr - str;
}

Depending on how clever the compiler is, this code may be slightly faster, because there won’t be so many additions to the base pointer. We increment the pointer only, and use the difference between it and the base-pointer as the length. This code is less readable though, and probably counts as premature optimisation.

This method is still O(n) and still involves a lot of jumps. Let’s see what we can do about that next.

Method 3: partially unroll the loop

1
2
3
4
5
6
7
8
9
10
11
12
13
14
size_t strlen(const char * str)
{
   char * ptr = str;

   while(1)
   {
         if(!*(ptr++)) break;
         if(!*(ptr++)) break;
         if(!*(ptr++)) break;
         if(!*(ptr++)) break;
    }

    return (ptr - 1) - str;
}

Now we’re into serious unreadability territory, and by the way, the above is how djb implements strlen. But we have actually gained some efficiency, now for every jump operation that the loop creates, we test for the 0 value 4 times. The choice of 4 times is arbitrary here, but an interesting exercise would be to vary this number and test each possibility.

More tests will interfere with pipelines, but there will be less jumping. Who knows where the sweet spot is. Still, though this implementation is faster, it’s still O(n). Unfortunately in C, we’re doomed to an O(n) implementation, best case, but we’re still not done … we can do something about the very size of n.

Method 4: word-wise checks

Just like the example I gave earlier, where we used integers to compare 4 bytes of a string all at once, we can do something even cleverer with strlen. We can construct a test, that with one small exception, can determine if any byte within a long is set to zero. This works for 2, 4 and 8 byte word-sizes.

It works by setting up a bit pattern like “01111110 11111110 11111110 11111111″, and then we add an arbitrary word to it. Anything except 0 in any of the bytes should cause an overflow into the neighbouring bytes. So by performing an addition, and then testing the “hole” bits, we can tell that there was likely a zero. There is one case where we’ll get a false-positive, which can happen when bit 31 is set – but that’s easy to check for.

It’s too long to reproduce here, but you can see the entire glibc implementation as a great example.

With this method we have divided our problem space, N is now N/4 or N/8, and the checks happen much more quickly. Though with really small strings, this method may actually be worse. The reason for that is that word-wise checks also have to be word-aligned. So if the string starts on a mis-aligned byte we may have to scroll up to 3-characters (for 32-bit words, it’s 7-characters for 64-bits) to become word-aligned.

Additionally when we do find a zero, we still have to perform 4 tests to figure out exactly which byte it was. If you had a lot of mis-aligned 7 byte strings, this method would be highly sub-optimal.

Method 5: Outsource the problem

In the many years war between RISC and “kitchen sink” architectures, the latter appears to have won. x86 is king, and that means we usually have plenty of instructions at our disposal. Two such instruction are “repnz” (repeat while non zero), and “scasb” (scan string) that combined allow to instruct the CPU to go find the next zero in a range of memory, using whatever tricks it likes.

The CPU can implement any of the approaches we’ve already talked about, but much more is possible. Modern memory, and SRAM in particular, has been designed to make parallelised searches possible. The chips can be electrically probing many sequences of bits all at once, and feed into a tree of gates that makes sure the earlier wins – O(ln) , in hardware. This is highly optimal, and if you’re using strlen() on an x86 host right now, that’s probably what’s happening.

But what else can we do? What if we make some trade-offs about the very nature of C strings?

Method 6: Add a cache

C is rightly criticised for how it handles strings, it’s pretty dumb, and C buffer overflows remain a major source of security problems. There really isn’t any excuse. Treating strings as arbitrary regions of zero-terminated memory has one main advantage; it allows strings to be seperated in-place. One can take “Hello World” and make it “Hello” and “World” with the very simple insertion of a zero byte.

But strings are not separated a whole lot, and it tends to be a one-time operation. If we throw away this “feature”, there is a much more optimal way to get string length – just keep recording it in a cache.

1
2
3
4
5
6
7
8
9
typedef struct {
    char * str;
    size_t len;
} string;

size_t strlen(const string * s)
{
    return s.len;
}

Of course, we needn’t modify the API here, instead of creating a new type, we could just keep a static index of pointer values (say a hashMap) and record the lengths there. In fact, we could even have a compiler do this for us by defining the return value of the function as const.

But we do need to enforce the API. Now everything that deals with strings has to update this cache, and we can’t permit the programmer to mess with the string internally. Our string functions are the only valid way.

This is actually the norm. There are quite a few APIs for doing this in C, and it’s also what djb does in his code. Almost all modern languages take this approach, and have an internal record of every string length in their symbol table. Finally, we have a real O(1) solution.

What next?

But it doesn’t end there. Every solution above had trade-offs, and there are plenty of other ways of doing it.

Some text processors, for example, are designed to accommodate lots of string concatenation, so strings are optimistically allocated much more memory than they need, with lots of zeroes at the end. For these strings, if we know how much memory is allocated, we can implement strlen as a word-wise binary search, which will be O(ln).

Some XML processors, for another example, know that some very high percentage of the time that a closing tag will be exactly the same length as the name of the opening tag + 3 (‘< ', '/' and '>‘) – and take that as a best guess, and skip straight to that length and see if they find what the expect.

Other times, optimisers perform statistical analysis on their inputs and find out where the peaks are, and optimise the tests to go for those paths first. There really is an endless series of possibilities, and this is for something as simple as getting the length of a string. Underneath every little problem is a world of opportunity and a wealth of material from very clever people. (The Varnish source code, and the Architect Notes are also a great source of inspiration).

A few years ago, at Joost, we came across exactly this problem. An application that was spending about 10% of its time inside strlen. Within an hour or so of rewriting some primitives, it was down to less than a hundredth of a percent. You can’t drag and drop that kind of improvement in an IDE.