Optimising strlen()

Posted on March 1, 2009, under coding, general.

Optimisation has a bit of a bad rep these days, good advice such as Hoare’s dictum that “Premature optimization is the root of all evil”, has led to a stern outlook on adding obfuscated mess to gain efficiency. Sometimes going really really fast just makes you really really insane.

Economically, the convenience of developers has won out over efficiency. Very high-level languages are near ubiquitous. Very few people have to squeeze implementations into so many CPU operations, or such and such an amount of memory. On the contrary, click-dragging many millions of CPU operations and many millions of bytes of memory in the form of some library, form-control or icon in an even more inefficient IDE is the norm.

But as others have put far better, advocating the complete absence of optimisation is advocating a fallacy. Focused, intelligent, and crucially – well documented – optimisation is a very important skill that can save lots of time and money. To give three concrete examples;

1. Deriving the most optimal buffer sizes

At HEAnet when we were scaling ftp.heanet.ie to cope with over 50,000 concurrent downloads from a single host, we ran a lot of experiments to figure out the most optimal read buffer sizes.

This is very rarely done, most buffers are chosen arbitrarily, 4k and 8k being common for various reasons. Our experiments showed that the optimal buffer size was actually around 40k, and when we made this change within Apache we measured a 25% capacity improvement.

2. Using integers to compare strings

The objective of a contract I was involved in some years ago was to analyse a lot of HTTP data, from a load-balancer, to try and determine some statistics about it. Our input was billions of requests and getting the reports as quickly as possible was important (it fed into a health-check system, with a rolling average).

One of the key stages of the process was the very start of the request, because we would branch on the type of request (GET, HEAD, CONNECT, PUT, POST … etc). A branch here was bad for two reasons, it interfered with pipe-lining and the CPU caches, which is especially inefficient when most requests were GETs anyway.

Re-ordering the branches such that the GET case was a “fall-through” (that it didn’t involve a jump) helped a little, but there was still some inefficiency going on. The string comparison itself seemed to be wasting some of the L1 cache on us.

So, as a crazy solution, we treat the first 4 bytes of the request as an integer – and use integer comparison. On x86, “GET” == 5522759, “HEAD” == 1145128264 and so on. The first four bytes just happen to be unique in HTTP methods, and the check can happen in the CPU registers directly without having to deference pointers.

The app got about twice as fast, but this was probably due to some other variable now fitting in the L1 cache. Plenty of explanatory comments in the code made this an acceptable, if still crazy, optimisation to make. It also taught everyone involved exactly what SIGBUS really is.

3. Replacing inefficient calls

Sometimes the simplest optimisations are the most effective. At my current job, we managed to speed one of our external services up by about 20% just by replacing the top 3 calls that showed up in gprof with more standard implementations that are implemented in x86 assembly.

This is classic optimisation, run a profiler and go for a targeted attack, but it can still be amazing what it can achieve.

Unfortunately one of the problems with optimisation is that it’s hard. Being able to profile something is relatively straightforward and repeatable, but even the basic step of knowing that your application is slow is non-trivial. Sure there might be a high-level SLA, but you can always try throwing hardware at a problem.

Knowing that something is slow, or large, involves being able to make educated estimates about implementations, from first principals. Optimisers can approach a problem and think “well we’re accessing X much data, and performing about Y basic CPU operations, the fundamental limit is probably around Z”. That’s just step 1.

The other key part is that opimisers tend to have a deep, intuitive, understanding of systems at every level. They know not just a programming language, but the intricacies of how it is implemented. They’ll know how memory managers, file-systems, compilers, CPUs, networks and all sorts of things in-between actually work. This takes years, and quite a few “ahhhh” moments at 3AM in the middle of some nightmarish failure scenario.

strlen()

So where to start? One good place is to look is standard libraries, they contain many implementations of very basic routines that are called so often that they have to be optimal. Additionally, the really common routines are implemented in nearly every language, and you can look the trade-offs made within each. It’s a good way to learn a lot. One great case-study is strlen().

Before we go on, let’s get some things defined. For the most part here, we’re going to look at C style strings. A quick refresher; in C, strings are just a 0 terminated array of bytes (char is the C type for a character). They look like this;

1
2
3
4
5
6
7
8
/* A C string initialised as an array */
char hello[] = { 'h', 'e', 'l', 'l', 'o', 0 };

/* The same C string */
char hello[] = "hello";

/* A pointer to a C string */
char * hello = "hello";

strlen(), as the name suggests, returns the length of a string;

1
2
3
4
5
6
7
/**
 * Return the length of a string.
 *
 * @param str The string to measure
 * @return    The string's length
 */

size_t strlen(const char * str);

Some important things; strlen(“hello”) returns 5, not 6. The zero at the end doesn’t count as part of the string. Zero is the only thing that defines the end of a string, all sorts of unprintable and control characters can be inside a string.

When someone is learning to program, a typical first attempt at a strlen implementation will be something like this:

Method 1: an iterative for-loop

1
2
3
4
5
6
7
8
size_t strlen(const char * str)
{
    size_t len;

    for (len = 0; str[len]; len++);

    return len;
}

This method is easy to understand, and read, but it’s not very good. It runs in time O(n) and involves a lot of jumps.

The next thing programmers usually realise is that they don’t have to use array indices, that pointers are enough.

Method 2: sacrifice a variable and readability

1
2
3
4
5
6
7
8
size_t strlen(const char * str)
{
   char * ptr;

   for (ptr = str; *ptr; ++ptr);
   
   return ptr - str;
}

Depending on how clever the compiler is, this code may be slightly faster, because there won’t be so many additions to the base pointer. We increment the pointer only, and use the difference between it and the base-pointer as the length. This code is less readable though, and probably counts as premature optimisation.

This method is still O(n) and still involves a lot of jumps. Let’s see what we can do about that next.

Method 3: partially unroll the loop

1
2
3
4
5
6
7
8
9
10
11
12
13
14
size_t strlen(const char * str)
{
   char * ptr = str;

   while(1)
   {
         if(!*(ptr++)) break;
         if(!*(ptr++)) break;
         if(!*(ptr++)) break;
         if(!*(ptr++)) break;
    }

    return (ptr - 1) - str;
}

Now we’re into serious unreadability territory, and by the way, the above is how djb implements strlen. But we have actually gained some efficiency, now for every jump operation that the loop creates, we test for the 0 value 4 times. The choice of 4 times is arbitrary here, but an interesting exercise would be to vary this number and test each possibility.

More tests will interfere with pipelines, but there will be less jumping. Who knows where the sweet spot is. Still, though this implementation is faster, it’s still O(n). Unfortunately in C, we’re doomed to an O(n) implementation, best case, but we’re still not done … we can do something about the very size of n.

Method 4: word-wise checks

Just like the example I gave earlier, where we used integers to compare 4 bytes of a string all at once, we can do something even cleverer with strlen. We can construct a test, that with one small exception, can determine if any byte within a long is set to zero. This works for 2, 4 and 8 byte word-sizes.

It works by setting up a bit pattern like “01111110 11111110 11111110 11111111″, and then we add an arbitrary word to it. Anything except 0 in any of the bytes should cause an overflow into the neighbouring bytes. So by performing an addition, and then testing the “hole” bits, we can tell that there was likely a zero. There is one case where we’ll get a false-positive, which can happen when bit 31 is set – but that’s easy to check for.

It’s too long to reproduce here, but you can see the entire glibc implementation as a great example.

With this method we have divided our problem space, N is now N/4 or N/8, and the checks happen much more quickly. Though with really small strings, this method may actually be worse. The reason for that is that word-wise checks also have to be word-aligned. So if the string starts on a mis-aligned byte we may have to scroll up to 3-characters (for 32-bit words, it’s 7-characters for 64-bits) to become word-aligned.

Additionally when we do find a zero, we still have to perform 4 tests to figure out exactly which byte it was. If you had a lot of mis-aligned 7 byte strings, this method would be highly sub-optimal.

Method 5: Outsource the problem

In the many years war between RISC and “kitchen sink” architectures, the latter appears to have won. x86 is king, and that means we usually have plenty of instructions at our disposal. Two such instruction are “repnz” (repeat while non zero), and “scasb” (scan string) that combined allow to instruct the CPU to go find the next zero in a range of memory, using whatever tricks it likes.

The CPU can implement any of the approaches we’ve already talked about, but much more is possible. Modern memory, and SRAM in particular, has been designed to make parallelised searches possible. The chips can be electrically probing many sequences of bits all at once, and feed into a tree of gates that makes sure the earlier wins – O(ln) , in hardware. This is highly optimal, and if you’re using strlen() on an x86 host right now, that’s probably what’s happening.

But what else can we do? What if we make some trade-offs about the very nature of C strings?

Method 6: Add a cache

C is rightly criticised for how it handles strings, it’s pretty dumb, and C buffer overflows remain a major source of security problems. There really isn’t any excuse. Treating strings as arbitrary regions of zero-terminated memory has one main advantage; it allows strings to be seperated in-place. One can take “Hello World” and make it “Hello” and “World” with the very simple insertion of a zero byte.

But strings are not separated a whole lot, and it tends to be a one-time operation. If we throw away this “feature”, there is a much more optimal way to get string length – just keep recording it in a cache.

1
2
3
4
5
6
7
8
9
typedef struct {
    char * str;
    size_t len;
} string;

size_t strlen(const string * s)
{
    return s.len;
}

Of course, we needn’t modify the API here, instead of creating a new type, we could just keep a static index of pointer values (say a hashMap) and record the lengths there. In fact, we could even have a compiler do this for us by defining the return value of the function as const.

But we do need to enforce the API. Now everything that deals with strings has to update this cache, and we can’t permit the programmer to mess with the string internally. Our string functions are the only valid way.

This is actually the norm. There are quite a few APIs for doing this in C, and it’s also what djb does in his code. Almost all modern languages take this approach, and have an internal record of every string length in their symbol table. Finally, we have a real O(1) solution.

What next?

But it doesn’t end there. Every solution above had trade-offs, and there are plenty of other ways of doing it.

Some text processors, for example, are designed to accommodate lots of string concatenation, so strings are optimistically allocated much more memory than they need, with lots of zeroes at the end. For these strings, if we know how much memory is allocated, we can implement strlen as a word-wise binary search, which will be O(ln).

Some XML processors, for another example, know that some very high percentage of the time that a closing tag will be exactly the same length as the name of the opening tag + 3 (‘< ', '/' and '>‘) – and take that as a best guess, and skip straight to that length and see if they find what the expect.

Other times, optimisers perform statistical analysis on their inputs and find out where the peaks are, and optimise the tests to go for those paths first. There really is an endless series of possibilities, and this is for something as simple as getting the length of a string. Underneath every little problem is a world of opportunity and a wealth of material from very clever people. (The Varnish source code, and the Architect Notes are also a great source of inspiration).

A few years ago, at Joost, we came across exactly this problem. An application that was spending about 10% of its time inside strlen. Within an hour or so of rewriting some primitives, it was down to less than a hundredth of a percent. You can’t drag and drop that kind of improvement in an IDE.

20 Replies to "Optimising strlen()"

gravatar

Justin Mason  on March 2, 2009

that glibc trick is nifty; I’d never seen it before. thanks!

gravatar

Colm MacCarthaigh  on March 2, 2009

Yeah, it’s cool :-) Wikipedia has another interesting word-wise example on its Strlen page, but it’s actually less efficient on x86 in terms of clock-cycles.

gravatar

Pádraig Brady  on March 2, 2009

Nice post Colm!

Her’s my optimal solution:

size_t strlen(const char* str) { return strchr(str,0)-str; }

Now just need to optimize strchr :p

gravatar

colmmacc  on March 2, 2009

lol :-) Actually that reminds me of a completely different tack, tail recursion!

1
2
3
4
5
6
7
size_t strlen(const char * str)
{
    if (!str)
        return 0;
    else
        return 1 + strlen(++str);  
}

Who knows how that might do on a truly stackless non von-neumann machine.

gravatar

Pádraig Brady  on March 2, 2009

wrt method 5, it’s worth noting that there are text processing instructions in SSE 4.2 in intel’s i7. For an example of using those for strlen see:
http://smallcode.weblogs.us/strcmp_and_strlen_using_sse_4.2
(there is a link at the bottom of that to more strlen goodness).
I think with a new enough gcc and the right arch selected, __builtin_strlen could use this? Here’s a chance to plug my gcc -march identifier:
http://www.pixelbeat.org/scripts/gcccpuopt

gravatar

paulj  on March 2, 2009

Nice post..

Dumb question: How does the additive compare manage to avoid reading past the end of the string? (It seems to me method requires all the code involved to co-operate and align all strings via padding, but I must be missing something).

gravatar

colmmacc  on March 2, 2009

lines 39 to 43 do the alignment necessary to perform the checks. It’s true that at the very end of string we might read memory that isn’t part of our string at all, but the conditionals from line 130 onwards are in-order, so the length will still be correct. Does that make sense?

gravatar

Joe Drumgoole  on March 3, 2009

Loop unrolling I’d come across, but the bitwise comparison stuff and the statistical analysis are new to me. Great stuff!

gravatar

paulj  on March 4, 2009

Yeah, the alignment ‘shove’ at the beginning is fine. I mean the end, it seems to read a possible 3 bytes past the end of the string.. Which doesn’t quite kosher.

Right, so is that safe? It doesn’t seem like it could cause any faults on any typical machine, as it’ll stay within the longword, but is it really kosher? (I’d be interested in the explanation).

gravatar

paulj  on March 4, 2009

Gah.. perhaps you could allow the comment box to have scrollbars, to give submitters a cue they’ve got old crap above their reply. :)

gravatar

Keith Gaughan  on March 4, 2009

More tests will interfere with pipelines, but there will be less jumping.

That’s pretty much dependent on the machine architecture, mind. For instance, processors, such as the ARM, with predicated instructions don’t have this problem, so it’s hard to squeeze more performance out of a strlen() implementation than this variant of method 3:

; Entry:
; r0 – String pointer
; Exit:
; r0 – Length of the string.
; r1, r2 corrupted.
MOV r1, r0
.loop LDR r2, [r0]
TST r2, #&000000FF
ADDNE r0, r0, #1
TSTNE r2, #&0000FF00
ADDNE r0, r0, #1
TSTNE r2, #&00FF0000
ADDNE r0, r0, #1
TSTNE r2, #&FF000000
ADDNE r0, r0, #1
BNE loop
RSB r0, r1, r0
MOV pc, lr

Mind though, I haven’t written any ARM assembly in some nine years or so, so caveats apply, and that obviously won’t work on a big-endian ARM, and it assumes the string is word-aligned.

gravatar

josh  on March 4, 2009

Eh, my preferred optimization for strlen is to stop calling it so much. Other library functions tend to return a pointer to the beginning of the string, which you already have, rather than the end of the string, even though they typically already have that internally. If you don’t throw that data away, you don’t have to recalculate it.

gravatar

baryluk  on March 4, 2009

@colmmacc, Actually it’s not tail recurisve.

Proper version:

size_t strlen(const char * str)
strlen(str, 0);

size_t strlen(const char * str, size_t a)
if (!*str)
return a;
else
return strlen(++str,a+1);

gravatar

Justin  on March 5, 2009

Method 3 reminds me an awful lot of Duff’s device, which is even more fun
http://en.wikipedia.org/wiki/Duff%27s_device

gravatar

colmmacc  on March 5, 2009

@paulj – both pages from mmap should be aligned along word boundaries, and memory on the stack, so it seems safe. If the zero is completely absent then we might get a protection fault, but that’s fair enough. I can’t see any other magic :/

@barryluk true.

gravatar

j  on March 5, 2009

> You can’t drag and drop that kind of improvement in an IDE.

But you can get it for free if you use modern libraries or a modern language!

gravatar

wolf550e  on March 5, 2009

@paulj: the error you get from reading past the end of a string (or any array) is a page fault. Reading past the array inside the page is completely “kosher” – the processor and the OS can’t help you if you read (or overwrite) a different variable than the one you intended. Only when you are trying to touch a page that’s not yours will you trap (You won’t get an error if the next page is also yours). Anyway, since the error can only happen on page boundary (that’s how the mmu works), and that is usually not less than 4096 bytes (sometimes much more!), while reading a properly null terminated string in word size chunks you will be completely ok. Barring off-by-one bugs, of course.

gravatar

Leonid Volnitsky  on March 5, 2009

My some what similar travails in optimizing static sized array:
http://volnitsky.com/project/lvvlib/array.html

gravatar

Paul Jakma  on May 18, 2009

colmmac/wolf: Yes, I know about pages and mmap(). They (or analogues) might be very common, but they’re not part of the C spec, and there definitely are machines without. Anyway, it doesn’t matter because that’s not what the code is generally relying on. It’s relying on the longword access to never cause a fault.

So is that assumption safe? Obviously it is if memory access faults are on longword-aligned, page-boundaries. But do machines exist with sub-longword fault or memory-mapping granularity? (E.g. could you have RAM mapped at the bottom of a longword and, say, an I/O port at the top?).

Basically, my question is, I get the impression this code relies on common machine-properties for correctness – not on any guarantees made by C. Am I correct in thinking that?

gravatar

inmarket  on June 18, 2009

Actually – on some architectures method 1 may be the most efficient. The C compiler could automatically loop unroll (like method 3). The compiler may produce better code from method 1 than method 2 if it supports a double base indirect memory test instruction. The memory cache may pre-read a long word even on byte access and may support multiple byte instructions simultaneously (a common feature on modern processors) thus giving you all the optimisations of the later methods without the code overheads. Remember longer code also means less efficient instruction cache use.
The point of all this is that any optization is COMPLETELY processor dependant and even supposedly obvious optimisations may in-fact slow the routine. The only way to tell is to profile based on a thourough understanding of the processor and memory achitecture.

Leave a Comment