Thursday, June 12, 2008

Lately I've been working on a new hash table implementation, combining ideas from several papers, and benchmarking every little idiom. As impressive as modern compiler optimization has become, it's still interesting to find amazing speedups you can achieve by avoiding conditionals. For example, I have a Bucket class that holds up to four items. Now I could store the count of items in the bucket, or I could count them on demand, an issue of speed versus memory. Should I choose to count them, then how I choose to count them becomes important. The naive way might look like:


size_t size(char* keys)
{
int i;
int count = 0;
for (i = 0; i < 4; ++i)
if (keys[i] > 0)
count++;
return count;
}


I won't bore you with the body of main that I used to test this with, but basically calling this MAX_INT times, (and doing something with the return so the compiler won't optimize it all away), I get these times on a 1.8 mhz core duo:


real 0m12.796s
user 0m12.785s
sys 0m0.000s


So lets look at an unconventional way to compute size:

int size(char* keys)
{
return (keys[0] != 0) +
(keys[1] != 0) +
(keys[2] != 0) +
(keys[3] != 0);
}

This version, in the same loop as the previous example, turns in the following times:

real 0m0.004s
user 0m0.004s
sys 0m0.000s

Oh, yeah. Considering this function can get called on every insertion in my modified Cuckoo Hash, this is a big improvement. Obviously the new hand unrolled version isn't as flexible wrt bucket size, so let's see how a hybrid approach might perform:

int size(char* keys, size_t bsize)
{
unsigned int count = 0;
int i;

for (i = 0; i < bsize; ++i)
count += (keys[i] != 0);
return count;
}

Okay this still performs better than the original:

real 0m2.200s
user 0m2.196s
sys 0m0.004s

Not quite as good, but still better if you need the flexibility of specifying array size at runtime. I'm sure there's a way to describe a recursive C++ template to get the benefit of the unrolled version if you are content to specify array size at compile time.

Of course, the unrolled version could also use SIMD instructions for an even greater speedup.

2 comments:

Tomy said...

Sorry I haven't responded sooner. Sounds like you're doing the same research as me. I'll add Javascript to your list. I want an object model with LLVM as the foundation, with the ability to implement Javascript, Smalltalk, Ruby, and why not through in Self and Objective-C.

I sometimes hang out on irc in the rubinius channel. I'm on Google Talk as Tomy, or my email i nospam-hudson at speakeasy dot net.

Tomy said...

Doh, my G-Talk name is seattletomy.