Wednesday, 3 January 2007

Tight loop optimization in C++

Duff's Device is a fairly well known technique used for optimizing small loops. It uses switch case statement fall through that [when compiled] mimicks loop unrolling but is generally much faster and cleaner to use than standard loop unrolling. This is a breif article on the usage and speed benefits of using Duff's Device for optimizing tight loops.

I first used Duff's Device whilst developing a cross platform graphics library from which the examples given are drawn from. They may not be perfect but definately indicate correct usage of the device.

Firstly here is the C++ implementation of Duffs Device, in the form of a #define. This #define takes care of all the operation of the device, all you need to do it use the right parameters with it.

#define DUFF_DEVICE_8(aCount, aAction) \
{ \
int count_ = (aCount); \
int times_ = (count_ + 7) >> 3; \
switch (count_ & 7){ \
case 0: do { aAction; \
case 7: aAction; \
case 6: aAction; \
case 5: aAction; \
case 4: aAction; \
case 3: aAction; \
case 2: aAction; \
case 1: aAction; \
} while (--times_ > 0); \
} \
}

Before I go on to discuss the device itself there are a few hidden optimizations in the code [which you could take not of and employ elsewhere in your code].

  • Firstly you will notice that a shift has been done rather than a division, shifts are much faster than multiplications and divisions when dealing with powers of 2, so where possible you should use shifts.
  • Secondly the while condition contains --times_ rather than times_--. Pre-decrement [and increment] is faster than post decrement / increment. This is because post operations require a copy of the variable to be made. Obviously this overhead is minimal, but if speed is of the essense small savings count.
  • Also in the while condition you can see the comparison is against zero [counts down] rather than against aCount [counts up]. This is beacause a comparison against 0 is much faster than any other number.
N.b. many compilers will replace perform these optimisations on compile time so they are not always needed.

So how can you use Duff's Device in your code?

You simply use it like a normal define, taking care to finish each line with a semi colon.
void Bitmap::Blank(COLORREF colour)
{
unsigned long *blitPtr = m_BitmapData;
unsigned long *blitEnd = blitPtr + (m_Width*m_Height);
unsigned int index=0;

int numPixels = static_cast<int>(blitEnd-blitPtr);
if (numPixels > 0)
{
DUFF_DEVICE_8(numPixels, blitPtr[index++] = colour;);
}
}


This code is designed to simly fill my bitmap with a certain colour defines by COLORREF [simply a DWORD aka long which is 4 bytes on most machines].
The first operation I do is to get the pointer to the start of my array [i.e. the start of the bitmap] and then get the end of my array using some pointer arithmetic.

Onto the usage of the Duff Device itself; as you can see numPixels is passed as the count parameter and "blitPtr[index++] = colour;" is passed as the code to run in each iteration. Notice the trailing ; inside the statement which is required as it is text replacement and not a function.

Quite simply it iterates through each pixel [long] in our m_BitmapData array and assigns the colour to it.

In tests on my machine Duff's Device, in this instance, is faster than a memcpy. This is not to say that it will always be faster, but in this instance it was. The point is that no matter how efficient a certain algorithm or piece of code is, the generic algorithms / hardware accelerated variations will usually be faster for most instances.

So if you do have a very tight loop that requires speeding up then why not look to Duff's Device to see if it could improve the efficiency of the loop.

It is worth pointing out that more obvious optimisations should be made before looking to a novel solution such as this. i.e. remove allocations or any code that doesn't need to be in the loop - simple!

3 comments:

Christopher said...

Nice article. The other optimization I'd like to point out is that Duff's Device uses a do-while loop (which requires one branch at the end) as opposed to a plain while loop (which requires branching at both the beginning and end). Reduced branching is indeed the primary benefit of Duff's Device, which is basically a loop unrolling technique.

Yesha said...

> DUFF_DEVICE_8(numPixels,
> blitPtr[index++] = colour;);

If you are that extremlly performance sensetive so you use loop unrolling you propobly not want to access your blitPtr with an index. Calculataing the address with the index involves a multiplication (maybe optimized to a shift?) and will propobly take more time than you save by skipping those branch tests in the loop. You propobly want to do something like:
blitPtr++ = colour;

shantanu said...

Nice article..Here is another one for loop unrolling:
http://www.safercode.com/blog/2009/01/27/tweak-your-code-for-speed-unroll-those-loops-part-1.html