Tuesday, July 31, 2012

Accelerate your SIMD knowledge to accelerate your code: Tricks to blaze through SSE code

Often when dealing with SSE code you find yourself referring to the documentation to understand what an instruction does, and it can impede on your time.  This is a part of a mini series I'm writing "Accelerate your SIMD knowledge to accelerate your code".  Lets begin shall we? 

1) Eat those instruction mnemonics
Learn the instruction mnemonic rules, doing this will greatly improve your speed in hand writing SSE code.
SSE Has five categories of instructions:
  • Data movement instructions
  • Arithmetic instructions
  • Comparison instructions
  • Logical instructions
  • Shuffle instruction 
All data movement instructions start with MOV, and have either [U/A/H/L] after MOV. These four letters describe something.
 
U = unaligned
A = aligned
H = high (64 bits high)
L = low (64 bits low)

There is combination HL (high to low) and combination, LH (low to high) which moves SRC[H/L] -> DST[H/L]; finally PS is postfixed after the choice of instruction. Knowing this you can easily think of something to yourself: "I need to move the upper 64bit of this register to the lower 64bits of this other register"; which is easy as MOV (move) H(high) L(low) PS, or [MOVHLPS].  If you need an unalligned move for 128 bits, we know unalligned is U so MOVUPS is the instruction we need.

2) Order your memory, and fries with that.
SSE greatly benefits from predictable memory patterns, it prefers linear over arbitrary data.  Iterating over a linked list of vectors to normalize is faster than incrementing a float pointer for every four elements of the vector and performing the normalize on the next four elements.

Aliasing memory strangely stalls the pipeline: i.e converting back from float to _m128 etc are wasteful operations.  However aliasing memory in respect to SSE is fine.  __mm_set_ps for example composes a _m128 from four floats, aliasing an aligned structure of four elements , or an array to _m128 would work the same.
Example: 

/* allowed */
float x = 1,y = 2,z = 3, w = 0;
typedef struct { float x,y,z,w; } vec3 __attribute__((aligned(16)));
vec3 v3 = {x,y,z};
_m128 p = *(m128*)&v3;
/* same as this */
float x = 1,y = 2,z = 3, w = 0;
_m128 p = _mm_set_ps(x,y,z,w);
Aliasing memory like this allows for flexibility.

3) Nuke the runtime selection
Testing CPUID at runtime to determine if SSE is present is a common trick programs do for CPU compatibility.  These are often fine for selecting the correct optimized routines, but many people get it wrong.
Consider the following example:

void (*add)(const vec3*, const vec3*) = NULL;
int main() {
  add = (HasSSE()) ? &sse_add : &scalar_add;
}

This code is terribly wrong, because the function pointer calls are indirect calls, SSE doesn't like indirect calls, it instantly prevents any sort of pipelinability, and could very well be slower than the scalar version.
The right way to doing this would involve two separate translation units for each version of the function, and simply calling the correct version at runtime.

4) Punch a hole in your type system (C++ only)
When working with C++ and SSE intrinsics you often see yourself using lots of explicit casts, like reinterpret_cast, const_cast.  It's no wonder this is an issue especially with things like _mm_malloc, which allocates an safely aligned chunk of memory for SSE operations.  The problem is it returns void*, and C++ doesn't allow void* to be implicitly convertible to any pointer type.  To work around this I've often found myself punching a hole in C++'s type system via a simple structure.

class voidptr_t {
  void *ptr;
public: voidptr_t (void *pptr) : ptr (pptr) {}
   template<typename T>
  operator T *() { return (T*) ptr; }
};

As such when using _mm_malloc the void * will implicitly be converted to a voidptr_t using voidptr_t::voidptr_t(void *), and then the voidptr_t will implicitly convert itself to whatever pointer type you are trying to assign to.

5) Forget about prefetching
Unless you're writing code for CPU's before AMD anthlon XP or Intel Pentium 4, you're
wasting your time. Now virtually every CPU includes automatic cache prefetching. Which is costless and more accurate.