Friday, April 6, 2012

Addition the bit hack way

Disclaimer:  I would ask that none of this code actually be used in software; it doesn't offer anything, if your CPU has an ADD instruction, use it.  Otherwise this is just a teaching aid.

Addition is a basic mathematical operation which can be emulated with some bitwise operations.  Lets look at a canonical implementation:

uint8_t add(uint8_t x, uint8_t y) {
  uint8_t a;
  uint8_t b;

  do {
    a = x & y;
    b = x ^ y;
    x = a << 1;
    y = b;
  } while (a);

  return b;
}

To understand what is going on lets follow out with a basic example:
Let in:x = 00000011 (3)
Let in:y = 00000100 (4)

When we invoke this add function the steps taking place are:
  • AND   -- a = (x&y) 
  • XOR   -- b = (x^y)
  • LSHFT -- x = (a<<1)
Resulting in the following:
  • x = 00000011
  • y = 00000100
  • a = 00000000
  • b = 00000100
  • x = 00000000
  • y = 00000111  -- XOR a:b = 7
This canonical implementation works, but it's slow.  We can make it slightly faster, to see why, lets take a look at the assembly that code translates to (note: all code is x86_64 non-optimized)
add:
  pushq %rbp
  movq  %rsp,   %rbp
  movl  %edi,   %edx
  movl  %esi,   %eax
  movb  %dl -20(%rbp) ; y
  movb  %al -24(%rbp) ; x
  .loop:
    movzbl -24(%rbp),   %eax 
    movzbl -20(%rbp),   %edx
    andl       %edx,    %eax
    movb       %al,  -1(%rbp)
    movzbl -24(%rbp),   %eax
    movzbl -20(%rbp),   %edx
    xorl       %edx,    %eax
    movb       %al,  -2(%rbp)
    movzbl  -1(%rbp),   %eax
    addl       %eax,    %eax
    movb       %al, -20(%rbp)
    movzbl  -2(%rbp),   %eax
    movb       %al, -24(%rbp)
    cmpb       $0x0, -1(%rbp) 
    jne .loop
  movzbl -2(%rbp), %eax
  popq             %rbp
  ret

There is 9+(15*iterations) worth instructions here.  Assuming the best case senerio, that is 24 instructions.  We can refine our implementation to be slightly faster.
void add(uint8_t *x, uint8_t *y) {
  do {
    *y = (*x^*y);
    *x = (*x&(*x^*y)) << 1;

  } while (*x);
}

In this version we've removed the return, and replaced our arguments with pointers to our variables, storing the result of the addition in *y, and using no temporaries!.  This is (best case) 14! instruction less:
add:
  movzbl (%rsi), %eax
  .loop:
    movzbl (%rdi),      %ecx
    movl    %eax,       %edx
    andl    %ecx,       %edx
    xorl    %ecx,       %eax
    leal   (%rdx,%rdx), %ecx
    testb   %dl,        %dl
    movb    %cl      (%rdi)
    movb    %al      (%rsi)
    jne .loop
  rep
  ret

Of course since this function is small we can most likely make it inline, (which could allow for better optimization).  I'd personally perfer a macro before the inline keyword.

#define add(x,y) do{*y=(*x^*y),*x=(*x&(*x^*y))<<1;} while(*x)

That concludes addition by bit hacking. I've played around with many other methods, but this is 3+(9*iterations) worth instructions, or best case (11 instructions). I challenge anyone to beat this for best case.  Must be C code (no assembly) .. post solution in the comments.  Hint: abuse stack.

Thursday, April 5, 2012

Checking if divisible by 10 or 5 the bitwise way.

My Problem: check if something was divisble by 10, the rules where I couldn't use the modulus, divison or multiplication operator. The check had to be in the form of a single expression, no branches or loop constructs could be used, pointer arithmetic wasn't allowed, neither was the use of floating point data types. The code had to work on a x86 CPU; and had to be ANSI C.

After spending a few hours I've come up with the following solution:
((x<<0x1F)+(x<<0x1E)+(x<<0x1B)+(x<<0x1A)+
 (x<<0x17)+(x<<0x16)+(x<<0x13)+(x<<0x12)+
 (x<<0x0F)+(x<<0x0E)+(x<<0x0B)+(x<<0x0A)+
 (x<<0x07)+(x<<0x06)+(x<<0x03)+(x<<0x02)+x<0x33333334)&&!(x&1)

Now, this isn't exactly a fast solution, expecially on a P4 which lacks a barrel shifter, however this solution led me finding a faster soluton. The rules where I couldn't use the multiplication operator. But after removing those shifted adds and replacing it with a constant 0xCCCCCCCD, I had the following:

(x*0xCCCCCCCD) < 0x33333334 &&!(x&1)

much to my surprise this is faster than using modulus (x%10) as well as the division operator. MUL on the x86 is actually the fastest way to do division for NPOT (non-power-of-two) constants. This can be proofed with:
(x * 5^-1) < 0x33333334

Explaining how this works is quite simple: the constant 0xCCCCCCCD is the inverse of 5 modulo 2^32. Of course I couldn't stop here. I took it a bit further and came up with the following code:

!((0x19999999 * x + (x >> 1) + (x >> 3)) >> 28) [
"\x0\x1\x2\x2\x3\x3\x4\x5\x5\x6\x7\x7\x8\x8\x9\x0"
]

This is a lot more complicated to proof, mainly because it's simply odd code. To get the most obvious odd bit sorted out C allows array[idx] and idx[array] because both cases decay to *(idx+array) or *(array+idx) which are equivlent. Now for the explination:

In the table we have here: the following: 0,1,2,2,3,3,4,5,5,6,7,7,8,8,9,0
this maps the multiples of 5 to either 0 or 8. Better expressed as: (8/5 * x) % 16 (the error term is never too big) -- The real equivalency being:
((8/5 + 0x0.00000002/5) * x - delta) % 16 for delta in [0, 0.625/2^28]
This one uses a lot more operations to do the same thing, 3 shifts, 2 adds, and one MUL, still a lot faster on a pentium D than the former one. Of course I took it two more steps further.

The rules where it had to be ANSI C, but I sort of broke that bit, by exposing some x86 assembly (which entailed the best solutions).


MUL x, 0xCCCCCCCD, x
ROR x, 0x19999999
CMP x, 0x19999999
JLE hit_table ; same table {0,1,2,2,3,3,4,5,5,6,7,7,8,8,9,0}

This it the same to the CPU as the one above, but expressed in a simple fashion. Essentially the observation is:

x * 0xCCCCCCCD is even exactly if x is even, because 0xCCCCCCCD is odd, so the right rotation to move the last bit to the first and compare against 0x19999999 will give you the same answer, that is if you have ROR. So I rewrote it another way for ROR-less systems.

This method is the same as the first: which I find is a lot more nicer, too many MOVs though, but the real power comes from the x86 LEA instruction.

MOV %edx, 0x33333334
MOV %ebx, %eax
MUL %edx
AND %edx, x
LEA %edx, [%edx+%edx*4]

The reason this works is the x86 MUL instruction puts it's double width result in two registers, the sad part is you cannot write this in C.

But in the end the winner for performance ended up being:
(x * 0xCCCCCCCD) < 0x33333334 && !(x&1)