Friday, April 6, 2012

Addition the bit hack way

Disclaimer:  I would ask that none of this code actually be used in software; it doesn't offer anything, if your CPU has an ADD instruction, use it.  Otherwise this is just a teaching aid.

Addition is a basic mathematical operation which can be emulated with some bitwise operations.  Lets look at a canonical implementation:

uint8_t add(uint8_t x, uint8_t y) {
  uint8_t a;
  uint8_t b;

  do {
    a = x & y;
    b = x ^ y;
    x = a << 1;
    y = b;
  } while (a);

  return b;

To understand what is going on lets follow out with a basic example:
Let in:x = 00000011 (3)
Let in:y = 00000100 (4)

When we invoke this add function the steps taking place are:
  • AND   -- a = (x&y) 
  • XOR   -- b = (x^y)
  • LSHFT -- x = (a<<1)
Resulting in the following:
  • x = 00000011
  • y = 00000100
  • a = 00000000
  • b = 00000100
  • x = 00000000
  • y = 00000111  -- XOR a:b = 7
This canonical implementation works, but it's slow.  We can make it slightly faster, to see why, lets take a look at the assembly that code translates to (note: all code is x86_64 non-optimized)
  pushq %rbp
  movq  %rsp,   %rbp
  movl  %edi,   %edx
  movl  %esi,   %eax
  movb  %dl -20(%rbp) ; y
  movb  %al -24(%rbp) ; x
    movzbl -24(%rbp),   %eax 
    movzbl -20(%rbp),   %edx
    andl       %edx,    %eax
    movb       %al,  -1(%rbp)
    movzbl -24(%rbp),   %eax
    movzbl -20(%rbp),   %edx
    xorl       %edx,    %eax
    movb       %al,  -2(%rbp)
    movzbl  -1(%rbp),   %eax
    addl       %eax,    %eax
    movb       %al, -20(%rbp)
    movzbl  -2(%rbp),   %eax
    movb       %al, -24(%rbp)
    cmpb       $0x0, -1(%rbp) 
    jne .loop
  movzbl -2(%rbp), %eax
  popq             %rbp

There is 9+(15*iterations) worth instructions here.  Assuming the best case senerio, that is 24 instructions.  We can refine our implementation to be slightly faster.
void add(uint8_t *x, uint8_t *y) {
  do {
    *y = (*x^*y);
    *x = (*x&(*x^*y)) << 1;

  } while (*x);

In this version we've removed the return, and replaced our arguments with pointers to our variables, storing the result of the addition in *y, and using no temporaries!.  This is (best case) 14! instruction less:
  movzbl (%rsi), %eax
    movzbl (%rdi),      %ecx
    movl    %eax,       %edx
    andl    %ecx,       %edx
    xorl    %ecx,       %eax
    leal   (%rdx,%rdx), %ecx
    testb   %dl,        %dl
    movb    %cl      (%rdi)
    movb    %al      (%rsi)
    jne .loop

Of course since this function is small we can most likely make it inline, (which could allow for better optimization).  I'd personally perfer a macro before the inline keyword.

#define add(x,y) do{*y=(*x^*y),*x=(*x&(*x^*y))<<1;} while(*x)

That concludes addition by bit hacking. I've played around with many other methods, but this is 3+(9*iterations) worth instructions, or best case (11 instructions). I challenge anyone to beat this for best case.  Must be C code (no assembly) .. post solution in the comments.  Hint: abuse stack.

No comments:

Post a Comment