• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Microsoft makes a stupid inline assembler! (vent)

Ken g6

Programming Moderator, Elite Member
Moderator
OK, so I have a need to count the number of trailing zeros in a 64-bit number really fast. In a 64-bit x86 OS I can use the bsf instruction. Heck, it maps directly to the C __builtin_ctzll() function. But in 32 bits it doesn't. This particular application often has few zeros at the end, so the 32-bit version of __builtin_ctzll() isn't as fast as the following GCC assembly:

Code:
asm volatile(
"    bsfl	%[k0l], %[pos]	\n"
"    jnz	bsflok		\n"
"    bsfl	%[k0h], %[pos]	\n"
"    addl	$32, %[pos]	\n"
"bsflok:"
: [pos] "=r" (pos)
: [k0l] "rm" ((unsigned int)(n)),
[k0h] "rm" ((unsigned int)((n) >> 32))
: "cc" )

OK, so that works great, but now I need to port it to Visual C++. First, after reversing the operands for Intel syntax, I tried sticking the contents of each of those operands in the assembly inline. I get syntax errors. I try casting to a struct of two 32-bit unsigned integers instead. I get syntax errors. Finally, after a couple of hours, I turn up, on this page:
Note

Microsoft and Borland inline assemblers do not support type casts.

ARGH! D:

OK, fine. So I make an inline function, where n gets passed in as a const, I copy it to two const unsigned ints, nlo and nhi, add a non-const "pos", and stick those in the inline assembly. It says something about "Improper operand type". 😡

I'm gathering the Microsoft assembler isn't smart enough to realize it should pick a register, any free register, and just use that for pos.

I guess if anyone has any bright ideas they can post them; but I'm just going back to __builtin_ctzll().
 
Ehh you just have to do things a bit more traditional.

This should do what you want.

Code:
    __int64 iValue=0x5000000000000000;
    int pos;
    _asm {
	xor eax, eax
        lea edx, iValue
        bsf eax, dword ptr [edx]
        jnz    bsflok    
        bsf eax, dword ptr [edx+4]
        add eax, 32
bsflok:
        mov pos, eax
    };

You can use any registers you want without saving them, however using ebx, esi or edi will cause the compiler to produce extra code to preserve them. They say you should preserve other registers like flags, but I don't. I can't imagine a situation where the flags need preservation over an inline assembly block.

Edit: BTW you should clear your register before doing the operation. If the operand being tested is zero it will not change eax and you could end up with strange values.
 
Last edited:
I guess if anyone has any bright ideas they can post them; but I'm just going back to __builtin_ctzll().

I think this is my bright idea. D:

My only other (non-bright) idea is this:
Code:
inline
uint32 CountTheDamnBits( uint64 x ) {
#ifdef __64_BITS
  return __builtin_ctzll(x);
#else
  uint32 lower_count = __builtin_ctzl( static_cast<uint32>(x) ); // note ctzl not ctzll
  uint32 upper_count = __builtin_ctzl( static_cast<uint32>(x>>32) );
  return (lower_count >= 32) ? (lower_count+upper_count) : lower_count;
#endif
}
... and hope that the peephole optimizer does something reasonable. I also assume that __builtin_ctzl actually exists... I dunno. Damn MS product.
 
I've been learning x86 assembly for my compilers class, and I must say I always manage to confuse myself when translating between Intel and ATT syntax... especially with non-commutative operations.
 
🙂 He he, I would have said the same thing about the Gcc inline syntax. I think the __asm() function is extremely ugly. microsoft's asm { } thingy looks much prettier to me (That and I love the Intel syntax over the AT&T syntax).

Though, MS does do a lot of things that I don't so much like about their assembly stuff. For example, they insert TONS of code around their assembly to make sure you don't screw up the system. Gcc is much more willing to say "You want that, I'll do it, so I hope you know what you are doing". That said, things like giving labels and variables into the ASM with the gcc are freaking annoying. No named variables? MS may not cast, but at least they allow the use of named variables.

I guess I will say that the Gcc inline assembly is much more feature rich when it comes to what you can command the compiler to output. Microsoft's inline assembly doesn't really offer any guarantees other then that it will try to put out what you want.
 
I think this is my bright idea. D:

My only other (non-bright) idea is this:
[snip]
... and hope that the peephole optimizer does something reasonable. I also assume that __builtin_ctzl actually exists... I dunno. Damn MS product.

I don't think that will even work. The 64-bit value is guaranteed to be non-zero, which is why I don't have to do what Schmide said. But the lower 32 bits can be zero, and I believe the result of __builtin_ctzl is undefined when its argument is zero. :\

I could do:
Code:
inline
uint32 CountTheDamnBits( uint64 x ) {
#ifdef __64_BITS
  return __builtin_ctzll(x);
#else
  if(static_cast<uint32>(x) != 0) {
    return __builtin_ctzl( static_cast<uint32>(x) ); // note ctzl not ctzll
  } else {
    return __builtin_ctzl( static_cast<uint32>(x>>32) ) + 32;
  }
#endif
}

But the trick with my code is that bsf sets a zero flag when its argument is zero (even though its result is also undefined), so I could skip the separate test of the lower bits.

Cogman: It's like the difference between C's printf and C++'s cout. Sure, cout looks nice, until you need to format a float in a specific way.
 
I know all the compiler optimization huggers are going to flame me for this, but I like to skip the static casts and use regular casts and dereferencing to achieve the same result. It basically forces the compiler, even in debug versions, to optimize the addressing and produce faster code. Down side it breaks endianness, but there are tricks if you wish to make it cross compilation friendly.

VC code using static casts
Code:
    __int64 iValue=0x5000000000000000;
    int pos;
    if(static_cast<__int32>(iValue) != 0) {
        _BitScanForward( (unsigned long *)&pos, static_cast<__int32>(iValue) ); 
    } else {
        _BitScanForward( (unsigned long *)&pos, static_cast<__int32>(iValue>>32) ); 
        pos +=32;
    }

Dereferenced addressing version.
Code:
    __int64 iValue=0x5000000000000000;
    int pos;

    if(*((int *)(&iValue))) {
        _BitScanForward( (unsigned long *)&pos, *((int *)(&iValue)) ); 
    } else {
        _BitScanForward( (unsigned long *)&pos, *((int *)(&iValue)+1) ); 
        pos +=32;
    }
 
The problem with dereferencing is that it forces the values to be in RAM, rather than in registers.
 
The problem with dereferencing is that it forces the values to be in RAM, rather than in registers.

It doesn't force anything, it simply tells the compiler to use a different set of rules when determining what it is you want it to do.

Code:
*((int *)(&iValue)+1)

This tells the compiler to: get the address of iValue, treat it as an int ptr, go to the value after it, then pass that value to the function.

While

Code:
static_cast<__int32>(iValue>>32)

This tells the compiler to: take the 64bit int value, shift it 32 bits to the right, then convert that to an int and pass it to the function. Although it makes the right decision in the optimized version, it's really told to do the work in an inefficient manor.

For example.

Unoptimized version of static cast
Code:
        _BitScanForward( (unsigned long *)&pos, static_cast<__int32>(iValue>>32) ); 
00072FED  mov         eax,dword ptr [iValue] 
00072FF0  mov         edx,dword ptr [ebp-8] 
00072FF3  mov         cl,20h 
00072FF5  call        @ILT+455(__allshr) (711CCh) 
00072FFA  bsf         eax,eax 
00072FFD  mov         dword ptr [pos],eax 
        pos +=32;
00073000  mov         eax,dword ptr [pos] 
00073003  add         eax,20h 
00073006  mov         dword ptr [pos],eax

Unoptimized version of dereferenced.
Code:
        _BitScanForward( (unsigned long *)&pos, *((int *)(&iValue)+1) ); 
00B9353D  mov         eax,dword ptr [ebp-8] 
00B93540  bsf         ecx,eax 
00B93543  mov         dword ptr [pos],ecx 
        pos +=32;
00B93546  mov         eax,dword ptr [pos] 
00B93549  add         eax,20h 
00B9354C  mov         dword ptr [pos],eax

Optimized version of static cast
Code:
        _BitScanForward( (unsigned long *)&pos, static_cast<__int32>(iValue>>32) ); 
013E1010  mov         eax,dword ptr [esp+8] 
013E1014  bsf         edx,eax 
[COLOR="Red"]013E1017  mov         ecx,eax [/COLOR]
        pos +=32;
013E1019  mov         eax,edx 
[COLOR="Red"]013E101B  sar         ecx,1Fh [/COLOR]
[COLOR="Red"]013E101E  mov         dword ptr [esp+4],edx [/COLOR]
013E1022  add         eax,20h
* (red) for some reason the compiler failed to remove parts of the shift and unnecessarily saved the value on the stack.

Optimized version of dereference
Code:
        _BitScanForward( (unsigned long *)&pos, *((int *)(&iValue)+1) ); 
00241010  bsf         ecx,dword ptr [esp+8] 
        pos +=32;
00241015  mov         eax,ecx 
[COLOR="Red"]00241017  mov         dword ptr [esp+4],ecx [/COLOR]
0024101B  add         eax,20h
* (red) unnecessarily saved the value on the stack.

It just goes to show you how compilers are good but not perfect, but you can use some tricks to help them in their endeavor to produce better code. Often at the cost of intuitive readability.
 
Back
Top