x86 assembly anyone? (and simple C question)

xtknight

Elite Member
Oct 15, 2004
12,974
0
71
Alright, I have attached some C code that I would like a fast assembly version of. Also, for using inline assembly in C, I'd use __asm {...}, correct? I know very little assembly. I know jmp and jne jump and mul and imul multiply but I don't know when to use one or how to even start an assembly program. I know about registers (eax-edx) somewhat. If you won't write it could you give me a good starters tutorial? This code will be used for realtime video blending (part of supersampling). I really think though if someone can give me a framework, I can get the hang of it and finish it myself. What I actually need to do is take the average of a bunch of values, but I'm going to try and modify it to do it by myself.

I also have another side question. For the for loop, I want to use i<640 and not i<=640 because the array starts at 0, right? Also I hope I'm right in my assumption that this program will take the data that's in the matrix location, take half of it and take the ceiling if it ends up as floating point. The total memory allocation of the rgb matrix would be 1 byte (8-bit int)*640*480 also, correct? Is the ceiling of 1.0 still 1.0?

Thanks in advance.
 

imported_FishTaco

Golden Member
Apr 28, 2004
1,120
0
0
I don't know if hand coded assembly will be any faster than a good C compiler in this case, but you can use some bitshifts to avoid the function call and the conversion to float then back to int.

something like this maybe:

rgb[ i ][ j ] = (rgb[ i ][ j ] + 1) >> 1;


extra spaces in the brakets to stop the forum from making the line italics.

Maybe someone who knows one of the SIMD instruction sets can help you more.
 

DaveSimmons

Elite Member
Aug 12, 2001
40,730
670
126
Originally posted by: FishTaco
something like this maybe:

rgb[ i ][ j ] = (rgb[ i ][ j ] + 1) >> 1;
definitely keeping it int and using a shift is faster, though if the array is int8 the above will overflow if the value is already 255.

If mapping 0/1 -> 0, ... 254/255 -> 127 is acceptable then just the shift without +1 would work.

If this is a sequel to the supersampling thread you could do something like:
1. use an array of int16 for the pixels
2. assign pixel = source 1 + source 2 (int8 + int8 = 0-510)
3. use the shift without +1 since (2) already maps to 0-255 when you /2

... or pass in 2 pixel arrays and do 2,3 using a local temp int16, assign the result to one of the input arrays
 

xtknight

Elite Member
Oct 15, 2004
12,974
0
71
Yup, that's promising. Based on counting in my head and looping the code a various amount of times, it looks like with the binary shift I can do that operation at about 100 FPS. Definitely acceptable. Just for the challenge it would be nice to get some SIMD code to work though. Moral of the story: never underestimate pure C.

DaveSimmons: This is indeed my attempt at making the VB.NET program faster or even realtime to make an open source DirectShow filter out of it or something.

Why an int16 for pixels? I only need 0-255 so int8 would be adequate.
Actually I made a stupid mistake and mislabeled that variable. I need 3 variables to do the R, G, and B, don't I? I don't actually know how the data is going to be coming in at this point. Something along the lines of a GDI bitmap function using scanlines, it's confusing for me...

So
__int8 sourceR [640][480];
__int8 sourceG [640][480];
__int8 sourceB [640][480];

Sorry for the confusion but I'm not going to store the average of them in a variable per se, I'm going to directly write them out to the new bitmap.

My code will go somewhat like this:

SetNewPixel((GetPixelRed(x,y)+GetPixelRed(x-1,y-1)+GetPixelRed(x-1,y-2)+GetPixelRed(x-2,y-2))/4,..green,..blue);

I assume >>2 would be /4?
Also if I have the number of samples stored in a variable, how would I do the equivalent shift? Like for example say __int8 SSAA=4;
Somehow I need to interpret that variable or something. How would I take that 4 and "internally convert" the code to >>2 or, if it was 2 how could I >>1 it. Perhaps avg>>(SSAA^0.5);? Also it would probably be faster to store the calculation of SSAA^0.5 in a variable itself so the math isn't required each time. The reason I'm doing ^0.5 (square root) is the binary shifts. Unless I'm really off, the shifts are respectively:

divide by 2 = >>1
divide by 4 = >>2
divide by 8 = >>3
divide by x = >>(x^0.5)

Or, can I not put that (x^0.5) expression after the >> sign?

Thanks a ton guys. The binary shifting is probably 5x faster. By the time I use SSAA more than 2 samples though I'm not going to end up with 100 FPS that's for sure so I'll want to look in to SIMD as well. Or maybe I'll just give in and map the video as a texture on a surface and let the GPU AA it. :p (Err, wait, do I need the new 7800GTX transparency AA for that? :|) I still might make a picture program out of this, but video is more well suited to Direct3D9 at this point. For photos the one advantage of the CPU method is I can endlessly tweak it unlike the GPU's. Maybe someday I'll end up with the best AA algorithm ever conceived.
 

DaveSimmons

Elite Member
Aug 12, 2001
40,730
670
126
> SetNewPixel((GetPixelRed(x,y)+GetPixelRed(x-1,y-1)+GetPixelRed(x-1,y-2)+GetPixe
Red(x-2,y-2))/4,..green,..blue);

1. Each value is 0-255 for one pixel but when you add them it will overflow an int8 until after you shift or divide to compute the average.

2. Fastest instead of using a variable for the shift or divide would be to have separate blocks of code for each case (2 pixel, 4 pixel, 8 pixel). The compiler can optimize better given constant numbers instead of variables.

3. It may be slower to call GetPixelRed() , -Green(), -Blue(), separately instead of using one GetRGB() function if your library offers it and sharing the result in a joint R,G,B loop.

if ( SS4 )
{

int16 nRed = (Get (x-1 ,y-1 ) + Get ( x, y-1 ) + Get ( x-1, y) + Get(x,y)) >> 2 ;

Set( (int8) nRed, ... )

}
else if ( SS8 )
{

....

}