SIMD and MMX programming

invidia

Platinum Member
Oct 8, 2006
2,151
1
0
I'm stuck on this problem and quite new to assembly, so bear with me.

I'm suppose to make a SIMD version of it. I was trying to program it with MMX set.
a1 and a2 is just an array of 256 random integers. When I run this code, their outputs don't match. The pointers are throwing me off and confusing me. Plus, I'm new to assembly. Any suggestions?

// Regular C++ code
void Add(unsigned char *a1, unsigned char *a2, unsigned short *out){
for (int j = 0; j < 256; j++){
out[j] = a1[j] + a2[j];
}
}


//SIMD version of the above
void AddSIMD(unsigned char *a1, unsigned char *a2, unsigned short *out){

__asm{
mov eax, dword ptr [a1];
mov ecx, dword ptr [a2];
mov edx, dword ptr [aOut];

movdqu xmm0, XMMWORD ptr [eax];
movdqu xmm1, XMMWORD ptr [ecx];
paddd xmm0, xmm1;
movdqu XMMWORD ptr [edx], xmm0;
}
}
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,836
4,816
75
OK, there are quite a few issues here. First, I notice that out is an array of shorts, while the other arrays are chars. Are those shorts supposed to be 16 bit, while the chars are 8-bit? As I recall, there is an instruction to take 4 8-bit values in a word, and break them out into 4 16-bit values in a dword. You probably need to do that for both a1 and a2.

Next, you have a loop in the c++ code. You haven't implemented a loop in the assembly code. Instead of incrementing by one byte, you'll need to increment by 4 with each iteration. You might increment eax and ecx by 4 and edx by 8. Conversely, loops often decrement in assembly, so you could store the base value for one pointer in e.g. ebx, move all the other pointers to their respective ends, and work backwards until that one pointer matches ebx. Or, I imagine, you could put the loop outside the __asm block.

Finally, I hear that SSE2 allows doing everything MMX does, but with twice the values.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,836
4,816
75
I was hoping you'd come by with some answers. Those look great; I'll have to try them!

P.S. Mind if I quote you?
 

invidia

Platinum Member
Oct 8, 2006
2,151
1
0
Thanks, but code has to be inline assembly using the MMX registers. I don't think we can use SSE and higher yet. Sorry I don't have a copy of the updated code, but I added a loop and have it increment correctly. When I debug/test it, the xmm0 and xmm1 registers load the first 16 numbers of each array correctly.

But when I do the the addition with paddd, there is wraparound? or saturation (not sure which) if the resulting sum is over 256 (ie. 240 + 230 = 470, but the result in the array comes to 214). I'm guessing there's some type of overflow and not sure how to handle it.


EDIT: In fact, there was no mention of whether I have to use inline assembly or not, so I'll give the intrinics a shot. Thanks degibson.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,836
4,816
75
To handle the overflow, you need to work with the numbers as 16-bit numbers (words). To convert your 8-bit numbers to 16-bit numbers, you need the PUNPCKLBW instruction (and the PUNPCKHBW instruction if you load 64 bits at a time), with the first argument the double word you loaded, and the second full of 0's.

P.S.: Hint: If you're using xmm registers, and loading 16 bytes at a time, you're using SSE2. :)
 

degibson

Golden Member
Mar 21, 2008
1,389
0
0
Originally posted by: Ken g6
I was hoping you'd come by with some answers. Those look great; I'll have to try them!

P.S. Mind if I quote you?

I don't mind a quote taken out of context, provided it is sufficiently entertaining. It is ;-)