SSE ALU Max

dodge1this

Junior Member
Oct 14, 2006
3
0
0
Hi all,

This is my first post and I hope it finds some relevance here. I have been trying to extract out the peak ALU performance on CPU's. I have an Athlon 3200+ CPU on my machine and I have been trying to write a demo code that shows max GFlops on my machine. For this purpose, I have been using Visual Studio 2005 SSE intrinsics. I have written a small program that just loads 2 float[4] arrays into 2 __m128 variables and then perform dependent adds on them 1 after the other. Finally I store back the result. The number of memory ops is 3(2 for loads and 1 for store) while the number of ALU ops is very high(around 200K), so memory can't be the bottleneck.
However, when I do this, I get a peak GFlop number of 1.7 GFlop/sec while the CPU I have is 1.8GHz. Shouldn't the theoretical peak be 1.8 X 4 = 7.2 GFlops, and I should get something around 5 or so(in worst case)? Might there be some dependency related issues or other ALU optimizations that I might be missing? I can paste the code if someone wants to take a look at it.

Thanks and regards,
Kshitij.
 

firewolfsm

Golden Member
Oct 16, 2005
1,848
29
91
Paste the code...I won't understand it...and I can't really help you...but paste the code.
 

zephyrprime

Diamond Member
Feb 18, 2001
7,512
2
81
Your code is probably correct. That's the level of performance you can get from an a64 or P4 doing sse. SSE in both the p4 and a64 is really a second class citizen compared to the 32bit integer units. Take a look at the optimization manuals for the P4 or the instruction set manual for the A64 to take a look at the latencies of instructions.

The biggest problem is that both the A64 and the P4 actually only have 64bit sse units. When you execute _mm_add_ps, inside the processor the data operand will be broken down into 2 separate 2x32bit vectors. Only in the Conroe and the upcoming Barcelona is 128bit sse implemented.

You're also slowed down by doing dependent adds. This would generate pipeline bubbles since the the result of the prior add must be determined before the conditional operation occurs. Executing the add takes 2-3 cycles as I recall.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
I think on an Athlon you can get peak throughput just using fadd and fmul, until you run out of registers and start creating false dependencies (at which point you will need to throw SSE operations into the mix). Athlons and Athlon64s have an FP adder and an FP multiplier that are both fully pipelined (so you can get 2 FP ops/cycle, or 3 if you count the "misc" pipe operations as FP ops).