SSE ALU Max

dodge1this · Mar 30, 2007

Hi all,

This is my first post and I hope it finds some relevance here. I have been trying to extract out the peak ALU performance on CPU's. I have an Athlon 3200+ CPU on my machine and I have been trying to write a demo code that shows max GFlops on my machine. For this purpose, I have been using Visual Studio 2005 SSE intrinsics. I have written a small program that just loads 2 float[4] arrays into 2 __m128 variables and then perform dependent adds on them 1 after the other. Finally I store back the result. The number of memory ops is 3(2 for loads and 1 for store) while the number of ALU ops is very high(around 200K), so memory can't be the bottleneck.
However, when I do this, I get a peak GFlop number of 1.7 GFlop/sec while the CPU I have is 1.8GHz. Shouldn't the theoretical peak be 1.8 X 4 = 7.2 GFlops, and I should get something around 5 or so(in worst case)? Might there be some dependency related issues or other ALU optimizations that I might be missing? I can paste the code if someone wants to take a look at it.

Thanks and regards,
Kshitij.

lambchops511 · Apr 2, 2007

paste the code

and the part how u calcualted the GFlops/s

firewolfsm · Apr 2, 2007

Paste the code...I won't understand it...and I can't really help you...but paste the code.

zephyrprime · Apr 3, 2007

Your code is probably correct. That's the level of performance you can get from an a64 or P4 doing sse. SSE in both the p4 and a64 is really a second class citizen compared to the 32bit integer units. Take a look at the optimization manuals for the P4 or the instruction set manual for the A64 to take a look at the latencies of instructions.

The biggest problem is that both the A64 and the P4 actually only have 64bit sse units. When you execute _mm_add_ps, inside the processor the data operand will be broken down into 2 separate 2x32bit vectors. Only in the Conroe and the upcoming Barcelona is 128bit sse implemented.

You're also slowed down by doing dependent adds. This would generate pipeline bubbles since the the result of the prior add must be determined before the conditional operation occurs. Executing the add takes 2-3 cycles as I recall.

CTho9305 · Apr 3, 2007

I think on an Athlon you can get peak throughput just using fadd and fmul, until you run out of registers and start creating false dependencies (at which point you will need to throw SSE operations into the mix). Athlons and Athlon64s have an FP adder and an FP multiplier that are both fully pipelined (so you can get 2 FP ops/cycle, or 3 if you count the "misc" pipe operations as FP ops).

Search

SSE ALU Max

dodge1this

Junior Member

lambchops511

Senior member

firewolfsm

Golden Member

zephyrprime

Diamond Member

CTho9305

Elite Member

TRENDING THREADS