• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Core 2 theoretical maximum floating point throughput

eLiu

Diamond Member
Hi all,
I'm trying to get a handle on what the theoretical maximum floating point throughput for the Core 2 processor is (working w/doubles). So assume we have infinity registers, gigantic ROB/RS/etc, lol. Also assume that the we aren't running SSE & x87 simultaneously--just focus on SSE2.

So for Core 2, the SSE execution units live on 3 different ports; and there are 2 for adding and 1 for multiplying.
http://www.realworldtech.com/page.cfm?ArticleID=RWT040208182719&p=6

Does that mean that once the pipeline is filled, I could nominally finish:
3*2 = 6 floating point ops per cycle? (3, 128bit wide exec units, so each can do 2 ops for a total of 4 adds & 2 mults)
This is also assuming that SSE add/mult correspond to just 1 uop... i don't know anything about uops, so maybe that's wrong.

Or do those ports have nothing to do with anything, and I can only finish 2 (mult or add, not both) per cycle? (As you can see, I'm not 100% sure what ports have to do with anything. Preliminary reading leads me to believe in a cycle, we can issue 1 at most uop to each port. So no dice doing SSE mult and x87 add at once, even though they are on separate 'execution uints.')

More "realistically," if I was say computing a dot product, it seems like the max would be much less. Like the additions should cost nothing, since they happen in parallel with the multiplications (once pipe is full), right? So my "best" floating point throughput would be a little better than 2 per cycle.

Am I even close?

-Eric
 
So assume we have infinity registers, gigantic ROB/RS/etc, lol. Also assume that the we aren't running SSE & x87 simultaneously--just focus on SSE2...

The answer depends on how much you idealize. Those "ports" mean that regardless of the number of reservation stations, Core2 is only flowing three uOps per cycle. And, it depends on the instruction mix.

But I think the answer to your question is "3". They just have to be the "right" mix of floating point ops, never go to cache, etc.
 
Core 2 can work with 128 bits at a time. So it can work on two doubles simultaneously.

On the other hand, adds only go in port 1, with a latency of 3; multiplies only go in port 0, with a latency of 5; and not much can go in port 5. (2, 3, and 4 are memory access-related.) Can SHUFPS be used to shuffle double-precision values? If so, the answer could be up to 6; if not, up to 4.

Edit: P.S. You might be interested in seeing my source. 😉
 
Last edited:
Core 2 can work with 128 bits at a time. So it can work on two doubles simultaneously.

On the other hand, adds only go in port 1, with a latency of 3; multiplies only go in port 0, with a latency of 5; and not much can go in port 5. (2, 3, and 4 are memory access-related.) Can SHUFPS be used to shuffle double-precision values? If so, the answer could be up to 6; if not, up to 4.

Edit: P.S. You might be interested in seeing my source. 😉

I'm too lazy to look at your source... and you may already know the answers to my follow-ups anyway.

- Are the FUs fully pipelined?
- If so, you can get more ops executing concurrently, but that doesn't change the throughput (3, limited by "ports" in the diagram, unless I'm missing something)
- If they're not fully pipelined, then throughput = 1/mean_service_time * num_ports.
 
Practically everything (or everything practical) is fully pipelined in Core 2. What you're missing is that he's talking about SSE instructions. So one add instruction, which takes three cycles on one port but is fully pipelined, adds two DP FP numbers to two other numbers.
 
Practically everything (or everything practical) is fully pipelined in Core 2.

I wonder what, if anything, isn't pipelined. I'm so out of touch with x64 offerings that I don't even know the exotic floating operations... so I can't guess.

What you're missing is that he's talking about SSE instructions. So one add instruction, which takes three cycles on one port but is fully pipelined, adds two DP FP numbers to two other numbers.

Ah... I got hung up on the diagram and didn't notice that OP was talking specifically about doubles.
 
Core 2 can work with 128 bits at a time. So it can work on two doubles simultaneously.

On the other hand, adds only go in port 1, with a latency of 3; multiplies only go in port 0, with a latency of 5; and not much can go in port 5. (2, 3, and 4 are memory access-related.) Can SHUFPS be used to shuffle double-precision values? If so, the answer could be up to 6; if not, up to 4.

Edit: P.S. You might be interested in seeing my source. 😉

HADDPD can issue to port 5 but it's reciprocal latency is 1.5. That's kind of grasping at straws, lol. There's DPPD too... doing a whole dot product! Though I'm not entirely sure that DPPD is actually better than doing it the "old way" with some shuf & mulpd.

I guess the page I linked to is somewhat misleading. I should've checked Agner Fog first, thanks for that 🙂

I think my statement about dot products (more genreally, matrix-matrix product) is still roughly right? You'd expect a little more than 2 ops/cycle at best.
 
Though I'm not entirely sure that DPPD is actually better than doing it the "old way" with some shuf & mulpd.

It should be. It's basically skipping the retirement of the instructions and feeding them directly back into the pipeline, probably shortening the pipeline by a few cycles.

I don't think you'll ever get near or above 2 FP per cycle in a real world dot product situations as you'll always be limited by memory or contingent operands. Theoretically, if you were iterating a 3x3 or 4x4 Markov chain you could probably do it as it would fit in the register set and allow enough parallel operations.
 
Back
Top