Hi all,
I'm trying to get a handle on what the theoretical maximum floating point throughput for the Core 2 processor is (working w/doubles). So assume we have infinity registers, gigantic ROB/RS/etc, lol. Also assume that the we aren't running SSE & x87 simultaneously--just focus on SSE2.
So for Core 2, the SSE execution units live on 3 different ports; and there are 2 for adding and 1 for multiplying.
http://www.realworldtech.com/page.cfm?ArticleID=RWT040208182719&p=6
Does that mean that once the pipeline is filled, I could nominally finish:
3*2 = 6 floating point ops per cycle? (3, 128bit wide exec units, so each can do 2 ops for a total of 4 adds & 2 mults)
This is also assuming that SSE add/mult correspond to just 1 uop... i don't know anything about uops, so maybe that's wrong.
Or do those ports have nothing to do with anything, and I can only finish 2 (mult or add, not both) per cycle? (As you can see, I'm not 100% sure what ports have to do with anything. Preliminary reading leads me to believe in a cycle, we can issue 1 at most uop to each port. So no dice doing SSE mult and x87 add at once, even though they are on separate 'execution uints.')
More "realistically," if I was say computing a dot product, it seems like the max would be much less. Like the additions should cost nothing, since they happen in parallel with the multiplications (once pipe is full), right? So my "best" floating point throughput would be a little better than 2 per cycle.
Am I even close?
-Eric
I'm trying to get a handle on what the theoretical maximum floating point throughput for the Core 2 processor is (working w/doubles). So assume we have infinity registers, gigantic ROB/RS/etc, lol. Also assume that the we aren't running SSE & x87 simultaneously--just focus on SSE2.
So for Core 2, the SSE execution units live on 3 different ports; and there are 2 for adding and 1 for multiplying.
http://www.realworldtech.com/page.cfm?ArticleID=RWT040208182719&p=6
Does that mean that once the pipeline is filled, I could nominally finish:
3*2 = 6 floating point ops per cycle? (3, 128bit wide exec units, so each can do 2 ops for a total of 4 adds & 2 mults)
This is also assuming that SSE add/mult correspond to just 1 uop... i don't know anything about uops, so maybe that's wrong.
Or do those ports have nothing to do with anything, and I can only finish 2 (mult or add, not both) per cycle? (As you can see, I'm not 100% sure what ports have to do with anything. Preliminary reading leads me to believe in a cycle, we can issue 1 at most uop to each port. So no dice doing SSE mult and x87 add at once, even though they are on separate 'execution uints.')
More "realistically," if I was say computing a dot product, it seems like the max would be much less. Like the additions should cost nothing, since they happen in parallel with the multiplications (once pipe is full), right? So my "best" floating point throughput would be a little better than 2 per cycle.
Am I even close?
-Eric