Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Tuna-Fish · Apr 27, 2011

HW2050Plus said:
Okay hey the 4-wide decoder and 32 Byte fetch is great but you forget that for one core that means only 2-wide decoding and 16 Byte fetch.

No, read again. I said that average of 2 insns per clock would probably be enough, unless you waste the decode bandwidth.

Decoding is an issue because AMD would need 6-wide decoders

I still maintain that 2-wide with SNB-class branch prediction is probably better in the real world than 3-wide with phenom-class prediction.

which is nearly impossible with x86. The main reason why obviously IBM was able to get it right with Power was that they do not have the decoding issues. And that is what makes Bulldozer suffer.

AMD has always done some predecode, including measuring instruction lengths, when code is loaded from L2 to L1i. (which makes L1i miss much worse than it seems when you just look at the latencies) With that predecode, I don't think 6 insns per clock is that close to impossible. Possibly still too expensive, though.

You have to be careful with execution width. Now the width you use indicates me that you talk about micro ops not macro ops. So just consider that Phenom II was 6-wide!

No, Phenom II could execute at most 3 ops per clock, one for each of the pipes.

But it is very misleading to discuss the execution width in MicroOps because what counts in the end is the MacroOps execution width and there you have for each core:

Sandy Bridge, Decoding: 3 wide *, Execution 3 wide + MacroOp fusion (=4 if fused), Max allowed address ops. 2

Can't SNB execute 5 in the situation of (all independent, and has spare decode from elsewhere) add, add, add, mov reg mem, mov reg mem?

Bulldozer, Decoding: 2 wide, Execution 2 wide, Max allowed address ops 2

There are 4 result busses -- so BD is Decode: 2 wide, Execution 4 wide (2 of which can be ALU ops and 2 can be memory ops/inc/dec/lea)

Shared instruction decoding does raise some interesting possibilities -- notably, during a code cache miss, and during a branch mispredict when you know you missed but don't yet know the real target, the other thread can, in theory, use all the decode bandwidth to run ahead in decode a little, and then in turn let the thread that missed use more than it's share when it does finally get data. As this would increase the decode bandwidth exactly in the situation where it matters the most, it could in theory give quite an advantage. I have no idea if AMD does this, or if they just do the inflexible "if a thread isn't sleeping, it gets all the even/odd clocks".

Tuna-Fish · Apr 27, 2011

Martimus said:
You forget that Sandybridge uses a ring bus, so adding additional cores is relatively easy, but also adds latency. Core to core communication will have double the current latency with 8 cores, and communication to L3 cache also has double the latency due to the way it was designed. There will definitely be areas where the E series will be slower than current S1155 processors when it comes to IPC for this reason.

Yes, there is a hit to the cache/memory latency when you add more cores. But frankly, the latency/throughput of the SNB L3 is so freakishly good, that this simply won't be an issue for most things.

Also, HW2050 mentioned 8-core SNB -- wasn't it established that desktop SNB-E will top out at 6 cores?

exar333 · Apr 27, 2011

Tuna-Fish said:
Yes, there is a hit to the cache/memory latency when you add more cores. But frankly, the latency/throughput of the SNB L3 is so freakishly good, that this simply won't be an issue for most things.

Also, HW2050 mentioned 8-core SNB -- wasn't it established that desktop SNB-E will top out at 6 cores at launch?

6-core will be tops at launch for Desktop, but more are always plausible. Plus there likely will be s2011 Xeons that will be 8-core.

Abwx · Apr 27, 2011

Tuna-Fish said:
Yes, there is a hit to the cache/memory latency when you add more cores. But frankly, the latency/throughput of the SNB L3 is so freakishly good, that this simply won't be an issue for most things.

Also, HW2050 mentioned 8-core SNB -- wasn't it established that desktop SNB-E will top out at 6 cores?

SB 6C will top at 130W TDP.
A server oriented 8C version will have no more than 3ghz frequency along with 360/380mm2 die area, but it s not enough to compete with Interlagos whose real competitor will be a 10C/20T Xeon....

Riek · Apr 28, 2011

HW2050Plus said:
rant.

You fail to realize the difference between a SHARED decoder and a SPLIT decoder.

BD uses a SHARED 4wide decoder.

You're numbers assume a SPLIT 4wide decoder.

As long as you don't grasp that concept, your analization will be off by miles.

JFAMD · Apr 28, 2011

AtenRa said:
Can anyone tell me for sure if one thread can use both 128-bit FMACs per cycle per module ???

I was under the impression that only one 128-bit FMAC could only be used per core per cycle.

You have a shared scheduler, you can issue for both FMACs off of a single thread if you need to. Realistically, since 90% of the typical execution is integer anyway, you probably won't run into this scenario, but it is possible.

In regards to the die space discussion, it is an interesting discussion, but meaningless. If processors were priced relative to actual die space it would be a big argument. But they are not.

Compare performance, price and power discussion, those are the things that you will see. Comparing die space is like comparing the color of the processor - while it may be different, it does not drive the features that customers see.

AtenRa · Apr 28, 2011

JFAMD said:
You have a shared scheduler, you can issue for both FMACs off of a single thread if you need to. Realistically, since 90% of the typical execution is integer anyway, you probably won't run into this scenario, but it is possible.

Yes thx john,

I have read the Software optimization Guide again and it is possible for all 4 pipes (2x FMACs + 2x MMX) of the FPU to be used by a single thread if it has 4x Micro-Ops.

FPU Features Summary and Specifications:
The FPU can receive up to four ops per cycle. These ops can only be from one thread, but the
thread may change every cycle. Likewise the FPU is four wide, capable of issue, execution and
completion of four ops each cycle. Once received by the FPU, ops from multiple threads can be
executed.
Within the FPU, up to two loads per cycle can be accepted, possibly from different threads.
There are four logical pipes: two FMAC and two packed integer. For example, two 128-bit
FMAC and two 128-bit integer ALU ops can be issued and executed per cycle.
Two 128-bit FMAC units. Each FMAC supports four single precision or two double-precision
ops.
FADDs and FMULs are implemented within the FMACs.
x87 FADDs and FMULs are also handled by the FMAC.
Each FMAC contains a variable latency divide/square root machine.
Only 1 256-bit operation can issue per cycle, however an extra cycle can be incurred as in the case
of a FastPath Double if both micro ops cannot issue together.

podspi · Apr 28, 2011

JFAMD said:
In regards to the die space discussion, it is an interesting discussion, but meaningless. If processors were priced relative to actual die space it would be a big argument. But they are not.

Well, it does have very important implications on profitability and future updates. Both of those things are important to at least some of us!

That being said, if you look at some of the leaks that are coming out, it looks like AMD has managed to pack an insane number of cores at very high speeds while still maintaining reasonably low TDP. Awesome! :thumbsup: Hopefully we'll get benchmarks from a non-damaged ES soon...

Dresdenboy · Apr 28, 2011

AtenRa said:
Yes thx john,

I have read the Software optimization Guide again and it is possible for all 4 pipes (2x FMACs + 2x MMX) of the FPU to be used by a single thread if it has 4x Micro-Ops.

"The FPU can receive up to four ops per cycle. These ops can only be from one thread" means dispatching to the scheduler. Once the new ops (and the older, still not executed and retired ops) sit in the scheduler, the other points apply:
"Likewise the FPU is four wide, capable of issue, execution and´completion of four ops each cycle. Once received by the FPU, ops from multiple threads can be executed."

There is a independency between dispatch and execution.

dma0991 · Apr 28, 2011

JFAMD said:
In regards to the die space discussion, it is an interesting discussion, but meaningless. If processors were priced relative to actual die space it would be a big argument. But they are not.

Compare performance, price and power discussion, those are the things that you will see. Comparing die space is like comparing the color of the processor - while it may be different, it does not drive the features that customers see.

Well said. :thumbsup:

If AMD can give more cores for the money and if having more cores equals more performance then AMD gets a thumbs up. It would be nice to see some results from Bulldozer soon.

Idontcare · Apr 28, 2011

JFAMD said:
In regards to the die space discussion, it is an interesting discussion, but meaningless. If processors were priced relative to actual die space it would be a big argument. But they are not.

<snip>

Comparing die space is like comparing the color of the processor - while it may be different, it does not drive the features that customers see.

Not meaningless. At worst you could argue that it is of academic curiosity to the layperson.

To investors, engineers, and the economically inclined it is a rather interesting talking point which garners merit all on its own.

The topic does come up here fairly routinely, year after year, no harm in entertaining discussions on the matter.

Phynaz · Apr 28, 2011

Nobody has been talking retirement.

How many instructions can a BD core retire in one clock?

AtenRa · Apr 28, 2011

Dresdenboy said:
"The FPU can receive up to four ops per cycle. These ops can only be from one thread" means dispatching to the scheduler. Once the new ops (and the older, still not executed and retired ops) sit in the scheduler, the other points apply:
"Likewise the FPU is four wide, capable of issue, execution and´completion of four ops each cycle. Once received by the FPU, ops from multiple threads can be executed."

There is a independency between dispatch and execution.

You confuse me mate and i though i had it

We are talking about a single thread, so if that thread has four ops that will be dispatched to the scheduler, it can use all the FPU pipes (both FMACs and MMXs) for execution, correct ??

(lets say we dont have any older ops or not executed etc from the same or a different thread)

Tuna-Fish · Apr 28, 2011

AtenRa said:
You confuse me mate and i though i had it

We are talking about a single thread, so if that thread has four ops that will be dispatched to the scheduler, it can use all the FPU pipes (both FMACs and MMXs) for execution, correct ??

Yes.

Dresdenboy · Apr 28, 2011

Phynaz said:
Nobody has been talking retirement.

How many instructions can a BD core retire in one clock?

Four - or in case of a fused branch this should be five.

@AtenRa:
Sorry

Martimus · Apr 28, 2011

Phynaz said:
Nobody has been talking retirement.

How many instructions can a BD core retire in one clock?

According to Mike Butler in 2010 (HotChips), up to 4 macro-ops can be retired each cycle, which matches throughput of the rest of the CPU.

podspi · Apr 28, 2011

Dresdenboy said:
Four - or in case of a fused branch this should be five.

I apologize for being "that guy" in this discussion, but just to clarify, so a Bulldozer module should be able to retire 8 ~ 10 under "the stars have aligned" sort of situation.

Phynaz · Apr 28, 2011

Dresdenboy said:
Four - or in case of a fused branch this should be five.

Thank you.

Phynaz · Apr 28, 2011

podspi said:
I apologize for being "that guy" in this discussion, but just to clarify, so a Bulldozer module should be able to retire 8 ~ 10 under "the stars have aligned" sort of situation.

No clarification needed. I asked about a single core.

HW2050Plus · Apr 28, 2011

Dresdenboy said:
Four - or in case of a fused branch this should be five.

@AtenRa:
Sorry

4 MicroOps which are 2 MacroOps.
@Dresdenboy: Do you have a link/source to the fused branch, that would be a good feature.

And the bad news just don't stop: New detail emerged and that is that all FPU, SSE, AVX operations cannot use L1 cache and have to access L2 or worse.

I just cannot understand why AMD has made all this ...

gdansk · Apr 28, 2011

HW2050Plus said:
I just cannot understand why AMD has made all this ...

I just cannot help but think you're interpreting it incorrectly. The FPU is shared so for it to have access to the L1 cache would be difficult (from both cores), it is a design requirement.

Tuna-Fish · Apr 28, 2011

HW2050Plus said:
4 MicroOps which are 2 MacroOps.

No, this is just plain wrong. Based on the available data, BD can retire 4 MacroOps per clock.

HW2050Plus · Apr 28, 2011

Tuna-Fish said:
Yes, there is a hit to the cache/memory latency when you add more cores. But frankly, the latency/throughput of the SNB L3 is so freakishly good, that this simply won't be an issue for most things.

Also, HW2050 mentioned 8-core SNB -- wasn't it established that desktop SNB-E will top out at 6 cores?

The roadmaps I have seen showed a Sandy Bridge EN (desktop) which is 6 and 8 cores.

However in latest news there was more the speak of the 6 core variant for desktop. I have no idea about actual Intel plans or if they have changed them.

Tuna-Fish said:
There are 4 result busses -- so BD is Decode: 2 wide, Execution 4 wide (2 of which can be ALU ops and 2 can be memory ops/inc/dec/lea)

According to the optimization manual the two additional AGLU cannot do any instruction on their own. So to understand the coupling according to optimization manual:
mov rax, [address] -> executes in ALUx + AGLUx (2 micro, 1 macro)
add rax, [address] -> executes in ALUx + AGLUx (2 micro, 1 macro)
add rax, rbx -> executes in ALUx -> AGLU idle (1 micro, 1 macro)
lea rbx, [address] -> executes in ALUx + AGLUx (2 micro, 1 macro)
inc/dec rax -> executes in ALUx -> AGLU idle (1 micro, 1 macro)

To understand the gain from the AGLU is quite complex, the gain is that this can be scheduled in advance (before the rest of the op is executed in the ALU pipe, this reduces latency in memory stalls a bit). But you never get more that 2 MacroOps executed. One MacroOp is split in a ALU half and a AGLU half (if it has a AGLU half). In Phenom II the ALU and AGU were joint together so the scheduler could not schedule this in advance. That is the only gain from the 4-wide BD design. In speak of these joint micro ALU+AGU Phenom II is 6-wide.

Now Dresdenboy said there is according to his informations the ability to inc/dec (alone in AGLU). We have to see, since according to the public manual this ability is not given.

Martimus · Apr 28, 2011

Mike Butler stated clearly that Bulldozer can retire 4 MACRO-OPS per cycle. Dresden Boy stated this, and I restated it with references (mostly because I was looking up the references I had to check the answer, and didn't see he had already posted the answer).

It is possible he was saying it per module, but based on the presentation I am pretty sure he was saying 4 MacroOps per core.

Tuna-Fish · Apr 28, 2011

HW2050Plus said:
According to the optimization manual the two additional AGLU cannot do any instruction on their own. So to understand the coupling according to optimization manual:

The instruction table in the optimization manual is plain wrong. It contradicts several other statements made by several different figures by AMD, and it looks like a bad copy-paste job as it is. Please stop looking at it for answers.
Just look at the mov rax, [address], the manual only states EX0|EX1 for it, which cannot be true (needs the AGU for the address).

mov rax, [address] -> executes in AGUx -> ALU idle. (1 micro, 1 macro)

add rax, [address] -> executes in ALUx + AGUx (2 micro, 1 macro)

add rax, rbx -> executes in ALUx -> AGU idle (1 micro, 1 macro)
lea rbx, [address] -> executes in AGUx -> ALU idle (1 micro, 1 macro)
(3 operand lea executes in ALUx + AGUx (2 micro, 1 macro)
inc/dec rax -> executes in AGUx, ALU idle (1 micro, 1 macro)

Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Golden Member

Golden Member

Diamond Member

Lifer

Senior member

Senior member

Lifer

Golden Member

Golden Member

Platinum Member

Elite Member

Lifer

Lifer

Golden Member

Golden Member

Diamond Member

Golden Member

Lifer

Lifer

Member

Diamond Member

Golden Member

Member

Diamond Member

Golden Member