Some Bulldozer and Bobcat articles have sprung up

Cogman · Aug 30, 2010

JFAMD said:
Actually in a multithreade FPU environment, we would have an advantage.

In AVX we will have 8 256-bit units
In non-AVX we will have 16 128-bit units.

Compared to everything I have seen on the server Sandybridge, they will have 8 256-bit AVX units, so we are generally tied on AVX code, but on non-AVX code they will only have 8 128-bit units, or half the FP capability.

Remember that most apps will not be recompiled to take advantage of AVX right away, so we have an advantage.

Also, unless they have changed their scheduler, they have 1 that covers 2 integer threads and the FPU. We have one for each integer thread plus one for the FPU, so in a multithreaded environement I would bet on Bulldozer.

So wait, I'm probably just looking at this slide wrong. Will this appear as 1 core or 2? If it is 2, than I stand by my claim that highly threaded FP performance might suffer. But if it is 1 then my claim is off the wall and you can kindly ignore me

.

IntelUser2000 · Aug 30, 2010

Scali said:
I don't think I do, can you elaborate as to what makes you think that?

Sure. The "4:6" FP unit quote.

Yes, but I was mainly talking about single-threaded performance and gaming.
While games are more multi-threaded these days, they aren't exactly on the best terms with Amdahl's Law, if you know what I mean.

But they didn't clarify the 50% is what. Since its server, its a combination of both single and multi. That's not hard to understand, and JFAMD is a server guy. And I think someone knowledgeable as you that concludes how desktop will look by judging server performance is a bit naive.

jvroig · Aug 30, 2010

Scali said:
You have two 128-bit FMAC units per module. Zambezi will have 4 modules, correct? Now that would be 4*2 = 8 128-bit units, not 16.

He's talking about Interlagos, their 16-core Bulldozer server chip. He never talks about desktop.

JFAMD · Aug 30, 2010

Cogman said:
So wait, I'm probably just looking at this slide wrong. Will this appear as 1 core or 2? If it is 2, than I stand by my claim that highly threaded FP performance might suffer. But if it is 1 then my claim is off the wall and you can kindly ignore me .

This will be 2 cores.

But see my earlier post on FP a few replies up.

Cogman · Aug 30, 2010

IntelUser2000 said:
Sure. The "4:6" FP unit quote.

But they didn't clarify the 50% is what. Since its server, its a combination of both single and multi. That's not hard to understand, and JFAMD is a server guy. And I think someone knowledgeable as you that concludes how desktop will look by judging server performance is a bit naive.

I wouldn't say it is naive. Server and desktop architectures have traditionally been pretty closely linked. Servers chips are generally pretty good predictors to how their desktop counterparts will behave.

JFAMD · Aug 30, 2010

Oh, and round 2 of 20 questions went live this morning:

http://bit.ly/a0ykVq

If you want to ask a question, use the form:
http://blogs.amd.com/work/2010/08/10/20-questions-–-bulldozer-style/

jvroig · Aug 30, 2010

Cogman said:
So wait, I'm probably just looking at this slide wrong. Will this appear as 1 core or 2? If it is 2, than I stand by my claim that highly threaded FP performance might suffer. But if it is 1 then my claim is off the wall and you can kindly ignore me

That will appear as 2 cores.

But since he is comparing servers, and their top end Bulldozer (Interlagos) will have 16 cores (hence, 8 modules, hence, 8 256-bit FPUs) vs the top end SB, then his math checks out.

But that is just "mental math", if you get what I mean. Intel has silicon in hand, Anand even got around to make them run benchmarks. AMD has not shown anything of the sort, so it's hard to say just how hard Bulldozer will suck. Or maybe not suck. Who knows.

As for me, seeing as to how I've just read this:

Idontcare said:
begone with your logically steadfast refusal to condemn that which you have yet to study! I'll have none of it, none of it I say, ya hear!

... then, for the moment, I will safely file away the ten pages worth of hate and fail against Bulldozer as a desktop product that I have laboriously typed away, at least until launch. Then by god, the benchmarks better be awe-inspiring, or righteous nerd rage shall be unleashed.

Scali · Aug 30, 2010

IntelUser2000 said:
Sure. The "4:6" FP unit quote.

I don't see it that way, I'm talking about efficiency.

IntelUser2000 said:
But they didn't clarify the 50% is what. Since its server, its a combination of both single and multi. That's not hard to understand, and JFAMD is a server guy. And I think someone knowledgeable as you that concludes how desktop will look by judging server performance is a bit naive.

As far as I know, BD will be used for both high-end desktop (Zambezi) and server systems.

Scali · Aug 30, 2010

jvroig said:
He's talking about Interlagos, their 16-core Bulldozer server chip. He never talks about desktop.

Oh, but then we should compare to Magny Cours (the current server chip).
12 cores, 24 SIMD units.
Against 16 units on Interlagos?

AtenRa · Aug 30, 2010

Scali said:
Exactly there:
Phenom II can also do 2x128 bit per core. That is 4x128 bit per 2 cores/threads, vs 2x128 bit per module/2 cores/threads for BD.
See: http://en.wikipedia.org/wiki/File:AMD_K10_Arch.svg
Note that there are 3 FPU execution ports, two capable of 128-bit SIMD, and one additional FMISC, which also isn't present on BD.

Phenom II can do one 128-bit FADD and one 128-bit FMUL

I think one FMAC can do FADD and FMUL so we have 2 x FMACs means 2 x FADD + 2 x FMUL, plus BDs FP unit also has 2 MMX pipes (Ports) .

IntelUser2000 · Aug 30, 2010

Cogman said:
I wouldn't say it is naive. Server and desktop architectures have traditionally been pretty closely linked. Servers chips are generally pretty good predictors to how their desktop counterparts will behave.

The reason I say this is because of where the conclusion came from. It's from "50% faster than predecessor...".

Look at the Round 2 of the questions from JFAMD's link. Now they say 80%. We are trying to scrutinize 5-10% difference when AMD themselves are playing with 20% easily. See the futility here?

And how close are client and servers anyway? Even 2P servers Nehalem was almost a no-brainer purchase decision over the Core 2 based parts while on the desktop it wasn't that clear cut. While there are some overlap in general, but not with the amount of variability we are playing with.

jvroig · Aug 30, 2010

Scali said:
Oh, but then we should compare to Magny Cours (the current server chip).
12 cores, 24 SIMD units.
Against 16 units on Interlagos?

That would be absolutely right.
In fact, the "33% more cores for 50% more performance" refers specifically to Magny Cours vs Interlagos, which caused people to condemn Bulldozer desktop (not server), even before knowing anything about the architecture (way before Hot Chips) or whether it's possible that single-thread performance could possibly differ from max multi-threaded throughput.

Cerb · Aug 30, 2010

Scali said:
You mean they are 'good enough', as long as your standards aren't that high.
The *best* gaming CPUs come from Intel. Especially if we get into multi-GPU configurations. I hate the AMD camp always bringing budget into the equation, when clearly I was discussing performance here.

If discussing performance at any cost, why even bother replying to an AMD thread? They can't hang with Intel on that. AMD caught Intel with its pants down, and that won't happen again, at least for a solid decade.

One's standards are almost entirely what one has to spend, at the moment. Pure performance is rarely a consideration.

<- in the AMD camp, apparently, running all Intel CPUs

jvroig · Aug 30, 2010

IntelUser2000 said:
The reason I say this is because of where the conclusion came from. It's from "50% faster than predecessor...".

Look at the Round 2 of the questions from JFAMD's link. Now they say 80%. We are trying to scrutinize 5-10% difference when AMD themselves are playing with 20% easily. See the futility here?

You are mixing up two different statements, the 50% and the 80%.

50% is a chip performance statement (12 core MC vs 16 core Interlagos): "33% more cores, 50% more performance"

80% is a module comparison to CMP - it is not new, this is what they've been saying all the while, alongside the 50% quote: compared to CMP ("true" dual core), a module only delivers 80% throughput. Looks bad, but that's a "per module" measure - in the bigger picture, you save space, power, and, in the end, get more performance because you get more cores.

JFAMD · Aug 30, 2010

Comparing Magny Cours to Bulldozer you get 12 128-bit FPUs vs. 16 128-bit FPUs.

Scali · Aug 30, 2010

AtenRa said:
Phenom II can do one 128-bit FADD and one 128-bit FMUL

We're talking SIMD here though. x87 is no longer relevant in x64, it has been superceded by SSE2+.

AtenRa said:
I think one FMAC can do FADD and FMUL so we have 2 x FMACs means 2 x FADD + 2 x FMUL, plus BDs FP unit also has 2 MMX pipes (Ports) .

From what I understand (that's how the diagrams show it), it really only has two execution ports, which handle x87 and SIMD.
Aside from that, the combined FADD and FMUL require new instructions, so recompile of applications... Or perhaps op fusion, but I have not heard AMD say that they can do this.

Scali · Aug 30, 2010

Cerb said:
If discussing performance at any cost, why even bother replying to an AMD thread? They can't hang with Intel on that. AMD caught Intel with its pants down, and that won't happen again, at least for a solid decade.

One's standards are almost entirely what one has to spend, at the moment. Pure performance is rarely a consideration.

<- in the AMD camp, apparently, running all Intel CPUs

This post was completely unnecessary. We're just trying to figure out where Bulldozer will stand compared to current CPUs (both Intel and AMD), based on what we know so far, and what we deem possible and impossible, barring the use of pixie dust and magic unicorns.

jvroig · Aug 30, 2010

@JFAMD,

Great info on Round 2 (better than Round 1, but it's probably a personal thing). The second question there is actually something I've had in mind but never got around to asking. Like how modern OSes are HT-aware now, I would imagine it would be a performance guarantee to make sure modern OSes are somehow also "module-aware" such that 2 threads don't always end up in a single module to completely negate any performance drop caused by resource sharing

Why exclude Apple, though? You only mentioned Windows and Linux operating systems specifically.

Scali · Aug 30, 2010

JFAMD said:
Comparing Magny Cours to Bulldozer you get 12 128-bit FPUs vs. 16 128-bit FPUs.

I don't think that is correct, as each K10 core has two SIMD units, so two 128-bit FPU operations in parallel.
Which makes for 24, not 12 (this is where most of the saving in the Bulldozer module design is being done).

Riek · Aug 30, 2010

Scali said:
I don't think that is correct, as each K10 core has two SIMD units, so two 128-bit FPU operations in parallel.
Which makes for 24, not 12 (this is where most of the saving in the Bulldozer module design is being done).

I looked at the diagram you posted of the K10, but although you are correct it has 2 parts to execute room to SSE instructions, I don't think it can store 2*128b/cycle. (nor load for 2 as I see it, could be wrong though).
Wasn't that the big difference in design? That K10 can load/store 128 in one go whereass previously it needed to use 2cycles to load ? I never was under the assumption that K10 could execute 2 SSE instructions/cycle.

Scali · Aug 30, 2010

imported_Riek said:
I looked at the diagram you posted of the K10, but although you are correct it has 2 parts to execute room to SSE instructions, I don't it can store 2/cycle. (nor load for 2 as I see it, could be wrong though).
Wasn't that the big difference in design? That K10 can load/store 128 in one go whereass previously it needed to use 2cycles to load ? I never was under the assumption that K10 could execute 2 SSE instructions/cycle.

It can load two 128-bit values per cycle, and execute 2 128-bit SIMD operations per cycle.
It can only store one 128-bit value per cycle though:
http://www.xbitlabs.com/articles/cpu/display/amd-k10_7.html
Then again, that is usually the pattern... eg you load two values, then add/multiply/do whatever, and store the single result.

jvroig · Aug 30, 2010

Edited out. Sorry. Don't want to get the joke out of hand and make John Fruehe's job harder, don't want people knocking at his door expecting >40% single-thread performance improvement because of me

JFAMD · Aug 30, 2010

jvroig said:
Hehe, didn't I say that somewhere in this thread?

Just because I enjoy seeing you take on the AMD camp, here's throwing them a bone (I'm surprised nobody cited this first)

-33% more cores, 50% more performance (server comparison, i.e., maximal multi-thread output is likely, no cores sitting around at idle as in a desktop)

That means:
12 cores = 100 performance units.
16 cores = 150 performance units.

To get the "performance units per core", we simply do the math:
100 performance units / 12 cores = 8.33 performance units per core
150 performance units / 16 cores = 9.38 performance units per core

That shows only a 13% improvement in "performance units per core".

NO - DON'T DO THIS.

It takes me 45 minutes to get home in rush hour traffic. So, if I am driving the same route at 3AM it should be 45 minutes, right?

This math is ALL FAULTY.

Every time this shows up in a post it gets coppied to 10 different places.

You cannot determine single threaded performance based on a fully utilized server benchmark. Period.

Martimus · Aug 30, 2010

Sometimes I never learn. If you were unfortunate enough to read my post, you at least understand how I feel. However, that post really added nothing worthwhile to this discussion.

jvroig · Aug 30, 2010

@JFAMD
I'm sorry, that was a joke, as I indicated in the disclaimer.

I realize now that if some people don't realize that, they might actually make your job harder.

So to anybody who didn't get that I was kidding because they didn't read the disclaimer, I was interpreting relevant PR data and "spun" it to the best possible outcome to get an incredible single-thread boost, and it was all a joke. Please do not quote that as any reasonable figure or bother JFAMD for any comment about it.

EDIT:

JFAMD said:
It takes me 45 minutes to get home in rush hour traffic. So, if I am driving the same route at 3AM it should be 45 minutes, right?

Actually, I did take into account for "rush hour traffic", by mentioning the 20% dual-thread penalty and adjusting the single-thread as necessary. I am now confused if you only have problem with the "low" estimate (13%) and not the adjusted high 40% estimate. Or do you mean it should even be much higher than 40%?

Regardless, it was not meant to be taken seriously at all, whether your protestations are about 13% or (more rightfully) the basis of the figures (a server benchmark, no context at all regarding test environment, apps involved, etc - my basis for regarding the 40% figure as not founded in fact).

I hope I didn't cause you too much grief

Sorry, man. Would you prefer I edit out that post? I am not into the habit of ever editing out anything I post (it screws up the flow of the thread), but if it will help make your job not harder, I suppose it won't be too much of a bother for me, especially since I was only kidding.

Some Bulldozer and Bobcat articles have sprung up

Lifer

Elite Member

Platinum Member

Senior member

Lifer

Senior member

Platinum Member

Banned

Banned

Lifer

Elite Member

Platinum Member

Elite Member

Platinum Member

Senior member

Banned

Banned

Platinum Member

Banned

Senior member

Banned

Platinum Member

Senior member

Diamond Member

Platinum Member