• We should now be fully online following an overnight outage. Apologies for any inconvenience, we do not expect there to be any further issues.

Some Bulldozer and Bobcat articles have sprung up

Page 14 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Cogman

Lifer
Sep 19, 2000
10,286
145
106
Actually in a multithreade FPU environment, we would have an advantage.

In AVX we will have 8 256-bit units
In non-AVX we will have 16 128-bit units.

Compared to everything I have seen on the server Sandybridge, they will have 8 256-bit AVX units, so we are generally tied on AVX code, but on non-AVX code they will only have 8 128-bit units, or half the FP capability.

Remember that most apps will not be recompiled to take advantage of AVX right away, so we have an advantage.

Also, unless they have changed their scheduler, they have 1 that covers 2 integer threads and the FPU. We have one for each integer thread plus one for the FPU, so in a multithreaded environement I would bet on Bulldozer.

1282327138uwa7eZO5M3_1_3_l.jpg


So wait, I'm probably just looking at this slide wrong. Will this appear as 1 core or 2? If it is 2, than I stand by my claim that highly threaded FP performance might suffer. But if it is 1 then my claim is off the wall and you can kindly ignore me :D.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
I don't think I do, can you elaborate as to what makes you think that?

Sure. The "4:6" FP unit quote.

Yes, but I was mainly talking about single-threaded performance and gaming.
While games are more multi-threaded these days, they aren't exactly on the best terms with Amdahl's Law, if you know what I mean.

But they didn't clarify the 50% is what. Since its server, its a combination of both single and multi. That's not hard to understand, and JFAMD is a server guy. And I think someone knowledgeable as you that concludes how desktop will look by judging server performance is a bit naive.
 

jvroig

Platinum Member
Nov 4, 2009
2,394
1
81
You have two 128-bit FMAC units per module. Zambezi will have 4 modules, correct? Now that would be 4*2 = 8 128-bit units, not 16.
He's talking about Interlagos, their 16-core Bulldozer server chip. He never talks about desktop.
 

JFAMD

Senior member
May 16, 2009
565
0
0
1282327138uwa7eZO5M3_1_3_l.jpg


So wait, I'm probably just looking at this slide wrong. Will this appear as 1 core or 2? If it is 2, than I stand by my claim that highly threaded FP performance might suffer. But if it is 1 then my claim is off the wall and you can kindly ignore me :D.

This will be 2 cores.

But see my earlier post on FP a few replies up.
 

Cogman

Lifer
Sep 19, 2000
10,286
145
106
Sure. The "4:6" FP unit quote.



But they didn't clarify the 50% is what. Since its server, its a combination of both single and multi. That's not hard to understand, and JFAMD is a server guy. And I think someone knowledgeable as you that concludes how desktop will look by judging server performance is a bit naive.

I wouldn't say it is naive. Server and desktop architectures have traditionally been pretty closely linked. Servers chips are generally pretty good predictors to how their desktop counterparts will behave.
 

jvroig

Platinum Member
Nov 4, 2009
2,394
1
81
So wait, I'm probably just looking at this slide wrong. Will this appear as 1 core or 2? If it is 2, than I stand by my claim that highly threaded FP performance might suffer. But if it is 1 then my claim is off the wall and you can kindly ignore me
That will appear as 2 cores.

But since he is comparing servers, and their top end Bulldozer (Interlagos) will have 16 cores (hence, 8 modules, hence, 8 256-bit FPUs) vs the top end SB, then his math checks out.

But that is just "mental math", if you get what I mean. Intel has silicon in hand, Anand even got around to make them run benchmarks. AMD has not shown anything of the sort, so it's hard to say just how hard Bulldozer will suck. Or maybe not suck. Who knows.

As for me, seeing as to how I've just read this:
begone with your logically steadfast refusal to condemn that which you have yet to study! I'll have none of it, none of it I say, ya hear!
... then, for the moment, I will safely file away the ten pages worth of hate and fail against Bulldozer as a desktop product that I have laboriously typed away, at least until launch. Then by god, the benchmarks better be awe-inspiring, or righteous nerd rage shall be unleashed.
 
Last edited:

Scali

Banned
Dec 3, 2004
2,495
0
0
Sure. The "4:6" FP unit quote.

I don't see it that way, I'm talking about efficiency.

But they didn't clarify the 50% is what. Since its server, its a combination of both single and multi. That's not hard to understand, and JFAMD is a server guy. And I think someone knowledgeable as you that concludes how desktop will look by judging server performance is a bit naive.

As far as I know, BD will be used for both high-end desktop (Zambezi) and server systems.
 

Scali

Banned
Dec 3, 2004
2,495
0
0
He's talking about Interlagos, their 16-core Bulldozer server chip. He never talks about desktop.

Oh, but then we should compare to Magny Cours (the current server chip).
12 cores, 24 SIMD units.
Against 16 units on Interlagos?
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
Exactly there:
Phenom II can also do 2x128 bit per core. That is 4x128 bit per 2 cores/threads, vs 2x128 bit per module/2 cores/threads for BD.
See: http://en.wikipedia.org/wiki/File:AMD_K10_Arch.svg
Note that there are 3 FPU execution ports, two capable of 128-bit SIMD, and one additional FMISC, which also isn't present on BD.

Phenom II can do one 128-bit FADD and one 128-bit FMUL

I think one FMAC can do FADD and FMUL so we have 2 x FMACs means 2 x FADD + 2 x FMUL, plus BDs FP unit also has 2 MMX pipes (Ports) .
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
I wouldn't say it is naive. Server and desktop architectures have traditionally been pretty closely linked. Servers chips are generally pretty good predictors to how their desktop counterparts will behave.

The reason I say this is because of where the conclusion came from. It's from "50% faster than predecessor...".

Look at the Round 2 of the questions from JFAMD's link. Now they say 80%. We are trying to scrutinize 5-10% difference when AMD themselves are playing with 20% easily. See the futility here?

And how close are client and servers anyway? Even 2P servers Nehalem was almost a no-brainer purchase decision over the Core 2 based parts while on the desktop it wasn't that clear cut. While there are some overlap in general, but not with the amount of variability we are playing with.
 

jvroig

Platinum Member
Nov 4, 2009
2,394
1
81
Oh, but then we should compare to Magny Cours (the current server chip).
12 cores, 24 SIMD units.
Against 16 units on Interlagos?
That would be absolutely right.
In fact, the "33% more cores for 50% more performance" refers specifically to Magny Cours vs Interlagos, which caused people to condemn Bulldozer desktop (not server), even before knowing anything about the architecture (way before Hot Chips) or whether it's possible that single-thread performance could possibly differ from max multi-threaded throughput.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
You mean they are 'good enough', as long as your standards aren't that high.
The *best* gaming CPUs come from Intel. Especially if we get into multi-GPU configurations. I hate the AMD camp always bringing budget into the equation, when clearly I was discussing performance here.
If discussing performance at any cost, why even bother replying to an AMD thread? They can't hang with Intel on that. AMD caught Intel with its pants down, and that won't happen again, at least for a solid decade.

One's standards are almost entirely what one has to spend, at the moment. Pure performance is rarely a consideration.

<- in the AMD camp, apparently, running all Intel CPUs :rolleyes:
 
Last edited:

jvroig

Platinum Member
Nov 4, 2009
2,394
1
81
The reason I say this is because of where the conclusion came from. It's from "50&#37; faster than predecessor...".

Look at the Round 2 of the questions from JFAMD's link. Now they say 80%. We are trying to scrutinize 5-10% difference when AMD themselves are playing with 20% easily. See the futility here?
You are mixing up two different statements, the 50% and the 80%.

50% is a chip performance statement (12 core MC vs 16 core Interlagos): "33% more cores, 50% more performance"

80% is a module comparison to CMP - it is not new, this is what they've been saying all the while, alongside the 50% quote: compared to CMP ("true" dual core), a module only delivers 80% throughput. Looks bad, but that's a "per module" measure - in the bigger picture, you save space, power, and, in the end, get more performance because you get more cores.
 

JFAMD

Senior member
May 16, 2009
565
0
0
Comparing Magny Cours to Bulldozer you get 12 128-bit FPUs vs. 16 128-bit FPUs.
 

Scali

Banned
Dec 3, 2004
2,495
0
0
Phenom II can do one 128-bit FADD and one 128-bit FMUL

We're talking SIMD here though. x87 is no longer relevant in x64, it has been superceded by SSE2+.

I think one FMAC can do FADD and FMUL so we have 2 x FMACs means 2 x FADD + 2 x FMUL, plus BDs FP unit also has 2 MMX pipes (Ports) .

From what I understand (that's how the diagrams show it), it really only has two execution ports, which handle x87 and SIMD.
Aside from that, the combined FADD and FMUL require new instructions, so recompile of applications... Or perhaps op fusion, but I have not heard AMD say that they can do this.
 

Scali

Banned
Dec 3, 2004
2,495
0
0
If discussing performance at any cost, why even bother replying to an AMD thread? They can't hang with Intel on that. AMD caught Intel with its pants down, and that won't happen again, at least for a solid decade.

One's standards are almost entirely what one has to spend, at the moment. Pure performance is rarely a consideration.

<- in the AMD camp, apparently, running all Intel CPUs :rolleyes:

This post was completely unnecessary. We're just trying to figure out where Bulldozer will stand compared to current CPUs (both Intel and AMD), based on what we know so far, and what we deem possible and impossible, barring the use of pixie dust and magic unicorns.
 

jvroig

Platinum Member
Nov 4, 2009
2,394
1
81
@JFAMD,

Great info on Round 2 (better than Round 1, but it's probably a personal thing). The second question there is actually something I've had in mind but never got around to asking. Like how modern OSes are HT-aware now, I would imagine it would be a performance guarantee to make sure modern OSes are somehow also "module-aware" such that 2 threads don't always end up in a single module to completely negate any performance drop caused by resource sharing

Why exclude Apple, though? You only mentioned Windows and Linux operating systems specifically.
 
Last edited:

Scali

Banned
Dec 3, 2004
2,495
0
0
Comparing Magny Cours to Bulldozer you get 12 128-bit FPUs vs. 16 128-bit FPUs.

I don't think that is correct, as each K10 core has two SIMD units, so two 128-bit FPU operations in parallel.
Which makes for 24, not 12 (this is where most of the saving in the Bulldozer module design is being done).
 

Riek

Senior member
Dec 16, 2008
409
15
76
I don't think that is correct, as each K10 core has two SIMD units, so two 128-bit FPU operations in parallel.
Which makes for 24, not 12 (this is where most of the saving in the Bulldozer module design is being done).

I looked at the diagram you posted of the K10, but although you are correct it has 2 parts to execute room to SSE instructions, I don't think it can store 2*128b/cycle. (nor load for 2 as I see it, could be wrong though).
Wasn't that the big difference in design? That K10 can load/store 128 in one go whereass previously it needed to use 2cycles to load ? I never was under the assumption that K10 could execute 2 SSE instructions/cycle.
 

Scali

Banned
Dec 3, 2004
2,495
0
0
I looked at the diagram you posted of the K10, but although you are correct it has 2 parts to execute room to SSE instructions, I don't it can store 2/cycle. (nor load for 2 as I see it, could be wrong though).
Wasn't that the big difference in design? That K10 can load/store 128 in one go whereass previously it needed to use 2cycles to load ? I never was under the assumption that K10 could execute 2 SSE instructions/cycle.

It can load two 128-bit values per cycle, and execute 2 128-bit SIMD operations per cycle.
It can only store one 128-bit value per cycle though:
http://www.xbitlabs.com/articles/cpu/display/amd-k10_7.html
Then again, that is usually the pattern... eg you load two values, then add/multiply/do whatever, and store the single result.
 

jvroig

Platinum Member
Nov 4, 2009
2,394
1
81
Edited out. Sorry. Don't want to get the joke out of hand and make John Fruehe's job harder, don't want people knocking at his door expecting >40&#37; single-thread performance improvement because of me :)
 
Last edited:

JFAMD

Senior member
May 16, 2009
565
0
0
Hehe, didn't I say that somewhere in this thread?

Just because I enjoy seeing you take on the AMD camp, here's throwing them a bone (I'm surprised nobody cited this first)

-33% more cores, 50% more performance (server comparison, i.e., maximal multi-thread output is likely, no cores sitting around at idle as in a desktop)

That means:
12 cores = 100 performance units.
16 cores = 150 performance units.

To get the "performance units per core", we simply do the math:
100 performance units / 12 cores = 8.33 performance units per core
150 performance units / 16 cores = 9.38 performance units per core

That shows only a 13% improvement in "performance units per core".

NO - DON'T DO THIS.

It takes me 45 minutes to get home in rush hour traffic. So, if I am driving the same route at 3AM it should be 45 minutes, right?

This math is ALL FAULTY.

Every time this shows up in a post it gets coppied to 10 different places.

You cannot determine single threaded performance based on a fully utilized server benchmark. Period.
 

Martimus

Diamond Member
Apr 24, 2007
4,490
157
106
Sometimes I never learn. If you were unfortunate enough to read my post, you at least understand how I feel. However, that post really added nothing worthwhile to this discussion.
 
Last edited:

jvroig

Platinum Member
Nov 4, 2009
2,394
1
81
@JFAMD
I'm sorry, that was a joke, as I indicated in the disclaimer.

I realize now that if some people don't realize that, they might actually make your job harder.

So to anybody who didn't get that I was kidding because they didn't read the disclaimer, I was interpreting relevant PR data and "spun" it to the best possible outcome to get an incredible single-thread boost, and it was all a joke. Please do not quote that as any reasonable figure or bother JFAMD for any comment about it.


EDIT:

It takes me 45 minutes to get home in rush hour traffic. So, if I am driving the same route at 3AM it should be 45 minutes, right?
Actually, I did take into account for "rush hour traffic", by mentioning the 20&#37; dual-thread penalty and adjusting the single-thread as necessary. I am now confused if you only have problem with the "low" estimate (13%) and not the adjusted high 40% estimate. Or do you mean it should even be much higher than 40%?

Regardless, it was not meant to be taken seriously at all, whether your protestations are about 13% or (more rightfully) the basis of the figures (a server benchmark, no context at all regarding test environment, apps involved, etc - my basis for regarding the 40% figure as not founded in fact).

I hope I didn't cause you too much grief :) Sorry, man. Would you prefer I edit out that post? I am not into the habit of ever editing out anything I post (it screws up the flow of the thread), but if it will help make your job not harder, I suppose it won't be too much of a bother for me, especially since I was only kidding.
 
Last edited: