YABulldozerT: AMD FX Processor Prices Lower Than Expected

busydude · Sep 22, 2011

LOL_Wut_Axel said:
No, 180%.

I don't understand where you got the 80% number.

JF-AMD's reply in xtremesystems:

JFAMD @ XS said:
OK, daddy is going to do some math, everyone follow along please.

First: There is only ONE performance number that has been legally cleared, 16-core Interlagos will give 50% more throughput than 12-core Opteron 6100. This is a statement about throughput and about server workloads only. You CANNOT make any client performance assumptions about that statement.

Now, let's get started.

First, everything that I am about to say below is about THROUGHPUT and throughput is different than speed. If you do not understand that, then please stop reading here.

Second, ALL comparisons are against the same cores, these are not comparison different generations nor are they comparisons against different architectures.

Assume that a processor core has 100% throughput.

Adding a second core to an architecture is typically going to give ~95% greater throughput. There is obviously some overhead because the threads will stall, the threads will wait for each other and the threads may share data. So, two completely independent cores would equal 195% (100% for the first core, 95% for the second core.)

Looking at SPEC int and SPEC FP, Hyperthreading gives you 14% greater throughput for integer and 22% greater throughput for FP. Let's just average the two together.

One core is 100%. Two cores are 118%. Everyone following so far? We have 195% for 2 threads on 2 cores and we have 118% for 2 threads on 1 core.

Now, one bulldozer core is 100%. Running 2 threads on 2 seperate modules would lead to ~195%, it's consistent with running on two independent cores.

Running 2 threads on the same module is ~180%.

You can see why the strategy is more appealing than HT when it comes to threaded workloads. And, yes, the world is becoming more threaded.

Now, where does the 90% come from? What is 180% /2? 90%.

People have argued that there is a 10% overhead for sharing because you are not getting 200%. But, as we saw before, 2 cores actually only equals 195%, so the net per core if you divide the workload is actually 97.5%, so it is roughly a 7-8% delta from just having cores.

Now, before anyone starts complaining about this overhead and saying that AMD is compromising single thread performance (because the fanboys will), keep in mind that a processor with HT equals ~118% for 2 threads, so per thread that equals 59%, so there is a ~36% hit for HT. This is specifically why I think that people need to stay away from talking about it. If you want to pick on AMD for the 7-8%, you have to acknowledge the ~36% hit from HT. But ultimately that is not how people jusdge these things. Having 5 people in a car consumes more gas than driving alone, but nobody talks about the increase in gas consumption because it is so much less than 5 individual cars driving to the same place.

So, now you know the approximate metrics about how the numbers work out. But what does that mean to a processor? Well, let's do some rough math to show where the architecture shines.

An Orochi die has 8 cores. Let's say, for sake of argument, that if we blew up the design and said not modules, only independent cores, we'd end up with about 6 cores.

Now let's compare the two with the assumption that all of the cores are independent on one and in modules on the other. For sake of argument we will assume that all cores scale identically and that all modules scale identically. The fact that incremental cores scale to something less than 100% is already comprehended in the 180% number, so don't fixate on that. In reality the 3rd core would not be at 95% but we are holding that constant for example.

Mythical 6-core bulldozer:
100% + 95% + 95% + 95% + 95% + 95% = 575%

Orochi die with 4 modules:
180% + 180% + 180% + 180% = 720%

What if we had just done a 4 core and added HT (keeping in the same die space):
100% + 95% +95% +95% + 18% + 18% + 18% + 18% = 457%

What about a 6 core with HT (has to assume more die space):
100% + 95% +95% +95% +95% +95% + 18% + 18% + 18% + 18% + 18% + 18% = 683%

(Spoiler alert - this is a comparison using the same cores, do NOT start saying that there is a 25% performance gain over a 6-core Thuban, which I am sure someone is already starting to type.)

The reality is that by making the architecture modular and by sharing some resources you are able to squeeze more throughput out of the design than if you tried to use independent cores or tried to use HT. In the last example I did not take into consideration that the HT circuitry would have delivered an extra 5% circuitry overhead....

Every design has some degree of tradeoff involved, there is no free lunch. The goal behind BD was to increase core count and get more throughput. Because cores scale better than HT, it's the most predictable way to get there.

When you do the math on die space vs. throughput, you find that adding more cores is the best way to get to higher throughput. Taking a small hit on overall performance but having the extra space for additional cores is a much better tradeoff in my mind.

Nothing I have provided above would allow anyone to make a performance estimate of BD vs. either our current architecture or our compeition, so, everyone please use this as a learning experience and do not try to make a performance estimate, OK?

NostaSeronx · Sep 22, 2011

Idontcare said:
Shouldn't the CMT architecture have an advantage over the CMP architecture when there is only one thread running owing to the fact that the thread running on the CMT architecture has more resources available to it because the otherwise shared resources are not being shared at that point?

Is this not the case?

All the advantages of Hyperthreading is in AMD's CMT

So, name off Hyperthreading advantages then add the fact that Bulldozer has static physical pipelines for both "cores" in the OS which removes all disadvantages that Hyperthreading has

Idontcare said:
But I still don't get how both the CMT and CMP designs would yield the same 1x (i.e. the same IPC).

Core A in CMP has 4 IPC Core A in CMT has 4 IPC
Max possible IPC is static and stays the same over the designs

Only the Front End/L2/Floating Point really combined

LOL_Wut_Axel · Sep 22, 2011

busydude said:
I don't understand where you got the 80% number.

JF-AMD's reply in xtremesystems:

I don't understand where you didn't get the number. I said 180%, not 80%. That's what John says about running two threads in one module in the quote as well, so I don't know what you're arguing about.

LOL_Wut_Axel · Sep 22, 2011

Idontcare said:
So it would be something more like:
Dual-Core CMP => 1x, 2x
Dual-Core CMT => 1x, 1.8x

But I still don't get how both the CMT and CMP designs would yield the same 1x (i.e. the same IPC).

Shouldn't the CMT architecture have an advantage over the CMP architecture when there is only one thread running owing to the fact that the thread running on the CMT architecture has more resources available to it because the otherwise shared resources are not being shared at that point?

Is this not the case?

Because the CMT architecture when running a single thread has the same resources available to it as the CMP would, and not more? 2x 128-bit FMAC can be "linked" together so it can handle 256-bit SSE and AVX. If AMD were to go dedicated for this they'd have to ditch the module concept and instead make each core have 1x 256-bit FMAC, but that would require a much bigger die. In single-threaded, 1x 256-bit FMAC should have the same performance as 2x 128-bit FMAC in both SSE and AVX.

SolMiester · Sep 22, 2011

trollolo · Sep 22, 2011

SolMiester said:

window users have to go through all that? pfffft

:~$ sudo apt-get install senseofhumor

dbigers · Sep 22, 2011

We put up with that in exchange for the latest games. Actually I dual boot and I have gotten some games to work ok using WINE. But, nice to see another Linux user either way.

Nemesis 1 · Sep 22, 2011

busydude said:
I don't understand where you got the 80% number.

JF-AMD's reply in xtremesystems:

Really JF hasn't proven to be a reliable source at all . It is whats proven is he attacks those who disagree with him . and so far have been proven a better source.

Thats what these threads are about . JF was here hyping and dresden boy was hyping the micro arch for how long know? If we were all screaming how BD was going to destroy intel your post would not say the same thing . But now that the other shoe has fallen the hype has gone south and its hard for ya to take . Send mailing adderess to me in PM I will send ya a beutiful cring towel

sm625 · Sep 22, 2011

There's no reason not to hype the bd uarch. On paper it looks pretty good. I havent seen anyone explain why it would perform worse than phenom when nearly every aspect of it is improved, on paper anyway. Barcelona didnt look that great on paper. They didnt double the size of the FPU, they didnt increase the number of int clusters, they didnt have a better OoO... nothing like that at all. I wont pass judgement till the 2nd stepping, after the launch stepping.

Nemesis 1 · Sep 22, 2011

Explain to me how its better on paper . Your understanding of Micro arch. Is way differant than mine . The thing a see that is shared resources and the added hop. . I don't want someones elses understanding of BD microarch but yours.

If I want to debate JF on the microarch I will debate him on it . I just can't wait to see AVX on Intel Vs AMD bd performance . After its been recompiled for both microarch

Nemesis 1 · Sep 22, 2011

In post 126 JF goes on to tell us this and that . and uses the word THROUGHPUT on the server side . Mind telling me what clock speed is of the server BD JF is talking about . and I see that 50% number again. LOL . This guy is a complete xxxx. I suppose that you are going to in the server market pit AMD Vs. Intel based on price . Thats not going to happen. It will be AMDs best against Intels best SB-E . It will be a laugher for sure but it won't be me that loses his smile.

Regarding: "This guy is a complete xxxx."

Nemesis 1, personal attacks are not permitted here. You are fully aware of this. Offer an apology and edit your post, or accept an infraction point. Your choice.
Anandtech Moderator - Keysplayr

Well I retract my statement he is a xxxx .

Very well. Now you have 2 infraction points. One for the attack on JFAMD and the other for spitting in my face with this lame attempt at faux cooperation. You managed to not remove your initial insult and then add a second. Not.... a..... game....
You're on your way out. Keep it up.
Anandtech Moderator - Keysplayr

Googer · Sep 22, 2011

http://lenzfire.com/2011/09/amd-bulldozer-official-price-and-release-date-46209/

busydude · Sep 22, 2011

Nemesis 1 said:
JF was here hyping and dresden boy was hyping the micro arch for how long know? If we were all screaming how BD was going to destroy intel your post would not say the same thing . But now that the other shoe has fallen the hype has gone south and its hard for ya to take . Send mailing adderess to me in PM I will send ya a beutiful cring towel

So what? Didn't your wife do the same before C2D launch? Yes, C2D launch was spectacular and in the end you were right. Why can't you give them a chance.. and do the talking after BD launches?

NostaSeronx · Sep 23, 2011

Nemesis 1 said:
In post 126 JF goes on to tell us this and that . and uses the word THROUGHPUT on the server side . Mind telling me what clock speed is of the server BD JF is talking about . and I see that 50% number again. LOL . This guy is a complete tool. I suppose that you are going to in the server market pit AMD Vs. Intel based on price . Thats not going to happen. It will be AMDs best against Intels best SB-E . It will be a laugher for sure but it won't be me that loses his smile.

Is this your image of JF-AMD?

and this is to the underlined word in your quote

RussianSensation · Sep 23, 2011

Googer said:
http://lenzfire.com/2011/09/amd-bulldozer-official-price-and-release-date-46209/

AMD FX-8120 for $205 with overclocking is going to be a sweet CPU for content creation users and heavy mathematical computation work!

Now an 8 core CPU w/ 4.0ghz Turbo (!) priced below a 2500k pretty much seals the deal in my eyes that it will not be as competitive in 1-4 threaded apps (esp. not in overclocked vs. overclocked case), but pull away in 5-8 threaded tasks.

I have to say the marketing force is strong with this one: 8 cores for $200 and change. If this was an Apple product, it would sell for $400+. :biggrin:

AtenRa · Sep 23, 2011

Idontcare said:
So it would be something more like:

Dual-Core CMP => 1x, 2x
Dual-Core CMT => 1x, 1.8x

But I still don't get how both the CMT and CMP designs would yield the same 1x (i.e. the same IPC).

Shouldn't the CMT architecture have an advantage over the CMP architecture when there is only one thread running owing to the fact that the thread running on the CMT architecture has more resources available to it because the otherwise shared resources are not being shared at that point?

Is this not the case?

Theoretically speaking,
Lets assume that we have the same cores both in CMP and in CMT.

Core has 4 Integer execution pipelines (IPC = 1.3)

The only difference between the two designs is the shared Front End (in BD the FP as well)

Now, if a single core has 1.3 IPC it will have the same IPC both in CMP and in CMT when we talking about that single core(Integer), except for FP in BD(i'll come to that later)

Why ??

In CMP it is easy to understand, both cores are the same and everything is double (double the Front End, double the Execution units, double the L caches etc). So we have two cores with the same IPC (1.3)

Core 1 = Core has 4 Integer execution pipelines (IPC = 1.3)
Core 2 = Core has 4 Integer execution pipelines (IPC = 1.3)

(No sharing anything)

(Theoretically, in order to keep it simple and understand the differences of the design, each core will keep the same IPC in CMP as a single core )

Now in CMT, we combine the same cores we used in CMP but we share a single Front End. When we have a single thread, meaning a single core of the module will be used, that core has all the resources of the module for him self alone to use (except the second execution Integer unit ie the second core).

But because the single core CANNOT use any of the second Integer execution units, it's IPC will be the same (1.3) as a single core would have.

Single thread
Core 1 = Core has 4 Integer execution pipelines (IPC 1.3) (no sharing anything)

When we have two threads in CMT, the two threads use the same single shared Front End (but they don't share the Integer execution units) and because of the single Shared Front End, the IPC will degrade per core.

Two threads

Core 1 = Core has 4 Integer execution pipelines (IPC 1.2) (Sharing the front end = lower performance vs CMP)

Core 2 = Core has 4 Integer execution pipelines (IPC 1.2) (Sharing the front end = lower performance vs CMP)

Floating Point

Now because the FP is shared between two cores and each core can use the entire FP or two cores can share it (half each), then when we have a single thread that thread can take advantage of the entire FP and it will have higher IPC (per core) but when we have two threads they will share the FP execution units in two so each core will have lower IPC.

Single core : FP has 4 Execution Pipelines

Single thread
FP has 4 Execution Pipelines (IPC 1.8) (single core use all the resources)

Two threads : They share the 4 FP Execution Units

FP has 4 Execution Pipelines
Core 1: can use 2 FP Execution Pipelines (IPC 0.9)
Core 2: can use 2 FP Execution Pipelines (IPC 0.9)

Note that in dual FP Threads in CMT, we also share the same Single shared Front End AND the FP execution Units and we have a higher performance penalty than in Integer.

AtenRa · Sep 23, 2011

Just to add,

CMT is not about higher IPC (Performance) but smaller die size and lower power usage. They compromise and have a little IPC penalty when we have two threads in CMT vs CMP but at a smaller die size (shared Front end etc) and lower power usage.

Those were the goals of the Bulldozer architecture, smaller die size and lower power usage for CMP characteristics.

grimpr · Sep 23, 2011

Nemesis 1 said:
Really JF hasn't proven to be a reliable source at all . It is whats proven is he attacks those who disagree with him . and so far have been proven a better source.

Thats what these threads are about . JF was here hyping and dresden boy was hyping the micro arch for how long know? If we were all screaming how BD was going to destroy intel your post would not say the same thing . But now that the other shoe has fallen the hype has gone south and its hard for ya to take . Send mailing adderess to me in PM I will send ya a beutiful cring towel

The donkey accused the rooster for being a stubborn big head.

psolord · Sep 23, 2011

grimpr said:
The donkey accused the rooster for being a stubborn big head.

lol I thought that was a Greek only saying!

Dresdenboy · Sep 23, 2011

Nemesis 1 said:
and dresden boy was hyping the micro arch for how long know?

And when exactly did you see me doing that? When I mentioned, that BD's IPC would be constant compared to older archs?
http://forums.anandtech.com/showthread.php?p=32263642&highlight=#post32263642

I think that whatever I write will be wrong in your eyes

Phynaz · Sep 23, 2011

AtenRa said:
Now because the FP is shared between two cores and each core can use the entire FP or two cores can share it (half each)

You may be mistaken here. Hasn't AMD said the FP unit can be used by only one thread at a time? In other words, the FP unit can't be 'shared' by the INT units.

formulav8 · Sep 23, 2011

grimpr said:
The donkey accused the rooster for being a stubborn big head.

Yuk Yuk Yuk

Don't remember ever hearing that before...

inf64 · Sep 23, 2011

Dresdenboy said:
And when exactly did you see me doing that? When I mentioned, that BD's IPC would be constant compared to older archs?
http://forums.anandtech.com/showthread.php?p=32263642&highlight=#post32263642

I think that whatever I write will be wrong in your eyes

CHeck his posting history and after that use the neat forum option that has a list

.

Idontcare · Sep 23, 2011

AtenRa said:
Just to add,

CMT is not about higher IPC (Performance) but smaller die size and lower power usage. They compromise and have a little IPC penalty when we have two threads in CMT vs CMP but at a smaller die size (shared Front end etc) and lower power usage.

Those were the goals of the Bulldozer architecture, smaller die size and lower power usage for CMP characteristics.

Yeah I realize now the source of my confusion above, thanks for contributing the detailed response.

I had this in my mind:

But I was assuming the CMT design actually resulted in more favorable single-threaded performance over the CMP design, now I realize I was actually "double counting" the available resources for the CMT design over the CMP design.

Whoops.

So the final question I have is this "80%" versus "180%" number. AMD slides clearly only state "80%"...and 80% x 2 = 1.6

So is the performance scaling in going from 1 thread in one module to having two threads in one module going to merely be 1x -> 1.6x for applications which would have otherwise scaled perfectly on a CMP architecture 1x -> 2x?

inf64 · Sep 23, 2011

Idontcare said:
So is the performance scaling in going from 1 thread in one module to having two threads in one module going to merely be 1x -> 1.6x for applications which would have otherwise scaled perfectly on a CMP architecture 1x -> 2x?

Well first of all,no application out there scales perfectly with more cores. It's more like 1.8-1.9x over single core in a dual core CMP chip with no SMT. Now that we have acknowledge that,we can go back to that slide from AMD. They did say 80% of CMP design and this means 1.6x uplift versus ~1.9 which is CMP.

Now in AMD's case,the FP workloads are probably the ones that put down the average figure the most. Without FP in the mix(shared Flexfp),you would probably have something like conventional CMP scaling. But due to shared nature of big floating pint unit , the average is down to 80%.

edit: I see AtenRa posted similar post

.

YABulldozerT: AMD FX Processor Prices Lower Than Expected

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Junior Member

Lifer

Diamond Member

Lifer

Lifer

Lifer

Diamond Member

Diamond Member

Elite Member

Lifer

Lifer

Golden Member

Platinum Member

Golden Member

Lifer

Diamond Member

Diamond Member

Elite Member

Diamond Member