S/A: "AMD outs bulldozer based orochi die"

jvroig · Sep 8, 2010

Computer Bottleneck said:
P.S. I thought the following comment was interesting.

Well, that info is the same old info we've had since modules were made known. "At least in initial designs" is probably just the author's way of saying "well, this is supposedly how it is according to the few info they've shared, but this can change since the final product isn't out yet".

cbn · Sep 8, 2010

jvroig said:
Bulldozer does have turbo. "Real turbo" as Anand called it, as opposed to the more simplistic TurboCORE found in Thuban.

Will the server parts also receive "turbo"?

jvroig · Sep 8, 2010

Computer Bottleneck said:
Will the server parts also receive "turbo"?

I don't see why not.

Xeons have turbo boost. If AMD were to cripple servers by withholding turbo on Opterons, it would only be a disadvantage. There's no good side to it.

bryanW1995 · Sep 8, 2010

JFAMD said:
Here are some simple numbers (for demonstration only):

1 thread on 1 core: 100%
2 threads on 1 core (HT): 120%
2 threads on 2 cores in 1 module: 180%
2 threads on 2 cores in 2 modules: ~200%

I say ~200% because Amdahl's law prevents it from really being 200%, but you get the general idea.

As you can see, when you start getting more threads. an HT system has more of a penalty for putting multiple threads on the same core.

that is good to see, I thought that 2 threads on 2 cores in 1 module would be 160% (80% each). so a fully loaded 8 core BD would perform similarly to a theoretical 7.2 module/7.2 core bd. that should definitely compete well with and possibly surpass gulftown assuming clocks are there. if intel keeps gulftown at $900+ then that would certainly make it easier to jump over to amd next round.

jvroig · Sep 8, 2010

Based on the Hot Chips presentation of 80% throughput for a module, then it most certainly is 160%, and not 180%.

Either JFAMD was just in a hurry when he wrote his post, or maybe clarification is in order.

jvroig · Sep 8, 2010

Based on the Hot Chips presentation of 80% throughput for a module, then it most certainly is 160%, and not 180%.

Either JFAMD was just in a hurry when he wrote his post, or maybe clarification is in order.

Sp12 · Sep 8, 2010

Well, the performance delta from a separate module means BD will have to have some pretty significant IPC improvements to keep up with SB.

I'm impressed with that scaling nonetheless.

JFAMD · Sep 8, 2010

It is 80% for the second core, not both. 100 + 80 = 180.

And server will have turbo, but as someone put it earlier "turbo done right." I am not a fan of the way my competitor implemented turbo today. For server workloads it is too little and too inconsistent.

busydude · Sep 8, 2010

JFAMD said:
It is 80% for the second core, not both. 100 + 80 = 180.

So, we have two cores that are NOT created equal? 😕

Nathelion · Sep 9, 2010

nyker96 said:
yes, I like the coloring as well, but I think it's artificial like colored shots from electron microscope shots of viruses etc. they originally has no color.

ok, I guess it's nothing like the power of showing an actual fabrication of a design sitting right in front you to deliver the message, "See, We got it done! It's not ~~phantomware~~ phenomware!"

Corrected

(sorry, I couldn't resist)

DrMrLordX · Sep 9, 2010

JFAMD said:
There will be lots of enhancements to the memory controller, significant advantages there. But no details until launch. The big message there is that there will be a big improvement in memory throughput, and that improvement is not tied to higher speed memory per se. If you put in the same speed memory you would see a pretty good jump in throughput.

Good to know. I'm sort of hoping for more HT Assist improvements (yeah, I know, like I'm going to be fiddling with any enterprise-class systems) since it seemed to do good things for Istanbul and Magny-Cours. NB/IMC improvements are all good in my book, says the NB-overclocking whore.

jvroig · Sep 9, 2010

JFAMD said:
It is 80% for the second core, not both. 100 + 80 = 180.

So now the max throughput for one module is 90% of a traditionally-designed dual core?

Idontcare · Sep 9, 2010

jvroig said:
So now the max throughput for one module is 90% of a traditionally-designed dual core?

Unless I am mistaken that has been the discussed trade-off in going with a CMT design since day one?

JFAMD · Sep 9, 2010

yasasvy said:
So, we have two cores that are NOT created equal? 😕

No. I am merely doing math, the cores are equal.

jvroig · Sep 9, 2010

Idontcare said:
Unless I am mistaken that has been the discussed trade-off in going with a CMT design since day one?

I remember that from long ago, yes, and even from JFAMD as well, way before Hot Chips (whenever he says "HT = ~20% more, while BD way = 80% more"). I guess he remains consistent even to this day.

But AMD at Hot Chips quotes an 80% figure in a different context: http://www.anandtech.com/Gallery/Album/754#6

Every figure quoted so far are estimates (far worse, actually: estimates of averages), of course, but it's disconcerting when estimates don't match. Since Hot Chips info is official AND most recent, one would naturally conclude it would be the most up-to-date.

The difference between 160% and 180% is rather significant. The first is "just a bit over half", while the second one is "over three-fourths" or "pretty close to full".

I realize I am a non-native English speaker so perhaps it is my fault that I misinterpret the wording in the slide, but no matter how I try to dissect the sentence, "80% of the CMP performance" means 80% of a traditional dual core, and in no way could be interpreted as "additional 80% performance added to a single core". So if a single core is 100% and a dual core is 200%, then the module will be 80% of that, which is 160%. Not 100+80=180%.

What do you make of that?

Of course, it could all be as harmless as: AMD would rather be officially too conservative with the estimate (so they quoted 80% as the average, probably the "low" estimate their engineers came up with), while JFAMD would rather be a little bit bolder while still being grounded in the reality that their labs tell them, so he quotes 90% (probably the "high" estimate their engineers came up with).

JFAMD · Sep 9, 2010

Would it be easier if we simplified it to "two to three times the throughput of HT"?

I think too many people are getting too wrapped up and fixated on percentages.

Arguing it is 60 or 80 when the competition is at 20 is really at the point of diminishing returns.

Let's worry about this when the benchmarks are out.

jvroig · Sep 9, 2010

It's because the gap between 160% and 180% is huge, and not just spare change. If current product lineup is used to quantify that, 20% performance difference is probably around 3 - 4 SKUs apart (say, Phenom II 925 vs Phenom II 965). Now, if the delta between the Hot Chips estimate and your estimate was just a measly 5%, nobody will care (or at least, I won't,I'm sure someone out there will have to, chances are). But as it was, 160 and 180 are totally different beasts altogether. However, this is past now, since you did concede to a wider range of performance targets, and so the rational conclusion is that you are indeed quoting a "high" estimate while at Hot Chips a more cautious "lower" estimate was used. It makes sense, but yes, you did have to clarify it for those of us who do care.

JFAMD said:
Let's worry about this when the benchmarks are out.

I certainly would, but at least I'd like to know how to interpret those benchmarks when they come. If I see 70% throughput of CMP, should I be disappointed? What if I see 80%, is that still short of the goal, or actually bulls-eye? If I see 90%, should that mean AMD over-delivered, or did AMD just deliver as promised / bulls-eye? That's the reason behind the "nit-pickiness" you may have been irritated at - we're just trying to clarify exactly what you are promising so that we can use that to judge the benchmarks to come later. Your promises (and by this I mean "you" as in AMD as a company) will be our basis in judging whether benchmarks are "bad", "acceptable", "good", or "intel-raping silicon-awesomeness".

ModestGamer · Sep 9, 2010

jvroig said:
I remember that from long ago, yes, and even from JFAMD as well, way before Hot Chips (whenever he says "HT = ~20% more, while BD way = 80% more"). I guess he remains consistent even to this day.

But AMD at Hot Chips quotes an 80% figure in a different context: http://www.anandtech.com/Gallery/Album/754#6

Every figure quoted so far are estimates (far worse, actually: estimates of averages), of course, but it's disconcerting when estimates don't match. Since Hot Chips info is official AND most recent, one would naturally conclude it would be the most up-to-date.

The difference between 160% and 180% is rather significant. The first is "just a bit over half", while the second one is "over three-fourths" or "pretty close to full".

I realize I am a non-native English speaker so perhaps it is my fault that I misinterpret the wording in the slide, but no matter how I try to dissect the sentence, "80% of the CMP performance" means 80% of a traditional dual core, and in no way could be interpreted as "additional 80% performance added to a single core". So if a single core is 100% and a dual core is 200%, then the module will be 80% of that, which is 160%. Not 100+80=180%.

What do you make of that?

Of course, it could all be as harmless as: AMD would rather be officially too conservative with the estimate (so they quoted 80% as the average, probably the "low" estimate their engineers came up with), while JFAMD would rather be a little bit bolder while still being grounded in the reality that their labs tell them, so he quotes 90% (probably the "high" estimate their engineers came up with).

Maybe JFAMD will correct me or not. From what I read this is what I gather

OK lets break this down to nuts and bolts. I know this cpu architecture is a bit confusing becuase of the modules and cores.

Lets just look at the pipline layout in a I7 and compare that to what they are telling us about bulldozer.

a I7 has 6 physical cores. these core each have a Hyperthreading component that shoves through instructions etc when the physical core is underutilized. that gives to the Operating system the appearance of 12 logical core even though it is only 6. Many people have stated this give a net improvement of 20-30% in cpu efficiency versus a regular non hyperthreaded 6 core.

the bulldozer modules are a bit different. BD for short.

Each module has 2 cores. so in a 4 module chip you have 8 logical cores.

The advanteg is that it behaves like a 8 core chip with lower TDP becuase the architecture shares alot of components that are normally under utilized.

So instead of having 6 physical core that give you 80 effiecny and 6 virtual Hyperthreading cores that give those cores a 20% increase on the physical core.

BD gives you 8 cores all running at 80% efficnecy.

In thoery. If I understand the points they make on the architecture correctly. Its more cores on the same die space in effect.

Martimus · Sep 9, 2010

jvroig said:
It's because the gap between 160% and 180% is huge, and not just spare change. If current product lineup is used to quantify that, 20% performance difference is probably around 3 - 4 SKUs apart (say, Phenom II 925 vs Phenom II 965). Now, if the delta between the Hot Chips estimate and your estimate was just a measly 5%, nobody will care (or at least, I won't,I'm sure someone out there will have to, chances are). But as it was, 160 and 180 are totally different beasts altogether. However, this is past now, since you did concede to a wider range of performance targets, and so the rational conclusion is that you are indeed quoting a "high" estimate while at Hot Chips a more cautious "lower" estimate was used. It makes sense, but yes, you did have to clarify it for those of us who do care.

I certainly would, but at least I'd like to know how to interpret those benchmarks when they come. If I see 70% throughput of CMP, should I be disappointed? What if I see 80%, is that still short of the goal, or actually bulls-eye? If I see 90%, should that mean AMD over-delivered, or did AMD just deliver as promised / bulls-eye? That's the reason behind the "nit-pickiness" you may have been irritated at - we're just trying to clarify exactly what you are promising so that we can use that to judge the benchmarks to come later. Your promises (and by this I mean "you" as in AMD as a company) will be our basis in judging whether benchmarks are "bad", "acceptable", "good", or "intel-raping silicon-awesomeness".

The way I have seen it proposed is that a single thread should get 100% of the resources, and therfore 100% of the performance possible from the chip. The question you are getting at is whether a second core on the same module would add 60% or 80% performance.

Now it could be that it adds 80% performance, but the first thread would lose 20% performance due to sharing some of the functions of the module. (which would really be adding 60% I guess) The other option is that the first thread is unaffected by the second thread, and only the second thread gets the 20% performance hit due to waiting for shared resources to free up. The third option is something you brought up, which is that each thread will only get 90% of their theoretical maximum performance due to shared resources.

Of course none of this really means much, since the type of work being done by each thread will really affect how much of the shared resources have conflicts between the threads. Also, the timing of when those resources will be needed for each thread will affect how efficient each core is as well. I don't see how any number being thrown out there today will mean much until the concept is tested with useful code.

Maybe if we knew the percentage of the modules resources that are shared, and can not be used by both threads at the same time, then we might be able to draw some more meaningful conclusions.

jvroig · Sep 9, 2010

Against my better judgement, I will try to point out that what you said not only has no bearing on the conversation, but also that you pretty much butchered the concept and HT as well. I have commented previously that you don't seem to understand how HT works at all. Your new comment now about "6 physical core that give you 80 effiecny and 6 virtual Hyperthreading cores that give those cores a 20% increase on the physical core" reinforces my belief, and to add to that, it seems you don't understand even Bulldozer as well, or even just CPU design in general.

I hardly know where to start.

1.) First of all, in an i7, any single core runs at 100% efficiency. By this, we simply take a single core's performance as the baseline, hence, 100%. Why you would randomly tag it as "80%" is beyond me.

2.) HT adds to that baseline, ~20%. Maybe 20-30%, no problem. Final performance is then anywhere around 120%

3.) There used to be a penalty when OSes where non-HT-aware, such as when HT first appeared in the Pentium line. This is long gone, and modern operating systems have schedulers smart enough to avoid the penalty. There probably are some rare few exceptions to these, but they are now just that: the exception, rather than the rule. As a rule, hyperthreading is a clear win as far as performance is the metric.

4.) Equating the core efficiency of an i7 with the core efficiency of a Bulldozer (setting both at 80%, at parity) betrays your understanding that, at multi-threaded maximal loads (as you say "8 cores all running") the BD module suffers no penalty. In fact, the opposite is true: performance wise, there is a penalty to dual-threading in each module (that is, in fact, the topic of the conversation you entered - attempting to clarify just how much should be expected as penalty). So a BD chip will not have the same performance characteristic as an i7 - an i7, being a traditional CMP-based chip, will always have the baseline 100% throughput, while a Bulldozer chip, being a CMT-based chip, will always be below the baseline (a known tradeoff).

5.) Since you believe that the multi-threaded throughput remains constant from a regular CMP to a CMT chip like Bulldozer, and yet significant die space is saved ("more cores on the same die space effect"), then I cannot imagine that you actually do understand the topic at hand. What you are suggesting is magic.

ModestGamer · Sep 9, 2010

Wheres the face palm smiley ?

Hyperthreading just stuff more through the pipeline.

http://en.wikipedia.org/wiki/Hyper-threading

jvroig said:
Against my better judgement, I will try to point out that what you said not only has no bearing on the conversation, but also that you pretty much butchered the concept and HT as well. I have commented previously that you don't seem to understand how HT works at all. Your new comment now about "6 physical core that give you 80 effiecny and 6 virtual Hyperthreading cores that give those cores a 20% increase on the physical core" reinforces my belief, and to add to that, it seems you don't understand even Bulldozer as well, or even just CPU design in general.

I hardly know where to start.

1.) First of all, in an i7, any single core runs at 100% efficiency. By this, we simply take a single core's performance as the baseline, hence, 100%. Why you would randomly tag it as "80%" is beyond me.

2.) HT adds to that baseline, ~20%. Maybe 20-30%, no problem. Final performance is then anywhere around 120%

3.) There used to be a penalty when OSes where non-HT-aware, such as when HT first appeared in the Pentium line. This is long gone, and modern operating systems have schedulers smart enough to avoid the penalty. There probably are some rare few exceptions to these, but they are now just that: the exception, rather than the rule. As a rule, hyperthreading is a clear win as far as performance is the metric.

4.) Equating the core efficiency of an i7 with the core efficiency of a Bulldozer (setting both at 80%, at parity) betrays your understanding that, at multi-threaded maximal loads (as you say "8 cores all running") the BD module suffers no penalty. In fact, the opposite is true: performance wise, there is a penalty to dual-threading in each module (that is, in fact, the topic of the conversation you entered - attempting to clarify just how much should be expected as penalty). So a BD chip will not have the same performance characteristic as an i7 - an i7, being a traditional CMP-based chip, will always have the baseline 100% throughput, while a Bulldozer chip, being a CMT-based chip, will always be below the baseline (a known tradeoff).

5.) Since you believe that the multi-threaded throughput remains constant from a regular CMP to a CMT chip like Bulldozer, and yet significant die space is saved ("more cores on the same die space effect"), then I cannot imagine that you actually do understand the topic at hand. What you are suggesting is magic.

jvroig · Sep 9, 2010

Martimus said:
The way I have seen it proposed is that a single thread should get 100% of the resources, and therfore 100% of the performance possible from the chip.

Not exactly 100%, but negligible penalty - probably anywhere around 95 - 99.9%. But yes, that is mostly true.

Martimus said:
The question you are getting at is whether a second core on the same module would add 60% or 80% performance.

I suppose it can be said like that, especially in cases where threads being run aren't really related to each other.

Martimus said:
Now it could be that it adds 80% performance, but the first thread would lose 20% performance due to sharing some of the functions of the module. (which would really be adding 60% I guess)

Since throughput is still 80% of CMP, then yes, this would still be in-line with the Hot Chips statement. It doesn't matter if the first core loses 20%, or the first core loses 40% while the second core gets to be 100% efficient. All that matters is the max throughput achievable is 80% of what a comparable CMP chip would achieve.

Martimus said:
The other option is that the first thread is unaffected by the second thread, and only the second thread gets the 20% performance hit due to waiting for shared resources to free up.

If you mean this to be an explanation of the Hot Chips statement, then I will have to disagree. The statement is 80% of the throughput of CMP. By this option you present, the max throughput will be 90% of CMP. We thus end up at square one with the Hot Chips vs JFAMD numbers. I believe this to be an unlikely explanation.

Martimus said:
The third option is something you brought up, which is that each thread will only get 90% of their theoretical maximum performance due to shared resources.

Yes. Like your first option, it really doesn't matter whether they get 90% each, or one gets 60% while the other gets 100%. As long as the max throughput of a module is 80% of a CMP equivalent, then all is well as it will end up hitting the Hot Chips target.

Martimus said:
Of course none of this really means much, since the type of work being done by each thread will really affect how much of the shared resources have conflicts between the threads. Also, the timing of when those resources will be needed for each thread will affect how efficient each core is as well. I don't see how any number being thrown out there today will mean much until the concept is tested with useful code.

In the same way that 40% was bandied about before and not met, figures today matter. It's their target. It's what is being promised. If they tell us to expect 80%, then we'll look forward to getting 80%, and not just a measly 50%. If they tell us to expect 90%, then we'll look forward to getting 90%.

Anyway, it's pretty much been settled with JFAMD casting a wider net. Like I said in an earlier post, the rational conclusion simply seems to be that one quoted a lower estimate, while another quoted a higher estimate, all based on the estimates brought forth by their engineers. It's just nice to have seeming inconsistencies clarified, although none of them are earth-shattering in nature.

Martimus · Sep 9, 2010

jvroig said:
4.) Equating the core efficiency of an i7 with the core efficiency of a Bulldozer (setting both at 80%, at parity) betrays your understanding that, at multi-threaded maximal loads (as you say "8 cores all running") the BD module suffers no penalty. In fact, the opposite is true: performance wise, there is a penalty to dual-threading in each module (that is, in fact, the topic of the conversation you entered - attempting to clarify just how much should be expected as penalty). So a BD chip will not have the same performance characteristic as an i7 - an i7, being a traditional CMP-based chip, will always have the baseline 100% throughput, while a Bulldozer chip, being a CMT-based chip, will always be below the baseline (a known tradeoff).

I realize that my understanding of the acronymns you are using is shaky at best. (I work in an industry where 95% of our terms are acronymns, so they all just mix together at some point.) I hope that I was understanding you correctly, since I was equating CMT to AMD's module approach, while I equated CMP to Hyperthreading. If I am wrong in that assumption, then you can completely ignore my post 😛

I don't understand why you believe that the baseline is less than 100% on a CMT chip, while it is 100% on a CMP chip. The two are very similar in that they share resources, but the CMT chip shares fewer resources. The CMP method does have penalties on the original thread when using HT, due to cache thrashing and scheduling conflicts (since neither thread running through the pipeline has priority over the other, the first thread will likely have new delays that it wouldn't have if there was no second thread, due to waiting for resources the second thread is using.)

CMP has more inherent penalties, due to the fact that most of the resources are shared. CMT should have many of these same penalties, but it should NOT have any additional penalties, since it does not introduce any new areas for penalties to occur over Hyper Threading (at least that I can see, but I can be pretty blind at times).

What we are trying to figure out is what penalties CMT will have in comparison to Hyper Threading and to full blown seperate cores. AMD's implementation of CMT is very similar to Intels implementation of Hyper Threading, except where Intel shares the majority of the core between two threads, AMD shares far less of the module between each thread (Each thread even has its own scheduler, even the FPU has a seperate scheduler). The more I think about the two setups, the more similarities I see.

jvroig · Sep 9, 2010

ModestGamer said:
Hyperthreading just stuff more through the pipeline.

That's all you can say after a 400-word reply that doesn't even contest that or have that as its primary message?

You've quoted the entire post, and all you have is a wikipedia link, and nowhere in there will it say that an i7 core is 80% efficient as you say, nor will it even remotely touch any of the 4 other points I bothered to enumerate.

jvroig · Sep 9, 2010

Martimus said:
while I equated CMP to Hyperthreading

No, Hyperthreading would be SMT 🙂

CMP = chip multiprocessor (real cores, traditional, no HT)
CMT = clustered / cluster-based multi-threading (module approach)
SMT = simultaneous multi-threading (HT)

100% is the baseline, so a quad i7 will have a max throughput (we are talking of multi-threads, as I did clearly say; we have already gone over the issue that at single-threads only, there is negligible penalty, so we aren't discussing that anymore) of no less than 400%. With HT on, we can add ~80% to that, so we get ~480%, maybe 500% even. Still, it is above the baseline. Core efficiency goes up, not down.

For CMT, we already accept a penalty. For simple core efficiency metrics (and realize here that "efficiency" in this conversation was misused to mean performance or throughput - not by me, I simply followed through it), we lost performance immediately. In the end, it can still be a win, if the performance loss through CMT will be more than offset by the additional cores made available (and this is part of the picture AMD paints, so yes we can count on this especially on serverland). But as it is, the "efficiency" (throughput) went down per core in multi-threaded workloads, we just count on having more cores to end up with competitive / better performance. Hence, my bewilderment in putting at parity the throughput of a CMP and CMT design when all cores are running.

S/A: "AMD outs bulldozer based orochi die"

Platinum Member

Lifer

Platinum Member

Lifer

Platinum Member

Platinum Member

Senior member

Senior member

Diamond Member

Senior member

Lifer

Platinum Member

Elite Member

Senior member

Platinum Member

Senior member

Platinum Member

Banned

Diamond Member

Platinum Member

Banned

Platinum Member

Diamond Member

Platinum Member

Platinum Member