Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

classy · May 23, 2011

When is the actual release date anyway?

Soleron · May 23, 2011

classy said:
When is the actual release date anyway?

"Q2".

AtenRa · May 23, 2011

Tuna-Fish said:
And wouldn't you agree that, given the leaked price points and die sizes, a BD core should be compared to an Intel thread, and a BD module to an Intel core?

Two problems, BD core is smaller(less transistors) than SB core and it is unfair to compare them because SB will always be faster at the same frequency and secondly BDs module(L2 included) is larger than SBs core and it is unfair for SB to compare them cause BD module will always be faster when running two threads.

I would only compare price/performance at the applications I would like to run, so it comes to the individual to see what it needs, higher IPC or more cores.

Cerb · May 23, 2011

podspi said:
I'd be the first to admit that I am not a computer science academic, but I am not familiar with the research you are talking about. The only paper that I've seen is this one:

http://www.eecs.harvard.edu/~vj/images/4/4e/Msr09search.pdf

Which is by Microsoft, but talks about how because searches must be latency constrained, so singlethread performance still matters (if the search gets returned by a deadline each time, slower singlethread performance degrades search quality results).

This is certainly true for some class of workloads, I was thinking more along the lines of transcoding and rendering -- which are purely throughput oriented.

Many transactional workloads fare similarly. MS' stats are interesting (they actually do well towards showing potential strengths of small CPUs in a distributed environment), there, but basically, if a task is waiting on small requests to get done, needing its output for the next input, sacrificing time/operation to gain operations/time can turn out less than stellar. Doubly so, if you have users waiting for responses. Heavy SMT, and small CPUs, have begun to catch on, and will grow, but not such that big CPUs will be replaced everywhere, so much as that big CPUs will be replaced in areas where they cannot be utilized well. Even in cases where power bills may be saved, it might not always be the best choice.

Right now, current Intel Xeons are generally fast enough that you may be better off getting them, and turning HT off (or getting no-HT models), than buying AMD, as a purchasing decision, if you feel HT is not in your best interests (Intel does not claim HT replaces real cores). AMD may have many much cheaper Opterons, but the total cost difference usually ends up to be very small, whether big vendor or custom. Maybe AMD's could have gotten you 10% more performance, thanks to more cores/$, but that's a maybe. OTOH, Intel's superior performance per core is guaranteed, at least right now. If AMD can get good enough performance per thread w/ BD, that very same thinking could make their CPUs attractive, again.

Phynaz said:
How much of a hit do you think a thread running on a BD core will take when a second thread is added to the same core?

How will you add a second thread to a BD core? BD only allows one per core. The point of two threads per module is to improve overall performance, while using less space and power, such that two threads per module nets you as much performance as with two whole cores. While there is a theoretical performance hit, that's based on research that uses old Alphas (seriously, every paper I've found starts with a 21264). Chances are very good that if AMD had made a new CPU without sharing the front-end, the single-threaded task performance would be no different, since that performance is so dependent on speculation and OoOE, outside of floating point work.

Phynaz · May 23, 2011

Cerb said:
How will you add a second thread to a BD core? BD only allows one per core.

Checks task manager, sees 1,277 threads running on dual core i5.

magomago · May 23, 2011

AtenRa said:
Two problems, BD core is smaller(less transistors) than SB core and it is unfair to compare them because SB will always be faster at the same frequency and secondly BDs module(L2 included) is larger than SBs core and it is unfair for SB to compare them cause BD module will always be faster when running two threads.

I would only compare price/performance at the applications I would like to run, so it comes to the individual to see what it needs, higher IPC or more cores.

Yup you hit it spot on.

The only way to truly compare these processors when they are out is to look at the performance (or performance/price) in the applications and take them as a whole.

Of course I'm sure that most sites will try to do something stupid and make up pseudo tests in order to see which 'core' or 'module' is faster in different conditions (be it comparing 1 core to a module, or 2 cores to a module)...and this will yield a LOT of flaming and stupid threads everywhere

Khato · May 23, 2011

AtenRa said:
Two problems, BD core is smaller(less transistors) than SB core and it is unfair to compare them because SB will always be faster at the same frequency and secondly BDs module(L2 included) is larger than SBs core and it is unfair for SB to compare them cause BD module will always be faster when running two threads.

I would only compare price/performance at the applications I would like to run, so it comes to the individual to see what it needs, higher IPC or more cores.

While I agree with the sentiment, the die size estimates that I've seen don't quite align with that assessment. For Bulldozer, 294mm^2 total with ~31mm^2 per module equates to ~19mm^2 per module without the L2 cache and ~15.5mm^2 if you take out one of the integer cores. A Sandy Bridge core meanwhile is at ~17.2mm^2, dropping down to the same ~15.5mm^2 when removing the L2 cache. Of course, it's not really fair comparing logic die size considering that AMDs floor-planning favors logic density at the cost of 'dead space' routing between logic blocks, which implies that Sandy Bridge core logic is markedly smaller.

And yes, for the sake of die/logic size comparison, I would consider a single bulldozer 'core' to be a module minus L2 cache and the second integer core. This being equal in size to a sandy bridge core implies that there's at least the potential for it to keep up in single threaded performance. Whether or not that comes to pass is something that we'll see soon enough.

Heh, for all the noise that's been made about the advantages of AMD's approach... It's nothing more than a further level of SMT than Intel's hyperthreading - they're simply duplicating execution logic (in addition to control logic necessary for SMT) in accordance with where they felt multi-threaded workloads are trending. What's odd is that they didn't allow for either 3 or 4 threads to be executed per module to keep that duplicated logic busy.

PreferLinux · May 23, 2011

LucJoe said:
Think about this: the 8130P is positioned to compete against the 2600k. 2600k is marketed as "4C/8T" while 8130P is marketed as "4M/8T".

I'm assuming you want to know the performance hit when running two threads on a single...thread? That's silly and obvious... you know the answer.

The point he was making is Bulldozer's modular design which allows for higher "core" count (two cores sharing some circuitry) is a more efficient way to run highly threaded applications than adding hyperthreading.

Wrong, it will be marketed as 8 cores.

Tuna-Fish said:
Isn't that already accounted for in the 36%? If a thread fully utilizes the core, HT gets you nothing. You get the good boost from HT when your software is written by monkeys and plays hide and seek with pointers.

Unless you meant when the other thread is completely idle? Because, in that case I don't care about performance -- it's when your queues are full and everything is firing on full cylinders when it counts. Every other time, I just want it to suck as little power as possible.

What I mean is that with normal code, you get less than a 36% performance penalty per thread, as each thread isn't trying to use the whole core.

Tuna-Fish said:
My talk about code-monkeys is self-depreciating -- I'd call myself a code monkey. Also, just because my previous post seemed overly harsh, I absolutely love HT and other features like it that allow us programmers to be lazy. Features that let programmers do less work are features that saves us a lot of money.

Also, it's not necessary to keep all the parts of the core utilized to block the other thread -- you only need to fill up one part that the other thread also needs. On SNB, that would probably be the cache write unit. Also, as the FPU shares issue ports with the ALU's on Intel processors, using one of them also blocks the other.

I wouldn't call myself a code-monkey, but that would be reasonably accurate.

I think you will find that HT will force the two threads to share the resources, but that doesn't help any if there is only the one unit.

386DX · May 24, 2011

Black96ws6 said:
I'm starting to think we're setting our hopes too high for Bulldozer.

If you look at AMD's leaked marketing slide: http://www.xbitlabs.com/news/cpu/di..._Range_Microprocessor_to_Cost_320_Report.html

You can see that, at least according to that slide, 4 core BD does not match i5-2500k.

Apparently it takes a FX-6110 6 core BD to match\slightly beat a 2500k, with the 8 core BD matching the 2600k.

It's only the AMD fanbois that are setting there hopes to high for BD. You can debate hypothetical, specs, etc. but in the end looking at how they are trying to market the product will give you a good idea about its performance. The fact that AMD has not talk at all about single threaded performance and only talks about cores etc should tell you that BD is gonna fall short of SB clock for clock. When you trying to market something you never point out your products obvious shortcomings you only focus on its strength and divert/ignore everything else. It's why you don't see Ferrari advertising the super fuel efficiency of there cars, or the ample amount of cargo space.

The marketing slides and price list further points to this fact. FX4110 is priced at $190 which is less then $216 for Intel's i2500k, its pretty safe bet the FX4110 will be slower then the i2500 in pretty much all tasks. The pricing for the FX6110 at $240 would suggest the 6 core BD will be faster then the 4 core i2500 in multi-threaded situations, and most likely slower in single threaded. The pricing of the FX4110 makes it more competitive with the i2400 I suspect performance should be similar mainly due to the FX4110 being clocked higher and turboing higher.

Black96ws6 said:
If that slide is close to accurate, and the leaked pricing is also close to accurate, it basically means this:

2500k for $225, or 6 core BD for $240 (about the same performance)
2600k for $320, or 8 core BDp for $320 (about the same performance)

Again, speculation, but, if that leaked slide and those prices are true\close, anyone waiting on BD should just go buy a 2500k or 2600k now, at least from a gaming perspective, and be able to upgrade to Ivy later on without changing the MB...of course, it's so close now, might as well wait to see if it's better than hoped...

The 8 core BD pricing and P is another sign of what to expect. If BD was all that it's made out to be it would be a safe bet that on the slides you'd see "Superior Performance" instead of "More Cores Overclocked". The pricing for the 8 core FX8110 @ $290 puts it nearly identical to the i2600 price. Once again it'll probably be Intel faster at single thread and 8 core BD faster at multi-thread... but probably not by as big a margin as AMD hope.

That's where the odd ball 8 core FX8130P comes in. It's priced higher then Intel's unlocked i2600K and probably put out to show BD`s 8 cores winning in multi-threaded benchmarks better . From what we know the "P" refers to the CPU being 125W compared to 95W that the rest of the BD CPU. The fact that they have to put out a special "P" version of the 8 core shows that AMD is already pushing BD to its limits (ie. not much room for overclocking). Remember that all the FX CPU's come with unlocked multiplier there really is no point in putting out two versions of the 8 core. They could just release the FX8110 at the FX8130P speed, or just release an FX8130 at 95W. The fact that they did`t (can`t) would suggest that to get BD to run at FX8130P speed they had to increase its TDP, all things being equal (same CPU family) the only reason you have to increase TDP is if you have to increase the voltage to the CPU which basically means AMD has to overclock to get BD to run at FX8130P speed which isn`t a good sign for a brand new architecture.

videoclone · May 24, 2011

386DX - looks spot on! the reviews will show the real performance of BD

I think what we all must take from BD is that AMD should be back inline with intel and not behind..

jimbo75 · May 24, 2011

386DX said:
That's where the odd ball 8 core FX8130P comes in. It's priced higher then Intel's unlocked i2600K and probably put out to show BD`s 8 cores winning in multi-threaded benchmarks better . From what we know the "P" refers to the CPU being 125W compared to 95W that the rest of the BD CPU. The fact that they have to put out a special "P" version of the 8 core shows that AMD is already pushing BD to its limits (ie. not much room for overclocking). Remember that all the FX CPU's come with unlocked multiplier there really is no point in putting out two versions of the 8 core. They could just release the FX8110 at the FX8130P speed, or just release an FX8130 at 95W. The fact that they did`t (can`t) would suggest that to get BD to run at FX8130P speed they had to increase its TDP, all things being equal (same CPU family) the only reason you have to increase TDP is if you have to increase the voltage to the CPU which basically means AMD has to overclock to get BD to run at FX8130P speed which isn`t a good sign for a brand new architecture.

Err what? You didn't actually believe AMD were going to release only a 95W chip to take on the 980X (and now 990X)? 35W is a HUGE amount to be behind and the 8130P will go that extra ten yards.

Riek · May 24, 2011

386DX said:
It's only the AMD fanbois that are setting there hopes to high for BD. You can debate hypothetical, specs, etc. but in the end looking at how they are trying to market the product will give you a good idea about its performance. The fact that AMD has not talk at all about single threaded performance and only talks about cores etc should tell you that BD is gonna fall short of SB clock for clock. When you trying to market something you never point out your products obvious shortcomings you only focus on its strength and divert/ignore everything else. It's why you don't see Ferrari advertising the super fuel efficiency of there cars, or the ample amount of cargo space.

How is that relevant? You cannot compare ipc and use that to say cpu 1 is faster than cpu 2, just like you can't use the frequency of cpu 1 to state it is faster than cpu 2.

performance = ipc for the application * frequency.

In this case BD seems to be able to get higher frequencies -> so doesn't need the same ipc to perform the same..

The marketing slides and price list further points to this fact. FX4110 is priced at $190 which is less then $216 for Intel's i2500k, its pretty safe bet the FX4110 will be slower then the i2500 in pretty much all tasks. The pricing for the FX6110 at $240 would suggest the 6 core BD will be faster then the 4 core i2500 in multi-threaded situations, and most likely slower in single threaded. The pricing of the FX4110 makes it more competitive with the i2400 I suspect performance should be similar mainly due to the FX4110 being clocked higher and turboing higher.

AMD is in the spot where it needs to perform better for the same cost. It will most likely be clocked higher than the competitive SB version... but that is by design. However how much is the difference between the i2400 and the i7 2600K? That surely does not lineup with going from 2M to 4M, not event going from 2M to 3M.

So the FX4110 could be clocked higher to compete, which means that the higher module count BD will we use turboboost to reach the same or higher frequency to compete. (see later about top performance intel cpu's)

The 8 core BD pricing and P is another sign of what to expect. If BD was all that it's made out to be it would be a safe bet that on the slides you'd see "Superior Performance" instead of "More Cores Overclocked". The pricing for the 8 core FX8110 @ $290 puts it nearly identical to the i2600 price. Once again it'll probably be Intel faster at single thread and 8 core BD faster at multi-thread... but probably not by as big a margin as AMD hope.

Not sure why SB would be faster in single threaded now, SB-E, gulftown have a single threaded turboboost very close to the 2500 and 2600 (100MHz difference), so all the top line intel cpu's are also very close to eachother in real single threaded applications. (e.g. 2500 < 3% of 2600 < 3% of SB-E). So again why would single threaded be faster on SB (on average)? If AMd is able to compete with SB using a high clocked 4core, they can use turboboost2 to clock modules high enough to compete with those models also.

That's where the odd ball 8 core FX8130P comes in. It's priced higher then Intel's unlocked i2600K and probably put out to show BD`s 8 cores winning in multi-threaded benchmarks better . From what we know the "P" refers to the CPU being 125W compared to 95W that the rest of the BD CPU. The fact that they have to put out a special "P" version of the 8 core shows that AMD is already pushing BD to its limits (ie. not much room for overclocking). Remember that all the FX CPU's come with unlocked multiplier there really is no point in putting out two versions of the 8 core. They could just release the FX8110 at the FX8130P speed, or just release an FX8130 at 95W. The fact that they did`t (can`t) would suggest that to get BD to run at FX8130P speed they had to increase its TDP, all things being equal (same CPU family) the only reason you have to increase TDP is if you have to increase the voltage to the CPU which basically means AMD has to overclock to get BD to run at FX8130P speed which isn`t a good sign for a brand new architecture.

Ofcourse AMD does not overclock its cpu. We don't know at which speeds the FX8130P will run and what the real power consumption would be. Matter be there SB-E and gulftown have a TDP of 130W. AMD might be very competive with those cpu's. Those cpu's have no single threaded benifit compared to the 4cores... and only get a real advantage when 5 threads or more are used, but this is where BD will shine also. The increased TDP on BD is probably needed to run 8cores above 3.2GHz base (>3.5GHz turbo for all?)... the same reason why intel needs 130W on their 6core chips. Just for you to look at it another way
Suppose they had:
FX4 95W
FX6 95W
FX8 125W

would you say the same thing? Isn't the 8core in 95W TDP a sign of a very good cpu and yields instead the 125W 8core clocked really high a sign of the reverse?

Cerb · May 24, 2011

Phynaz said:
Checks task manager, sees 1,277 threads running on dual core i5.

That's 1277 threads in memory. A dual-core i5, with HT on, can only run 4 of them at a time. A Quad-core (2 module) BD will also run no more than 4 at a time.

Khato said:
Heh, for all the noise that's been made about the advantages of AMD's approach... It's nothing more than a further level of SMT than Intel's hyperthreading - they're simply duplicating execution logic (in addition to control logic necessary for SMT) in accordance with where they felt multi-threaded workloads are trending. What's odd is that they didn't allow for either 3 or 4 threads to be executed per module to keep that duplicated logic busy.

It is less about speculation of where the workloads are trending, which I'm sure AMD and Intel are on the same page about (if not publicly

), than about the best way to improve performance in any given workload, with resources available.

Think of a CPU not as being a big monolithic device, but many small devices networked together. If one device on that network always remained at <50%, when others were at 100%, surely you could use only one of them per two of the other devices, and get by just fine, right? Also, if the communications between each CPU's copies of one or more devices causes you to wait, not just on the work being done, but on the actual communication itself, wouldn't you be better off making one more powerful device, to reduce or remove that network chatter that is wasting your time?

There can come a point where, no matter how powerful each unit of each core is, certain units being duplicated in a per-core basis will result in worse performance than if shared between cores. Intel can use their process advantage to keep from having to worry about all of this for some time. AMD cannot. On one hand, that no one else has yet made a real high-performance processor, for COTS code, using CMT, is pretty cool, pioneering stuff. On the other hand, it's very much AMD having to design their way around the handicap of having inferior physical technology with which to make their CPUs.

It's nothing more than a further level of SMT than Intel's hyperthreading - they're simply duplicating execution logic (in addition to control logic necessary for SMT)

It's not a further level of SMT. Where AMD may use SMT, they will just be using SMT. If you duplicate the execution logic, such that it isn't shared by multiple threads, it's not SMT. Both methods result in multithreaded processors, but are significantly different approaches. SMT targets idle time, assuming that if units are not busy, they need more to do. CMT targets waste, assuming that if units are not busy, there must be fat to trim. They are orthogonal concepts. If you trim fat to the point that the cluster can keep all thread execution units busy, while not idling other parts of the CPU (IE, no redundancy, and also no bottlenecks to functional units, either), but real code doesn't do that, you could add SMT, and get just as much benefit (or detriment) as with a pure CMP that gets SMT added to it.

For a visual analogy, consider CMP v. CMT (more threads/cluster as X increases, up to an optimum for a given uarch--probably 4 or 8) as a horizontal axis, and 1 thread/execution unit (no SMT) v. many threads/execution unit (extreme SMP) as a vertical axis. Assume workloads with multiple threads operating at the same time, and a CPU made to run each thread fast, and that for any given combination, die space, TDP, and/or xtor count are made the same. Left to right, performance/xtor improves. Down to up, performance/xtor improves. Plot by total throughput, and the winner will be in the top-right (CMT+SMT) corner. Plot by average service time, or % reaching QoS targets, and the winner will be in the bottom-right corner (CMT no SMT). Plot by known performance of actual CPUs, which are dependent on memory resources that are never fast enough, and...uh, let me get my TARDIS, because BD is going to be the first commodity high-performance CMT CPU out there, AFAIK. Ultimately, CMT is the future, but AMD taking the plunge before any other commodity CPU makers/designers.

They are treated as exclusive, opposed options, because AMD is using one, and Intel the other, towards a similar goal, at this point in time (just ignore that FPU...yeah, that one, right there, shared by the BD cores

).

CMT should have five main advantages over SMT, if each is used exclusively, as in BD's integer v. SB's integer:
1. Each thread can run about as fast as if the other shared resources weren't in use, except cache (cache is an area where both will have similar problems).
2. In cases where on-chip network latencies, and switch and wire delays of duplicated logic, may slow performance, or limit clock speeds, the penalties can be reduced, or even entirely removed, with sharing of resources.
3. As execution units get wider and deeper, and clock speeds grow, it will be harder to keep them full, not merely by way of IPC, but that the logic controlling them tends to scale on log or quadratic curves, such that more narrower execution units can be far easier to come up with, and cheaper all-around, than a wide set of shared execution units (IoW, potential to get improved performance per R&D dollar spent, v. CMP or CMP+SMT).
4. CMT's effectiveness, in a chip made for high performance with a few threads, will not depend on inefficient code execution (merely that dividing the execution resources caps peak performance, compared to being wider, yet otherwise identical), where SMT can be dependent on such.
5. All of that combined should make it easier to reach higher clock speeds within a given TDP, and improve per-thread resource utilization, enough to more than make up for the very minor penalty of having narrower execution resources. It is quite possible that a pure CMP, with all non-cluster features of BD, could be slower, for a single threaded task, than BD will be, if just due to lower clock speed limits at a given TDP.
Now, for the encore presentation. A drum roll, if you please...
6. Since the first Opterons, AMD has had a marketing stance that adding real cores is best way to improve multithreaded performance. CMT allows them to keep the cost per thread down, necessarily by sharing resources, without having to eat crow about that, treating a set of integer thread execution resources as one core. They'll just have to be clever about talking up the FPU, but they've handled that pretty well, so far.

* This has gotten a contextual meaning of, "not clustered multithreading," used in conjunction with CMT, though is often used for any CPU with multiple cores on the same die or package, outside of the CMT-centric context. After all, if you put more than one cluster on a chip, have you not just made a CMP of CMT processors? If you want to go read up a little on this stuff, that discrepancy could be confusing, at first.

Dresdenboy · May 24, 2011

386DX said:
The fact that AMD has not talk at all about single threaded performance and only talks about cores etc should tell you that BD is gonna fall short of SB clock for clock.

This "fact" is no fact at all:

While high-throughput performance was a primary goal for Bulldozer, AMD made a significant investment in delivering high, single-thread performance levels. A major contributor to this strategy is in scaling the core structures and an aggressive frequency goal (low gates per clock).1 Another major component of the single-thread performance strategy is Bulldozers investment in instruction and data prefetching.

(Michael Butler et al., Bulldozer: An Approach to Multithreaded Compute Performance)

386DX said:
The marketing slides and price list further points to this fact. FX4110 is priced at $190 which is less then $216 for Intel's i2500k, its pretty safe bet the FX4110 will be slower then the i2500 in pretty much all tasks. The pricing for the FX6110 at $240 would suggest the 6 core BD will be faster then the 4 core i2500 in multi-threaded situations, and most likely slower in single threaded. The pricing of the FX4110 makes it more competitive with the i2400 I suspect performance should be similar mainly due to the FX4110 being clocked higher and turboing higher.

I'm not sure if these prices already reached "fact" status. I still see them as rumor/speculation, albeit a not that unrealistic one.

386DX said:
The 8 core BD pricing and P is another sign of what to expect. If BD was all that it's made out to be it would be a safe bet that on the slides you'd see "Superior Performance" instead of "More Cores Overclocked". The pricing for the 8 core FX8110 @ $290 puts it nearly identical to the i2600 price. Once again it'll probably be Intel faster at single thread and 8 core BD faster at multi-thread... but probably not by as big a margin as AMD hope.

The problem of such a high level view on performance is the missing granularity regarding specific tasks. It could be, that the average performance of all tested apps/games/synthetic benches might be about the same, while for the whole group of games (although they also differ a lot) BD might be 10% slower on average, for video encoding 10% faster and for synthetic memory benches 20% faster on average. Any variant is possible. We might begin with cycle exact simulations of the architecture using a modified PTLSim but this would help only so much and means a lot of work.

grimpr · May 24, 2011

Great post Cerb. :thumbsup:

Khato · May 24, 2011

Cerb said:
http://forums.anandtech.com/showpost.php?p=31752021&postcount=2414

Not going to quote the entire argument in the interests of saving space. Especially since it's merely detailing a difference in semantics.

Why? Because AMD's approach was likely derived after analyzing workloads and deciding that, in their intended design, everything except for the integer cores was typically IDLE half the time. So what did they do? It doesn't make sense to halve the rest of the logic and kill single-threaded performance, so instead they added SMT control logic to all other portions of the core design and simply duplicated the integer core. It really can't be called anything other than SMT because you have... Fetch, Decode, Branch Prediction, and FPU at least as shared logic, without which the duplicated integer logic sure can't do much.

At the same time, I wouldn't agree that this approach is 'the future'. It's the future for lazy designs that don't want to maximize single-threaded performance in the process. They could have spent the exact same die area creating a massive monolithic integer core and extracted the exact same multi-threaded performance through a typical SMT implementation while getting far higher single-threaded performance. But such a design is markedly more time consuming than 'simple' copy/paste (yeah, not quite -that- simple, but with AMD's apparent floor-plan approach it's close to it.)

To end, let's go through those points, oh, though I guess that I should call that last paragraph a response to point number 3, since that's exactly what it is, haha.

1. I'd sure hope it's twice as fast considering that AMD's approach is to use twice the die size. Really is the same as point number 3.
2. Amusing thought, but it's by no means an 'advantage' seeing as how those constraints apply to the design no matter what. It could just as easily end up being a disadvantage.
3. Already detailed above, and this is indeed an advantage for AMD considering the marked difference in development costs. However, it's a disadvantage in terms of single-threaded performance.
4. Nonsense. Unless you're going back to point number 1, where sure AMD's approach should be twice as fast when executing across duplicated logic since it's using twice the die space.
5. What does this have to do with AMD's approach supposedly non-SMT approach vs Intel's more typical SMT?
6. Yes, I know, this is the entire reason that AMD is trying to say that they aren't doing SMT. But it's nothing more than marketing.

RoyG · May 24, 2011

Wow, history is repeating itself. Barcelona was said to be 50% faster than Core2 Quad too

---------------------------------------
AMD : Barcelona quad-core 50% faster than Intel’s quad-core Xeon

Posted on Apr 23rd 2007 by Wolfgang Gruener

Sunnyvale (CA) – Following last week’s sobering financial news, AMD today provided some news that may calm down worried analysts and investors.

There are lots of question marks surrounding AMD’s first quad-core CPU, the Barcelona server/workstation. How fast will it be and - most importantly - will it be able to outpace Intel’s quad-core Xeon 5300-series (“Clovertown&#8221

processors. It will, says AMD, and in fact upgraded its performance expectations.

So far, we the company had claimed that Barcelona will surpass the performance of Clovertown by about 40% at any given clock speed. Now the company says that it believes that Barcelona will have a 50% advantage over Clovertown in floating point and 20% in integer performance “over the competition’s highest-performing quad-core processor at the same frequency.”

AMD did not release the specific of Barcelona. Intel’s Clovertown currently tops out at 2.66 GHz, but the company has begun supplying limited numbers of 3.0 GHz server quad-core processors.

AMD also announced the new dual-core Opteron models 2222 SE and 8222 SE, which somewhat had been announced already at the beginning of this month: AMD claims that the new 3.0 GHz Opterons beat comparable Intel 5100 series processors in three server-specific benchmarks (SPECint_rate_2006, SPECint_rate2006, SPECompM2001) by up to 24%.

The 2222 SE is available for a tray price of $873, the 8222 SE is priced at $2149. Intel’s 3.0 GHz Xeon 5160 processor currently sells for $851

Cerb · May 24, 2011

Khato said:
Why? Because AMD's approach was likely derived after analyzing workloads and deciding that, in their intended design, everything except for the integer cores was typically IDLE half the time. So what did they do? It doesn't make sense to halve the rest of the logic and kill single-threaded performance,

No. There is no killing to be done. Let's say your execution units are at 100%, and unit A in each CPU is running at 70%. But, in tight code when only one thread runs, it can reach 100%. If you share it between cores, at 140% original capacity, what have you lost? Nothing.

Now, with a non-cache-coherent system, running a highly efficient ISA, the above might be more easily preventable. DX10 and newer GPUs are good examples. Your average PC CPU, however, is far from that scalability ideal. It is quite probable that, very often, some parts will go underutilized, such that there is either (a) no way to use them at 100% with many threads running, or (b) there is no way to use them at 100% with many cores even on the die, and/or (c) their efficiency drops as cores or cache banks are added, due to network issues. They can all safely be turned into a shared resource of lesser complexity than the totality of the duplicated units, without harming performance.

So, let's repeat that consolidation over and over again, but with a new design, that has all the K8's problems fixed (each BD int core has that taken care of, while STARS left much to be desired). Now, let's assume we're done, and we have the same performance at the same clock speed, for each core, but only using 70% of the xtors (ignoring far caches, anyway). Between saving on xtors in general, and simplifying the logic, we've not only reduced power consumption at a given speed, but have gained clock speed headroom, without sacrificing one iota of IPC. Both single and multithreaded performance will have significantly improved at any given TDP.

That's the potential promise of CMT.

so instead they added SMT control logic to all other portions of the core design and simply duplicated the integer core.

Nope. The other way around. They took two whole cores, and trimmed the fat, until trimming more would hurt each core's performance (the 4-wide decoder was a bit uprising, though...I was expecting closer to 6, for occasional peaks). There should be enough performance in the shared front-end to keep both execution units busy, if the code allows. The front end should not be over-provisioned.

It really can't be called anything other than SMT because you have... Fetch, Decode, Branch Prediction, and FPU at least as shared logic, without which the duplicated integer logic sure can't do much.

The FPU is using SMT of some kind. The integer logic, though, can be, and is, called something other than SMT. It's called CMT. Here is an easy to Google paper, with nice diagrams on page 3. The C stands for cluster, referring to grouping otherwise independent items[/SIZE].

At the same time, I wouldn't agree that this approach is 'the future'. It's the future for lazy designs that don't want to maximize single-threaded performance in the process.

In that case, what company isn't lazy, today?
AMD, Intel: can't chase single-thread performance, due to needing more threads on the die, so must balance several threads at a time, and those idle cores when they aren't needed, in a fairly strict power envelope.
IBM: going all into SMT, cost and power be damned.
Tilera: we can cram more compute kernels into a chip than you can eat popped corn kernels during a bad movie.
ARM: OK single-thread performance, good power efficiency, A5 MP on the way for threads galore.
Oracle: we're gonna run each thread like it's 1989, but run so many, it sets records, and replaces whole racks of other servers.

Neither Intel nor AMD maximize single thread performance absolutely, though both of them do maximize it within their many-core restraints.

They could have spent the exact same die area creating a massive monolithic integer core and extracted the exact same multi-threaded performance through a typical SMT implementation while getting far higher single-threaded performance.

My jar of magic pixie dust is slap empty. Would people really buy a 200W CPU 10% faster than a 2600K, and fewer threads, assuming AMD even could do it? I wouldn't. I think the idea that AMD could truly one-up Intel, by following behind Intel, is laughable. Intel can leverage smaller and faster xtors, yet also relies on that ability. AMD must exploit that as a weakness.

Also, it's not big wide execution units that make x86 int performance. It's efficient use of the caches, good prefetchers, and good branch predictors. Is that everything? No. But, you can't run any faster than those allow, when running around looking through pointer after pointer.

But such a design is markedly more time consuming than 'simple' copy/paste (yeah, not quite -that- simple, but with AMD's apparent floor-plan approach it's close to it.)

How is it a copy and paste? I'm not seeing how you can split each core from its twin, within a module.

1. I'd sure hope it's twice as fast considering that AMD's approach is to use twice the die size. Really is the same as point number 3.
2. Amusing thought, but it's by no means an 'advantage' seeing as how those constraints apply to the design no matter what. It could just as easily end up being a disadvantage.
3. Already detailed above, and this is indeed an advantage for AMD considering the marked difference in development costs. However, it's a disadvantage in terms of single-threaded performance.

Where are you getting twice the die size? AMD has given very ambiguous numbers, and no CMP version of BD was ever developed, that we know of. The size of what a single core of a new CMP design would be is unknown.

There will be an optimum amount of shared resources per set of distributed execution resources. After this point, other means will need to be used to keep performance improving. Up to this point (somewhat of an unknown, for high speed x86 cores), there will be no disadvantages to sharing more resources per cluster. Whether it works out amazingly well for AMD or not, conservative use of CMT should only be a gain.

4. Nonsense. Unless you're going back to point number 1, where sure AMD's approach should be twice as fast when executing across duplicated logic since it's using twice the die space.

Not nonsense. If a thread can utilize 90% of a SB core, and you add another thread that can do the same, you best case scenario will be an 11% throughput improvement, and that each thread would run at 55% the speed of just one. In the worst case, that 90% is all cache, and your total performance and per-thread performance will drop (usually HPC and video stuff, but DBs are not immune, either, even with the new Cores).

HT's ideal case is that no CPU resource is being used more than 50%, including caches, and/or that the two threads can use a shared cache well, in which case you will get much better performance. This kind of code tends to be really bad about cache misses and branch mispredicts, so the theoretical near 100% improvement practically never happens.

While they've fixed most of the major performance drops by HT, QoS can still be harmed by it, and it can certainly be irksome when it causes even a small % of performance drop by being used, because you happen to be running efficient software (it has to annoy the developers of such software to no end, too).

In the CMT-no-SMT case, using another core should, in the worst case, result in a 100% improvement, since it is basically another real core, while using a much smaller fraction of the space and overall complexity. The small cases where it may be slightly worse, on occasion (again, that narrow decoder and all that follows it), it should be worse by so little that the added clock speeds it enables will more than make up for it. That doesn't mean BD will be mindblowing, just that if it sucks, the module based design shouldn't be where the blame goes.

5. What does this have to do with AMD's approach supposedly non-SMT approach vs Intel's more typical SMT?

Simpler things run faster. Smaller things run faster. Intel can get around this benefit of CMT for some generations yet, by making smaller and faster xtors, which will let their CMP-only CPUs get by for a bit longer. AMD needs to follow the spirit of RISC, and simplify for that speed. A CMP of monolithic cores could very well run slower, due to latencies on the chip, or just switching more xtors, and thus using more power.

6. Yes, I know, this is the entire reason that AMD is trying to say that they aren't doing SMT. But it's nothing more than marketing.

For the FPU, they aren't hiding that they are using SMT, but they do seem to be going to pains to describe in very different terms than intel likes to describe HT, and I'm sure some AMD employees spent hours just to do that. For the int, though...it's not SMT. It's more real cores, just as AMD has talked up in the past, but not made as pure CMPs.

Voo · May 24, 2011

Cerb said:
Simpler things run faster. Smaller things run faster.

Yep, after all it's a well known fact that all those complications like multiple cores, sophisticated cache coherency protocols or out of order execution really only slow down programs

I think you'll agree that you simplified a bit too much here - for some things simplicity up to a certain degree is positive (say the ISA) but there's a reason even RISC chips are generally getting more and more complicated with every generation (cf ARM and their cache implementations for one obvious example)

I pretty much agree with the overall notion of the post though

PS: And really anyone thinking that adding more cores is simply copy+paste on a modern ASIC is nuts - sure your design can make it simpler and you can avoid some pitfalls, but it's still far from trivial - there are enough fun things like parasitic capacities around to make sure it stays fun

Tuna-Fish · May 24, 2011

Voo said:
Yep, after all it's a well known fact that all those complications like multiple cores, sophisticated cache coherency protocols or out of order execution really only slow down programs I think you'll agree that you simplified a bit too much here - for some things simplicity up to a certain degree is positive (say the ISA) but there's a reason even RISC chips are generally getting more and more complicated with every generation (cf ARM and their cache implementations for one obvious example)

I don't think that was what he meant at all. As I read it, he meant things like the design of the register file and schedulers, not the ISA of the processor.

Khato · May 25, 2011

Bah, guess I actually have to go quote-happy this time around

Cerb said:
Khato said:

Why? Because AMD's approach was likely derived after analyzing workloads and deciding that, in their intended design, everything except for the integer cores was typically IDLE half the time. So what did they do? It doesn't make sense to halve the rest of the logic and kill single-threaded performance,

Click to expand...

No. There is no killing to be done. Let's say your execution units are at 100%, and unit A in each CPU is running at 70%. But, in tight code when only one thread runs, it can reach 100%. If you share it between cores, at 140% original capacity, what have you lost? Nothing.

I'm guessing that my statement was not properly understood, since that's the only explanation for that response. Let me attempt to rephrase in similar terms. If a fictional design in the intended workloads averaged 90% utilization of its integer resources and 45% utilization on everything else... Then it doesn't make sense to halve everything else in order to have a more balanced core design - sure it would increase the utilization of everything else, simply because there's less of it, heh. It'd also decrease integer resource utilization due to dependencies and drastically decrease performance (aka kill.)

Cerb said:
Khato said:

so instead they added SMT control logic to all other portions of the core design and simply duplicated the integer core.

Click to expand...

Nope. The other way around. They took two whole cores, and trimmed the fat, until trimming more would hurt each core's performance (the 4-wide decoder was a bit uprising, though...I was expecting closer to 6, for occasional peaks). There should be enough performance in the shared front-end to keep both execution units busy, if the code allows. The front end should not be over-provisioned.

Eh, okay. I know I sure wasn't there in the high level architecture design meetings 5+ years ago when those decisions were made.

Cerb said:
Khato said:

It really can't be called anything other than SMT because you have... Fetch, Decode, Branch Prediction, and FPU at least as shared logic, without which the duplicated integer logic sure can't do much.

Click to expand...

The FPU is using SMT of some kind. The integer logic, though, can be, and is, called something other than SMT. It's called CMT. Here is an easy to Google paper, with nice diagrams on page 3. The C stands for cluster, referring to grouping otherwise independent items[/SIZE].

I'll pretend to be an academic for a moment and proclaim that AMD has invented SCMT! Or maybe even P-SCMT if they did the separate issue queues like hyper threading. After all, according to that paper, bulldozer is neither SMT nor CMT because of the shared FPU, quoting from page 2, "The primary difference between the P-SMT and CMT approaches is that the former assigns threads to execution units at issue time, while in the more highly partitioned CMT processor, this assignment is done at dispatch time by steering each instruction to a particular cluster." As for the actual performance and energy conclusions of that paper, it's unfortunate that they only compared various 16-thread designs, none of which are comparable to the processor's we're interested in.

Cerb said:
Khato said:

At the same time, I wouldn't agree that this approach is 'the future'. It's the future for lazy designs that don't want to maximize single-threaded performance in the process.

Click to expand...

In that case, what company isn't lazy, today?
AMD, Intel: can't chase single-thread performance, due to needing more threads on the die, so must balance several threads at a time, and those idle cores when they aren't needed, in a fairly strict power envelope.
IBM: going all into SMT, cost and power be damned.
Tilera: we can cram more compute kernels into a chip than you can eat popped corn kernels during a bad movie.
ARM: OK single-thread performance, good power efficiency, A5 MP on the way for threads galore.
Oracle: we're gonna run each thread like it's 1989, but run so many, it sets records, and replaces whole racks of other servers.

Neither Intel nor AMD maximize single thread performance absolutely, though both of them do maximize it within their many-core restraints.

Haha, I'll certainly agree with that entire assessment! Especially enjoy the Oracle since that's ever so true.

Cerb said:
Khato said:

They could have spent the exact same die area creating a massive monolithic integer core and extracted the exact same multi-threaded performance through a typical SMT implementation while getting far higher single-threaded performance.

Click to expand...

My jar of magic pixie dust is slap empty.

Did Apple steal it again?

Sorry, couldn't resist... Continuing with the rest.

Cerb said:
Would people really buy a 200W CPU 10% faster than a 2600K, and fewer threads, assuming AMD even could do it? I wouldn't. I think the idea that AMD could truly one-up Intel, by following behind Intel, is laughable. Intel can leverage smaller and faster xtors, yet also relies on that ability. AMD must exploit that as a weakness.

Also, it's not big wide execution units that make x86 int performance. It's efficient use of the caches, good prefetchers, and good branch predictors. Is that everything? No. But, you can't run any faster than those allow, when running around looking through pointer after pointer.

Pretty sure I didn't say anything about such a design being more power hungry or having less threads... But you are correct in that I should have stated it as far higher potential single-threaded performance, which likely wouldn't be realized often at all. The point being that the same resources in an equivalent SMT configuration could hit the same multi-threaded performance while offering no constraints to single-threaded potential. It's just markedly more difficult to design an adequate scheduler of that width.

Cerb said:
Khato said:

But such a design is markedly more time consuming than 'simple' copy/paste (yeah, not quite -that- simple, but with AMD's apparent floor-plan approach it's close to it.)

Click to expand...

How is it a copy and paste? I'm not seeing how you can split each core from its twin, within a module.

Okay, it's a copy, reflect, and paste. At least that's what the die shot implies, and is the only sensible way to implement the design. (I did the same thing on a ALU layout for one of my VLSI courses back in college.) Doing it any other way vastly increases the amount of back-end work necessary for no purpose. Oh, and guess I should have been more specific that I'm talking in terms of design implementation.

Cerb said:
Khato said:

Cerb said:

CMT should have five main advantages over SMT, if each is used exclusively, as in BD's integer v. SB's integer:
1. Each thread can run about as fast as if the other shared resources weren't in use, except cache (cache is an area where both will have similar problems).

Click to expand...

1. I'd sure hope it's twice as fast considering that AMD's approach is to use twice the die size. Really is the same as point number 3.

Click to expand...

Where are you getting twice the die size? AMD has given very ambiguous numbers, and no CMP version of BD was ever developed, that we know of. The size of what a single core of a new CMP design would be is unknown.

Yay for triple-quote to ensure that proper context is clear. Now, in that context of comparing the integer design approaches, AMD's "CMT" doubles the die size used (okay, it's actually more like 1.98x due to the bits that SMT needs to add) vs a SB type SMT... There's no need for figures or anything - when you duplicate the integer logic you double the die space that that logic is going to use.

Cerb said:
Khato said:

Cerb said:

4. CMT's effectiveness, in a chip made for high performance with a few threads, will not depend on inefficient code execution (merely that dividing the execution resources caps peak performance, compared to being wider, yet otherwise identical), where SMT can be dependent on such.

Click to expand...

4. Nonsense. Unless you're going back to point number 1, where sure AMD's approach should be twice as fast when executing across duplicated logic since it's using twice the die space.

Click to expand...

Not nonsense. If a thread can utilize 90% of a SB core, and you add another thread that can do the same, you best case scenario will be an 11% throughput improvement, and that each thread would run at 55% the speed of just one. In the worst case, that 90% is all cache, and your total performance and per-thread performance will drop (usually HPC and video stuff, but DBs are not immune, either, even with the new Cores).

HT's ideal case is that no CPU resource is being used more than 50%, including caches, and/or that the two threads can use a shared cache well, in which case you will get much better performance. This kind of code tends to be really bad about cache misses and branch mispredicts, so the theoretical near 100% improvement practically never happens.

My claim of nonsense and reference back to point number 1 was that there's no merit of 'CMT' that increases its potential multi-threaded performance in comparison to SMT unless you give it more execution units to work with. Now the statement of SMT only providing an increase in performance with inefficient code would be correct, but it's still quite 'effective' at keeping execution units busy when running efficient code. On the flip-side, depending upon its implementation, a 'CMT' design could easily find inefficient code resulting in idle execution units - everything available thus far implies that this could well be the case with bulldozer.

Cerb said:
Khato said:

Cerb said:

5. All of that combined should make it easier to reach higher clock speeds within a given TDP, and improve per-thread resource utilization, enough to more than make up for the very minor penalty of having narrower execution resources. It is quite possible that a pure CMP, with all non-cluster features of BD, could be slower, for a single threaded task, than BD will be, if just due to lower clock speed limits at a given TDP.

Click to expand...

5. What does this have to do with AMD's approach supposedly non-SMT approach vs Intel's more typical SMT?

Click to expand...

Simpler things run faster. Smaller things run faster. Intel can get around this benefit of CMT for some generations yet, by making smaller and faster xtors, which will let their CMP-only CPUs get by for a bit longer. AMD needs to follow the spirit of RISC, and simplify for that speed. A CMP of monolithic cores could very well run slower, due to latencies on the chip, or just switching more xtors, and thus using more power.

Smaller and simpler does indeed run faster. But again, how does that turn into an advantage of 'CMT' vs SMT? Sure compared to a huge CMP there's an advantage. But all the rest is more a function of other design decisions rather than some superiority of 'CMT'.

Khato · May 25, 2011

Voo said:
PS: And really anyone thinking that adding more cores is simply copy+paste on a modern ASIC is nuts - sure your design can make it simpler and you can avoid some pitfalls, but it's still far from trivial - there are enough fun things like parasitic capacities around to make sure it stays fun

I'm nuts for other reasons sure, but not that one. Especially when it comes to high performance blocks that are not just straight synthesized logic, any duplication does amount to a copy/paste. Yes, after that's done the automated tools for signal wire routing and timings come into play, but that doesn't negate the starting point. It's far easier than having to design a high performance block that's twice the size.

Absolution75 · May 25, 2011

I didn't read this entire thread, but do the leaked prices of say the 8core BD refer to BD modules or actual cores (ex 2 cores per module). If the BD was 8-module.... that would be interesting.

Steelski · May 25, 2011

Absolution75 said:
I didn't read this entire thread, but do the leaked prices of say the 8core BD refer to BD modules or actual cores (ex 2 cores per module). If the BD was 8-module.... that would be interesting.

But there is more. I just read that the FX-8130 has a TDP of 125, clocks at 3.8 ghz, Turbo cores to 4.2 ghz....
The 8 core FX-8110 CPU has 3.6 clocks and TDP of 95....

http://news.mydrivers.com/1/194/194271.htm

Could be fake, or simply wrong.. But interesting if true.

AtenRa · May 25, 2011

I believe we will not see more than 3.2GHz (base frequency) for an 8 Core BD (16MB caches) at 95W TDP, at least in the beginning.

Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Lifer

Senior member

Lifer

Elite Member

Lifer

Lifer

Golden Member

Senior member

Member

Golden Member

Senior member

Senior member

Elite Member

Golden Member

Golden Member

Golden Member

Member

Elite Member

Golden Member

Golden Member

Golden Member

Golden Member

Senior member

Senior member

Lifer