And wouldn't you agree that, given the leaked price points and die sizes, a BD core should be compared to an Intel thread, and a BD module to an Intel core?
Many transactional workloads fare similarly. MS' stats are interesting (they actually do well towards showing potential strengths of small CPUs in a distributed environment), there, but basically, if a task is waiting on small requests to get done, needing its output for the next input, sacrificing time/operation to gain operations/time can turn out less than stellar. Doubly so, if you have users waiting for responses. Heavy SMT, and small CPUs, have begun to catch on, and will grow, but not such that big CPUs will be replaced everywhere, so much as that big CPUs will be replaced in areas where they cannot be utilized well. Even in cases where power bills may be saved, it might not always be the best choice.I'd be the first to admit that I am not a computer science academic, but I am not familiar with the research you are talking about. The only paper that I've seen is this one:
http://www.eecs.harvard.edu/~vj/images/4/4e/Msr09search.pdf
Which is by Microsoft, but talks about how because searches must be latency constrained, so singlethread performance still matters (if the search gets returned by a deadline each time, slower singlethread performance degrades search quality results).
This is certainly true for some class of workloads, I was thinking more along the lines of transcoding and rendering -- which are purely throughput oriented.
How will you add a second thread to a BD core? BD only allows one per core. The point of two threads per module is to improve overall performance, while using less space and power, such that two threads per module nets you as much performance as with two whole cores. While there is a theoretical performance hit, that's based on research that uses old Alphas (seriously, every paper I've found starts with a 21264). Chances are very good that if AMD had made a new CPU without sharing the front-end, the single-threaded task performance would be no different, since that performance is so dependent on speculation and OoOE, outside of floating point work.How much of a hit do you think a thread running on a BD core will take when a second thread is added to the same core?
How will you add a second thread to a BD core? BD only allows one per core.
Two problems, BD core is smaller(less transistors) than SB core and it is unfair to compare them because SB will always be faster at the same frequency and secondly BDs module(L2 included) is larger than SBs core and it is unfair for SB to compare them cause BD module will always be faster when running two threads.
I would only compare price/performance at the applications I would like to run, so it comes to the individual to see what it needs, higher IPC or more cores.
Two problems, BD core is smaller(less transistors) than SB core and it is unfair to compare them because SB will always be faster at the same frequency and secondly BDs module(L2 included) is larger than SBs core and it is unfair for SB to compare them cause BD module will always be faster when running two threads.
I would only compare price/performance at the applications I would like to run, so it comes to the individual to see what it needs, higher IPC or more cores.
Wrong, it will be marketed as 8 cores.Think about this: the 8130P is positioned to compete against the 2600k. 2600k is marketed as "4C/8T" while 8130P is marketed as "4M/8T".
I'm assuming you want to know the performance hit when running two threads on a single...thread? That's silly and obvious... you know the answer.
The point he was making is Bulldozer's modular design which allows for higher "core" count (two cores sharing some circuitry) is a more efficient way to run highly threaded applications than adding hyperthreading.
What I mean is that with normal code, you get less than a 36% performance penalty per thread, as each thread isn't trying to use the whole core.Isn't that already accounted for in the 36%? If a thread fully utilizes the core, HT gets you nothing. You get the good boost from HT when your software is written by monkeys and plays hide and seek with pointers.
Unless you meant when the other thread is completely idle? Because, in that case I don't care about performance -- it's when your queues are full and everything is firing on full cylinders when it counts. Every other time, I just want it to suck as little power as possible.
I wouldn't call myself a code-monkey, but that would be reasonably accurate.My talk about code-monkeys is self-depreciating -- I'd call myself a code monkey. Also, just because my previous post seemed overly harsh, I absolutely love HT and other features like it that allow us programmers to be lazy. Features that let programmers do less work are features that saves us a lot of money.
Also, it's not necessary to keep all the parts of the core utilized to block the other thread -- you only need to fill up one part that the other thread also needs. On SNB, that would probably be the cache write unit. Also, as the FPU shares issue ports with the ALU's on Intel processors, using one of them also blocks the other.
It's only the AMD fanbois that are setting there hopes to high for BD. You can debate hypothetical, specs, etc. but in the end looking at how they are trying to market the product will give you a good idea about its performance. The fact that AMD has not talk at all about single threaded performance and only talks about cores etc should tell you that BD is gonna fall short of SB clock for clock. When you trying to market something you never point out your products obvious shortcomings you only focus on its strength and divert/ignore everything else. It's why you don't see Ferrari advertising the super fuel efficiency of there cars, or the ample amount of cargo space.I'm starting to think we're setting our hopes too high for Bulldozer.
If you look at AMD's leaked marketing slide: http://www.xbitlabs.com/news/cpu/di..._Range_Microprocessor_to_Cost_320_Report.html
You can see that, at least according to that slide, 4 core BD does not match i5-2500k.
Apparently it takes a FX-6110 6 core BD to match\slightly beat a 2500k, with the 8 core BD matching the 2600k.
If that slide is close to accurate, and the leaked pricing is also close to accurate, it basically means this:
2500k for $225, or 6 core BD for $240 (about the same performance)
2600k for $320, or 8 core BDp for $320 (about the same performance)
Again, speculation, but, if that leaked slide and those prices are true\close, anyone waiting on BD should just go buy a 2500k or 2600k now, at least from a gaming perspective, and be able to upgrade to Ivy later on without changing the MB...of course, it's so close now, might as well wait to see if it's better than hoped...
That's where the odd ball 8 core FX8130P comes in. It's priced higher then Intel's unlocked i2600K and probably put out to show BD`s 8 cores winning in multi-threaded benchmarks better . From what we know the "P" refers to the CPU being 125W compared to 95W that the rest of the BD CPU. The fact that they have to put out a special "P" version of the 8 core shows that AMD is already pushing BD to its limits (ie. not much room for overclocking). Remember that all the FX CPU's come with unlocked multiplier there really is no point in putting out two versions of the 8 core. They could just release the FX8110 at the FX8130P speed, or just release an FX8130 at 95W. The fact that they did`t (can`t) would suggest that to get BD to run at FX8130P speed they had to increase its TDP, all things being equal (same CPU family) the only reason you have to increase TDP is if you have to increase the voltage to the CPU which basically means AMD has to overclock to get BD to run at FX8130P speed which isn`t a good sign for a brand new architecture.
How is that relevant? You cannot compare ipc and use that to say cpu 1 is faster than cpu 2, just like you can't use the frequency of cpu 1 to state it is faster than cpu 2.It's only the AMD fanbois that are setting there hopes to high for BD. You can debate hypothetical, specs, etc. but in the end looking at how they are trying to market the product will give you a good idea about its performance. The fact that AMD has not talk at all about single threaded performance and only talks about cores etc should tell you that BD is gonna fall short of SB clock for clock. When you trying to market something you never point out your products obvious shortcomings you only focus on its strength and divert/ignore everything else. It's why you don't see Ferrari advertising the super fuel efficiency of there cars, or the ample amount of cargo space.
AMD is in the spot where it needs to perform better for the same cost. It will most likely be clocked higher than the competitive SB version... but that is by design. However how much is the difference between the i2400 and the i7 2600K? That surely does not lineup with going from 2M to 4M, not event going from 2M to 3M.The marketing slides and price list further points to this fact. FX4110 is priced at $190 which is less then $216 for Intel's i2500k, its pretty safe bet the FX4110 will be slower then the i2500 in pretty much all tasks. The pricing for the FX6110 at $240 would suggest the 6 core BD will be faster then the 4 core i2500 in multi-threaded situations, and most likely slower in single threaded. The pricing of the FX4110 makes it more competitive with the i2400 I suspect performance should be similar mainly due to the FX4110 being clocked higher and turboing higher.
Not sure why SB would be faster in single threaded now, SB-E, gulftown have a single threaded turboboost very close to the 2500 and 2600 (100MHz difference), so all the top line intel cpu's are also very close to eachother in real single threaded applications. (e.g. 2500 < 3% of 2600 < 3% of SB-E). So again why would single threaded be faster on SB (on average)? If AMd is able to compete with SB using a high clocked 4core, they can use turboboost2 to clock modules high enough to compete with those models also.The 8 core BD pricing and P is another sign of what to expect. If BD was all that it's made out to be it would be a safe bet that on the slides you'd see "Superior Performance" instead of "More Cores Overclocked". The pricing for the 8 core FX8110 @ $290 puts it nearly identical to the i2600 price. Once again it'll probably be Intel faster at single thread and 8 core BD faster at multi-thread... but probably not by as big a margin as AMD hope.
That's where the odd ball 8 core FX8130P comes in. It's priced higher then Intel's unlocked i2600K and probably put out to show BD`s 8 cores winning in multi-threaded benchmarks better . From what we know the "P" refers to the CPU being 125W compared to 95W that the rest of the BD CPU. The fact that they have to put out a special "P" version of the 8 core shows that AMD is already pushing BD to its limits (ie. not much room for overclocking). Remember that all the FX CPU's come with unlocked multiplier there really is no point in putting out two versions of the 8 core. They could just release the FX8110 at the FX8130P speed, or just release an FX8130 at 95W. The fact that they did`t (can`t) would suggest that to get BD to run at FX8130P speed they had to increase its TDP, all things being equal (same CPU family) the only reason you have to increase TDP is if you have to increase the voltage to the CPU which basically means AMD has to overclock to get BD to run at FX8130P speed which isn`t a good sign for a brand new architecture.
That's 1277 threads in memory. A dual-core i5, with HT on, can only run 4 of them at a time. A Quad-core (2 module) BD will also run no more than 4 at a time.Checks task manager, sees 1,277 threads running on dual core i5.
It is less about speculation of where the workloads are trending, which I'm sure AMD and Intel are on the same page about (if not publicly ), than about the best way to improve performance in any given workload, with resources available.Heh, for all the noise that's been made about the advantages of AMD's approach... It's nothing more than a further level of SMT than Intel's hyperthreading - they're simply duplicating execution logic (in addition to control logic necessary for SMT) in accordance with where they felt multi-threaded workloads are trending. What's odd is that they didn't allow for either 3 or 4 threads to be executed per module to keep that duplicated logic busy.
It's not a further level of SMT. Where AMD may use SMT, they will just be using SMT. If you duplicate the execution logic, such that it isn't shared by multiple threads, it's not SMT. Both methods result in multithreaded processors, but are significantly different approaches. SMT targets idle time, assuming that if units are not busy, they need more to do. CMT targets waste, assuming that if units are not busy, there must be fat to trim. They are orthogonal concepts. If you trim fat to the point that the cluster can keep all thread execution units busy, while not idling other parts of the CPU (IE, no redundancy, and also no bottlenecks to functional units, either), but real code doesn't do that, you could add SMT, and get just as much benefit (or detriment) as with a pure CMP that gets SMT added to it.It's nothing more than a further level of SMT than Intel's hyperthreading - they're simply duplicating execution logic (in addition to control logic necessary for SMT)
This "fact" is no fact at all:The fact that AMD has not talk at all about single threaded performance and only talks about cores etc should tell you that BD is gonna fall short of SB clock for clock.
(Michael Butler et al., Bulldozer: An Approach to Multithreaded Compute Performance)While high-throughput performance was a primary goal for Bulldozer, AMD made a significant investment in delivering high, single-thread performance levels. A major contributor to this strategy is in scaling the core structures and an aggressive frequency goal (low gates per clock).1 Another major component of the single-thread performance strategy is Bulldozers investment in instruction and data prefetching.
I'm not sure if these prices already reached "fact" status. I still see them as rumor/speculation, albeit a not that unrealistic one.The marketing slides and price list further points to this fact. FX4110 is priced at $190 which is less then $216 for Intel's i2500k, its pretty safe bet the FX4110 will be slower then the i2500 in pretty much all tasks. The pricing for the FX6110 at $240 would suggest the 6 core BD will be faster then the 4 core i2500 in multi-threaded situations, and most likely slower in single threaded. The pricing of the FX4110 makes it more competitive with the i2400 I suspect performance should be similar mainly due to the FX4110 being clocked higher and turboing higher.
The problem of such a high level view on performance is the missing granularity regarding specific tasks. It could be, that the average performance of all tested apps/games/synthetic benches might be about the same, while for the whole group of games (although they also differ a lot) BD might be 10% slower on average, for video encoding 10% faster and for synthetic memory benches 20% faster on average. Any variant is possible. We might begin with cycle exact simulations of the architecture using a modified PTLSim but this would help only so much and means a lot of work.The 8 core BD pricing and P is another sign of what to expect. If BD was all that it's made out to be it would be a safe bet that on the slides you'd see "Superior Performance" instead of "More Cores Overclocked". The pricing for the 8 core FX8110 @ $290 puts it nearly identical to the i2600 price. Once again it'll probably be Intel faster at single thread and 8 core BD faster at multi-thread... but probably not by as big a margin as AMD hope.
http://forums.anandtech.com/showpost.php?p=31752021&postcount=2414
No. There is no killing to be done. Let's say your execution units are at 100%, and unit A in each CPU is running at 70%. But, in tight code when only one thread runs, it can reach 100%. If you share it between cores, at 140% original capacity, what have you lost? Nothing.Why? Because AMD's approach was likely derived after analyzing workloads and deciding that, in their intended design, everything except for the integer cores was typically IDLE half the time. So what did they do? It doesn't make sense to halve the rest of the logic and kill single-threaded performance,
Nope. The other way around. They took two whole cores, and trimmed the fat, until trimming more would hurt each core's performance (the 4-wide decoder was a bit uprising, though...I was expecting closer to 6, for occasional peaks). There should be enough performance in the shared front-end to keep both execution units busy, if the code allows. The front end should not be over-provisioned.so instead they added SMT control logic to all other portions of the core design and simply duplicated the integer core.
The FPU is using SMT of some kind. The integer logic, though, can be, and is, called something other than SMT. It's called CMT. Here is an easy to Google paper, with nice diagrams on page 3. The C stands for cluster, referring to grouping otherwise independent items[/SIZE].It really can't be called anything other than SMT because you have... Fetch, Decode, Branch Prediction, and FPU at least as shared logic, without which the duplicated integer logic sure can't do much.
In that case, what company isn't lazy, today?At the same time, I wouldn't agree that this approach is 'the future'. It's the future for lazy designs that don't want to maximize single-threaded performance in the process.
My jar of magic pixie dust is slap empty. Would people really buy a 200W CPU 10% faster than a 2600K, and fewer threads, assuming AMD even could do it? I wouldn't. I think the idea that AMD could truly one-up Intel, by following behind Intel, is laughable. Intel can leverage smaller and faster xtors, yet also relies on that ability. AMD must exploit that as a weakness.They could have spent the exact same die area creating a massive monolithic integer core and extracted the exact same multi-threaded performance through a typical SMT implementation while getting far higher single-threaded performance.
How is it a copy and paste? I'm not seeing how you can split each core from its twin, within a module.But such a design is markedly more time consuming than 'simple' copy/paste (yeah, not quite -that- simple, but with AMD's apparent floor-plan approach it's close to it.)
Where are you getting twice the die size? AMD has given very ambiguous numbers, and no CMP version of BD was ever developed, that we know of. The size of what a single core of a new CMP design would be is unknown.1. I'd sure hope it's twice as fast considering that AMD's approach is to use twice the die size. Really is the same as point number 3.
2. Amusing thought, but it's by no means an 'advantage' seeing as how those constraints apply to the design no matter what. It could just as easily end up being a disadvantage.
3. Already detailed above, and this is indeed an advantage for AMD considering the marked difference in development costs. However, it's a disadvantage in terms of single-threaded performance.
Not nonsense. If a thread can utilize 90% of a SB core, and you add another thread that can do the same, you best case scenario will be an 11% throughput improvement, and that each thread would run at 55% the speed of just one. In the worst case, that 90% is all cache, and your total performance and per-thread performance will drop (usually HPC and video stuff, but DBs are not immune, either, even with the new Cores).4. Nonsense. Unless you're going back to point number 1, where sure AMD's approach should be twice as fast when executing across duplicated logic since it's using twice the die space.
Simpler things run faster. Smaller things run faster. Intel can get around this benefit of CMT for some generations yet, by making smaller and faster xtors, which will let their CMP-only CPUs get by for a bit longer. AMD needs to follow the spirit of RISC, and simplify for that speed. A CMP of monolithic cores could very well run slower, due to latencies on the chip, or just switching more xtors, and thus using more power.5. What does this have to do with AMD's approach supposedly non-SMT approach vs Intel's more typical SMT?
For the FPU, they aren't hiding that they are using SMT, but they do seem to be going to pains to describe in very different terms than intel likes to describe HT, and I'm sure some AMD employees spent hours just to do that. For the int, though...it's not SMT. It's more real cores, just as AMD has talked up in the past, but not made as pure CMPs.6. Yes, I know, this is the entire reason that AMD is trying to say that they aren't doing SMT. But it's nothing more than marketing.
Yep, after all it's a well known fact that all those complications like multiple cores, sophisticated cache coherency protocols or out of order execution really only slow down programs I think you'll agree that you simplified a bit too much here - for some things simplicity up to a certain degree is positive (say the ISA) but there's a reason even RISC chips are generally getting more and more complicated with every generation (cf ARM and their cache implementations for one obvious example)Simpler things run faster. Smaller things run faster.
Yep, after all it's a well known fact that all those complications like multiple cores, sophisticated cache coherency protocols or out of order execution really only slow down programs I think you'll agree that you simplified a bit too much here - for some things simplicity up to a certain degree is positive (say the ISA) but there's a reason even RISC chips are generally getting more and more complicated with every generation (cf ARM and their cache implementations for one obvious example)
I'm guessing that my statement was not properly understood, since that's the only explanation for that response. Let me attempt to rephrase in similar terms. If a fictional design in the intended workloads averaged 90% utilization of its integer resources and 45% utilization on everything else... Then it doesn't make sense to halve everything else in order to have a more balanced core design - sure it would increase the utilization of everything else, simply because there's less of it, heh. It'd also decrease integer resource utilization due to dependencies and drastically decrease performance (aka kill.)No. There is no killing to be done. Let's say your execution units are at 100%, and unit A in each CPU is running at 70%. But, in tight code when only one thread runs, it can reach 100%. If you share it between cores, at 140% original capacity, what have you lost? Nothing.Why? Because AMD's approach was likely derived after analyzing workloads and deciding that, in their intended design, everything except for the integer cores was typically IDLE half the time. So what did they do? It doesn't make sense to halve the rest of the logic and kill single-threaded performance,
Eh, okay. I know I sure wasn't there in the high level architecture design meetings 5+ years ago when those decisions were made.Nope. The other way around. They took two whole cores, and trimmed the fat, until trimming more would hurt each core's performance (the 4-wide decoder was a bit uprising, though...I was expecting closer to 6, for occasional peaks). There should be enough performance in the shared front-end to keep both execution units busy, if the code allows. The front end should not be over-provisioned.so instead they added SMT control logic to all other portions of the core design and simply duplicated the integer core.
I'll pretend to be an academic for a moment and proclaim that AMD has invented SCMT! Or maybe even P-SCMT if they did the separate issue queues like hyper threading. After all, according to that paper, bulldozer is neither SMT nor CMT because of the shared FPU, quoting from page 2, "The primary difference between the P-SMT and CMT approaches is that the former assigns threads to execution units at issue time, while in the more highly partitioned CMT processor, this assignment is done at dispatch time by steering each instruction to a particular cluster." As for the actual performance and energy conclusions of that paper, it's unfortunate that they only compared various 16-thread designs, none of which are comparable to the processor's we're interested in.The FPU is using SMT of some kind. The integer logic, though, can be, and is, called something other than SMT. It's called CMT. Here is an easy to Google paper, with nice diagrams on page 3. The C stands for cluster, referring to grouping otherwise independent items[/SIZE].It really can't be called anything other than SMT because you have... Fetch, Decode, Branch Prediction, and FPU at least as shared logic, without which the duplicated integer logic sure can't do much.
Haha, I'll certainly agree with that entire assessment! Especially enjoy the Oracle since that's ever so true.In that case, what company isn't lazy, today?At the same time, I wouldn't agree that this approach is 'the future'. It's the future for lazy designs that don't want to maximize single-threaded performance in the process.
AMD, Intel: can't chase single-thread performance, due to needing more threads on the die, so must balance several threads at a time, and those idle cores when they aren't needed, in a fairly strict power envelope.
IBM: going all into SMT, cost and power be damned.
Tilera: we can cram more compute kernels into a chip than you can eat popped corn kernels during a bad movie.
ARM: OK single-thread performance, good power efficiency, A5 MP on the way for threads galore.
Oracle: we're gonna run each thread like it's 1989, but run so many, it sets records, and replaces whole racks of other servers.
Neither Intel nor AMD maximize single thread performance absolutely, though both of them do maximize it within their many-core restraints.
Did Apple steal it again? Sorry, couldn't resist... Continuing with the rest.My jar of magic pixie dust is slap empty.They could have spent the exact same die area creating a massive monolithic integer core and extracted the exact same multi-threaded performance through a typical SMT implementation while getting far higher single-threaded performance.
Pretty sure I didn't say anything about such a design being more power hungry or having less threads... But you are correct in that I should have stated it as far higher potential single-threaded performance, which likely wouldn't be realized often at all. The point being that the same resources in an equivalent SMT configuration could hit the same multi-threaded performance while offering no constraints to single-threaded potential. It's just markedly more difficult to design an adequate scheduler of that width.Would people really buy a 200W CPU 10% faster than a 2600K, and fewer threads, assuming AMD even could do it? I wouldn't. I think the idea that AMD could truly one-up Intel, by following behind Intel, is laughable. Intel can leverage smaller and faster xtors, yet also relies on that ability. AMD must exploit that as a weakness.
Also, it's not big wide execution units that make x86 int performance. It's efficient use of the caches, good prefetchers, and good branch predictors. Is that everything? No. But, you can't run any faster than those allow, when running around looking through pointer after pointer.
Okay, it's a copy, reflect, and paste. At least that's what the die shot implies, and is the only sensible way to implement the design. (I did the same thing on a ALU layout for one of my VLSI courses back in college.) Doing it any other way vastly increases the amount of back-end work necessary for no purpose. Oh, and guess I should have been more specific that I'm talking in terms of design implementation.How is it a copy and paste? I'm not seeing how you can split each core from its twin, within a module.But such a design is markedly more time consuming than 'simple' copy/paste (yeah, not quite -that- simple, but with AMD's apparent floor-plan approach it's close to it.)
Yay for triple-quote to ensure that proper context is clear. Now, in that context of comparing the integer design approaches, AMD's "CMT" doubles the die size used (okay, it's actually more like 1.98x due to the bits that SMT needs to add) vs a SB type SMT... There's no need for figures or anything - when you duplicate the integer logic you double the die space that that logic is going to use.Where are you getting twice the die size? AMD has given very ambiguous numbers, and no CMP version of BD was ever developed, that we know of. The size of what a single core of a new CMP design would be is unknown.1. I'd sure hope it's twice as fast considering that AMD's approach is to use twice the die size. Really is the same as point number 3.CMT should have five main advantages over SMT, if each is used exclusively, as in BD's integer v. SB's integer:
1. Each thread can run about as fast as if the other shared resources weren't in use, except cache (cache is an area where both will have similar problems).
My claim of nonsense and reference back to point number 1 was that there's no merit of 'CMT' that increases its potential multi-threaded performance in comparison to SMT unless you give it more execution units to work with. Now the statement of SMT only providing an increase in performance with inefficient code would be correct, but it's still quite 'effective' at keeping execution units busy when running efficient code. On the flip-side, depending upon its implementation, a 'CMT' design could easily find inefficient code resulting in idle execution units - everything available thus far implies that this could well be the case with bulldozer.Not nonsense. If a thread can utilize 90% of a SB core, and you add another thread that can do the same, you best case scenario will be an 11% throughput improvement, and that each thread would run at 55% the speed of just one. In the worst case, that 90% is all cache, and your total performance and per-thread performance will drop (usually HPC and video stuff, but DBs are not immune, either, even with the new Cores).4. Nonsense. Unless you're going back to point number 1, where sure AMD's approach should be twice as fast when executing across duplicated logic since it's using twice the die space.4. CMT's effectiveness, in a chip made for high performance with a few threads, will not depend on inefficient code execution (merely that dividing the execution resources caps peak performance, compared to being wider, yet otherwise identical), where SMT can be dependent on such.
HT's ideal case is that no CPU resource is being used more than 50%, including caches, and/or that the two threads can use a shared cache well, in which case you will get much better performance. This kind of code tends to be really bad about cache misses and branch mispredicts, so the theoretical near 100% improvement practically never happens.
Smaller and simpler does indeed run faster. But again, how does that turn into an advantage of 'CMT' vs SMT? Sure compared to a huge CMP there's an advantage. But all the rest is more a function of other design decisions rather than some superiority of 'CMT'.Simpler things run faster. Smaller things run faster. Intel can get around this benefit of CMT for some generations yet, by making smaller and faster xtors, which will let their CMP-only CPUs get by for a bit longer. AMD needs to follow the spirit of RISC, and simplify for that speed. A CMP of monolithic cores could very well run slower, due to latencies on the chip, or just switching more xtors, and thus using more power.5. What does this have to do with AMD's approach supposedly non-SMT approach vs Intel's more typical SMT?5. All of that combined should make it easier to reach higher clock speeds within a given TDP, and improve per-thread resource utilization, enough to more than make up for the very minor penalty of having narrower execution resources. It is quite possible that a pure CMP, with all non-cluster features of BD, could be slower, for a single threaded task, than BD will be, if just due to lower clock speed limits at a given TDP.
PS: And really anyone thinking that adding more cores is simply copy+paste on a modern ASIC is nuts - sure your design can make it simpler and you can avoid some pitfalls, but it's still far from trivial - there are enough fun things like parasitic capacities around to make sure it stays fun
I didn't read this entire thread, but do the leaked prices of say the 8core BD refer to BD modules or actual cores (ex 2 cores per module). If the BD was 8-module.... that would be interesting.