Bulldozer arch versus Sandy Bridge arch

Magic Carpet · Dec 31, 2011

Identical IPC.

What would be more better or are there any other flaws in AMD's approach?

Idontcare · Dec 31, 2011

Same IPC and same power-consumption? and same diesize/manufacturing cost? same clockspeeds?

To many other things involved here to say just hitting same IPC would make anything "more better".

Same IPC for bulldozer might be possible but in doing so the diesize might swell to 500mm^2 and power-consumption might double in the process (Pollack's Rule).

Would that be "more better"?

Magic Carpet · Dec 31, 2011

Okay, you beat me to it. Too many variables are involved here. A battle that already lost, lol.

I was more thinking from a design point view. Mainly Intel HT vs AMD Modules.

dma0991 · Dec 31, 2011

Magic Carpet said:
Identical IPC.

BD was a proof of concept of CMT that never paid off. Had AMD managed to design an 8 core CPU with SB's IPC, I would have to plug it in to the nearest nuclear power plant. 😉

Maximilian · Dec 31, 2011

dma0991 said:
BD was a proof of concept of CMT that never paid off. Had AMD managed to design an 8 core CPU with SB's IPC, I would have to plug it in to the nearest nuclear power plant. 😉

Is that country music television or charcot marie tooth disease? Google didnt throw up anything more relevant.

bryanW1995 · Dec 31, 2011

Bulldozer is a way cooler name than Sandy Bridge. Sandy Bridge sounds like something my daughter would put in to cross the moat around her disney castle. Bulldozer sounds like something that would knock over the castle/moat/little sandy bridge/etc. Unfortunately for AMD, Sandy Bridge is a LOT better in every other metric.

dma0991 · Dec 31, 2011

Maximilian said:
Is that country music television or charcot marie tooth disease? Google didnt throw up anything more relevant.

It should be somewhat similar to CMP(Chip Multiprocessor) or I might just had my terms mixed up.

Vesku · Dec 31, 2011

Same IPC would require a monster module. If they can get 15-20% more than the 8150 out of the next revision it will actually be a solid chip in the $250-350 range.

Cerb · Dec 31, 2011

Magic Carpet said:
Okay, you beat me to it. Too many variables are involved here. A battle that already lost, lol.

I was more thinking from a design point view. Mainly Intel HT vs AMD Modules.

Ultimately, the greatest difference would be the kind of overall CPU performance you could get for <$150, and at some decent power consumption. AMD's would be better for AMD, since Intel has smaller and fast on-chip memory cells and interfaces in production than anyone else on the planet.

As far as caches go, if your working set is large enough, and/or you face a hard decision as to aligning data versus compacting data, and/or you use a language that abstracts that away from you, then even with very accurate speculative execution, prefetching, and L1 eviction, you can be waiting on cache. OOOE helps hide this, and in some cases, can completely hide it, but it still isn't perfect.

HT (mostly-shared SMT) helps solve the problem that the execution units are idle, because instructions are waiting on others going through the pipelines, and/or waiting on caches. If your threads don't need too much of your cache (Intel can throw more cache at a performance problem with ease, as well), too much data bandwidth, aren't high-IPC, and aren't sensitive to regularity in task completion latency, SMT can be very good...and all those conditions do represent quite a bit of loopy data processing.

The problems of HT are that your caches are now effectively halved in many cases, you have increased chances for each thread's needs to conflict with the other in terms of fetching and evicting, and as code in threads get more efficient, you face increased likelihood that they will each take much longer than with no HT, due to fighting over execution and L1/L2 cache bandwidth. Worse yet, atomic operations can slow things down even further. While SMT can be a blessing, and takes very little in the way of extra space or power, why your execution units are idling matters, and just trying to fix it by loading on more threads is not always a good idea.

CMT (BD's int; FP is CMT+SMT) helps solve the problem that execution units are unavailable, because the typical CPU front ends and execution units are designed for heavy optimized loops. But, not making them good at such loops is plain stupid, since optimization of just a few lines of code can make huge differences in any language compiled straight to binary. While somewhat coarse-grained (I would expect that to improve over time), the shared front end allows code that can use wide fetch/decode to do so, while code that doesn't shouldn't. In code coming straight out of common compilers, the typical results should be 90%+ that of two whole cores, while reducing size and power by more than the performance drop. Low-IPC code will be as fast as two whole cores. All of this appears to work out quite well on BD.

In that case, adding dedicated execution units for each thread serves a similar purpose to SMT's adding of threads to execution units. The advantage should be not only that typical code runs about as well, but that bandwidth-bound and latency-bound code will have more execution and on-chip memory resources free at any given time, whether running few or many threads. For AMD, it serves a marketing purpose as well as a technical one, in that it should target precisely those areas where SMT is inferior in performance to CMP of more cores, while not giving up performance where SMT is good (IE, comparing by total thread counts and power consumption).

If near caches were not shared between threads, the negatives should be minimal, with highly-efficient high-IPC threads able to bring each 'core' down to maybe 80% or so of a plain CMP, and even that much of a drop should be atypical. The real gotcha would be that very-high-IPC loops, typically hand-tuned, across many threads, would be bottlenecked by the front end. In that case, an SMT CPU and CMT CPU with similar end-to-end widths and memory bandwidths would perform similarly on a basis of the width of each front end (see synthetic int tests on a FX-8150 v. i5-2500). IOW, even if BD had lived up to its full promise, SB-E would not look remotely bad.

However, all of that working out as better at this and better at that would be premised on similar single-threaded performance, which is lacking; and the shared L2 complicates it all a bit, as well. With worse single-threaded performance, worse performance per clock on average, and high power consumption, CMT itself working as advertised is not much consolation to the customer who buys a computer. Most will be better served by Intel CPUs w/o HT than by AMD BD CPUs.

ThatsABigOne · Dec 31, 2011

Maximilian said:
Is that country music television or charcot marie tooth disease? Google didnt throw up anything more relevant.

Should be SMP. 🙂

Magic Carpet · Dec 31, 2011

Vesku said:
Same IPC would require a monster module. If they can get 15-20% more than the 8150 out of the next revision it will actually be a solid chip in the $250-350 range.

You reckon, it is the architecture as a whole that requires heaps of power?

or just poor execution?

Idontcare · Dec 31, 2011

Magic Carpet said:
You reckon, it is the architecture as a whole that requires heaps of power?

or just poor execution?

Is there a difference?

It all comes down to project management, the tradeoffs made, the work ethic/effort invested, time and resources, etc.

At the end of the day it comes down to the people, how many are involved, and their capabilities in capitalizing on opportunities versus looking for excuses to explain failure in advance.

Magic Carpet · Dec 31, 2011

Nice drawing skills :thumbsup:

I get your point. This must be really hilarious for you to deal with people like me, lol.

TuxDave · Dec 31, 2011

Idontcare said:

Nice. And just to clarify that "fast" doesn't mean the product is fast but instead it means how fast it takes to get the product created. I tell my boss every so often that if it wasn't for such a competitive schedule, we could seriously clean up a lot of loose ends. Oh well, here's to "good enough, better luck next time" methodologies.

Ferzerp · Dec 31, 2011

Actually, for the Bulldozer project, it was neither a good project, nor a fast project. I have no idea if it was a cheap project though. I doubt it. It seems they went 0/3

Magic Carpet · Dec 31, 2011

you know, 3dfx had a similarly uber-ambitios project before they went bust.

-BUT-

More specifically, it's because AMD doesn't have talents like IDC 🙂

TuxDave · Dec 31, 2011

Ferzerp said:
Actually, for the Bulldozer project, it was neither a good project, nor a fast project. I have no idea if it was a cheap project though. I doubt it. It seems they went 0/3

I kind of wonder what nightmare the project manager was going through. The project was late and I really would like to believe that they knew they had a performance problem but they couldn't delay the product anymore to fix it.

I know we had a couple close calls where one of our features was modelled incorrectly and so the performance we thought we got wasn't there at all once we started simulating it. But it was nothing that a couple months of jamming in circuits and new instructions couldn't fix. 😀

Idontcare · Dec 31, 2011

Magic Carpet said:
Nice drawing skills :thumbsup:

I get your point. This must be really hilarious for you to deal with people like me, lol.

Thanks, but its not mine (its from the wiki link I linked), but it is a "picture is worth a thousand words" type drawing, I agree!

Idontcare · Dec 31, 2011

Magic Carpet said:
you know, 3dfx had a similarly uber-ambitios project before they went bust.

-BUT-

More specifically, it's because AMD doesn't have talents like IDC 🙂

Oh AMD has talent, its just crazy under-resourced and over-stressed like few others have been. Its easy to make it look easy when you have no skin in the game (referring to myself) but I'm really bush-league compared to the experts at AMD, serious.

TuxDave said:
I kind of wonder what nightmare the project manager was going through. The project was late and I really would like to believe that they knew they had a performance problem but they couldn't delay the product anymore to fix it.

I know we had a couple close calls where one of our features was modelled incorrectly and so the performance we thought we got wasn't there at all once we started simulating it. But it was nothing that a couple months of jamming in circuits and new instructions couldn't fix. 😀

LOL, reminds me of "In theory, there is no difference between theory and practice. But, in practice, there is." 😛

T_Yamamoto · Dec 31, 2011

Ferzerp said:
Actually, for the Bulldozer project, it was neither a good project, nor a fast project. I have no idea if it was a cheap project though. I doubt it. It seems they went 0/3

Hopefully piledriver won't be a letdown

Sent from my Galaxy Nexus

Bulldozer arch versus Sandy Bridge arch

Diamond Member

Elite Member

Diamond Member

Platinum Member

Lifer

Lifer

Platinum Member

Diamond Member

Elite Member

Diamond Member

Diamond Member

Elite Member

Diamond Member

Lifer

Diamond Member

Diamond Member

Lifer

Elite Member

Elite Member

Lifer