Ex-AMD Engineer explains Bulldozer fiasco

bryanW1995 · Oct 21, 2011

AMD TDP uses the metric system. It is very confusing...

krumme · Oct 21, 2011

Dresdenboy said:
With BD, actually everything changed. And there are already some known bottlenecks regarding streaming store bandwidth, which will be adressed w/ BDv2 (Piledriver).

We have around 1 year from tape out to production - right?
So piledriver must have been designed before tapeout of BDv1.?
If all is new, what can we then expect from piledriver - i mean a lot of the bottlenecks must be discovered after tapeout - or what?

Idontcare · Oct 21, 2011

Dresdenboy said:
@IDC:
Cache latency effect for K8 is a good point, but this happened w/o significantly changing the rest of the microarchitecture.

With BD, actually everything changed. And there are already some known bottlenecks regarding streaming store bandwidth, which will be adressed w/ BDv2 (Piledriver).

Each thread running on a BD module should have access to the same L1 BW as both threads running on a SB core combined (2x64b R + 1x64b W). And even the amount and ways for 2 threads (2x4 ways, 2x16kB) match that what's available to 2 threads on a SB core. Here SB has a clear advantage for single threads.

Matthias, just the man I was wanting to run into!

Been meaning to catch you at some point to pick your brain a bit...what is your assessment of the reasoning for the per-core IPC gap between Bulldozer and Sandy Bridge?

What is it about SB that makes it so superior to BD (is it more decoders? faster cache?), and what do think AMD will do/should do/can do to improve bulldozer's IPC with piledriver?

If the cache is a red herring or dead-end excuse for the current IPC, what is holding the IPC back?

With Prescott it was pretty straightforward as to what was killing IPC - the pipeline was so long that the penalties from mispredicts and so on was unforgiving on performance.

Do you see a similar problem being Bulldozer's Achilles heel?

brybir · Oct 21, 2011

Idontcare said:
Matthias, just the man I was wanting to run into!

Been meaning to catch you at some point to pick your brain a bit...what is your assessment of the reasoning for the per-core IPC gap between Bulldozer and Sandy Bridge?

What is it about SB that makes it so superior to BD (is it more decoders? faster cache?), and what do think AMD will do/should do/can do to improve bulldozer's IPC with piledriver?

If the cache is a red herring or dead-end excuse for the current IPC, what is holding the IPC back?

With Prescott it was pretty straightforward as to what was killing IPC - the pipeline was so long that the penalties from mispredicts and so on was unforgiving on performance.

Do you see a similar problem being Bulldozer's Achilles heel?

No to derail your discussion here, but I thought this diagram here was an interesting comparison between BD and Westmere. Granted this is over a year old, but it shows some stricking differences in CPU development.

http://realworldtech.com/page.cfm?ArticleID=RWT082610181333&p=10

I also agree, it should be more obvious from testing what the "thing" is that is holding BD back, as I just get this sense (based on nothing I can point to) that BD is like a Nascar engine with "some" restrictor plate limiting it to 100mph. I keep expecting AMD to come out and say something like "wow, holy crap, we shipped this thing and turns out half the modules were deactivated by accident," or "turns out the cache showed up in tests but it was actually disabled the entire time" etc

Also, I am placing my bets on the shared instruction decode as being one significant bottleneck, along with pre-fetching not being mature enough to offset the cache size/timings.

Idontcare · Oct 21, 2011

brybir said:
No to derail your discussion here, but I thought this diagram here was an interesting comparison between BD and Westmere. Granted this is over a year old, but it shows some stricking differences in CPU development.

http://realworldtech.com/page.cfm?ArticleID=RWT082610181333&p=10

I also agree, it should be more obvious from testing what the "thing" is that is holding BD back, as I just get this sense (based on nothing I can point to) that BD is like a Nascar engine with "some" restrictor plate limiting it to 100mph. I keep expecting AMD to come out and say something like "wow, holy crap, we shipped this thing and turns out half the modules were deactivated by accident," or "turns out the cache showed up in tests but it was actually disabled the entire time" etc

Also, I am placing my bets on the shared instruction decode as being one significant bottleneck, along with pre-fetching not being mature enough to offset the cache size/timings.

Yeah I'm assuming this is one of those cases where in theory there is no difference between theory and practice, but in practice there is.

If reality was anything like the powerpoint engineering then we'd be having a very different discussion right now...but something didn't live up to those fancy colored boxes and arrow-line connections

What it is, exactly, that failed to deliver is something I haven't been able to nail down yet. Francois suggests the hardware in the chip doesn't actually function in the way the powerpoint engineering would have us believe.

The IPC gap is so large between the the cores in sandy bridge and those in bulldozer that it makes me curious, sure its purely academic but I'm curious nonetheless, what it is that makes such a huge IPC difference between these two architectures.

AtenRa · Oct 24, 2011

Idontcare said:
What it is, exactly, that failed to deliver is something I haven't been able to nail down yet.

You are not the only one

But i believe i have found something, ill let you know in a few days.

podspi · Oct 24, 2011

AtenRa said:
You are not the only one

But i believe i have found something, ill let you know in a few days.

I hope you let the rest of us know, too :biggrin:

Really interested in knowing what went wrong...

frostedflakes · Oct 24, 2011

I don't think there's any one big issue, just a lot of smaller ones. There was some talk about a cache issue (was it L1 or L2? can't remember) that could hurt performance a couple percent and some more recent articles suggest that B3 silicon will tweak the cache and presumably address this. Or maybe they are trying to reduce latency a bit? Whatever they're doing, I'm sure it will improve things. B3 and future revisions will also hopefully address power consumption and clock scaling, which are currently kind of poor. Software still isn't optimized for CMT like it has been for HyperThreading for years, so it will help when schedulers and software are module aware like they're currently aware of the difference between physical cores and logical cores. Instruction sets like AVX, FMA4, and XOP will improve performance when more applications are optimized for them (although AVX improves Intel's performance a ton as well, but these optimizations will give both Sandy Bridge and Bulldozer a huge leg up over Phenom II). Piledriver is promising increased IPC and other improvements, although AMD promised a lot with Bulldozer and didn't really deliver, so I wouldn't take their word as gospel.

None of these things by themselves are really a game changer, but add them all up and I think in 6-12 months Bulldozer could be a much better option than it currently is.

OS · Oct 26, 2011

maybe this is necro as i read this before, but only now did it really sink into me how horribly the engineering team was treated if the stories are true.

We are talking about an engineering team that could beat intel, and AMD management rewarded them with basically forcing them out. It's f--king beyond stupid.

Search

Ex-AMD Engineer explains Bulldozer fiasco

bryanW1995

Lifer

krumme

Diamond Member

Idontcare

Elite Member

brybir

Senior member

Idontcare

Elite Member

AtenRa

Lifer

podspi

Golden Member

frostedflakes

Diamond Member

OS

Lifer

TRENDING THREADS