bryanW1995
Lifer
- May 22, 2007
- 11,144
- 32
- 91
With BD, actually everything changed. And there are already some known bottlenecks regarding streaming store bandwidth, which will be adressed w/ BDv2 (Piledriver).
@IDC:
Cache latency effect for K8 is a good point, but this happened w/o significantly changing the rest of the microarchitecture.
With BD, actually everything changed. And there are already some known bottlenecks regarding streaming store bandwidth, which will be adressed w/ BDv2 (Piledriver).
Each thread running on a BD module should have access to the same L1 BW as both threads running on a SB core combined (2x64b R + 1x64b W). And even the amount and ways for 2 threads (2x4 ways, 2x16kB) match that what's available to 2 threads on a SB core. Here SB has a clear advantage for single threads.
Matthias, just the man I was wanting to run into!
Been meaning to catch you at some point to pick your brain a bit...what is your assessment of the reasoning for the per-core IPC gap between Bulldozer and Sandy Bridge?
What is it about SB that makes it so superior to BD (is it more decoders? faster cache?), and what do think AMD will do/should do/can do to improve bulldozer's IPC with piledriver?
If the cache is a red herring or dead-end excuse for the current IPC, what is holding the IPC back?
With Prescott it was pretty straightforward as to what was killing IPC - the pipeline was so long that the penalties from mispredicts and so on was unforgiving on performance.
Do you see a similar problem being Bulldozer's Achilles heel?
No to derail your discussion here, but I thought this diagram here was an interesting comparison between BD and Westmere. Granted this is over a year old, but it shows some stricking differences in CPU development.
http://realworldtech.com/page.cfm?ArticleID=RWT082610181333&p=10
I also agree, it should be more obvious from testing what the "thing" is that is holding BD back, as I just get this sense (based on nothing I can point to) that BD is like a Nascar engine with "some" restrictor plate limiting it to 100mph. I keep expecting AMD to come out and say something like "wow, holy crap, we shipped this thing and turns out half the modules were deactivated by accident," or "turns out the cache showed up in tests but it was actually disabled the entire time" etc
Also, I am placing my bets on the shared instruction decode as being one significant bottleneck, along with pre-fetching not being mature enough to offset the cache size/timings.
What it is, exactly, that failed to deliver is something I haven't been able to nail down yet.
You are not the only one
But i believe i have found something, ill let you know in a few days.