Ex-AMD Engineer explains Bulldozer fiasco

Blitzvogel · Oct 17, 2011

Martimus said:
Also, this quote from IDC might be a little bit Prescient. http://forums.anandtech.com/showpost.php?p=30358536&postcount=231

Isn't Bobcat essentially a lightweight Athlon x2? I honestly don't see Bobcat architecture being used for anything outside of low power computing solutions.

Makaveli · Oct 17, 2011

NostaSeronx said:
315mm² x 80% => 252mm²
5.99 pts / 80% => 7.49 pts in Cinebench

If only if it were true, I would have reacted like:

Instead. I reacted like

I've seen a ton of people post this phrase.

Was this written by a grade 5 student ??

Its disappointed!!!

Martimus · Oct 17, 2011

NUSNA_Moebius said:
Isn't Bobcat essentially a lightweight Athlon x2? I honestly don't see Bobcat architecture being used for anything outside of low power computing solutions.

Bobcat was a bottoms up design creating a cheap and low-powered processor. It used a very large amount of automation with the P&R in the design. It also shows that the Engineer who made these claims doesn't really have much of a leg to stand on, since it is both smaller and more powerful than the Atom.

Munky · Oct 17, 2011

NUSNA_Moebius said:
Isn't Bobcat essentially a lightweight Athlon x2? I honestly don't see Bobcat architecture being used for anything outside of low power computing solutions.

Bobcat is a dual-issue design, and it offers roughly 80% of the performance of the original K8, which was a 3-issue. I think with a modern, widened front end (similar to what Intel did with SB) Bobcat has potential to grow into a power-efficient architecture that would surpass Bulldozer.

Caza · Oct 17, 2011

As with most automated tools, the smaller the design the better the P&R. Bobcat had this advantage.

Bobcat - 380 Million Transistors
Zambezi - 2 Billion Transistors

Broheim · Oct 17, 2011

Makaveli said:
I've seen a ton of people post this phrase.

Was this written by a grade 5 student ??

Its disappointed!!!

I think meme is a pretty cool guy, eh mispels and doesn't afraid of anything.

Blitzvogel · Oct 18, 2011

Munky said:
Bobcat is a dual-issue design, and it offers roughly 80% of the performance of the original K8, which was a 3-issue. I think with a modern, widened front end (similar to what Intel did with SB) Bobcat has potential to grow into a power-efficient architecture that would surpass Bulldozer.

3 issue as in three ALUs? *CPU hardware noob*

Has anyone done any studies/research on overclocking Bobcat derived parts to see how the TDP and TDW changes?

Isn't each "core" in a BD module 2 issue? Where does a Bobcat core sit in reference to each core in a Bulldozer module? Judging from indicated performances, 8x Bobcat cores might hold their own in overall capability vs a full 4 module Bulldozer chip unless we get into AVX or something like that. It really seems that AMD was overly confident in the module's ability to decode and schedule data into the two integer cores, at least with today's programs.

podspi · Oct 18, 2011

Martimus said:
Bobcat was a bottoms up design creating a cheap and low-powered processor. It used a very large amount of automation with the P&R in the design. It also shows that the Engineer who made these claims doesn't really have much of a leg to stand on, since it is both smaller and more powerful than the Atom.

:thumbsup: Finally somebody brings this up. There is no way BD or SB (or really any modern CPU) is designed by hand anymore. Sounds like a disgruntled ex-employee who is convinced CPUs just "aren't made the way they used to be" which, AFAIK is a good thing...

BlueBlazer · Oct 18, 2011

Here is another ex-AMD engineer.......

Mitch Alsup said:
Message from discussion Bulldozer on Slashdot

MitchAlsup View profile

On Aug 25, 11:38 pm, Brett Davis <gg...@yahoo.com> wrote:

> In article <ggtgp-2F5622.01163525082...@news.isp.giganews.com>,
> Brett Davis <gg...@yahoo.com> wrote:
> K10 has one major bottleneck outside the issue pipeline to executing
> more instructions per cycle. The 16 byte decode unit will give you 3.5
> instructions per cycle on average, less for SSE code, as few as 2.5.

> A 32 byte decode unit will be idle greater than 50% of the time on average.
> Huge die area and a huge win to share.

> The k10 retirement unit can only retire 3 instructions a cycle, Bulldozer
> will do 4.

(Ahem) K10 is BullDozer, K8 is Opteron and follow-ons.

> The third AGU was never used, waste of die area and heat.

The issue was that the 3rd unit was used a lot, only to run into the
dual-only ported DataCache. This caused sequencing issues.

> The third ALU is of more concern, Intel will standardize benchmarks to
> make this look bad, even though I know it was used 1% on average.

So what else is new.

> AMD now has separate load and store pipelines, this can be a huge advantage.
> For every 90 instructions on average you will have 60 integer ops, 20 loads,
> and 10 stores.

We measured very close to 50% of x86 instructions having memory
reference attachments. So, for every 90 x86 instructions, on would
expect 45 memory references wiht a general ratio of just over 2 reads
to 1 write. Thus, I would expect 30-33 reads and 12-15 writes.

> The branch unit is not on the Bulldozer slides,

We always put these in the ALUs with means to redirect the front-end
on discovery of mispredict.

> Bulldozer will be faster than K10, the question is how much,

When I left, BD was supposed to be 20-25% faster frequency wise, and
loose a little architectural figure (5%-ish) of merit due to the
microarchitecture. The surprising thing was the lack of mention of
frequency in the market-droid-ing.

Mitch

sm625 · Oct 18, 2011

I once did the signal routing on a 64 bit memory bus on a 14 layer pcb. I hand routed every trace and did my own impedence matching calculations. I made sure every trace was the same length, and made sure stub lengths were the same. Right now we are having problems with a similar design. Someone else did the routing, and now we are having single bit intermittent data errors. :hmm:

krumme · Oct 18, 2011

From BlueBlazer post - great link btw. Mitch said:
"When I left, BD was supposed to be 20-25% faster frequency wise..."
It corresponds with what charlie wrote today about GF not being responsible for the lack of frequency.
Thats interesting if true, the nearest conclusion would otherwise be as anand indicated to look at GF not delivering. But probably we have to look at what charlie say is 1000 things - cuts. We will see q1 if there is som serious improvement potential for the future.
http://semiaccurate.com/2011/10/17/bulldozer-doesnt-have-just-a-single-problem/

Idontcare · Oct 18, 2011

krumme said:
From BlueBlazer post - great link btw. Mitch said:
"When I left, BD was supposed to be 20-25% faster frequency wise..."
It corresponds with what charlie wrote today about GF not being responsible for the lack of frequency.
Thats interesting if true, the nearest conclusion would otherwise be as anand indicated to look at GF not delivering. But probably we have to look at what charlie say is 1000 things - cuts. We will see q1 if there is som serious improvement potential for the future.
http://semiaccurate.com/2011/10/17/bulldozer-doesnt-have-just-a-single-problem/

I don't buy it.

(1) the dude left AMD ages ago, no way a clockspeed envelope had been tied down by then.

(2) that would imply AMD really did intend to implement a horrid latency cache hierarchy and planned to just make up for it with uber-clocks...meaning they intended for less than Stars-core IPC all along.

Point #1 I've seen countless times, the guy has presented himself as arrogant and out-of-touch with the state of the industry, I doubt he was anywhere close to the levels of decision making within AMD where bulldozer clockspeed targets were decided.

Point #2 I just refuse to believe for no other reason than I just can't believe AMD intended for the cache to be what it is, nor can I convince myself that they really did intend to take a step back in IPC and just make up for it with higher clocks.

I believe they planned for equivalent IPC but to enable higher clocks. I also believe their cache hierarchy is holding them back. The issue I have with having any confidence whatsoever in the cache-impacting IPC argument though (I'm arguing with myself here) is that cache latency and the congestion from it is something that is readily simulated as well as being "baked in" when they design the microarchitecture.

The latency certainly is exactly what they intended, but maybe they failed to simulate the congestion that would come from it with conventional instructions mixes?

thilanliyan · Oct 18, 2011

Idontcare said:
I
I believe they planned for equivalent IPC but to enable higher clocks. I also believe their cache hierarchy is holding them back. The issue I have with having any confidence whatsoever in the cache-impacting IPC argument though (I'm arguing with myself here) is that cache latency and the congestion from it is something that is readily simulated as well as being "baked in" when they design the microarchitecture.

The latency certainly is exactly what they intended, but maybe they failed to simulate the congestion that would come from it with conventional instructions mixes?

The person who wrote the Lostcirucuits article on BD also said the same thing about the cache. If they can simulate that easily why the heck would they have left it as is??!! It's so frustrating to have AMD lagging lol...I want the companies to be fairly equal like ATI/AMD and nVidia are.

SickBeast · Oct 18, 2011

Lookit, when people design things, there are usually a small number of people at the top who are in charge. Those people are to blame for Bulldozer failing, and at least one of them has been held accountable (the old CEO).

Bulldozer has a bunch of design flaws that a layperson such as myself can point out. It's not rocket science. An ex-engineer can come out and blame this or that, but the fact remains that the "modules" are a dumb idea, along with the longer pipeline.

The problem is, designing a CPU is such a complex project, it's nearly impossible to get it done properly in a cohesive package. You can have 8 teams designing 8 different parts of it, but then how do you get them all to work together and work well?

AMD tried and failed. I think they will be able to make some gains by doing another re-spin, however there is only so much they can get from a CPU that did not have efficiency as its #1 mandate (as it should have IMO).

Vesku · Oct 18, 2011

There was the Linux cache patch and there have been benchmarks and software that seems to indicate some sort of congestion. Doubt it can all be blamed on cache definitely more complicated and harder to simulate stuff going on in the shared front end. Hope AMD still has some bright engineers to sort it all out.

SickBeast · Oct 18, 2011

After the P4 came the P4C with hyperthreading. It wasn't much better. AMD would be lucky to get even that much of a boost out of Bulldozer.

NostaSeronx · Oct 18, 2011

Makaveli said:
I've seen a ton of people post this phrase.

Was this written by a grade 5 student ??

Its disappointed!!!

SickBeast said:
After the P4 came the P4C with hyperthreading. It wasn't much better. AMD would be lucky to get even that much of a boost out of Bulldozer.

Don't you mean Piledriver? Bulldozer is pretty much is a done deal...

Idontcare · Oct 18, 2011

thilanliyan said:
The person who wrote the Lostcirucuits article on BD also said the same thing about the cache. If they can simulate that easily why the heck would they have left it as is??!! It's so frustrating to have AMD lagging lol...I want the companies to be fairly equal like ATI/AMD and nVidia are.

I've been meaning to read their article, they usually do really good power-consumption tests. Gonna go check it out right now, thanks for reminding me :thumbsup:

podspi · Oct 18, 2011

SickBeast said:
Lookit, when people design things, there are usually a small number of people at the top who are in charge. Those people are to blame for Bulldozer failing, and at least one of them has been held accountable (the old CEO).

Bulldozer has a bunch of design flaws that a layperson such as myself can point out. It's not rocket science. An ex-engineer can come out and blame this or that, but the fact remains that the "modules" are a dumb idea, along with the longer pipeline.

The problem is, designing a CPU is such a complex project, it's nearly impossible to get it done properly in a cohesive package. You can have 8 teams designing 8 different parts of it, but then how do you get them all to work together and work well?

AMD tried and failed. I think they will be able to make some gains by doing another re-spin, however there is only so much they can get from a CPU that did not have efficiency as its #1 mandate (as it should have IMO).

I have to disagree on both counts. IMHO, I think the module idea is brilliant (just as SMT is, which also originally had a less than impressive debut). The idea of sharing what (shouldn't) be a bottleneck in most situations, and keeping discrete what will be is not a bad idea. Longer pipelines are also not a bad idea, if you can get the clockspeeds high enough to compensate. Otherwise, couldn't we just go towards the other extreme (shorter and shorter pipelines)?

In the case of Bulldozer, something (many things, if some are to be believed) went wrong. But that doesn't mean many of the underlying ideas behind the architecture are "dumb".

Idontcare · Oct 18, 2011

thilanliyan said:
The person who wrote the Lostcirucuits article on BD also said the same thing about the cache. If they can simulate that easily why the heck would they have left it as is??!! It's so frustrating to have AMD lagging lol...I want the companies to be fairly equal like ATI/AMD and nVidia are.

Ouch, this page says it all. Achilles heel right there.

Lets take a look at the worst offenders in the current design:

L1D cache size too small and too slow. Especially at 16 kB size there should not be a reason to need 4 cycles access latency.

L2 cache latency: 27 cycles. This is almost twice the access latency of the L2 cache in Phenom II and while the L2 cache here is substantially larger, the combination of the insufficient L1 size with the extremely slow L2 cache is a recipe for disaster. I dare say that by reducing the L2 latency to 12-15 cycles, Zambezi would most likely see a 20-30% performance increase. Of course, this is pure speculation because I have not run any simulations.

thilanliyan · Oct 19, 2011

Idontcare said:
Ouch, this page says it all. Achilles heel right there.

[/LIST]

It that fixable in a respin?

Idontcare · Oct 19, 2011

thilanliyan said:
It that fixable in a respin?

No, monkeying around with things like cache timings and size are not something you can get away with in a respin.

It could be addressed in piledriver, provided they planned for such changes maybe 2 yrs ago or so.

frostedflakes · Oct 19, 2011

Well if the respin allows them to hit higher clocks, that will help with the cache throughput, right? I thought that was one of the reasons it was under performing, BD wasn't able to come close to the clocks they were hoping for.

Or do the cache problems go beyond just not being able to meet clock speed targets?

Idontcare · Oct 19, 2011

frostedflakes said:
Well if the respin allows them to hit higher clocks, that will help with the cache throughput, right? I thought that was one of the reasons it was under performing, BD wasn't able to come close to the clocks they were hoping for.

Or do the cache problems go beyond just not being able to meet clock speed targets?

Higher clocks doesn't mean higher IPC (cache discussion is relating to IPC-limitations)...just means higher performance because the clocks are higher.

Throughput will go up, of course, commensurate with the need for throughput to go up as clocks go up, otherwise IPC would go down as the clocks went up, making performance not go up (stays flat).

Schmide · Oct 19, 2011

Idontcare said:
No, monkeying around with things like cache timings and size are not something you can get away with in a respin.

I can't believe they thought a 4-27-86 cache would be viable under any circumstances?

Wouldn't cache speed/latency be determined by how fast that part of the chip can run? If you can't deliver the proper clocks due to some bad power plane management or other issue, you're going to have to run it slower.

Ex-AMD Engineer explains Bulldozer fiasco

Platinum Member

Diamond Member

Diamond Member

Diamond Member

Junior Member

Diamond Member

Platinum Member

Golden Member

Senior member

Diamond Member

Diamond Member

Elite Member

Lifer

Lifer

Diamond Member

Lifer

Diamond Member

Elite Member

Golden Member

Elite Member

Lifer

Elite Member

Diamond Member

Elite Member

Diamond Member