Ex-AMD Engineer explains Bulldozer fiasco

PlasmaBomb · Oct 19, 2011

Makaveli said:
I've seen a ton of people post this phrase.

Was this written by a grade 5 student ??

Its disappointed!!!

It is a meme.

zlejedi · Oct 19, 2011

frostedflakes said:
Well if the respin allows them to hit higher clocks, that will help with the cache throughput, right? I thought that was one of the reasons it was under performing, BD wasn't able to come close to the clocks they were hoping for.

Or do the cache problems go beyond just not being able to meet clock speed targets?

Actually at higher clock speed latencies might be getting even worse as checked here:

http://forum.purepc.pl/findpost-t322141-p3662624.html

PlasmaBomb · Oct 19, 2011

zlejedi said:
Actually at higher clock speed latencies might be getting even worse as checked here:

http://forum.purepc.pl/findpost-t322141-p3662624.html

That's showing that L1 and L2 cache is running at core speed and L3 cache is running at uncore speed - which isn't increasing.

A respin might let them run both higher core and uncore speeds, depends what the issues are and where they feel they can get the most benefit out of tweaking...

Idontcare · Oct 19, 2011

PlasmaBomb said:
A respin might let them run both higher core and uncore speeds, depends what the issues are and where they feel they can get the most benefit out of tweaking...

Changing L1$ and L2$ at this point would be akin to changing the size of the fuel-pump on the gas line in your car but not doing anything with the downstream components (injectors, displacement, exhaust ports, etc).

Look at all the benches where the ram speed is changed from DDR3-1333 to DDR3-1866, even for things memory limited like Llano.

If the architecture is not designed to need the bandwidth, then adding more bandwidth isn't going to fix a bottleneck.

I'd like to make the argument that it is the cache that is holding bulldozer back, but the architecture had to have been tuned to use the cache as is...otherwise they would have implemented a different cache hierarchy.

Simply changing the cache without changing the rest of the architecture is not going to make an impact. But they must change the cache if they are going to change anything else under the hood and expect much performance to come of it.

exar333 · Oct 19, 2011

Idontcare said:
No, monkeying around with things like cache timings and size are not something you can get away with in a respin.

It could be addressed in piledriver, provided they planned for such changes maybe 2 yrs ago or so.

This.

AMD is not deactivating half the cache here or anything, this is a complete re-work of the cache system. Huge design and testing impacts.

krumme · Oct 19, 2011

Can the cache measuments be "biased" by unballanced GF process, causing unforeseen jitter whatever? - sorry dont have the technical insight, but i simply try to understand more of the situation

Idontcare · Oct 19, 2011

krumme said:
Can the cache measuments be "biased" by unballanced GF process, causing unforeseen jitter whatever? - sorry dont have the technical insight, but i simply try to understand more of the situation

No, not by process-induced variation or anything like that.

But it can be measured wrong, that has happened in the past for both AMD and Intel architectures.

Update:As many astute readers have pointed out, Core 2's prefetchers are able to work their magic with ScienceMark 2.0, which results in the significant memory latency advantage over AMD's Athlon 64 FX-62. This advantage will not always exist; where it doesn't, AMD will continue to have lower latency memory access and where it does, Intel can gain performance advantages similar to what ScienceMark 2.0 shows.

Updated - 1/5/07: Although AMD previously did not mention any issues with our findings, we were contacted today and informed that the latency information both ScienceMark and CPU-Z produced is incorrect. The Brisbane core's L2 latency should be 14 cycles, up from 12 cycles and not 20 cycles. This would help explain the relatively low impact on application performance that we've seen across the board. We are still waiting to hear back from AMD on a handful of other issues regarding Brisbane and will update you as soon as we have more information.

Munky · Oct 20, 2011

NUSNA_Moebius said:
3 issue as in three ALUs? *CPU hardware noob*

Has anyone done any studies/research on overclocking Bobcat derived parts to see how the TDP and TDW changes?

Isn't each "core" in a BD module 2 issue? Where does a Bobcat core sit in reference to each core in a Bulldozer module? Judging from indicated performances, 8x Bobcat cores might hold their own in overall capability vs a full 4 module Bulldozer chip unless we get into AVX or something like that. It really seems that AMD was overly confident in the module's ability to decode and schedule data into the two integer cores, at least with today's programs.

In general terms, by "issue" I meant how many instruction decoders it has, or how many instructions the front end can simultaneously issue to the ALU's.

Dresdenboy · Oct 20, 2011

Idontcare said:
I believe they planned for equivalent IPC but to enable higher clocks. I also believe their cache hierarchy is holding them back. The issue I have with having any confidence whatsoever in the cache-impacting IPC argument though (I'm arguing with myself here) is that cache latency and the congestion from it is something that is readily simulated as well as being "baked in" when they design the microarchitecture.

The latency certainly is exactly what they intended, but maybe they failed to simulate the congestion that would come from it with conventional instructions mixes?

How can we be sure that the cache latency is the main reason? Sounds to me a bit like (using CA - car analogies for the sake of having just a touch screen keyboard right now) ?Oh, the new Corvette has a lower max. mph! Ah, I can see, why: the diameter of the exhausts is smaller!? It's just too early to say such things w/o some decent profiling.

Well, whats known about the caches.. one to one
BD : 4c L1, 2x 2R/1W (48B)
SB : 4c L1, 1x R/W (48B combined)
BD : 2x16k (4 way) for 2 threads
SB : 1x32k (8 way) for 2 threads
BD : 20c L2 2048MB
SB : 9(?)c L2 256MB
...

Dresdenboy · Oct 20, 2011

NUSNA_Moebius said:
3 issue as in three ALUs? *CPU hardware noob*

Has anyone done any studies/research on overclocking Bobcat derived parts to see how the TDP and TDW changes?

Isn't each "core" in a BD module 2 issue? Where does a Bobcat core sit in reference to each core in a Bulldozer module? Judging from indicated performances, 8x Bobcat cores might hold their own in overall capability vs a full 4 module Bulldozer chip unless we get into AVX or something like that. It really seems that AMD was overly confident in the module's ability to decode and schedule data into the two integer cores, at least with today's programs.

According to AIDA measurements, BD's raw integer instruction throughput is a little higher (10-15% IIRC), while FP t'put is more like 2-4x for more than 32b floats, as expected.

Idontcare · Oct 20, 2011

Dresdenboy said:
How can we be sure that the cache latency is the main reason? Sounds to me a bit like (using CA - car analogies for the sake of having just a touch screen keyboard right now) ?Oh, the new Corvette has a lower max. mph! Ah, I can see, why: the diameter of the exhausts is smaller!? It's just too early to say such things w/o some decent profiling.

Well, whats known about the caches.. one to one
BD : 4c L1, 2x 2R/1W (48B)
SB : 4c L1, 1x R/W (48B combined)
BD : 2x16k (4 way) for 2 threads
SB : 1x32k (8 way) for 2 threads
BD : 20c L2 2048MB
SB : 9(?)c L2 256MB
...

I think you might have misread my post.

Cause and effect...I am NOT arguing that the cache is the cause of the lower IPC (the effect).

I am arguing that the cache is likely preventing them from going to higher IPC.

Look at ARM and its lowly memory bandwidth. Is the lowly memory bandwidth the reason for ARM's low performance? No, not at all. But ARM is never going to become high performance so long as it has low memory bandwidth.

That's my argument about the cache...its clearly designed from the beginning to be higher latency and smaller (L1$ Stars core vs. BD)...meaning the designers had to have known for years that IPC was going to be lower.

But if they plan for higher IPC in future iterations (Piledriver, etc) then the current cache MUST be improved, otherwise it will hold them back as much as the current memory bandwidth of ARM will keep its performance held down if they tried to create an 80W TDP SKU today clocked at 5GHz.

But we do know how sensitive IPC can be to cache latency because AMD did this experiment already with the 90nm -> 65nm Athlon's where they increased the L2$ by a mere 2 cycles (12 -> 14 cycles) and IPC went down because of this.

If you gave anyone the cache specifications of Thuban, and the cache specs of Zambezi, and without any more information you asked the person to speculate on the performance and IPC of Thuban versus Zambezi they will be able to pretty much tell you exactly what you see in the benches.

And there is a reason why they can do that. Cache matters, if it didn't then we wouldn't bother to measure and quantify it from an enthusiast and reviewer standpoint.

qliveur · Oct 20, 2011

PlasmaBomb said:
It is a meme.

I think that he's talking about the misspelling of "disappoint", as "dissapoint", which is very prevalent on the internet.

Idontcare · Oct 20, 2011

qliveur said:
I think that he's talking about the misspelling of "disappoint", as "dissapoint", which is very prevalent on the internet.

If you had "a point" in your post, and I were to "diss" your point...wouldn't that action be to "diss a point"..."dissapoint"?

(j/k)

bryanW1995 · Oct 20, 2011

Martimus said:
Also, this quote from IDC might be a little bit Prescient. http://forums.anandtech.com/showpost.php?p=30358536&postcount=231

Well, AMD clearly put more/better (or maybe even more better

) resources into bobcat, so they might have been planning that as the next NEXT cpu family for a while. However, unlike intel, they can't just ramble around for 5+ years while it happens. They are likely pushing it hard even now, and it will come out ASAP if they can make it work.

bryanW1995 · Oct 20, 2011

Idontcare said:
No, monkeying around with things like cache timings and size are not something you can get away with in a respin.

It could be addressed in piledriver, provided they planned for such changes maybe 2 yrs ago or so.

But didn't they deliberately increase the latency to enable higher clocks? Dropping it to 20, much less 12-15, would be a huge task right?

podspi · Oct 20, 2011

bryanW1995 said:
Well, AMD clearly put more/better (or maybe even more better ) resources into bobcat, so they might have been planning that as the next NEXT cpu family for a while. However, unlike intel, they can't just ramble around for 5+ years while it happens. They are likely pushing it hard even now, and it will come out ASAP if they can make it work.

Bobcat also has lower IPC than stars

I remember JF saying at one point that customers had asked him if they were going to get super-dense BC servers, and JF said that they would be better served through BD. I wonder if that is still the case.

bryanW1995 · Oct 20, 2011

He would have said that regardless of if it was true or just possibly true, since his job is to sell lots of Opterons and the Opterons of today/near future are BD-based. If bobcat/stars/athlon xp/arm becomes the "Opteron" of the future, then he'll push that agenda. And that's not a knock on John at all, that's what any person in his position should do.

exar333 · Oct 20, 2011

podspi said:
Bobcat also has lower IPC than stars

I remember JF saying at one point that customers had asked him if they were going to get super-dense BC servers, and JF said that they would be better served through BD. I wonder if that is still the case.

You could get a WHOLE LOT more BC cores vs. BD cores for the same power envelope.

podspi · Oct 20, 2011

ExarKun333 said:
You could get a WHOLE LOT more BC cores vs. BD cores for the same power envelope.

Is that true, though? Bobcat is great for what it is, but AMD has claimed that TDP per core is going down to ~ 5W (on server) for BD. Of course, AMD also said IPC wouldn't go down, so I guess that isn't a terribly strong argument :$

If it turns out that you're right and perf/W ends up being much higher for BC, we actually COULD see an Opteron BC. Maybe the rumored 28nm Opterons are in fact them...

SickBeast · Oct 20, 2011

All of these shenanigans about the cache.

Bulldozer has a longer pipeline than any processor currently on the market.

Bulldozer artificially inflates core count by forcing extra integer units to share the floating point units.

Those are the reasons why the thing is so inefficient.

I'm sure that a better cache could help somewhat, but the crux of the problem is that Bulldozer is inherently inefficient and seems to have been created to enhance a pissing contest involving words like core count and megahertz.

Schmide · Oct 21, 2011

SickBeast said:
All of these shenanigans about the cache.

I don't think you realize what effect caches have on modern processors?

SickBeast said:
Bulldozer has a longer pipeline than any processor currently on the market.

A longer pipeline doesn't necessarily kill IPC if you have a large enough register file to keep things moving.

SickBeast said:
Bulldozer artificially inflates core count by forcing extra integer units to share the floating point units.

Those are the reasons why the thing is so inefficient.

I'm sure that a better cache could help somewhat, but the crux of the problem is that Bulldozer is inherently inefficient and seems to have been created to enhance a pissing contest involving words like core count and megahertz.

I'm sure the last thing in their mind is buzz words. They've been around long enough to know the game.

Ironically I would say Bulldozer's cores are kind of efficient considering L1 writes are slower than thuban L2 writes. Think of it this way, when you retire an instruction's data, it takes twice as long as Thuban and Sandy Bridge.

Why am I writing this you don't care? You're just here to be a rabble rouser.

Riek · Oct 21, 2011

podspi said:
Is that true, though? Bobcat is great for what it is, but AMD has claimed that TDP per core is going down to ~ 5W (on server) for BD. Of course, AMD also said IPC wouldn't go down, so I guess that isn't a terribly strong argument :$

If it turns out that you're right and perf/W ends up being much higher for BC, we actually COULD see an Opteron BC. Maybe the rumored 28nm Opterons are in fact them...

Weren't they going to release a 8c 2.5GHz BD @ 32W TDP? or was that a hoax.

Dresdenboy · Oct 21, 2011

bryanW1995 said:
But didn't they deliberately increase the latency to enable higher clocks? Dropping it to 20, much less 12-15, would be a huge task right?

"Dropping" to 20 isn't necessary since the 2MB L2 latency is 20c acvording to Software Optimization Manual, AIDA and other latency measurement tools. Sandra might suffer from a wrong way of measurement, or at least they didn't adapt the code correctly.

Edit:
David Kanter's article with fresh information from Hot Chips last year already cites the 18-20c latency for L2 (18 is with 1MB L2): http://realworldtech.com/page.cfm?ArticleID=RWT082610181333&p=8

@all:
Please refrain from discussing wrong measurements like they were showing the truth

It's boring. 20 cycles is already enough, no need to artificially inflate it ^^

@IDC:
Cache latency effect for K8 is a good point, but this happened w/o significantly changing the rest of the microarchitecture.

With BD, actually everything changed. And there are already some known bottlenecks regarding streaming store bandwidth, which will be adressed w/ BDv2 (Piledriver).

Each thread running on a BD module should have access to the same L1 BW as both threads running on a SB core combined (2x64b R + 1x64b W). And even the amount and ways for 2 threads (2x4 ways, 2x16kB) match that what's available to 2 threads on a SB core. Here SB has a clear advantage for single threads.

SickBeast · Oct 21, 2011

Schmide, if AMD is able to "fix" Bulldozer with a new cache with the respin they are working on, I will give your comments some due credit. Otherwise, I stand by my opinion.

AMD seems to be blaming Global Foundries for the problem, so perhaps part of it is indeed manufacturing related.

As for your personal attacks...well...sometimes the pot calls the kettle black. In this case I'm no kettle.

exar333 · Oct 21, 2011

podspi said:
Is that true, though? Bobcat is great for what it is, but AMD has claimed that TDP per core is going down to ~ 5W (on server) for BD. Of course, AMD also said IPC wouldn't go down, so I guess that isn't a terribly strong argument :$

If it turns out that you're right and perf/W ends up being much higher for BC, we actually COULD see an Opteron BC. Maybe the rumored 28nm Opterons are in fact them...

Another marketing move by AMD. Let's just say that an 8-core BD = 4-core SB in a specific task. They both use the same power. AMD's 'per core' power usage will be 1/2 that of Intel, but they still consume the same overall power.

Now back to reality, BD uses 2x the power of SB providing similar performance in some tasks. The power usage per core is now comperable.

Power/core is meaningless. Performance/watt is the important metric. In a highly-threaded application, I would choose 50 BC cores > 8 BD cores. I still am curious how AMD marked BD in the TDP ratings they did. They use 2x the power at load over SB, but only are rated 30w higher TDP.

Ex-AMD Engineer explains Bulldozer fiasco

Lifer

Senior member

Lifer

Elite Member

Diamond Member

Diamond Member

Elite Member

Diamond Member

Golden Member

Golden Member

Elite Member

Diamond Member

Elite Member

Lifer

Lifer

Golden Member

Lifer

Diamond Member

Golden Member

Lifer

Diamond Member

Senior member

Golden Member

Lifer

Diamond Member