What made AMD stray from K10?

nehalem256 · Aug 5, 2012

ShintaiDK said:
I believe you should read it again, because you completely missed it. I even said AMD still had the advantage.

I never said Intels solution was faster. But it is much better integrated.

Intels iGPU got direct access to the L3 cache and able to directly communicate. While AMDs solution sits on a bus (One would guess PCIe or HT in nature.)

Also its nonsense to say Intels iGPU doesnt support compute. Because they do.

Why do the technical details of how the GPU is integrated matter if it does not produce any results?

Its like when AMD introduced the first native quad-core CPU. Unfortunately it was inferior to Intel's non-native quad-core.

jhu · Aug 5, 2012

AMD would do better in GPU compute if they'd add support for 64-bit float in their GPUs which NVidia has done from low-end to high-end.

ShintaiDK · Aug 5, 2012

nehalem256 said:
Why do the technical details of how the GPU is integrated matter if it does not produce any results?

Its like when AMD introduced the first native quad-core CPU. Unfortunately it was inferior to Intel's non-native quad-core.

You cant compare the 2.

It matters alot since memory bandwidth is the main weakness of iGPUs. And access to a very fast L3 helps solve that. Not to mention certain compute tasks that HD4000 really shows what a fast L3 cache can do.

And AMD is the one to talk about integration, yet they are so far behind. And with the upcoming Haswell and added iGPU expansion. Its just not looking pretty.

pelov · Aug 5, 2012

nehalem256 said:
Why do the technical details of how the GPU is integrated matter if it does not produce any results?

Its like when AMD introduced the first native quad-core CPU. Unfortunately it was inferior to Intel's non-native quad-core.

^^ this. And integration implies integrated with the CPU, whereas its Intel's chips which are the ones that are completely separate. The "integrated" L3 cache isn't integrated at all. It's a separate 1.5MB block of L3 that's portioned off for the purpose of the on-die GPU.

The bandwidth issue AMD has is certainly one that needs to be addressed, but what does this have to do with "integration"? Integration means integrating the CPU and GPU, which AMD is miles ahead of Intel in this regard. Kaveri will bring unified memory address space as well, providing even further integration.

The approach AMD currently has with respect to onion & garlic and how dependent they are on the DDR3 and IMC is certainly an issue, but what this has to do with integration is beyond me. You can argue that AMD will have to change its approach with their APUs if they want to remove that bandwidth bottleneck, but I've been saying this for months

Intel has literally no "integration" whatsoever and neither have I seen it on their roadmaps. The only integration I've seen is the implication that MIC/Larrabbee 2.0 will make it to their CPUs sometime in the near future, Sky Lake at the earliest, and that's only if their x86 everywhere approach takes off this time. That's going to be a tough sell considering they've tried that before with the original Larrabbee and people weren't buying it.

ShintaiDK said:
And AMD is the one to talk about integration, yet they are so far behind. And with the upcoming Haswell and added iGPU expansion. Its just not looking pretty.

By your own definition, integration means "it has a completely separate cache than the rest of the processor." You can then have 1 billion different CPUs, all of them with their separate cache who do completely separate work that has nothing to do with each other's work but they are each on the same die therefore they're integrated? As opposed to a CPU+GPU which share a workload but because they don't have completely separate caches (LLC in Intel's case), they're considered non-integrated.

I'm gonna chalk that up as a failure to grasp the english language, because that's hella stupid.

ShintaiDK · Aug 5, 2012

pelov said:
^^ this. And integration implies integrated with the CPU, whereas its Intel's chips which are the ones that are completely separate. The "integrated" L3 cache isn't integrated at all. It's a separate 1.5MB block of L3 that's portioned off for the purpose of the on-die GPU.

The bandwidth issue AMD has is certainly one that needs to be addressed, but what does this have to do with "integration"? Integration means integrating the CPU and GPU, which AMD is miles ahead of Intel in this regard. Kaveri will bring unified memory address space as well, providing even further integration.

Using you own words. You are full of poopoo now. :thumbsdown:

I already showed and explained it. LCC shared+ringbus.

pelov · Aug 5, 2012

ShintaiDK said:
Using you own words. You are full of poopoo now. :thumbsdown:

I already showed and explained it. LCC shared+ringbus.

And what if they do share a DDR bus? Are they then integrated or does this cache HAVE to be on the CPU, by your own definition and apparently no one else's?

ShintaiDK said:
It matters alot since memory bandwidth is the main weakness of iGPUs. And access to a very fast L3 helps solve that. Not to mention certain compute tasks that HD4000 really shows what a fast L3 cache can do.

Mind you, I've been saying this since last year when I saw Llano. They have to address the DDR dependency if they wish to push the envelope for throughput. Even still with this handicap AMD outperforms Intel's IGP on a larger node. But what does a potential future problem mean for current integration? Your problem arises in your failure to either grasp the English language or microarchitecture or you're just nitpicking for anything you possibly can, and I have a feeling it's the latter.

ShintaiDK · Aug 5, 2012

pelov said:
And what if they do share a DDR bus? Are they then integrated or does this cache HAVE to be on the CPU, by your own definition and apparently no one else's?

You might want to read this:
http://www.realworldtech.com/fusion-llano/2/

In contrast, Sandy Bridge has tigher integration – using the on-die ring interconnect and L3 cache. Data is passed through the ring interconnect, but can be shared either through the cache or memory. The ring interconnect is 32B wide with 6 agents and operates at the core frequency (>3GHz). Data usually coming from either the 4 slices of the L3 cache or the memory controller, which resides in the system agent. The peak bandwidth is over 400GB/s, but the practical bandwidth since many accesses have to go through multiple stops on the ring interconnect. The Sandy Bridge power management is also fully unified for both CPU and GPU, so that when one is idle, the other may ‘borrow’ the thermal headroom.

The Sandy Bridge CPU cores can access data and synchronize with the GPU’s portion of the L3 cache. For example, the CPU can write graphics commands into the GPU’s L3 cache, which the GPU then reads. The GPU can also explicitly flush data back to the L3 cache for the CPU to access with very high performance (e.g. for offloading from the GPU to CPU). Passing data between the GPU and CPU through the cache (instead of memory) is one area where Intel’s GPU integration is substantially ahead of AMD. In many respects, the communication model is much more bi-directional rather than the traditional one-way flow in a graphics pipeline.

Trinity vs IB is the same.

piesquared · Aug 5, 2012

nehalem256 said:
Why do the technical details of how the GPU is integrated matter if it does not produce any results?

Its like when AMD introduced the first native quad-core CPU. Unfortunately it was inferior to Intel's non-native quad-core.

^^Bingo. The implimentation means nothing if it doesn't work, and intel's bottom feeder GPU's are the perfect example of that.

With upcomming Kaveri. hasntwell isn't going to look very good at all.

pelov · Aug 5, 2012

using the on-die ring interconnect and L3 cache.

AMD's approach goes through the DDR bus.

The Sandy Bridge power management is also fully unified for both CPU and GPU, so that when one is idle, the other may ‘borrow’ the thermal headroom.

Implemented in Trinity. It's a variation of Turbo depending on how the load is split.

What does portioning off 1.5MB of cache have to do with integration if one end isn't helping the other? How is it integrated at all? Shares the interconnect? Is that all it takes for "integration"?

Blitzvogel · Aug 5, 2012

K10 was probably a technological dead end for AMD, even with bolted on 256 bit FPU/Vector capability. Bulldozer like Phenom is beginning as a dud, but could become a gem with enough love and care especially if it matches the price/performance of Intel's offering and on the flanking side via APUs where the graphics features clearly match and well exceed the price/performance of Intel's competing processors. The all in one package of a Trinity and Bobcat APU is excellent, it's AMDs high end that is not and the low and medium end is where AMD needs to focus it's CPU business until it fixes it's high end or introduces a new paradigm (like an APU with a few CPU cores, but with a massive array of graphics processing elements).

Idontcare · Aug 5, 2012

piesquared said:
With upcomming Kaveri. hasntwell isn't going to look very good at all.

"hasntwell"? LOL

that's good, made me laugh.

Got me thinking, the name "Haswell" is actually a perfect setup for all kinds of denigrating modifcations - Hasnothing, Hashype, Hastemps...

NUSNA_Moebius said:
K10 was probably a technological dead end for AMD, even with bolted on 256 bit FPU/Vector capability.

The thing with this kind of thinking is that it could have been equally applied to the P3 microarchitecture as reasoning for the transition to the P4 as well, only we saw what happened there and it turned out that going back to the P3 lineage and further enhancing it was actually exactly what the doctor ordered.

So when I see what are blindingly self-evident and obvious parallels between the P3/P4/Core2 history and the K10/BD/?? billion dollar effort to reinvent that wheel AMD-style, I really have to wonder just how full-circle this is all going to become when AMD decides to roll out "??".

It took Intel 4 nodes to get to Core 2 after releasing the P4. AMD is only one node into their BD cycle. And they just brought back Jim Keller, one guy who is probably going to be quite interested in seeing the K10 revisited as a replacement for BD.

I wouldn't want to rule that out given that we've all seen this movie before.

nismotigerwvu · Aug 5, 2012

I think it's a movie we'd all be buying tickets for if it brings about the same leap in performance that netburst to conroe brought us.

pelov · Aug 5, 2012

Idontcare said:
And they just brought back Jim Keller, one guy who is probably going to be quite interested in seeing the K10 revisited as a replacement for BD.

I wouldn't want to rule that out given that we've all seen this movie before.

IDC, I think it would be misplaced hope to think that AMD can keep tugging along with the K# strategy because they'd only be playing second fiddle to Intel, and as far as profits went that never worked. They were chasing IPC and single-threaded performance and they were only falling further behind in both. Couple that with Intel's now accelerating lead with respect to fabrication and node size, that approach would be a death wish, particularly with the desktop playing an ever-diminishing role in computing.

I'm sort of hoping they go APU only from here on out. That way at least their CMT approach would make some sense ;P

pantsaregood · Aug 5, 2012

Is it really that far fetched to think that AMD could produce a Kx CPU that can compete with Sandy Bridge in IPC? Does it need to compete in IPC if clocks can be increased?

Staple a 256-bit FPU, SSE4, AES, and AVX onto a Phenom II. Boost L3 cache to full speed. Increase IPC by 5% by using Llano cores. Produce an eight-core unit at 4.0 GHz. You suddenly have a complete beast of a CPU.

Also, Llano was the only real IPC boost K10 ever saw. Phenom II just threw on extra L3 cache.

I know it isn't as easy as I make it sound, but this is very parallel to Netburst. P6 scaled rapidly, then around 1 GHz things started to get rough. The 1.13 GHz Coppermine actually got cancelled due to poor yields. When Intel revisited Pentium III with Pentium M, they overcame that clock issue and pushed to 2.26 GHz. Core Duo, also a direct successor to Pentium III, made it up to 2.33 GHz.

nehalem256 · Aug 5, 2012

pantsaregood said:
Is it really that far fetched to think that AMD could produce a Kx CPU that can compete with Sandy Bridge in IPC? Does it need to compete in IPC if clocks can be increased?

Staple a 256-bit FPU, SSE4, AES, and AVX onto a Phenom II. Boost L3 cache to full speed. Increase IPC by 5% by using Llano cores. Produce an eight-core unit at 4.0 GHz. You suddenly have a complete beast of a CPU.

If this was as easy as you claim why didnt AMD do it in all the years it put out the Phenom II arch? It seems like such an obvious move there must be some reason.

pantsaregood said:
Also, Llano was the only real IPC boost K10 ever saw. Phenom II just threw on extra L3 cache.

I know it isn't as easy as I make it sound, but this is very parallel to Netburst. P6 scaled rapidly, then around 1 GHz things started to get rough. The 1.13 GHz Coppermine actually got cancelled due to poor yields. When Intel revisited Pentium III with Pentium M, they overcame that clock issue and pushed to 2.26 GHz. Core Duo, also a direct successor to Pentium III, made it up to 2.33 GHz.

Well they also die shrunk it from 180nm -> 90nm. And this was back in a time when CPUs scaled up much faster in terms of clock speed.

Idontcare · Aug 6, 2012

nehalem256 said:
If this was as easy as you claim why didnt AMD do it in all the years it put out the Phenom II arch? It seems like such an obvious move there must be some reason.

It is easy, but expensive. For a given process node, the clockspeed that you can run a given sram layout will scale proportional to the sram cell size and corresponding operating voltage.

This is why you see the sram cell size for L1$ being so large (large amount of die area per megabit) versus that of the lower-clocked and higher-latency L3$ on both Intel and AMD designs.

The L3$ cellsize is reduced so they can fit much more of it onto the die, taking the tradeoff in speed versus size to minimize the expense of the increasing die size.

If AMD (or Intel) wanted their L3$ to run "full clockspeed" then they'd have to increase the cell size of the sram so it can reliably function at the higher clockspeed (and resultant higher operating temperatures), as well as take a power-consumption hit in addition to having a much larger die size for the same L3$ size.

I hope folks can understand that all this is taken into consideration and the tradeoffs are optimized with respect to modeling the effects of a faster but less MB of L3$, faster and same MB but more expensive L3$, and slower and same MB of L3$ across a huge suite of user apps.

The L3$ is targeted at its current clockspeed and cachsize based on loads of modeling and educated engineering inputs, of course they knew a faster L3$ would make for higher IPC, that wasn't their challenge though. They had to produce a CPU that met its production cost targets (die size and intrinsic yieldability - something that factors into sram cell selection) as well as delivered on a performance target that they were guided in advance to assume would be competitive years down the road when the chip actually came to market.

Homeles · Aug 6, 2012

That's a great explanation. Gonna have to bookmark that...

nenforcer · Aug 6, 2012

pantsaregood said:
Is it really that far fetched to think that AMD could produce a Kx CPU that can compete with Sandy Bridge in IPC? Does it need to compete in IPC if clocks can be increased?

Staple a 256-bit FPU, SSE4, AES, and AVX onto a Phenom II. Boost L3 cache to full speed. Increase IPC by 5% by using Llano cores. Produce an eight-core unit at 4.0 GHz. You suddenly have a complete beast of a CPU.

SlowSpyder · Aug 6, 2012

ShintaiDK said:
You cant compare the 2.

It matters alot since memory bandwidth is the main weakness of iGPUs. And access to a very fast L3 helps solve that. Not to mention certain compute tasks that HD4000 really shows what a fast L3 cache can do.

And AMD is the one to talk about integration, yet they are so far behind. And with the upcoming Haswell and added iGPU expansion. Its just not looking pretty.

I'm not saying Intel's approach may be more elegant, and may be very important going forward (who knows), but the two are very comparable. Let's say I told you that the C2Q's were 'brutally bolted together', the two dual cores that have to communicate via slow bus vs. the Phenom II with it's fancy HT links and shared L3 cache. But then we see the benches.

As long as AMD's APU's can put up the numbers, I think no one will care how it is integrated.

sm625 · Aug 6, 2012

AMD is just stupid. They needed to differentiate themselves. The uArch of K10 was fine. They needed to turn it into a SoC, like bobcat was doing. Only moreso. Heterogenous compute. SIMDs in the fpu. ARM cores right next to x86 cores. The ability to turn off the big hungry cores and let the ARM cores run at low loads. Windows desktop, simple web page rendering, processing network and I/O, all sorts of things can be done on an ARM core, using X86 emulation if needed. (Think transmeta only on a smaller scale.) But ideally with direct support at the kernel level from microsoft.

ShintaiDK · Aug 6, 2012

SlowSpyder said:
I'm not saying Intel's approach may be more elegant, and may be very important going forward (who knows), but the two are very comparable. Let's say I told you that the C2Q's were 'brutally bolted together', the two dual cores that have to communicate via slow bus vs. the Phenom II with it's fancy HT links and shared L3 cache. But then we see the benches.

As long as AMD's APU's can put up the numbers, I think no one will care how it is integrated.

Im not saying AMD is bad in any way. I just say they really need to stay sharp. Because with Haswells supposed 40 EUs + improvemnts. Its starting to look really bad for AMD if they dont really get their act together.

Core 2 had the benefit (Just like AMDs GPU today) to be wastly superiour. But currently Intel moves in a much faster pace than AMD on the GPU side as well. Thats why I think its gonna be really hard for AMD on the GPU side after Haswell/Broadwell. Haswell is only ~½ year away. Broadwell ~1½ year. Tock tick tock tick....

In 2014, it might very well be AMDs 28nm vs Intels 14nm.

nehalem256 · Aug 6, 2012

sm625 said:
AMD is just stupid. They needed to differentiate themselves. The uArch of K10 was fine. They needed to turn it into a SoC, like bobcat was doing. Only moreso. Heterogenous compute. SIMDs in the fpu. ARM cores right next to x86 cores. The ability to turn off the big hungry cores and let the ARM cores run at low loads. Windows desktop, simple web page rendering, processing network and I/O, all sorts of things can be done on an ARM core, using X86 emulation if needed. (Think transmeta only on a smaller scale.) But ideally with direct support at the kernel level from microsoft.

Because surely Microsoft is going to add support for toggling between CPU Instruction sets because AMD asked for it.

If you want to go down that road it would make more sense to just include a couple of Bobcat cores with the K10 cores. Then you could run your slow Bobcat cores for low load.

Also, I believe that Trinity is already very good at idle power consumption. So it would seem to give a better user experience to just momentarily spin up you r fast core render the website then spin it down to an idle state. Which is what AMD did, no need for Microsoft to rewrite windows required.

ShintaiDK · Aug 6, 2012

Mixed ARM+x86 CPUs doesnt make sense at all. Unless you wish to retire one of them.

Atom/Bobcat already shows in smartphones/tablets etc that ARM isnt needed for anything.

Blitzvogel · Aug 6, 2012

Idontcare said:
"hasntwell"? LOL that's good, made me laugh.

Got me thinking, the name "Haswell" is actually a perfect setup for all kinds of denigrating modifcations - Hasnothing, Hashype, Hastemps...

The thing with this kind of thinking is that it could have been equally applied to the P3 microarchitecture as reasoning for the transition to the P4 as well, only we saw what happened there and it turned out that going back to the P3 lineage and further enhancing it was actually exactly what the doctor ordered.

So when I see what are blindingly self-evident and obvious parallels between the P3/P4/Core2 history and the K10/BD/?? billion dollar effort to reinvent that wheel AMD-style, I really have to wonder just how full-circle this is all going to become when AMD decides to roll out "??".

It took Intel 4 nodes to get to Core 2 after releasing the P4. AMD is only one node into their BD cycle. And they just brought back Jim Keller, one guy who is probably going to be quite interested in seeing the K10 revisited as a replacement for BD.

I wouldn't want to rule that out given that we've all seen this movie before.

But AMD obviously wanted multithread like capabilities in their CPU cores, so the split core BD architecture did make sense, plus I'm sure it was to their advantage having the split FPU in 128 bit workloads. It's clearly a chip made for the server market, we all know that, it's just not grand for the home user who still requires brute strength per core, but still needs expanded vector/SIMD throughput on their CPU.

I can see AMD going back to K10 in the future, but I think they are in the long haul with BD. Quadcore K10s with 256 bit FPUs sounds great and all, but AMD effectively made it unnecessary with Fusion. AMD just needs to better leverage developers to make use of Fusion and to better meld the CPU and GPU side of things to more efficiently allocate and share resources, but it's the x86 IPC that really sucks for AMD. "Moar cores" doesn't fix everything obviously, especially when your competition is a process node ahead and their equivalent node was better in the first place.

AMD has been kicked out of the high end. The medium and low end is where they can still create great value per dollar. The problem here is most shoppers have no idea who AMD is, and why they shouldn't go with Intel. The basic user wouldn't know the difference and you can't rely on the typical Best Buy yuckle to know either beyond GHz. Even if you could explain it to them and if they understood, they'll say they have no need for expanded graphics performance or don't need it for anything GPGPU related. Even if the GPGPU performance meant something in regards to their needs, they wouldn't even know how to implement it most likely. DirectCompute and OpenCL support from the get go is a plus, but the programs need to be able to autodetect the hardware from the start to make it as user friendly and smooth as possible.

Homeles · Aug 6, 2012

sm625 said:
AMD is just stupid. They needed to differentiate themselves. The uArch of K10 was fine. They needed to turn it into a SoC, like bobcat was doing. Only moreso. Heterogenous compute. SIMDs in the fpu. ARM cores right next to x86 cores. The ability to turn off the big hungry cores and let the ARM cores run at low loads. Windows desktop, simple web page rendering, processing network and I/O, all sorts of things can be done on an ARM core, using X86 emulation if needed. (Think transmeta only on a smaller scale.) But ideally with direct support at the kernel level from microsoft.

Yes, I'm sure you're more intelligent than the electrical engineers at AMD. Clearly you know better than they do.

What made AMD stray from K10?

Lifer

Lifer

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Golden Member

Diamond Member

Platinum Member

Elite Member

Golden Member

Diamond Member

Senior member

Lifer

Elite Member

Platinum Member

Golden Member

Lifer

Diamond Member

Lifer

Lifer

Lifer

Platinum Member

Platinum Member