How will heterogeneous cpu's (fusion) handle bandwidth limitations?

jondeker · Jun 27, 2010

When intel/amd add a decent gpu into their cpu there's going to be a lot of bandwidth limitations. I haven't read anything about how they will approach this problem.

DDR4 will be out at 12gbyte/s?
Quad channel would bring it to 48gbyte/s which is still limiting, and a bit impractical for people to buy dimms in sets of 4.

Will gpu access to cpu cache be enough?

Will they add edram on package?

frostedflakes · Jun 28, 2010

My guess is, they won't, at least not in the near future. The GPU portion will be memory bandwidth limited compared to discrete cards using high speed GDDR.

Also, DDR3 is already available with a bandwidth of 12.8GB/s and more. From what I've read, DDR4 is expected to double that, so it should offer 25.6GB/s for a single channel.

punker · Jun 28, 2010

jondeker said:
When intel/amd add a decent gpu into their cpu there's going to be a lot of bandwidth limitations. I haven't read anything about how they will approach this problem.

DDR4 will be out at 12gbyte/s?
Quad channel would bring it to 48gbyte/s which is still limiting, and a bit impractical for people to buy dimms in sets of 4.

Will gpu access to cpu cache be enough?

Will they add edram on package?

GPU will have it's own dedicated ram channel and shared turbo cache

heyheybooboo · Jun 28, 2010

AMD charts show an independent buffer on-die for the GPU separate from the CPU cache structure.

Everything appears to be interconnected with HT links to the memory controller --- hard to tell by the lines on the chart I saw as to how many links there will be, and how they may work dependently/independently of each other.

--

CPUarchitect · Jul 18, 2011

They will eventually become homogeneous. A homogeneous out-of-order architecture requires less bandwidth because it has to share the caches with fewer threads (giving each thread more local storage and thus requiring less off-chip storage). Note that a GPU wastes practically all of its fast storage on thread contexts, while CPU's save a lot of bandwidth due to high cache hit ratios.

They can achieve such a high throughput homogeneous architecture, while limiting power consumption, by executing 1024-bit AVX2 instructions on 256-bit execution units. It would allow clock gating the majority of the power hungry out-of-order execution logic for up to 3/4 of the time.

A homogeneous architecture also enables a wide range of optimizations. With a GPU you don't bother to minimize your task size, you attempt to churn through it. But that wastes a lot of bandwidth. With a homogeneous architecture which combines high throughput with high sequential performance, you can make complex decisions fast enough to not waste precious resources at redundant work.

Note that the vast majority of GPGPU applications run faster on the CPU than on an integrated GPU. This balance will only shift further in the favor of the CPU once AVX2 (which features FMA and gather support), and AVX-1024 have been added.

Cerb · Jul 18, 2011

Once they become fully integrated, what we now see as the GPU part will be a fancy vector coprocessor, like Altivec on steroids, and the capabilities will be more matched to the CPU. Right now, and to the near future, it's a bit lopsided, because AMD has a poor CPU, but great GPU and good drivers, where Intel is on the opposite side of the spectrum.

Once the latency penalty to go from 'CPU' to 'GPU' is gone, then running wasteful loops that the 'CPU', using more efficient code, could do better at, would be done just on the 'CPU'.

As far as bandwidth waste goes, note that there has been a trend for decades now, that memory simply does not scale up with processors. Big iron could add more channels. Everyone else designed their HW and SW to be able to use less DRAM IO, making up for it with caches, and algorithm choices that favored doing much work on small pieces of data, and using data structures that were friendly to HW prefetchers and cache eviction policies, rather than directly working on large data structures (sometimes, that's unavoidable, of course).

GPUs are beginning to do the same thing, with caches and local memories. It's not a problem that can be solved over night, and it won't be. But, it is only insurmountable if you assume that parallel processing engines in CPUs 10+ years from now will be just like your current GPU, except integrated into the CPU. Previously, they were transistor and power-efficient for graphics, now they are becoming more efficient as generalized IO for tons of threads. Over time, they will continue to evolve to be more efficient wrt to memory use, consuming more space and power for resource management and IO, and less space and power for the functional units.

GammaLaser · Jul 18, 2011

I think we'll also eventually see integration of DRAM onto the package.

It's inevitable if we want to keep adding data-hungry cores while keeping the packaging cost in check.

nenforcer · Jul 18, 2011

In the meantime we can all dream of discrete Intel GPU parts!

Mopetar · Jul 18, 2011

nenforcer said:
In the meantime we can all dream of discrete Intel GPU parts!

Just like Intel

sangyup81 · Jul 19, 2011

AMD will just be inefficient at the start until the product matures. It took K10 a while to become what the Phenom II is. Imagine if the K10 was more like the Phenom II from the start.... 3 years ago?

Arkadrel · Jul 19, 2011

nenforcer said:
In the meantime we can all dream of discrete Intel GPU parts!

Only that is a "CPU" that you put into your PCIe port.
*IT* doesnt do graphics, it does GPGPU ish calculations (only, but very well).

this is the card that will KILL nvidia's CUDA and use in servers.
Basically Intel plans on "stealing" nvidias most lucrative bussiness part, the top line for the professionals/servers.

greenhawk · Jul 19, 2011

got to love people ressurecting dead threads (12+months old).

sm625 · Jul 20, 2011

Legit reviews just did another review on llano with different memory timings. It seems as if gaming performance will just keep going upwards in a linear fashion if memory bandwidth could be increased. The only cutoff point I am seeing is due to having to loosen the ram timings as the speed increases. I havent see any indications that llano can reach any memory bandwidth sufficient to transfer the bottleneck onto the shaders. Until someone does that, we wont really know just how fast these integrated shaders really can be. If they did a quad channel llano we might see it actually outperform a 5750.

Topweasel · Jul 20, 2011

sm625 said:
Legit reviews just did another review on llano with different memory timings. It seems as if gaming performance will just keep going upwards in a linear fashion if memory bandwidth could be increased. The only cutoff point I am seeing is due to having to loosen the ram timings as the speed increases. I havent see any indications that llano can reach any memory bandwidth sufficient to transfer the bottleneck onto the shaders. Until someone does that, we wont really know just how fast these integrated shaders really can be. If they did a quad channel llano we might see it actually outperform a 5750.

But then it wouldn't be worth it in terms of the target market. Hell we are probably lucky its dual channel.

Tuna-Fish · Jul 20, 2011

CPUarchitect said:
Note that a GPU wastes practically all of its fast storage on thread contexts, while CPU's save a lot of bandwidth due to high cache hit ratios.

There's a reason for this. Texture fetches, which are most of the memory read b/w used by GPU's, have no time locality of access at all -- in fact, in the common case of mipmaps and rendering a larger texture to a smaller area, they have the opposite -- after you have used a texel once, it instantly becomes the least likely to be used data point in your data set. Because of this, caches on GPU's only help in combining accesses (there is a lot of space locality). You cannot get high cache hit ratios for texture fetches, no matter how smart your cache architectures are.

A somewhat similar effect happens on writes -- nearly all writes are done to the frame buffer, and unless you tile it somehow, it doesn't really cache because you rarely write to the same location within a short time window.

GPU architects are not idiots -- if their loads could use caches, they would build products with them.

As for frame buffers, they have the advantage of being quite small -- about half of the total external B/W requirements of the GPU vanish once you can put some 20-30MB of fast dedicated local buffer near it. This will perhaps be feasible in the Haswell timeframe.

sm625 · Jul 20, 2011

Topweasel said:
But then it wouldn't be worth it in terms of the target market. Hell we are probably lucky its dual channel.

I wonder about that. 2 extra channel on a motherboard and socket would cost what to a Dell, $10?? It would cost maybe $5 more for the cpu. And then we're talking about being able to use the most dirt cheap ram in the world, so I would estimate no more than $20 more for 4x2GB than for 2x2GB. So we're talking about $35 more total. For 5750 level performance? Not to mention the whole system would be loads faster, not just the games.

Rifter · Jul 20, 2011

I think eventually we will see fast DDR on die.

wuliheron · Jul 20, 2011

jondeker said:
When intel/amd add a decent gpu into their cpu there's going to be a lot of bandwidth limitations. I haven't read anything about how they will approach this problem.

DDR4 will be out at 12gbyte/s?
Quad channel would bring it to 48gbyte/s which is still limiting, and a bit impractical for people to buy dimms in sets of 4.

Will gpu access to cpu cache be enough?

Will they add edram on package?

DDR 4's release date got pushed back to 2015, so no immediate relief from that direction.

There's a rumor going around that Intel has already managed to piggy-back 1gb ddr 2 edram on an ivy bridge for roughly the bandwidth of a radeon 5770. The big question is how expensive will it be.

AMD's llano seems to take a different approach favoring overclocking system ram and I would guess that we'll see combinations of the two win out in the long run. With memristors and other alternatives making significant strides its anyone's guess what the next several years will bring.

Topweasel · Jul 20, 2011

sm625 said:
I wonder about that. 2 extra channel on a motherboard and socket would cost what to a Dell, $10?? It would cost maybe $5 more for the cpu. And then we're talking about being able to use the most dirt cheap ram in the world, so I would estimate no more than $20 more for 4x2GB than for 2x2GB. So we're talking about $35 more total. For 5750 level performance? Not to mention the whole system would be loads faster, not just the games.

Well for one it would increase the the complexity of the board, which would also increase costs. It would increase the complexity of the CPU, which would cost more money, it would increase the needed pin density, which means a change over to LGA, which would increase board costs. Maybe the end result is $20-$30 more expensive for a platform.

Llano on the other-hand is an expensive chip transistor wise that is supposed to competing in an area where not $10, but even $1 can drastically impact adoption. People already have a hard enough time understanding where Llano is trying to fit, adding $20 to it no matter the model makes it even harder to understand where it really sits in competition with the i3 and SB Pentiums.

ElFenix · Jul 20, 2011

Topweasel said:
Well for one it would increase the the complexity of the board, which would also increase costs. It would increase the complexity of the CPU, which would cost more money, it would increase the needed pin density, which means a change over to LGA, which would increase board costs. Maybe the end result is $20-$30 more expensive for a platform.

Llano on the other-hand is an expensive chip transistor wise that is supposed to competing in an area where not $10, but even $1 can drastically impact adoption. People already have a hard enough time understanding where Llano is trying to fit, adding $20 to it no matter the model makes it even harder to understand where it really sits in competition with the i3 and SB Pentiums.

a lot of llano's transistors are very dense. it's barely bigger than a 4 core SB. area is the expense, not transistors.

GammaLaser · Jul 20, 2011

wuliheron said:
DDR 4's release date got pushed back to 2015, so no immediate relief from that direction.

False. DRAM manufacturers are already producing test chips/modules and they expect to begin mass production by next year. It may not be until 2015 when it gets the majority of the market but that's a different issue.

[FONT=Arial, Helvetica]3 months after Samsung, it’s over to Hynix to announce its first DDR4 chips. Engraved at 3xnm, these 256 MB chips are used on a bar of 2 GB sticks of ECC-SODIMM.[/FONT]

These are DDR4-2400 chips, while Samsung has produced just DDR4-2133. DDR4 memory should officially go up to DDR4-3200, which is double the bandwidth of DDR3-1600. At this speed, Hynix has kept voltage at 1.2v, which will be the DDR4 standard.

Of course, DDR4 is only just showing the first signs of life and the JEDEC standard will only be finalised in the second half of this year. Hynix is moreover saying that it plans to begin production in high volumes in over a year, in the second half of 2012! According to iSuppli, DDR4 should represent 5% of the DRAM market come 2013, and 50% in 2015, against 71% for DDR3 memory in 2012 and 49% in 2014.

http://www.behardware.com/news/11425/hynix-produces-its-first-ddr4-modules.html

sm625 · Jul 20, 2011

Topweasel said:
Llano on the other-hand is an expensive chip transistor wise that is supposed to competing in an area where not $10, but even $1 can drastically impact adoption. People already have a hard enough time understanding where Llano is trying to fit, adding $20 to it no matter the model makes it even harder to understand where it really sits in competition with the i3 and SB Pentiums.

If AMD truly believes that the future is fusion then they need way to feed it. If it cost $30 more then so be it. At least if they do this now they will have all the pieces in place for integration with BD cores. With quad channel memory and 50% more IPC per core, it is easy to imagine better than 5770 level performance out of a $150 apu. If they did this then I'd be confident in saying that my sapphire 5750 was the last gpu card I will ever buy.

Topweasel · Jul 20, 2011

sm625 said:
If AMD truly believes that the future is fusion then they need way to feed it. If it cost $30 more then so be it. At least if they do this now they will have all the pieces in place for integration with BD cores. With quad channel memory and 50% more IPC per core, it is easy to imagine better than 5770 level performance out of a $150 apu. If they did this then I'd be confident in saying that my sapphire 5750 was the last gpu card I will ever buy.

We aren't talking about Fusion as the greatest thing since sliced bread, we are talking about Fusion as a cost effective low cost laptop and desktop processor proof of concept CPU. Llano would still be highly successful and graphics specially for the laptop models would be exceedingly good compared to its competitors on a single channel.

Trinity isn't going to fix this either really. Because the bar will move with the GPU die shrinks, and people will no matter the memory bandwidth will find a way to call it bandwidth starved. The fact is you can keep throwing tech at it and shift the bottleneck but there is always going to be one. Next it would be the CPU. For a cost effective solution this generation and the next will be to use dual channel memory.

Topweasel · Jul 20, 2011

ElFenix said:
a lot of llano's transistors are very dense. it's barely bigger than a 4 core SB. area is the expense, not transistors.

It's still bigger then a CPU that should cost at max $150 should be and its competitor is a 2 core SB.

podspi · Jul 20, 2011

Topweasel said:
It's still bigger then a CPU that should cost at max $150 should be and its competitor is a 2 core SB.

On the other hand, look at GPU prices and sizes. Doesn't look so ridiculous then...

How do we judge the size of these things, by their relative CPU size, or GPU size?

How will heterogeneous cpu's (fusion) handle bandwidth limitations?

Junior Member

Diamond Member

Member

Diamond Member

Senior member

Elite Member

Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Platinum Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Lifer

Diamond Member

Diamond Member

Elite Member

Member

Diamond Member

Diamond Member

Diamond Member

Golden Member