How will heterogeneous cpu's (fusion) handle bandwidth limitations?

sm625 · Jul 20, 2011

When IMC+ socket 939 came around, there was a period of time where the cpu was far from bandwidth starved. That period ushered in a 3+ year era where any kind of dual channel configuration resulted in sufficient bandwidth. Now we're pushing up against bandwidth limits again. At least AMD is anyway. (Intel has crap gpu so they dont need memory bandwidth.) 192/256 bit memory bus would remove that limit for a few years. We are already in an age where many x86/ARM mainboards are manufactured much like a discrete gpu card, with memory chips soldered onto one or both sides of the board. We could easily have a 192 bit memory interface composed of just 6 discrete DDR ICs. For a bit more cost we could do it with 3 DDR ICs, but there is no need for that because only a tablet would need that level of miniaturization, and 192 bit memory would be overkill for a tablet.

zephyrprime · Jul 20, 2011

Most memory bandwidth consumption is due to texturing. The need for texturing is not increasing as much as the need for shader power is. So eventually, the bandwidth requirements of GPUS will become relatively modest compared to how much bandwidth ram can provide. However, that point in time will probably be about 5 years out so until then, we're screwed.

Idontcare · Jul 20, 2011

GammaLaser said:
I think we'll also eventually see integration of DRAM onto the package.

It's inevitable if we want to keep adding data-hungry cores while keeping the packaging cost in check.

Intel presented some slides at an IDF a while back regarding die-stacking possibilities for getting DRAM closer to the CPU (presumably for bandwidth and latency improvements).

(^ I don't have the original graphic on this computer but here is the photoshopped version that is in my photobucket account from a thread here long ago, probably a lot more info in that original ATF thread if you can find it)

GammaLaser · Jul 20, 2011

Idontcare said:
Intel presented some slides at an IDF a while back regarding die-stacking possibilities for getting DRAM closer to the CPU (presumably for bandwidth and latency improvements).

(^ I don't have the original graphic on this computer but here is the photoshopped version that is in my photobucket account from a thread here long ago, probably a lot more info in that original ATF thread if you can find it)

Very interesting, especially the way they put the DRAM/CPU dies under the IHS. Thanks :thumbsup:

Idontcare · Jul 20, 2011

GammaLaser said:
Very interesting, especially the way they put the DRAM/CPU dies under the IHS. Thanks :thumbsup:

Unfortunately I don't remember which part of that slide was the original versus which part was my effort to creatively solve a challenge of the step-height delta's.

IIRC the original slide has the dram die fully cover the underlying CPU die as if they were the same size, which then becomes a hot-spot issue for the dram lying over the active parts of the CPU.

At any rate we know die-stacking is an established technology and all these companies have very smart engineers working on stuff today that we won't see for another 5 yrs. They probably already have die stacking stuff well underway and are now tasking themselves with developing the N+2 generation stuff.

GammaLaser · Jul 20, 2011

Idontcare said:
Unfortunately I don't remember which part of that slide was the original versus which part was my effort to creatively solve a challenge of the step-height delta's.

IIRC the original slide has the dram die fully cover the underlying CPU die as if they were the same size, which then becomes a hot-spot issue for the dram lying over the active parts of the CPU.

At any rate we know die-stacking is an established technology and all these companies have very smart engineers working on stuff today that we won't see for another 5 yrs. They probably already have die stacking stuff well underway and are now tasking themselves with developing the N+2 generation stuff.

Also, silicon interposers might be a feasible interim solution considering all the thermal issues that die-on-die stacking has.

Topweasel · Jul 20, 2011

podspi said:
On the other hand, look at GPU prices and sizes. Doesn't look so ridiculous then...

How do we judge the size of these things, by their relative CPU size, or GPU size?

I am looking at it vs. its competitor and AMD's target segment for the chip. They aren't competing against a discrete card. They are competing against using another companies integrated graphics/cannibalizing their low end GPU's to do so. They do it by leveraging their GPU capabilities because the CPU portion is lacking. So again, while Trinity and beyond can look to upping the target segmants, this one is completely about market penetration and proof of concept.

Kind of like the original Xbox. Wasn't a real console but wasn't a real PC. It worked well enough for Microsoft to design the Xbox360, which was from the get go the destination. Its that same thing with Fusion. I would say that even Trinity isn't going to be that product that brings it all together, its going to be an important step, but its not going to happen till you see the blend of GCN and CPU. In that respects AMD needs to be price competitive and its not by increasing the costs of what is already a decently large (if not as huge as you might have felt I was trying to say) CPU and platform even by a couple of bucks. If they were trying to do anything else they would have canceled Llano and just waited for Trinity.

Idontcare · Jul 21, 2011

GammaLaser said:
Also, silicon interposers might be a feasible interim solution considering all the thermal issues that die-on-die stacking has.

I doubt anything but TSV would be considered given the timeline involved.

pantsaregood · Jul 21, 2011

From what I've seen, current CPUs aren't really pushing their current bandwidth limitations at all. Sandy Bridge, Thuban, and Gulftown will all function with minimal performance loss at DDR3 (or in the case of Thuban, even DDR2) 1066 speeds.

The GPU portion of modern processors, however, is pretty bandwidth starved. I recall Anandtech overclocking SB's HD 3000 by some absurd amount and receiving almost no performance gain due to bandwidth limitation. Llano's GPU similarly (and likely, far more significantly) is bandwidth limited. Llano's IMC natively runs at DDR3 1866, and using slower RAM actually results in tangible performance loss. It would likely see gains from running faster RAM, as well.

GammaLaser · Jul 21, 2011

Idontcare said:
I doubt anything but TSV would be considered given the timeline involved.

Seeing as TSV is an option in both cases, interposer would be a nice stepping stone towards die-to-die stacking. Rumors were going around that IVB would have interposers, but I see 22nm + tri-gate + TSV as a ton of risk to implement at the same time. Then again who knows what Intel is really up to

GaiaHunter · Jul 21, 2011

Tessellation is supposed to alleviate the bandwidth requirements, although so far all the know game tessellation implementations out there achieve nothing of the sort.

Idontcare · Jul 21, 2011

GammaLaser said:
Seeing as TSV is an option in both cases, interposer would be a nice stepping stone towards die-to-die stacking. Rumors were going around that IVB would have interposers, but I see 22nm + tri-gate + TSV as a ton of risk to implement at the same time. Then again who knows what Intel is really up to

Oops, sorry, I didn't realize we were entertaining a 22nm intercept for the stacked chips, my mind was firmly on a 14nm intercept target.

The rumors regarding IB and die stacking came from Charlie's incorrect article aimed at rumoring the notion that IB was going to have 3d gates for the memory areas (i.e. that the core logic areas were still going to be planar CMOS).

TSV certainly won't be HVM ready for logic applications in 2012 when IB is launching. 2014 for Haswell yes, 2012 not a chance.

wuliheron · Jul 21, 2011

pantsaregood said:
From what I've seen, current CPUs aren't really pushing their current bandwidth limitations at all. Sandy Bridge, Thuban, and Gulftown will all function with minimal performance loss at DDR3 (or in the case of Thuban, even DDR2) 1066 speeds.

The GPU portion of modern processors, however, is pretty bandwidth starved. I recall Anandtech overclocking SB's HD 3000 by some absurd amount and receiving almost no performance gain due to bandwidth limitation. Llano's GPU similarly (and likely, far more significantly) is bandwidth limited. Llano's IMC natively runs at DDR3 1866, and using slower RAM actually results in tangible performance loss. It would likely see gains from running faster RAM, as well.

Exactly, it took years of tweaking but finally they've got a pretty good idea of how to do multicore processing. Hyperthreading, cache, prefetch, etc. all make major contributions to bandwidth. Now we need to learn how to do the same for a cpu/gpu combination and I suspect the learning curve will be even steeper.

Adding edram is the equivalent of adding cache, but at some point must reach diminishing returns. Adding more and faster system ram can likewise help, but has to also reach diminishing returns eventually. That leaves heterogeneous processing architecture as the really creative area where significant gains can be made, but also where the possibilities are almost endless.

CPUarchitect · Jul 22, 2011

wuliheron said:
That leaves heterogeneous processing architecture as the really creative area where significant gains can be made, but also where the possibilities are almost endless.

You meant homogeneous, right? The only way to reduce bandwidth needs is to reduce the number of threads, and give full control to the developers. AVX2's gather support and fused multiply-add allow the CPU to perform high-throughput shader processing, and beyond. The only thing lacking is power efficiency, but that might be achieved by executing 1024-bit instructions on 256-bit execution units in four cycles and taking advantage of the clock gating opportunities this lower instruction rate creates.

Unifying the CPU and GPU would give us the best if both worlds. That convergence has been going on for years now.

Search

How will heterogeneous cpu's (fusion) handle bandwidth limitations?

sm625

Diamond Member

zephyrprime

Diamond Member

Idontcare

Elite Member

GammaLaser

Member

Idontcare

Elite Member

GammaLaser

Member

Topweasel

Diamond Member

Idontcare

Elite Member

pantsaregood

Senior member

GammaLaser

Member

GaiaHunter

Diamond Member

Idontcare

Elite Member

wuliheron

Diamond Member

CPUarchitect

Senior member

TRENDING THREADS