• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

WCCFAMD Carrizo APU on the 28nm Node Will Have Stacked DRAM On Package

csbin

Senior member
[RUMOR] Notice the shiny tag at the start. Continue. We have received some information regarding AMD’s upcoming APU Carrizo, this information is unverified so I will be treating it as a rumor. The report comes from the Italian site bitsandchips.it and states that AMD’s upcoming flagship APUs will have Stacked DRAM while maintaining the 28nm Node.


0VU.png


Now it goes without saying, that you need to keep that pinch of salt handy throughout this post. However this news, if true, is very interesting. We know for a fact that APUS benefit a lot from good memory and if these APUs will truly support HBM then we can expect to see some very substantial performance per clock gains while jumping from Kaveri to Carrizo even while staying on the same node. Another important point to note is that with the Carrizo APU the implementation of HSA will be perfected resulting in probably significant in compute as well as gaming.
Now we already know that AMD is working with Hynix to create Stacked DRAM. We also know that this memory will come in two types, namely 3DS and HBM (Don’t be fooled by the lack of 3D in this name, both are stacked). The memory that is in question here is the HBM variant type which will feature the highest bandwidth and I know for a fact that there are two types already in production. Namely the 2-Hi and 4-Hi variants. You can find the detailed analysis of the same in my Pascal Architecture Analysis. Now the max bandwidth of a single HBM is 128-256 GBps (compare this to the 28GBps of GDDR5), so we are looking at an insane growth in bandwidth, albeit at reduced clocks (most probably around 1000Mhz).

1VU.jpg



Now we previously received a much more authentic report that the APU would actually feature DDR4 support but this is obviously better. Carrizo APU’s die size will be smaller than Kaveri APU according to the same source, though I am not sure how they aim to accomplish this if the HBM is truly “on-package”. The Stacked DRAM will be manufactured on the 20nm node but the APU will stay at 28nm. Previous leaks had suggested that the upcoming APU will be compatible with the FM2+ socket and have TDP no greater than 65W. However the thing is, the last authentic leak was quite a while back and AMD’s plans could have changed in the meantime. We will be waiting for more information on this front and in the meantime this should serve as good food for thought, if nothing else.

Read more: http://wccftech.com/amd-carrizo-apu-28nm-stacked-dram-alleges-italian-leak/#ixzz37LwoX9Pc
 
Just wondering what Carrizo would do with all that memory bandwidth? Will the iGPU really be able to make use of it all?
 
HBM can't come soon enough. My only question to the more knowledgable amongst us would be how close to crystalwell bandwidth this gets us.

I can't wait for the next-gen GPU's with this tech and cheap laptops with even better entry level gaming performance. Not to mention being able to pick up the cheapest memory for an APU build this could bring even more price-conscious people into the PC gamer fold.
 
HBM can't come soon enough. My only question to the more knowledgable amongst us would be how close to crystalwell bandwidth this gets us.

I can't wait for the next-gen GPU's with this tech and cheap laptops with even better entry level gaming performance. Not to mention being able to pick up the cheapest memory for an APU build this could bring even more price-conscious people into the PC gamer fold.
Gen 1 Crystalwell is 1.6GHz and gives 50Gbps. Gen 2 Crystalwell (Broadwell, possibly Skylake) is 2 GHz, which would give up to 62.5 Gbps (timings are relaxed though, so the final number will be interesting to see).

Compare to the rumored 64 Gbps and 128 Gbps of Carrizo.

If this rumor is true, Carrizo probably looks a lot like this:
95dd2b6d.jpg
 
I would triple the bandwidth of current APUs.
To give it a scale... it is like going from HD7770 to HD7950 - massive +200% bandwidth increase!

I don't think it will happen with carrizo, but it will happen sooner rather than later.
 
Does latency come into play though? I assume intel's method ends up with considerably lower latency?

Unrelated question- Did the APU in the PS4 end up using an interposer like this? I thought it just ended up with GDDR like a graphics card attached to the APU in a typical fashion.

Gen 1 Crystalwell is 1.6GHz and gives 50Gbps. Gen 2 Crystalwell (Broadwell, possibly Skylake) is 2 GHz, which would give up to 62.5 Gbps (timings are relaxed though, so the final number will be interesting to see).

Compare to the rumored 64 Gbps and 128 Gbps of Carrizo.

If this rumor is true, Carrizo probably looks a lot like this:
95dd2b6d.jpg
 
Does latency come into play though? I assume intel's method ends up with considerably lower latency?
It doesn't matter for the GPU, but for the CPU it does... a little.

I have no idea what the latency difference would be, but it's probably not that huge.
Unrelated question- Did the APU in the PS4 end up using an interposer like this? I thought it just ended up with GDDR like a graphics card attached to the APU in a typical fashion.
No, it didn't.
 
Well they have to do something like this fairly soon because these things don't really make sense right now 🙂

Fascinating to see how well it does if it does arrive though.
 
Interesting. It is sort of disappointing that the general memory interface isn't what's improving here (DDR4, or something significantly better than DDR4), but at the same time, that kind of bandwidth increase is far beyond anything you can expect from the usual improvements in JEDEC memory specs.

Unfortunately, this almost guarantees that max clock speeds for Carrizo will probably be lower than those of even Kaveri. Or does it?
 
Definitely. Look at some Kaveri overclocking results- even with DDR3 2400, a GPU OC from 720Mhz to 1Ghz+ has almost no impact. http://www.eteknix.com/amd-kaveri-a10-7850k-overclocking-unleashing-gcns-potential/6/ Seems like a pretty massive memory bottleneck.

Yes, I agree that a fast cache would be beneficial. But the question is how fast does it have to be? At some point there will be diminishing returns.

And note that this is Carrizo we're talking about, not some top end AMD discrete GFX card. So the number of GPU cores and hence required memory bandwidth will be much lower in Carrizo.
 
Last edited:
2.5D High Bandwidth Memory tools at GlobalFoundries and TSMC are only 20-nm planar and FinFETs.
====
Here is a comparison to DDR3 and GDDR5;
kzgFBb4.jpg


Same latency as GDDR5 but 4.6x more bandwidth.
 
A big fat 1GB L3 HBM cache should help a lot.

How big does it have to be? The Anandtech article on Crystalwell Intel said that:

"There’s only a single size of eDRAM offered this generation: 128MB. Since it’s a cache and not a buffer (and a giant one at that), Intel found that hit rate rarely dropped below 95%. It turns out that for current workloads, Intel didn’t see much benefit beyond a 32MB eDRAM however it wanted the design to be future proof. Intel doubled the size to deal with any increases in game complexity, and doubled it again just to be sure."

So basically Intel is saying that there is little point in having more than 32 MB.
 
The implementation of HBM in APUs and GPUs is a double L3.

Installation Cache which is SRAM and low latency. (The traditional L3 cache)
HBM Stack which is DRAM and high latency. (The novel L3 cache)

As a non-limiting example, for a die-stacked DRAM over a multicore processor, the installation cache may be placed on the same chip as the multicore processor's memory controller. In some embodiments, the main cache may contain a logic layer upon which the installation cache may be placed. Regardless of how the installation cache is implemented in any particular system, the installation cache provides the advantage of low latency cache returns allowing the use of higher latency die-stacked DRAM L3 cache memory with reduced risk of increasing cache misses.
 
Bigger L3 means lower hit rate, even 2% makes a big difference if you need to go over to main memory. I would assume this cache would be shared by CPU and GPU, so an even split becomes 512mb each (big assumption).

I remember S/A have a die shot of an interposer for SI cards that never came to fruition due to I believe costs. That was a few years ago so I would assume costs have come down somewhat and yields increased. I do think AMD needs this to compete with high end intel mobile graphics offerings.

I could see them doing this for the top bin, if anything fails then fuse off the HBM and use a normal interface. I do hope they offer this, also hope they focus on CPU improvements more this round as Kaveri seemed to focus mostly on the APU side
 
So if this is true, how soon will we get affordable laptops and tablets that are apu based with similar gaming potential as the playstation 4? Of course we will need a die shrink or two, but one die shrink, two, three, four?

45w tdp
35w tdp
15w tdp
5w tdp

(I am using teraflops to make it easier to calculate since they are the same architecture and thus will have similar teraflops. Comparing other architectures such as nvidia with terraflops does not work)
Playstation 4 is 1.84 terraflops
7970m is 2.176 terraflops (100w tdp) (8970m and R9 m290x is the same chip but with boost clocks that can go 50 mhz faster)
7950m is 1.792 terraflops (75w tdp)
R9 m270x is 1.382 terraflops (50w tdp)
R9 m265x is 1.155 terraflops (35w tdp)

Die shrinks usually get you 30% reduction in power consumption if you keep everything the same right? Thus we are talking a 7970m would be a 50w chip with two die shrinks, and a 7950m would be a 38 watt chip with two die shrinks.

----

Can someone check my math? Also how much power consumption is due to the memory? And how much power consumption would stack memory take?
 
One HBM's TDP is less than 4 watts.

If placed in a 82 watt TDP part (Ameythst XT) then it would out 90 watts from HBM and 100+ watts from HBM + GDDR5 + VRMs.
 
Back
Top