• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

WCCFAMD Carrizo APU on the 28nm Node Will Have Stacked DRAM On Package

Page 8 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
On die fabric and off die Hypertransport are very different things... And HT is point to point.
AMD is taking the new Hypertransport and doing it on-die and off-die.

http://www.hypertransport.org/docs/uploads/Why_Torus_Main.pdf

Freedom Fabric Gen 2; Hypertransport 2D or 3D Torus
Scalable Coherent Fabric; cHypertransport 1D or 2D Torus

With the new Hypertransport Expansion Slots as well GG PCIe.

I wonder how much the interconnect has evolved from this;
http://www.overclock3d.net/gfx/articles/2007/05/16132923973l.jpg

Eight 128-bit Channels or sixteen 128-bit channels or thirty-two 128-bit channels, a ring-based interconnect might be needed.
 
Last edited:
AMD is taking the new Hypertransport and doing it on-die and off-die.

http://www.hypertransport.org/docs/uploads/Why_Torus_Main.pdf

That PDF in no way supports what you just said. 😵 The only mention of "on die" is that the controller can be embedded on die... wow, revolutionary, AMD have only been doing this since the original Opteron! The "torus" is referring to the topology of a HPC cluster- a topology made up of dozens of point-to-point connections. Each node has a handful of point to point connections, connecting it to its neighbours. It is in no way a ringbus. Hypertransport is a standard for communicating off-die with other processors, not an on-die fabric.
 
Just want to point that it is most like Hypertransport with Cache Coherence. While the interperation that I'm getting. Is that it is completely different from previous implementations.
HyperTransport uses a single control line to determine when the link is carrying a
control packet (the control signal is asserted) or a data packet (the control signal
is de-asserted). Deterministic control of packet type is a significant feature of the
link because the control signal can be used to insert control packets in the middle
of a long data packet. A special HyperTransport Priority Request Interleaving™
feature contributes to the very low latency characteristics of the HyperTransport
link by enabling concurrent data streams to be initiated in the middle of a longer
data stream.
One of the first things is that the data packets and control packets are no longer sharing the same stream.

Control packets now have their own interconnect and data packets now have their own interconnect.

From what I can get from the Hypertransport Expansion slot that is for PCI Express and Hypertransport.
https://www.google.com/patents/US20120258611

PCIE or DATA for HT=> 16 lanes of bi-directional connections
HT/Sideband or CONTROL for HT => 4 lanes of bi-directional connections
A PCIE component can plug into the first connector portion and the second connector portion is unused. A component compliant with a HyperTransport link can plug into both portions of the connector. Thus, the same socket may be used to couple components compliant with either type of link, providing flexibility to expand a system architecture implemented by an exemplary printed circuit board assembly.
a printed circuit board assembly (e.g., printed circuit board assembly 600) includes a printed circuit board (e.g., printed circuit board 602) populated with a socket (e.g., socket 604 including an interface, e.g., interface 605) for a processor including an integrated PCIE/HyperTransport interface, memory slots (e.g., dual in-line memory module slots 606), and a flexible expansion slot including a flexible bus (e.g., a bus including conductive traces 612, 614, 616, and 618) and a flexible connector (e.g., connector 400).
In at least one embodiment, circuit 620, which is coupled between conductive trace portions 612(a) and 612(b), is populated with switches, capacitors, resistors, and/or jumpers that are configured to implement AC coupling for a PCIE link or DC coupling for a HyperTransport link. In at least one embodiment of printed circuit board 602, conductive traces 616 and 618 are also included to couple sideband signals of a HyperTransport link between connector 400 and socket 604. Accordingly, a flexible slot of printed circuit board assembly 600 is configured to receive a component consistent with either a PCIE link or a HyperTransport link. That is, the flexible slot of printed circuit board assembly 600 is configured to receive a connector compliant with a communications interface consistent with either of the PCIE or HyperTransport protocols.
A first connector portion (e.g., portion 402) includes contacts to support a PCIE slot (e.g., 16 lanes). A second connector portion (e.g., portion 404) includes additional contacts for additional signals (e.g., four lanes for HyperTransport and sideband signals) required by the HyperTransport slot.
http://www.linkedin.com/in/jeanchittilappilly
Working on verifying AMD's coherent hyper transport fabric for Radeon & APU products
http://www.linkedin.com/pub/mike-osborn/10/586/797
Definition and documentation of a high density server fabric;
Definition of a Scalable Coherent SOC interconnect architecture for x86 and ARM based designs, including fabric protocol and interface definitions; Architecture, documentation, and implementation of a scalable ring-based transport. Led a design team responsible for uArch and implementation of both the protocol and transport layer of an AMD-ambidextrous SOC interconnect fabric
It scales out from microarchitecture to full on systems architecture.

===
Now with all the informations;

"Mexico*" CPU == *My guess for the next Opteron
20-nm LPM
16 Excavator Cores / ~20 mm² is my expectation / 8 * 20 mm² => ~160 mm² (Cores + L2)
(HBM)L3 Interface and (DDR4)IMC Interface / No clue / ~40 mm² (Rough Estimate)
Northbridge + Interconnect + PHY / No clue / ~100 mm² (Rough Estimate)
x < ~300 mm² < y
 
Last edited:
http://wccftech.com/evidence-amd-apus-featuring-highbandwidth-stacked-memory-surfaces/

"More Evidence Regarding AMD APUs Featuring High-Bandwidth Stacked Memory Surfaces:"


AMD-Carrizo-APU-Stacked-Memory-635x473.jpg



AMD-APU-Memory-Technology-Stacked-Two-Level-635x473.jpg


There are more slides at the link referenced... don't know if some of them have been published publicly before though since they are from Feb 27, 2014?

I think the big question is when we'll see this in actual AMD products. Is Excavator within likely the timeframe? Or will we have to wait until AMD moves to 20/16/14 nm...
 
Last edited:
AMD-APU-Memory-Technology-Stacked-Two-Level-635x473.jpg


I think the big question is when we'll see this in actual AMD products. Is Excavator within likely the timeframe? Or will we have to wait until AMD moves to 20/16/14 nm...

If you look at what this slide says about stacked memory, it appears that AMD has only finished the evaluation phase for HBM for APUs. If that is true, I wouldn't expect to see HBM on an APU till 14/16FF - unless this slide is deliberately sandbagging it; or fake.
 
An interesting little aside- I was reading some old RWT stories today, and found this quote in the Llano analysis piece:

If AMD was particularly aggressive, they might use 3D packaging to attach high bandwidth memory to Trinity to improve graphics performance. One of the last real advantages of a discrete GPU is high bandwidth and dedicated memory. Even as little as 256MB of attached DRAM using WideIO or LP-DDR3 could bring GPU performance to a new level &#8211; at a time when programmable graphics will begin hitting its stride. However, there are still a number of thermal challenges to 3D integration, so 2013 or 2014 may be more realistic.

http://www.realworldtech.com/fusion-llano/4/ Funny how things turn out!
 
So stacked DRAM has been researched for quite some time. What's holding it back? Heat issues?

And for it to be fully utilized, doesn't the iGPU performance have to be increased quite a lot?
 
WRT iGPU performance, no. The 7850K has been shown to be severely lacking in memory bandwidth when it comes to iGPU performance. If you're running something like DDR3-1600 with "normal" timings, you can run up your iGPU speed without seeing any appreciable performance increase (while overclocking the RAM does produce positive results). It seems fairly obvious that the iGPU is starved for bandwidth.
 
WRT iGPU performance, no. The 7850K has been shown to be severely lacking in memory bandwidth when it comes to iGPU performance. If you're running something like DDR3-1600 with "normal" timings, you can run up your iGPU speed without seeing any appreciable performance increase (while overclocking the RAM does produce positive results). It seems fairly obvious that the iGPU is starved for bandwidth.

I'm not questioning whether it'll improve iGPU performance. But I wondered what iGPU performance it takes to fully utilize the increased memory bandwidth that comes with stacked DRAM? After all we're talking about 128-256 GBps with a single stacked DRAM compared to 28 GBps for GDDR5 (see OP).
 
That is the problem to me. An APU is always going to be limited relative to a discrete card. If you eliminate the memory bottleneck, the 7850k would still be only equivalent to a HD7750 because of the number of shaders and thermal constraints. So on the desktop, a discrete card is still much more powerful and flexible. In mobile though, something like 7750 performance would be very attractive.
 
That is the problem to me. An APU is always going to be limited relative to a discrete card. If you eliminate the memory bottleneck, the 7850k would still be only equivalent to a HD7750 because of the number of shaders and thermal constraints. So on the desktop, a discrete card is still much more powerful and flexible. In mobile though, something like 7750 performance would be very attractive.

With HBM the memory bandwidth is removed and the APU will obsolete the smallest discrete GPU chip of any generation. In the last generation that means the HD 7750 GDDR5 version. Going forward at the future nodes GF 20LPM / Samsung 14 LPE the apu can easily sport 1024 GCN 2.0 cores.

Since notebook GPU product stack normally has 3 different chips eg - GK107 (entry) , mid level (GK106) and high end (GK104) the APU takes out the entry level where the sales volume is the highest. In the long term the dGPU market especially for notebooks will be a much smaller market.
 
Cost and yield issues. Installing TSVs correctly is still hard.

This.

A lot of this stuff is years and years away from being commercially viable for any product which is targeting the consumer market (i.e. where cost trumps performance).
 
A classic example we can relate to today may be this from 2005:
cpu-mch-vrm-package.jpg


All these 3 are combined today. But it took 8 years from the first demonstration.
 
This.

A lot of this stuff is years and years away from being commercially viable for any product which is targeting the consumer market (i.e. where cost trumps performance).
Indeed. Even NVIDIA has said outright that the amount of VRAM Pascal will come with comes down to cost. A not so subtle hint that they expect large memory configurations to be expensive, which in turn implies it's going to be used for pro grade products first and foremost.
 
According to Intel, there is no use in going beyond 32 MB eDRAM size. So shouldn't that apply to stacked DRAM too, i.e. it can be kept small and hence not that expensive?
 
According to Intel, there is no use in going beyond 32 MB eDRAM size. So shouldn't that apply to stacked DRAM too, i.e. it can be kept small and hence not that expensive?
HBM gets is bandwidth from its large number of dies, not unlike NAND. If you're going to use only a small amount of it like a cache, you're better off with eDRAM or SRAM like Intel has done.

NVIDIA for their part intends to use it as primary memory, not a cache. The Pascal test vehicle has no other RAM besides the stacked DRAM.
 
HBM gets is bandwidth from its large number of dies, not unlike NAND. If you're going to use only a small amount of it like a cache, you're better off with eDRAM or SRAM like Intel has done.

NVIDIA for their part intends to use it as primary memory, not a cache. The Pascal test vehicle has no other RAM besides the stacked DRAM.

In that case, isn't it totally unrealistic that a mid-range APU like Carrizo would get expensive HBM / Stacked-DRAM as suggested by the article in the OP?
 
Last edited:
Quite possibly, but the APUs do badly need it - they simply don't really make sense without it - so you can bet they'd love to make one using it.

Even if its a specalist/pricy high end part to show what is going to be possible going forwards.
 
Quite possibly, but the APUs do badly need it - they simply don't really make sense without it - so you can bet they'd love to make one using it.

Even if its a specalist/pricy high end part to show what is going to be possible going forwards.

The APU need it only for Gaming, for everything else including GPGPU DDR-3 is just fine.
 
The APU need it only for Gaming, for everything else including GPGPU DDR-3 is just fine.

Has anyone run benches proving (or disproving) this allegation? Something as simple as a LibreOffice spreadsheet bench at different memory settings (higher DDR3 speeds, same timings) would do the trick. Especially on the 7850K which is the APU most in need of extra memory bandwidth right now.
 
Has anyone run benches proving (or disproving) this allegation? Something as simple as a LibreOffice spreadsheet bench at different memory settings (higher DDR3 speeds, same timings) would do the trick. Especially on the 7850K which is the APU most in need of extra memory bandwidth right now.

Luxmark 2.0 GPU score is identical between A10-7850K and HD7750 GDDR-5.
Im sure most of the GPGPU applications will exhibit the same behavior. Computation in that level of Hardware is not that much Memory bandwidth depended and DDR-3 can cope so far.
Games on the other hand become very memory depended because of the visuals, especially at higher resolutions.


http://www.anandtech.com/show/7677/amd-kaveri-review-a8-7600-a10-7850k/14
60951.png


http://www.tomshardware.com/reviews/radeon-hd-7770-7750-benchmark,3135-13.html
luxmark.png
 
Back
Top