WCCFAMD Carrizo APU on the 28nm Node Will Have Stacked DRAM On Package

NostaSeronx · Jul 19, 2014

NTMBK said:
On die fabric and off die Hypertransport are very different things... And HT is point to point.

AMD is taking the new Hypertransport and doing it on-die and off-die.

http://www.hypertransport.org/docs/uploads/Why_Torus_Main.pdf

Freedom Fabric Gen 2; Hypertransport 2D or 3D Torus
Scalable Coherent Fabric; cHypertransport 1D or 2D Torus

With the new Hypertransport Expansion Slots as well GG PCIe.

I wonder how much the interconnect has evolved from this;
http://www.overclock3d.net/gfx/articles/2007/05/16132923973l.jpg

Eight 128-bit Channels or sixteen 128-bit channels or thirty-two 128-bit channels, a ring-based interconnect might be needed.

NTMBK · Jul 19, 2014

NostaSeronx said:
AMD is taking the new Hypertransport and doing it on-die and off-die.

http://www.hypertransport.org/docs/uploads/Why_Torus_Main.pdf

That PDF in no way supports what you just said. 😵 The only mention of "on die" is that the controller can be embedded on die... wow, revolutionary, AMD have only been doing this since the original Opteron! The "torus" is referring to the topology of a HPC cluster- a topology made up of dozens of point-to-point connections. Each node has a handful of point to point connections, connecting it to its neighbours. It is in no way a ringbus. Hypertransport is a standard for communicating off-die with other processors, not an on-die fabric.

NostaSeronx · Jul 19, 2014

Just want to point that it is most like Hypertransport with Cache Coherence. While the interperation that I'm getting. Is that it is completely different from previous implementations.

HyperTransport uses a single control line to determine when the link is carrying a
control packet (the control signal is asserted) or a data packet (the control signal
is de-asserted). Deterministic control of packet type is a significant feature of the
link because the control signal can be used to insert control packets in the middle
of a long data packet. A special HyperTransport Priority Request Interleaving™
feature contributes to the very low latency characteristics of the HyperTransport
link by enabling concurrent data streams to be initiated in the middle of a longer
data stream.

One of the first things is that the data packets and control packets are no longer sharing the same stream.

Control packets now have their own interconnect and data packets now have their own interconnect.

From what I can get from the Hypertransport Expansion slot that is for PCI Express and Hypertransport.
https://www.google.com/patents/US20120258611

PCIE or DATA for HT=> 16 lanes of bi-directional connections
HT/Sideband or CONTROL for HT => 4 lanes of bi-directional connections

A PCIE component can plug into the first connector portion and the second connector portion is unused. A component compliant with a HyperTransport link can plug into both portions of the connector. Thus, the same socket may be used to couple components compliant with either type of link, providing flexibility to expand a system architecture implemented by an exemplary printed circuit board assembly.

a printed circuit board assembly (e.g., printed circuit board assembly 600) includes a printed circuit board (e.g., printed circuit board 602) populated with a socket (e.g., socket 604 including an interface, e.g., interface 605) for a processor including an integrated PCIE/HyperTransport interface, memory slots (e.g., dual in-line memory module slots 606), and a flexible expansion slot including a flexible bus (e.g., a bus including conductive traces 612, 614, 616, and 618) and a flexible connector (e.g., connector 400).

In at least one embodiment, circuit 620, which is coupled between conductive trace portions 612(a) and 612(b), is populated with switches, capacitors, resistors, and/or jumpers that are configured to implement AC coupling for a PCIE link or DC coupling for a HyperTransport link. In at least one embodiment of printed circuit board 602, conductive traces 616 and 618 are also included to couple sideband signals of a HyperTransport link between connector 400 and socket 604. Accordingly, a flexible slot of printed circuit board assembly 600 is configured to receive a component consistent with either a PCIE link or a HyperTransport link. That is, the flexible slot of printed circuit board assembly 600 is configured to receive a connector compliant with a communications interface consistent with either of the PCIE or HyperTransport protocols.

A first connector portion (e.g., portion 402) includes contacts to support a PCIE slot (e.g., 16 lanes). A second connector portion (e.g., portion 404) includes additional contacts for additional signals (e.g., four lanes for HyperTransport and sideband signals) required by the HyperTransport slot.

http://www.linkedin.com/in/jeanchittilappilly

Working on verifying AMD's coherent hyper transport fabric for Radeon & APU products

http://www.linkedin.com/pub/mike-osborn/10/586/797

Definition and documentation of a high density server fabric;
Definition of a Scalable Coherent SOC interconnect architecture for x86 and ARM based designs, including fabric protocol and interface definitions; Architecture, documentation, and implementation of a scalable ring-based transport. Led a design team responsible for uArch and implementation of both the protocol and transport layer of an AMD-ambidextrous SOC interconnect fabric

It scales out from microarchitecture to full on systems architecture.

===
Now with all the informations;

"Mexico*" CPU == *My guess for the next Opteron
20-nm LPM
16 Excavator Cores / ~20 mm² is my expectation / 8 * 20 mm² => ~160 mm² (Cores + L2)
(HBM)L3 Interface and (DDR4)IMC Interface / No clue / ~40 mm² (Rough Estimate)
Northbridge + Interconnect + PHY / No clue / ~100 mm² (Rough Estimate)
x < ~300 mm² < y

Fjodor2001 · Jul 23, 2014

http://wccftech.com/evidence-amd-apus-featuring-highbandwidth-stacked-memory-surfaces/

"More Evidence Regarding AMD APUs Featuring High-Bandwidth Stacked Memory Surfaces:"

AMD-APU-Memory-Technology-Stacked-Two-Level-635x473.jpg

There are more slides at the link referenced... don't know if some of them have been published publicly before though since they are from Feb 27, 2014?

I think the big question is when we'll see this in actual AMD products. Is Excavator within likely the timeframe? Or will we have to wait until AMD moves to 20/16/14 nm...

AtenRa · Jul 23, 2014

I dont expect Stacked Memory on AMD products at 28nm. I will guess we could see that on 14nm.

positivedoppler · Jul 23, 2014

I wonder what core is back means. http://wccftech.com/amd-teases-core...ion-video-cpu-announcement-possibly-imminent/

Ajay · Jul 23, 2014

Fjodor2001 said:
I think the big question is when we'll see this in actual AMD products. Is Excavator within likely the timeframe? Or will we have to wait until AMD moves to 20/16/14 nm...

If you look at what this slide says about stacked memory, it appears that AMD has only finished the evaluation phase for HBM for APUs. If that is true, I wouldn't expect to see HBM on an APU till 14/16FF - unless this slide is deliberately sandbagging it; or fake.

inf64 · Jul 23, 2014

@Ajay

The presentation is all but fake 😉.

NTMBK · Jul 27, 2014

An interesting little aside- I was reading some old RWT stories today, and found this quote in the Llano analysis piece:

If AMD was particularly aggressive, they might use 3D packaging to attach high bandwidth memory to Trinity to improve graphics performance. One of the last real advantages of a discrete GPU is high bandwidth and dedicated memory. Even as little as 256MB of attached DRAM using WideIO or LP-DDR3 could bring GPU performance to a new level – at a time when programmable graphics will begin hitting its stride. However, there are still a number of thermal challenges to 3D integration, so 2013 or 2014 may be more realistic.

http://www.realworldtech.com/fusion-llano/4/ Funny how things turn out!

Fjodor2001 · Jul 29, 2014

So stacked DRAM has been researched for quite some time. What's holding it back? Heat issues?

And for it to be fully utilized, doesn't the iGPU performance have to be increased quite a lot?

DrMrLordX · Jul 29, 2014

WRT iGPU performance, no. The 7850K has been shown to be severely lacking in memory bandwidth when it comes to iGPU performance. If you're running something like DDR3-1600 with "normal" timings, you can run up your iGPU speed without seeing any appreciable performance increase (while overclocking the RAM does produce positive results). It seems fairly obvious that the iGPU is starved for bandwidth.

Fjodor2001 · Jul 29, 2014

DrMrLordX said:
WRT iGPU performance, no. The 7850K has been shown to be severely lacking in memory bandwidth when it comes to iGPU performance. If you're running something like DDR3-1600 with "normal" timings, you can run up your iGPU speed without seeing any appreciable performance increase (while overclocking the RAM does produce positive results). It seems fairly obvious that the iGPU is starved for bandwidth.

I'm not questioning whether it'll improve iGPU performance. But I wondered what iGPU performance it takes to fully utilize the increased memory bandwidth that comes with stacked DRAM? After all we're talking about 128-256 GBps with a single stacked DRAM compared to 28 GBps for GDDR5 (see OP).

frozentundra123456 · Jul 29, 2014

That is the problem to me. An APU is always going to be limited relative to a discrete card. If you eliminate the memory bottleneck, the 7850k would still be only equivalent to a HD7750 because of the number of shaders and thermal constraints. So on the desktop, a discrete card is still much more powerful and flexible. In mobile though, something like 7750 performance would be very attractive.

raghu78 · Jul 30, 2014

frozentundra123456 said:
That is the problem to me. An APU is always going to be limited relative to a discrete card. If you eliminate the memory bottleneck, the 7850k would still be only equivalent to a HD7750 because of the number of shaders and thermal constraints. So on the desktop, a discrete card is still much more powerful and flexible. In mobile though, something like 7750 performance would be very attractive.

With HBM the memory bandwidth is removed and the APU will obsolete the smallest discrete GPU chip of any generation. In the last generation that means the HD 7750 GDDR5 version. Going forward at the future nodes GF 20LPM / Samsung 14 LPE the apu can easily sport 1024 GCN 2.0 cores.

Since notebook GPU product stack normally has 3 different chips eg - GK107 (entry) , mid level (GK106) and high end (GK104) the APU takes out the entry level where the sales volume is the highest. In the long term the dGPU market especially for notebooks will be a much smaller market.

ViRGE · Jul 30, 2014

Fjodor2001 said:
So stacked DRAM has been researched for quite some time. What's holding it back? Heat issues?

Cost and yield issues. Installing TSVs correctly is still hard.

Idontcare · Jul 30, 2014

ViRGE said:
Cost and yield issues. Installing TSVs correctly is still hard.

This.

A lot of this stuff is years and years away from being commercially viable for any product which is targeting the consumer market (i.e. where cost trumps performance).

ShintaiDK · Jul 30, 2014

A classic example we can relate to today may be this from 2005:

All these 3 are combined today. But it took 8 years from the first demonstration.

ViRGE · Jul 30, 2014

Idontcare said:
This.

A lot of this stuff is years and years away from being commercially viable for any product which is targeting the consumer market (i.e. where cost trumps performance).

Indeed. Even NVIDIA has said outright that the amount of VRAM Pascal will come with comes down to cost. A not so subtle hint that they expect large memory configurations to be expensive, which in turn implies it's going to be used for pro grade products first and foremost.

Fjodor2001 · Jul 30, 2014

According to Intel, there is no use in going beyond 32 MB eDRAM size. So shouldn't that apply to stacked DRAM too, i.e. it can be kept small and hence not that expensive?

ViRGE · Jul 30, 2014

Fjodor2001 said:
According to Intel, there is no use in going beyond 32 MB eDRAM size. So shouldn't that apply to stacked DRAM too, i.e. it can be kept small and hence not that expensive?

HBM gets is bandwidth from its large number of dies, not unlike NAND. If you're going to use only a small amount of it like a cache, you're better off with eDRAM or SRAM like Intel has done.

NVIDIA for their part intends to use it as primary memory, not a cache. The Pascal test vehicle has no other RAM besides the stacked DRAM.

Fjodor2001 · Jul 30, 2014

ViRGE said:
HBM gets is bandwidth from its large number of dies, not unlike NAND. If you're going to use only a small amount of it like a cache, you're better off with eDRAM or SRAM like Intel has done.

NVIDIA for their part intends to use it as primary memory, not a cache. The Pascal test vehicle has no other RAM besides the stacked DRAM.

In that case, isn't it totally unrealistic that a mid-range APU like Carrizo would get expensive HBM / Stacked-DRAM as suggested by the article in the OP?

Qwertilot · Jul 30, 2014

Quite possibly, but the APUs do badly need it - they simply don't really make sense without it - so you can bet they'd love to make one using it.

Even if its a specalist/pricy high end part to show what is going to be possible going forwards.

AtenRa · Jul 30, 2014

Qwertilot said:
Quite possibly, but the APUs do badly need it - they simply don't really make sense without it - so you can bet they'd love to make one using it.

Even if its a specalist/pricy high end part to show what is going to be possible going forwards.

The APU need it only for Gaming, for everything else including GPGPU DDR-3 is just fine.

DrMrLordX · Jul 30, 2014

AtenRa said:
The APU need it only for Gaming, for everything else including GPGPU DDR-3 is just fine.

Has anyone run benches proving (or disproving) this allegation? Something as simple as a LibreOffice spreadsheet bench at different memory settings (higher DDR3 speeds, same timings) would do the trick. Especially on the 7850K which is the APU most in need of extra memory bandwidth right now.

AtenRa · Jul 30, 2014

DrMrLordX said:
Has anyone run benches proving (or disproving) this allegation? Something as simple as a LibreOffice spreadsheet bench at different memory settings (higher DDR3 speeds, same timings) would do the trick. Especially on the 7850K which is the APU most in need of extra memory bandwidth right now.

Luxmark 2.0 GPU score is identical between A10-7850K and HD7750 GDDR-5.
Im sure most of the GPGPU applications will exhibit the same behavior. Computation in that level of Hardware is not that much Memory bandwidth depended and DDR-3 can cope so far.
Games on the other hand become very memory depended because of the visuals, especially at higher resolutions.

http://www.anandtech.com/show/7677/amd-kaveri-review-a8-7600-a10-7850k/14

http://www.tomshardware.com/reviews/radeon-hd-7770-7750-benchmark,3135-13.html

WCCFAMD Carrizo APU on the 28nm Node Will Have Stacked DRAM On Package

Diamond Member

Lifer

Diamond Member

Diamond Member

Lifer

Golden Member

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member

Elite Member, Moderator Emeritus

Elite Member

Lifer

Elite Member, Moderator Emeritus

Diamond Member

Elite Member, Moderator Emeritus

Diamond Member

Golden Member

Lifer

Lifer

Lifer