Speculation: AMD's APU with HBM

William Gaatjes · Jul 8, 2018

Vattila said:
In this speculation thread, I have used HBM as a generic term for High Bandwidth Memory, with no regard to the specific version, or generation, of the particular HBM specification.

In this discussion, since we are speculating about the future, I think we need to see the APU chip as a black box, and accept any form of integration within that black box, including MCM solutions.

MCM solutions based on new packaging technology and "chiplet" design is an inevitable part of the roadmap of the industry, it seems, as monolithic die design hits its limits.

Well, that's merely what Intel did.

As a programmer by profession, I view HSA as the overarching vision for AMD's Fusion design philosophy — in particular, the simplification of the programming model and memory model of the system, with seamless and coherent memory access, reduced latency, pre-emptive task switching, and a standardised API and hardware-agnostic ISA for the heterogeneous compute units in the system.

As a programmer I face two kinds of algorithms; sequential and parallel. Some of the latter runs very well on a GPU, but to this day, I have never taken advantage of that, even though my 20+ years old software (road design) may be very well suited for that and see massive speed and efficiency gains.

Why? It is not easy. There are some tools (such as C++ AMP) that are promising, and the C++ language standard committee is now doing great work to enable parallelisation seamlessly. But there is still a long way to go before running computations on a GPU is as simple as running them on the CPU.

For example, a game should seamlessly be able to use all hardware acceleration available on a system. Even if a discrete GPU card is present and runs the game's main graphics load, the game should take advantage of the gigaflops of parallel compute performance available in todays integrated GPUs. Just disabling such a powerful coprocessor is a sad waste. In the future, maybe a game will use the integrated GPU to run AI, while using the discrete GPU for the graphics load.

To bring my digression back on topic; my interest in APUs are not for graphics, and not to replace more powerful discrete cards for gamers and professional users that need them. I am interested in technology, and I'd like to see HSA become reality, so that programmers can simply and seamlessly make use of any acceleration available in their user's systems. A big part of that is to remove bottlenecks in memory latency and bandwidth — which is what this discussion is about.

You know what makes me wonder for about 2 years now when thinking about apu and hbm memory.
HBM2 for consoles and laptops.
Let say that there is 8GB or even 16GB of HBM2.
Since the hbm channel width is made up of sections of 128 bits and totals 1024 bit( 8 channels of 128 bit). It would be possible to divide the hbm memory into a pool only for the cpu and a pool only for the gpu but both pools controlled by the same memory controller. One 128 bit channel set up as 2 64 bit pseudo channels for the cpu and 7 128 bit channels for the gpu.
Or 2 channels for the cpu and 6 for the gpu.
This would allow maximum usage and since it works in parallel, the cpu and gpu would not have to wait for each other. And since it is all connected to the same memory controller,
it is still possible to do zero copy latency reducing tricks to reduce memory traffic. The memory is shared virtually for the system but in hardware it runs parallel.
That is, if this arrangement is possible at all. But if it is possible, then the need for extra system ram would no longer be there.
This would allow for flexible management of the memory and a massive increase in performance and reduction of the bill of material because system memory is no longer needed.
That would be a bonus for APUs.

IntelUser2000 · Jul 8, 2018

You are merely generalizing. We can adopt such a view to everything. And take it to its logical extreme.

Likely this line of thought and futurists' being consistently wrong about their predictions are related. Just assume everything will progress in a linear line! Mostly it exposes lack of full understanding and the details it took to get to such a point.

This would allow for flexible management of the memory and a massive increase in performance and reduction of the bill of material because system memory is no longer needed.

BOM reduction isn't guaranteed. It would take many years for the cost to come down to a point comparable to current parts without HBM but needing system memory. The volumes and optimizations favor the DRAM side heavily. Micron has a slide that said its due to the sheer inertia that allowed DRAM to remain as the dominant system memory. It has the volume and investment(not just $, but people working on it) that continually broke through previously perceived barriers.

In the near future, it'll be still sold as a premium product as the performance is higher. DRAM is almost a commodity product. Competition is cutthroat, and they'll do anything to stand out. Seeing RGB LEDs everywhere is driven by similar reasons. Makes the product flashy, catches our attention, which drives sales.

Vattila · Jul 8, 2018

William Gaatjes said:
But if it is possible, then the need for extra system ram would no longer be there.

Without going into technical details, I think the thought of eliminating system DRAM altogether is an alluring one. As @IntelUser2000 points out, it cannot happen across the industry and all segments, but it may be feasible in some niches — maybe an Apple product will be first out of the gate.

whm1974 · Jul 8, 2018

Vattila said:
Without going into technical details, I think the thought of eliminating system DRAM altogether is an alluring one. As @IntelUser2000 points out, it cannot happen across the industry and all segments, but it may be feasible in some niches — maybe an Apple product will be first out of the gate.

And how would go about eliminating system DRAM altogether and still provide a usable amount of memory? Or that matter allow for memory upgrades later without replacing the entire unit?

Vattila · Jul 8, 2018

whm1974 said:
And how would go about eliminating system DRAM altogether and still provide a usable amount of memory? Or that matter allow for memory upgrades later without replacing the entire unit?

Not feasible for your server or your gaming PC, but perhaps for your 2-in-1 notebook.

William Gaatjes · Jul 8, 2018

IntelUser2000 said:
You are merely generalizing. We can adopt such a view to everything. And take it to its logical extreme.

Likely this line of thought and futurists' being consistently wrong about their predictions are related. Just assume everything will progress in a linear line! Mostly it exposes lack of full understanding and the details it took to get to such a point.

BOM reduction isn't guaranteed. It would take many years for the cost to come down to a point comparable to current parts without HBM but needing system memory. The volumes and optimizations favor the DRAM side heavily. Micron has a slide that said its due to the sheer inertia that allowed DRAM to remain as the dominant system memory. It has the volume and investment(not just $, but people working on it) that continually broke through previously perceived barriers.

In the near future, it'll be still sold as a premium product as the performance is higher. DRAM is almost a commodity product. Competition is cutthroat, and they'll do anything to stand out. Seeing RGB LEDs everywhere is driven by similar reasons. Makes the product flashy, catches our attention, which drives sales.

When it comes to HBM, the same thing is said about for example 3d xpoint memory.
Because it is new and competing with an existing technology, everybody laughs it away even though it has serious advantages. The reason dram got big because everybody is making it. Granted there are some patents that rambus for example owns but still. The issue here is how many manufacturers there are.
There are many players in the ddrx dram market. And i admit that ddrx dram is not going enywhere soon, but we see more and more that when a product is developed that stands out and has good support that it will happen. HBM may not replace all memory of course, but a powerful system based on an APU with HBM2 is really not impossible.

Vattila said:
Without going into technical details, I think the thought of eliminating system DRAM altogether is an alluring one. As @IntelUser2000 points out, it cannot happen across the industry and all segments, but it may be feasible in some niches — maybe an Apple product will be first out of the gate.

Maybe Apple, maybe a game console.
Or perhaps a deep learning player.
Remember how the memory controller of vega and the vega workstation cards function...
A ddr memory bus is no longer needed to have massive amounts of memory available.

CatMerc · Jul 8, 2018

An organic substrate can't do HBM.

CatMerc · Jul 8, 2018

PeterScott said:
IIRC off die IF uses re-purposed PCIe lanes, so it is just PCIe running a different protocol.

IF isn't any specific bus, it's a control and communication protocol. It can run off of any bus, including but not limited to PCI-E.

richaron said:
Both of these assume IF is much faster than PCIe, but I haven't seen much info comparing them in reality. I know IF is scalable, but so is PCIe up to a certain point.

I think of IF as an AMD in-house and more versatile version of PCIe, which can work over the exact same pins as PCIe. For example I understand my Ryzen + Vega system is technically capable of using IF to connect the two over the PCIe x 16 hardware. IIRC the IF connection would be slightly faster than PCIe v3, but it wouldn't be heaps faster simply because it's IF vs PCIe like you're implying. So is the internal IF in an APU wider or clocked higher? And will this always be the case?

Ditto. It's not faster or slower than PCI-E because that's comparing apples and oranges. IF can run over PCI-E, it can run over custom busses like GMI that connects different dies on EPYC/Threadripper. It can run over xGMI that's supposed to come with Vega 20 (NVLink equivalent?).

PeterScott · Jul 8, 2018

Since I don't think there will be a "Super" APU, I am interested in where the mainstream APU is going.

I think that depends a lot AMDs next socket update. AM5?

Ultimately the GPU in an APU is limited by memory bandwidth, and the question is how AM5 will improve bandwidth and by how much?

Could AMD move to triple channel memory in the mainstream for AM5?

Could the package be enlarged to allow for optional HBM chips under the heatspreader with the APU? I wonder if this is even worth doing for anything like a mainstream APU. Ultimately I don't think it is.

So mainstream socket based APU is likely to see small evolutionary performance tweaks on the GPU side, though they could increase the CPU core count quite easily.

So that is my APU expectation. No "Super" APU, at least not on the GPU side where I think most people are looking for a breakthrough and thus no real need/point for HBM. Core count increase most likely. Either from dual CCX or CCX core count expansion.

Significant GPU improvement will continue to rely on discrete GPU parts, which I don't consider an APU, and that can be packaged a number of ways, but it changes little functionally or economically.

whm1974 · Jul 8, 2018

PeterScott said:
Significant GPU improvement will continue to rely on discrete GPU parts, which I don't consider an APU, and that can be packaged a number of ways, but it changes little functionally or economically.

Yeah if you want or need graphics performance then you will need a discrete GPU with it's own memory to get it. There is simply no way around that.

Vattila · Jul 8, 2018

PeterScott said:
Ultimately the GPU in an APU is limited by memory bandwidth, and the question is how AM5 will improve bandwidth and by how much?

AM5 will support DDR5, I guess, doubling bandwidth, assuming dual-channel. Regarding size of the socket, to allow for HBM and MCM solutions, I presume they will ensure the socket is big enough to support their plans for a couple of years, assuming they continue to be true to their platform stability philosophy.

PeterScott · Jul 8, 2018

Vattila said:
AM5 will support DDR5, I guess, doubling bandwidth, assuming dual-channel. Regarding size of the socket, to allow for HBM and MCM solutions, I presume they will ensure the socket is big enough to support their plans for a couple of years, assuming they continue to be true to their platform stability philosophy.

Not really doubling DDR4.
https://www.anandtech.com/show/12710/cadence-micron-demo-ddr5-subsystem

It's coming at DDR5-4400 intially. Most people are already running something faster than DDR4-2200.

Like most memory standards, the peak of the old standard is and the base of the new standard are often the same.

You can already buy DDR4-4400.

Vattila · Jul 8, 2018

PeterScott said:
It's coming at DDR5-4400 intially.

Same article says "DDR5-6400 Eventually", so hopefully, it has ample headroom.

beginner99 · Jul 9, 2018

Something else:

Meaning there won't be any such APU simply because it's expensive to make (especially masking costs) and will be a niche product so that the sales can't justify the development costs. There is simply little need for such a product. If the iGPU is not good enough you simply choose one with a dGPU. Putting such a huge GPU on the same die has little to no advantage in fact it could even pose a problem for cooling.

Simply a CPU + dGPU is more cost effective.

NTMBK · Jul 9, 2018

beginner99 said:
Something else:

Meaning there won't be any such APU simply because it's expensive to make (especially masking costs) and will be a niche product so that the sales can't justify the development costs. There is simply little need for such a product. If the iGPU is not good enough you simply choose one with a dGPU. Putting such a huge GPU on the same die has little to no advantage in fact it could even pose a problem for cooling.

Simply a CPU + dGPU is more cost effective.

At 7nm, a "huge" GPU won't take up that much space. The die area of Raven Ridge's 11CU GPU would probably be enough for about a 20CU 7nm GPU, at a rough guess. How are you going to keep that fed? DDR4/5 speeds aren't increasing quickly enough.

NTMBK · Jul 9, 2018

PeterScott said:
Not really doubling DDR4.
https://www.anandtech.com/show/12710/cadence-micron-demo-ddr5-subsystem

It's coming at DDR5-4400 intially. Most people are already running something faster than DDR4-2200.

Like most memory standards, the peak of the old standard is and the base of the new standard are often the same.

You can already buy DDR4-4400.

How many laptops do you see running massive memory controller overclocks and non-JEDEC hot clocked RAM? Zero.

CatMerc · Jul 9, 2018

NTMBK said:
At 7nm, a "huge" GPU won't take up that much space. The die area of Raven Ridge's 11CU GPU would probably be enough for about a 20CU 7nm GPU, at a rough guess. How are you going to keep that fed? DDR4/5 speeds aren't increasing quickly enough.

Well, considering DDR5 starts at 4400MT/s (equivalent to DDR4 2133), and has improvements to increase the real world bandwidth, I think it would do fine for a 20CU APU lol

AtenRa · Jul 9, 2018

beginner99 said:
Something else:

Meaning there won't be any such APU simply because it's expensive to make (especially masking costs) and will be a niche product so that the sales can't justify the development costs. There is simply little need for such a product. If the iGPU is not good enough you simply choose one with a dGPU. Putting such a huge GPU on the same die has little to no advantage in fact it could even pose a problem for cooling.

Simply a CPU + dGPU is more cost effective.

I will have to disagree,

1. TAM for such a product is high in mobile segment, just see how many GTX1050/1060 equipped laptops are sold each quarter. It will also allow for the creation of OEM SFF Desktop gaming PCs.
2. Creating double the dies (one for the CPU and a second one for the dGPU) will increase the cost of R&D. Double the masks vs a single mask for the APU.
3. Having a CPU + dGPU will increase the BOM. More expensive motherboards because of GDDR5 vs HBM.
4. Integrating everything in a SoC design will decrease space allowing for a smaller, thinner Laptop design. Further decreasing BOM.
5. An SoC design (CPU + iGPU + HBM ) will be more efficient than a CPU + dGPU with GDDR-5, commanding a higher ASP.

Personally I believe that currently AMD is not concerned about the R&D cost and the TAM for such a product but the Manufacturing volumes. They seam to be 14nm and 12nm volume constrained as of today and creating a big 300-350mm die will simple add more problems to the 14nm dGPUs and 12nm CPUs volums.

At 7nm they could make a 6Core APU with 20-24 CU iGPU with 4GB HBM memory. That die will be smaller than today's 4C + 11 CU APU with only the 4GB HBM for added cost but would command extremely higher ASP than todays Ryzen 2700U.

PeterScott · Jul 9, 2018

AtenRa said:
I will have to disagree,

1. TAM for such a product is high in mobile segment, just see how many GTX1050/1060 equipped laptops are sold each quarter. It will also allow for the creation of OEM SFF Desktop gaming PCs.
2. Creating double the dies (one for the CPU and a second one for the dGPU) will increase the cost of R&D. Double the masks vs a single mask for the APU.
3. Having a CPU + dGPU will increase the BOM. More expensive motherboards because of GDDR5 vs HBM.
4. Integrating everything in a SoC design will decrease space allowing for a smaller, thinner Laptop design. Further decreasing BOM.
5. An SoC design (CPU + iGPU + HBM ) will be more efficient than a CPU + dGPU with GDDR-5, commanding a higher ASP.

Personally I believe that currently AMD is not concerned about the R&D cost and the TAM for such a product but the Manufacturing volumes. They seam to be 14nm and 12nm volume constrained as of today and creating a big 300-350mm die will simple add more problems to the 14nm dGPUs and 12nm CPUs volums.

At 7nm they could make a 6Core APU with 20-24 CU iGPU with 4GB HBM memory. That die will be smaller than today's 4C + 11 CU APU with only the 4GB HBM for added cost but would command extremely higher ASP than todays Ryzen 2700U.

1: Great, you have sales figures for how many GTX1050/1060 laptops are sold? I'd like to see them. I am betting it isn't as high as you think. Most people buy laptops with no dGPU.

2: No, creating a multipurpose dies that server multiple markets makes more sense. NVidia owns both the laptop dGPU market, and the discrete GPU card market with the same 1030/1050/1060/1080 chips. It's not creating double the mask for one market. It is using one mask to cover two markets, and that makes much more sense.

The problem with the "Super" APU attempting to win this market, is that you actually have to win it with one inflexible, very expensive die. So you come out with your ultra expensive APU, and NVida puts GTX 1160 into laptops (Lenovo already let slip these are coming) for lower cost, and higher performance, and you are totally sunk. You just created a large white elephant.

The dGPU option is flexible. It allows you to reuse your desktop dGPU card dies, as mobile dies, and cover all the price/performance levels.

NVidia actually blankets the laptop dGPU market with the dies from GT1030/GTX1050/GTX1060/GTX1080. All 4 have mobile variants, and you can flexibly pair it with your CPU of choice (almost all still choosing Intel CPU for design wins).

A Super APU really can't compete with the flexibility and economy of scale of simply re-using desktop GPU as mobile GPUs to cover the whole range of options.

APUs will stick to low performance(relatively), low cost GPU sections, to server that much bigger TAM that doesn't buy dGPU in their laptops (and desktops).

The higher performance GPU niche will continue to be met by reusing desktop GPUs.

french toast · Jul 9, 2018

I really would like AM5 to be 3 channel ddr5.
Presumably we are going to 6 then 8 core CCX, plus more powerful igpu's, without some kind of vram or L4 cache I don't see it working very well with dual channel ddr5 and no vram.

LTC8K6 · Jul 9, 2018

french toast said:
I really would like AM5 to be 3 channel ddr5.
Presumably we are going to 6 then 8 core CCX, plus more powerful igpu's, without some kind of vram or L4 cache I don't see it working very well with dual channel ddr5 and no vram.

We can't afford dual channel ddr4, now we gotta' pay for triple channel ddr5...

whm1974 · Jul 9, 2018

LTC8K6 said:
We can't afford dual channel ddr4, now we gotta' pay for triple channel ddr5...

I doubt that we will see triple channel memory with socket AM5 anyway due to cost.

beginner99 · Jul 9, 2018

NTMBK said:
At 7nm, a "huge" GPU won't take up that much space. The die area of Raven Ridge's 11CU GPU would probably be enough for about a 20CU 7nm GPU, at a rough guess. How are you going to keep that fed? DDR4/5 speeds aren't increasing quickly enough.

It's not a technical issue, it's a financial issue. Too niche. Too low volume to be worth development cost, especially mask costs (they alone are in the 100+ million range so you must make that much profit, not sales from such a die to break even.).

AtenRa said:
I will have to disagree,

What PeterScott said. Sales numbers especially compared to laptops without dGPU? Raven ridge will sell far more often,probably order(s) of magnitude than any such hypothetical APU. Any any money spent on this can't be used for something else like Zen3 and breaking into the x86 server market again which is not a niche but a huge growing market.

whm1974 · Jul 9, 2018

Right now AMD doesn't have the resources and can't afford to be developing products that will be niche with limited market appeal.

Personally I think AMD's strategy with using the same socket for APUs and CPUs with no iGPU and higher core counts is rather smart.

french toast · Jul 9, 2018

LTC8K6 said:
We can't afford dual channel ddr4, now we gotta' pay for triple channel ddr5...

whm1974 said:
I doubt that we will see triple channel memory with socket AM5 anyway due to cost.

I think by 20/21 we will be on 16 cores mainstream, also with igpu to worry about, I would hope by then Tripple channel is a go.

whm1974 said:
Right now AMD doesn't have the resources and can't afford to be developing products that will be niche with limited market appeal.

Personally I think AMD's strategy with using the same socket for APUs and CPUs with no iGPU and higher core counts is rather smart.

What is the feasibility of a crystalwell type L4 cache? On 7nm this has to be feasible for 6 core/16CU APU's does it not?
Alongside quad channel lpddr5 option.

Speculation: AMD's APU with HBM

What will AMD's first "super APU" look like?

Small monolithic die (1 CCX, 1 GPU) with low-cost HBM on package substrate

Small monolithic die (same) with low-cost HBM on interposer

Small monolithic die (same) with ordinary HBM on interposer

Large monolithic die (2+ CCXs, 1 GPU) with HBM on interposer

Separate CPU die (2+ CCXs) and GPU die with HBM on interposer

Something else (specify below)

Lifer

Elite Member

Senior member

Diamond Member

Senior member

Lifer

Golden Member

Golden Member

Platinum Member

Diamond Member

Senior member

Platinum Member

Senior member

Diamond Member

Lifer

Lifer

Golden Member

Lifer

Platinum Member

Senior member

Lifer

Diamond Member

Diamond Member

Diamond Member

Senior member