Question AMD Phoenix/Zen 4 APU Speculation and Discussion

uzzi38 · Apr 28, 2022

I can finally make this thread.

https://twitter.com/x/status/1519669375283957760

Phoenix is indeed RDNA3. My advice to everyone: treat the old APU rumours as being out of date.

FlameTail · Dec 29, 2022

Glo. said:
The biggest problem for 256 bit memory controller is actually... the power draw.

Power consumption of mobile SOCs with 256 bit bus will go from 35-45W TDPs to 65W minimum.

"35-45W to 65W minimum"

yo what?
That's bizarre.

How can increaseing the memory bus size by a mere 128b, increase power consumption by 20-30W?

Glo. · Dec 29, 2022

FlameTail said:
"35-45W to 65W minimum"

yo what?
That's bizarre.

How can increaseing the memory bus size by a mere 128b, increase power consumption by 20-30W?

The same way by increasing dGPU GDDR memory subsystem by 128 bits, from 128 to 256 bit bus, you go from 20W power draw, to 40W power just for the memory subsystem.

Moving data costs energy. The more channels you have the more energy it costs.

Exist50 · Dec 29, 2022

Glo. said:
The same way by increasing dGPU GDDR memory subsystem by 128 bits, from 128 to 256 bit bus, you go from 20W power draw, to 40W power just for the memory subsystem.

Moving data costs energy. The more channels you have the more energy it costs.

It's probably not that dramatic though. We can use Apple as a reference here.

SpudLobby · Dec 29, 2022

uzzi38 said:
Uh maybe not that soon for >128b

Yeah lol. I don't think Zen 5 will either though I'm not as pessimistic as Dylan Patel at all - if the market wants it, they'll pass some of the costs on to consumers, it's not complicated.

FlameTail · Dec 29, 2022

Glo. said:
The same way by increasing dGPU GDDR memory subsystem by 128 bits, from 128 to 256 bit bus, you go from 20W power draw, to 40W power just for the memory subsystem.

Moving data costs energy. The more channels you have the more energy it costs.

GDDR guzzles power though. We are talking about LPDDR5 here, which is remarkably efficient.

FlameTail · Dec 29, 2022

If increasing by 128b costs 20W, then a 128b memory bus in a normal chip will consume 20W? Then almost half the power of a normal 45W chip such as the 6900H is consumed by the memory controller, which is certainly ridiculous.

TESKATLIPOKA · Dec 30, 2022

FlameTail said:
If increasing by 128b costs 20W, then a 128b memory bus in a normal chip will consume 20W? Then almost half the power of a normal 45W chip such as the 6900H is consumed by the memory controller, which is certainly ridiculous.

Yeah, what Glo. wrote about ~20W higher power draw by making 128bit wider bus can't be true for CPU. 6800U is 15-28W and has 128 wide bus.

But he is right, that a wider controller would consume more power.

Doug S · Dec 30, 2022

FlameTail said:
If increasing by 128b costs 20W, then a 128b memory bus in a normal chip will consume 20W? Then almost half the power of a normal 45W chip such as the 6900H is consumed by the memory controller, which is certainly ridiculous.

A 6900H doesn't have a GDDR6 controller. A regular DDR5 controller won't use that much, and as you point you above an LPDDR5 controller will use even less.

You can run your bus slow and get power efficiency like LPDDR5, or run it fast with less efficiency like GDDR6. You can use a lot of wires to get high bandwidth with power efficiency like Apple with LPDDR5, or use fewer wires to get high bandwidth with less efficiency like HBM. Tradeoffs are a fact of life in tech, you gotta pick your poison and accept the bad with the good.

TESKATLIPOKA · Dec 30, 2022

Doug S said:
A 6900H doesn't have a GDDR6 controller. A regular DDR5 controller won't use that much, and as you point you above an LPDDR5 controller will use even less.

You can run your bus slow and get power efficiency like LPDDR5, or run it fast with less efficiency like GDDR6. You can use a lot of wires to get high bandwidth with power efficiency like Apple with LPDDR5, or use fewer wires to get high bandwidth with less efficiency like HBM. Tradeoffs are a fact of life in tech, you gotta pick your poison and accept the bad with the good.

HBM has a wide bus and is working at lower clocks, that's why It's more efficient than GDDR5/6.

A single HBM2e module consumes 2x more than GDDR6 module and provides 5.6x higher BW. But this is not part of CPU TDP.

~~I think per GBps/W HBM2 is better than even LPDDR5, but couldn't find anything to back It up.~~

edit: I found this. Link

LPDDR5 is <50% more power efficient than GDDR6, so HBM should have better GBps/W.

moinmoin · Dec 30, 2022

It's true, in normal usage memory controllers are the biggest power consumer in the uncore of CPUs, especially as the memory is frequency is being increased.

With Zen chips one can compare the uncore power usage of 2 channel stock and OC memory Ryzen with 4 channel Threadripper, 8 channel low frequency Epyc and 8 channel high frequency Threadripper Pro.

LPDDR has power saving mechanisms to cope with that but usually runs at lower frequency.

Apple uses huge SLC to reduce power usage further, and keeping memory dies right next to the main die practically eliminates the power usage for data transport further. Both changes likely make up for the power penalty more memory channels normally would bring.

Those changes need big enough demand on the x86 for AMD and/or Intel to be worthwhile to follow suit.

Glo. · Dec 30, 2022

TESKATLIPOKA said:
Yeah, what Glo. wrote about ~20W higher power draw by making 128bit wider bus can't be true for CPU. 6800U is 15-28W and has 128 wide bus.

But he is right, that a wider controller would consume more power.

It is unfortunately true. Idle power draw for standard platform on Intel side is around 33W. At least that was idle power for Comet Lake "T" CPUs era, as you can see it here: https://www.computerbase.de/2018-06/intel-core-i7-8700t-i5-8500t-cpu-test-coffee-lake/3/

The CPU idles at just 2W. The rest is consumed by the subsystems of the platform: RAM, Mobo, SSD. Mobo at Idle is not drawing a lot of power, nor is the chipset. Maybe it will not increase the power draw by 20W using LPDDR5, but...

P.S. I don't know in which Hardware Unboxed video from either Intel 9th gen or 10th gen testing I have seen it, but I can remember that they have tested power draw between 2 bank RAM and 4 bank RAM, and the difference was 10 or 15W more for 4 bank RAM.

FlameTail said:
GDDR guzzles power though. We are talking about LPDDR5 here, which is remarkably efficient.

Exist50 said:
It's probably not that dramatic though. We can use Apple as a reference here.

To both of you, that is indeed true. LPDDR5 will be much more efficient than GDDR memory, which has higher voltage than LPDDR5, which - obviously - uses more power. Apple uses LPDDR5 in their M chips. But the TDP, in "that" context of what was discussed will indeed increase by 20W from 45 to 65W for mobile parts.

You do not add 256 bit memory controller to your mobile, mainstream platform chips, without a reason. You want to achieve something with this. So expect that we will see larger iGPUs alongside.

P.S. I should have worded it correctly. Its not only memory controller, but whole memory subsystem: memory controller+memory chips, that consume 20W more power.

FlameTail · Dec 30, 2022

Great Great. So which is more ideal for meeting the bandwidth and power requirments then?

1. Doubling bus width to 256b and using LPDDR5

2. Using a stacked 3D-V cache as SLC/infinity cache

IntelUser2000 · Dec 30, 2022

There's little point in going with a large iGPU if you cannot offer advantages in cost and power over a dGPU. The latter is flexible and has a better market perception.

Also a large iGPU means that's pretty much the only market it'll sell in, because you wouldn't pair that with a dGPU since it'll be more expensive.

When has a large iGPU offered cost advantages over a similar dGPU? You think the manufacturers making them will make large dies just to sell it the same as a regular boring iGPU?

Nvidia killed both Intel's Iris and Kaby-G efforts by offering discounts on their low end dGPUs. It's that simple.

Sure dGPUs in theory are more expensive but GDDR memory modules are produced in much higher volume and has been done forever. Whether you go 3D stacked caches or HBM it's lower volume and it's likely more expensive. GPU boards are done pretty much every day and in every segment. New design, low volume = higher cost.

So for just 2x the performance gain over regular iGPU you might/potentially/maybe have a setup that has better battery life in idle over a dGPU setup but worse in flexibility and costs about the same or even higher.

And that advantage is just for a single year because it's impossible to upgrade them. More e-waste and no repairability too!

uzzi38 · Dec 31, 2022

Glo. said:
The biggest problem for 256 bit memory controller is actually... the power draw.

Power consumption of mobile SOCs with 256 bit bus will go from 35-45W TDPs to 65W minimum.

I'd imagine they'd be able to do some optimisations there. Current gen APUs already regulate FCLK and IMC clocks depending on the workload (CPU heavy or GPU heavy) - it is possible to tune things in real time. Also, those estimates are totaly OTT. IMC + FCLK usually consumes way less than that, you have to remember that 128b works perfectly fine and still a big performance boost even at 15W TDP.

Glo. · Dec 31, 2022

IntelUser2000 said:
There's little point in going with a large iGPU if you cannot offer advantages in cost and power over a dGPU. The latter is flexible and has a better market perception.

Also a large iGPU means that's pretty much the only market it'll sell in, because you wouldn't pair that with a dGPU since it'll be more expensive.

When has a large iGPU offered cost advantages over a similar dGPU? You think the manufacturers making them will make large dies just to sell it the same as a regular boring iGPU?

Nvidia killed both Intel's Iris and Kaby-G efforts by offering discounts on their low end dGPUs. It's that simple.

Sure dGPUs in theory are more expensive but GDDR memory modules are produced in much higher volume and has been done forever. Whether you go 3D stacked caches or HBM it's lower volume and it's likely more expensive. GPU boards are done pretty much every day and in every segment. New design, low volume = higher cost.

So for just 2x the performance gain over regular iGPU you might/potentially/maybe have a setup that has better battery life in idle over a dGPU setup but worse in flexibility and costs about the same or even higher.

And that advantage is just for a single year because it's impossible to upgrade them. More e-waste and no repairability too!

Designing ONE chip, instead of two or even three separate chips for process node, is something beneficial to both: companies like AMD and Nvidia, and to consumers.

I think if 256 bit bus on mainstream platforms will happen, - the dGPUs up to 106 die from Nvidia will disappear because they will simply be useless.

I expect that at some point, AMD will be designing three APUs, to something like this:

Lets say, for the sake of the discussion, that Strix Point is the first arch which will have three APUs in the lineup.

Small Strix Point: 4P/8E/8(1024 ALU) CU design. 128 bit DDR5/LPDDR5 bus, no Infinity cache. Sort of AMD's A16 chip.
"Normal" Strix Point: 8P/8E/16(2048 ALU) CU design, with 128 bit DDR5/LPDDR5 bus, 32 MB Infinity Cache. Sort of AMD's M2 chip.
Large Strix Point: 8P/16E/32 CU(4096 ALU) design, 256 bit LPDDR5 bus, with 64 MB Infinity Cache. Sort of AMD's M2 Pro chip.

For "normal" and Large Strix point AMD would design one type of chiplet that combines DDR5/LPDDR5 memory controller, with Infinity cache, exactly like Navi 31 has cache+memory controller chiplets, thanks to which - they are intercompatible with each other. You simply take two more chiplets for large Strix Point.

For APU portion, those designs will be monolithic, only cache are transferred to chiplets, and the die size of large Strix point, would NOT exceed 250 mm2. APU like this offers console levels of performance, while not exceeding certain thermal threshold, while also allowing for new, and smaller footprints and form factors for desktop PCs. It would fit basically everywhere where OEMs would want to compete against Apple: AIOs, MiniPCs, laptops, etc. Three designs that would span from mobile to desktops. It saves development costs, saves manufacturing costs, saves time.

32CU design is large, to the degree that we are talking about desktop RX 6800-6800 XT performance level in an APU. Is it pointless? Hell, no.

And lastly, the price. For something like this to be viable you have to have benefits: low manufacturing costs, high scalability, large enough market for volume. Is it impossible to see something like what I have mentioned: 250 mm2 die, with 4 larger and cheaper node chiplets costing 499$(assuming no hyperinflation happens in upcoming years)?

Mark this as a speculation. But this should give you all an idea where things are going.

FlameTail said:
Great Great. So which is more ideal for meeting the bandwidth and power requirments then?

1. Doubling bus width to 256b and using LPDDR5

2. Using a stacked 3D-V cache as SLC/infinity cache

What will happen is both of them.

MadRat · Dec 31, 2022

DDR5 on Zen 4 is already well stacked with 4x32 bit channels. If they expand it would probably move to 6x32 bit. EPYX is already 12x32 bit.

TESKATLIPOKA · Dec 31, 2022

Glo. said:
Designing ONE chip, instead of two or even three separate chips for process node, is something beneficial to both: companies like AMD and Nvidia, and to consumers.

I think if 256 bit bus on mainstream platforms will happen, - the dGPUs up to 106 die from Nvidia will disappear because they will simply be useless.

I expect that at some point, AMD will be designing three APUs, to something like this:

Lets say, for the sake of the discussion, that Strix Point is the first arch which will have three APUs in the lineup.

Small Strix Point: 4P/8E/8(1024 ALU) CU design. 128 bit DDR5/LPDDR5 bus, no Infinity cache. Sort of AMD's A16 chip.
"Normal" Strix Point: 8P/8E/16(2048 ALU) CU design, with 128 bit DDR5/LPDDR5 bus, 32 MB Infinity Cache. Sort of AMD's M2 chip.
Large Strix Point: 8P/16E/32 CU(4096 ALU) design, 256 bit LPDDR5 bus, with 64 MB Infinity Cache. Sort of AMD's M2 Pro chip.

For "normal" and Large Strix point AMD would design one type of chiplet that combines DDR5/LPDDR5 memory controller, with Infinity cache, exactly like Navi 31 has cache+memory controller chiplets, thanks to which - they are intercompatible with each other. You simply take two more chiplets for large Strix Point.

For APU portion, those designs will be monolithic, only cache are transferred to chiplets, and the die size of large Strix point, would NOT exceed 250 mm2. APU like this offers console levels of performance, while not exceeding certain thermal threshold, while also allowing for new, and smaller footprints and form factors for desktop PCs. It would fit basically everywhere where OEMs would want to compete against Apple: AIOs, MiniPCs, laptops, etc. Three designs that would span from mobile to desktops. It saves development costs, saves manufacturing costs, saves time.

32CU design is large, to the degree that we are talking about desktop RX 6800-6800 XT performance level in an APU. Is it pointless? Hell, no.

And lastly, the price. For something like this to be viable you have to have benefits: low manufacturing costs, high scalability, large enough market for volume. Is it impossible to see something like what I have mentioned: 250 mm2 die, with 4 larger and cheaper node chiplets costing 499$(assuming no hyperinflation happens in upcoming years)?

Mark this as a speculation. But this should give you all an idea where things are going.

I have some misgivings about your large Strix Point. Yes, I know It's just a made up chip.
1.) For the large Strix you will need a significantly higher TDP. If It was at N5(N4) I would set It as 55W(CPU)+85W(IGP) for a total of 140W, so at N3 It would be ~100-105W.
For an APU that's a lot, especially using 3nm process.

2.) 32CU IGP RDNA3+ would hardly perform as 6800-6800XT. Even If architecture adds 20% of performance, you would need
RX 6800: 16.17 TFLOPs
RX 6800XT: 20.74 TFLOPs
32CU IGP RDNA3+: 3300-4219MHz as clockspeed. That's simply too high for an APU.

3.) Unless It's built using 3nm then I don't think It will be within 250mm2 even If you separate IC and memory controllers. CPU would still use L3 or is 64MB IC also for the CPU?

4.) If a single 3nm wafer costs $20,000 and die is ~200mm2 then you have 306 dies per wafer and cost per die is $65.
37.5mm2 per MCD and $8,000 per 6nm wafer would mean 8000/1776=4.5*4= $22
So you are already at $87 just for dies alone, then there is still the packaging cost.
$499 looks pretty low for this.

5.) With 256bit bus you will need 2x more LPPDR5 modules, which will increase cost and as far as I know they are only soldered, so you won't be able to expand It later. That's not so great for a High end APU.

Glo. · Dec 31, 2022

TESKATLIPOKA said:
I have some misgivings about your large Strix Point. Yes, I know It's just a made up chip.
1.) For the large Strix you will need a significantly higher TDP. If It was at N5(N4) I would set It as 55W(CPU)+85W(IGP) for a total of 140W, so at N3 It would be ~100-105W.
For an APU that's a lot, especially using 3nm process.

2.) 32CU IGP RDNA3+ would hardly perform as 6800-6800XT. Even If architecture adds 20% of performance, you would need
RX 6800: 16.17 TFLOPs
RX 6800XT: 20.74 TFLOPs
32CU IGP RDNA3+: 3300-4219MHz as clockspeed. That's simply too high for an APU.

3.) Unless It's built using 3nm then I don't think It will be within 250mm2 even If you separate IC and memory controllers. CPU would still use L3 or is 64MB IC also for the CPU?

4.) If a single 3nm wafer costs $20,000 and die is ~200mm2 then you have 306 dies per wafer and cost per die is $65.
37.5mm2 per MCD and $8,000 per 6nm wafer would mean 8000/1776=4.5*4= $22
So you are already at $87 just for dies alone, then there is still the packaging cost.
$499 looks pretty low for this.

5.) With 256bit bus you will need 2x more LPPDR5 modules, which will increase cost and as far as I know they are only soldered, so you won't be able to expand It later. That's not so great for a High end APU.

I mean, that was written only to give you an image how it will look like. No specifics.

But I will respond to one thing.

Chiplet based Cache+memory controller is SYSTEM wide access. So both CPU(s) and the GPU have access to it.

TESKATLIPOKA · Dec 31, 2022

Glo. said:
I mean, that was written only to give you an image how it will look like. No specifics.

But I will respond to one thing.

Chiplet based Cache+memory controller is SYSTEM wide access. So both CPU(s) and the GPU have access to it.

Ok. So would there be a separate L3 cache for CPU or not?

Glo. · Dec 31, 2022

TESKATLIPOKA said:
Ok. So would there be a separate L3 cache for CPU or not?

Separate? IC is L4 cache for the CPU, not L3.

TESKATLIPOKA · Dec 31, 2022

Glo. said:
Separate? IC is L4 cache for the CPU, not L3.

Ok, so LLC(Last level memory) for both CPU+IGP. Don't you think It would be an overkill?
You would have:
8x 1MB L2 for Zen5 cores
8x 1MB L2 for Zen4D(C?) cores
16MB L3 for Zen5 cores
8MB L3 for Zen4D(C?) cores
40MB cache just in a single monolithic APU.
64MB LLC for both CPU and IGP.
104MB in total. What a monster.

Rembrandt has just 16MB L3.

Glo. · Dec 31, 2022

TESKATLIPOKA said:
Ok, so LLC(Last level memory) for both CPU+IGP. Don't you think It would be an overkill?
You would have:
8x 1MB L2 for Zen5 cores
8x 1MB L2 for Zen4D cores
16MB L3 for Zen5 cores
8MB L3 for Zen4D cores
40MB cache just in a single monolithic APU.
64MB LLC for both CPU and IGP.
104MB in total. What a monster.

I don't think it will be overkill.

As we go into wider CPU core designs, and much more complex GPU architectures, and much larger than todays ones, we will need all the bandwidth available for bandwidth and latency reasons.

As I have hinted already, I fully expect that next gen APU will bring 3060 performance. It WILL NOT be achieved without Infinity Cache, or other marketing name for integrated or separate large pool of cache which will serve as a bandwidth accelerator for logic systems of an SOC ot APU.

Especially when all of what you can rely on for bandwidth is system RAM.

TESKATLIPOKA · Dec 31, 2022

The small and normal looks realistic for next year to me.
Strix Point could be an interesting product, but I am not waiting for It. I am not even sure we will see It at CES 2024.

FlameTail · Dec 31, 2022

TESKATLIPOKA said:
1.) For the large Strix you will need a significantly higher TDP. If It was at N5(N4) I would set It as 55W(CPU)+85W(IGP) for a total of 140W, so at N3 It would be ~100-105W.
For an APU that's a lot, especially using 3nm process

Not at all. If A 32 CU RDNA3 iGPU would offer performance greater than a mobile RTX 3070. That RTX 3070 meanwhile consumes 80W+ on it's own. Considering that this APU with such an iGPU effectively replaces such a dGPU, 100W of power is very reasonable.

FWIW, the M1 Max draws upto <100W of power already.

BorisTheBlade82 · Dec 31, 2022

TESKATLIPOKA said:
The small and normal looks realistic for next year to me.
Strix Point could be an interesting product, but I am not waiting for It. I am not even sure we will see It at CES 2024.

Me neither. I think there will be something in between Phoenix and Strix or that Strix is not based on Zen5 at all. Otherwise there is no timeline that makes sense.

Question AMD Phoenix/Zen 4 APU Speculation and Discussion

Platinum Member

Diamond Member

Diamond Member

Platinum Member

Golden Member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Diamond Member

Diamond Member

Elite Member

Platinum Member

Diamond Member

Lifer

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Senior member