Discussion RDNA4 + CDNA3 Architectures Thread

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,586
5,694
136
1655034287489.png
1655034259690.png

1655034485504.png

With the GFX940 patches in full swing since first week of March, it is looking like MI300 is not far in the distant future!
Usually AMD takes around 3Qs to get the support in LLVM and amdgpu. Lately, since RDNA2 the window they push to add support for new devices is much reduced to prevent leaks.
But looking at the flurry of code in LLVM, it is a lot of commits. Maybe because US Govt is starting to prepare the SW environment for El Capitan (Maybe to avoid slow bring up situation like Frontier for example)

See here for the GFX940 specific commits
Or Phoronix

There is a lot more if you know whom to follow in LLVM review chains (before getting merged to github), but I am not going to link AMD employees.

I am starting to think MI300 will launch around the same time like Hopper probably only a couple of months later!
Although I believe Hopper had problems not having a host CPU capable of doing PCIe 5 in the very near future therefore it might have gotten pushed back a bit until SPR and Genoa arrives later in 2022.
If PVC slips again I believe MI300 could launch before it :grimacing:

This is nuts, MI100/200/300 cadence is impressive.

1655034362046.png

Previous thread on CDNA2 and RDNA3 here

 
Last edited:

soresu

Platinum Member
Dec 19, 2014
2,535
1,734
136
  • Each GCD is basically functional as a GPU with the LLC/CP on the base die and SEDs on top.
  • Patent calls the individual stacked die (based + SEDs) as a GCD (see below). There are multiple GCD in the GPU
  • SEDs could perhaps be the GCX?
RedGamingTech covered this I think in one of their RDNA4 videos.

SED = Shader Execution Die.

The wording implies that it is basically a GCD/GCX in principle.

Naming it "Shader Execution Die" though makes it agnostic to either graphics specialised or pure compute use.

This is smart from a patent law perspective, because it generalises the terminology to cover all massively parallel processors using the shader type model, whether they have specialised graphics hardware (ala RDNA) or not (ala CDNA).
 

Joe NYC

Golden Member
Jun 26, 2021
1,864
2,147
106
Don't know who is this All The Watts Fellow, but seems he is reporting some stuffs which is now echoed by few LeakTubers.

View attachment 75949

Not coincidentally, found a patent around this as described in the above leak. I am not sure if said leaker started reading patents and made up leaks, hahaha.
Basically the patent has 1x, 2x and 3x GCX config as shown in leak LOL.
Shader Engine Die (SED) --> Stacked on top of base die
Base Dies --> Memory Controller + CP + LLC
LSI used to connect the base dies.
Inventor is Mike Mantor, Senior Fellow at AMD

20220320042 - DIE STACKING FOR MODULAR PARALLEL PROCESSORS

Difference between the leak and the patent are
  • There are MCDs in that leak, whereas in patent the IC is within the base die.
  • Each GCD is basically functional as a GPU with the LLC/CP on the base die and SEDs on top.
  • Patent calls the individual stacked die (based + SEDs) as a GCD (see below). There are multiple GCD in the GPU
  • SEDs could perhaps be the GCX?




View attachment 75950
From this patent it is basically MI300 tech.

UPDATE:
Twitter Account deleted.

So it looks like Mi300, where the differences are:
- No HBM, GDDR instead
- No interposer die
- Silicon bridges replace the interposer dies

Unlike Navi 3x, instead of MCM being a separate die, theoretically stacked,, in Navi 4x base die would have 2 of the MCM and all the cache all in the base die.

The compute die and the silicon bridge would be stacked on top of the base die. Size of the base die would seem quite large.

If AMD planned for up to 2 layers of cache stacked on on each MCM module, each 16 MB, then the die size can have 2 x 3 of these = 6 x 16 = 96 MB on each base die. (It does not have to be that, but just to illustrate what can make the base die grow to ~ 150 mm2

Then the compute die on top would be ~125 mm2 with a thin strip of silicon ~50 mm2 bridge. So the products could be:

Navi 43: 1 base die + 1 compute die
Navi 42: 2 base die + 2 compute die + 1 bridge
Navi 41: 3 base die + 3 compute die + 2 bridge

So for illustration the total silicon would be
Navi 43 150 mm2 N6 + 125 mm2 of N4(?) = 275 mm2
Navi 42 350 mm2 N6 + 250 mm2 of N4(?) = 500 mm2
Navi 41 450 mm2 N6 + 325 mm2 of N4(?) = 775 mm2
 

GodisanAtheist

Diamond Member
Nov 16, 2006
6,581
6,802
136

I wonder if we're going to see a funny trend where as we get deeper into a given memory standard's lifecycle, on die cache gets bigger and bigger, then when a new mem standard drops (like GDDR7) then cache sizes will shrink way back thanks to the additional bandwidth.

Only to then start growing again as we get into the 4th and 5th years of the current memory type.
 
  • Like
Reactions: Kaluan

Stuka87

Diamond Member
Dec 10, 2010
6,240
2,559
136

I wonder if we're going to see a funny trend where as we get deeper into a given memory standard's lifecycle, on die cache gets bigger and bigger, then when a new mem standard drops (like GDDR7) then cache sizes will shrink way back thanks to the additional bandwidth.

Only to then start growing again as we get into the 4th and 5th years of the current memory type.

I think it comes down to cost. Right now, GDDR6X is more expensive than adding cache. So by AMD adding a lot of cache, they could use the much cheaper GDDR6.

I think even with GDDR7, cache will still be beneficial. Especially if AMD goes forward with having multiple GCD chiplets.
 
  • Like
Reactions: Kaluan

GodisanAtheist

Diamond Member
Nov 16, 2006
6,581
6,802
136
I think it comes down to cost. Right now, GDDR6X is more expensive than adding cache. So by AMD adding a lot of cache, they could use the much cheaper GDDR6.

I think even with GDDR7, cache will still be beneficial. Especially if AMD goes forward with having multiple GCD chiplets.

- The cost is more a function of how large of a memory controller the die needs, I think. Right now GDDR6 (and X) are too slow for modern workloads so chipmakers either had the option of putting a massive memory controller on die or a smaller controller and more cache.

Obviously the solution seems to have been add more cache (Kinda weird for the 7900 series since they ended up with a larger bus and less cache, but that might be on account of stacked vcache coming in the future).

Faster GDDR7 will mean that either BUS sizes or Cache sizes can shrink thanks to the additional speed, and I guess that's just a question of economics in terms of which one will save more die space.
 

Kepler_L2

Senior member
Sep 6, 2020
303
942
106
- The cost is more a function of how large of a memory controller the die needs, I think. Right now GDDR6 (and X) are too slow for modern workloads so chipmakers either had the option of putting a massive memory controller on die or a smaller controller and more cache.

Obviously the solution seems to have been add more cache (Kinda weird for the 7900 series since they ended up with a larger bus and less cache, but that might be on account of stacked vcache coming in the future).

Faster GDDR7 will mean that either BUS sizes or Cache sizes can shrink thanks to the additional speed, and I guess that's just a question of economics in terms of which one will save more die space.
RDNA3 already reduced LLC sizes and IC is less than 50% of the MCDs area. Reducing it further in RDNA4 makes no sense from either perf or economic pov. Using smaller buses per tier makes a lot more sense economically but we are already saw it in RDNA2 and now NVIDIA is following along (192-bit 4070Ti, 128-bit 4060).
 

DisEnchantment

Golden Member
Mar 3, 2017
1,586
5,694
136
Well, for the MCD in N31 we already have 1 TByte/s. This can be increased quite easily by using more beachfront.
Apple's M1 Ultra, which employs InFO-LSI or EFB also has only 2.5 Tbyte/s.

/edit:
But in general I agree with you. A silicon bridge is more likely.
1679911590690.png

Potential MI300 config with one LSI bridge in the center.

20230069294 - MULTI-DIE COMMUNICATIONS COUPLINGS USING A SINGLE BRIDGE DIE
 

Joe NYC

Golden Member
Jun 26, 2021
1,864
2,147
106
  • Like
Reactions: Kaluan

DisEnchantment

Golden Member
Mar 3, 2017
1,586
5,694
136
After a string of patents hinting earlier attempts at Multi die GPU, latest one from AMD seems like a much more workable solution

20230095365 - DISTRIBUTED GEOMETRY
Systems, apparatuses, and methods for performing geometry work in parallel on multiple chiplets are disclosed. A system includes a chiplet processor with multiple chiplets for performing graphics work in parallel. Instead of having a central distributor to distribute work to the individual chiplets, each chiplet determines on its own the work to be performed. For example, during a draw call, each chiplet calculates which portions to fetch and process of one or more index buffer(s) corresponding to one or more graphics object(s) of the draw call. Once the portions are calculated, each chiplet fetches the corresponding indices and processes the indices. The chiplets perform these tasks in parallel and independently of each other. When the index buffer(s) are processed, one or more subsequent step(s) in the graphics rendering process are performed in parallel by the chiplets.


1680540908461.png

Crazy to think AMD started working on these Multi GCD concepts all the way from 2019 with these patents below.
11232622 : Data flow in a distributed graphics processing unit architecture

From <https://www.freepatentsonline.com/11232622.html>
20220207827 : SYSTEMS AND METHODS FOR DISTRIBUTED RENDERING USING TWO-LEVEL BINNING
From <https://www.freepatentsonline.com/y2022/0207827.html>
 

Saylick

Diamond Member
Sep 10, 2012
3,058
6,101
136
Tom's has a small update on MI300:
xoag7CiZE9xgGdkF24Ymrc-1200-80.png.webp

FKz92waSJwLwuwsR32XPfc-1200-80.png.webp

zvCeSyZzFqYqEsUWFvNrmc-1200-80.png.webp

gcGDjvs9JQM9jMgCtr3DWk-1200-80.png.webp
HBqc5cMvJ3hVuxH7W6M4Hk-1200-80.png.webp
qMPXHoLM3wPhHxP76kkY4k-1200-80.png.webp
ZAQ78MUvhYFuuXX47uxusj-1200-80.png.webp
sAs3yyKsWPgcjBHSzF7kij-1200-80.png.webp
 

Joe NYC

Golden Member
Jun 26, 2021
1,864
2,147
106
I was about to post this as well, thanks.
The biggest takeaways for me:
  • They confirmed, that the base die houses Infinity Caches (of unknown size)
  • They confirmed, that the bottom right square is indeed the CPU (as I assumed in January)

Those were extremely safe bets.

What I find to be new:
- is 4-way configuration, first for AMD since Opteron.
- no external DRAM slots? At least not visible (to me)
- water cooling for the nodes was expected
- each one of the Mi300 seems to be paired with a network card (for the supercomputer connectivity)
 

BorisTheBlade82

Senior member
May 1, 2020
653
997
106
Those were extremely safe bets.

What I find to be new:
- is 4-way configuration, first for AMD since Opteron.
- no external DRAM slots? At least not visible (to me)
- water cooling for the nodes was expected
- each one of the Mi300 seems to be paired with a network card (for the supercomputer connectivity)
Yes, I know - the IF$ especially.
For the CPU placement there was still some speculation recently.
But nice to get official confirmation nevertheless.
 

Joe NYC

Golden Member
Jun 26, 2021
1,864
2,147
106
Jim from AdoredTV has some new tidbits on Mi300. 3 modesl will be released:
- Mi300a - APU with 6 GPU chiplets + 3 CPU chiplets (24 cores)
- Mi300c - CPU only, with 96 Zen 4 Genoa cores.
- Mi300x - GPU only, with 8 GPU chiplets. Also having an option of 128 / 192 GB of HBM 3. Most likely from Hynix starting to offer 12 high stacks of HBM3

Mi300c is looking good, likely generational uplift over Genoa / Genoa-X SP5 CPUs for workloads that fit inside shared system level cache and or HBM3 memory.

 

jamescox

Senior member
Nov 11, 2009
636
1,103
136
Those were extremely safe bets.

What I find to be new:
- is 4-way configuration, first for AMD since Opteron.
- no external DRAM slots? At least not visible (to me)
- water cooling for the nodes was expected
- each one of the Mi300 seems to be paired with a network card (for the supercomputer connectivity)
I assume the large green card is the sligshot network card. It appears to be raised up off the surface of the board. The white pieces around the cpus look raised up also, so perhaps some kind of ram coolers or baffles with DIMMS underneath? It is a very compact system probably without much airflow. A lot of HPC applications need TB of ram, so the 128 GB HBM is not sufficient by itself. They could possibly connect DRAM with CXL. AMD uses the HBM like cache, so the speed of external memory may be reduced in importance.

It looks like Grace-Hopper has HBM and also LPDDR on the package, but it doesn't seem like that much. AMD might have an advantage if they support off package DRAM.
 
  • Like
Reactions: Tlh97 and Joe NYC

jamescox

Senior member
Nov 11, 2009
636
1,103
136
View attachment 78696

Potential MI300 config with one LSI bridge in the center.

20230069294 - MULTI-DIE COMMUNICATIONS COUPLINGS USING A SINGLE BRIDGE DIE
The base die are rectangular, so they only have 2 orientations. With the single LSI bridge, it seems like they would need to put the interface for it on both the top and the bottom of one edge of the base die with 1 interface going unused. It would be rotated 180 degrees for those on the left vs. the right, so the same as gen 1 EPyc Naples.

I thought that they might use the infinity fabric fan out like they do with GCD/MCD die. That is almost 900 GB/s. That causes issues with connecting to the diagonally placed die though. There is almost certainly not a giant silicon interposer under the whole package. That would be massive and expensive. If they have to use EFB to connect to the HBM, then it seems like they would just use it to connect the base die also. It seems like if you use it one place, then you are already essentially paying the price for it; the whole base die must be elevated on copper pillars. Are they any other options for connecting to the HBM? The infinity fabric fanout is likely much cheaper and fast enough for an HBM stack, but then that would require custom HBM.
 
  • Like
Reactions: Tlh97 and Joe NYC

jamescox

Senior member
Nov 11, 2009
636
1,103
136
Jim from AdoredTV has some new tidbits on Mi300. 3 modesl will be released:
- Mi300a - APU with 6 GPU chiplets + 3 CPU chiplets (24 cores)
- Mi300c - CPU only, with 96 Zen 4 Genoa cores.
- Mi300x - GPU only, with 8 GPU chiplets. Also having an option of 128 / 192 GB of HBM 3. Most likely from Hynix starting to offer 12 high stacks of HBM3

Mi300c is looking good, likely generational uplift over Genoa / Genoa-X SP5 CPUs for workloads that fit inside shared system level cache and or HBM3 memory.

I assume that the SH5 socket will have a larger max power than the SP5 socket, so it will be really interesting to see 96 cores with possibly a much larger power budget, if these actually exist. SH5 isn’t really the “data center” socket; that is SP5. SH5 is the HPC socket.
 
  • Like
Reactions: Tlh97 and Joe NYC

jamescox

Senior member
Nov 11, 2009
636
1,103
136
Jim from AdoredTV has some new tidbits on Mi300. 3 modesl will be released:
- Mi300a - APU with 6 GPU chiplets + 3 CPU chiplets (24 cores)
- Mi300c - CPU only, with 96 Zen 4 Genoa cores.
- Mi300x - GPU only, with 8 GPU chiplets. Also having an option of 128 / 192 GB of HBM 3. Most likely from Hynix starting to offer 12 high stacks of HBM3

Mi300c is looking good, likely generational uplift over Genoa / Genoa-X SP5 CPUs for workloads that fit inside shared system level cache and or HBM3 memory.


This doesn’t have anything about AI chiplets. Will the AI chiplets take the place of GPU chiplets? If it can operate as all GPU, then I guess they probably can mix and match any that they want. Perhaps the APU launches first with the other variants a little later.
 

Joe NYC

Golden Member
Jun 26, 2021
1,864
2,147
106
I assume the large green card is the sligshot network card. It appears to be raised up off the surface of the board. The white pieces around the cpus look raised up also, so perhaps some kind of ram coolers or baffles with DIMMS underneath? It is a very compact system probably without much airflow. A lot of HPC applications need TB of ram, so the 128 GB HBM is not sufficient by itself. They could possibly connect DRAM with CXL. AMD uses the HBM like cache, so the speed of external memory may be reduced in importance.

It looks like Grace-Hopper has HBM and also LPDDR on the package, but it doesn't seem like that much. AMD might have an advantage if they support off package DRAM.
128 GB x 4 nodes would be 512 GB. We will see if that's sufficient.

Jim from AdoredTV mentioned that the pure GPU cards may have 192 GB, which probably means HBM stack 12 high as opposed to 8 high.

Current single 8 high stack of Mi300a appears to have 16 GB, 12 high stack 24 GB.

There is also a theoretical support for 16 high, which would be 64 GB per stack and double the size of DRAM chip, but 48 GB per stack (12 high) may be the practical limit for this generation.

So theoretically, the capacity could go up to 256 GB per MCM using 8 high and 384 GB using 12-high.

I have not seen anyone, including NVidia, using the double capacity DRAM modules / layers. Nvidia H100 has only 6 stacks as opposed to 8 stacks for Mi300.

We will see if there is support for local DIMM slots. Probably not, and SH5 socket probably goes overboard with PCIe Gen 5 and CXL lanes.

Or perhaps, some of the lanes will be Infinity Fabric for mesh interconnect of 4 way nodes.. Not sure how those would re-use the PCIe / CXL lanes or if they are somehow dedicated.
 

Joe NYC

Golden Member
Jun 26, 2021
1,864
2,147
106
I assume that the SH5 socket will have a larger max power than the SP5 socket, so it will be really interesting to see 96 cores with possibly a much larger power budget, if these actually exist. SH5 isn’t really the “data center” socket; that is SP5. SH5 is the HPC socket.
We will see if the proliferation of AMD sockets will be an asset or detriment in datacenter. There will be 3, almost 4 sockets:

- SP5 - regular Genoa / Bergamo
- SH5 - Mi300
- SP6 - Sienna
- AM5 - AMD is starting to push this for super low end and micro-servers

I think socket SH5 is the most forward looking, there is a non-zero probability that going forward, to Mi400 line, SH5 could become the main thrust of AMD and SP5 will start to get de-emphacized.

Architecturally, Mi300 line of products is just way ahead of Genoa. Chiplet interconnect plus shared system level memory + power efficient high bandwidth memory is a generation ahead of Genoa socket.

Cloud providers want to get rid of local DIMMs and move it to CXL based pools. While still have local memory from HBM.

Moving to Mi400, if the Turin Dense chiplet is N3 16 core, perhaps sacrificing some of its L3, placing them on Mi400, with 12 chiplets x 16 cores, that would be 192 cores. The system level memory, which will likely be 1 to 2 GB will compensate for the lack of L3 or small size of L3. The rumor has it that L2 is doubling too, to 2 MB, so 16 x 2 MB = 32 MB in L2 alone.
 

Joe NYC

Golden Member
Jun 26, 2021
1,864
2,147
106
This doesn’t have anything about AI chiplets. Will the AI chiplets take the place of GPU chiplets? If it can operate as all GPU, then I guess they probably can mix and match any that they want. Perhaps the APU launches first with the other variants a little later.
The GPU chiplets are now AI chiplets. :)


The difference from the CDNA2 to CDNA3 is that CDNA 3 will support compute on much smaller data sizes. From FP64, that was the mainstream HPC demand down to FP32 and various 16 and 8 bit integer and float data types.

If the compute units that performed single FP64 can perform 8 x 8 bit operations, there you have 8x performance just from supporting AI friendly data types.
 

jamescox

Senior member
Nov 11, 2009
636
1,103
136
We will see if the proliferation of AMD sockets will be an asset or detriment in datacenter. There will be 3, almost 4 sockets:

- SP5 - regular Genoa / Bergamo
- SH5 - Mi300
- SP6 - Sienna
- AM5 - AMD is starting to push this for super low end and micro-servers

I think socket SH5 is the most forward looking, there is a non-zero probability that going forward, to Mi400 line, SH5 could become the main thrust of AMD and SP5 will start to get de-emphacized.

Architecturally, Mi300 line of products is just way ahead of Genoa. Chiplet interconnect plus shared system level memory + power efficient high bandwidth memory is a generation ahead of Genoa socket.

Cloud providers want to get rid of local DIMMs and move it to CXL based pools. While still have local memory from HBM.

Moving to Mi400, if the Turin Dense chiplet is N3 16 core, perhaps sacrificing some of its L3, placing them on Mi400, with 12 chiplets x 16 cores, that would be 192 cores. The system level memory, which will likely be 1 to 2 GB will compensate for the lack of L3 or small size of L3. The rumor has it that L2 is doubling too, to 2 MB, so 16 x 2 MB = 32 MB in L2 alone.
SP3 and SH5 cover different markets. The even the SP5 socket is problematic in some cases due to needing the board space for 12 DIMM slots per socket. There seems to be some board/chasis power limitations, at least for 2U systems. Some are not supporting the higher power consumption Genoa parts. Without any other sockets they would only have 2 channel or 12 channel, which doesn't cover all markets very well. It sounds like SP6 may actually be designed to be very similar to SP3 with possible support for the same heat sinks and such. They may come out much more quickly than Genoa systems since they might be able to design a board that can be essentially swapped into an existing Milan chassis. A lot of servers do not need the expense of even the full SP5 socket. SH5 will likely be really high power consumption and also significantly higher cost making it unsuitable for a lot of systems.

I assume that the MCM style design without stacking is much cheaper to make than something like MI300 or even sapphire rapids. The prices I have seen for 24 to 32-core level processors were around 2000$ more for sapphire rapids and the roughly equivalent Genoa part and the Genoa part has lower power consumption. I don't know if we know what connectivity SH5 has yet. It likely has really high speed socket to socket links that would be massive overkill for most servers. I have seen some Grace-Hopper diagrams showing 18x NV4 links for 900 GB/s socket to socket bandwidth, so I assume AMD has something similar with infinity fabric gpu links. It also seems to be unclear how much off package memory it supports. Grace-hopper seems to only be 512 GB of lpddr on package. That isn't enough for a lot of HPC applications, but that may be backed by CXL attached memory.

SH5 will be HPC prices, but likely a lot cheaper than Nvidia Grace-Hopper systems. Intel's similar part seems to be delayed? SP5 based processors will probably maintain performance and power consumption advantages over Intel parts. Since they are cheaper to make, they can use SP5 MCM processors to dominate the regular server market. SP6 will take low end servers. SH5 will be HPC, not general servers. "Servers" is too general of a term. There is a big difference between general servers and GPU "server" for AI or other HPC applications. One might think an SH5 cpu would be good for database applications due to the possibly massive amount of infinity cache, but the infinity cache may not be as low of latency as v-cache. Database servers may be better with just multiple layers of v-cache die in an MCM package. We don't know the amount of infinity cache yet. They may have 4 to 6 pci-express (GPU-GPU capable?) per base die, so possibly much more IO than a regular Epyc IO die with only 2 x16 per quadrant. It is unclear how many off package memory controllers it has, if any. They have 128 GB of HBM cache, so that might be backed by CXL attached memory.

Anyway, SP5 and SH5 are not going to be interchangeable. I don't think we will see AMD trying to push SH5 down into the general server market as it likely is specialized for HPC. They can always start using stacked die in an SP5 package if it makes sense to do so. It doesn't have to stay an MCM.