News [Toms] AMD multi chiplet gpu patent

GodisanAtheist

Diamond Member
Nov 16, 2006
6,719
7,016
136
If I'm reading that patent right, it looks like AMD's sort of mimics what they're doing with their Zen chiplets with a sort of master/IO chiplet that is then ganged together with a set of subchiplets to do the actual processing.

As was mentioned, this looks like a fine method of handling compute and machine learning, but the data latency issues would tank any sort of graphics processing workload.

Curious to see how much Infinity Cache figures into hiding latency issues in this design.
 

Bigos

Member
Jun 2, 2019
127
281
136
From the looks of things, the chiplets will be symmetrical (given how they are meant to stick closely together in a square formation for a configuration with 4 chiplets). One will be the primary one simply because it is connected directly to the PCIe bus and the rest need to be access through the primary one. I wonder if it does mean that there will be more duplicated structures than usual (like multiple VCN blocks).

Regarding latency, it is pretty much not an issue in GPU workloads with very long pipelines. Using a tile-based rasterizer will also make it possible to subdivide the geometry between the chiplets to reduce the inter-chiplet bandwidth requirements. However, things like texture/geometry access might need to go through the interposer as each chiplet will contain its own memory bank (something akin of NUMA or Zen 1 EPYC). I wonder how much bandwidth will the interposer solution allow for. It was evidently not possible on organic substrate (like EPYC and Ryzen 3000+ use) as it would probably burn too much power for the required bandwidth.
 
  • Like
Reactions: moinmoin and Elfear

Dribble

Platinum Member
Aug 9, 2005
2,076
611
136
From the looks of things, the chiplets will be symmetrical (given how they are meant to stick closely together in a square formation for a configuration with 4 chiplets). One will be the primary one simply because it is connected directly to the PCIe bus and the rest need to be access through the primary one. I wonder if it does mean that there will be more duplicated structures than usual (like multiple VCN blocks).

Regarding latency, it is pretty much not an issue in GPU workloads with very long pipelines. Using a tile-based rasterizer will also make it possible to subdivide the geometry between the chiplets to reduce the inter-chiplet bandwidth requirements. However, things like texture/geometry access might need to go through the interposer as each chiplet will contain its own memory bank (something akin of NUMA or Zen 1 EPYC). I wonder how much bandwidth will the interposer solution allow for. It was evidently not possible on organic substrate (like EPYC and Ryzen 3000+ use) as it would probably burn too much power for the required bandwidth.
The way to reduce the latency and inter-chip transfer issues is to optimise for them. To do that you'd need to code your game renderer in such a way as this happens. The easiest way to achieve this would be in the gpu's drivers but that's only possible with a high level driver (something AMD killed when they helped usher in the current age of low level drivers). With a low level driver it becomes to a much greater extent the game dev's problem. As we know the game devs really aren't very keen on this sort of thing as it's too much work (see the death of Xfire and SLi which died because it moved from being a driver problem to make them work to the game dev's who just weren't interested).
Even if AMD (and of course Nvidia and Intel) provided libraries to help support this they'd all be different, and unlike a high level drivers which all support the same unified interface these would all have completely different interfaces. It would just be a ton of work for the devs to sort out which they won't be keen to do.
 
  • Like
Reactions: KompuKare

Bigos

Member
Jun 2, 2019
127
281
136
At the very least, Vulkan provides facilities to program for tile-based renderers by partitioning render passes into subpasses. That should make it possible to reduce the inter-tile dependencies and allow for them to be scheduled into different chiplets. This should be done transparently in the driver.

Crossfire/SLI worked with very little communication between the GPUs. Both the dedicated bridges and the PCIe interconnect provide an order of magnitude less bandwidth than even the GPU memory bus, let alone the on-chip cache interconnects. The solution described in the patent seems to imply this won't be an issue anymore thanks to high performance interconnect through a passive interposer.

The patent also tells that the solution is meant to be seen as a "single GPU" to the CPU. It remains to be seen whether it means that the solution will be 100% hardware/firmware based or the driver will be involved. I don't think the game engine will have to be involved, though, unless there will be a "NUMA" mode or something like that.
 
  • Like
Reactions: moinmoin

gorobei

Diamond Member
Jan 7, 2007
3,654
980
136
amd's work on active interposer (butterdoughnut) suggests they can get the bandwidth needed, but the 'single gpu' part maybe why they need xylinx fpga which would allow them to avoid needing to get it perfect in the hardware at the start, vs a patch later.
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
but the 'single gpu' part maybe why they need xylinx fpga
Nah, FPGA tech acquisition is just about diversifying and remaining competitive with Intel.

Not counting ML optimised HW in CDNA (or splitting compute/gaming uArch), the Xilinx acquisition represents the first major diversification AMD has made in years.

If they had not been in dire straits financially for at least half a decade they might have bought into FPGA tech much sooner, as it is they had to prioritise belt tightening and core uArch R&D to keep going - now they have regained investor confidence a lot* and can afford to splurge on new investments.

Right now Qualcomm has bought Nuvia I believe in response to Samsung/RDNA as much as for it's competitive CPU possibilities - it remains to be seen just how well prepared Qualcomm's Adreno team were for the last couple of years DX12 Ultimate and Vulkan RT features, so they may be facing uncertain times unless they also cut a deal with AMD (or IMG Tec for Wizard/Caustic RT licensing).

*if in part due to Intel's process foibles vs their fabless model.