Question Beginner GPU uarch questions (scratching my head too long)?


Junior Member
Oct 25, 2020
Sorry if inappropriate
  1. AMD said pipeline balancing in RDNA2, does this involve reduction of latency via reorganization of the VGPRs and scalar registers, and wave size (16 <- 20) or the wavefronts across SIMDs?
  2. Earlier reports indicated that the shaders would traverse the BVH and the traversals stored in texture cache. How is that the case with a Ray Accelerating Unit?
  3. How would the infinity cache be differently structured to traditional L3 CPU Cache?
  4. Bandwidth management beginning with RDNA1 shared L1 cache and the LLC, infinity fabric, indicate a future chiplet approach. How would they manage conflicting GPU processing (triangles?) without a hit on latency or power?


Jun 2, 2019
I will try to answer some of your questions while still not being the expert in the field (read: take with a bit of salt).

1. The wave size did not change between RDNA 1 and RDNA 2 - it is still 32 or 64. What did change is the number of wavefronts (i.e. bundles of 32/64 threads) that can be processed simultaneously by a single CU. By "simultaneously" I mean "one at a time", i.e. the CU will select each cycle (or maybe less often?) which wavefront to process, and there are less of the wavefronts to choose from now. The assumption is that this made it possible for AMD to optimize some other aspects (like the clock speed).
2. I always assumed that the ray acceleration unit is similar to a texture unit - the shader sends a request for data and then (after 10-1000 cycles) it receives this data. With how the ray-tracing is described in RDNA 2 (i.e. it cannot works simultaneously with the texture unit) I assume this was a correct assumption - the same unit is responsible for fetching texture data and fetching BVH data. But still, texture units write data into the texture cache (after it is decompressed) which speeds up further texture lookups. I wonder if something similar is employed for the ray acceleration units - if it is, it will likely use the same texture cache. They might just write ray intersection results into the cache to be processed by the sibling thread invocations.
3. I am no expert in cache design, but I would assume that the Infinity Cache would optimize for the cache bandwidth rather than latency. Additionally, there will be many access points to this cache which might also influence how it is designed, like increasing the number of read ports, or banking/partitioning of the cache, etc. Still, I believe this cache is similarly structured to a CPU cache (i.e. it is divided into sets which have multiple ways, each having a cache line and a tag) but I also think it will have some facilities for color/depth compression (which are very important bandwidth/memory size optimization techniques).
4. If by "shared L1 cache" you mean the patent that shares this cache between CUs I would assume it is not being used for RDNA2 - or rather, there is no proof of that. I still think it is not suitable for graphics workloads, but I would like to be proven wrong. Regarding chiplets, my limited understanding of GPU pipelines says that it would mostly make sense for tiled-based rasterizers. You just schedule tiles between the chiplets while each of them manages its own data. Since GFX9+ (vega & navi) AMD GPUs already support tile-based rasterizing, I would think that it is the future of chiplet processing. This approach has many challenges, but it sounds the most reasonable of the bunch (the other being putting everything on a high-bandwidth interposer and pretending nothing has changed).
  • Like
Reactions: Zoal