Thanks for the detailed paper analysis
@basix.
Here's an older 2019 paper that looks related to AMD's DSMEM implementation:
https://adwaitjog.github.io/docs/pdf/remote-core-pact19.pdf
Another difference between papers is 28 CUs vs 80.
On YT researchers compared
2020 (private and shared) vs
2021 (shared only) as difference between one building where everyone has their own kitchen and can share ingredients, and the other as one big shared kitchen. Indeed not apples to apples.
Something between C5 and C10 (16 or 8 CU per cluster) seemed to yield best results regarding performance, chip area and power efficiency. The rumored 12 CU per RDNA5 SE seem to be the sweet spot.
There's actually patent for adaptive CU clustering:
https://patents.google.com/patent/US11360891B2
Also simulations are pretty much based on GCN CU. RDNA split L0's merged into one it and would align with PR40 in 2021 paper, except for the cache size differences. RDNA 5 SE = 12/24 CUs, except for AT02. Clustering could be C6, C12, C8 maybe even irregular clustering perhaps based on some clever heuristic or ML block to maximize performance.
Maaybe the P-GEMM speedup in 2020 paper was a result of reaping the benefits from private and global at the same time?
The last not optimal performing applications can be covered with bigger L1 caches overall
If shared GFX12.5+ LDS+L0 slab is configurable, then L0 size could be massive as long as this doesn't conflict with LDS.
The bigger L2$ of RDNA5 would for sure help as well and its interconnect could be used as a secondary / hieararchial and true chip global cache sharing network. So not writing back to L2$ but directly share between SE L1 cache clusters, but not sure if L2 interconnect & bandwidth pressure will allow that
In 2021 paper there seems to be an inverse correlation between L2 reply BW and shared L1 communication. In other words the more the application benefits from shared L1 (reduced l1 misses) the fewer requests are forwarded to the L2. Maybe the freed up BW in NoC is enough to satisfy additional L1-L1 communications.
But I assume AMD would aim for a balanced approach, where power and area gets reduced for all applications while gaining performance for most applications. Gaming would benefit from reduced power draw, ML/AI from increased performance.
Sounds about right. BTW here's another interesting patent for dataflow execution:
https://patents.google.com/patent/US11947487B2.
As prev stated there are many more patents that alter data management, cachemem, execution, and scheduling characteristics, at least some of which should find their way into actual products. Perhaps as soon as RDNA 5 if we're lucky.
Interesting optimisations
Another new cache system optimisation paper (2021):
https://jbk5155.github.io/publications/MICRO_2021.pdf
- Idle unused I-cache and LDS capacity can now be used for TLB. There's a massive boost IPC (>100%) in some applications and on average 30.1%. Paper also highlights the current issues with irregular memory access patterns and how this new approach counters that. Assume this would be beneficial to ray tracing and other irregular applications? If not in RDNA 5 then later µarchs. The increased flexibility reminds me of another patent I shared:
https://patents.google.com/patent/US12265484B2
This one sounds a lot like flexible cache in M3 and later. VRF is a cache, everything can be reconfigured to a different cache type depending on the specific needs of workloads executed.
And why stop there. Apparently even registers can bypass L2 for inter-register data requests beyond WGP takeover mode employed in PS5 Pro. Here it looks like a GPR analog to DSMEM (not an AMD patent):
https://patents.google.com/patent/US20210398339A1/en
Though I'm not sure if true LDS + VGPR global sharing a la L1 2020 and 2021 paper (replication mitigation) is even possible, mayb some hybrid private global implementation?
With SRAM scaling pretty much dead business won't work. So really not surprised that AMD seems to be actively investigating basically every avenue to squeeze as much performance out of limited SRAM capacities as possible. It'll be interesting to see how this materializes in RDNA 5 and later.