Well, maybe it is simply not required to have more memory bandwidth:
- Revamped CUs and respective low level caches (bigger capacity)
- Out-of-order execution (increase hardware utilization of ALUs and cache)
- Maybe L0 cache sharing across multiple CUs (reduce wasted SRAM capacity, reduce LLC & DRAM bandwidth requirements)
- Universal compression (smaller memory footprint, reduce bandwidth requirements)
- DGF & DMM (smaller memory footprint, reduce bandwidth requirements)
- Neural techniques like NTC which aim to reduce data fetching from DRAM but rather use more compute from matrix engines (whose performance mostly rely on CU low level caches) to generate or extract data and information
- Work graphs and procedural algorithms with dynamic execution on CU level (reduces code footprints and reduces bandwidth pressure from higher level caches and DRAM)
All those things aim to maximize usage of low level CU resources, increase data locality and reduce load on higher level structures like LLC and DRAM.
It seems that there is much going on regarding rethinking GPU architecture as a whole.
This is a great summary but perhaps AMD wants to go even further than this. It really depends on just how clean slate GFX13 is.
#1: Maybe instead of bigger caches a universal M3-esque flexible cache to maximize compute/area and cachemem overhead.
#2: Hopefully without massive area overhead.
GhOST paper indicated this is plausible
#3: Yeah like the 2020 AMD paper. Flexible clustering and global/private management based on compiler and other hints. Combined with dataflow execution this could be a gamechanger for ML. Much less pressure on L2 and VRAM. Maybe it could be expanded to other forms of WGP caches and register files. Perhaps the WGP VGPR takeover mode Cerny talked about during Road to PS5 Pro talk could be extended across multiple WGPs.
#5: This is probably a compute-to-cache tradeoff but yeah a considerable benefit on-die cachemem usage.
#7: Hopefully well beyond that.
8 days ago NVIDIA published this
research paper suggesting it's possible to basically bolt dataflow execution onto existing architectures with only modest adjustments, although still far from being feature complete (see section 7). Despite this sizeable speedups and reductions in VRAM BW traffic were achieved for inference and training. And interestingly at
section 8 it clearly outlines how Kitsune is leveraging tile programming, mirroring recent moves with CUDA Tile, and also how it's generally far more applicable than Work Graphs, that's generally limited to shader pipelines.
I'm bringing this up because AMD is already exploring a dataflow API paradigm shift with Work Graphs, and as a result why not go all the way and implement sweeping changes on the HW and compiler side to fundamentally change how workloads are managed on GPUs. While workgraphs might be a push nextgen graphics API standardization, even with the impressive patent derived optimizations I doubt it gives anywhere close to the full picture of what GFX13 and later could be capable off in terms of compute and ML perf. They would prob need a brand new API and clean slate compiler to fully tap into this.
Some other considerations to reduce memory and cache pressure (far from complete):
Summarizing prev info
- Decentralized and locally autonomous distributed scheduling and dispatch (less pressure on higher level caches)
- Mapping data accesses to exploit ^ (^)
- Leaf nodes (ray/tri): Prefiltering and DGF nodes = parallel fixed-point testers (increased intersections/kB of cache)
- Payload sorting (^^^)
- Deferred any-hit shaders (increased cache and ALU utilization)
New
- TLAS and upper BLAS (ray/box):
Sorting rays into coherent bundles to be tested together to reduce redundant calculations (less cachemem overhead)
- Sophisticated lookup tables to
reuse expensive calculations (trancendental) math,
more general vector calculations (^)
The tail end end of Moore's Law demands every stone to be turned and I just hope AMD has taken the bold route rather than the cautious one. We'll see in ~2027.