The Chiphell guy pretty much concurs with everything said so far, 4 dies, 2027 launch, very specifically no AT1.
Oh and also the doubling of CU size.
Can you share links to the specific comments? Couldn't find them.
Edit: Disputed info about cache hierarchy characteristics of NVIDIA and AMD. Don't take this as fact. Skip to #1201 for the quote from the Volta tuning guide.
The L0+LDS merger is long overdue imo.
NVIDIA merged texture and shared cache into L1 with Turing. Maxwell Pascal had unified Texture (really a dcache)/L1 cache + shared memory. Seems the local SRAM stores define what a core is more than actual processing blocks. The NVIDIA SM has been partitioned in four processing blocks with local warp since Maxwell in 2014.
RDNA4 CU is also split in two partitions, within a WGP that's four in total. Merging L0 into one would be enough to qualify WGP as a CU, but AMD obviously want a shared LDS+L0 like NVIDIA.
This is simplified and might be misleading. It's not about merging seperate caches it's likely about permitting an underlying combined datacache to be more flexible. Slim chance of flexible cache like Apple M3 but even then the VGPR would stil be partitioned for each SIMD32, but the VGPR size could be variable.
Unless AMD shrinks the new CU partitions in half (aligning with Turing and later) I just can't see how MLIDs CU numbers would be possible. For reference using Ampere's bogus math AMD has 256 ROCm cores per WGP rn. NVIDIA has 128 CUDA cores per SM.
IDK how RDNA 4's CU partitions work but would think they can either do Wave64 or Wave32 x 2 based on specs (this might be incorrect). If RDNA 5 shrinks partions in half, then maybe AMD might do something insane like Wave32 or Wave16 x 2 to brute force PT and branchy code. Or alternatively if they don't shrink CUs maybe a Wave32 + Wave16 x 2 mode or perhaps a Wave24 x 2 + Wave14 mode. They could get really creative here with a sophisticated warp scheduler coupled with local launcher functionality like in the Kepler patent except even better. Wave32 x 2 - Wave16 x 4 in the big design or Wave32 - Wave16 x 2 in the small design. Big partition design (WGP = CU) has more flexibility here.
This is far fetched speculation so please don't take it too seriously.
What are workgraphs inherently good at? Branchy code so why not align the dispatch mode with this (fewer threads/warp). If AMD fixed the scheduling (would require significant area investment) then this would be extremely beneficial to path tracing and branchy code in general. Perhaps workgraph is good enough in itself and then in addition to SWC AMD could implement clever methods like ray coherency sorting similar to Imagination Technologies' Packet coherency gatherer and this would prevent the instead of relying on a bandaid (thread coherency sorting).
What about the split registers and instruction caches? I
- Ignore this, most likely misleading or false. I couldn't find any accurate info to suggest otherwise
Still wonder what will happen with the instruction and scalar cache. NVIDIA seems to have unified both since they only mention L0-i cache. Same thing for VRF. I did see mentions about NVIDIA going scalar since Tesla G80 and with Turing superscalar. GCN was a scalar and vector hybrid, RDNA4 doesn't change anything here. Essentially NVIDIA is SIMT and AMD is scalar + SIMD. Explains why AMD has to maintain seperate VGPR + SGPR and scalar cache + instruction cache while NVIDIA can just have a L0-i cache + VRF.
As you can clearly see I don't understand the implications of any of this so it's well above my level of comprehension but still thinks it deserves to at least be discussed.
RDNA5 is the biggest architectural change GCN so is there any chance they could theoretically go the NVIDIA route and go scalar instead of vector+scalar hybrid? That would mean no more split VGPR+SGPR and no split instruction and scalar cache right?
But wouldn't it also ruin BWC for consoles and be a huge pain for AMD SW team + CDNA 4 LLVM code still mentions SIMD for GFX12.5.
Can someone please say no to ^
Please don't it's futile