Josh128
Golden Member
- Oct 14, 2022
- 1,499
- 2,249
- 106
You are off by one order of magnitude: 50k * 1kW = 50 MW50,000 x 1KW = Another half gigawatt of AMD AI infrastructure. The AI saga continues.
eyyyy it's a collaborative design with Meta.
Work graphs can morph recursively by splitting and merging nodes. These nodes also share data between each other, so maybe one could assume this new inter-CU fabric is beneficial here as well?Workgroup clusters + globally accessible LDS
Think that's the most interesting HW paper I've ever read. Is it by NVIDIA? No it's written by AMD researchers in 2020. The description in the patent aligns a lot with your Shared L1 paper.What about going one step further (as you mentioned "crackpot solution"):
Shared L1 caches across one SE: https://adwaitjog.github.io/docs/pdf/sharedl1-pact20.pdf
- L1 in this context would be the new unified L0/LDS of gfx1250+
DNN execution got accelerated by 2...3x
in this economy?Jobs people, jobs
You can always pick cherries and make Alibaba trinkets in the factory during the Big Beautiful Leap Forward™in this economy?
might as well announce an OpenAI partnership
Yeah + @basix's shared L1 makes this redundant, but only if RDNA 5 has it.Forget about weird coprocessor models.
Nope.Did somebody else already post this new AMD publication?
Looks like RDNA 5 might adress those weaknesses. Shifting bottleneck from cachemem to logic.GPUs are pretty damn good at running GEMM and are really bad at doing RT.
Ever tried DLSS 4? When it launched even r/f**kTAA praised it as a TAA deblur filter. For non-TAA noise problems (Lumen, Nanite...) maybe some variant of MLP as a future solution.No, it's simply that some people don't really see all the noise and blur and praise TAA and its derivatives despite the fact that they destroy image clarity, small details and make your head spin with the ghosting in motion.
Private per SE makes sense, because of all the high bandwidth connections you need to have ("wiring nightmare" as Huynh described it).I don't think it has to be SE private it can be global like in paper and the patent I linked. Not sure if this is practical and directly conflicts with Huynh so might just be per Shader Array. This shared L1 scheme makes L2 less important and then it probably only needs to hold prefetched data.
you should probably make a substack and rant there.Detailed reporting and analysis moved to declutter thread. Available here:
From 2020 paper:"Overall, the total power under DynEB is similar to baseline, with <1% reduction averaged across all evaluated applications"Private per SE makes sense, because of all the high bandwidth connections you need to have ("wiring nightmare" as Huynh described it).
And at some point further distances mean higher latencies and higher energy draw, which undermines efficiency.
And then the complexity of managing all those memory accesses.
You could theoretically share across a full chip. But as soon as you are thinking about disaggregated GPU chiplets - having that on e.g. MI300/M400 with stacked XCDs - keeping it local to just one SE makes more sense to me.
But as I said, 12 * 448kB is already >5 MByte of fast local cache. This is a huge increase compared to 128kB + 32kB + 512kB of e.g. RDNA3 LDS, L0 and L1. And the L1$ gets shared across 16 CU.
The L2$ is still important. SE to SE transfers, memory side caching and probably global command processor acceses (ACEs as well?) will use L2$.
Even more words. Impressive.
If AMD really did pursue the path of chiplet GPUs in the past, having a few remains of that in RDNA5 should make sense.Delegating as much work to individual SEs as possible. Repeating myself but this is clearly tailored to chiplets.
Why do you talk as if chiplet gpu is a dead ideaIf AMD really did pursue the path of chiplet GPUs in the past, having a few remains of that in RDNA5 should make sense.
And to be honest, they have chiplet GPUs already (MI300 series), which should benefit directly of big and shared caches.
And as you pointed out, there is even some super-linear scaling with more cache in case of bigger GPUs.
MI450 does maybe allow to share data between the XCDs in order to reap those benefits.
Giving SEs more autonomy generally helps in two ways:
- More localized cache accesses and therefore bandwidth efficiency. Crossing chiplets boundaries will cost something
- Better scaling to more SEs. If you can distribute work in a lightweight manner and the SE does the heavy lifting, more SEs should scale better due to localized cache accesses and scheduling. As long as there is enough work to do. This applies for both graphics and ML / HPC accelerators.
because it is (for gaming GPU definitely)Why do you talk as if chiplet gpu is a dead idea
What is the technological issue here. Seems like AMD solved the software problem 🤔because it is (for gaming GPU definitely)
nope.because it is (for gaming GPU definitely)
alright, then wake me up when there´s new gaming GPU card using chipletsnope.
Dead for RDNA5. Lives with CDNA5. Might get resurrected for RDNA6+Why do you talk as if chiplet gpu is a dead idea
What if amd can resuscitate it for RDNA 6 or 7 etc.?
They're win more devices for people with more money than sanity.I still like the idea of chiplet based gaming GPUs
Bring back the days of Radeon HD 7990I still like the idea of chiplet based gaming GPUs. And it would be cool if AMD can pull that off in the future
I'd argue that what makes the biggest difference for whether chiplet GPUs make sense or don't is silicon defect density. There have been serious concerns that as we scale to smaller litho, defect density grows until large chips become unmanufacturable. If you expect that, splitting your design into many smaller chiplets you can test separately can make it a lot more economical to make.They're win more devices for people with more money than sanity.
