oh my god you're going by the NV marketing blurbs.
it's over
See no one is correcting you, so assume you're correct but I don't understand why. I'm very confused and trying to grasp this and let's just say the info online is not easy to find. Please entire read comment before commenting.
I still don't understand how the cache architecture can be unchanged since Fermi, when NVIDIA went out of their way to communicate changes to developers with Volta and no doubt also earlier changes like Maxwell's reworked SM cache programming model (IDK what else to call this). Here's an example (there's more) from the NVIDIA Volta tuning guide for SWEs (not marketing):
- The answer is hidden in plain sight, read the entire description for cache system in the Volta tuning guide on NVIDIA's website and connect the dots. Prev comments now reflect this.
"Like Pascal, Volta combines the functionality of the L1 and texture caches into a unified L1/Texture cache which acts as a coalescing buffer for memory accesses, gathering up the data requested by the threads of a warp prior to delivery of that data to the warp.
Volta increases the maximum capacity of the L1 cache to 128 KB, more than 7x larger than the GP100 L1. Another benefit of its union with shared memory, the Volta L1 improves in terms of both latency and bandwidth compared to Pascal."
Link: https://docs.nvidia.com/cuda/volta-tuning-guide/index.html#unified-shared-memory-l1-texture-cache
Are you saying that NVIDIA is communicating BS in this tuning guide or is there something here that explains this discrepancy between the misleading SM diagram (marketing blurb) and the true HW implementation, which if I understand you correctly is unchanged since Fermi? Why would NVIDIA change the cache programming model for Maxwell and Volta, when the HW is unchanged since Fermi. It makes no sense. Again what am I missing here? Does both companies have one big slab with different seperate datapaths or is it something else entirely?
If none of the info NVIDIA is providing online is correct, then I assume AMD's WGP diagram by that extension is also extremely misleading. If so how can we uncover the true architecture on the HW level devoid of any abstractions and simplications (marketing BS)? I couldn't find any publicly available info about this online and what that even encompasses.
By extension without intricate knowledge of the HW change of the L0+LDS merger for CDNA5 is meaningless (can't be understood) and not something you can understand by just looking at L0+LDS (implies shared). I mean there has to be a HW level change because this functionality is for GFX12.5+ only and thus cannot be exposed via the HW we have rn.
If the following is inaccurate then please tell me how it could be roughly compared AMD vs NVIDIA? There has to be a way to compare architectures cache systems right?
Do we agree that the AMD L0 vector cache is roughly equivalent to NVIDIAs texture cache (misleading name since GPGPU) and that the L1 instruction cache in Maxwel-Pascal is roughly equivalent to the functionality of the Scalar and instruction cache in RDNA1-4?
For AMD that only leaves LDS and for NVIDIA L1 and shared memory, pre Turing. So what's comparable here LDS and shared memory or LDS and Shared memory +L1 or is it just impossible to compare this and/or the other SM/CU cache systems for NVIDIA and AMD?
Also I still don't understand why NVIDIA would remove the L1 instruction cache from the SM diagram when it's there. They don't do this for DC didn't do it pre-Turing, yet every single client GPU SM description since Turing has no L1-i cache. I've also tried to search specifically for L1 instruction cache Turing and L1-i Turing and zero hits. It makes no sense that this a simple case of lying by omission. Please explain this discrepancy.
In case @adroc_thurston doesn't want to address my questions is there anyone else that want to give it a go?
Last edited: