Seems
@Kepler_L2 was right on the money about VGPRs
C++:
unsigned getTotalNumVGPRs(const MCSubtargetInfo *STI) {
if (STI->getFeatureBits().test(FeatureGFX90AInsts))
return 512;
if (!isGFX10Plus(*STI))
return 256;
bool IsWave32 = STI->getFeatureBits().test(FeatureWavefrontSize32);
if (STI->getFeatureBits().test(FeatureGFX11FullVGPRs))
return IsWave32 ? 1536 : 768;
return IsWave32 ? 1024 : 512;
}
This code shows that VGPR --> 1536 * 32 * 4 = 192KiB (+50%)
Since the number of VGPRs per bank has not changed (i.e 256) this means full 6 bank VGPRs for a fullblown dual x32 ALUs in one SIMD.
And it also hint at 1-cycle wave64 mode because it seems they can now band two adjacent VGPR banks (see num VGPRs is halved when not in wave32) to form 3 banks of wave64 operands for full 1 cycle wave64.
C++:
unsigned getVGPRAllocGranule(const MCSubtargetInfo *STI,
Optional<bool> EnableWavefrontSize32) {
if (STI->getFeatureBits().test(FeatureGFX90AInsts))
return 8;
bool IsWave32 = EnableWavefrontSize32 ?
*EnableWavefrontSize32 :
STI->getFeatureBits().test(FeatureWavefrontSize32);
if (STI->getFeatureBits().test(FeatureGFX11FullVGPRs))
return IsWave32 ? 24 : 12;
Allocation granule is also 24 (2*2*6)
Looks like N31 and N32 will be compute monsters. As expected, 11.0.2 and 11.0.3 (N33) don't have this feature and they go the VOPD route.
reviews.llvm.org
Another unique thing of N31 is native fp16 ops, Vector Registers from 0-127 contains Lo and Hi 16 bit floats. In theory they can do 4x native fp16 ops (not matrix) per cycle per SIMD, This will be great of FSR kind of stuffs