My analysis on Xbox Series X power supply estimates the 52 CU GPU at 1.825 Ghz is drawing roughly 115-120w. Series X GPU die size is < 300 sq mm given the total die size of 360 sq mm for Series X SoC. Given that we know Series X GPU has 56 CU laid out in 2SE, 4SA config with 7 WGP / 14 CU per SA with 320 bit GDDR6 memory bus , I can say with reasonable confidence that Navi 21 at 505 sq mm is 96 CU (4SE, 8SA, 6WGP/12CU per SA) with a 384 bit memory bus. The dual pipe graphics command processor should ensure the performance scales linearly with added SE,SA,CU.
Given how the Render Back Ends are a part of the SA in the RDNA1 architecture I expect a similar design in RDNA2. That would give the 4SE, 8SA config on Navi 21 to have 32 RBE and 128 ROPs.
Navi 21 - 505 sq mm, 96 CU, 4 SE, 8 SA, 6WGP/12CU per SA ( 8 x 12 = 96 CU), 384 bit GDDR6, 32 RBE (Render Back End), 128 ROPs. My math leads me to believe they can clock this GPU at 2 Ghz game clock at 260w - 270w
I present my conjecture at this point as well
512-bit Bus Width (spoiler has 384-bit Bus width as well)
384-bit Bus width
Based on latest commits
Highlights vs Navi 10
- New PCI device supporting HDMI over USB-C (From amdgpu commit)
- Additional SDMA engine supporting DMA via XGMI/IF (see spoiler/From amdgpu commit)
- Doubled shader Engines (4SE)
- MEC queues per pipe is reduced by half (From amdgpu commit) probably indicating that the shader array size will not be increased wrt N10.
- ME pipes doubled (From amdgpu commit) which could probably indicate each pipe feeding 2 SE
- Reduced wavefronts per SIMD suggesting improved ILP and/or reduced latencies.
- Primitive binning support removed.
DMA Engines
Conjecture for RTRT based on patents
The issue with RTRT w/o HW acceleration is because shader gets occupied for long periods when BVH intersections and traversal are being done fully in the shader ALUs. (Read heavily tanked fps). Add to that the enormous requirements on BW due to the nature of BVH traversal.
These operations are very memory bandwidth intensive and have high occurrences of random accesses. For example, each ray may fetch over 24 different 64 byte nodes. These operations are also very arithmetic logic unit (ALU) and/or compute unit intensive. These ray traces suffer from very high divergence due to different traversal lengths, (where average wave utilization is 30%), are vector general purpose register (VGPR) use intensive, and waves waterfall frequently due to high probability of containing both triangle and box nodes.
Navi2x introduces HW acceleration for ray intersection along side the texture filter unit in the CU which make use of all the neccessary infrastructure of the CU thereby reducing die area and complexity.
A fixed function BVH intersection testing and traversal (a common and expensive operation in ray tracers) logic is implemented on texture processors. This enables the performance and power efficiency of the ray tracing to be substantially improved without expanding high area and effort costs. High bandwidth paths within the texture processor and shader units that are used for texture processing are reused for BVH intersection testing and traversal. In general, a texture processor receives an instruction from the shader unit that includes ray data and BVH node pointer information. The texture processor fetches the BVH node data from memory using, for example, 16 double word (DW) block loads. The texture processor performs four ray-box intersections and children sorting for box nodes and 1 ray-triangle intersection for triangle nodes. The intersection results are returned to the shader unit.
Each CU has 4 Texture processors which houses the ray intersection engine, the traditional texture filter unit will now be inside the texture processor.
The texture processor performs four ray-box intersections and children sorting for box nodes and 1 ray-triangle intersection for triangle nodes.
62CUs@2GHz can perform 0.5 Trillion ray-triangle intersection tests per second or 2 Trillion ray-box intersection tests per second
The shader may or may not use the Intersection for RT but the when the need arise to use it the speedup is very significant.
At this point it is not explicit if the shader unit waits for the intersection result or can do something in the meanwhile.
CU w/ Texture Processor with Intersection engine
Memory and Bus width
Personally I think the 384 Bit Bus w/ 12 GB VRAM @16-17 Gbps is more likely. But still...
- 12GB VRAM for 384 Bit Bus or 16GB for 512 Bit Bus
- 16-17 Gbps GDDR6 (18 Gbps+ GDDR6 has signal integrity issues)
- 768-816 GB/s for 384 Bit or 1024 GB/s for 512 Bit bus
Clocks and CUs
In general AMD will attempt to clock Navi2x as high as possible. The throughput of the geometry engine/primitive units etc increases with frequency and therefore the CUs have a better chance of being occupied and irrelevant operations discarded early in the pipeline.
However, compute shaders could still benefit from overall CU count.
This is an interesting balance to watch. Increasing CUs will not always help in all cases. Increasing Clocks should always help so long there is not a bottleneck being hit somewhere.
Caches
The intersection engines and RTRT in general would benefit more from cache/BW increases rather than just pure ALU throughput increase.
There will be generous increases of the cache
L0/
L2.
L1 would be interesting how it would pan out.
Scalar Data Cache and LDS may or may not see some increase.
L0 which houses the texture data will house the BVH data structure and should see a big bump in size. It is currently at
16KB.
On consoles this value might not have been that much in order to keep die size from ballooning too much but on desktop there is a good chance this will be greatly increased (
64KB?) to allow more data to be as close to the intersection engine as possible.
This should benefit regular non RTRT operations as well.
In addition, by utilizing the texture processor infrastructure, large buffers for ray storage and BVH caching are eliminated that are typically required in a hardware raytracing solution as the existing VGPRs and texture cache can be used in its place
L2 is globally accessible and this should be greatly increased as well to amplify the BW available and to minimize the trips to memory.
Navi1x is using
256 KB L2 slices.
I would surmise we should see this raised to
512KB per slice and together with compression should help with BW. I hope the limit of
512KB is raised for Navi2x.
All of this caching would raise the die size considerably, more so than pure ALU fixed function blocks. I would surmise, Navi2x would use the die area for a lot of cache and not only for increasing CU count.
Other noteworthy things
- Improved MES (Micro Engine Scheduler) which is the HW scheduler. How this pans out with WDDM 2.7 is to be seen.
- Large increase of DVFS modules.
- There is a new patent which describes how to selectively boost a number of CUs, this was for GPU compute and virtualization but if applied to Navi 2x would be an interesting concept.
-
HINT-BASED FINE-GRAINED DYNAMIC VOLTAGE AND FREQUENCY SCALING IN GPUS
20200183485 Abstract
A processing system dynamically scales at least one of voltage and frequency at a subset of a plurality of compute units of a graphics processing unit (GPU) based on characteristics of a kernel or workload to be executed at the subset. A system management unit for the processing system receives a compute unit mask, designating the subset of a plurality of compute units of a GPU to execute the kernel or workload, and workload characteristics indicating the compute-boundedness or memory bandwidth-boundedness of the kernel or workload from a central processing unit of the processing system. The system management unit determines a dynamic voltage and frequency scaling policy for the subset of the plurality of compute units of the GPU based on the compute unit mask and the workload characteristics.
Update:
Added some tidbits from Krteq.