Question Speculation: RDNA2 + CDNA Architectures thread

Page 9 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

uzzi38

Platinum Member
Oct 16, 2019
2,626
5,927
146
All die sizes are within 5mm^2. The poster here has been right on some things in the past afaik, and to his credit was the first to saying 505mm^2 for Navi21, which other people have backed up. Even still though, take the following with a pich of salt.

Navi21 - 505mm^2

Navi22 - 340mm^2

Navi23 - 240mm^2

Source is the following post: https://www.ptt.cc/bbs/PC_Shopping/M.1588075782.A.C1E.html
 

DiogoDX

Senior member
Oct 11, 2012
746
277
136
How many ROPs the XBOX Series X have? Techpowerup lists 80 but I know they put estimate specs on the site and I coun't find on the microsoft oficial announcement.
 

Stuka87

Diamond Member
Dec 10, 2010
6,240
2,559
136
How many ROPs the XBOX Series X have? Techpowerup lists 80 but I know they put estimate specs on the site and I coun't find on the microsoft oficial announcement.

We don't actually know. Most places seem tot hink it will be 64, TPU is the only place to put 80. 80 would be a strange number. I would expect it to be 64 or 96.
 
  • Like
Reactions: DiogoDX

Geranium

Member
Apr 22, 2020
83
101
61

Adored on both AMD and Nvidia next gen GPUs.
This rumor about AMD's gpu is pure redacted.
Big 64+CU unit will not clock as high as PS5's GPU portion.
The reason PS5's GPU clock high because it is a narrow design with only 36CU.
Xbox SX's GPU is 52CU and it doesn't clock as high.
Also AMD's will go HBM 4096- bit first than another 512-bit on GDDR.

Also 400W TBP?? Last AMD's single gpu with high TBP was RX Vega 64 Liquid edition with 345W TBP and it was limited edition. Highest AMD gone for normal card is 300W TBP.

We have a zero tolerance policy regarding profanity in the tech sub-forums.
Don't do it again.

Iron Woode

Super Moderator
 
Last edited by a moderator:

maddie

Diamond Member
Jul 18, 2010
4,740
4,674
136
This rumor about AMD's gpu is pure humanshit.
Big 64+CU unit will not clock as high as PS5's GPU portion.
The reason PS5's GPU clock high because it is a narrow design with only 36CU.
Xbox SX's GPU is 52CU and it doesn't clock as high.
Also AMD's will go HBM 4096- bit first than another 512-bit on GDDR.

Also 400W TBP?? Last AMD's single gpu with high TBP was RX Vega 64 Liquid edition with 345W TBP and it was limited edition. Highest AMD gone for normal card is 300W TBP.
I think you're misusing information

The PS5 & XBox speed difference is due to power not ability. Within a fairly large range, (wider + slower) > (narrower + faster)
 

kurosaki

Senior member
Feb 7, 2019
258
250
86
You have to have in mind that the consoles powerbudget are quite limited on within a quite small compartment. Both cpu and gpu are in the same package. So it has not only have to cool 100w cpu, also X watts of gpu. There is almost 100 watts extra you could put in a discrete gpu.
 

uzzi38

Platinum Member
Oct 16, 2019
2,626
5,927
146
And power is unlimited on desktop?
It's far less limited than what you get in a console.

The Series X is working with a 255W budget for the entire console SoC (with sustained frequencies regardless of workload, so the Zen 2 cores need to be capable of 3.6GHz in a full AVX workload for example AND the GPU needs to be capable of 1825mhz under full load), 10 GDDR6 modules (~30W) and 2 4W NVMe drives whilst also accounting for 12v rail power efficiency losses, differences in silicon quality and on top of that some extra budget as a failsafe just in case.

I'm 100% certain a 64CU GPU at over 2GHz would easily fit into a very manageable form factor for desktops, and maybe even 80CUs at around 2GHz could fit in as well. But that depends on how well RDNA2 can scale up.
 

raghu78

Diamond Member
Aug 23, 2012
4,093
1,475
136
It's far less limited than what you get in a console.

The Series X is working with a 255W budget for the entire console SoC (with sustained frequencies regardless of workload, so the Zen 2 cores need to be capable of 3.6GHz in a full AVX workload for example AND the GPU needs to be capable of 1825mhz under full load), 10 GDDR6 modules (~30W) and 2 4W NVMe drives whilst also accounting for 12v rail power efficiency losses, differences in silicon quality and on top of that some extra budget as a failsafe just in case.

I'm 100% certain a 64CU GPU at over 2GHz would easily fit into a very manageable form factor for desktops, and maybe even 80CUs at around 2GHz could fit in as well. But that depends on how well RDNA2 can scale up.

My analysis on Xbox Series X power supply estimates the 52 CU GPU at 1.825 Ghz is drawing roughly 115-120w. Series X GPU die size is < 300 sq mm given the total die size of 360 sq mm for Series X SoC. Given that we know Series X GPU has 56 CU laid out in 2SE, 4SA config with 7 WGP / 14 CU per SA with 320 bit GDDR6 memory bus , I can say with reasonable confidence that Navi 21 at 505 sq mm is 96 CU (4SE, 8SA, 6WGP/12CU per SA) with a 384 bit memory bus. The dual pipe graphics command processor should ensure the performance scales linearly with added SE,SA,CU.

RDNA2 Sienna Cichlid.jpg

Given how the Render Back Ends are a part of the SA in the RDNA1 architecture I expect a similar design in RDNA2. That would give the 4SE, 8SA config on Navi 21 to have 32 RBE and 128 ROPs.

Navi 10-block-diagram.jpg


Navi 21 - 505 sq mm, 96 CU, 4 SE, 8 SA, 6WGP/12CU per SA ( 8 x 12 = 96 CU), 384 bit GDDR6, 32 RBE (Render Back End), 128 ROPs. My math leads me to believe they can clock this GPU at 2 Ghz game clock at 260w - 270w
 

Geranium

Member
Apr 22, 2020
83
101
61
It's far less limited than what you get in a console.

The Series X is working with a 255W budget for the entire console SoC (with sustained frequencies regardless of workload, so the Zen 2 cores need to be capable of 3.6GHz in a full AVX workload for example AND the GPU needs to be capable of 1825mhz under full load), 10 GDDR6 modules (~30W) and 2 4W NVMe drives whilst also accounting for 12v rail power efficiency losses, differences in silicon quality and on top of that some extra budget as a failsafe just in case.

I'm 100% certain a 64CU GPU at over 2GHz would easily fit into a very manageable form factor for desktops, and maybe even 80CUs at around 2GHz could fit in as well. But that depends on how well RDNA2 can scale up.
I am not saying that 2GHz is not possible on 64+CU. I was talking about Adored 2700MHz+ on 80CU.

It seems that grey doesn't exist anymore.
??

With nvidia supposing going for 350Watts it doesn't matter very much, the only question is if they can properly cool it.
GA100 is much bigger than the die area if supposed Big Navi die. And it will be cooled by very expensive server cooler which is not available for average customer.
 

Konan

Senior member
Jul 28, 2017
360
291
106
I am not saying that 2GHz is not possible on 64+CU. I was talking about Adored 2700MHz+ on 80CU.

Adored debunked and redacted the 2.7Ghz so might as well drop talking about that speed, not going to happen made up fairy stuff. He said the information was wrong. Same with the 400W.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,602
5,788
136
My analysis on Xbox Series X power supply estimates the 52 CU GPU at 1.825 Ghz is drawing roughly 115-120w. Series X GPU die size is < 300 sq mm given the total die size of 360 sq mm for Series X SoC. Given that we know Series X GPU has 56 CU laid out in 2SE, 4SA config with 7 WGP / 14 CU per SA with 320 bit GDDR6 memory bus , I can say with reasonable confidence that Navi 21 at 505 sq mm is 96 CU (4SE, 8SA, 6WGP/12CU per SA) with a 384 bit memory bus. The dual pipe graphics command processor should ensure the performance scales linearly with added SE,SA,CU.

Given how the Render Back Ends are a part of the SA in the RDNA1 architecture I expect a similar design in RDNA2. That would give the 4SE, 8SA config on Navi 21 to have 32 RBE and 128 ROPs.

Navi 21 - 505 sq mm, 96 CU, 4 SE, 8 SA, 6WGP/12CU per SA ( 8 x 12 = 96 CU), 384 bit GDDR6, 32 RBE (Render Back End), 128 ROPs. My math leads me to believe they can clock this GPU at 2 Ghz game clock at 260w - 270w
I present my conjecture at this point as well :)

512-bit Bus Width (spoiler has 384-bit Bus width as well)

1592068212903.png


384-bit Bus width
1592068235493.png

Based on latest commits

Highlights vs Navi 10
  • New PCI device supporting HDMI over USB-C (From amdgpu commit)
  • Additional SDMA engine supporting DMA via XGMI/IF (see spoiler/From amdgpu commit)
  • Doubled shader Engines (4SE)
    • MEC queues per pipe is reduced by half (From amdgpu commit) probably indicating that the shader array size will not be increased wrt N10.
    • ME pipes doubled (From amdgpu commit) which could probably indicate each pipe feeding 2 SE
  • Reduced wavefronts per SIMD suggesting improved ILP and/or reduced latencies.
  • Primitive binning support removed.
DMA Engines
1592068464351.png

Conjecture for RTRT based on patents
The issue with RTRT w/o HW acceleration is because shader gets occupied for long periods when BVH intersections and traversal are being done fully in the shader ALUs. (Read heavily tanked fps). Add to that the enormous requirements on BW due to the nature of BVH traversal.
These operations are very memory bandwidth intensive and have high occurrences of random accesses. For example, each ray may fetch over 24 different 64 byte nodes. These operations are also very arithmetic logic unit (ALU) and/or compute unit intensive. These ray traces suffer from very high divergence due to different traversal lengths, (where average wave utilization is 30%), are vector general purpose register (VGPR) use intensive, and waves waterfall frequently due to high probability of containing both triangle and box nodes.
Navi2x introduces HW acceleration for ray intersection along side the texture filter unit in the CU which make use of all the neccessary infrastructure of the CU thereby reducing die area and complexity.
A fixed function BVH intersection testing and traversal (a common and expensive operation in ray tracers) logic is implemented on texture processors. This enables the performance and power efficiency of the ray tracing to be substantially improved without expanding high area and effort costs. High bandwidth paths within the texture processor and shader units that are used for texture processing are reused for BVH intersection testing and traversal. In general, a texture processor receives an instruction from the shader unit that includes ray data and BVH node pointer information. The texture processor fetches the BVH node data from memory using, for example, 16 double word (DW) block loads. The texture processor performs four ray-box intersections and children sorting for box nodes and 1 ray-triangle intersection for triangle nodes. The intersection results are returned to the shader unit.

Each CU has 4 Texture processors which houses the ray intersection engine, the traditional texture filter unit will now be inside the texture processor.

The texture processor performs four ray-box intersections and children sorting for box nodes and 1 ray-triangle intersection for triangle nodes.
62CUs@2GHz can perform 0.5 Trillion ray-triangle intersection tests per second or 2 Trillion ray-box intersection tests per second

The shader may or may not use the Intersection for RT but the when the need arise to use it the speedup is very significant.
At this point it is not explicit if the shader unit waits for the intersection result or can do something in the meanwhile.

CU w/ Texture Processor with Intersection engine
1592068677094.png
Memory and Bus width
Personally I think the 384 Bit Bus w/ 12 GB VRAM @16-17 Gbps is more likely. But still...
  • 12GB VRAM for 384 Bit Bus or 16GB for 512 Bit Bus
  • 16-17 Gbps GDDR6 (18 Gbps+ GDDR6 has signal integrity issues)
    • 768-816 GB/s for 384 Bit or 1024 GB/s for 512 Bit bus

Clocks and CUs
In general AMD will attempt to clock Navi2x as high as possible. The throughput of the geometry engine/primitive units etc increases with frequency and therefore the CUs have a better chance of being occupied and irrelevant operations discarded early in the pipeline.
However, compute shaders could still benefit from overall CU count.
This is an interesting balance to watch. Increasing CUs will not always help in all cases. Increasing Clocks should always help so long there is not a bottleneck being hit somewhere.

Caches
The intersection engines and RTRT in general would benefit more from cache/BW increases rather than just pure ALU throughput increase.
There will be generous increases of the cache L0/L2. L1 would be interesting how it would pan out.
Scalar Data Cache and LDS may or may not see some increase.

L0 which houses the texture data will house the BVH data structure and should see a big bump in size. It is currently at 16KB.
On consoles this value might not have been that much in order to keep die size from ballooning too much but on desktop there is a good chance this will be greatly increased (64KB?) to allow more data to be as close to the intersection engine as possible.
This should benefit regular non RTRT operations as well.
In addition, by utilizing the texture processor infrastructure, large buffers for ray storage and BVH caching are eliminated that are typically required in a hardware raytracing solution as the existing VGPRs and texture cache can be used in its place

L2 is globally accessible and this should be greatly increased as well to amplify the BW available and to minimize the trips to memory.
Navi1x is using 256 KB L2 slices.
I would surmise we should see this raised to 512KB per slice and together with compression should help with BW. I hope the limit of 512KB is raised for Navi2x.

All of this caching would raise the die size considerably, more so than pure ALU fixed function blocks. I would surmise, Navi2x would use the die area for a lot of cache and not only for increasing CU count.

Other noteworthy things
  • Improved MES (Micro Engine Scheduler) which is the HW scheduler. How this pans out with WDDM 2.7 is to be seen.
  • Large increase of DVFS modules.
    • There is a new patent which describes how to selectively boost a number of CUs, this was for GPU compute and virtualization but if applied to Navi 2x would be an interesting concept.
      • HINT-BASED FINE-GRAINED DYNAMIC VOLTAGE AND FREQUENCY SCALING IN GPUS
        20200183485 Abstract
        A processing system dynamically scales at least one of voltage and frequency at a subset of a plurality of compute units of a graphics processing unit (GPU) based on characteristics of a kernel or workload to be executed at the subset. A system management unit for the processing system receives a compute unit mask, designating the subset of a plurality of compute units of a GPU to execute the kernel or workload, and workload characteristics indicating the compute-boundedness or memory bandwidth-boundedness of the kernel or workload from a central processing unit of the processing system. The system management unit determines a dynamic voltage and frequency scaling policy for the subset of the plurality of compute units of the GPU based on the compute unit mask and the workload characteristics.

Update:
Added some tidbits from Krteq.
 
Last edited:

Krteq

Senior member
May 22, 2015
991
671
136
You also can add there info from RadeonSI MESA commits:

ac_gpu_info.c
Code:
if (info->chip_class >= GFX10_3)
        info->max_wave64_per_simd = 16;
    else if (info->chip_class == GFX10)
        info->max_wave64_per_simd = 20;
    else if (info->family >= CHIP_POLARIS10 && info->family <= CHIP_VEGAM)
        info->max_wave64_per_simd = 8;
gitlab.freedesktop.org/mesa - ac,radeonsi: start adding support for gfx10.3

There is 16-entry wavefront controller per SIMD for Navi 2 now (20-entry for Navi, 8 for Polaris/Vega). It seems that they change this to mask latencies etc.
 

tajoh111

Senior member
Mar 28, 2005
298
312
136
The 2.7+ghz was unbelievable to begin with. I don't think I have seen a 5700xt get past 2.5 ghz on LN2.

I think why people hype went into overdrive was because of the ps5 boost clocks. But people should be weary of these clocks since these are the least important clocks generally. People with more knowledge about hardware knowledge typically want actual sustained clocks, base clocks or all core clocks. This is the clock Microsoft published for their xbox series X.

What has almost certain to have happened was Sony was caught off guard by the series x CU advantage. As a result, they are publishing only the boost clock which almost no company does.

Nvidia, Intel and AMD always list there base clock and will sometimes even forget to list their boost clocks. What I suspect is the PS5 clocks are closer to the Xbox series x when actually in use. E.g 1.90-2 ghz.

Publishing a 1.9ghz speed with a 2304 cu part would mean the PS5 would only be 8.7 tflops which would make the PS5 compute rating more similar to the xbox one X than the series X(The IPC of Navi2 would not be taken into account by the general public) and it would be humiliating for Sony and kill their hype since their consoles are launching at the same time.

As a result, Sony has hidden their base clock, not mentioned anything about typical clock and only have mentioned their boost clock which still shows a significant inferiority in terms of compute to the series X. And as a result, have got the media to mention PS5 SSD advantage and speed at every chance they get, even getting articles focused entirely on SSDs at strategic times. But the PS5 is likely going to be a tad underpowered compared to the xbox series x and it showed with some of the demos.

One thing, you might have noticed with the PS5 demoes a couple days ago was the FPS was kind of on the lacking side. Particularly the resident Evil demo below.


Pragmatic was also on the stuttery side.

I think most of the performance per watt from RDNA 2 is going to come from having a wider design and using most the maturity of 7nm for power savings, particularly for big Navi since big chips have more leakage which means power quickly climbs with clocks.

What I would guess is Big Navi's is 300watts and clocked around 1.9-2ghz. A Boost clock of around 2.1 and a base clock of 1.8ghz. Basically 2x as fast as a 5700xt, but at 300watts which would equate to AMD specific improvement in performance per watt.

2/(300/225) = 1.5 or 50% improvement in performance per watt.
 

Geranium

Member
Apr 22, 2020
83
101
61
Adored debunked and redacted the 2.7Ghz so might as well drop talking about that speed, not going to happen made up fairy stuff. He said the information was wrong. Same with the 400W.
I am not promoting that number. I am saying why that number is not possible for that kind of CU count.
But lot of people still believing that numbers and power consumption figure.
 
  • Like
Reactions: Konan

Geranium

Member
Apr 22, 2020
83
101
61
Whom exactly? None here as far as I can see. Be careful of channeling Don Quixote.
You will see them after the launch of the card with reasonable clock speed and TBP, just like 5GHz and 99$ after Matise launch.
Note : Mastise did launch at 99$ but it was nearly one year later and with 4C8T not what the rumor suggested.
 

soresu

Platinum Member
Dec 19, 2014
2,660
1,860
136
I think most of the performance per watt from RDNA 2 is going to come from having a wider design and using most the maturity of 7nm for power savings, particularly for big Navi since big chips have more leakage which means power quickly climbs with clocks.
Indications from the latest slide imply that simplified logic + IPC gains will combine to give that 50% perf/watt bump, likely also combined with minor process gains.

Simplified logic presumably giving more clock per watt, and IPC giving more FPS per FLOP.

With the IPC gains of the Vega iteration in Renoir added to a more refined RDNA, the 50% efficiency improvement would not surprise me at all - AMD already stated publicly at Renoir launch that those gains will be rolled into RDNA2.
 
Last edited:

Olikan

Platinum Member
Sep 23, 2011
2,023
275
126
I think most of the performance per watt from RDNA 2 is going to come from having a wider design and using most the maturity of 7nm for power savings, particularly for big Navi since big chips have more leakage which means power quickly climbs with clocks.
Nope, we know both consoles clocks/CUs, no RDNA1 gpu could reach those numbers... even if it were ported to 5nm, it doesn't add up.