• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."

Question Speculation: RDNA2 + CDNA Architectures thread

Page 216 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

andermans

Member
Sep 11, 2020
71
65
51
IPC comparison between 40CU: https://www.computerbase.de/2021-03/amd-radeon-rdna2-rdna-gcn-ipc-cu-vergleich/

Somehow, RDNA2 is behind RDNA1 in most cases.
It's just immature drivers?
I think the expectation is mostly that GPUs don't necessarily increase IPC. Consider that IPC here is not really instructions per clock for the whole device but instructions per clock per core. So suddenly what you call a core matters. On CPUs a prime way to increase IPC is to add execution units, but on GPUs it is often easier to just go to more cores so the IPC for a given core doesn't increase that much.

Changes that could increase performance (I'd rather not use instructions because with all the fixed function hardware it is a really bad indicator) per clock are:

1) New accelerators like raytracing. The problem with measuring this is that nobody seems to support the software feature you'd before hardware support.
2) Non shader features like variable rate shading or mesh shading
3) Improvements in the memory hierarchy
4) Maybe some improvements in say branch prediction.

I don't think AMD talked about any things like 4, and looks like 3 has been traded off against a smaller bus with so that is a gain some lose some situation. Leaves 1 and 2 which are kinda in a situation like AVX512 tends to be on the CPU side: if you have programs that make use of it it is great, otherwise kinda useless.

Of course sometimes there is a real performance per clock per core increase. RDNA1 was actually a great example because AMD was able to remove a lot of the bottlenecks to keep the shader units busy.
 

GodisanAtheist

Diamond Member
Nov 16, 2006
3,070
1,566
136
I think the expectation is mostly that GPUs don't necessarily increase IPC. Consider that IPC here is not really instructions per clock for the whole device but instructions per clock per core. So suddenly what you call a core matters. On CPUs a prime way to increase IPC is to add execution units, but on GPUs it is often easier to just go to more cores so the IPC for a given core doesn't increase that much.

Changes that could increase performance (I'd rather not use instructions because with all the fixed function hardware it is a really bad indicator) per clock are:

1) New accelerators like raytracing. The problem with measuring this is that nobody seems to support the software feature you'd before hardware support.
2) Non shader features like variable rate shading or mesh shading
3) Improvements in the memory hierarchy
4) Maybe some improvements in say branch prediction.

I don't think AMD talked about any things like 4, and looks like 3 has been traded off against a smaller bus with so that is a gain some lose some situation. Leaves 1 and 2 which are kinda in a situation like AVX512 tends to be on the CPU side: if you have programs that make use of it it is great, otherwise kinda useless.

Of course sometimes there is a real performance per clock per core increase. RDNA1 was actually a great example because AMD was able to remove a lot of the bottlenecks to keep the shader units busy.
- Excellent points.

Also worth noting, that RDNA2 clocks *really high*. Its entirely possible that AMD engineers said "ok, we can give up 5% 'IPC', but the trade off will be cranking clocks up 35-40%, ultimately allowing much more real work to get done".

I'm no computer engineer but if life has taught me anything, its that everything is just a series of trade-offs and AMD engineers made several to ultimately get to their 30% performance increase over their prior 40 CU design on the same process at the same power...

Edit: Also worth noting that RDNA 1 basically did not see any real performance gains from core OCs (and limited gains even from mem OCs), so clearly there were some major changes under the hood to not only allow RDNA2 to clock as high as it does, but also realize performance gains from those high clocks.
 

Gideon

Golden Member
Nov 27, 2007
1,353
2,586
136
Also worth noting, that RDNA2 clocks *really high*. Its entirely possible that AMD engineers said "ok, we can give up 5% 'IPC', but the trade off will be cranking clocks up 35-40%, ultimately allowing much more real work to get done".
Yeah IPC comparisons with vastly different clocks aren't the best idea. It's natural that performance does not scale linearly with clocks, if you compare 10700K to 6700 (non-k) skylake I'm pretty sure the latter will have "better ipc" despite it being the same microarchitecture.

Then again there are some differences in RDNA2 that hurt performance on 40CU designs that were made to make the 80CU design possible (with good die-area). One is that 6700XT has 2x less triangle discard capability than 5700XT
 

GodisanAtheist

Diamond Member
Nov 16, 2006
3,070
1,566
136
Yeah IPC comparisons with vastly different clocks aren't the best idea. It's natural that performance does not scale linearly with clocks, if you compare 10700K to 6700 (non-k) skylake I'm pretty sure the latter will have "better ipc" despite it being the same microarchitecture.

Then again there are some differences in RDNA2 that hurt performance on 40CU designs that were made to make the 80CU design possible (with good die-area). One is that 6700XT has 2x less triangle discard capability than 5700XT
- Indeed.

In that same link, the site starts off with RDNA/RDNA2/GCN alll normalized at 1000Mhz clock and while GCN is left in the dust, RDNA 1 squeeks ahead of RDNA 2.

Further down they normalize at 2000mhz, and RDNA 2 then takes the 'IPC' lead (GCN cannot clock that high and drops out).

Edit: Looks like the later tests are at 1440P as well.
 

Glo.

Diamond Member
Apr 25, 2015
4,754
3,392
136
- Indeed.

In that same link, the site starts off with RDNA/RDNA2/GCN alll normalized at 1000Mhz clock and while GCN is left in the dust, RDNA 1 squeeks ahead of RDNA 2.

Further down they normalize at 2000mhz, and RDNA 2 then takes the 'IPC' lead (GCN cannot clock that high and drops out).

Edit: Looks like the later tests are at 1440P as well.
And in 1440p it appears that RDNA2 is faster in "IPC" than RDNA1 was.

Unfathomable.
 

Mopetar

Diamond Member
Jan 31, 2011
5,791
2,559
136
Kind of an odd result. I'm assuming that the publishers having noticed the same would have rerun the tests just to make sure it wasn't a fluke. Now the question is whether it's due to the resolution change or the clock speed increase and the latter doesn't make much sense so I'd assume it comes down to something that RDNA2 does better or improved upon.
 
  • Like
Reactions: Leeea

gdansk

Senior member
Feb 8, 2011
639
326
136
Maybe an over simplification on my part but I was under the impression IPC was far less important on a GPU. They work on problems for which "more cores" is a working strategy. In that environment, instructions per watt is the prime target in order to allow larger designs.

Did they show the power consumption of each device at the given clock rates? Regardless, it appears the cache arrangement allows them to get similar performance with lower memory bandwidth.
 
  • Like
Reactions: Tlh97 and Leeea

moinmoin

Platinum Member
Jun 1, 2017
2,503
3,166
136
IPC becomes very important once two other parameters are exhausted: frequency and amount of cores (which for GPUs are essentially limitless thanks to embarrassingly parallel computation).

With RDNA the narrative set by AMD is that its Ryzen team looked at the Radeon chips and higher frequency was achieved through simplification of the CU cores. When one thinks about it compared to CPUs GPU frequencies have been very low. So frequency improvements achieved in the last several gens have been low hanging fruits (highest (boost) clock for each gen: RX200/300: 1050, RX400: 1266, Vega: 1677, Radeon VII: 1750, RDNA: 1980, RDNA2: 2581 MHz).

With RDNA2 AMD doubled the max possible amount of CUs from 40 to 80 and, if you look at the second page of the computerbase article, achieved a rather stable performance per core scalability of ~75% regardless of if going from 40 to 60, 72 or 80 CUs. This in particular is very promising for further CU core amount expansions using MCM.
 
  • Like
Reactions: Tlh97 and Leeea

lightmanek

Senior member
Feb 19, 2017
289
528
136
It's very hard to measure IPC of a GPU at vastly different clocks as we have no control of memory timings and internal frequency dividers. Very likely that some components designed to scale with high clock are simply tanking RDNA 2 performance at much lower clocks, where GCN or RDNA 1 run these components at lower latencies or have more of them (see Gideon's post above about discard units). That's the reason tests were done at two fixed clock settings and that's mainly why we see lead changing from RDNA 1 to RDNA 2.
 

leoneazzurro

Senior member
Jul 26, 2016
355
470
136
T
It's very hard to measure IPC of a GPU at vastly different clocks as we have no control of memory timings and internal frequency dividers. Very likely that some components designed to scale with high clock are simply tanking RDNA 2 performance at much lower clocks, where GCN or RDNA 1 run these components at lower latencies or have more of them (see Gideon's post above about discard units). That's the reason tests were done at two fixed clock settings and that's mainly why we see lead changing from RDNA 1 to RDNA 2.
A simple example: Infinity cache. Its efficiency it's likely to change a lot with GPU clocks.
 

andermans

Member
Sep 11, 2020
71
65
51
There weren't any 40 CU Vega parts though. The only one that lines up with a modern GPU is the Radeon VII which has the same 60 CU as the 6800.

I don't know about how it would stack up clock for clock, but the Radeon VII is regularly matched or even beaten by the 40 CU 5700 XT in benchmarks.
I do think the Radeon VII had some bottlenecks that made it really hard to use that many CUs though. Would be interesting to disable some CUs per shader engine to get 40 CUs total and then compare.
 

TESKATLIPOKA

Senior member
May 1, 2020
396
396
96
Twitter
Dimgray Cavefish GPU
<For Premium 1080p Gaming>
~1440p < RTX 3060 < 1080p
~April
~CNY 2,499
~32 Compute Units
~About 236 mm2
~64MB Infinity Cache
~128-bit 16Gbps GDDR6 with 8GB VRAM
I am quite skeptical about having 64MB IC within 236mm2 when N22 with additional 8CU, 32MB IC and 64bit GDDR6 is ~100mm2 bigger.
Price is ~384 USD, which is a lot.
 
Last edited:
  • Like
Reactions: Leeea

PhoBoChai

Member
Oct 10, 2017
119
389
106
Infinity Cache has it's own clock domain, and is locked at 1.94GHz iirc.
There is nothing to suggest this. AMD's slides refer to the 1.94ghz as the fabric that ties the cache and GPU together, not the actual cache operating freq.

AMD claims they used Ryzen L3 SRAM libraries for RDNA2 on 7nm, and we know Ryzen's L3 has no problems running >4.5ghz. There should be no issues at all tying infinity cache clks to the engine clock. While the fabric clk is linked to the memory controller.

As for the IF clks, its not static, it has two states, 1.4ghz power saving and 1.94ghz boost.
 

ASK THE COMMUNITY