Question Speculation: RDNA2 + CDNA Architectures thread

uzzi38 · Apr 28, 2020

All die sizes are within 5mm^2. The poster here has been right on some things in the past afaik, and to his credit was the first to saying 505mm^2 for Navi21, which other people have backed up. Even still though, take the following with a pich of salt.

Navi21 - 505mm^2

Navi22 - 340mm^2

Navi23 - 240mm^2

Source is the following post: https://www.ptt.cc/bbs/PC_Shopping/M.1588075782.A.C1E.html

kurosaki · Sep 23, 2020

Macros96 said:
3.) They have successfully avoided accurate leaks and there is a big-die plus HBM card at the top of the stack.

I'm fervently hoping for #3, but know it will be expensive.

Like, 10 000 kola nuts-expensive?!

Glo. · Sep 23, 2020

AtenRa said:
Are you sure that increasing clocks will gain you 1 to 1 performance ??

Depends on the Frequency/IPC design that AMD aimed for. If you want to scale your GPUs to clock so high, you do not aim for the most optimal frequency/IPC curve to be on the same level as your previous generation of GPUs. You want to schedule as much as possible, with each cycle.

Stuka87 · Sep 23, 2020

CastleBravo said:
The improved capabilities don't reflect improved performance in current games, but there is still some chance that future games will make proper use of it. The NV marbles demo showed a massive improvement.

nVidia (and other companies) likes to "forcefully" make their demo's look better on new hardware. Lots of optimizations and such on the new cards, zero optimizations on the old cards. The demo looks very cool, and has nice visuals. But should never be used to compare performance between two generations of cards.

raghu78 · Sep 23, 2020

AtenRa said:
Are you sure that increasing clocks will gain you 1 to 1 performance ??

Im expecting 40CU RDNA 2 card to be in the same level as RTX2080/Super and RTX3060

If a 2SE,4SA, 20WGP/40CU, 64 ROPs, 192 bit GDDR6 RDNA2 graphics card can match a 5 GPC, 38 SM, 80 ROPs, 256 bit GDDR6 Ampere graphics card then the comparisons up the stack will get brutal as AMD is likely to have better scaling due to the 80 CU being 4SE, 8SA, 40 WGP/80CU, 128 ROPs and most likely 2048 bit HBM2E. All I can say is NV are in for a hard contest this fall.

Saylick · Sep 23, 2020

Glo. said:
P.S. RDNA1 CUs launch 5 instructions per clock. RDNA2 CUs launch 7 instructions per clock 😉.

Out of curiosity, do you have a source on the 5 inst/clk for RDNA1? I saw 2-4 inst/clock per CU for RDNA1 from the whitepaper, but no indication of what the average inst/clk is for the WGP.

Glo. · Sep 23, 2020

raghu78 said:
If a 2SE,4SA, 20WGP/40CU, 64 ROPs, 192 bit GDDR6 RDNA2 graphics card can match a 5 GPC, 38 SM, 80 ROPs, 256 bit GDDR6 Ampere graphics card then the comparisons up the stack will get brutal as AMD is likely to have better scaling due to the 80 CU being 4SE, 8SA, 40 WGP/80CU, 128 ROPs and most likely 2048 bit HBM2E. All I can say is NV are in for a hard contest this fall.

Its simpler.

RDNA2 CU compares to one SM from Ampere.

But RDNA2 GPUs clock higher.

Saylick said:
Out of curiosity, do you have a source on the 5 inst/clk for RDNA1? I saw 2-4 inst/clock per CU for RDNA1 from the whitepaper, but no indication of what the average inst/clk is for the WGP.

Oh, did I made I mistake? I thought I read in the RDNA1 whitepaper that it was 5 instructions per clock, not 4.

dr1337 · Sep 23, 2020

Glo. said:
Its simpler.

RDNA2 CU compares to one SM from Ampere.

But RDNA2 GPUs clock higher.

Oh, did I made I mistake? I thought I read in the RDNA1 whitepaper that it was 5 instructions per clock, not 4.

https://www.amd.com/system/files/documents/rdna-whitepaper.pdf

Can confirm, page 9 of the whitepaper says 2-4 instructions per cycle. It would seem like RDNA2 has an increase vs. RDNA1 unless somehow the xbox sex slides are lying or aren't truly accurate.

Glo. · Sep 23, 2020

So its 4 RDNA1 instructions vs 7 in RDNA2, with each cycle.

Thats pretty hefty increase in scheduling...

Saylick · Sep 23, 2020

Glo. said:
Oh, did I made I mistake? I thought I read in the RDNA1 whitepaper that it was 5 instructions per clock, not 4.

No worries.

Glo. said:
So its 4 RDNA1 instructions vs 7 in RDNA2, with each cycle.

Thats pretty hefty increase in scheduling...

dr1337 said:
https://www.amd.com/system/files/documents/rdna-whitepaper.pdf

Can confirm, page 9 of the whitepaper says 2-4 instructions per cycle. It would seem like RDNA2 has an increase vs. RDNA1 unless somehow the xbox sex slides are lying or aren't truly accurate.

Yeah, so it's effectively 4-8 inst/clk for RDNA 1 at the CU level vs. 7 inst/clk for RDNA 2 per the Xbox Series X presentation. Knowing the actual average inst/clk for RDNA 1 will let us know what the potential improvement in IPC can be.

Glo. · Sep 23, 2020

Saylick said:
No worries.

Yeah, so it's effectively 4-8 inst/clk for RDNA 1 at the WGP level vs. 7 inst/clk for RDNA 2 per the Xbox Series X presentation. Knowing the actual average inst/clk for RDNA 1 will let us know what the potential improvement in IPC can be.

Uhhh, Isn't it 7 Instructions PER CU, not per WGP?

If its per CU, and there is 5 CUs per WGP it means 35 instruction per WGP with each cycle.

O_O

Saylick · Sep 23, 2020

Glo. said:
Uhhh, Isn't it 7 Instructions PER CU, not per WGP?

If its per CU, and there is 5 CUs per WGP it means 35 instruction per WGP with each cycle.

O_O

Yeah, it's at the CU level, my mistake. I tried to ninja edit my post after I realized my folly. 😉

Regardless, I think my point still stands: 2-4 IPC per SIMD or 4-8 IPC per CU for RDNA 1, up to 7 IPC per CU for RDNA 2.

EDIT: It's 2 CUs per WGP, not 5. 😛

andermans · Sep 23, 2020

I think the quote from the whitepaper is "Each of the four SIMDs can request instructions every cycle and the instruction cache can deliver 32B (typically 2-4 instructions) every clock to each of the SIMDs". So that would be up to a theoretical 8 per CU/16 per WGP.

In practice, if you look at what units it can issue to I believe the numbers are unchanged though (with 1 vector ALU unit per SIMD you're obviously not issuing more than 2 per CU), or at least there is no indication of any changes. Looking at the 7 I'm also pretty sure those are not the only ones that can be issued, there are categories not named (like LDS or NOPs). Pretty sure RDNA1 and RDNA2 have these but they were not included in the 7 explicitly listed on the xbox slides.

Glo. · Sep 23, 2020

Here is my question.

Are we sure Microsoft is not talking about 7 instructions per SIMD, and not per CU?

Ajay · Sep 23, 2020

raghu78 said:
If a 2SE,4SA, 20WGP/40CU, 64 ROPs, 192 bit GDDR6 RDNA2 graphics card can match a 5 GPC, 38 SM, 80 ROPs, 256 bit GDDR6 Ampere graphics card then the comparisons up the stack will get brutal as AMD is likely to have better scaling due to the 80 CU being 4SE, 8SA, 40 WGP/80CU, 128 ROPs and most likely 2048 bit HBM2E. All I can say is NV are in for a hard contest this fall.

HBM just isn't going to show its face on a consumer GPU. Don't know why people keep repeating this idea.

AtenRa · Sep 23, 2020

Glo. said:
Here is my question.

Are we sure Microsoft is not talking about 7 instructions per SIMD, and not per CU?

There are only two SIMDs (2x32 ALUs) per CU in RDNA, so no way to launch more than 2 Vector instructions per SIMD

raghu78 · Sep 23, 2020

andermans said:
I think the quote from the whitepaper is "Each of the four SIMDs can request instructions every cycle and the instruction cache can deliver 32B (typically 2-4 instructions) every clock to each of the SIMDs". So that would be up to a theoretical 8 per CU/16 per WGP.

In practice, if you look at what units it can issue to I believe the numbers are unchanged though (with 1 vector ALU unit per SIMD you're obviously not issuing more than 2 per CU), or at least there is no indication of any changes. Looking at the 7 I'm also pretty sure those are not the only ones that can be issued, there are categories not named (like LDS or NOPs). Pretty sure RDNA1 and RDNA2 have these but they were not included in the 7 explicitly listed on the xbox slides.

In RDNA the instruction cache which is shared per WGP can fetch 32 bytes (2-4 instructions) per clock cycle per SIMD. 4 SIMDs are present in a dual CU WGP. So 4-8 instructions per clock cycle per CU and 8-16 instructions per clock cycle per WGP.

RDNA2 does 7 instructions per clock cycle per CU. If thats sustained every clock cycle then the key question is does RDNA2 instruction cache fetch 2x the bytes per SIMD per clock cycle. Even if we assume a average 6 instructions per clock cycle per CU its still a 15% higher instruction throughput for RDNA2 with 7 instructions per clock cycle per CU. As always the devil is in the details. 🙂

andermans · Sep 23, 2020

raghu78 said:
RDNA2 does 7 instructions per clock cycle per CU. If thats sustained every clock cycle then the key question is does RDNA2 instruction cache fetch 2x the bytes per SIMD per clock cycle. Even if we assume a average 6 instructions per clock cycle per CU its still a 15% higher instruction throughput for RDNA2 with 7 instructions per clock cycle per CU. As always the devil is in the details. 🙂

I'm pretty sure the numbers mentioned are just theoretical maximums in the different categories (e.g. 2 vector ALU instructions/cycle with 2 vector ALU units is obviously a max, not an average. Same for for the others.) so that wouldn't be sustained.

Olikan · Sep 23, 2020

Also the max wavefronts was reduced from 20 to 16... It's another indicator there is more IPC in rdna2

Timorous · Sep 23, 2020

AtenRa said:
This is not going to happen , sorry but then XBOX SX would be faster than RTX2080Ti then , which its not.

The 60CU die will compete against GA104 (RTX3070) and the 80CU die will compete against the GA102 (RTX3080/90)

For 40 CUs to compete with the 2080Ti RDNA2 needs a 40% IPC increase (measured crudely as fps/Tflop) with PS5 clockspeeds. If it could do this in a 225W envelope then AMD will also hit their 50% perf/watt target as the 2080Ti is about 50% faster than the 5700XT at 4k.

If this is the IPC gain then this card would have 11.5 Tflops which means the Series X would also be 2080Ti tier.

The fact a 36CU @ 2.23Ghz + 8c16t zen2 @ 3.5Ghz based SoC with 16GB GDDR6 can be powered by a 350W PSU (which will have plenty of overhead) suggests that a 2.23Ghz 40CU dGPU part could be done for less than the 225W tbp of the 5700XT.

I think a 40CU 2.23Ghz part is more likely to land in 2080S territory than 2080Ti territory due to memory bandwidth but I also suspect the 3070 will also be closer to the 2080S than the Ti.

blckgrffn · Sep 23, 2020

Oh man I love being on this train. You guys are making me legit excited 😆

I am also in the camp that is finding it hard to believe the 3070 base card will exceed the 2080 ti and expect it to instead be above the 2080S.

Also, are we really caring what a 40CU card does at 4K? It seems like that is still a domain for the highest tier cards if you want sustained 60+ FPS.

Show me it trouncing my 5700xt at 1440p and I’ll be opening my wallet.

TESKATLIPOKA · Sep 23, 2020

Timorous said:
For 40 CUs to compete with the 2080Ti RDNA2 needs a 40% IPC increase (measured crudely as fps/Tflop) with PS5 clockspeeds. If it could do this in a 225W envelope then AMD will also hit their 50% perf/watt target as the 2080Ti is about 50% faster than the 5700XT at 4k.

RTX 2080Ti is 50% faster than RX 5700 XT. For the same performance you need 25% better IPC and 20% higher clockspeed. Of course you need enough bandwidth to not be a bottleneck.

Konan · Sep 23, 2020

TESKATLIPOKA said:
RTX 2080Ti is 50% faster than RX 5700 XT. For the same performance you need 25% better IPC and 20% higher clockspeed. Of course you need enough bandwidth to not be a bottleneck.

Yeah 192bit bus and none of that extra cache magic that will be on higher SKUs.

If anything, everything is over-exaggerated. Same thing happens every couple of years followed by "wait for the next one".

Avalon · Sep 23, 2020

Throwing my hat in the ring here and saying Navi 21 will be at least as fast as this card:

kurosaki · Sep 23, 2020

TESKATLIPOKA said:
RTX 2080Ti is 50% faster than RX 5700 XT. For the same performance you need 25% better IPC and 20% higher clockspeed. Of course you need enough bandwidth to not be a bottleneck.

Ehhh.
The 2080 Ti is mostly 25% faster. Are we on the same forum? Anand bench Or, are you solely looking at the 1080p figures in certain games?

TESKATLIPOKA · Sep 23, 2020

kurosaki said:
Ehhh.
The 2080 Ti is mostly 25% faster. Are we on the same forum? Anand bench Or, are you solely looking at the 1080p figures in certain games?

Techpowerup average performance in 4K.

Question Speculation: RDNA2 + CDNA Architectures thread

Platinum Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Member

Diamond Member

Lifer

Lifer

Diamond Member

Member

Platinum Member

Golden Member

Diamond Member

Platinum Member

Senior member

Diamond Member

Senior member

Platinum Member