Question Speculation: RDNA2 + CDNA Architectures thread

uzzi38 · Apr 28, 2020

All die sizes are within 5mm^2. The poster here has been right on some things in the past afaik, and to his credit was the first to saying 505mm^2 for Navi21, which other people have backed up. Even still though, take the following with a pich of salt.

Navi21 - 505mm^2

Navi22 - 340mm^2

Navi23 - 240mm^2

Source is the following post: https://www.ptt.cc/bbs/PC_Shopping/M.1588075782.A.C1E.html

raghu78 · Oct 10, 2020

TESKATLIPOKA said:
RX 5700 XT has 1887Mhz clockspeed on average so 25% higher clockspeed for Big Navi is actually 2359Mhz, but the info is only about 2.2Ghz, which is only 16.6% better.
Another thing is that doubling the number of CUs won't increase your performance by 100%, but by ~90-95% and you also need double of ROPs.
It should be at least on the level of RTX3080, but reviews will tell us the truth.

RX 5700XT official game clock is 1755 Mhz but you can see it run anywhere between 1800 - 1850 Mhz depending on game. So if N21 official game clock is 2.2 Ghz we should not be surprised to see it run between 2250-2300 Mhz. So game clock to game clock its roughly 25%. More importantly RDNA2 has higher perf/clock (which we need to see in press reviews). RDNA2 has improved (and probably larger) caches. So here are the vectors

2x CU, 25% higher clocks, 10-15% higher perf/clock. Even without higher perf/clock N21 is going to end up > 2x RX 5700XT. With higher perf/clock it will pull even further ahead.

rainy · Oct 10, 2020

TESKATLIPOKA said:
If not, then I say we will see 384bit GDDR6.

It will not fit to 16GB of VRAM - you have 12/24 instead.
Here are two possibilities: 256-bit bus with some "magic" or HBM2e - 512-bit bus is too big/complicated and last time was used by AMD with Radeon 200/300 series.

TESKATLIPOKA · Oct 10, 2020

I know for 384bit you need 12 or 24GB.

Glo. · Oct 10, 2020

TESKATLIPOKA said:
He didn't say RTX3070 will be 20% faster than RTX 2080Ti. It was actually meant that 16Ghz 256bit GDDR6 could provide enough bandwidth to feed a RDNA 2 GPU 20% more powerful than RTX 2080ti.

Yes, my bad. I misread, his post, out of context.

My apologize @sontin

TESKATLIPOKA · Oct 10, 2020

raghu78 said:
RX 5700XT official game clock is 1755 Mhz but you can see it run anywhere between 1800 - 1850 Mhz depending on game. So if N21 official game clock is 2.2 Ghz we should not be surprised to see it run between 2250-2300 Mhz. So game clock to game clock its roughly 25%. More importantly RDNA2 has higher perf/clock (which we need to see in press reviews). RDNA2 has improved (and probably larger) caches. So here are the vectors

2x CU, 25% higher clocks, 10-15% higher perf/clock. Even without higher perf/clock N21 is going to end up > 2x RX 5700XT. With higher perf/clock it will pull even further ahead.

Didn't I also say I expect at least RTX 3080 performance? Because 2x 5700XT is RTX 3080 performance in 4K.

maddie · Oct 10, 2020

reb0rn said:
if you see rdn2 with 256bit is faster then 3070 with same bus, then it can, we just don`t know in which games it save enough and what is the limit memory bus or GPU

for sure miners are DEAD and will not buy AMD anymore as mining is strictly tied by raw memory bandwith and bus

also don't forget wider bus is a lot expansive PCB design and GPU side controller

Yep, it appears than anything needing mainly unique data for each operation will be at a disadvantage. This explains the split to CDNA architecture for HPC, distributed computing, etc. This as a good thing for gamers.

raghu78 · Oct 10, 2020

TESKATLIPOKA said:
Didn't I also say I expect at least RTX 3080 performance? Because 2x 5700XT is RTX 3080 performance in 4K.

Yeah. But again unless AMD has messed up thats the least you can expect. In reality if perf/clock is higher then RTX 3090 level perf at 4K and better at 1440p is expected.

Guru · Oct 10, 2020

Cu's don't scale linearly, so doubling the cu's with a ton of optimizations in the process as well, will only yield about 70 to 80% faster performance, assuming the memory system and clocks are the same. We know it will have higher clocks, so maybe some good 15% more performance through that and it does come close to double the rx5700 xt performance, which would put it at around RTX 3080 performance numbers.

Glo. · Oct 10, 2020

Guru said:
Cu's don't scale linearly, so doubling the cu's with a ton of optimizations in the process as well, will only yield about 70 to 80% faster performance, assuming the memory system and clocks are the same. We know it will have higher clocks, so maybe some good 15% more performance through that and it does come close to double the rx5700 xt performance, which would put it at around RTX 3080 performance numbers.

In order to go chiplet for dGPUs, you need to achieve best possible, almost perfect scaling with CU counts. First step to achieve this - redesigning the caches, to achieve highest possible internal bandwidth.

If RDNA2 brings those Cache improvements, that are designed for perfect, or almost perfect scaling - we should see better scaling with CU counts than what we have seen with RDNA1.

And we have seen that that archicteure achieved not 70% scaling with going from RX 5500 XT CU counts to RX 5700 XT CU counts, but 86%.

leoneazzurro · Oct 10, 2020

Scaling with CU cont going from 22 CU to 40 CU was around 93% wit RDNA1. Taking in account game clock to game clock comparison. If we took game clock on 5500XT vs the max clock on 5700XT we are just shy of 90%. But going from 40 to 80 CU is certainly a bigger step.

Glo. · Oct 10, 2020

leoneazzurro said:
Scaling with CU cont going from 22 CU to 40 CU was around 93% wit RDNA1. Taking in account game clock to game clock comparison. If we took game clock on 5500XT vs the max clock on 5700XT we are just shy of 90%. But going from 40 to 80 CU is certainly a bigger step.

For chiplets, it doesn't matter how many CUs you have. You have to have best possible, linear scaling with CU counts.

Krteq · Oct 10, 2020

Glo. said:
For chiplets, it doesn't matter how many CUs you have. You have to have best possible, linear scaling with CU counts.

For (GPU) chiplets... latencies are most important. Just saying

Glo. · Oct 10, 2020

Krteq said:
For (GPU) chiplets... latencies are most important. Just saying

Which is a byproduct of cache redesign requirement.

And byproduct of cache redesign is ... scaling.

maddie · Oct 10, 2020

Krteq said:
For (GPU) chiplets... latencies are most important. Just saying

Please explain as every thing I've read by GPU designers have stated that increased latency is much easier to manage in gaming GPUs when compared to CPUs.

By the way, interposer based designs have interconnect latencies of 1-2 ns ( a several yrs old Xilinx slide).

omikun · Oct 11, 2020

maddie said:
Please explain as every thing I've read by GPU designers have stated that increased latency is much easier to manage in gaming GPUs when compared to CPUs.

GPUs have lots of latency hiding mechanisms but they incur costs. RDNA was designed to lower shader latencies by widening simd width to 32. The rational was lower latency means fewer warps needed said latency means smaller number of registers needed to sustain perf.

beginner99 · Oct 11, 2020

maddie said:
Please explain as every thing I've read by GPU designers have stated that increased latency is much easier to manage in gaming GPUs when compared to CPUs.

Yeah also wonder about the scaling. In fact the scaling would be far more important in a monolithic GPU as it makes a larger and larger die less and less useful performance wise and far more expensive manufacturing wise (yields!). Adding another "cheap" chiplet at sub-optimal scaling has a far lower cost than making a bigger die at sub-optimal scaling.

The real issue is latency and possibly bandwidth between chiplets and how to connect them together, eg some kind of IO die or does each chiplet go to memory directly? (unlikley as each one would need a memory controller)

In fact this last part could be the driving factor of the cache redesign. Too much traffic to IO and too high memory latency (chiplet->IO->Memory). So needing less access to memory would be a huge bonus.

Olikan · Oct 11, 2020

I really doubt any future gpu chiplet is gonna split the compute units... similar to zen2, a IO die and another compute die

Geranium · Oct 11, 2020

Infinity Cache looks like a caching system where one AMD's CPU/GPU can access cache of another through Infinity Fabric.

moinmoin · Oct 11, 2020

Olikan said:
I really doubt any future gpu chiplet is gonna split the compute units... similar to zen2, a IO die and another compute die

Well, with Zen 2 you got Epyc 2 chips containing 8 chiplets and thus up to 64 cores. Imagine that kind of scalability for GPU CUs without having to use a single ~500mm² monolith.

Kuiva maa · Oct 11, 2020

moinmoin said:
Well, with Zen 2 you got Epyc 2 chips containing 8 chiplets and thus up to 64 cores. Imagine that kind of scalability for GPU CUs without having to use a single ~500mm² monolith.

I always thought that GPU workloads were not that sensitive in latency and cherished bandwidth, given how videocards use GDDRx. So in theory splitting the shader array across chiplets could result in very big GPUs without a single big die used yes. I suppose if it was that easy they would have brought such a solution to the market already. But maybe it is not.

moinmoin · Oct 11, 2020

Kuiva maa said:
I always thought that GPU workloads were not that sensitive in latency and cherished bandwidth, given how videocards use GDDRx. So in theory splitting the shader array across chiplets could result in very big GPUs without a single big die used yes. I suppose if it was that easy they would have brought such a solution to the market already. But maybe it is not.

Bandwidth is definitely the much bigger issue, seeing how performance scalability comes to a screeching halt once e.g. the APUs hit the dual channel DDR4 bandwidth bottleneck. Within potential CU chiplets that bandwidth would need to be a multiple of the amount of chiplets to guarantee no bottleneck. But only a solution with some form of overprovisioning seems feasible there if the design ought to be realistically fully scalable. And at this point latency can become crucial as well, after all a GPU has only the time of a frame to do the work. The higher the framerate the less time there is, the more chiplets and the more unavoidable crosstalk between them the more latency can add up, ending up as a bottleneck for the achievable framerate.

With pure compute there's no framerate to keep in mind and the communication between chiplets can be avoided, so I expect CDNA to make use of a chiplets approach much earlier and als much stronger than RDNA ever will.

Veradun · Oct 11, 2020

moinmoin said:
Bandwidth is definitely the much bigger issue, seeing how performance scalability comes to a screeching halt once e.g. the APUs hit the dual channel DDR4 bandwidth bottleneck. Within potential CU chiplets that bandwidth would need to be a multiple of the amount of chiplets to guarantee no bottleneck. But only a solution with some form of overprovisioning seems feasible there if the design ought to be realistically fully scalable. And at this point latency can become crucial as well, after all a GPU has only the time of a frame to do the work. The higher the framerate the less time there is, the more chiplets and the more unavoidable crosstalk between them the more latency can add up, ending up as a bottleneck for the achievable framerate.

With pure compute there's no framerate to keep in mind and the communication between chiplets can be avoided, so I expect CDNA to make use of a chiplets approach much earlier and als much stronger than RDNA ever will.

I can't see a GPU chiplet solution without an active interposer tbh

kurosaki · Oct 11, 2020

Kuiva maa said:
I always thought that GPU workloads were not that sensitive in latency and cherished bandwidth, given how videocards use GDDRx. So in theory splitting the shader array across chiplets could result in very big GPUs without a single big die used yes. I suppose if it was that easy they would have brought such a solution to the market already. But maybe it is not.

7970xt was a dual circuit card anyways.

moinmoin · Oct 11, 2020

Veradun said:
I can't see a GPU chiplet solution without an active interposer tbh

True, classic MCM would lack the bandwidth and/or need too much pJ/bit.

maddie · Oct 11, 2020

omikun said:
GPUs have lots of latency hiding mechanisms but they incur costs. RDNA was designed to lower shader latencies by widening simd width to 32. The rational was lower latency means fewer warps needed said latency means smaller number of registers needed to sustain perf.

GCN has a 64 thread wave issued 16 at a time, to a 64 shader CU. RDNA has a 32 thread wave issued all at once to a 32 shader array, all doubled up to allow backwards compatibility with GCN.

AFAIK, this was done to improve occupancy and reduce idle shader resources, not because data latency was a problem.

Question Speculation: RDNA2 + CDNA Architectures thread

Platinum Member

Diamond Member

Senior member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Junior Member

Diamond Member

Platinum Member

Member

Diamond Member

Member

Diamond Member

Senior member

Senior member

Diamond Member

Diamond Member