• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Question Speculation: RDNA2 + CDNA Architectures thread

Page 117 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
RX 5700 XT has 1887Mhz clockspeed on average so 25% higher clockspeed for Big Navi is actually 2359Mhz, but the info is only about 2.2Ghz, which is only 16.6% better.
Another thing is that doubling the number of CUs won't increase your performance by 100%, but by ~90-95% and you also need double of ROPs.
It should be at least on the level of RTX3080, but reviews will tell us the truth.

RX 5700XT official game clock is 1755 Mhz but you can see it run anywhere between 1800 - 1850 Mhz depending on game. So if N21 official game clock is 2.2 Ghz we should not be surprised to see it run between 2250-2300 Mhz. So game clock to game clock its roughly 25%. More importantly RDNA2 has higher perf/clock (which we need to see in press reviews). RDNA2 has improved (and probably larger) caches. So here are the vectors

2x CU, 25% higher clocks, 10-15% higher perf/clock. Even without higher perf/clock N21 is going to end up > 2x RX 5700XT. With higher perf/clock it will pull even further ahead.
 
If not, then I say we will see 384bit GDDR6.

It will not fit to 16GB of VRAM - you have 12/24 instead.
Here are two possibilities: 256-bit bus with some "magic" or HBM2e - 512-bit bus is too big/complicated and last time was used by AMD with Radeon 200/300 series.
 
RX 5700XT official game clock is 1755 Mhz but you can see it run anywhere between 1800 - 1850 Mhz depending on game. So if N21 official game clock is 2.2 Ghz we should not be surprised to see it run between 2250-2300 Mhz. So game clock to game clock its roughly 25%. More importantly RDNA2 has higher perf/clock (which we need to see in press reviews). RDNA2 has improved (and probably larger) caches. So here are the vectors

2x CU, 25% higher clocks, 10-15% higher perf/clock. Even without higher perf/clock N21 is going to end up > 2x RX 5700XT. With higher perf/clock it will pull even further ahead.
Didn't I also say I expect at least RTX 3080 performance? Because 2x 5700XT is RTX 3080 performance in 4K.
 
if you see rdn2 with 256bit is faster then 3070 with same bus, then it can, we just don`t know in which games it save enough and what is the limit memory bus or GPU

for sure miners are DEAD and will not buy AMD anymore as mining is strictly tied by raw memory bandwith and bus

also don't forget wider bus is a lot expansive PCB design and GPU side controller
Yep, it appears than anything needing mainly unique data for each operation will be at a disadvantage. This explains the split to CDNA architecture for HPC, distributed computing, etc. This as a good thing for gamers.
 
Didn't I also say I expect at least RTX 3080 performance? Because 2x 5700XT is RTX 3080 performance in 4K.
Yeah. But again unless AMD has messed up thats the least you can expect. In reality if perf/clock is higher then RTX 3090 level perf at 4K and better at 1440p is expected.
 
Cu's don't scale linearly, so doubling the cu's with a ton of optimizations in the process as well, will only yield about 70 to 80% faster performance, assuming the memory system and clocks are the same. We know it will have higher clocks, so maybe some good 15% more performance through that and it does come close to double the rx5700 xt performance, which would put it at around RTX 3080 performance numbers.
 
Cu's don't scale linearly, so doubling the cu's with a ton of optimizations in the process as well, will only yield about 70 to 80% faster performance, assuming the memory system and clocks are the same. We know it will have higher clocks, so maybe some good 15% more performance through that and it does come close to double the rx5700 xt performance, which would put it at around RTX 3080 performance numbers.
In order to go chiplet for dGPUs, you need to achieve best possible, almost perfect scaling with CU counts. First step to achieve this - redesigning the caches, to achieve highest possible internal bandwidth.

If RDNA2 brings those Cache improvements, that are designed for perfect, or almost perfect scaling - we should see better scaling with CU counts than what we have seen with RDNA1.

And we have seen that that archicteure achieved not 70% scaling with going from RX 5500 XT CU counts to RX 5700 XT CU counts, but 86%.
 
Scaling with CU cont going from 22 CU to 40 CU was around 93% wit RDNA1. Taking in account game clock to game clock comparison. If we took game clock on 5500XT vs the max clock on 5700XT we are just shy of 90%. But going from 40 to 80 CU is certainly a bigger step.
For chiplets, it doesn't matter how many CUs you have. You have to have best possible, linear scaling with CU counts.
 
For (GPU) chiplets... latencies are most important. Just saying
Please explain as every thing I've read by GPU designers have stated that increased latency is much easier to manage in gaming GPUs when compared to CPUs.

By the way, interposer based designs have interconnect latencies of 1-2 ns ( a several yrs old Xilinx slide).
 
Please explain as every thing I've read by GPU designers have stated that increased latency is much easier to manage in gaming GPUs when compared to CPUs.

GPUs have lots of latency hiding mechanisms but they incur costs. RDNA was designed to lower shader latencies by widening simd width to 32. The rational was lower latency means fewer warps needed said latency means smaller number of registers needed to sustain perf.
 
Please explain as every thing I've read by GPU designers have stated that increased latency is much easier to manage in gaming GPUs when compared to CPUs.

Yeah also wonder about the scaling. In fact the scaling would be far more important in a monolithic GPU as it makes a larger and larger die less and less useful performance wise and far more expensive manufacturing wise (yields!). Adding another "cheap" chiplet at sub-optimal scaling has a far lower cost than making a bigger die at sub-optimal scaling.

The real issue is latency and possibly bandwidth between chiplets and how to connect them together, eg some kind of IO die or does each chiplet go to memory directly? (unlikley as each one would need a memory controller)

In fact this last part could be the driving factor of the cache redesign. Too much traffic to IO and too high memory latency (chiplet->IO->Memory). So needing less access to memory would be a huge bonus.
 
I really doubt any future gpu chiplet is gonna split the compute units... similar to zen2, a IO die and another compute die
 
I really doubt any future gpu chiplet is gonna split the compute units... similar to zen2, a IO die and another compute die
Well, with Zen 2 you got Epyc 2 chips containing 8 chiplets and thus up to 64 cores. Imagine that kind of scalability for GPU CUs without having to use a single ~500mm² monolith.
 
Well, with Zen 2 you got Epyc 2 chips containing 8 chiplets and thus up to 64 cores. Imagine that kind of scalability for GPU CUs without having to use a single ~500mm² monolith.

I always thought that GPU workloads were not that sensitive in latency and cherished bandwidth, given how videocards use GDDRx. So in theory splitting the shader array across chiplets could result in very big GPUs without a single big die used yes. I suppose if it was that easy they would have brought such a solution to the market already. But maybe it is not.
 
I always thought that GPU workloads were not that sensitive in latency and cherished bandwidth, given how videocards use GDDRx. So in theory splitting the shader array across chiplets could result in very big GPUs without a single big die used yes. I suppose if it was that easy they would have brought such a solution to the market already. But maybe it is not.
Bandwidth is definitely the much bigger issue, seeing how performance scalability comes to a screeching halt once e.g. the APUs hit the dual channel DDR4 bandwidth bottleneck. Within potential CU chiplets that bandwidth would need to be a multiple of the amount of chiplets to guarantee no bottleneck. But only a solution with some form of overprovisioning seems feasible there if the design ought to be realistically fully scalable. And at this point latency can become crucial as well, after all a GPU has only the time of a frame to do the work. The higher the framerate the less time there is, the more chiplets and the more unavoidable crosstalk between them the more latency can add up, ending up as a bottleneck for the achievable framerate.

With pure compute there's no framerate to keep in mind and the communication between chiplets can be avoided, so I expect CDNA to make use of a chiplets approach much earlier and als much stronger than RDNA ever will.
 
Bandwidth is definitely the much bigger issue, seeing how performance scalability comes to a screeching halt once e.g. the APUs hit the dual channel DDR4 bandwidth bottleneck. Within potential CU chiplets that bandwidth would need to be a multiple of the amount of chiplets to guarantee no bottleneck. But only a solution with some form of overprovisioning seems feasible there if the design ought to be realistically fully scalable. And at this point latency can become crucial as well, after all a GPU has only the time of a frame to do the work. The higher the framerate the less time there is, the more chiplets and the more unavoidable crosstalk between them the more latency can add up, ending up as a bottleneck for the achievable framerate.

With pure compute there's no framerate to keep in mind and the communication between chiplets can be avoided, so I expect CDNA to make use of a chiplets approach much earlier and als much stronger than RDNA ever will.
I can't see a GPU chiplet solution without an active interposer tbh
 
I always thought that GPU workloads were not that sensitive in latency and cherished bandwidth, given how videocards use GDDRx. So in theory splitting the shader array across chiplets could result in very big GPUs without a single big die used yes. I suppose if it was that easy they would have brought such a solution to the market already. But maybe it is not.
7970xt was a dual circuit card anyways.
 
GPUs have lots of latency hiding mechanisms but they incur costs. RDNA was designed to lower shader latencies by widening simd width to 32. The rational was lower latency means fewer warps needed said latency means smaller number of registers needed to sustain perf.
GCN has a 64 thread wave issued 16 at a time, to a 64 shader CU. RDNA has a 32 thread wave issued all at once to a 32 shader array, all doubled up to allow backwards compatibility with GCN.

AFAIK, this was done to improve occupancy and reduce idle shader resources, not because data latency was a problem.
 
Back
Top