Question Speculation: RDNA2 + CDNA Architectures thread

uzzi38 · Apr 28, 2020

All die sizes are within 5mm^2. The poster here has been right on some things in the past afaik, and to his credit was the first to saying 505mm^2 for Navi21, which other people have backed up. Even still though, take the following with a pich of salt.

Navi21 - 505mm^2

Navi22 - 340mm^2

Navi23 - 240mm^2

Source is the following post: https://www.ptt.cc/bbs/PC_Shopping/M.1588075782.A.C1E.html

uzzi38 · Aug 18, 2020

DXDiag said:
Your post in nothing more than damage control at this point, a shared unit can never beat a dedicated unit, worse yet, RDNA2 does BVH traversal on the shaders as well (Turing does it on RT cores). So RT acceleration is shared on two levels with RDNA2, not one.

And no Textures units are not overbudgeted on modern GPUs, they are just the right amount for regular texturing, 16X AF filtering, and texture heavy shaders and effects.

View attachment 28198
View attachment 28199

More damage control, Tensores are not necessary, but they are fast enough to offset any performance loss due to using ML to upscale the image, without tensors the loss would be bigger.

The shaders don't do the BVH traversal, it says they can run in parallel to the BVH traversal.

In other words, while the BVH traversal is taking place, they can perform other calculations while they wait for the BVH data.

That's called "not wasting compute resources".

And we're not talking about a shared unit vs a dedicated one if you didn't realise. We're talking about 4 shared units vs a single dedicated one.

So, let me restate what I said before: the implementation in RDNA2 is vastly different to that of Turing. Attempting to judge which is better based off of technical specs alone is a meaningless waste of time.

As for the portion about tensors, you're correct there, the performance loss will be larger. Well, it would be more accurate to say that the performance improvement from performing the same DLSS2.0 algorithm would me smaller.

Fact of the matter is that neither of us knows how long the tensors are used during the stage of the pipeline whilst they perform the algorithm, and to what degree of a slowdown it would cause. We have no degree as to the difficulty and accuracy of the algorithm AMD would use.

But I'm going to be a bit honest here - I simply cannot believe that it would not be possible to use the same algorithm as DLSS2.0 on RDNA2. I'm quite positive it'll be usuable, just much slower, such that the performance uplift will be much smaller. Still present though.

In the end, the upscaling is just another portion of the pipeline and unlike with Turing's architecture, running off the shaders still allows you to perform other computations at the same time (whereas running tensor operations on Turing prevents you from doing anything else on that SM at the same time).

You know, having remembered that last portion, I think I'm going to defer again to the same conclusion as the raytracing one.

The implementation is so different that I don't really think we can actually judge which will be appreciably better or if one will be severely handicapped vs the other. Once again, it's best to wait and see.

DisEnchantment · Aug 18, 2020

Konan said:
Hmmmm. RDNA2 implementation of RT shows that the RT operations is sharing with Textures saying you can either do one or the other but not both at the same time. Won't that impact overall RT performance delivery??
Why can't they be done at the same time?

It can't be done at the same time because AMD uses the same datapath used by the TMU and the Ray unit, using a Mux as stated in their patent. Reason being to minimize transistor bloat.
TMU writes to L1 whereas Ray units read from L0 ( and L2 ) and LDS and write to LDS to be consumed by the next shader stages.

AMD's architecture is more tailored toward inline ray tracing of DXR1.1.
When the CU is doing RTRT it is not producing anything that can be consumed by the TMU, because the shader code is waiting for hit miss results before committing.
However due to much larger CU count, having a dual pipe could allow independent command streams to run so while the other is engaged with some long running work the other "core" of the CP, could still engage since there is enough horsepower to do something else.
On top of this the ACE can still engage in parallel of the Command Processor to do tasks which can be handled by compute shaders using CUs which are not engaged.
Dual pipe CP is a smart way to improve shader occupancy for high CU count parts, provided the SW can handle it.
This is my understanding at the moment based on the limited slides and from DX12U presentation.

Glo. · Aug 18, 2020

DXDiag said:
Yes it will impact it, RT with RDNA 2 will be slower than RT in Turing.

No.

Krteq · Aug 18, 2020

Konan said:
Hmmmm. RDNA2 implementation of RT shows that the RT operations is sharing with Textures saying you can either do one or the other but not both at the same time. Won't that impact overall RT performance delivery??
Why can't they be done at the same time?

Hmm, where in the graphics pipeline you need to do RT operations at the same time as texturing operations?

AtenRa · Aug 18, 2020

Krteq said:
Hmm, where in the graphics pipeline you need to do RT operations at the same time as texturing operations?

Yea people should learn about DX raster and DXR pipeline first and them make conclusions about performances.

Olikan · Aug 18, 2020

7 instructions/clock, is a nice increase from 4 as rdna1 and GCN does...

yet, 2 of those are "control" instructions, control what exacly?

Gideon · Aug 18, 2020

Anandtech mentions an interesting dibit from Xbox the talk:

CUs have 25% better perf/clock compared to last gen

Unfortunately, as this is an Xbox talk, it most certainly means vs Polaris (XBOX One X). Therefore no real perfomance gain over RDNA1, As the original 5700 XT slides also mentions 25% increase vs Vega:

will offer 25% better performance per clock per core and 50% better power efficiency than AMD’s current-generation Vega architecture.

Fortunately the 50% perf/watt claim seems to translate directly into clock-speed so RDNA2 is still a huge improvement over RDNA1.

leoneazzurro · Aug 18, 2020

I think there is a discrepancy in this slide as publieby videocardz:

https://cdn.videocardz.com/1/2020/08/Xbox-Series-X-Slide16.jpg

with respect to this:

https://images.anandtech.com/doci/15994/202008180221421_575px.jpg

In the second one the bottom of the page is covered. but in the first there is a reference about power consumption: 10x the pixel fill rate at 1x the power consumption. Now, the 10x is clearly referred to the Xbox One so the 1x power consumption should be referred to the same console. Question is, Xbox One has a 120W max power draw.
While 170W are the power figures for the Xbox One X. But in that case the improvement is not 10x in any case, especially for pixel fill rate.
So an error? Something Videocardz or their leaker added to the slide on their own (unlikely)? Or does Xbox Series X draw less power than expected?

Karnak · Aug 18, 2020

leoneazzurro said:
So an error? Something Videocardz or their leaker added to the slide on their own (unlikely)?

Nope, there are just two slides. One with the mentioned fill rate/raw power/power consumption at the bottom and one without.

With: https://cdn.mos.cms.futurecdn.net/w4NFnxWXdsDhUJNYpWMNYP-2560-80.jpg
Without: https://cdn.mos.cms.futurecdn.net/g7CS3xPkrd4LpjXCiCA3NM-2560-80.jpg

Microsoft Xbox Series X's AMD Architecture Deep Dive at Hot Chips 2020

AMD's next-gen console APUs come out of hiding.

www.tomshardware.com

leoneazzurro · Aug 18, 2020

Thank you. In this case, the question stays: if the reference to the pixel fill rate is to the Xbox One, does this indicate a lower power consumptin for the Xbox Series X too? Or did MS be quite "rough" by considering 120W and 170W "the same"?

DisEnchantment · Aug 18, 2020

Olikan said:
7 instructions/clock, is a nice increase from 4 as rdna1 and GCN does...

yet, 2 of those are "control" instructions, control what exacly?

Indeed I have been wondering about this. This is a very intriguing topic.
I could speculate it has something to do with the control flow which in the past was also done by the scalar units.
Another possibility is the handling of branching code during RT and for synchronizing different blocks.

Gideon said:
Unfortunately, as this is an Xbox talk, it most certainly means vs Polaris (XBOX One X). Therefore no real perfomance gain over RDNA1, As the original 5700 XT slides also mentions 25% increase vs Vega:

I wondered about this as well.
Per CU is a good measure, but there is not a lot more you can do with the SIMDs without major rearchitecting which would have been hard without breaking compatibility with older GCN used in previous games. Here comes the GE/NGG/VRS.
According to former Sony dev, Matt Hargett, because of increased GE throughput, in the end what the CU has to do is much lesser because a lot of invisible triangles are simply not processed. In the end the RDNA2 GPU can do more with same CUs.
Therefore it seems a bit unnerving to see that NGG culling is crashing on Linux.
Another thing would be maximized shader occupancy due to dual core CP.
I still expect L0/L2 to be different on desktop though which will have an impact on perf/clock.
1.5x perf/watt only from clocks is a bit nuts when you recollect that RDNA1 can hover around 1.9 GHz.

lobz · Aug 18, 2020

DXDiag said:
Your post in nothing more than damage control at this point, a shared unit can never beat a dedicated unit, worse yet, RDNA2 does BVH traversal on the shaders as well (Turing does it on RT cores). So RT acceleration is shared on two levels with RDNA2, not one.

And no Textures units are not overbudgeted on modern GPUs, they are just the right amount for regular texturing, 16X AF filtering, and texture heavy shaders and effects.

View attachment 28198
View attachment 28199

More damage control, Tensores are not necessary, but they are fast enough to offset any performance loss due to using ML to upscale the image, without tensors the loss would be bigger.

Accusing @uzzi38 of all people with damage control is not going to do you any good here.

soresu · Aug 18, 2020

uzzi38 said:
But I'm going to be a bit honest here - I simply cannot believe that it would not be possible to use the same algorithm as DLSS2.0 on RDNA2. I'm quite positive it'll be usuable, just much slower, such that the performance uplift will be much smaller. Still present though.

Definitely not.

DLSS 2.0 uses the Tensor cores which are only used for that purpose in current games.*

RDNA could likely do it, but would require some shaders partitioned just for that purpose that would otherwise be doing graphics and async compute work - though I suppose if you are rendering to a lower resolution anyway that is less of an issue.

*is RT monte carlo denoising a separate algorithm to DLSS on Turing cards?

Either way the Tensor cores on Turing have little to do otherwise, leaving the base shaders to do as much as they can on graphics and compute.

soresu · Aug 18, 2020

uzzi38 said:
But I'm going to be a bit honest here - I simply cannot believe that it would not be possible to use the same algorithm as DLSS2.0 on RDNA2. I'm quite positive it'll be usuable, just much slower, such that the performance uplift will be much smaller. Still present though.

If I had to guess based upon previous nVidia software strategies of 'optimisation' I would bet that even DLSS 2.0 is probably not quite as efficient as it could be, simply because they know there is plenty of silicon to handle it while the base shaders do all the rest of the work.

Albeit I do wonder as you say what penalty is incurred for stopping to transfer between the base shader part of the GPU and the Tensor cores - exactly how well nVidia have engineered that copy procedure will determine how efficient it is on the whole, just as with the RT cores.

uzzi38 · Aug 18, 2020

soresu said:
Definitely not.

DLSS 2.0 uses the Tensor cores which are only used for that purpose in current games.*

RDNA could likely do it, but would require some shaders partitioned just for that purpose that would otherwise be doing graphics and async compute work - though I suppose if you are rendering to a lower resolution anyway that is less of an issue.

*is RT monte carlo denoising a separate algorithm to DLSS on Turing cards?

Either way the Tensor cores on Turing have little to do otherwise, leaving the base shaders to do as much as they can on graphics and compute.

Like I noted near the bottom of that post, when an SM is doing work on the Tensor cores, the rest of the SM is inactive.

https://twitter.com/x/status/1245884197601824768

AMD doesn't need seperate, dedicated units. The main difference they'll have is their shaders are slower at the work than the tensor cores could on their own, not that Turing can seperate both DLSS from the traditional graphics pipeline providing them a major advantage.

To be completely honest, I'm quite sure this is a scenario AMD have already evaluated and decided against adding tensors or something similar to them to RDNA. They have the tech ready to go already, they'd have done it by now if they wanted to.

soresu · Aug 18, 2020

uzzi38 said:
They have the tech ready to go already, they'd have done it by now if they wanted to.

Ready to go might be a bit of a stretch as it's only just going into CDNA1 this year.

It will be interesting to see how separate their own tensor/matrix solution is from the regular compute shaders.

DXDiag · Aug 18, 2020

uzzi38 said:
And we're not talking about a shared unit vs a dedicated one if you didn't realise. We're talking about 4 shared units vs a single dedicated one.

Nope, 4 simple units with simple tasks, vs one big independent advanced unit, worse yet they are shared.

uzzi38 said:
In other words, while the BVH traversal is taking place, they can perform other calculations while they wait for the BVH data.

Nope, BVH traversal is NOT accelerated on AMD RT cores, it is done on shaders, while the RT cores finish doing ray intersections.

uzzi38 said:
So, let me restate what I said before: the implementation in RDNA2 is vastly different to that of Turing. Attempting to judge which is better based off of technical specs alone is a meaningless waste of time.

I disagree, they are vastly different in the sense that AMD's solution is hybrid, underpowered, more prone to performance drops and is very sensitive requiring careful optimizations.

Have a read here:

https://www.reddit.com/r/Amd/comments/ic4bn1

uzzi38 · Aug 19, 2020

DXDiag said:
Nope, 4 simple units with simple tasks, vs one big independent advanced unit, worse yet they are shared.

4 simple units that can individually do ray-box or together do ray-triangle vs 1 independent unit that can do one of both.

We call that a "different implementation". In some cases it'll be faster, in other slower. If only I hadn't said that before...

DXDiag said:
Nope, BVH traversal is NOT accelerated on AMD RT cores, it is done on shaders, while the RT cores finish doing ray intersections.

This is only half correct. The patent describes that the RT units use a state machine to determine a list of bodes that should be traversed next in the order that need to be traversed back, which is returned to the shader.

My understanding from the patent is that the main reason this is done is for flexibility - you send all of the potential branches back to the shader, which decides whether or not certain branches should even be calculated. In Turing, the RT cores will automatically attempt to perform intersection testing on any possible branches, returning all hits along the way as workloads for a shader to pick up and finiah off.

The idea behind AMD's solution is to prevent a waste of your hardware resources by only performing the raytracing needed. It's actually one of the features being brought about by DX12U called inline raytracing.

If you follow Twitter discussions on tech at all, you'll know about this guy that does great die annotations called GPUsAreMagic or Nemez. Here's an imgur link to a load of stuff he's said about the feature and how both AMD and Nvidia go about raytracing:

https://imgur.com/a/ayGZKdz

His explanation is several times better than I could hope to give. It's actually a discussion from many months ago, back when we first found out about AMD's patent.

DXDiag said:
I disagree, they are vastly different in the sense that AMD's solution is hybrid, underpowered, more prone to performance drops and is very sensitive requiring careful optimizations.

Have a read here:

https://www.reddit.com/r/Amd/comments/ic4bn1

Well I hope this should explain why I've been so insistent on "it depends". The most likely scenario is that different games will perform differently on both architectures. The implementation in RDNA should allow for significantly more optimisation as to what should and shouldn't be raytraced, and I imagine in games that do utilise inline raytracing RDNA will take a significant lead. The implementation in Turing may well have a significant leg-up when there are no such implementations in place.

But this is just speculation, and we don't really know how things will turn out again. So once again: the two are entirely different implementations, both with their own strengths and weaknesses. At this point I'm starting to wonder how many times I'll need to say this before you begin to accept it...

itsmydamnation · Aug 19, 2020

DXDiag said:
Nope, 4 simple units with simple tasks, vs one big independent advanced unit, worse yet they are shared.

Nope, BVH traversal is NOT accelerated on AMD RT cores, it is done on shaders, while the RT cores finish doing ray intersections.

I disagree, they are vastly different in the sense that AMD's solution is hybrid, underpowered, more prone to performance drops and is very sensitive requiring careful optimizations.

Have a read here:

https://www.reddit.com/r/Amd/comments/ic4bn1

Seeing as your speaking like your an expect .

Functionally prove it

ie show the life of different rays with different bounces/refections/refractions and how it fits in the pipeline (functionally not high level BS) and how one solution is far superior to the other.

soresu · Aug 19, 2020

itsmydamnation said:
Seeing as your speaking like your an expect .

Functionally prove it

ie show the life of different rays with different bounces/refections/refractions and how it fits in the pipeline (functionally not high level BS) and how one solution is far superior to the other.

Proof is hard to come by when no production hardware exists on the open market for either console or the PC cards using standard RDNA2.

sontin · Aug 19, 2020

uzzi38 said:
4 simple units that can individually do ray-box or together do ray-triangle vs 1 independent unit that can do one of both.

We call that a "different implementation". In some cases it'll be faster, in other slower. If only I hadn't said that before...

No, it wont be faster. nVidia has the same numbers of RT Cores like AMD in each compute unit. But unlike AMD every RT Core is doing the whole acceleration part (BVH travel and intersection test).

Its a superior implementation which can be used totally free from the other units and doesnt stall the cores for BVH travel.

moinmoin · Aug 19, 2020

sontin said:
No, it wont be faster. nVidia has the same numbers of RT Cores like AMD in each compute unit. But unlike AMD every RT Core is doing the whole acceleration part (BVH travel and intersection test).

Its a superior implementation which can be used totally free from the other units and doesnt stall the cores for BVH travel.

Or so you are told and so you chose to believe anyway. How about we wait for benchmarks instead of prematurely jumping to conclusions?

sontin · Aug 19, 2020

it wont be faster because nVidia is doing the exact same thing. The difference is that nVidia has a fixed-function unit for the whole process and AMD is rerouting work to the shaders back and forth. It cant be faster but it may be not slower.

DXDiag · Aug 19, 2020

moinmoin said:
Or so you are told and so you chose to believe anyway. How about we wait for benchmarks instead of prematurely jumping to conclusions?

We will wait for benchmarks of course, but this is a tech forum, we predict and extrapolate behaviors based on the info we have.

uzzi38 said:
His explanation is several times better than I could hope to give. It's actually a discussion from many months ago, back when we first found out about AMD's patent.

I explicitly chatted with several NVIDIA engineers and developers over discord at the DX12U stream event, and asked them directly whether DXR1.1 would be slow on Turing hardware, the answer was a resounding NO. They stated DXR 1.1 will work just as well as DXR1.0 on Turing.

Krteq · Aug 19, 2020

There is no reason why Turing should be slower in inline RT in DXR1.1, but AMD's implementation can be faster in inline RT than in current DXR 1.0 RT.

Question Speculation: RDNA2 + CDNA Architectures thread

Platinum Member

Platinum Member

Golden Member

Diamond Member

Golden Member

Lifer

Platinum Member

Platinum Member

Golden Member

Senior member

Golden Member

Golden Member

Platinum Member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Member

Platinum Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Member

Golden Member