Question Speculation: RDNA2 + CDNA Architectures thread

uzzi38 · Apr 28, 2020

All die sizes are within 5mm^2. The poster here has been right on some things in the past afaik, and to his credit was the first to saying 505mm^2 for Navi21, which other people have backed up. Even still though, take the following with a pich of salt.

Navi21 - 505mm^2

Navi22 - 340mm^2

Navi23 - 240mm^2

Source is the following post: https://www.ptt.cc/bbs/PC_Shopping/M.1588075782.A.C1E.html

Glo. · Aug 19, 2020

Leave Nvidia discussion out of AMD thread, please.

maddie · Aug 19, 2020

DXDiag said:
We will wait for benchmarks of course, but this is a tech forum, we predict and extrapolate behaviors based on the info we have.

I explicitly chatted with several NVIDIA engineers and developers over discord at the DX12U stream event, and asked them directly whether DXR1.1 would be slow on Turing hardware, the answer was a resounding NO. They stated DXR 1.1 will work just as well as DXR1.0 on Turing.

What in the world does this really mean. I hope you realize that slow is not really quantifiable in absolute terms. One person's fast is another's slow.

uzzi38 · Aug 19, 2020

sontin said:
No, it wont be faster. nVidia has the same numbers of RT Cores like AMD in each compute unit. But unlike AMD every RT Core is doing the whole acceleration part (BVH travel and intersection test).

Its a superior implementation which can be used totally free from the other units and doesnt stall the cores for BVH travel.

Nvidia's RT cores handle a single calculation per clock. This is the same for both ray-box and ray-triangle. I agree that Nvidia's solution will cause less bottlenecks, but you're being silly if you don't recognise that AMD has got a distinct advantage in this area. Similarily, the handling of traversal on the RT cores trades up flexibility for speed. Where that flexibility is needed, Turing will take a notable hit to performance.

There no harm in admitting both architectures have their strengths and weaknesses. That's kind of what makes tech interesting. I honestly do see AMD's implementation struggling in older RTRT games, and ones not particularily well optimised for the consoles. However, ones that are, RDNA will come out ahead in.

Just repeating "muh dedicated coars" doesn't really push a conversation very far.

DXDiag said:
I explicitly chatted with several NVIDIA engineers and developers over discord at the DX12U stream event, and asked them directly whether DXR1.1 would be slow on Turing hardware, the answer was a resounding NO. They stated DXR 1.1 will work just as well as DXR1.0 on Turing.

"I spoke to Nvidia about whether or not their current products would age badly. They said no."

That's what I read here. Seriously, what kind of response were you expecting from Nvidia? And you had the nerve to say I was performing damage control earlier?

Anywho, after seeing this thread Nemes had a little more to say:

Uh, enjoy?

Stuka87 · Aug 19, 2020

Some good details on the XBSX GPU in this article that went up last night:

Hot Chips 2020 Live Blog: Microsoft Xbox Series X System Architecture (6:00pm PT)

www.anandtech.com

GoodRevrnd · Aug 19, 2020

Stuka87 said:
Some good details on the XBSX GPU in this article that went up last night:

Hot Chips 2020 Live Blog: Microsoft Xbox Series X System Architecture (6:00pm PT)

www.anandtech.com

Wonder if "Efficient enhancements rather than isolated cores" is commentary on RT implementation.

uzzi38 · Aug 19, 2020

GoodRevrnd said:
Wonder if "Efficient enhancements rather than isolated cores" is commentary on RT implementation.

That actually does read like what I linked about inline RTRT. Surprised I missed that slide now. Thanks for that!

Stuka87 · Aug 19, 2020

GoodRevrnd said:
Wonder if "Efficient enhancements rather than isolated cores" is commentary on RT implementation.

It is specifically talking about the RT implementation.

GodisanAtheist · Aug 19, 2020

Any word on RDNA2 transistor density improvements? Looks like XBSX is still at ~40Mtr/mm2 but that's sort of to be expected given the design cycle of that part and the high volume market it will be addressing.

Wonder if AMD will be able to get some additional die size savings based on what they've learned with the release of Renoir (which i think is around 60Mtr/mm2 so a 50% density increase).

Maybe they're saving that for RDNA3 as a refresh...

uzzi38 · Aug 19, 2020

GodisanAtheist said:
Any word on RDNA2 transistor density improvements? Looks like XBSX is still at ~40Mtr/mm2 but that's sort of to be expected given the design cycle of that part and the high volume market it will be addressing.

Wonder if AMD will be able to get some additional die size savings based on what they've learned with the release of Renoir (which i think is around 60Mtr/mm2 so a 50% density increase).

Maybe they're saving that for RDNA3 as a refresh...

From the looks of it the Series X looks to ve a whole ~1Mtr/mm^2 more dense than Navi10!

Yeah, not much has changed. Interestingly though, each CU appears to be ever so slightly smaller (difference is <1mm^2)

Also, I'd be surprised to see RDNA3 as a refresh, kind of hard to get another 50% perf/W with just a refresh if you ask me now that nodes are providing less returns on power efficiency than ever before.

GodisanAtheist · Aug 19, 2020

uzzi38 said:
From the looks of it the Series X looks to ve a whole ~1Mtr/mm^2 more dense than Navi10!

Yeah, not much has changed. Interestingly though, each CU appears to be ever so slightly smaller (difference is
Also, I'd be surprised to see RDNA3 as a refresh, kind of hard to get another 50% perf/W with just a refresh if you ask me now that nodes are providing less returns on power efficiency than ever before.

- I can see RDNA2 being the "tick" - new features and efficiencies, with RDNA3 being the "tock" - a shrink and refinement of the existing feature set (maybe with more packed in).

Only in this case, they would shrink within the same node since their first round on 7nm was so conservative given the kind of transistor density 7nm is capable of (I think it caps out at 80Mtr/mm2, but that's not really practical while we know 60Mtr/mm2 is definitely possible given Renoir).

sontin · Aug 20, 2020

uzzi38 said:
Nvidia's RT cores handle a single calculation per clock. This is the same for both ray-box and ray-triangle. I agree that Nvidia's solution will cause less bottlenecks, but you're being silly if you don't recognise that AMD has got a distinct advantage in this area. Similarily, the handling of traversal on the RT cores trades up flexibility for speed. Where that flexibility is needed, Turing will take a notable hit to performance.

A RT Core is a math unit. Like a TensorCore it's doing much more work than a standard "core". Or how do you think a 2080TI can be 10x faster with just 72 RT Cores than a 1080TI?
And there doesnt exist any flexibility with AMD's approach, too. They still need a BVH to accelerate Raytracing.

Stuka87 · Aug 20, 2020

sontin said:
A RT Core is a math unit. Like a TensorCore it's doing much more work than a standard "core". Or how do you think a 2080TI can be 10x faster with just 72 RT Cores than a 1080TI?
And there doesnt exist any flexibility with AMD's approach, too. They still need a BVH to accelerate Raytracing.

Er, AMD's approach has MORE flexibility, not less. Any time you have fixed function hardware, you throw flexibility out the window. AMD's approach should also offer far greater linearity in performance as you scale up to larger chips.

But again, we don't know. Trying to argue a point with an unreleased product with extremely limited information is pointless.

sontin · Aug 20, 2020

No, it hasnt because AMD has to travel through the BVH-tree. The same limitation nVidia has just they have to use the compute units to calculate the next BVH-leaf.

soresu · Aug 20, 2020

sontin said:
They still need a BVH to accelerate Raytracing.

BVH is not hardware, it is just a data structure - strictly speaking it is not necessary for ray intersection/traversal acceleration, it just happens to be the one most focused on.

sontin said:
A RT Core is a math unit. Like a TensorCore it's doing much more work than a standard "core". Or how do you think a 2080TI can be 10x faster with just 72 RT Cores than a 1080TI?

RT cores are not doing "much more" work, they are simply optimised for a specific type of work as Tensor cores are.

General compute flexibility comes with a cost to efficiency in any given task - this is precisely why nVidia made Tensor and RT cores.

Either way,by MS's own HotChips slides we know that their RT acceleration is significantly custom, so drawing conclusions on standard RDNA2 uArch RT performance based on those slides is clearly a flawed position.

If not AMD would have released RDNA2 slides of their own prior to HotChips.

uzzi38 · Aug 20, 2020

sontin said:
A RT Core is a math unit. Like a TensorCore it's doing much more work than a standard "core". Or how do you think a 2080TI can be 10x faster with just 72 RT Cores than a 1080TI?

Those cores aren't always 100% active. The number of times where you'd be making a tradeoff between texturing and raytracing is relatively low given they're entirely different parts of the rendering pipeline.

sontin said:
And there doesnt exist any flexibility with AMD's approach, too. They still need a BVH to accelerate Raytracing.

If you're not going to read what was written next time don't reply. It wastes both of our times.

The flexibility is in regards to performing optimisations on what you are tracing. Inline raytracing and more direct in-hardware support for it allows you to be write more complex logic on how you determine when you should continue on traversing the BVH tree. The flexibility mentionned in the patent and by Microsoft refers to this. Assuming the current implementations of RTRT are performed exactly as described by the Turing Whitepaper, you would start off by constructing the BVH and sending it over to the RT cores to calculate on, along with the number of bounces, and if allowed to do so, the RT core will continue to traverse the BVH until it finds every single hit/miss along the way before sending that back to the shaders to draw the scenes. Nemes does state that most implementations don't actually do this anyway though.

This implementation straight up doesn't work with inline raytracing, hence the flexibility part of what I wrote. Provided Nemes is correct, most games end up and in the case of future games, will end up performing RTRT on Turing (and probably Ampere too) the way they will on RDNA2, because that method allows them to optimise their games better at relatively small costs.

uzzi38 · Aug 20, 2020

soresu said:
Either way,by MS's own HotChips slides we know that their RT acceleration is significantly custom, so drawing conclusions on standard RDNA2 uArch RT performance based on those slides is clearly a flawed position.

Oh weird, I thought they said it was the standard RDNA2 implementation, but looking through the Live Blog and it seems you're correct. Microsoft did indeed say it was a custom implementation.

moinmoin · Aug 20, 2020

uzzi38 said:
Oh weird, I thought they said it was the standard RDNA2 implementation, but looking through the Live Blog and it seems you're correct. Microsoft did indeed say it was a custom implementation.

The extend of the customization still needs to clear up. Both Microsoft and Sony have to empathize and likely exaggerate their custom implementations since otherwise both Series X and PS5 are pretty close hardware wise.

Gideon · Aug 20, 2020

Regarding RT perf one also has to adress the elephant in the room: Memory latency and bandwidth.

There is some good discussion about the complexity of mitigating mem-latency going on about it in this thread:

https://www.reddit.com/r/hardware/comments/icq4pq/_/g24zlon

Caches can help as well as other clever optimisations bundling rays that are going to same direction together.

On top of latency its also very bandwidth intensive. AMD themselves suggested to avoid any memory intensive operations in their single-shaders raytracing talk while waiting for the result from RT units as raytracing hogd up memory.

TL;DR
What I'm trying to say is, that in the end it won't matter how good your fixed-function RT units are, when you're bottonecked by memory. Therefore I'm also sceptical any vendors solution will outperform others by multiples

sontin · Aug 20, 2020

uzzi38 said:
Those cores aren't always 100% active. The number of times where you'd be making a tradeoff between texturing and raytracing is relatively low given they're entirely different parts of the rendering pipeline.

Has nothing to do with the RTA units. RTA only working with Raytracing. And with that the shader core has to calculate the next BVH-leaf.

The flexibility is in regards to performing optimisations on what you are tracing. Inline raytracing and more direct in-hardware support for it allows you to be write more complex logic on how you determine when you should continue on traversing the BVH tree. The flexibility mentionned in the patent and by Microsoft refers to this. Assuming the current implementations of RTRT are performed exactly as described by the Turing Whitepaper, you would start off by constructing the BVH and sending it over to the RT cores to calculate on, along with the number of bounces, and if allowed to do so, the RT core will continue to traverse the BVH until it finds every single hit/miss along the way before sending that back to the shaders to draw the scenes. Nemes does state that most implementations don't actually do this anyway though.

This implementation straight up doesn't work with inline raytracing, hence the flexibility part of what I wrote. Provided Nemes is correct, most games end up and in the case of future games, will end up performing RTRT on Turing (and probably Ampere too) the way they will on RDNA2, because that method allows them to optimise their games better at relatively small costs.

And yet Microsoft writes the opposite: https://devblogs.microsoft.com/directx/dxr-1-1/

Pls dont take anything on the internet for real. DXR1.1 doesnt change anything for the tracing part of the rays:

Inline raytracing in shaders starts with instantiating a RayQuery object as a local variable, acting as a state machine for ray query with a relatively large state footprint. The shader interacts with the RayQuery object’s methods to advance the query through an acceleration structure and query traversal information.

The API hides access to the acceleration structure (e.g. data structure traversal, box, triangle intersection), leaving it to the hardware/driver. All necessary app code surrounding these fixed-function acceleration structure accesses, for handling both enumerated candidate hits and the result of a query (e.g. hit vs miss), can be self-contained in the shader driving the RayQuery.

soresu · Aug 20, 2020

uzzi38 said:
Oh weird, I thought they said it was the standard RDNA2 implementation, but looking through the Live Blog and it seems you're correct. Microsoft did indeed say it was a custom implementation.

That was my thought too prior to looking through the slides.

It may be that the first person who declared it to be standard RDNA2 was less informed.

Either that or they simply did not want to go into detail at the time, and using the term custom might have brought up doubt as to which implementation was better and if it impacted their previously superior 12 TF to 10 TF positioning vs PS5 where RT is concerned.

Given the rumors that PS5 may have RDNA3 features built in I could see MS playing it safe earlier on rather than giving cause for speculation - I guess now it's just a waiting game until we get more exacting details on the PS5 implementation.

jpiniero · Aug 20, 2020

soresu said:
Given the rumors that PS5 may have RDNA3 features built in I could see MS playing it safe earlier on rather than giving cause for speculation - I guess now it's just a waiting game until we get more exacting details on the PS5 implementation.

I would be very surprised if the GPU on the PS5 was meaningfully different beyond having 36 CUs versus 52.

FaaR · Aug 20, 2020

soresu said:
Given the rumors that PS5 may have RDNA3 features built in

These types of wishful thinking fanboy-friendly rumors are almost never true. For example, I remember people back in the day essentially betting their lives on Wii main ASIC having special hardware not present in the Gamecube.

Yeah, other than the Wifi and USB stuff, that wasn't the case. lol

uzzi38 · Aug 20, 2020

sontin said:
Has nothing to do with the RTA units. RTA only working with Raytracing. And with that the shader core has to calculate the next BVH-leaf.

I'm referring to AMD's implementation. Microsoft's has the RT units shared for resources with the TMUs, so only one can do anything at any given clock for any given CU. The other cannot be used.

sontin said:
And yet Microsoft writes the opposite: https://devblogs.microsoft.com/directx/dxr-1-1/

Pls dont take anything on the internet for real. DXR1.1 doesnt change anything for the tracing part of the rays:

You should have probably read further down too:

Inline raytracing gives developers the option to drive more of the raytracing process. As opposed to handing work scheduling entirely to the system. This could be useful for many reasons:

Perhaps the developer knows their scenario is simple enough that the overhead of dynamic shader scheduling is not worthwhile. For example a well constrained way of calculating shadows.

It could be convenient/efficient to query an acceleration structure from a shader that doesn’t support dynamic-shader-based rays. Like a compute shader.

It might be helpful to combine dynamic-shader-based raytracing with the inline form. Some raytracing shader stages, like intersection shaders and any hit shaders, don’t even support tracing rays via dynamic-shader-based raytracing. But the inline form is available everywhere.

Another combination is to switch to the inline form for simple recursive rays. This enables the app to declare there is no recursion for the underlying raytracing pipeline, given inline raytracing is handling recursive rays. The simpler dynamic scheduling burden on the system might yield better efficiency. This trades off against the large state footprint in shaders that use inline raytracing.

The basic assumption is that scenarios with many complex shaders will run better with dynamic-shader-based raytracing. As opposed to using massive inline raytracing uber-shaders. And scenarios that would use a very minimal shading complexity and/or very few shaders might run better with inline raytracing.

What I wrote was my own understanding after talking with Nemes. It seems like I was partially wrong with my understanding, but I'm not entirely sure. The first few points talk about inline RTRT as being almost like the opposite way around to achieving what I was talking about. Instead of limiting the results of sending rays out, instead it's

1. A technique performed when attempting to perform RTRT on a limited number of objects. The example MS give is with shadows, so you would use inline RTRT in a dimly lit area with a limited number of lighting objects.

2. Certain shaders - such as compute shaders - aren't compatible with the standard RTRT methods.

3. More stuff about support.

4. When working with simple recursive rays, inline RTRT can also be more efficient than the normal method.

That being said, nothing here is entirely contradictory to the basis for my point in that it's used for optimising scenarios where the work you want to do is more simple, I was just quite wrong on how it does that. None of this also contradicts with what Nemes has said regarding ebing able to perform inline RTRT in the RT cores itself (or well, lack of that functionality). It says that the exact implementation is left to the hardware/drivers itself, but does not clearly specify how either would handle it at all.

Would love for some extra reading material though, so if you do actually find something that actually does prove what Nemes wrote wrong, please do share. Or something that proves him rught for that matter. For now, the few pages here will have to suffice...

soresu · Aug 20, 2020

jpiniero said:
I would be very surprised if the GPU on the PS5 was meaningfully different beyond having 36 CUs versus 52.

I did not mean to imply that this is actually the case - only that MS PR may have been influenced by events and rumors at one time or another, especially during the last few months when so much has been in flux.

FaaR said:
These types of wishful thinking fanboy-friendly rumors are almost never true. For example, I remember people back in the day essentially betting their lives on Wii main ASIC having special hardware not present in the Gamecube.

I also have a particular beef with Nintendo over inflating PR of hardware specs - they clearly implied that WiiU CPU was POWER7 based by using the Watson AI computer in early language about it to generate hype.

As we know it turned out to be little more than an even higher clocked, triple core version of the GC/Wii CPU which uses a much earlier, and less performant PPC core - needless to say I wasn't impressed at the time and Nintendo have long since dropped beneath my RADAR so far as interesting HW internals are concerned.

As for the PS5 RDNA3 feature rumors though, you are wrong to say such things with an example by Sony themselves less than 4 years ago.

PS4 Pro used what was basically Rapid Packed Math (double rate FP16), yet in a GPU uArch that was otherwise pretty much Polaris.

RPM did not in fact make it into a PC desktop GPU uArch until Vega - oddly it did not even make it into the XB1X/Scorpio GPU a year later than PS4 Pro.

Given these things, it is hardly the "wishful thinking fanboy-friendly" type of rumors - it clearly has recent precedence, especially regarding a Sony console.

It's certainly unlikely to be anything close to full fat RDNA3, but a feature or 2 making it into PS5 is not such a stretch depending on its maturity, and whether it was actually developed for PS5 in the first place (as I believe was the case for RPM and PS4 Pro).

FaaR · Aug 20, 2020

soresu said:
Given these things, it is hardly the "wishful thinking fanboy-friendly" type of rumors - it clearly has recent precedence, especially regarding a Sony console.

It's not precedence to point at one unrelated hardware feature in a different generation of likewise unrelated hardware and then somehow extrapolating that into an internet rumor regarding two different generations of hardware being true.

Likewise I'm not wrong when I said that these rumors really are almost never true. They really are almost never true! This basic sort of rumor has been around for ages, and they're a dime a dozen whenever a new console generation is about to come out, surely you've noticed that by now.

For example, some nutty guy who went under the handle of Chaphack (and a whole bunch of similar variations, because he kept getting banned) at Beyond3D, who was a huge MS fanboy, claimed before what later became known as Xbox 360 was revealed that it would have a "Tejas" Pentium4 CPU clocked at 10GHz. As we now know, that wasn't quite the case. lol

The basic principle behind these rumors and how they proliferate are easy enough to understand; people really want their preferred console to be special in some way. Hence all the dumb stuff we heard about the Wii, and the Wii U too, and many other consoles besides. Hell, console manufacturers sometimes engage in such rumormongery themselves, like when Sony BS-ingly claimed that exporting PS2s to Iraq had been banned by the UN, because the consoles were allegedly so powerful they could be converted into control systems for cruise missiles! lol

And not long ago I heard some loose talk about MS special sauce they'd held back about the Xbox SeX for example. But we've now had a hotchips presentation about the machine and there doesn't seem to be any such sauce in there, or they'd most likely mentioned it. Because why wouldn't they? A hot chips presentation is meant to brag about the machine and its capabilities, why leave stuff out on purpose, thus making it look weaker. And the thing goes on sale in like two months or maybe three anyhow. The time for keeping hardware secrets for this coming gen of consoles is essentially up.

So one oddball feature from PS Pro doesn't give this rumor any special credibility. It's just yet one more unsubstantiated claim with nothing in reality to back it up.

Question Speculation: RDNA2 + CDNA Architectures thread

Platinum Member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Platinum Member

Platinum Member

Platinum Member

Diamond Member

Golden Member

Diamond Member

Platinum Member

Lifer

Golden Member

Platinum Member

Platinum Member

Golden Member