Discussion RDNA 5 / UDNA (CDNA Next) speculation

adroc_thurston · Thursday at 5:52 PM

soresu said:
Well I mean, semi custom does at least get discussed

Kind of, mostly in terms of gfx roadmaps.
But again, FAD is for real meat, not fluff talks with gfx ISVs.

MrMPFR · Thursday at 5:55 PM

SolidQ said:
Here in topic was this - https://www.freepatentsonline.com/y2025/0104328.html

That's just one patent. There are more related to traversal in HW:
Traversal recursion for acceleration structure traversal
Graphics processing unit traversal engine

With SWC it brings RT up to Level 3.5 like Alchemist and Ada Lovelace and later.
Everything else is unfortunately shaky (except DGF in HW). This was barely even a teaser for RDNA 5.

But this part was interesting:
"One top of those performance increases (BVH traversal in HW), there's other features in the works, too, such as flexible and efficient data structures for the geometry being ray traced."

Have to assume this goes well beyond DMMs and DGF. How far who knows.
This patent implementing partial BVH computations directly within RT cores (sorting and reductions) popped up last week:
System and Method for Bounding Volume Hierarchy Construction

But more likely to be referencing to something akin to the the overlay trees and delta instances compression patents:
Acceleration structures with delta instances
Overlay trees for ray tracing

^Just patents. Who knows what actually ends up in RDNA 5.

Ignore^. @Kepler_L2 has spoken. It's just old boring DGF. Really hoped for more in RDNA 5 even if it's still beyond Blackwell.

~~Edit:~~ Huynh talked about the new BHV traversal HW reducing load on GPU shaders AND the CPU (It's possible he misspoke). Is the reduced CPU load from BVH in HW and/or actual HW BVH management as mentioned in the newer patent?

Again probably ignore. Seems like the novel partial BVH build in HW patent most likely absent in RDNA5. What a shame Level 5 RT implementation would've been massive.

MrMPFR · Thursday at 6:11 PM

adroc_thurston said:
because they don't really do it like anyone else.

Didn't expect it to be that different.

adroc_thurston said:
RDNA5 will have more stuff.

Sure. Just a teaser if you can even call it that.

adroc_thurston said:
was it really hype.
They just talked a bit about challenges ahead.

You're right but the lazy tech press will find a way to spin it as hype xD

adroc_thurston said:
FAD is for roadmaps and serious people, not console toddlerslop. get real.

Rewatch the 2020 FAD. There's tons of details on RDNA 2 and confirmation for NG consoles, but format would have had to be complete different.

adroc_thurston · Thursday at 6:13 PM

MrMPFR said:
Didn't expect it to be that different.

They like meth.

MrMPFR said:
Sure. Just a teaser if you can even call it that.

Yuh.

MrMPFR said:
You're right but the lazy tech press will find a way to spin it as hype xD

wccftech article in 3... 2... 1.

MrMPFR said:
Rewatch the 2020 FAD. There's tons of details on RDNA 2 and confirmation for NG consoles, but format would have had to be complete different.

The 2022 one had like one or two slides for RDNA3.

Keller_TT · Thursday at 6:41 PM

RnR_au said:
Hehe - python is just being used as a scripting language calling highly optimised 'AI primitives' coded in C/C++.

There is a thing called MegaKernel - you describe the computation graph for your LLM in python code and then it compiles a single gpu kernel that is highly optimised in terms of memory accesses. Very interesting stuff. Very fast and no C++

Compiling LLMs into a MegaKernel: A Path to Low-Latency Inference

TL;DR: We developed a compiler that automatically transforms LLM inference into a single megakernel — a fused GPU kernel that performs…

zhihaojia.medium.com

A smidge offtopic though.... looking forward to the 128GB RDNA 5 AI cards!!

Btw, my alma mater, Uni Heidelberg, started a project called hipSYCL, which has been renamed to AdaptiveCpp (fully open source on GitHub), which is using standard C++ 17 to move over from CUDA for HPC, GPGPU work. It was specifically started to extract the best from AMD GPUs, and its foundational papers were published on AMD testbench. But it is meant to be vendor neutral for CPUs + GPUs.. It is a super project, and I'm glad that I could do few little things for it. It is not specifically targeted at ML, but one can write ML kernels nevertheless.

Whatever Lattner critiqued about OpenCL blowing it with its terrible governance and mismatch of competitive interests holding back, this one goes a long way solving it, as it is started and managed by University led pure scientific research for real-world needs.

Reg Python instead of C++ for ML/AI and making CUDA moot, that's called Mojo, though Mojo is much more than that. They just recently added GPU programming support for RDNA 3, and 4.

RnR_au · Thursday at 7:09 PM

Keller_TT said:
Whatever Lattner critiqued about OpenCL blowing it with its terrible governance and mismatch of competitive interests holding back, this one goes a long way solving it, as it is started and managed by University led pure scientific research for real-world needs.

Hear! Hear!

basix · Friday at 5:41 AM

adroc_thurston said:
SE.
They're doing LDS to LDS transfers.

What about going one step further (as you mentioned "crackpot solution"):
Shared L1 caches across one SE: https://adwaitjog.github.io/docs/pdf/sharedl1-pact20.pdf
- L1 in this context would be the new unified L0/LDS of gfx1250+

DNN execution got accelerated by 2...3x

Mopetar · Friday at 11:36 AM

adroc_thurston said:
Wait no it is there.

adroc_thurston said:
or not.
god NV documentation is painfully obtuse when it comes to arch-specific features.

adroc_thurston said:
it is there.

It would appear that NVidia has some form of quantum technology in their products. Based off of this comment I expect MLID to be reporting on it within the week.

Win2012R2 · Friday at 11:37 AM

MrMPFR said:
Level 5 RT implementation would've been massive.

Does it become self-aware at that level?

adroc_thurston · Friday at 12:40 PM

MrMPFR said:
Level 5 RT implementation would've been massive.

This 'level' stuff is Fake and Gay since none of that slop addresses the main issue of doing RTRT on things not Larrabee.

basix · Friday at 12:45 PM

Win2012R2 said:
Does it become self-aware at that level?

RT Level 5 is only
- hard(a)ware
- hard-aware
- what(a)ever

MrMPFR · Friday at 1:38 PM

Win2012R2 said:
Does it become self-aware at that level?

I'm just using Imagination Technologies's old levels of RT (each higher level build upon prev): https://gfxspeak.com/featured/the-levels-tracing/
Level 1 = SW emulaton
Level 2 = Ray tri/box (RDNA 2+)
Level 3 = HW BVH processing (RTX 20-30)
Level 3.5 = Thread coherency sorting (ARC, 40-50 series, M3 and later and ?RDNA 5)
Level 4 = Ray coherency sorting (PowerVR Photon)
Level 5 = HW BVH construction (PowerVR GR6500)

It's completely meaningless for performance but a good gauge of architectural sophistication (number of fixed-function HW blocks). BTW Imagination scrapped Level 5 since it wasn't worth it.
Don't take it too seriously.

adroc_thurston said:
This 'level' stuff is Fake and Gay since none of that slop addresses the main issue of doing RTRT on things not Larrabee.

I'll be interesting to see where RDNA 5 lands. Register renaming is already a step towards CPU territory but not enough.

Also the entire point of Level 4 is avoid that overhead entirely by making RT behave differently to align with SIMD rather than MIMD. SER/SWC are bandaids. They don't fix the problem at its root unlike ray coherency sorting does. Rays heading in the same direction need to be batched and run together, instead of randomly assigning rays heading in left and right within a SIMD. Until that happens RT will always prefer MIMD.

basix said:
RT Level 5 is only
- hard(a)ware
- hard-aware
- what(a)ever

Lol

adroc_thurston · Friday at 1:41 PM

MrMPFR said:
Also the entire point of Level 4 is avoid that overhead entirely by making RT behave differently to align with SIMD rather than MIMD

it's all Fake and Gay since you're still adding chains of very latency-sensitive ops to a hardware pipeline that is just not built for it.
RTRT is just a really, really, really bad workload for anything, but especially GPUs that have like 200ns of L2 latency alone.

Keller_TT · Friday at 2:27 PM

YouTube decided to show me this channel "Threat Interactive", and this guy lays into the RT/PT kool aid, the current Unreal slop, and Digital Foundry's crap about "pushing gaming tech".

The guy has subsequently released a 2nd part to this today, but this part from 10 days ago is about Callisto Protocol's implementation of BRDF:

marees · Friday at 2:33 PM

Keller_TT said:
YouTube decided to show me this channel "Threat Interactive", and this guy lays into the RT/PT kool aid, the current Unreal slop, and Digital Foundry's crap about "pushing gaming tech".

The guy has subsequently released a 2nd part to this today, but this part from 10 days ago is about Callisto Protocol's implementation of BRDF:

What is the RDNA 5 connection ?

Is it PT ??

Keller_TT · Friday at 2:45 PM

marees said:
What is the RDNA 5 connection ?

Is it PT ??

Not specific to RDNA 5 or any graphics card, but just the current trajectory pushed by the incumbent powers that be - Read Epic, Nvidia, and graphics built on Unreal Engine.

His channel is about graphics tech in game engines.

Darkmont · Friday at 2:52 PM

Threat Interactive is a sludge posting grifter who pays his rent by appealing to 104 IQ redditors with a big stiffy for hating games made after 2015. Replace any graphics buzzwords he hates with "woke" and the output is 1:1

poke01 · Friday at 2:55 PM

Don’t fall for marketing buzz words from any company, that AMD/PS video was pure puke.

Likewise that RT level chart is funny coming from IMG

marees · Friday at 2:59 PM

Keller_TT said:
Not specific to RDNA 5 or any graphics card, but just the current trajectory pushed by the incumbent powers that be - Read Epic, Nvidia, and graphics built on Unreal Engine.

His channel is about graphics tech in game engines.

The combo of nanite with RT has wrecked many games

Plus Lumen implementation produces a result that is very hard to optimize for low end (SVOGI - Voxel Cone Tracing gives much better bang for buck)

I believe Epic UE5 has some work to do on UE5 for performance on low end cards

adroc_thurston · Friday at 3:02 PM

marees said:
The combo of nanite with RT has wrecked many games

No, the devs are just incompetent.

MrMPFR · Friday at 3:27 PM

adroc_thurston said:
it's all Fake and Gay since you're still adding chains of very latency-sensitive ops to a hardware pipeline that is just not built for it.
RTRT is just a really, really, really bad workload for anything, but especially GPUs that have like 200ns of L2 latency alone.

Was just reporting the stuff mentioned in the patent filing and the PowerVR Photon Whitepaper (ignore this as the patent is more interesting). Leaving the Packet Coherency Gather related patent here in case anyone is interested: https://patents.google.com/patent/US20220068008A1

And a few quotes from the patent and there's more:
"It (coherency gathering) can allow geometry information to be read once, and to be tested against multiple rays. This also facilitates parallel implementation—for example, using a Single Instruction Multiple Data (SIMD) model—whereby separate hardware-units process the different rays (of the same group) in parallel against the same geometry information."

"By gathering rays according to each specific instance of each BLAS node, the system can arrange for a group of rays that share the same transform as well as the same BLAS node to be scheduled for testing together. Therefore, at most one memory request should be required to retrieve the transform for intersection-testing a given group of rays. According to examples, this is further facilitated by using an instance transform cache."

"When an instance transform is first required, it is loaded into the instance transform cache. The next time the same instance transform is used for intersection testing, it can be expected that it can be retrieved from the instance transform cache without needing to load it from the external memory. This reduces the memory access overhead."

Doesn't sound fake to me, but like a well thought out system of multiple HW optimization that AMD would want to license or reach through different means and then include in RDNA 5. That's assuming they're serious about solving the subpar coherency and cachemem overload problem plaguing RT rn.

Edit: Arrrgh I added the wrong link. You can find the real patent now.

Win2012R2 · Friday at 7:00 PM

adroc_thurston said:
No, the devs are just incompetent.

Including Gearbox?

If they can't do it, then who can?

marees said:
I believe Epic UE5 has some work to do on UE5 for performance on low end cards

They must do two things:
1) get to UE6 real quick because UE5 is now more or less toxic keyword, new games that are well made using it are better off stop saying which engine they've got
2) they need to fix upgrade situation - games dev who start making a game on major version X should be able to upgrade seamlessly to a minor version, otherwise it's total BS

adroc_thurston · Friday at 7:05 PM

Win2012R2 said:
Including Gearbox?

*Especially* Gearbox.

Win2012R2 said:
If they can't do it, then who can

Embark.

Bigos · Friday at 9:06 PM

Darkmont said:
Threat Interactive is a sludge posting grifter who pays his rent by appealing to 104 IQ redditors with a big stiffy for hating games made after 2015. Replace any graphics buzzwords he hates with "woke" and the output is 1:1

And the fact modern games look bad is just me imagining things?

I am in the camp "truth is between the two opposing sides".

Darkmont · Friday at 9:55 PM

Bigos said:
And the fact modern games look bad is just me imagining things?

Neva said that

Bigos said:
I am in the camp "truth is between the two opposing sides".

Yes ThreatInteractive is a tard and game devs have the unenviable position of performance vs release deadlines

Discussion RDNA 5 / UDNA (CDNA Next) speculation

Diamond Member

Member

Member

Diamond Member

Member

Platinum Member

Senior member

Diamond Member

Golden Member

Diamond Member

Senior member

Member

Diamond Member

Member

Golden Member

Member

Member

Diamond Member

Golden Member

Diamond Member

Member

Golden Member

Diamond Member

Senior member

Member