Discussion RDNA 5 / UDNA (CDNA Next) speculation

marees · Jul 7, 2025

ToTTenTranz said:
They'll probably want to transition to transformer models like Nvidia, as those train better with AI generated data.

Are there any papers (by AMD ) on this stuff

itsmydamnation · Jul 7, 2025

ToTTenTranz said:
They'll probably want to transition to transformer models like Nvidia, as those train better with AI generated data.

AMD said they already where using transformer along with CNN in FSR4.

And what does a transformer model have to do with GEMM architecture inside a shader core , the difference in the models is how layers and data are linked/related to each other. Your going to spec a unit around target input/output and computation. That's what has made GPU's so "good" they have that balance of flexibility vs throughput/power backed by a big memory sub system, so they adapt easy to changing/emerging parallel workloads.

ToTTenTranz · Jul 7, 2025

marees said:
Are there any papers (by AMD ) on this stuff

For FSR4 I only know about the white papers, but AMD's been filing patents on the subject since all the way back to 2019:

US20210150669A1 - Gaming super resolution - Google Patents

A processing device is provided which includes memory and a processor. The processor is configured to receive an input image having a first resolution, generate linear down-sampled versions of the input image by down-sampling the input image via a linear upscaling network and generate non-linear...

patents.google.com

Kepler_L2 · Jul 8, 2025

GFX1250 (MI450X) has significant improvements to VOPD and a new VOPD3 encoding which almost completely eliminates scenarios where dual-SIMDs can't be used https://github.com/llvm/llvm-project/pull/147602

branch_suggestion · Jul 9, 2025

Kepler_L2 said:
GFX1250 (MI450X) has significant improvements to VOPD and a new VOPD3 encoding which almost completely eliminates scenarios where dual-SIMDs can't be used https://github.com/llvm/llvm-project/pull/147602

They have no problem iterating on IP fast.
How about iterating on die size in client?

soresu · Jul 9, 2025

Kepler_L2 said:
GFX1250 (MI450X) has significant improvements to VOPD and a new VOPD3 encoding which almost completely eliminates scenarios where dual-SIMDs can't be used https://github.com/llvm/llvm-project/pull/147602

Is this is a significant departure from GFX12 then?

Saylick · Jul 9, 2025

Kepler_L2 said:
GFX1250 (MI450X) has significant improvements to VOPD and a new VOPD3 encoding which almost completely eliminates scenarios where dual-SIMDs can't be used https://github.com/llvm/llvm-project/pull/147602

I hope they give every implementation of GFX1250 with 6 cache ports then.

Win2012R2 · Jul 9, 2025

Kepler_L2 said:
GFX1250 (MI450X) has significant improvements to VOPD and a new VOPD3 encoding which almost completely eliminates scenarios where dual-SIMDs can't be used

So if this is done on the client it would make dual issue cores act more like real double perf?

Saylick · Jul 9, 2025

Win2012R2 said:
So if this is done on the client it would make dual issue cores act more like real double perf?

Don't assume that performance will magically double. Raw vector throughput is not the typical bottleneck for performance.

soresu · Jul 9, 2025

Saylick said:
Don't assume that performance will magically double. Raw vector throughput is not the typical bottleneck for performance.

Something something fixed function hardware pipeline (for gfx).

Something something branch prediction misses.

Kepler_L2 · Jul 9, 2025

soresu said:
Is this is a significant departure from GFX12 then?

Yeah

Win2012R2 said:
So if this is done on the client it would make dual issue cores act more like real double perf?

For things that are bottlenecked only by Vector perf, it now acts more like Turing -> Ampere

branch_suggestion · Jul 11, 2025

Locality-based data processing

Complete Patent Searching Database and Patent Data Analytics Services.

www.freepatentsonline.com

Appears to be Zen7/MI500 stuff.

poke01 · Jul 11, 2025

@Kepler_L2 is RDNA5 also late 2026?

marees · Jul 11, 2025

poke01 said:
@Kepler_L2 is RDNA5 also late 2026?

H2 2026 as per last leaked road map

Xbox 25th anniversary next nov. They could reveal new xbox (based on RDNA 5) by then

Kepler_L2 · Jul 11, 2025

poke01 said:
@Kepler_L2 is RDNA5 also late 2026?

Should be

Win2012R2 · Jul 12, 2025

Kepler_L2 said:
For things that are bottlenecked only by Vector perf, it now acts more like Turing -> Ampere

So finally AMD's dual issue should be close to Ampere's?

Kepler_L2 · Jul 12, 2025

Win2012R2 said:
So finally AMD's dual issue should be close to Ampere's?

Looks like it

Win2012R2 · Jul 12, 2025

Kepler_L2 said:
Looks like it

Sadly both are pretty poor (in terms of ideal double perf), but I guess that's one step closer to Nvidia's perf, so that's a good thing

basix · Jul 16, 2025

The thing with vector performance is also, that you never reach that if you have INT instructions intermixed. From Turing presentations, about 30% of all vector instructions are INT (gaming). So double FP-Vector FLOPS will yield in ~1.4x gaming performance. Pretty much what we are seeing with Turing -> Ampere. Some compute applications did show 2x speedup. So these seem to be solely FP based calculations.

"True doubling" in gaming could only be achieved if you add a dedicated third INT (only) pipeline, which could run in parallel with the two FP pipelines. If 3 pipelines are inappropriate regarding HW design and scheduling, 1x INT and 3x FP could be an alternative. Maybe 1x INT, 1x INT/FP, 2x FP to maximize utilization in corner cases. With latter configuration it could roughly look like this:
- 1.5x throughput in FP only scenarios
- ~2x throughput in mixed FP/INT scenarios (gaming)
- 1x throughput in INT only scenarios (Blackwell) or 2x compared to Ampere

A pretty decent uplift compared to Blackwell, considering that you only added 1x FP Unit and two additional pipes.

A long time ago, long before Blackwell release, I speculated Blackwell could introduce additional FP/INT pipes. But that version has never seen the light of day, Blackwell just added INT capability to the FP only pipe. The result would look like following (resembling the numbers listed above regarding FP throughput) but "Blackwell" in the graph below is not Blackwell but maybe some future iteration of Nvidias SMs:
- Full FP throughput can be sustained up to ~30% INT/FP ratio, Ampere etc. fall below 75%
- Even at 50% ratio, 67% of the FP througput could be maintained. Ampere/Lovelace/Blackwell drop to 50% FP throughput.

Tuna-Fish · Jul 16, 2025

I think you are both massively overestimating what compute can do for you. For a true doubling, they don't need to just add another int pipeline, they'd need to double all the memory access related resources. At which point the units would be twice the size and there would be no point.

The relative cost of moving data to the units compared to the actual compute grows every silicon generation. The point of dual issue is that the cost of everything but the compute has gotten so high that it's worth it to have twice the compute, even if half of it is mostly idle, just to make sure that this everything else is 100% utilized every cycle.

basix · Jul 16, 2025

I know what you are referring to. FLOPS are cheap. And for sure, many of not most memory related structures would be required to be updated.

I am not saying, that this will ever be implemented. But ~2x typical gaming performance at less than 2x HW effort (even if you have to double memory, scheduling etc. stuff) doesn't seem so unreasonable.

Amperes doubled FP32 were cheap and yielded nice benefits. So this was a very decent update to the SM with limited cost. But I definitely see a future, where SMs will get wider as scaling is potentially better than just more SMs.

Win2012R2 · Jul 16, 2025

basix said:
Blackwell just added INT capability to the FP only pipe.

That seemed to have made no difference in terms of gaming IPC

"NVIDIA's Blackwell architecture delivered an average IPC advantage of just 1% over the older Ada Lovelace."

GPU IPC Showdown: NVIDIA Blackwell vs Ada Lovelace; AMD RDNA 4 vs RDNA 3

Instructions per clock is a metric used to define and compare CPU architecture performance usually. However, enthusiast colleagues at ComputerBase had an idea to test the IPC improvement in GPUs, comparing it across current and past generations. NVIDIA's Blackwell-based GeForce RTX 50 series...

www.techpowerup.com

Perhaps Blackwell is just very broken for gaming, maybe they fix this in 6000 series.

basix · Jul 16, 2025

INT throughput does obviously not limit gaming performance, if you look at the average INT/FP instruction mix. So additional INT TOPS won't help you.

cshughes · Jul 17, 2025

DisEnchantment said:
New set of patent applications for RT

OVERLAY TREES FOR RAY TRACING

From <https://www.freepatentsonline.com/y2024/0087223.html>

FRUSTUM-BOUNDING VOLUME INTERSECTION DETECTION USING HEMISPHERICAL PROJECTION

From <https://www.freepatentsonline.com/y2024/0233242.html>

TRAVERSAL RECURSION FOR ACCELERATION STRUCTURE TRAVERSAL

From <https://www.freepatentsonline.com/y2024/0370965.html>

SPHERE-BASED RAY-CAPSULE INTERSECTOR FOR CURVE RENDERING

From <https://www.freepatentsonline.com/y2024/0331266.html>

SPLIT BOUNDING VOLUMES FOR INSTANCES

From <https://www.freepatentsonline.com/y2024/0412446.html>

TRAVERSAL AND PROCEDURAL SHADER BOUNDS REFINEMENT

From <https://www.freepatentsonline.com/y2024/0412445.html>

NEURAL NETWORK-BASED RAY TRACING

From <https://www.freepatentsonline.com/y2025/0005842.html>

RAYTRACING STRUCTURE TRAVERSAL BASED ON WORK ITEMS

From <https://www.freepatentsonline.com/y2025/0104328.html>

LOSSY GEOMETRY COMPRESSION USING INTERPOLATED NORMALS FOR USE IN BVH BUILDING AND RENDERING

From <https://www.freepatentsonline.com/y2025/0104285.html>

They are adding BVH traversal in HW in this patent application this time.

RAYTRACING STRUCTURE TRAVERSAL BASED ON WORK ITEMS
From <https://www.freepatentsonline.com/y2025/0104328.html>
A processor employs work items to manage traversal of an acceleration structure, such as a ray tracing structure, at a hardware traversal engine of a processing unit. The work items are structures having a relatively small memory footprint, where each work item is associated both with a ray and with a corresponding portion of the acceleration structure. The hardware traversal engine employs a work items to manage the traversal of the corresponding portion of the acceleration structure for the corresponding ray.

No more traversal done in shaders, just one IMAGE_BVH_INTERSECT_RAY and then get the result without needing to push the current context to stack.
Additionally, the Traversal Engine can do a lot more parallel traversals in HW (for child nodes) than what is possible with shaders for a given CU count.

DisEnchantment - great post. Can I DM you? A couple follow-up questions.

poke01 · Jul 17, 2025

Hope there is a halo sku for RDNA 5

Discussion RDNA 5 / UDNA (CDNA Next) speculation

Platinum Member

Diamond Member

Senior member

Golden Member

Senior member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Golden Member

Senior member

Diamond Member

Platinum Member

Golden Member

Golden Member

Golden Member

Golden Member

Senior member

Golden Member

Senior member

Golden Member

Senior member

Junior Member

​

Diamond Member