Discussion RDNA 5 / UDNA (CDNA Next) speculation

Page 20 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

itsmydamnation

Diamond Member
Feb 6, 2011
3,045
3,835
136
They'll probably want to transition to transformer models like Nvidia, as those train better with AI generated data.
AMD said they already where using transformer along with CNN in FSR4.

And what does a transformer model have to do with GEMM architecture inside a shader core , the difference in the models is how layers and data are linked/related to each other. Your going to spec a unit around target input/output and computation. That's what has made GPU's so "good" they have that balance of flexibility vs throughput/power backed by a big memory sub system, so they adapt easy to changing/emerging parallel workloads.
 
  • Like
Reactions: ToTTenTranz

ToTTenTranz

Senior member
Feb 4, 2021
456
845
136
Are there any papers (by AMD ) on this stuff
For FSR4 I only know about the white papers, but AMD's been filing patents on the subject since all the way back to 2019:

 

Win2012R2

Senior member
Dec 5, 2024
979
979
96
GFX1250 (MI450X) has significant improvements to VOPD and a new VOPD3 encoding which almost completely eliminates scenarios where dual-SIMDs can't be used
So if this is done on the client it would make dual issue cores act more like real double perf?
 

soresu

Diamond Member
Dec 19, 2014
3,896
3,331
136
Don't assume that performance will magically double. Raw vector throughput is not the typical bottleneck for performance.
Something something fixed function hardware pipeline (for gfx).

Something something branch prediction misses.
 

basix

Member
Oct 4, 2024
148
304
96
The thing with vector performance is also, that you never reach that if you have INT instructions intermixed. From Turing presentations, about 30% of all vector instructions are INT (gaming). So double FP-Vector FLOPS will yield in ~1.4x gaming performance. Pretty much what we are seeing with Turing -> Ampere. Some compute applications did show 2x speedup. So these seem to be solely FP based calculations.

"True doubling" in gaming could only be achieved if you add a dedicated third INT (only) pipeline, which could run in parallel with the two FP pipelines. If 3 pipelines are inappropriate regarding HW design and scheduling, 1x INT and 3x FP could be an alternative. Maybe 1x INT, 1x INT/FP, 2x FP to maximize utilization in corner cases. With latter configuration it could roughly look like this:
- 1.5x throughput in FP only scenarios
- ~2x throughput in mixed FP/INT scenarios (gaming)
- 1x throughput in INT only scenarios (Blackwell) or 2x compared to Ampere

A pretty decent uplift compared to Blackwell, considering that you only added 1x FP Unit and two additional pipes.

A long time ago, long before Blackwell release, I speculated Blackwell could introduce additional FP/INT pipes. But that version has never seen the light of day, Blackwell just added INT capability to the FP only pipe. The result would look like following (resembling the numbers listed above regarding FP throughput) but "Blackwell" in the graph below is not Blackwell but maybe some future iteration of Nvidias SMs:
- Full FP throughput can be sustained up to ~30% INT/FP ratio, Ampere etc. fall below 75%
- Even at 50% ratio, 67% of the FP througput could be maintained. Ampere/Lovelace/Blackwell drop to 50% FP throughput.
More_FP_Pipes_IPC.JPG
 
Last edited:

Tuna-Fish

Golden Member
Mar 4, 2011
1,646
2,465
136
I think you are both massively overestimating what compute can do for you. For a true doubling, they don't need to just add another int pipeline, they'd need to double all the memory access related resources. At which point the units would be twice the size and there would be no point.

The relative cost of moving data to the units compared to the actual compute grows every silicon generation. The point of dual issue is that the cost of everything but the compute has gotten so high that it's worth it to have twice the compute, even if half of it is mostly idle, just to make sure that this everything else is 100% utilized every cycle.
 

basix

Member
Oct 4, 2024
148
304
96
I know what you are referring to. FLOPS are cheap. And for sure, many of not most memory related structures would be required to be updated.

I am not saying, that this will ever be implemented. But ~2x typical gaming performance at less than 2x HW effort (even if you have to double memory, scheduling etc. stuff) doesn't seem so unreasonable.

Amperes doubled FP32 were cheap and yielded nice benefits. So this was a very decent update to the SM with limited cost. But I definitely see a future, where SMs will get wider as scaling is potentially better than just more SMs.
 
  • Like
Reactions: Win2012R2

Win2012R2

Senior member
Dec 5, 2024
979
979
96
Blackwell just added INT capability to the FP only pipe.
That seemed to have made no difference in terms of gaming IPC

"NVIDIA's Blackwell architecture delivered an average IPC advantage of just 1% over the older Ada Lovelace."


Perhaps Blackwell is just very broken for gaming, maybe they fix this in 6000 series.
 

basix

Member
Oct 4, 2024
148
304
96
INT throughput does obviously not limit gaming performance, if you look at the average INT/FP instruction mix. So additional INT TOPS won't help you.
 

cshughes

Junior Member
Jul 17, 2025
1
0
6
New set of patent applications for RT




They are adding BVH traversal in HW in this patent application this time.

RAYTRACING STRUCTURE TRAVERSAL BASED ON WORK ITEMS
From <https://www.freepatentsonline.com/y2025/0104328.html>
A processor employs work items to manage traversal of an acceleration structure, such as a ray tracing structure, at a hardware traversal engine of a processing unit. The work items are structures having a relatively small memory footprint, where each work item is associated both with a ray and with a corresponding portion of the acceleration structure. The hardware traversal engine employs a work items to manage the traversal of the corresponding portion of the acceleration structure for the corresponding ray.

No more traversal done in shaders, just one IMAGE_BVH_INTERSECT_RAY and then get the result without needing to push the current context to stack.
Additionally, the Traversal Engine can do a lot more parallel traversals in HW (for child nodes) than what is possible with shaders for a given CU count.

DisEnchantment - great post. Can I DM you? A couple follow-up questions.