- Apr 28, 2024
- 1,253
- 1,806
- 96
Are there any papers (by AMD ) on this stuffThey'll probably want to transition to transformer models like Nvidia, as those train better with AI generated data.
Are there any papers (by AMD ) on this stuffThey'll probably want to transition to transformer models like Nvidia, as those train better with AI generated data.
AMD said they already where using transformer along with CNN in FSR4.They'll probably want to transition to transformer models like Nvidia, as those train better with AI generated data.
For FSR4 I only know about the white papers, but AMD's been filing patents on the subject since all the way back to 2019:Are there any papers (by AMD ) on this stuff
They have no problem iterating on IP fast.GFX1250 (MI450X) has significant improvements to VOPD and a new VOPD3 encoding which almost completely eliminates scenarios where dual-SIMDs can't be used https://github.com/llvm/llvm-project/pull/147602
Is this is a significant departure from GFX12 then?GFX1250 (MI450X) has significant improvements to VOPD and a new VOPD3 encoding which almost completely eliminates scenarios where dual-SIMDs can't be used https://github.com/llvm/llvm-project/pull/147602
I hope they give every implementation of GFX1250 with 6 cache ports then.GFX1250 (MI450X) has significant improvements to VOPD and a new VOPD3 encoding which almost completely eliminates scenarios where dual-SIMDs can't be used https://github.com/llvm/llvm-project/pull/147602
So if this is done on the client it would make dual issue cores act more like real double perf?GFX1250 (MI450X) has significant improvements to VOPD and a new VOPD3 encoding which almost completely eliminates scenarios where dual-SIMDs can't be used
Don't assume that performance will magically double. Raw vector throughput is not the typical bottleneck for performance.So if this is done on the client it would make dual issue cores act more like real double perf?
Something something fixed function hardware pipeline (for gfx).Don't assume that performance will magically double. Raw vector throughput is not the typical bottleneck for performance.
YeahIs this is a significant departure from GFX12 then?
For things that are bottlenecked only by Vector perf, it now acts more like Turing -> AmpereSo if this is done on the client it would make dual issue cores act more like real double perf?
H2 2026 as per last leaked road map@Kepler_L2 is RDNA5 also late 2026?
Should be@Kepler_L2 is RDNA5 also late 2026?
So finally AMD's dual issue should be close to Ampere's?For things that are bottlenecked only by Vector perf, it now acts more like Turing -> Ampere
Looks like itSo finally AMD's dual issue should be close to Ampere's?
Sadly both are pretty poor (in terms of ideal double perf), but I guess that's one step closer to Nvidia's perf, so that's a good thingLooks like it
That seemed to have made no difference in terms of gaming IPCBlackwell just added INT capability to the FP only pipe.
DisEnchantment - great post. Can I DM you? A couple follow-up questions.New set of patent applications for RT
- OVERLAY TREES FOR RAY TRACING
- FRUSTUM-BOUNDING VOLUME INTERSECTION DETECTION USING HEMISPHERICAL PROJECTION
- TRAVERSAL RECURSION FOR ACCELERATION STRUCTURE TRAVERSAL
- SPHERE-BASED RAY-CAPSULE INTERSECTOR FOR CURVE RENDERING
- SPLIT BOUNDING VOLUMES FOR INSTANCES
- TRAVERSAL AND PROCEDURAL SHADER BOUNDS REFINEMENT
- NEURAL NETWORK-BASED RAY TRACING
- RAYTRACING STRUCTURE TRAVERSAL BASED ON WORK ITEMS
- LOSSY GEOMETRY COMPRESSION USING INTERPOLATED NORMALS FOR USE IN BVH BUILDING AND RENDERING
They are adding BVH traversal in HW in this patent application this time.
RAYTRACING STRUCTURE TRAVERSAL BASED ON WORK ITEMS
From <https://www.freepatentsonline.com/y2025/0104328.html>
A processor employs work items to manage traversal of an acceleration structure, such as a ray tracing structure, at a hardware traversal engine of a processing unit. The work items are structures having a relatively small memory footprint, where each work item is associated both with a ray and with a corresponding portion of the acceleration structure. The hardware traversal engine employs a work items to manage the traversal of the corresponding portion of the acceleration structure for the corresponding ray.
No more traversal done in shaders, just one IMAGE_BVH_INTERSECT_RAY and then get the result without needing to push the current context to stack.
Additionally, the Traversal Engine can do a lot more parallel traversals in HW (for child nodes) than what is possible with shaders for a given CU count.