Dense Geometry Format (DGF)
End result: The prefilter and DGF nodes allow for a smaller BVH footprint, a massively reduced load on the memory subsystem, and permit fast low precision parallel bulk processing of triangle intersection tests. As a result a sizeable speedup is achieved while area investment for ray tri intersect logic is reduced.
Multiple Ray Tracing Patents Filings
One about configurable convex polygon ray/edge testing which allows sharing of results from edges between polygons eliminating duplicative intersection tests. This has the following benefit:
"By efficiently sharing edge test results among polygons with shared edges, inside/outside testing for groups of polygons can be made more efficient."
It can be implemented via full or reduced precision and makes ray tracing more cost-effective
Three other patent filings leverage displaced micro-meshes (DMMs) and a accelerator unit (AU) that creates them.
I cannot figure out how this DMM implementation differs from NVIDIA's now deprecated DMM implementation in Ada Lovelace, but it sounds very similar although some differences are probably to be expected.
IDK what benefits are to be expected here except perhaps lower BVH build cost and size.
Streaming Wave Coalescer (SWC)
The Streaming Wave Coalescer implements thread coherency sorting similar to Intel's TSU and NVIDIA's SER implementations. It does this by using sorting bins and hard keys to sort divergent threads across waves following the same instruction path, thereby coalescing the threads into new waves.
The
spill-after programming model offers developers granular control over when and how thread state is spilled to memory when reordering executions to different lanes. This helps avoid excessive cache usage and memory access operations resulting in large increases in latency and costly front-end stalls when leveraging SWC.
Just like SER the SWC would help boost path tracing performance, although the implementation looks different and enabled by default.
Local Launchers and Work Graph Scheduler
End result:
#1 Decentralized local scheduling: A decentralized GPU scheduling architecture that delegates scheduling to the lowest possible level in scheduling hierarchy while handing over almost complete scheduling autonomy to the Shader Engines (via WGS) and allowing WGPs to launch their own work. Improves scheduling latency and allows much more fine grained scheduling.
#2 Bottoms up scalable architecture: This is a bottom up instead of top down GPU scheduling paradigm. Everything operates on the assumption of local knows best although brakes are built into the system where higher scheduler takes control if a local scheduler is overloaded or can't feed its WGPs properly. Since each SE functions as its own GPU core scaling is no longer dictated by the scheduling capabilities of the global processor but how quickly it can prepare work and do load balancing across SEs.
#3 A boon for chiplet based GPUs: Preparing work in a global shared mailbox and doing some load balancing across SEs is far less demanding than micromanaging everything. As a result wider GPU designs should benefit the most and for chiplet based architectures the speedup could be even greater due to the latency mitigation and bottom up scheduling paradigm.
A Few Important Patents Filings
The
RECONFIGURABLE VIRTUAL GRAPHICS AND COMPUTE PROCESSOR PIPELINE patent filing allows shaders (general purpose) to emulate fixed function HW and take over when a fixed function bottleneck is happening.
Another patent filing talking about
ACCELERATED DRAW INDIRECT FETCHING leverages fixed function hardware (Accelerator) to speed up indiret fetching resulting in a lowered computational latency and allows
"...different types of aligned or unaligned data structures are usable with equivalent or nearly equivalent performance."