Last info dump was mainly scheduling related. Now I'll be addressing some additional potential RT related changes in RDNA 5 and correcting my old ignorant comments.
The scope of proposed changes is massive which explains the length of the post. Probably takes 15 minutes to read it all.
Following contains patent derived info, so the usual caveats apply. It may not happen but it's very likely all things considered. Exact implementation may not mirror patents 1:1 but impacts should be roughly the same.
Displaced Micro-Maps (DMMs)
- Then lists the three patents related to an AMD DMM implementation as beyond current µArch, when DMM has been supported on RTX 40 series since 2022.
How is AMD's implementation beyond Ada's DMM decompression engine (Blackwell removed it)?
I'm sorry Kepler for not bothering to actually read the patent. The leapfrogging is obvious and significant.
AMD uses a
special method to replace the traditional three subdivision of one base DMM triangle (64 tris) with 14 triangles derived from three tetrahedra + two triangles. In other words 4.6x reduction in triangle count.
The standard technique which would describe the DMM bounding volume using two or three lower level BVHs within it corresponding to respective BVH8 or BVH4 node implementations. But with a wide parallel low-precision integer based (Pre-filtering) pipeline the three subdivisions can be described with just one BVH14 node. In other words when the ray hit at DMM base triangle is confirmed just one additional traversal step (the BVH14 node) where everything is evaluated in one go. In addition any floating post tests and data is only fetched if an inconclusive result is obtained from one of the triangles, so in many instances it'll be skipped entirely.
And if you wonder if that's possible
here's a patent that mentions that an exemplary implementation in which
"...each group of low-precision testers 306 can test a ray against 8 to 16 triangles at once.” So BVH14 doesn't sound impossible albeit it's an odd number and a very wide BVH for ray/tri evaluations.
There's more things as well such as the Tessellation Circuitry that decode the subdivisions and prepare DMMs probably aligning with the improved scheme AMD came up with and a Bounding Circuitry that generates respective Prism Volumes on demand which helps to shrink BVH size significantly.
I doubt NVIDIA does their Prism Volume builds on GPU, but if they do AMD's pre-filtering, very wide BVH14 format, and their new DMM triangle encoding scheme are still major changes.
Prefiltering Computational Efficiency
I've already explained what this is before, so won't go into too much detail, but will provide additional info. A prefiltering pipeline is a set of smaller and simpler integer based ray intersection testers for triangles, bounding boxes or other primitives (linear swept spheres...) that do these evaluations in parallel across a very wide BVH with 8-16 child nodes/primitives in one example.
This seems counter intuitive given how demanding especially ray/tri evaluations are, but the problem with RT is not calculations, it's cachemem bottlenecks. Thus it's desirable to make the BVH as wide and shallow is possible. In addition the reduced-precision Integer based intersection testers
are tiny in comparison:
"Performing arithmetic on reduced precision values is significantly simpler and therefore requires significantly less silicon area (i.e., multiple times less area) than arithmetic on full precision, and we expect that it should also use less power and reduce latency."
I don't think 2-3x intersections/area is enough here as you're not just going from a FP32 to ~FP16 pipeline, but ~INT16 which reduces a lot of overhead associated with FP which is
described in greater detail here:
"Generating the quantization box 602 and overlaying the quantization box 602 with the grid advantageously enables an intersection testing circuitry to perform tests using arithmetic with just small Q-bit integer values (or slightly larger, e.g., Q+3 bit values). By comparison, conventional testers use single-precision floating point arithmetic that perform computations on 23-bit mantissas and need additional computational resources for 8-bit exponents, sign bits, Not a Number (NaN), denormalized numbers, etc. Using fixed-point circuitry for testing allows for much smaller arithmetic logical unit (ALU) circuitry.”
I was foolish enough to attempt get a viable answer from CoPilot, but it was all over the place. Anywhere from 2x to 10x, was constantly flip flopping, and as always severe sycophancy. If someone smarter than me can reasonably guesstimate increase in intersections per area/watt/kb Cache at same process node I don't think I'm the only one who would appreciate.
Otherwise I guess we just have to wait for the official AMD breakdowns where intersections/mm^2 and cache overhead reduction will surely be discussed.
As previously mentioned FP tests are still needed for inconclusive triangles but it's just one, but these only arise when INT tests doesn't confirm, which they in many cases will and in some cases
FP tests can be avoided entirely:
"...stand-alone low-precision test may be used for coarse level-of-detail cases, e.g., where an object is tiny, or far off in the distance (and so is only a handful of pixels in the image), or otherwise may be rendered crudely. In such cases, t-values and barycentric coordinates can be calculated as described above, and the full-precision testing can be entirely avoided by using the low-precision test results as the only parameter in the calculation"
I see no reason why this can't be implemented through compiler so significant additional speedups can be expected ALL ray traced games.
To summarize in all likelyhood AMD will probably shrink the footprint of the FP testing circuitry significantly in RDNA 5, since it'll not be used very often. They'll more than offset it with a humongous increase in intersection rate from tons of parallel INT testers which could result in intersection testing logic at iso-area on N3P vs N4 being massively more potent than what is possible with FP tests.
Prefiltering Memory Efficiency
A large reduction in cache overhead per BVH node can be expected as well, but this will depend on how well the data is compressed more so than raw ray/tri intersection rate. Just like in prior implementations this will be the bottleneck determining BVH traversal rate.
As for BVH data all you need to know is that the data required for computations gets compressed as integer math instead of floating point. Since we're not dealing with decimals precision can be reduced from
single-precision (FP32) to almost half-precision (INT16):
“The power-of-two box dimensions can provide computational benefits in that the bounding box can be compactly stored, e.g., by only storing the bfloat16 value of a minimum corner and exponent byte of the power-of-two box dimensions.”
Prefiltering Additional Info
Special precomputations are required to enable ray setup for pre-filtering pipelines this but AMD
has another patent for that, making it viable at runtime. I won't go into this here but these are significant since FP and INT ray evaluatons are quite different.
Most of the benefits from DGF are still possible without it.
They have a fallback method using prefiltering nodes that only quantize vertices vs DGF's more comprehensive scheme. But this can still enable CBLAS, significant BVH compression, and massive speedups as a result of pre-filtering.
The new quantized OBBs that use Platonic solids are now the default encoding scheme (prefetched) for Oriented Bounding Boxes and is built for mass pre-filtering of ray/box intersections. This should help reduce HW complexity and lower OBB BVH footprint:
"...provides several benefits, such as reducing the complexity of the hardware for applying the orientation of the oriented bounding box to the ray. The amount of data required to be stored is reduced as compared with an implementation that uses more orientation data because the reduced number of possible rotations can be represented with a smaller amount of data.”."
Will it be as good as RDNA 4's OBBs? No, but it's be superior to regular AABBs and once again the bottleneck isn't computations it's cachemem related.
Dense Geometry Format
It's just old boring DGF. Really hoped for more in RDNA 5 even if it's still beyond Blackwell.
I'm taking that back. DGF is great and Cerny is 100% correct in that DGF enables flexible and efficient data structures by keeping as much data wrapped into cache aligned fixed size packages. I was just expecting additional changes related to data structures like overlay trees and delta instances in HW.
Once again bothering to read the entirety of related patents reveals that DGF + Prefiltering has synergies that go well beyond what you would expect and is definitely a clean slate of RT pipeline in itself.
Pre-filtering with DGF is also much faster on the decode side as data doesn't need to be de-quantized enabling faster DGF/prefiltering node decompresion compared to full-precision (FP). And since there's a dedicated DGF/Prefilter Decompression we can expect this decompression to be sufficiently quick.
DGF has superior cache mem characteristics on all architectures, but really needs pre-filtering to shine:
"...parallel rejection testing of large groups of triangles enables a ray tracing circuitry to perform many ray-triangle intersections without fetching additional node data (since the data can be simply decoded from the DGF node, without the need of duplicating data at multiple memory locations). This improves the compute-to-bandwidth ratio of ray traversal and provides a corresponding speedup.”
In addition it
"...removes the need for fine-grained spatial partitioning at the triangle level, resulting in a lower memory footprint and lower memory bandwidth during traversal.”
Another benefit of DGF pre-filtering is that it doesn't need to fetch high-precision triangle data unless a floating point test is neccesary. So keeping everything fixed-point by default should debloat the private RT core caches significantly.
Otherwise all previous characteristics of pre-filtering nodes are still in effect. The only difference is that DGF decoding takes longer, but has much improved compression.
Sharing results of inside/outside ray/edge testing removes redundant calculations and from what I can gather exploits the triangle strips in DGF. It can even be done using low-precision parallel intersection testers similar to DGF.
So there's a lot more to DGF than the apparant first impression. It's really a cascade of optimizations derived from the original idea of cache aligned coherent fixed size geometry blocks. Someone at AMD prob had a Eureka moment regarding prefiltering during the early development of DGF.
Leaf Node Ray Coherency Sorting
Yes that's right AMD actually mentions this in the
one patent. This is really the last thing I expected but it totally makes sense. Since we have fixed size leaf nodes with DGF and prefiltering nodes why not leverage that for ray coherency sorting. I suspect this ray coelescing will be comprehensive and could benefit from a dedicated sorting HW.
Whether RDNA 5 includes additional mechanism for coherency sorting beyond this and SWC who knows, but at least this implies a level 4 implementation for leaf nodes.
Sure this isn't as comprehensive as Imagination Technologies Packet Coherency Gather (sorry for provoding incorrect patent link, fixed) that cover the entire RT pipeline, but AMD did find their own a workaround for the leaf node triangles tests which by far are the most problematic and divergent. This is another step up in cache system efficiency: from at one to multiple memory transactions per triangle for each ray, to one transaction per DGF block for each ray, and now just a one to a few transactions for all rays coalesced against that node (contains 1-4 DGF blocks). Yes FP tests will be needed but again this only has to be done one or a few times and not on a per ray basis.
Ray coalescing requires deferred execution to work and this will be a major change vs the current paradigm that races through entire BVH and when hit is determined immediately execution an any hit shader. In addition there's
another patent for deferred any hit shader evalution that coalesces execution items against an any hit shader, so this is possibly a recuring theme with RDNA 5.
Anyways what we can expect is probably a pipeline that executes as you would normally expect from TLAS to the DGF (prefiltering if you replace 1:1) node , where the proposed RT pipeline could be as follows: #1 wait for all rays to reach a DGF node, #2 sort rays by DGF node trajectory, #3 schedule all rays hitting one DGF node at only RT core and fetch data, and #4 only when all rays have processed switch DGF node.
For anyone doubting
this is the quote:
"The intersection scheduler pipelines data decoding requests to the decompressor 500 for multiple rays coalesced against the same node. All the rays coalesced against the same node are sent first before switching to another node. The decompressor 500 switches to the next ray only after decoding the last batch test for the current ray. Further, the decompressor 500 switches to the next node only after decoding the last batch test of the last ray for current node it is processing.”
DGF + DMM = Goated CBLAS
The wild potential synergies between DGF and DMM arises when yet again bothering to read the entire patent.
This patent mentions data nodes corresponding to a DGF Array consisting of multiple DGF blocks:
"...each data block can store data for up to 64 triangles, whereas each array can store data pertaining to 256 triangles, 256 vertices, and 64 materials.” Other patents mention clusters of 65-128 tris in a BVH data node so that seems more likely, but we'll see.
Then with the DMM multiplier on top we get 256 x 64 tris or up to 16,384 triangles/data node or CBLAS in a BVH. With proposed 65-128 tri clusters this is still 4,160-8192 tris/CLAS. Compared to DGF CBLAS that's 64X fewer leaf nodes. The leap over the conventional approach is even greater with a ~3-4 orders of magnitude reductions the number in BVH leaf nodes.
For detailed micro-polygon geometry such as Nanite the speedup in BVH build times would make RTX Mega Geometry look pathetic in comparison, not to mention the BVH footprint. With DGF AMD only says BVH can be
compressed 2.7-3X but that's with full precision data. The DMM compression factor depends on the asset mesh but ~6.5-20X from the Ada Lovelace Whitepaper. Regardless this is easily more than an order of magnitude in BVH footprint compared to RTX Mega Geometry.
Other Impacts of DGF + DMM
I have to caution that this kind of a setup causes a big shift in the ratio between ray/tri and ray/box intersections and may not be practical in all situations. But the biggest benefit is related to increased impact of ray coalescing. The implementation would make the non-leaf part of the BVH shallower and minimize the computational overhead from the ray/box intersections while increasing the amount of triangles exposed to ray coalescing and ray/tri intersections. Given how large these data nodes would be two ray coalescing events seems like the way to one: One at DGF block and another at DMM base triangle.
This would expose three traversal steps to the impact of ray coalescing. Two to reach DMM base triangles and one to locate the intersected DMM sub-triangle. For a 128 tri DGF Array/Node + DMM that would be BVH8 -> BVH16 -> BVH14. For 256 tri DGF Array/Node + DMM the pipeline would be BVH16 -> BVH16 -> BVH14
The DGF pipeline would be as described earlier and the DMM pipeline would closely mirror that: #1 traverse all DGF nodes for all rays until hit DMM base triangles, #2 sort rays by DMM base triangle trajectory, #3 schedule all rays hitting one DMM base triangle to be run together within the RT core and fetch data, and #4 when all rays have processed switch DMM triangle. It would still run within the higher DGF node step, but coalesce ray tests on a per DMM base triangle basis instead a per DGF node basis. In doing so prism volume related overhead at the Bounding Circuitry and additional DMM base triangle related data fetches can be massively reduced; the effects are here similar to ones observed with DGF nodes.
What lies beyond DMMs?
I found this
novel Holger Gruen patent, but I haven't seen any AMD research papers into it or additional patents, so I don't think this one is making its way into RDNA 5. Encodes a high resolution mesh into a low resolution mesh by
"...determining points on curved surfaces of curved surface patches defined for one of triangles and bi-linear quadrangles of the low resolution version of the high resolution mesh."
Instead of ray tracing triangles which is the case with DMM and most other techniques this one traces ray against entire curved surface patches with interpolated normals from the points along the patches. It's possible the ray evaluation is too complex or something else IDK, but the it achieves
"...higher compression ratios than conventional techniques" and is superior to DMM in terms of processing overhead by
"...encode the offsets without evaluating data" .
Conclusion:
The scope of the patent derived RT core changes related to DMMs, DGF and prefiltering for RDNA 5 are on their own easily enough to quality for a clean slate ray tracing pipeline. I've talked about these changes before but reading the patents again, properly this time, reveals additional info that expands the scope of changes significantly. This isn't about brute forcing the problem by mindlessly throwing more logic at the problem but about redesigning the entire pipeline with ingenious optimizations derived from first principles thinking.
Then there's the big surprise in ray coherency sorting for leaf nodes. I did not see that coming and that'll be a massive deal for ray/triangle intersection testing.
The RT core implementation now appears more impressive and without a shadow of a doubt well beyond 50 series on many fronts. NVIDIA better be cooking something good with Rubin or find a new thing to chase (they prob will).
Obviously caveats still apply and we need confirmation from official source or a reputable leaker to be 100% sure, but will assume it's happening.