Hmmm... OoO will certainly increase IPC (very excited to see it implemented in a GPU), but I'm not sure if it's crucial for RT.
From my limited understanding, BHV structure is "fixed" per frame and once the ray is cast, calculating intersections is not a random task.
Changes to BHV structure with better caches should bring more benefits for RT I suppose ...
It has to be significant otherwise Kepler wouldn't have mentioned it alongside branch prediction. But an explanation would be appreciated.
CMIIW but isn't OoO execution and branch prediction a slippery slope towards MIMD and CPU style mega-cores, and isn't the entire point of GPUs SIMD parallelism?
Seems like it would be preferable to prioritize data locality and other methods for boosting SIMD occupancy vs brute forcing the issue with complex branch prediction and OoO execution.
A future GPU design could accomplish this by implementing the following (non-exhaustive):
- SWC/Thread coherency sorting
- Other forms of coherency sorting like ray coherency sorting as seen in the PowerVR Photon.
- Memory and scheduling changes prioritizing data locality and fine-grained control
Nr3 requires GPU Work Graphs API to maximize performance and benefits.
But OoO scheduling on GPU could still happen at some point. Not the kind CPUs do but seems like there's a method going more than half of the way towards idealized implementation with a tiny area overhead of 0.007%. 6.9% median speedup, up to 36%, no slowdowns. 100X less area overhead than implementing MIMD on a GPU via FSMs (not the same thing, just for comparison). This obviously won't be AMD's or NVIDIA's exact implementations but GhOST is the first method without the usual drawbacks like slowdowns and large area overhead, so they could resemble it in some areas.
- GhOST OoO paper:
https://ieeexplore.ieee.org/document/10609594
- MIMD execution on GPU patent (AMD)
https://www.patents-review.com/a/20...s-enabling-mimd-like-execution-flow-simd.html
^IIRC Kepler said this was for CDNA:
It's less about the BVH structure and more about how ray traversal related scheduling, execution and memory accesses is handled on a GPU. Ideally you would want data contained within SM from start to finish, even with multiple bounces. Should lower power consumption, slash memory and scheduling latency and boost performance.
But BVH still needs fundamental changes and RTX Mega Geometry isn't enough.