So it seems, but patent specifies it can be any non-local cache, so they could be coupled to Shader Engine private cache, L2, or MALL.
If the reduced L2 (AT2 = 24MB L2 vs Navi 48 = 64MB MALL) is accurate and CCU is leveraged for RDNA 5 then those being coupled to L2 would shrink L2 for other processes since they require sizeable dedicated. Wonder how AMD engineers would tackle this. A SE cache implementation could happen as well, but it would require a much bigger SE cache, but some other benefits like superior cache latency, closer integration with CUs (routing and latency) etc... Latter seems more likely given the entire thing about the supposedly (not confirmed IIRC) overhaul with autonomous SE scheduling and dispatch (WGS and ADC). In this case CCUs outside SEs would complicate things alot.
Remember CCU offloads work from CU, so overhead might be lower than what it seems. No need to duplicate instructions, but yeah still some overhead, but how much?
Highly doubt that. But some BW heavy RT instructions could be offloaded to CCUs.
Again lots of unaswered questions but one of the most interesting AMD patents in a long time and a massive departure architecturally from anything previous.
Edit: Forgot this is Zen 7 thread, so would this be feasible at all for a CPU architecture? Maybe AMD processing in-cache 3D V-Cache for zen7 xD