Wild speculation but I believe the de-coupling of FE/SE clocks is due to future generations further splitting them up. Instead of multi-GCD (which is hard) we'll likely see a FED (Front-End Die) containing GCP, HWS, DCN, VCN and PCIe, while the GCD is literally just Shader Engines.
GCD with just shader engines still has to be scalable because cannot have a single GCD with 8 SEs for example. So they still need to link multi GCDs and patents also have indicate consistent investigation in this direction.
Something like multiple smaller GCDs linked around a CP die with bridges all around or mutithreaded CP on two independent GCDs
Is there that much % traffic between the shader engines? If not, as you wrote but with multiple SE block chiplets, the unit size corresponding to the low end model.
Shader Processor/Rasterizers/ROP/L1/etc. are all in the SE.
The biggest data exchange between shaders and the fixed function HW is at GL1 and GDS. It is where the shaders export the result of their operations and from where they pick up the data for the next stage of operation.
And there is a lot of synchronization happening here due to the sequential nature of the rendering pipeline.
Once the shaders are done and have exported their result, the rasterizer will take over using the data exported in GDS.
Similarly output of geometry in the SE goes into GDS and could be picked up by the shaders next if needed.
So there has to be a big crossbar between the shader engines here and link up with the CP.
For Multi GCD to work transparently, they need to link up the CP/L1/L2/GDS. That is why there is a big blob of interconnects around the CP in all the die shots.