Share the front end too, and go Full Bulldozer!
Intel has a different technology from the VISC buyout they can do.
Tremont core 0.88 mm sqrd, Total quad-core cluster size 5.14 mm sqrd
1. Get rid of local front-ends.
2. Use a single non-processor integrated global front-end.
3. Implement a large micro-op/L0i cache in the place of the local front-end.
4. etc.
Ideally, there is interest to also get rid of local memory execution for a global memory execution cluster as well.
1. Get rid of local back-ends.
2. Use a snigle non-procoessor integrated global back-end.
3. Implement a large L0d cache in place of the local back-end.
4. etc.
Then, watch as all four 3-ALU pipelines become one 12-wide virtual core or twelve 1-wide virtual cores and everything in between.
!!1.5~4.5 MB L2!! -> <<GFE Branch Predict/Fetch/Pick/Decode -> 128KB L1i -> GFE Dispatch/Retire/Allocate>> -> ~~16KB L0i, etc.~~
!!L2 voltage/clock domain!!
<<GFE&BE voltage/clock domain>>
~~Processor voltage/clock domain~~
There is also the Mirage-config which allows the Atom cores to be converted back into InO processors. Since, the Global-part is OoO in itself, which leads to; one 12-wide OoO virtual core executing on four 3-wide InO physical cores.
The FPU is integrated within the core for Intel, so with the mirage setup. It can dispatch a part of the AVX512 to each AVX128 simd unit. However, the performance-angle makes it more ideal for each of the cores to get wider SIMDs instead.
Instead of, 12-wide 64-bit I-pipe, 4-wide 128-bit M-pipe, 4-wide 128-bit A-pipe. Do to efficiencies given by the mirage setup, it would be; 12-wide 64-bit I-pipe, 4-wide 512-bit M-pipe, 4-wide 512-bit A-pipe.
Current CPU w/ VISC at Intel has legacy to this:
https://dada.cs.washington.edu/smt/memoryLogix.pdf