Third image. Allows big core to be even bigger than it could be* for highest ST performance while the many small cores are for not sacrificing MT performance.
It's a sort of task specialization but for the multi-core era.
*Let's say w/o hybrid GC is 20% faster than WC, but with it rather than having 16 of those cores, you would have GC cores that are 40% faster than WC but only 8 of them, and 8 Gracemont that adds to MT performance(and each outperforming SKL by 5% or so).
I don't buy this, not for performance in consumer desktops. The ratio between small and big cores doesn't bring that much of an advantage if you take into consideration the higher thread count needed to reach optimal throughput, not to mention the fact that it likely requires tasks with low MT diminishing returns to get there.
Did some napkin math last night while reading this thread, so I'll just copy-paste it bellow. If we assume GC = 1.5x Skylake IPC and Gracemont = 1x Skylake IPC, SMT yields at 20%, let's compare throughput potential and topology constraints:
Code:
8 big + 8 small (1x area)
8 x 1.5 x 1.2 = 14.4
8 x 1 = 8
Throughput @ 24T = 22.4
Throughput @ 16T = 20
Throughput @ 12T = 16
Will require dual ring bus, some kind of mesh or new type of interconnect.
Latency sensitive tasks will probably run only on the big cluster for best results.
10 big (1x area)
10 x 1.5 x 1.2 = 18
Throughput @ 24T ~ 18
Throughput @ 16T = 16.8
Throughput @ 12T = 15.6
Same old ring bus, all cores readily available for everything.
12 big (1.2X area)
12 x 1.5 x 1.2 = 21.6
Throughput @ 24T = 21.6
Throughput @ 16T = 19.2
Throughput @ 12T = 18
Maybe too much of a stretch for ring bus, maybe still doable.
8 big + 16 small (1.2X area)
Throughput @ 32T = 30.4
Throughput @ 24T = 28
Throughput @ 16T = 20
Throughput @ 12T = 16
Some observations:
- 12T workloads would work just as well on 10 big as on 8+8
- 8+8 will likely use only the big cores in gaming, pure 8 big core chips will be smaller and just as fast
- 12 big can match 8+8 in throughput, incidentally this may look a lot like Alder Lake vs. Zen 4
- 8+16 really starts to shine in MT, but is 32T a consumer load anymore?
I couldn't cover the influence of power savings brought by the small cores, but then again we'd have to take other things into consideration as well:
- small cores may or may not reach big core frequency, meaning the math is purely about max potential anyway
- significant changes in interconnect may actually offset power gains brought by a relatively small cluster of 8 small cores
- it's power on enthusiast desktop, we're playing with 150-200W right not and don't seem to mind... so why start caring now?
From my POV this 8+8 Alder Lake, if true, is the same type of experiment as Lakefield: very promising when looking at isolated parts, but quite troublesome to optimize once you put everything together in a cohesive package. Both Lakefield and Alder Lake successors will probably be the real deal where design decisions & prior experience bring hefty performance results, but it seems that lately all we do with Intel is dream about the generation after the next. Luckily both TGL and RKL-S are far more conventional and ready for today's software.