Also, people say the E core has been gaining large because it started from the lower end of the scale. This is a lazy explanation.
Well, what do you say about the way different approach the E core team uses? There hasn't been any new ideas on the P core team since Sandy Bridge in 2011! While the E core team has been bringing new ones every 1-2 generations. Also every generation has been a sweeping change.
Atom Bonnell - Macro Op execution:
https://www.anandtech.com/show/2493/9
Atom Silvermont - First OoOE, proper memory subsystem
Atom Goldmont - OoOE FP, 3-way decode, 16KB pre-decode L2 cache
Atom Goldmont Plus - 4-way backend, 64KB predecode L2 cache
Atom Tremont - Clustered decode, 128KB predecode L2
Gracemont - Improved Clustered 2x3 decode with auto-balancer. OD-ILD replaces the 128KB predecode L2 and can work with large code sizes, so now the clustered decode works on all code.
Skymont - 3x3 decode, with commonly used ucode instructions per cluster to improve parallelism. This is probably a area/power efficient way of having a Fast Path for instruction. Doubled FP/Vector units which benefits every code out there, not just few.
They improved branch prediction, size of the BTB, and backend width on Crestmont a mere Tick!
Willing to try new things
-Macro Op execution, rather than decode all
-Clustered decode
-Taking out 128KB L2 Predecode and replacing it with OD-ILD
-Very wide 16-wide retire on Skymont to save resources elsewhere. Probably also benefits clustered decode
-More Stores than Loads
Different ideas at a fundamental level
-Clustered decode can execute loops and it has no Loop Stream Buffer
-Rather than shared buffers, everything is independent
-Many, many simple ports over few very powerful ones
-Doubling FP benefits everyone, over AVX needing recompile every time
This is a truly inspired and dynamic team, and this is why the future is likely with them. AMD is using clustered decode on Zen 5 and so far David Huang isn't seeing it working on single thread. Tremont worked better 4 years ago.