You don't get it? Let me describe it to you.
The Haifa IDC team is very conservative in their changes. Nehalem as a core doesn't bring much, it's essentially a tick. SMT, IMC, QPI, cache level changes are all things that's architecture agnostic. Then you have Sandy Bridge, but Haswell, Skylake, Icelake, Sunny Cove, Golden Cove, Lion Cove are all expansions of previous concepts. No new architectural ideas.
Pentium M - 32+32KB L1 cache, Micro Op Fusion, Dedicated Stack manager, real dynamic SpeedStep, L2 way splitting to save power.
Core 2 - Macro Op Fusion, Memory Disambiguation
Sandy Bridge - Uop cache, Physical Registers, Branch predictor improvements without xtor increase, AVX256 support using same number of ports, Ring Bus, real Turbo Mode.
Even those above cores greatly increased the core size ISO-process. 50% larger, for 15-20% improvements. All the cores come with improved branch prediction.
Let's compare E cores:
Bonnell - 2-issue in-order core, SMT
Silvermont - Out of Order execution, non blocking memory instructions, SMT removed, same core size as previous gen. 50% faster per clock
Goldmont - 3-way decode, OoOE fully pipelined FP, 16KB L2 predecode cache, Fetch and I-cache decoupled, 20B fetch, from 16B, 30% faster per clock
Goldmont Plus - Widens backend to 4-wide from 3, 64KB L2 predecode, 30% faster per clock
Tremont - 32KB D-cache from 24KB, 6-wide using 2x 3-wide clustered decode, 2x16B fetch, 128KB L2 predecode, 30% faster per clock
Gracemont - L1-I doubled to 64KB, removes the L2 predecode with a new feature called On-Demand Instruction Decoder(OD-ILD), Clustered decode has a load balancer to address cases where there's not enough branch(meaning no clustering) and inserts a fake branch, 2x16B from OD-ILD and 2x32 from I-cache, supports AVX2 using 128-bit vector units, 30% faster per clock
Skymont - 9-wide using 3x3-wide, Nanocode to improve ILP, ultra-wide retirement to save overall resources, literally doubles FP from 2x to 4x units, 30% faster per clock in Int and 60% faster in FP
The efficiency gains, performance gains, execution efficiency and rate of innovation don't even come close between the two teams. The E core team is far superior, even in the best days of the P cores. And amazing 30% faster gains came at a linear area/power increase.
In Silvermont they removed SMT and got 50% perf improvement at the same area ISO-process with OoOE. In Tremont, they added a novel new feature. In Gracemont they addressed the weaknesses with the new feature while cutting out one feature to replace it with another one which is better. Such breakneck pace of modifications without screwing up and regressing is amazing.
Meanwhile, the P core team stayed at 16B fetch all the way from Pentium II in 1998 to Sunny Cove in 2020. Only Golden Cove doubles it to 32B. While it can be argued in the average scenario it is enough since average x86 instructions are 4-bytes and 16B satisfies 4-way decode, there will be bottlenecks. Goldmont, a tiny core increased it to 20B. In Gracemont it's 2x16B from OD-ILD and 2x32B from I-cache.
Skymont is 3x32B, while Lion Cove's 32B fetch has to serve all 8 decoders. It also slightly outperforms it in the all-important branch prediction and it's wider too.
Why, were the P core team so conservative in some areas, while blowing budget and power on others?