on the other hand dropping smt or ht for intel drops the physical logic units off the die, it reduces complexity and cost, and in turn will use anywhere from 10-22-25% less power under full load. you could laser fuse the logic units off so there's no active state when under load for the associated core but you're wasting space and increasing complexity and manufacturing costs if you can't get it right and need to heavily bin.
I think the road for SMT is unclear in mobile/desktop computing. Does it make much sense for 8 + 32 CPU? let's say it provides 25% throughput and E cores are 66% of P perf. 10 + 21 vs 8 + 21 => is 10% IDEAL throughput increase.
The keyword being IDEAL, as real SoC will be thermals/power limited and will need to drop clocks in P and E clusters. So our theoreticals might not apply or even produce a negative scaling in some thermals restricted scenario, where P cores overheat.
Intel might have some interesting sharing for cores in mind. What are the main users of transistors? Caching structures, be it mem, branches or TLBs. It would be madness to share "L1" of those due to performance, but "L2" structures are really massive, multi ported structures that sit idle if
So Intel might have went to create "Penryn" like module of 2 cores, where each core has "L0" of 48KB, L1 of 256 and then L2 shared by 2 cores of some 6-8MBs and traditional L3 on SoC.
L2 TLBs are also shared and so is massive L2 BTB. is there huge L2 TLB that is also shared? Very much possible.
That is a sharing that increases performance in straightforward, energy saving way without blowing up transistor budgets / per core.