Yes, that was the rumor. 8 P + 32 E would take roughly the same die area as 12 P + 16 E. But 8 P + 32 E (48 threads) would be faster in highly multi-threaded tasks than 12 P + 16 E (40 threads).
Lets give the P cores a benchmark that highlights the P-core strength and see what happens.
Rough math assumptions:
- We'll use the Chips and Cheese libx264 Transcode chart since that favors P cores over E cores.
- Assume a 125 W processor load.
- Assume a 10 W uncore with all the various tiles (IO, iGPU, SOC, etc). This is low especially if the iGPU is used, but it gives the most advantage to the P core by giving it more power.
- Assume that each P core is given double the power of each E core. (Feel free to change this assumption and recalculate as you wish).
12 P + 16 E Math:
- Each block of 4 P cores uses 23 W and performs 7.7 frames/s according to the Chips and Cheese benchmark.
- Each block of 4 E cores uses 11.5 W and performs 4.5 frames/s according to the Chips and Cheese benchmark.
- Total power = 10 W (uncore) + 3 * 23 W (P cores) + 4 * 11.5 W (E cores) = 125 W
- Total performance: 3 * 7.7 frames/s + 4 * 4.5 frames/s = 41.1 frames/s
8 P + 32 E Math:
- Each block of 4 P cores uses 19.17 W and performs 6.8 frames/s according to the Chips and Cheese benchmark.
- Each block of 4 E cores uses 9.6 W and performs 4.1 frames/s according to the Chips and Cheese benchmark.
- Total power = 10 W (uncore) + 2 * 19.17 W (P cores) + 8 * 9.58 W (E cores) = 125 W
- Total performance: 2 * 6.8 frames/s + 8 * 4.1 frames/s = 46.4 frames/s
The 8 P + 32 E wins over 12 P + 16 E
on the benchmark that preferred P cores. Even if you subtract a few percent for Amdahl's law, 8 P + 32 E still wins.