Now that we've smoked out the rat, I can't wait for details on Zen 5 LP.
Not just the perf, but what did they take out, power draw, etc.
Be interesting to see areal density too.
Apparently there will be only a small number of LP cores. [Purpose: to host background tasks in idle situations/ connected standby maybe — not to prop up Cinebench. ;
-) ] Thus, areal density, while not unimportant, may not be a central design goal. For Zen 5LP, that is.
Given Bergamo was only a 1.33x increase in cores over Genoa and the Zen5 successor is supposed to be more like 1.5x there must be a significant difference in layout there too.
Genoa and Bergamo still have some spare room under the lid. (According to published photos, not that I'd delidded one myself.) I guess the new IOD for Turin and Turin-Dense could be a more slender rectangle than Genoa's and Bergamo's IOD, for some more "shore line" to area ratio. Both for putting the additionally needed GMI links on the chip and to facilitate their routing on the package.
The 96 core Threadripper 7995WX however is very crowded under the hood. But whether there will be a direct Zen 5 based successor to this one remains to be seen anyway.
I have 352 Genoa cores myself.
4x 64c/128t and 1x 96c/192t according to Mark's signature. All 1P I think.
Are you able to utilize them properly or do you need to resort to putting your workloads in VMs for better core occupancy?
In Distributed Computing, we often run
n instances of single-threaded processes. This scales without problem to so many threads (on Linux; I am not up to date with Windows). Sometimes we run fewer instances of multi-threaded processes. With some of such applications, performance suffers a lot if threads of one such process end up running on different CCXs. That has been an issue with Zen 1...4 and obviously will remain with Zen 5. Hard to say what will happen with Zen 6 with its substantially changed SOCs. (Or Strix Halo already, in fact.) The problem is two-fold: Inter-thread shared data gets onto more caches than strictly needed, and inter-thread communications beyond CCX boundaries is slow and energy costly. But we don't need VMs or even containers to solve this; we can do this with helper tools, or in case of EPYCs can use a BIOS option which (ab)uses NUMA hints to coerce NUMA aware operating system into cache-aware thread scheduling. (Neither Windows' nor Linux's kernel implement a cache-aware scheduling policy. The kernel developers probably have their reasons to leave this to userspace to handle.)
EDIT: CPUs with unified last level cache would remove this inconvenience. But the price to pay would be higher latency of the last level cache, and higher chip manufacturing costs.
EDIT 2,
Some
distributed computing enthusiasts do have Intel p+e CPUs, but even though an e core performs roughly similar to one p HT thread, these CPUs are still [...troublesome...]
Not sure if the Win11 scheduler has been improved but Linux is supposedly better at dealing with hybrid cores:
https://www.phoronix.com/news/Linux-6.5-Intel-Hybrid-Sched
When Inte'ls offer was 8c/16t + 8c/8t, even with scheduling like that (regardless if implemented in kernelspace or userspace), you are left with a large asymmetry. It is now better with 8c/16t + 16c/16t at the top end, but still not symmetric.
As an aside, recall how Intel solved this in their LGA1700 Xeon line:
ark.intel.com (Though this line is not targeted to compute servers.)