Discussion Triple architecture CPU or dissimilar dual socket motherboard

igor_kavinski · Mar 10, 2022

Intel P-core - Really good single threaded performance

Intel E-core - Acceptable and scalable multicore performance per area

AMD Zen 3+ - Unmatched multicore performance per watt

We have a problem. We want all three types of cores in our PC. We don't want to buy two different PCs and juggle our tasks between them based on which CPU core is best for our use case.

Here's what needs to happen. Intel needs to reach out to AMD and propose a combined CPU with all three cores. AMD shouldn't mind licensing their core to Intel since this arrangement would free them from the burden of catering to desktop users and they can simply focus on their server clients. AMD gets royalties for their CPU cores used in this combined CPU assembled by Intel.

If this is not possible due to disagreement between Intel and AMD, then someone like ASUS needs to build a dual socket mobo incorporating both Intel and AMD chipsets. A KVM solution on the mobo would allow switching between the two systems. It can be a full ATX mobo. For PC users like us, we get the best of both worlds in a relatively compact PC and enjoy less hassle plus time saved from not having to daydream about the other side of the fence. I know this seems like a huge and almost insurmountable engineering problem but I have confidence that the best of the best can make this happen. They just need to believe that something like this would be well received. Please show your support for one or both of these crazy ideas and help make this miracle happen.

StefanR5R · Mar 11, 2022

igor_kavinski said:
We want [...] three types of cores in our PC.

Some tablet computers are sold with three types of cores currently. Snapdragon 8 Gen 1 comes to my mind, which consists of 1 Cortex X2 + 3 Cortex A710 + 4 Cortex A510.

In computers which are not battery powered, homogeneous CPUs are fine IMO. SMT and power management already bring a great deal of complications to operating systems' process schedulers.

Markfw · Mar 11, 2022

nicalandia said:
Just updated the info with the correct data. Actually at 158 mm^2 we get 48C/48T for a total of 45,000 points in CB R23.. We will see that on the Sierra Forest in a few years

Well, thanks for the numbers. BUT its all *estimation* vs facts. And its one benchmark. My 12700F is getting killed at primegrid about 50% more throughput. It does units in 64% of the rime and has 50% more threads working, at the same power. Not sure the size.

nicalandia · Mar 11, 2022

Markfw said:
Well, thanks for the numbers. BUT its all *estimation* vs facts.

That is all we currently have to based our number from. The 12700F currently have only 4 Gracemont cores. The 12900K has 8 and the 13900K will have 16. Intel Sierra Forrest is expected to have 60 e cores per tile(240 per 4 tile SOC)

Intel Sierra Forest the E-Core Xeon Intel Needs

Intel Sierra Forest the E-Core Xeon Intel Needs

We discuss the Intel Sierra Forest announcement at Intel Investor Meeting 2022. This is the high E-core count chip Intel needs

www.servethehome.com

Markfw · Mar 11, 2022

nicalandia said:
That is all we currently have to based our number from. The 12700F currently have only 4 Gracemont cores. The 12900K has 8 and the 13900K will have 16. Intel Sierra Forrest is expected to have 60 e cores per tile(240 per 4 tile SOC)

Intel Sierra Forest the E-Core Xeon Intel Needs

Intel Sierra Forest the E-Core Xeon Intel Needs

We discuss the Intel Sierra Forest announcement at Intel Investor Meeting 2022. This is the high E-core count chip Intel needs

www.servethehome.com

And we have no numbers as to how that compares to Milan-X or Genoa. These exist, and sierra forest does NOT.

Also, saying " Acceptable and scalable multicore performance per area " is pretty much undeniable.

But saying "Unmatched multicore performance per area " is very debateable.

igor_kavinski · Mar 11, 2022

StefanR5R said:
Snapdragon 8 Gen 1 comes to my mind, which consists of 1 Cortex X2 + 3 Cortex A710 + 4 Cortex A510.

But I would prefer these CPU manufacturers standardize on some chipset interface so companies like ASUS etc. can assemble the CPUs with different cores from different manufacturers based on demand. That would turn decision making into hell when buying a new PC but it would also lead to unique and fun experiences.

Here's an idea even crazier than dissimilar socket mobo. How about a game engine that queries what type of cores are available and then rewrites itself in memory to optimize itself to make the best use of available resources based on AI training done by the developers at their HQ? Won't that be uber cool? It would be like the Transmeta CPU that takes 24 hours or so to reach full speed but the benefit is that it keeps learning new tricks to make itself better and better. And it's not bound to any architecture. A new CPU comes to market, the developers or even the end user just gets the AI training profile and the game engine adapts itself to run as fast as it can on the new CPU without needing to do any manual optimization.

nicalandia · Mar 11, 2022

Markfw said:
And we have no numbers as to how that compares to Milan-X or Genoa. These exist, and sierra forest does NOT.

Also, saying " Acceptable and scalable multicore performance per area " is pretty much undeniable.

But saying "Unmatched multicore performance per area " is very debatable.

We can Extrapolate numbers with the information we currently have, at least to measure the Performance per Area where you were wanting numbers. Even if it's just one benchmark (CB R23). And in that application Gracemont Cores reign supreme in Performance/Area

You know what? Let me pull the SPEC Info we have from Anandtech Bench....

Hitman928 · Mar 11, 2022

nicalandia said:
We can Extrapolate numbers with the information we currently have, at least to measure the Performance per Area where you were wanting numbers. Even if it's just one benchmark (CB R23). And in that application Gracemont Cores reign supreme in Performance/Area

You know what? Let me pull the SPEC2017 Info we have from Anandtech Bench....

I don't think your die size numbers hold up. You seem to be including the whole CCD for Zen 3 versus just the Core+L2+Partial L3 for Gracemont. From eye balling it, a Zen 3 core + L2 is only ~4 mm^2 versus near 9 mm^2 for 4x Gracemont + L2.

nicalandia · Mar 11, 2022

Hitman928 said:
I don't think your die size numbers hold up. You seem to be including the whole CCD for Zen 3 versus just the Core+L2+Partial L3 for Gracemont. From eye balling it, a Zen 3 core + L2 is only ~4 mm^2 versus near 9 mm^2 for 4x Gracemont + L2.

Let me double check that for you.

Hitman928 · Mar 11, 2022

nicalandia said:
Let me double check that for you.

You're also ignoring thermal/power constraints in your hypothetical 48C Gracemont CPU performance estimate.

ryanjagtap · Mar 11, 2022

igor_kavinski said:
But I would prefer these CPU manufacturers standardize on some chipset interface so companies like ASUS etc. can assemble the CPUs with different cores from different manufacturers based on demand. That would turn decision making into hell when buying a new PC but it would also lead to unique and fun experiences.

Here's an idea even crazier than dissimilar socket mobo. How about a game engine that queries what type of cores are available and then rewrites itself in memory to optimize itself to make the best use of available resources based on AI training done by the developers at their HQ? Won't that be uber cool? It would be like the Transmeta CPU that takes 24 hours or so to reach full speed but the benefit is that it keeps learning new tricks to make itself better and better. And it's not bound to any architecture. A new CPU comes to market, the developers or even the end user just gets the AI training profile and the game engine adapts itself to run as fast as it can on the new CPU without needing to do any manual optimization.

I think the UCIe standard announced can be somewhat what you're wanting and it's an exciting concept, but I don't think Intel and AMD products will be available anytime soon with what we saw with their Kaby Lake-G lineup.

nicalandia · Mar 11, 2022

Hitman928 said:
I don't think your die size numbers hold up. You seem to be including the whole CCD for Zen 3 versus just the Core+L2+Partial L3 for Gracemont. From eye balling it, a Zen 3 core + L2 is only ~4 mm^2 versus near 9 mm^2 for 4x Gracemont + L2.

I got the numbers and will use Anandtech SPEC2017 Suit for MT performance metrics.

The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity

www.anandtech.com

Full sized Zen3 Vermeer CCD with L2$ and L3$ size is about 68.42 mm^2 (Extrapolated from Locuza Original Cezanne die shot annotations, will update it when I find available info)

16C/16T Gracemont Core Cluster with Full L2$/L3$ is 52.68 mm2 (Extrapolated from Locuza original measurement of a single quad core cluster with only L2$)

I am extrapolating the L3$ On Both CPU uArch to make it fair(L3$ on Vermeer is HUGE) but we can also do just with L2$ which will keep Locuza's original measurements to the letter.

Now to the Numbers. As you can see the 8C/16T is the sweet spot on the Performance/Area for Vermeer

8C/8T Gracemont Core die area 26.34 mm^2 FP performance per area is 1.44, INT Performance/Area 1.13
8C/16T Vermeer Zen3 Core die area 68.42 mm^2 , FP performance per area 0.688, INT Performance/Area 0.744

So in the same area you can fit 8C/16T Zen3 you could fit 20C/20T Gracemont Cores

This is all we currently have(SPEC is pretty good at measuring MT Performance in INT and FP workloads), with this information we can clearly see that Gracemont Cores can't be beat at Performance/Area, Zen3 is also pretty good at that, much better than Golden Cove for sure.

Hitman928 · Mar 11, 2022

nicalandia said:
I got the numbers and will use Anandtech SPEC2017 Suit for MT performance metrics.

The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity

www.anandtech.com

Full sized Zen3 Vermeer CCD with L2$ and L3$ size is about 68.42 mm^2 (Extrapolated from Locuza Original Cezanne die shot annotations, will update it when I find available info)

View attachment 58498

View attachment 58499

16C/16T Gracemont Core Cluster with Full L2$/L3$ is 52.68 mm2 (Extrapolated from Locuza original measurement of a single quad core cluster with only L2$)

View attachment 58500

I am extrapolating the L3$ On Both CPU uArch to make it fair(L3$ on Vermeer is HUGE) but we can also do just with L2$ which will keep Locuza's original measurements to the letter.

Now to the Numbers. As you can see the 8C/16T is the sweet spot on the Performance/Area for Vermeer

View attachment 58504

8C/8T Gracemont Core die area 26.34 mm^2 FP performance per area is 1.44, INT Performance/Area 1.13
8C/16T Vermeer Zen3 Core die area 68.42 mm^2 , FP performance per area 0.688, INT Performance/Area 0.744

This is all we currently have(SPEC is pretty good at measuring MT Performance in INT and FP workloads), with this information we can clearly see that Gracemont Cores can't be beat at Performance/Area, Zen3 is also pretty good at that, much better than Golden Cove for sure.

Cezanne has 16 MiB of L3 and Vermeer has 32 MiB of L3. I don't understand why 12 MiB of L3 cache is considered "full" cache for Gracemont, can you explain? Also, how are you getting 26.34 mm^2 for Gracemont area? (Edit: nevermind, I see you are cutting the hypothetical 16core Gracemont in half, but that would remove half of the L3 as well which would have a large effect on performance, especially in something like SPEC. You couldn't then use this size for your perf/area calculation. In Anandtech's SPEC tests, Gracemont has access to 30 MiB of L3 cache).

If we were to use SPECfp, then a better AMD example would probably be Rembrandt as memory bandwidth has a significant effect on SPEC Rate-N results, though I don't think clear die shots have been shown for RMB yet. RMB will have a slight process advantage, but even if you assume no die shrink from Cezanne (probably at least a decent assumption), your perf/area comparison will come out very different as RMB gets a score of ~44 in SPECfp2017 and ~41 in SPECint2017.

Hitman928 · Mar 11, 2022

@nicalandia This is all academic anyway but I actually think using Cinebench as the performance benchmark and cores+L2 would be the way to go. This is still a strained comparison as you can really just compare whole products but there is no independent Gracemont product (yet at least). Introducing L3, though, just opens a can of worms that makes it even more difficult to compare. Stopping at L2 and using Cinebench (which doesn't care much about L3 cache or memory bandwidth) is probably the closest you will get if you want a way to compare that doesn't require a complex and in-depth analysis.

nicalandia · Mar 11, 2022

Hitman928 said:
Cezanne has 16 MiB of L3 and Vermeer has 32 MiB of L3. I don't understand why 12 MiB of L3 cache is considered "full" cache for Gracemont, can you explain?

Thanks, I have updated with correct MiB option on Cezanne/Vermeer

Gracemont Cores are built into Alder Lake Ring Bus and they need to have a $ coherence, that has been set at 3MiB per block

amd6502 · Mar 11, 2022

nicalandia said:
Just updated the info with the correct data. Actually at 158 mm^2 we get 48C/48T for a total of 45,000 points in CB R23.. We will see that on the Sierra Forest in a few years

That's not surprising. MT scales linearly with number of cores and IPC uplift scales roughly by the square root of transistor count. At some point big cores are disadvantaged at perf/watt against a smaller core, and we've long surpassed that point. Sweet spot is prbly atom or cat core sized transistor count.

If AMD ported Jaguar/Puma to a modern process as efficient as current Epyc's, with a very large number of cores on the CPU, then we could expect it to have a big advantage in Multithread performance per watt. Of couse that comes at the cost of much lower performance per core; that's the trade off.

nicalandia · Mar 11, 2022

amd6502 said:
If AMD ported Jaguar/Puma to a modern process as efficent as current Epyc's, with a very large number of cores on the CPU, then we could expect it to have a big advantage in Multithread performance per watt.

True, Locuza made a similar comment. He came to the conclusion that Jaguar/Puma were more efficient per area than ARM processors. let me try to find that

Markfw · Mar 11, 2022

nicalandia said:
Thanks, I have updated with correct MiB option on Cezanne/Vermeer

Gracemont Cores are built into Alder Lake Ring Bus and they need to have a $ coherence, that has been set at 3MiB per block

View attachment 58512

Based on this you area estimates will be off, as NO L3 was in that.

Hitman928 · Mar 11, 2022

nicalandia said:
Thanks, I have updated with correct MiB option on Cezanne/Vermeer

Gracemont Cores are built into Alder Lake Ring Bus and they need to have a $ coherence, that has been set at 3MiB per block

View attachment 58512

That's not how it works. It seems you are referring to how the L3 is built from 'slices' of each unit, but that has more to do with the physical design and actually splits into 2.5MiB slices. If something is stored in L3, Gracemont cores can find it, it doesn't matter where in L3 it is stored. They have access to the whole L3 cache. The difference between the E cores and P cores is that the E cores share an L2 in groups of 4 whereas the P cores have their own private L2s. So, if you are going to use Gracemont performance from a 12900k, you need to include the whole L3 area as part of the calculation. Now, how much of that L3 it is effectively using is a big unknown which is part of the reason I suggested stopping at L2 and using something that isn't very sensitive to large L3 cache or memory bandwidth.

Hitman928 · Mar 11, 2022

Quick napkin math:

Zen 3 core + L2 = 4 mm^2
Gracemont 4 cores + L2 = 8.8 mm^2

8 Zen 3 cores + L2 = 32 mm^2
8 Gracemont cores + L2 = 17.6 mm^2

8 Zen 3 cores score ~15.5K points in Cinebench r23
8 Gracemont cores score ~7.5K points in Cinebench r23 (according to Pudget Systems numbers)

Zen 3 perf/area = ~484 pts/mm^2
Gracement perf/area = ~426 pts/mm^2

Now, of course you have the L3 factor and DDR5 vs DDR4, but again, Cinebench is fairly insensitive to these things, much more so than most benchmarks. Maybe when ADL-U comes out with 2P + 8E we can start to see tests which focus more on the 8 cores and have varying degrees of L3 cache for comparison.

Markfw · Mar 11, 2022

No matter how this turns out, I stick by this:

But saying "Unmatched multicore performance per area " is very debateable.

Yes, I doubt it.

Exist50 · Mar 11, 2022

Markfw said:
No matter how this turns out, I stick by this:

But saying "Unmatched multicore performance per area " is very debateable.

Yes, I doubt it.

It's an objective statement. There's no room for debate. I'm not sure why people would be surprised about it either.

Markfw · Mar 11, 2022

Exist50 said:
It's an objective statement. There's no room for debate. I'm not sure why people would be surprised about it either.

Post 44 proved it wrong. Did you see that ? Quit being a know it all, you don't know.

nicalandia · Mar 11, 2022

Okay, I was able to find the Information from Locuza, so no need to extrapolate. I was very close on my Guesstimate

I dont think its fair to leave the Huge L3$ on Zen3 out

Hitman928 · Mar 11, 2022

@nicalandia If we want to make some pretty big assumptions and maybe be favorable to Gracemont, we can look at the SPEC numbers too. If we assume that the E cores only use half of the available L3 cache available to them in Anandtech's tests, then we can compare pretty directly to RMB. Locuza seems to estimate that each L3 slice in ADL is the same size as a 4E core cluster (with L2). Pretty sure this isn't accurate, but is probably close enough for this very rough academic exercise.

That would make the 'effective die size' of Gracemont with L3 to be 4E cores + 4E cores + 15 MiB of L3 cache or 8.8 mm2 + 8.8 mm^2 + (8.8 mm^2 *5) = 61.6 mm2
RMB with L3 (assuming no shrink benefit from N7 -> N6 transition) is 52.7 mm2 with 8 cores + 16 MiB of L3 cache..

Gracemont scores 29.81 and 38.07 in SPECint and SPECfp respectively.
RMB scores ~41 and ~44 in SPECint and SPECfp respectively.

Gracemont:
SPECint -> 0.484 pts/mm2
SPECfp -> 0.618 pts/mm2

RMB:
SPECint -> 0.778 pts/mm2
SPECfp -> 0.835 pts/mm2

If you used the full L3 size for Gracemont, obviously its perf/area would decrease. It bears repeating that there are some big assumptions here and my gut feeling (completely non-evidenced based) is that Zen3 probably benefits more from a larger L3 than Gracemont, but at the very least, you can say that Zen 3 is extremely competitive if not superior than Gracemont in perf/area.

All of this of course ignores that they are on completely different, if somewhat comparable, nodes and AMD may benefit from a density advantage due to the process used. We'd have to have a transistor count for Gracemont alone to make any kind of even suggestion about this though.

Markfw · Mar 11, 2022

What part of "debatable" does not work here.

Discussion Triple architecture CPU or dissimilar dual socket motherboard

Lifer

Elite Member

Moderator Emeritus, Elite Member

Diamond Member

Moderator Emeritus, Elite Member

Lifer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Moderator Emeritus, Elite Member

Diamond Member

Diamond Member

Moderator Emeritus, Elite Member

Platinum Member

Moderator Emeritus, Elite Member

Diamond Member

Diamond Member

Moderator Emeritus, Elite Member