Solved! ARM Apple High-End CPU - Intel replacement

Richie Rich · Oct 14, 2019

There is a first rumor about Intel replacement in Apple products:

ARM based high-end CPU
8 cores, no SMT
IPC +30% over Cortex A77
desktop performance (Core i7/Ryzen R7) with much lower power consumption
introduction with new gen MacBook Air in mid 2020 (considering also MacBook PRO and iMac)
massive AI accelerator

Source Coreteks:

naukkis · Jun 30, 2020

Glo. said:
But jumping to conclusions that it must be rosettas fault, and that no way in hell it may be more than 80% of performance efficiency, is way to premature, considering what Apple wants to achieve. If they want to migrate their whole platform, they have to offer at least 90% of performance in Rosettas compiler.

No they don't, and that's probably impossible and even with 100% efficiency that's still not anything spectacular. What they can do is to make faster cpu's. Instead of 7nm 7w limited SOC from Ipad Macs will use 5nm hp-process made two generations younger cpu with more relaxed power limits which could easily be twice as fast. With twice as fast as 12Z cpu emulated x86-speed will be at least as fast as Intel offerings - probably even faster. And that's emulated performance, for native ARM-binaries will be way faster than anything Intel can offer.

name99 · Jun 30, 2020

Glo. said:
What makes you believe that Small cores, are completely unused in Geekbench on iOS, and are not yielding performance?

Wouldn't that 25% of performance lack compared to iOS come from the small cores?

I will answer, yes they were used for performance improvement of large cores, both in ST, and in Multicore. That was the whole point of Apple touting the benefits of small cores I think in A10 chips and so forth.

Everybody jumped to conclusion that it must be because of the emulation. Based only on armchair rough estimates based on the scores, themselves, without putting the TECHNOLOGY behind it in the context.

As I have said many times. Rosetta 2 is way more efficient in yielding performance, than you guys believe, based on the performance of Shadow of the Tomb Raider's performance, that Apple demoed on this very development kit with A12Z.

So let me put it to your minds this thought. What if the reality is different than your beliefs are, and those scores are actually legit, but only showing Big core performance, and IPC, excluding the performance of smaller cores, which may actually have yielded pretty decent performance boost, on iOS? Both ST, and Multicore. Apple touted many, many times that the benefit of their implementation of big.LITTLE arch is that those cores can work simlutaneously.

In the benchmarks on MacOS Big Sur - the small cores are not working.

Be it architecture, or... the fact that it is different platform, than iOS?

If all of this is correct, then everything falls into place.

P.S. If any of you would stop, how can I put this... love Apple design teams, for a second you would yourselves see that there might be different perspective for those benchmarks.

But jumping to conclusions that it must be rosettas fault, and that no way in hell it may be more than 80% of performance efficiency, is way to premature, considering what Apple wants to achieve. If they want to migrate their whole platform, they have to offer at least 90% of performance in Rosettas compiler.

Dude, you have been repeatedly corrected in this thread by people who know a LOT more than you. Some of us worked at Apple very close to the CPU, some of us have friends at Apple, some of just follow things very closely.

We don't owe you anything. We have lives. We ARE kind enough to tell you when you are wrong, and where you are wrong. If you want to ignore this and demand that we sink to your level, go right ahead. But don't expect us to follow you there.

So think very carefully about what your goals are.
Are your goals to UNDERSTAND technology (which, gee whiz, means listening to the people who know more than you)?
Or are your goals to be a shill for a tech company?

If you want to go in PR or Marketing, sure, head down the path of crazy claims and shutting your eyes to any evidence you don't like. But if your goal is to be an engineer, then start behaving like one!

Glo. · Jun 30, 2020

naukkis said:
No they don't, and that's probably impossible and even with 100% efficiency that's still not anything spectacular. What they can do is to make faster cpu's. Instead of 7nm 7w limited SOC from Ipad Macs will use 5nm hp-process made two generations younger cpu with more relaxed power limits which could easily be twice as fast. With twice as fast as 12Z cpu emulated x86-speed will be at least as fast as Intel offerings - probably even faster. And that's emulated performance, for native ARM-binaries will be way faster than anything Intel can offer.

Edit: GB5 is very multiplatform benchmark.

It has versions for Mac and for iOS. If the benchmark is giving pretty much linear scores between different platforms, one must assume that translation layer is not yielding 25% performance difference? Especially since, on MacOS, there is lacking one part of the equation - small cores, which yielded performance, for any workload.

name99 said:
Dude, you have been repeatedly corrected in this thread by people who know a LOT more than you. Some of us worked at Apple very close to the CPU, some of us have friends at Apple, some of just follow things very closely.

We don't owe you anything. We have lives. We ARE kind enough to tell you when you are wrong, and where you are wrong. If you want to ignore this and demand that we sink to your level, go right ahead. But don't expect us to follow you there.

So think very carefully about what your goals are.
Are your goals to UNDERSTAND technology (which, gee whiz, means listening to the people who know more than you)?
Or are your goals to be a shill for a tech company?

If you want to go in PR or Marketing, sure, head down the path of crazy claims and shutting your eyes to any evidence you don't like. But if your goal is to be an engineer, then start behaving like one!

So you haven't read my post, and replied to it, anyway. Read it, look at the big picture, then talk to me.

SarahKerrigan · Jun 30, 2020

Glo. said:
Edit: GB5 is very multiplatform benchmark.

It has versions for Mac and for iOS. If the benchmark is giving pretty much linear scores between different platforms, one must assume that translation layer is not yielding 25% performance difference? Especially since, on MacOS, there is lacking one part of the equation - small cores, which yielded performance, for any workload.

Apply Occam's razor here.

Which is more likely?
a) that emulation efficiency is going to be nowhere near 90-95%, because in the real world 50% is actually very good and 70% is excellent, and the reason small cores aren't showing up is that they aren't exposed to the guest?
b) That Apple is getting near-perfect translation efficiency, which would be utterly unprecedented, and native performance is, for some reason, hugely lower than it is on the iPad despite the OS being almost identical and GB, like all semi-competent benchmarks, being designed in such a way that it doesn't depend on syscall performance in the critical path?

If you think it's the latter, what differences between iOS and macOS do you think are the cause? What syscalls do you expect GB to be using that take up far more execution time on macOS than on iOS? You're making the claim, so be specific.

blckgrffn · Jun 30, 2020

naukkis said:
No they don't, and that's probably impossible and even with 100% efficiency that's still not anything spectacular. What they can do is to make faster cpu's. Instead of 7nm 7w limited SOC from Ipad Macs will use 5nm hp-process made two generations younger cpu with more relaxed power limits which could easily be twice as fast. With twice as fast as 12Z cpu emulated x86-speed will be at least as fast as Intel offerings - probably even faster. And that's emulated performance, for native ARM-binaries will be way faster than anything Intel can offer.

It's OK, Intel is ready to keep winning despite what any benchmarks might say. You know - platform this, technobabble that, point to something shiny in the distance. /s Which is what you say when you virtually required for the market to keep up with demand as a silicon juggernaut and you are leveraging cutting edge technology from 2016 into 2020/2021 in very high volumes.

That said, who cares about current Intel performance? It's a dinosaur. Even they don't want to talk about it anymore.

If in 2017 you could have said "Well, this other vendor architecture performs better than AMD FX, look out x86!" with great conviction and it would have been pretty funny.

Back in the 28nm AMD vs 22/14nm Intel days it seemed obvious that AMD was pursuing a questionable architecture and that Intel was crushing architecturally and from a process standpoint.

Now we see it the other way around, it sure looks like AMD's architecture has all this potential and they have a fab advantage. While the Queen of Blades and other have talked about Graviton (not Apple) and other more hyper-scale ARM implementations are fairing on the high margin servers side, that's still fairly tangental and seem really unlikely, to me, to be relevant to Apple in the immediate future. (Google tells me Apple is smart and uses AWS & Azure for their cloud computing infrastructure.)

If thread is to be believed, there is even more excitement about what Apple can do... for their walled garden of PC users. This is a very small pool of users. I am underwhelmed in that I don't see how this such a huge impact on the x86 ecosystem. I say this as a home that is iPhone/iPad rich.

Given that AMD, Apple & other ARM integrator can buy their way into the same silicon, I'd expect a convergence of performance levels with specialized scenarios on each side boasting some level of advantage in some scenarios. Intel (& maybe Samsung?) being able to develop their own chips on their own silicon probably deserve some long term consideration. It seems a reasonable conclusion by 2030 that due to the amazing capex required to keep the silicon technology moving there might be even less variation on the silicon side.

All that said, I really enjoy that these threads continue to have enjoyable updates by people both very and intelligent and well placed or (non-exclusively) entrenched in a certain viewpoint that cannot be swayed. By all means, please continue. It bums me out when there are no threads with updates when I refresh the forum page.

Thank you all!

Glo. · Jun 30, 2020

SarahKerrigan said:
Apply Occam's razor here.

Which is more likely?
a) that emulation efficiency is going to be nowhere near 90-95%, because in the real world 50% is actually very good and 70% is excellent, and the reason small cores aren't showing up is that they aren't exposed to the guest?
b) That Apple is getting near-perfect translation efficiency, which would be utterly unprecedented, and native performance is, for some reason, hugely lower than it is on the iPad despite the OS being almost identical and GB, like all semi-competent benchmarks, being designed in such a way that it doesn't depend on syscall performance in the critical path?

If you think it's the latter, what differences between iOS and macOS do you think are the cause? What syscalls do you expect GB to be using that take up far more execution time on macOS than on iOS? You're making the claim, so be specific.

I don't know, I only asked the question.

How about touching the topic from this post: https://forums.anandtech.com/thread...tel-replacement.2571738/page-49#post-40210997

SarahKerrigan · Jun 30, 2020

blckgrffn said:
Given that AMD, Apple & other ARM integrator can buy their way into the same silicon, I'd expect a convergence of performance levels with specialized scenarios on each side boasting some level of advantage in some scenarios. Intel (& maybe Samsung?) being able to develop their own chips on their own silicon probably deserve some long term consideration. It seems a reasonable conclusion by 2030 that due to the amazing capex required to keep the silicon technology moving there might be even less variation on the silicon side.

Not sure what you mean by this - Apple's microarchitecture is proprietary; nobody else can buy into it. Same goes for several of the server microarchitectures.

SarahKerrigan · Jun 30, 2020

Glo. said:
I don't know, I only asked the question.

How about touching the topic from this post: https://forums.anandtech.com/thread...tel-replacement.2571738/page-49#post-40210997

Small cores don't affect big-core ST perf on iOS (or anywhere else - Android, NT, whatever.) Period. How on earth do you imagine that would even work? With parallelism, there's no free lunch - you want higher ST, you build a faster core, you can't just magically split a single thread across multiple cores.. There's really not another option.

blckgrffn · Jun 30, 2020

SarahKerrigan said:
Not sure what you mean by this - Apple's microarchitecture is proprietary; nobody else can buy into it. Same goes for several of the server microarchitectures.

Sorry, I mean that if they (AMD, Apple, Amazon, etc.) can all buy into the same *node* for producing their CPUs and have teams of dedicated, intelligent design teams that advantages are likely to be application/implementation specific in their nature.

In your own Graviton vs Epyc example, specific integer based workloads may be more optimal on one platform due to the inclusion of faster/more efficient hardware for that purpose while other types of workloads are hindered by a relative lack of L3 cache. Some product manager & architect presumably made that trade off on purpose.

Cardyak · Jun 30, 2020

Just to drop something here in this thread, the Cortex X1 does not have 30% higher IPC than the A77, it has 30% higher single thread performance at ISO-frequency.

But the ISO-frequency of Cortex X1 is 3 GHz (5nm), but A77 is 2.6Ghz (7nm)

So the frequency is 15% higher for the Cortex X1, therefore scaling this down to work out the true IPC improvement results in:

1.3/1.15 = 1.13 = 13%

Source for all of this is WikiChip

SarahKerrigan · Jun 30, 2020

Cardyak said:
Just to drop something here in this thread, the Cortex X1 does not have 30% higher IPC than the A77, it has 30% higher single thread performance at ISO-frequency.

But the ISO-frequency of Cortex X1 is 3 GHz (5nm), but A77 is 2.6Ghz (7nm)

So the frequency is 15% higher for the Cortex X1, therefore scaling this down to work out the true IPC improvement results in:

1.3/1.15 = 1.13 = 13%

Source for all of this is WikiChip

View attachment 24609

I don't think you know what "iso-frequency" means. And whoever put that table together clearly managed to massively garble what ARM actually said.

The X1-vs-A77 comparisons were iso-process and iso-frequency - that means everything in question was on the same process and running at 3GHz.

https://images.anandtech.com/doci/15813/A78-X1-crop-12.png is not terribly ambiguous.

Doug S · Jun 30, 2020

blckgrffn said:
Sorry, I mean that if they (AMD, Apple, Amazon, etc.) can all buy into the same *node* for producing their CPUs and have teams of dedicated, intelligent design teams that advantages are likely to be application/implementation specific in their nature.

Sure AMD and Amazon can buy their way into the same silicon as Apple - in terms of getting the same TSMC N5 process that Apple will be using this fall for its various A14 derivatives going into phones, tablets and Macs.

That's been true for a long time though, Intel is the only one left standing that owns its own fabs and can be at an advantage (historically at least one process generation ahead of everyone else, until a few years ago) or a disadvantage (their current situation they are unaccustomed to) versus foundry processes.

And am I'm really not sure why you are whining about people talking about Apple going to ARM here and saying it is irrelevant to you because they have a small market share in a walled garden (the walled garden is iOS, not macOS BTW) This thread is explicitly about Apple going ARM...if you want to talk about AMD there are plenty of threads for that..

But recognize that the reason AMD has access to a process better than Intel's has a lot to do with Apple choosing TSMC for foundry services and the tens of billions of investment that has resulted from that. Had Apple made a deal with Intel for foundry services as was rumored now and again over the past decade, AMD would likely find themselves in a very different place.

RetroZombie · Jun 30, 2020

Doug S said:
Had Apple made a deal with Intel for foundry services as was rumored now and again over the past decade, AMD would likely find themselves in a very different place.

And that would mean apple could be on the same place has intel.

Doug S · Jun 30, 2020

RetroZombie said:
And that would mean apple could be on the same place has intel.

Yes it would have turned out badly for Apple had they made that arrangement and had to deal with Intel's 10nm shenanigans. But TSMC would not be rolling out a 5nm process this year either.

Etain05 · Jul 1, 2020

Glo. said:
What makes you believe that Small cores, are completely unused in Geekbench on iOS, and are not yielding performance?

Wouldn't that 25% of performance lack compared to iOS come from the small cores?

I will answer, yes they were used for performance improvement of large cores, both in ST, and in Multicore. That was the whole point of Apple touting the benefits of small cores I think in A10 chips and so forth.

Everybody jumped to conclusion that it must be because of the emulation. Based only on armchair rough estimates based on the scores, themselves, without putting the TECHNOLOGY behind it in the context.

As I have said many times. Rosetta 2 is way more efficient in yielding performance, than you guys believe, based on the performance of Shadow of the Tomb Raider's performance, that Apple demoed on this very development kit with A12Z.

So let me put it to your minds this thought. What if the reality is different than your beliefs are, and those scores are actually legit, but only showing Big core performance, and IPC, excluding the performance of smaller cores, which may actually have yielded pretty decent performance boost, on iOS? Both ST, and Multicore. Apple touted many, many times that the benefit of their implementation of big.LITTLE arch is that those cores can work simlutaneously.

In the benchmarks on MacOS Big Sur - the small cores are not working.

Be it architecture, or... the fact that it is different platform, than iOS?

If all of this is correct, then everything falls into place.

P.S. If any of you would stop, how can I put this... love Apple design teams, for a second you would yourselves see that there might be different perspective for those benchmarks.

But jumping to conclusions that it must be rosettas fault, and that no way in hell it may be more than 80% of performance efficiency, is way to premature, considering what Apple wants to achieve. If they want to migrate their whole platform, they have to offer at least 90% of performance in Rosettas compiler.

This is one of the most absurd things I have ever read. How on Earth could the little cores help in the single-core performance, on any imaginable platform? It’s right there in the name: single-core. The little cores are irrelevant in single-core performance, because, by definition, single-core tasks test a single core.

Once we get past the absurdity of that, let's discuss the rest with data @Gideon already provided in this thread:

A12Z iPad Pro/ iOS : 1115 ST (one single BIG core) / 4670 MT (4x small cores + 4x BIG cores)
A12Z DTK/macOS : 844 ST (one single BIG core) / 2943 MT (only 4x BIG cores)

ST : 844 / 1115 * 100 = 75,69%
MT : 2943 / 4670 *100 = 63%

You can clearly see that the ratio of the performances for ST and MT is different. The very clear and obvious difference between ST and MT that can cause the discrepancy in the ratios is the fact that those little cores are used in MT on the iPad, but not on the DTK. The fact that the little cores are not used in MT on the DTK but they are used on the iPad is the reason why the ratios are not exactly the same, at around 75%. But the ST test instead is exactly the same, the little cores make no difference there because they aren't used on either device in ST. So comparing ST is extremely straightforward, as others have already told you in this thread.

So unless you want to continue to pretend that in ST the very same chip with the very same clock speed provides 25% lower performance simply because macOS is somehow less optimised and iOS has software magic beans, the obvious answer is that the 25% lower performance is caused by Rosetta 2. And that would be an incredible success. Only a 25% penalty for translating x86 software would be amazing.

As for your other point that you like to mention regarding Shadow of the Tomb Raider's performance on the DTK, you fail to realise that Shadow of the Tomb Raider is a x86 game that on macOS uses Metal. Apple was very explicit in telling us that under Rosetta 2 Metal calls are made directly to the GPU with very little overhead, providing extremely good performance. So that is why the test was very impressive, certainly not because Rosetta 2 is so fantastically good (impossibly so) that the performance penalty is less than 10%. And that's without mentioning that Shadow of the Tomb Raider would be a GPU relevant test to begin with, not a CPU one.

It's actually laughable the mental jiujitsu or gymnastics you are trying to do just to hold on to the notion that somehow Apple's chip design is inferior to Intel or AMD's.

beginner99 · Jul 1, 2020

Etain05 said:
It's actually laughable the mental jiujitsu or gymnastics you are trying to do just to hold on to the notion that somehow Apple's chip design is inferior to Intel or AMD's.

I think the difference is that Apple makes CPUs for client devices for the average user. This means wide-cores with very fast ST performance for web browsing for example. Now they will scale them up a bit (mostly MT I would assume) to use them also in laptops and desktops.

Intel and AMD make server CPUs optimized for server and workstation usage and scale the design down so they can be put into desktops and laptops. In fact in case of Ryzen it's the exact same CPU with a little different (cheaper) IO die.

The issue for AMD and Intel is that servers have quite different needs than clients. So scaling down a server CPU to client doesn't work all that well. (Reduce cores but increase frequency). Apple doesn't cross and such boundaries, the still remain strictly in client space where ST is king. Even for avergae PC use, 2 big cores coupled with 4 small cores for background stuff would be enough for most users.

Hence the debates and conflict here. Apple cores will be better for average user. But once you move into more complex stuff like compiling, heavy FP usage or server type usage like databases, then I Intel/AMD will show their advantages. But mist people don't do that on phones, tablets or laptops.

defferoo · Jul 1, 2020

beginner99 said:
I think the difference is that Apple makes CPUs for client devices for the average user. This means wide-cores with very fast ST performance for web browsing for example. Now they will scale them up a bit (mostly MT I would assume) to use them also in laptops and desktops.

Intel and AMD make server CPUs optimized for server and workstation usage and scale the design down so they can be put into desktops and laptops. In fact in case of Ryzen it's the exact same CPU with a little different (cheaper) IO die.

The issue for AMD and Intel is that servers have quite different needs than clients. So scaling down a server CPU to client doesn't work all that well. (Reduce cores but increase frequency). Apple doesn't cross and such boundaries, the still remain strictly in client space where ST is king. Even for avergae PC use, 2 big cores coupled with 4 small cores for background stuff would be enough for most users.

Hence the debates and conflict here. Apple cores will be better for average user. But once you move into more complex stuff like compiling, heavy FP usage or server type usage like databases, then I Intel/AMD will show their advantages. But mist people don't do that on phones, tablets or laptops.

I don’t think this is really the case. Intel and AMD also design a single core and scale up. Why do you think Intel releases their low power chips before they release desktop chips, and Intel’s server chips lag one to two years behind desktop chips. At its core, Intel’s server chips are their desktop chips scaled up. AMDs server chips are basically 8 desktop clusters with an infinity fabric interconnect and server specific features like ECC, increased I/O bandwidth.

Obviously Apple has proven themselves in the mobile space, but have not done the same in the desktop or server space. This means until they do, it’s very easy for somebody to say that they wouldn’t be able to compete in those hypothetical markets (or that mobile and desktop aren’t comparable even if they’re running the same workload because reasons).

This is why their macOS on Apple silicon transition is so interesting. ARM is finally going to have high performance CPUs in the laptop/desktop and workstation/server market. We finally get to find out if all those naysayers were right or wrong, and my bet is that they’re going to be eating so much crow.

Gideon · Jul 1, 2020

Etain05 said:
This is one of the most absurd things I have ever read. How on Earth could the little cores help in the single-core performance, on any imaginable platform? It’s right there in the name: single-core. The little cores are irrelevant in single-core performance, because, by definition, single-core tasks test a single core.

Thanks I was just writing a very similar reply about the absurdidy of this claim before I decided it isn't worth it and went to bed. So essentially apple implemented the infamous "reverse-hyperthreading"?

@Glo. if you had ever written and compiled any single- or multithreaded code at all, you'd realize the absurdity of this. This is an extremely extraordinary claim that requires extraordinary evidence, you can't just throw it around if reality doesn't fit your world-view. If this were possible it would be the biggest breakthrough (... ever?) in hardware/software design, a holy grail if you will.

An analogy:
Two guys are arguing about two announced phones with the same SoC, screen, battery and similar dimensions while one is 50g lighter. One guy would claim the obvious, that phone "A" must have a lighter chassis and probably a more elegant design. The other guy would go: "no-no-no, this can't be! That phone is actually much heavier, it's just that the OEM implemented an anti-gravity device inside the chassis that cheats during weighting, just wait and you'll see!".

My take as a software developer
(that has limited experience with low-level languages but has actually, you know, written, compiled and ran compute-limited multithreaded C++ code):

Expecting 75% efficiency from Rosetta seems really good, but totally within reasonable bounds.
Explaining the difference by syscall performance in a widely used cross-platform benchmark (that is for no reason entirely different on MacOS than iOS on the same SoC) is highly unlikely and doesn't line up with any other evidence on the subject.
Expecting some kind of "reverse-hyperthreading" at play (e.g. two cores running a single-core task) is comically ridiculous

FFS people have run their own compiled code on their own A12 phone and Kaby Lake Mac and gotten similar results you can't really cheat in those.

TL;DR

I get why you're grasping at straws. I'm in the same boat as you in the sense that I also share no love for Apple. I'm actually quite sad that AMD, Intel and other ARM vendors haven't been able to design client-focused SoCs with similar IPC as then all the vendors could reap benefits. But I won't start denying reality when it doesn't fit my world-view.

Richie Rich · Jul 1, 2020

Cardyak said:
Just to drop something here in this thread, the Cortex X1 does not have 30% higher IPC than the A77, it has 30% higher single thread performance at ISO-frequency.

But the ISO-frequency of Cortex X1 is 3 GHz (5nm), but A77 is 2.6Ghz (7nm)

So the frequency is 15% higher for the Cortex X1, therefore scaling this down to work out the true IPC improvement results in:

1.3/1.15 = 1.13 = 13%

Source for all of this is WikiChip

View attachment 24609

You are wrong because ARM LLC did this huge IPC gains repeatedly in the past:

Cortex A75 ............ PPC 180 pts/GHz
Cortex A76 ............ PPC 253 pts/GHz ..... 253/180=1.405........ 40% IPC uplift to last gen
Cortex A77 ............ PPC 286 pts/GHz ..... 286/253=1.130........ 13% IPC uplift to last gen
Cortex X1 .............. PPC 371 pts/GHz ..... 371/286=1.3............. 30% IPC uplift to last gen

The A75 -> A76 huge 40% IPC jump was caused they went from 2xALU+1xBranch core to completely new wider desing 3xALU+1xBranch which is comparable with x86 4xALU designs (2xBranch shared on those ALU ports).

A77 was 4xALU+2xBranch, a lot of complex ops was moved to second ALU so doubled throughput in many ways.
A77 added two new 2xStore ports which boosted FPU operations by 35% according SPECfp.
A78/X1 increased ALU functionality despite keeping same 4xALU+2xBranch scheme.

And Matterhorn will be new core line-up, I guess an answer to Apple's 6xALU A11 Monsoon design released in 2017. If core design takes 4 years that's 2017+4= 2021. Cortex X1 is just the begining. I expect Matterhorn based Cortex X2 to reach 80% of Apple's IPC/PPC, somewhere between A12 and A13 (+70% higher IPC above Zen2). However with higher clock speed than A12 it will reach and beat desktop todays Ryzens/9900K/Tigerlakes.

ARM can also prepare Cortex X2 version for supercomputers with boosted FPU/SIMD to demonstrate power of SVE2 2048-bit capability. Such a Cortex FX2 with 4x1024-bit FPU is kind of Fredy Krueger for x86 dream. And even for Nvidia GPU in supercomputers too.

soresu · Jul 1, 2020

SarahKerrigan said:
you want higher ST, you build a faster core, you can't just magically split a single thread across multiple cores.. There's really not another option here.

Weelllll, yes and no.

As you say there is no free lunch, though there is a very difficult but possible way to turn one thread into several dynamically on the fly.

Popularised to near meme proportions years ago was "Reverse hyper threading", a slang term for the Speculative Multi Threading technique.

From what I remember transaction memory was supposed to be a key feature to getting it to work, but after a big hoohah about their Mitosis SpecMT compiler there was basically nothing from Intel about SpecMT.

Gideon said:
Expecting some kind of "reverse-hyperthreading" at play (e.g. two cores running a single-core task) is comically ridiculous

See my answer to SarahKerrigan above.

I do believe that the Soft Machines VISC core "Global Frontend" was supposed to be a glorified SpecMT engine, but it could just as easily have been snake oil as it has never seen the light of day in silicon thus far.

naukkis said:
Instead of 7nm 7w limited SOC from Ipad Macs will use 5nm hp-process made two generations younger cpu with more relaxed power limits which could easily be twice as fast. With twice as fast as 12Z cpu emulated x86-speed will be at least as fast as Intel offerings - probably even faster.

Richie Rich said:
Cortex X1 .............. PPC 371 pts/GHz ..... 371/286=1.3............. 30% IPC uplift to last gen

30% for scalar FP and Int yes, but could be as much as a 2x improvement in NEON performance due to the doubled SIMD units over A77.

Richie Rich said:
The A75 -> A76 huge 40% IPC jump was caused they went from 2xALU+1xBranch core to completely new wider desing 3xALU+1xBranch which is comparable with x86 4xALU designs (2xBranch shared on those ALU ports).

I don't know where you got that 40% number from, I can only assume it was a "battery efficiency" improvement number.

The FP/ASIMD number is close to that 40% figure, but few things use FP on a phone and SIMD is not for every workload.

JS also improves close to 40% - but that is a mercurial benchmark IMHO, depending on both the benchmark itself and the JS engine running it.

This is from the official ARM PR on the Anandtech article from the A76 announcement:

Best to work from integer IPC numbers as the base for improvement, it often comes first in PR messaging for a reason.

Something also to note, the next gen video codecs will heavily rely on ML techniques, so the improvements in A76, X1 and Matterhorn will put the battles with next gen heavily in ARM's favor.

soresu · Jul 1, 2020

Richie Rich said:
And Matterhorn will be new core line-up, I guess an answer to Apple's 6xALU A11 Monsoon design released in 2017. If core design takes 4 years that's 2017+4= 2021. Cortex X1 is just the begining. I expect Matterhorn to reach 80% of Apple's IPC/PPC, somewhere between A12 and A13 (+70% higher IPC above Zen2). However with higher clock speed than A12 it will reach and beat desktop todays Ryzens/9900K/Tigerlakes.

Bare in mind we have no idea if Matterhorn is Cortex A or X yet.

I'm inclined to think that we will at least have a new big A core somewhere between A78 and X1 performance, but much lighter on power/area than X1.

Also Anandtech's projection put X1 at equal to A13 FP, and less than 11% behind A13 in Int - whatever the X2 core is, it should be superior to A13 at least if Anand's predictions work out.

AkulaMD · Jul 1, 2020

Richie Rich said:
IMHO Samsung Mongoose is much better design than Zen1 but in much tough mobile enviroment it was a big fail. While much worse Zen1 is celebrated as a great design in x86 world. Clash of those two world will be really epic.

Which iteration of Mongoose, if i may ask?

Thank you in advance.

SarahKerrigan · Jul 1, 2020

soresu said:
Weelllll, yes and no.

As you say there is no free lunch, though there is a very difficult but possible way to turn one thread into several dynamically on the fly.

Popularised to near meme proportions years ago was "Reverse hyper threading", a slang term for the Speculative Multi Threading technique.

From what I remember transaction memory was supposed to be a key feature to getting it to work, but after a big hoohah about their Mitosis SpecMT compiler there was basically nothing from Intel about SpecMT.

Sure. SpMT theoretically exists, and I actually considered mentioning it as an aside in my post. It's just that nobody has ever actually shipped SpMT-capable hardware - it's been repeatedly announced (Rock scout threading, Soft Machines VISC) and then quietly vanished before anyone could buy it. Even if it did exist, it would likely be either multiple hard contexts on one core, or multiple identical cores.

Regardless, it's abundantly clear that SpMT is not, in fact, the source of the A12 microarchitecture's high performance, and Gideon's anti-gravity analogy above is basically right on. Until someone implements it in a shipping product, it remains firmly in the category of hypothetical magic fairy dust.

Glo. · Jul 1, 2020

Fair enough guys, I was wrong about the ST performance.

beginner99 · Jul 1, 2020

defferoo said:
Why do you think Intel releases their low power chips before they release desktop chips, and Intel’s server chips lag one to two years behind desktop chips.

The release the mobile chips before desktop because efficiency matters more for mobile and mobile chips usually have less cores so higher yield. This is especially obvious with the 10nm fiasco.

Server would also profit from efficiency but these dies are much larger and usually yield isn't ideal early on on a new process. On top of that server CPUs simply need more validation than client CPUs.

Th esituation now at intel is special anyway due to their 10nm fiasco. In case of AMD, Epyc was released shortly after Ryzen, no 2 year delay (well it's the same chiplets with different IO so no surprise here).

My point still stands especially with AMD. The desktop chips simply are reused server chips. Only Renoir is a custom mobile job but the CPU cores are still the same (with less cache).

No this is speculation but if we look at how graviton2 preforms in certain benchmakrs (see phoronix) one can really see the difference between a client first and server first architecture. (graviton2 has extremely slow compile time of linux kernel and database benches are also very slow). So stuff that relies on branch prediction, large caches in general complex stuff, that is were intel/amd shine. Maybe apple too, but that's to be seen.
(I mention compiling because many devs use macbooks and if that takes a large hit, the will probably move away from apple).

Solved! ARM Apple High-End CPU - Intel replacement

Senior member

Golden Member

Senior member

Diamond Member

Senior member

Diamond Member

Diamond Member

Senior member

Senior member

Diamond Member

Member

Senior member

Diamond Member

Senior member

Diamond Member

Junior Member

Diamond Member

Member

Platinum Member

Senior member

Diamond Member

Diamond Member

Member

Senior member

Diamond Member

Diamond Member