Solved! ARM Apple High-End CPU - Intel replacement

Page 50 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Richie Rich

Senior member
Jul 28, 2019
470
229
76
There is a first rumor about Intel replacement in Apple products:
  • ARM based high-end CPU
  • 8 cores, no SMT
  • IPC +30% over Cortex A77
  • desktop performance (Core i7/Ryzen R7) with much lower power consumption
  • introduction with new gen MacBook Air in mid 2020 (considering also MacBook PRO and iMac)
  • massive AI accelerator

Source Coreteks:
 
  • Like
Reactions: vspalanki
Solution
What an understatement :D And it looks like it doesn't want to die. Yet.


Yes, A13 is competitive against Intel chips but the emulation tax is about 2x. So given that A13 ~= Intel, for emulated x86 programs you'd get half the speed of an equivalent x86 machine. This is one of the reasons they haven't yet switched.

Another reason is that it would prevent the use of Windows on their machines, something some say is very important.

The level of ignorance in this thread would be shocking if it weren't depressing.
Let's state some basics:

(a) History. Apple has never let backward compatibility limit what they do. They are not Intel, they are not Windows. They don't sell perpetual compatibility as a feature. Christ, the big...

naukkis

Senior member
Jun 5, 2002
701
569
136
But jumping to conclusions that it must be rosettas fault, and that no way in hell it may be more than 80% of performance efficiency, is way to premature, considering what Apple wants to achieve. If they want to migrate their whole platform, they have to offer at least 90% of performance in Rosettas compiler.

No they don't, and that's probably impossible and even with 100% efficiency that's still not anything spectacular. What they can do is to make faster cpu's. Instead of 7nm 7w limited SOC from Ipad Macs will use 5nm hp-process made two generations younger cpu with more relaxed power limits which could easily be twice as fast. With twice as fast as 12Z cpu emulated x86-speed will be at least as fast as Intel offerings - probably even faster. And that's emulated performance, for native ARM-binaries will be way faster than anything Intel can offer.
 

name99

Senior member
Sep 11, 2010
404
303
136
What makes you believe that Small cores, are completely unused in Geekbench on iOS, and are not yielding performance?

Wouldn't that 25% of performance lack compared to iOS come from the small cores?

I will answer, yes they were used for performance improvement of large cores, both in ST, and in Multicore. That was the whole point of Apple touting the benefits of small cores I think in A10 chips and so forth.

Everybody jumped to conclusion that it must be because of the emulation. Based only on armchair rough estimates based on the scores, themselves, without putting the TECHNOLOGY behind it in the context.

As I have said many times. Rosetta 2 is way more efficient in yielding performance, than you guys believe, based on the performance of Shadow of the Tomb Raider's performance, that Apple demoed on this very development kit with A12Z.

So let me put it to your minds this thought. What if the reality is different than your beliefs are, and those scores are actually legit, but only showing Big core performance, and IPC, excluding the performance of smaller cores, which may actually have yielded pretty decent performance boost, on iOS? Both ST, and Multicore. Apple touted many, many times that the benefit of their implementation of big.LITTLE arch is that those cores can work simlutaneously.

In the benchmarks on MacOS Big Sur - the small cores are not working.

Be it architecture, or... the fact that it is different platform, than iOS?

If all of this is correct, then everything falls into place.

P.S. If any of you would stop, how can I put this... love Apple design teams, for a second you would yourselves see that there might be different perspective for those benchmarks.

But jumping to conclusions that it must be rosettas fault, and that no way in hell it may be more than 80% of performance efficiency, is way to premature, considering what Apple wants to achieve. If they want to migrate their whole platform, they have to offer at least 90% of performance in Rosettas compiler.

Dude, you have been repeatedly corrected in this thread by people who know a LOT more than you. Some of us worked at Apple very close to the CPU, some of us have friends at Apple, some of just follow things very closely.

We don't owe you anything. We have lives. We ARE kind enough to tell you when you are wrong, and where you are wrong. If you want to ignore this and demand that we sink to your level, go right ahead. But don't expect us to follow you there.

So think very carefully about what your goals are.
Are your goals to UNDERSTAND technology (which, gee whiz, means listening to the people who know more than you)?
Or are your goals to be a shill for a tech company?

If you want to go in PR or Marketing, sure, head down the path of crazy claims and shutting your eyes to any evidence you don't like. But if your goal is to be an engineer, then start behaving like one!
 

Glo.

Diamond Member
Apr 25, 2015
5,657
4,409
136
No they don't, and that's probably impossible and even with 100% efficiency that's still not anything spectacular. What they can do is to make faster cpu's. Instead of 7nm 7w limited SOC from Ipad Macs will use 5nm hp-process made two generations younger cpu with more relaxed power limits which could easily be twice as fast. With twice as fast as 12Z cpu emulated x86-speed will be at least as fast as Intel offerings - probably even faster. And that's emulated performance, for native ARM-binaries will be way faster than anything Intel can offer.
Edit: GB5 is very multiplatform benchmark.

It has versions for Mac and for iOS. If the benchmark is giving pretty much linear scores between different platforms, one must assume that translation layer is not yielding 25% performance difference? Especially since, on MacOS, there is lacking one part of the equation - small cores, which yielded performance, for any workload.
Dude, you have been repeatedly corrected in this thread by people who know a LOT more than you. Some of us worked at Apple very close to the CPU, some of us have friends at Apple, some of just follow things very closely.

We don't owe you anything. We have lives. We ARE kind enough to tell you when you are wrong, and where you are wrong. If you want to ignore this and demand that we sink to your level, go right ahead. But don't expect us to follow you there.

So think very carefully about what your goals are.
Are your goals to UNDERSTAND technology (which, gee whiz, means listening to the people who know more than you)?
Or are your goals to be a shill for a tech company?

If you want to go in PR or Marketing, sure, head down the path of crazy claims and shutting your eyes to any evidence you don't like. But if your goal is to be an engineer, then start behaving like one!
So you haven't read my post, and replied to it, anyway. Read it, look at the big picture, then talk to me.
 
Last edited:

SarahKerrigan

Senior member
Oct 12, 2014
339
468
136
Edit: GB5 is very multiplatform benchmark.

It has versions for Mac and for iOS. If the benchmark is giving pretty much linear scores between different platforms, one must assume that translation layer is not yielding 25% performance difference? Especially since, on MacOS, there is lacking one part of the equation - small cores, which yielded performance, for any workload.

Apply Occam's razor here.

Which is more likely?
a) that emulation efficiency is going to be nowhere near 90-95%, because in the real world 50% is actually very good and 70% is excellent, and the reason small cores aren't showing up is that they aren't exposed to the guest?
b) That Apple is getting near-perfect translation efficiency, which would be utterly unprecedented, and native performance is, for some reason, hugely lower than it is on the iPad despite the OS being almost identical and GB, like all semi-competent benchmarks, being designed in such a way that it doesn't depend on syscall performance in the critical path?

If you think it's the latter, what differences between iOS and macOS do you think are the cause? What syscalls do you expect GB to be using that take up far more execution time on macOS than on iOS? You're making the claim, so be specific.
 

blckgrffn

Diamond Member
May 1, 2003
9,110
3,028
136
www.teamjuchems.com
No they don't, and that's probably impossible and even with 100% efficiency that's still not anything spectacular. What they can do is to make faster cpu's. Instead of 7nm 7w limited SOC from Ipad Macs will use 5nm hp-process made two generations younger cpu with more relaxed power limits which could easily be twice as fast. With twice as fast as 12Z cpu emulated x86-speed will be at least as fast as Intel offerings - probably even faster. And that's emulated performance, for native ARM-binaries will be way faster than anything Intel can offer.

It's OK, Intel is ready to keep winning despite what any benchmarks might say. You know - platform this, technobabble that, point to something shiny in the distance. /s Which is what you say when you virtually required for the market to keep up with demand as a silicon juggernaut and you are leveraging cutting edge technology from 2016 into 2020/2021 in very high volumes.

That said, who cares about current Intel performance? It's a dinosaur. Even they don't want to talk about it anymore.

If in 2017 you could have said "Well, this other vendor architecture performs better than AMD FX, look out x86!" with great conviction and it would have been pretty funny.

Back in the 28nm AMD vs 22/14nm Intel days it seemed obvious that AMD was pursuing a questionable architecture and that Intel was crushing architecturally and from a process standpoint.

Now we see it the other way around, it sure looks like AMD's architecture has all this potential and they have a fab advantage. While the Queen of Blades and other have talked about Graviton (not Apple) and other more hyper-scale ARM implementations are fairing on the high margin servers side, that's still fairly tangental and seem really unlikely, to me, to be relevant to Apple in the immediate future. (Google tells me Apple is smart and uses AWS & Azure for their cloud computing infrastructure.)

If thread is to be believed, there is even more excitement about what Apple can do... for their walled garden of PC users. This is a very small pool of users. I am underwhelmed in that I don't see how this such a huge impact on the x86 ecosystem. I say this as a home that is iPhone/iPad rich.

Given that AMD, Apple & other ARM integrator can buy their way into the same silicon, I'd expect a convergence of performance levels with specialized scenarios on each side boasting some level of advantage in some scenarios. Intel (& maybe Samsung?) being able to develop their own chips on their own silicon probably deserve some long term consideration. It seems a reasonable conclusion by 2030 that due to the amazing capex required to keep the silicon technology moving there might be even less variation on the silicon side.

All that said, I really enjoy that these threads continue to have enjoyable updates by people both very and intelligent and well placed or (non-exclusively) entrenched in a certain viewpoint that cannot be swayed. By all means, please continue. It bums me out when there are no threads with updates when I refresh the forum page. :D Thank you all!
 

Glo.

Diamond Member
Apr 25, 2015
5,657
4,409
136
Apply Occam's razor here.

Which is more likely?
a) that emulation efficiency is going to be nowhere near 90-95%, because in the real world 50% is actually very good and 70% is excellent, and the reason small cores aren't showing up is that they aren't exposed to the guest?
b) That Apple is getting near-perfect translation efficiency, which would be utterly unprecedented, and native performance is, for some reason, hugely lower than it is on the iPad despite the OS being almost identical and GB, like all semi-competent benchmarks, being designed in such a way that it doesn't depend on syscall performance in the critical path?

If you think it's the latter, what differences between iOS and macOS do you think are the cause? What syscalls do you expect GB to be using that take up far more execution time on macOS than on iOS? You're making the claim, so be specific.
I don't know, I only asked the question.

How about touching the topic from this post: https://forums.anandtech.com/thread...tel-replacement.2571738/page-49#post-40210997
 

SarahKerrigan

Senior member
Oct 12, 2014
339
468
136
Given that AMD, Apple & other ARM integrator can buy their way into the same silicon, I'd expect a convergence of performance levels with specialized scenarios on each side boasting some level of advantage in some scenarios. Intel (& maybe Samsung?) being able to develop their own chips on their own silicon probably deserve some long term consideration. It seems a reasonable conclusion by 2030 that due to the amazing capex required to keep the silicon technology moving there might be even less variation on the silicon side.

Not sure what you mean by this - Apple's microarchitecture is proprietary; nobody else can buy into it. Same goes for several of the server microarchitectures.
 
Last edited:

SarahKerrigan

Senior member
Oct 12, 2014
339
468
136
I don't know, I only asked the question.

How about touching the topic from this post: https://forums.anandtech.com/thread...tel-replacement.2571738/page-49#post-40210997

Small cores don't affect big-core ST perf on iOS (or anywhere else - Android, NT, whatever.) Period. How on earth do you imagine that would even work? With parallelism, there's no free lunch - you want higher ST, you build a faster core, you can't just magically split a single thread across multiple cores.. There's really not another option.
 
Last edited:
  • Like
Reactions: Tlh97 and Etain05

blckgrffn

Diamond Member
May 1, 2003
9,110
3,028
136
www.teamjuchems.com
Not sure what you mean by this - Apple's microarchitecture is proprietary; nobody else can buy into it. Same goes for several of the server microarchitectures.

Sorry, I mean that if they (AMD, Apple, Amazon, etc.) can all buy into the same *node* for producing their CPUs and have teams of dedicated, intelligent design teams that advantages are likely to be application/implementation specific in their nature.

In your own Graviton vs Epyc example, specific integer based workloads may be more optimal on one platform due to the inclusion of faster/more efficient hardware for that purpose while other types of workloads are hindered by a relative lack of L3 cache. Some product manager & architect presumably made that trade off on purpose.
 

Cardyak

Member
Sep 12, 2018
72
159
106
Just to drop something here in this thread, the Cortex X1 does not have 30% higher IPC than the A77, it has 30% higher single thread performance at ISO-frequency.

But the ISO-frequency of Cortex X1 is 3 GHz (5nm), but A77 is 2.6Ghz (7nm)

So the frequency is 15% higher for the Cortex X1, therefore scaling this down to work out the true IPC improvement results in:

1.3/1.15 = 1.13 = 13%

Source for all of this is WikiChip

9A1DA682-0761-4540-950F-1D809A07FEF0.jpeg
 

SarahKerrigan

Senior member
Oct 12, 2014
339
468
136
Just to drop something here in this thread, the Cortex X1 does not have 30% higher IPC than the A77, it has 30% higher single thread performance at ISO-frequency.

But the ISO-frequency of Cortex X1 is 3 GHz (5nm), but A77 is 2.6Ghz (7nm)

So the frequency is 15% higher for the Cortex X1, therefore scaling this down to work out the true IPC improvement results in:

1.3/1.15 = 1.13 = 13%

Source for all of this is WikiChip

View attachment 24609

I don't think you know what "iso-frequency" means. And whoever put that table together clearly managed to massively garble what ARM actually said.

The X1-vs-A77 comparisons were iso-process and iso-frequency - that means everything in question was on the same process and running at 3GHz.

https://images.anandtech.com/doci/15813/A78-X1-crop-12.png is not terribly ambiguous.
 
Last edited:
  • Like
Reactions: Tlh97

Doug S

Platinum Member
Feb 8, 2020
2,201
3,405
136
Sorry, I mean that if they (AMD, Apple, Amazon, etc.) can all buy into the same *node* for producing their CPUs and have teams of dedicated, intelligent design teams that advantages are likely to be application/implementation specific in their nature.

Sure AMD and Amazon can buy their way into the same silicon as Apple - in terms of getting the same TSMC N5 process that Apple will be using this fall for its various A14 derivatives going into phones, tablets and Macs.

That's been true for a long time though, Intel is the only one left standing that owns its own fabs and can be at an advantage (historically at least one process generation ahead of everyone else, until a few years ago) or a disadvantage (their current situation they are unaccustomed to) versus foundry processes.

And am I'm really not sure why you are whining about people talking about Apple going to ARM here and saying it is irrelevant to you because they have a small market share in a walled garden (the walled garden is iOS, not macOS BTW) This thread is explicitly about Apple going ARM...if you want to talk about AMD there are plenty of threads for that..

But recognize that the reason AMD has access to a process better than Intel's has a lot to do with Apple choosing TSMC for foundry services and the tens of billions of investment that has resulted from that. Had Apple made a deal with Intel for foundry services as was rumored now and again over the past decade, AMD would likely find themselves in a very different place.
 
  • Like
Reactions: Etain05

Etain05

Junior Member
Oct 6, 2018
11
22
81
What makes you believe that Small cores, are completely unused in Geekbench on iOS, and are not yielding performance?

Wouldn't that 25% of performance lack compared to iOS come from the small cores?

I will answer, yes they were used for performance improvement of large cores, both in ST, and in Multicore. That was the whole point of Apple touting the benefits of small cores I think in A10 chips and so forth.

Everybody jumped to conclusion that it must be because of the emulation. Based only on armchair rough estimates based on the scores, themselves, without putting the TECHNOLOGY behind it in the context.

As I have said many times. Rosetta 2 is way more efficient in yielding performance, than you guys believe, based on the performance of Shadow of the Tomb Raider's performance, that Apple demoed on this very development kit with A12Z.

So let me put it to your minds this thought. What if the reality is different than your beliefs are, and those scores are actually legit, but only showing Big core performance, and IPC, excluding the performance of smaller cores, which may actually have yielded pretty decent performance boost, on iOS? Both ST, and Multicore. Apple touted many, many times that the benefit of their implementation of big.LITTLE arch is that those cores can work simlutaneously.

In the benchmarks on MacOS Big Sur - the small cores are not working.

Be it architecture, or... the fact that it is different platform, than iOS?

If all of this is correct, then everything falls into place.

P.S. If any of you would stop, how can I put this... love Apple design teams, for a second you would yourselves see that there might be different perspective for those benchmarks.

But jumping to conclusions that it must be rosettas fault, and that no way in hell it may be more than 80% of performance efficiency, is way to premature, considering what Apple wants to achieve. If they want to migrate their whole platform, they have to offer at least 90% of performance in Rosettas compiler.

This is one of the most absurd things I have ever read. How on Earth could the little cores help in the single-core performance, on any imaginable platform? It’s right there in the name: single-core. The little cores are irrelevant in single-core performance, because, by definition, single-core tasks test a single core.

Once we get past the absurdity of that, let's discuss the rest with data @Gideon already provided in this thread:

A12Z iPad Pro/ iOS : 1115 ST (one single BIG core) / 4670 MT (4x small cores + 4x BIG cores)
A12Z DTK/macOS : 844 ST (one single BIG core) / 2943 MT (only 4x BIG cores)

ST : 844 / 1115 * 100 = 75,69%
MT : 2943 / 4670 *100 = 63%

You can clearly see that the ratio of the performances for ST and MT is different. The very clear and obvious difference between ST and MT that can cause the discrepancy in the ratios is the fact that those little cores are used in MT on the iPad, but not on the DTK. The fact that the little cores are not used in MT on the DTK but they are used on the iPad is the reason why the ratios are not exactly the same, at around 75%. But the ST test instead is exactly the same, the little cores make no difference there because they aren't used on either device in ST. So comparing ST is extremely straightforward, as others have already told you in this thread.

So unless you want to continue to pretend that in ST the very same chip with the very same clock speed provides 25% lower performance simply because macOS is somehow less optimised and iOS has software magic beans, the obvious answer is that the 25% lower performance is caused by Rosetta 2. And that would be an incredible success. Only a 25% penalty for translating x86 software would be amazing.

As for your other point that you like to mention regarding Shadow of the Tomb Raider's performance on the DTK, you fail to realise that Shadow of the Tomb Raider is a x86 game that on macOS uses Metal. Apple was very explicit in telling us that under Rosetta 2 Metal calls are made directly to the GPU with very little overhead, providing extremely good performance. So that is why the test was very impressive, certainly not because Rosetta 2 is so fantastically good (impossibly so) that the performance penalty is less than 10%. And that's without mentioning that Shadow of the Tomb Raider would be a GPU relevant test to begin with, not a CPU one.

It's actually laughable the mental jiujitsu or gymnastics you are trying to do just to hold on to the notion that somehow Apple's chip design is inferior to Intel or AMD's.
 

beginner99

Diamond Member
Jun 2, 2009
5,208
1,580
136
It's actually laughable the mental jiujitsu or gymnastics you are trying to do just to hold on to the notion that somehow Apple's chip design is inferior to Intel or AMD's.

I think the difference is that Apple makes CPUs for client devices for the average user. This means wide-cores with very fast ST performance for web browsing for example. Now they will scale them up a bit (mostly MT I would assume) to use them also in laptops and desktops.

Intel and AMD make server CPUs optimized for server and workstation usage and scale the design down so they can be put into desktops and laptops. In fact in case of Ryzen it's the exact same CPU with a little different (cheaper) IO die.

The issue for AMD and Intel is that servers have quite different needs than clients. So scaling down a server CPU to client doesn't work all that well. (Reduce cores but increase frequency). Apple doesn't cross and such boundaries, the still remain strictly in client space where ST is king. Even for avergae PC use, 2 big cores coupled with 4 small cores for background stuff would be enough for most users.

Hence the debates and conflict here. Apple cores will be better for average user. But once you move into more complex stuff like compiling, heavy FP usage or server type usage like databases, then I Intel/AMD will show their advantages. But mist people don't do that on phones, tablets or laptops.
 

defferoo

Member
Sep 28, 2015
47
45
91
I think the difference is that Apple makes CPUs for client devices for the average user. This means wide-cores with very fast ST performance for web browsing for example. Now they will scale them up a bit (mostly MT I would assume) to use them also in laptops and desktops.

Intel and AMD make server CPUs optimized for server and workstation usage and scale the design down so they can be put into desktops and laptops. In fact in case of Ryzen it's the exact same CPU with a little different (cheaper) IO die.

The issue for AMD and Intel is that servers have quite different needs than clients. So scaling down a server CPU to client doesn't work all that well. (Reduce cores but increase frequency). Apple doesn't cross and such boundaries, the still remain strictly in client space where ST is king. Even for avergae PC use, 2 big cores coupled with 4 small cores for background stuff would be enough for most users.

Hence the debates and conflict here. Apple cores will be better for average user. But once you move into more complex stuff like compiling, heavy FP usage or server type usage like databases, then I Intel/AMD will show their advantages. But mist people don't do that on phones, tablets or laptops.
I don’t think this is really the case. Intel and AMD also design a single core and scale up. Why do you think Intel releases their low power chips before they release desktop chips, and Intel’s server chips lag one to two years behind desktop chips. At its core, Intel’s server chips are their desktop chips scaled up. AMDs server chips are basically 8 desktop clusters with an infinity fabric interconnect and server specific features like ECC, increased I/O bandwidth.

Obviously Apple has proven themselves in the mobile space, but have not done the same in the desktop or server space. This means until they do, it’s very easy for somebody to say that they wouldn’t be able to compete in those hypothetical markets (or that mobile and desktop aren’t comparable even if they’re running the same workload because reasons).

This is why their macOS on Apple silicon transition is so interesting. ARM is finally going to have high performance CPUs in the laptop/desktop and workstation/server market. We finally get to find out if all those naysayers were right or wrong, and my bet is that they’re going to be eating so much crow.
 

Gideon

Golden Member
Nov 27, 2007
1,608
3,570
136
This is one of the most absurd things I have ever read. How on Earth could the little cores help in the single-core performance, on any imaginable platform? It’s right there in the name: single-core. The little cores are irrelevant in single-core performance, because, by definition, single-core tasks test a single core.

Thanks I was just writing a very similar reply about the absurdidy of this claim before I decided it isn't worth it and went to bed. So essentially apple implemented the infamous "reverse-hyperthreading"? :D

@Glo. if you had ever written and compiled any single- or multithreaded code at all, you'd realize the absurdity of this. This is an extremely extraordinary claim that requires extraordinary evidence, you can't just throw it around if reality doesn't fit your world-view. If this were possible it would be the biggest breakthrough (... ever?) in hardware/software design, a holy grail if you will.

An analogy:
Two guys are arguing about two announced phones with the same SoC, screen, battery and similar dimensions while one is 50g lighter. One guy would claim the obvious, that phone "A" must have a lighter chassis and probably a more elegant design. The other guy would go: "no-no-no, this can't be! That phone is actually much heavier, it's just that the OEM implemented an anti-gravity device inside the chassis that cheats during weighting, just wait and you'll see!".


My take as a software developer
(that has limited experience with low-level languages but has actually, you know, written, compiled and ran compute-limited multithreaded C++ code):
  1. Expecting 75% efficiency from Rosetta seems really good, but totally within reasonable bounds.
  2. Explaining the difference by syscall performance in a widely used cross-platform benchmark (that is for no reason entirely different on MacOS than iOS on the same SoC) is highly unlikely and doesn't line up with any other evidence on the subject.
  3. Expecting some kind of "reverse-hyperthreading" at play (e.g. two cores running a single-core task) is comically ridiculous

FFS people have run their own compiled code on their own A12 phone and Kaby Lake Mac and gotten similar results you can't really cheat in those.

TL;DR

I get why you're grasping at straws. I'm in the same boat as you in the sense that I also share no love for Apple. I'm actually quite sad that AMD, Intel and other ARM vendors haven't been able to design client-focused SoCs with similar IPC as then all the vendors could reap benefits. But I won't start denying reality when it doesn't fit my world-view.
 
Last edited:

Richie Rich

Senior member
Jul 28, 2019
470
229
76
Just to drop something here in this thread, the Cortex X1 does not have 30% higher IPC than the A77, it has 30% higher single thread performance at ISO-frequency.

But the ISO-frequency of Cortex X1 is 3 GHz (5nm), but A77 is 2.6Ghz (7nm)

So the frequency is 15% higher for the Cortex X1, therefore scaling this down to work out the true IPC improvement results in:

1.3/1.15 = 1.13 = 13%

Source for all of this is WikiChip

View attachment 24609
You are wrong because ARM LLC did this huge IPC gains repeatedly in the past:
  • Cortex A75 ............ PPC 180 pts/GHz
  • Cortex A76 ............ PPC 253 pts/GHz ..... 253/180=1.405........ 40% IPC uplift to last gen
  • Cortex A77 ............ PPC 286 pts/GHz ..... 286/253=1.130........ 13% IPC uplift to last gen
  • Cortex X1 .............. PPC 371 pts/GHz ..... 371/286=1.3............. 30% IPC uplift to last gen
The A75 -> A76 huge 40% IPC jump was caused they went from 2xALU+1xBranch core to completely new wider desing 3xALU+1xBranch which is comparable with x86 4xALU designs (2xBranch shared on those ALU ports).

  • A77 was 4xALU+2xBranch, a lot of complex ops was moved to second ALU so doubled throughput in many ways.
  • A77 added two new 2xStore ports which boosted FPU operations by 35% according SPECfp.
  • A78/X1 increased ALU functionality despite keeping same 4xALU+2xBranch scheme.

And Matterhorn will be new core line-up, I guess an answer to Apple's 6xALU A11 Monsoon design released in 2017. If core design takes 4 years that's 2017+4= 2021. Cortex X1 is just the begining. I expect Matterhorn based Cortex X2 to reach 80% of Apple's IPC/PPC, somewhere between A12 and A13 (+70% higher IPC above Zen2). However with higher clock speed than A12 it will reach and beat desktop todays Ryzens/9900K/Tigerlakes.

ARM can also prepare Cortex X2 version for supercomputers with boosted FPU/SIMD to demonstrate power of SVE2 2048-bit capability. Such a Cortex FX2 with 4x1024-bit FPU is kind of Fredy Krueger for x86 dream. And even for Nvidia GPU in supercomputers too.
 
Last edited:

soresu

Platinum Member
Dec 19, 2014
2,616
1,812
136
you want higher ST, you build a faster core, you can't just magically split a single thread across multiple cores.. There's really not another option here.
Weelllll, yes and no.

As you say there is no free lunch, though there is a very difficult but possible way to turn one thread into several dynamically on the fly.

Popularised to near meme proportions years ago was "Reverse hyper threading", a slang term for the Speculative Multi Threading technique.

From what I remember transaction memory was supposed to be a key feature to getting it to work, but after a big hoohah about their Mitosis SpecMT compiler there was basically nothing from Intel about SpecMT.
Expecting some kind of "reverse-hyperthreading" at play (e.g. two cores running a single-core task) is comically ridiculous
See my answer to SarahKerrigan above.

I do believe that the Soft Machines VISC core "Global Frontend" was supposed to be a glorified SpecMT engine, but it could just as easily have been snake oil as it has never seen the light of day in silicon thus far.
Instead of 7nm 7w limited SOC from Ipad Macs will use 5nm hp-process made two generations younger cpu with more relaxed power limits which could easily be twice as fast. With twice as fast as 12Z cpu emulated x86-speed will be at least as fast as Intel offerings - probably even faster.
Cortex X1 .............. PPC 371 pts/GHz ..... 371/286=1.3............. 30% IPC uplift to last gen
30% for scalar FP and Int yes, but could be as much as a 2x improvement in NEON performance due to the doubled SIMD units over A77.
The A75 -> A76 huge 40% IPC jump was caused they went from 2xALU+1xBranch core to completely new wider desing 3xALU+1xBranch which is comparable with x86 4xALU designs (2xBranch shared on those ALU ports).
I don't know where you got that 40% number from, I can only assume it was a "battery efficiency" improvement number.

The FP/ASIMD number is close to that 40% figure, but few things use FP on a phone and SIMD is not for every workload.

JS also improves close to 40% - but that is a mercurial benchmark IMHO, depending on both the benchmark itself and the JS engine running it.

This is from the official ARM PR on the Anandtech article from the A76 announcement:
22.PNG


Best to work from integer IPC numbers as the base for improvement, it often comes first in PR messaging for a reason.

Something also to note, the next gen video codecs will heavily rely on ML techniques, so the improvements in A76, X1 and Matterhorn will put the battles with next gen heavily in ARM's favor.
 
  • Like
Reactions: Tlh97

soresu

Platinum Member
Dec 19, 2014
2,616
1,812
136
And Matterhorn will be new core line-up, I guess an answer to Apple's 6xALU A11 Monsoon design released in 2017. If core design takes 4 years that's 2017+4= 2021. Cortex X1 is just the begining. I expect Matterhorn to reach 80% of Apple's IPC/PPC, somewhere between A12 and A13 (+70% higher IPC above Zen2). However with higher clock speed than A12 it will reach and beat desktop todays Ryzens/9900K/Tigerlakes.
Bare in mind we have no idea if Matterhorn is Cortex A or X yet.

I'm inclined to think that we will at least have a new big A core somewhere between A78 and X1 performance, but much lighter on power/area than X1.

Also Anandtech's projection put X1 at equal to A13 FP, and less than 11% behind A13 in Int - whatever the X2 core is, it should be superior to A13 at least if Anand's predictions work out.
SPEC-A78-X1-projection_575px.png
 
  • Like
Reactions: Tlh97

AkulaMD

Member
May 20, 2017
56
17
81
IMHO Samsung Mongoose is much better design than Zen1 but in much tough mobile enviroment it was a big fail. While much worse Zen1 is celebrated as a great design in x86 world. Clash of those two world will be really epic.
Which iteration of Mongoose, if i may ask?

Thank you in advance.
 

SarahKerrigan

Senior member
Oct 12, 2014
339
468
136
Weelllll, yes and no.

As you say there is no free lunch, though there is a very difficult but possible way to turn one thread into several dynamically on the fly.

Popularised to near meme proportions years ago was "Reverse hyper threading", a slang term for the Speculative Multi Threading technique.

From what I remember transaction memory was supposed to be a key feature to getting it to work, but after a big hoohah about their Mitosis SpecMT compiler there was basically nothing from Intel about SpecMT.

Sure. SpMT theoretically exists, and I actually considered mentioning it as an aside in my post. It's just that nobody has ever actually shipped SpMT-capable hardware - it's been repeatedly announced (Rock scout threading, Soft Machines VISC) and then quietly vanished before anyone could buy it. Even if it did exist, it would likely be either multiple hard contexts on one core, or multiple identical cores.

Regardless, it's abundantly clear that SpMT is not, in fact, the source of the A12 microarchitecture's high performance, and Gideon's anti-gravity analogy above is basically right on. Until someone implements it in a shipping product, it remains firmly in the category of hypothetical magic fairy dust.
 
Last edited:

beginner99

Diamond Member
Jun 2, 2009
5,208
1,580
136
Why do you think Intel releases their low power chips before they release desktop chips, and Intel’s server chips lag one to two years behind desktop chips.

The release the mobile chips before desktop because efficiency matters more for mobile and mobile chips usually have less cores so higher yield. This is especially obvious with the 10nm fiasco.

Server would also profit from efficiency but these dies are much larger and usually yield isn't ideal early on on a new process. On top of that server CPUs simply need more validation than client CPUs.

Th esituation now at intel is special anyway due to their 10nm fiasco. In case of AMD, Epyc was released shortly after Ryzen, no 2 year delay (well it's the same chiplets with different IO so no surprise here).

My point still stands especially with AMD. The desktop chips simply are reused server chips. Only Renoir is a custom mobile job but the CPU cores are still the same (with less cache).

No this is speculation but if we look at how graviton2 preforms in certain benchmakrs (see phoronix) one can really see the difference between a client first and server first architecture. (graviton2 has extremely slow compile time of linux kernel and database benches are also very slow). So stuff that relies on branch prediction, large caches in general complex stuff, that is were intel/amd shine. Maybe apple too, but that's to be seen.
(I mention compiling because many devs use macbooks and if that takes a large hit, the will probably move away from apple).
 
  • Like
Reactions: Carfax83 and Tlh97