Solved! ARM Apple High-End CPU - Intel replacement

Page 48 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

SarahKerrigan

Member
Oct 12, 2014
187
173
116
A78 has an either 64 or 32 KiB 4 way which isn't going to help in most workloads compared to Zen 2's 32 KiB 8 way associative cache. It's germane to the subject of some supposed instruction set superiority. There is almost none.
I agree. There isn't much, although I would venture that on the flip side the costs of designing and validating an x86 microarchitecture are higher than an ARM64 microarchitecture.

My comments above were re: N1; I don't have A78 hardware to play with so I can't speak to that. But I find it generally plausible that at least X1 exceeds Zen2 and SKL at iso clock.
 

Thala

Senior member
Nov 12, 2014
961
306
136
Resource usage in this sense is die area, transistors and perhaps power. The front end of both designs can decode and issue a similar number of micro-ops. x86-64 instructions decode to more micro-ops but this isn't an issue as these instructions correspond to multiple ARMv8 instructions. On paper there is no way to expect A78 to have 40% higher IPC than both Skylake and Zen 2.
Thats a much too simplified view. There are issues all over the place with x86 backend and frontend - like you are running out of registers early, backend has very limited freedom to schedule memory access, write buffer flushes or speculatively schedule reads, broken down x86 uops either have a memory reference dependence or no architectural register dependence at all etc. - just to name a few issues. It does not help much if you have the same amount of compute resource (e.g. ALUs) if you cannot use them as efficiently.
 

Thala

Senior member
Nov 12, 2014
961
306
136
I agree. There isn't much, although I would venture that on the flip side the costs of designing and validating an x86 microarchitecture are higher than an ARM64 microarchitecture.

My comments above were re: N1; I don't have A78 hardware to play with so I can't speak to that. But I find it generally plausible that at least X1 exceeds Zen2 and SKL at iso clock.
I am sure that A77 already has higher relative performance (aka IPC) than both Zen2 and SKL. In fact A77 should be very close to SunyCove/Icelake.
 

gdansk

Senior member
Feb 8, 2011
475
131
116
Thats a much too simplified view. There are issues all over the place with x86 backend and frontend - like you are running out of registers early, backend has very limited freedom to schedule memory access, write buffer flushes or speculatively schedule reads, broken down x86 uops either have a memory reference dependence or no architectural register dependence at all etc. - just to name a few issues. It does not help much if you have the same amount of compute resource (e.g. ALUs) if you cannot use them as efficiently.
It is narrow. Unfortunately as these characteristics are not well known, and seldom show up in research between ISAs, I'll wait for actual benchmark results of the A78 before assuming its ISA has an impact other than forcing the compilers to work around the weak memory model.
 

SarahKerrigan

Member
Oct 12, 2014
187
173
116
I am sure that A77 already has higher relative performance (aka IPC) than both Zen2 and SKL. In fact A77 should be very close to SunyCove/Icelake.
My assumption has generally been that A77 at least matches what I've seen from N1. If A78 is indeed an improvement over that, that's a good place to be.
 

Thala

Senior member
Nov 12, 2014
961
306
136
It is narrow. Unfortunately as these characteristics are not well known, and seldom show up in research between ISAs, I'll wait for actual benchmark results of the A78 before assuming its ISA has an impact other than forcing the compilers to work around the weak memory model.
On a single core even with a weakly ordered memory model you can assume sequentially consistent ordering* - so there is no workaround the compiler would have to do. The weakly ordered memory model comes into play, when you have other observers in the system - like other cores.

*this means for instance, that a load will always observe the value of the most recent write in program order to the same address.
 

gdansk

Senior member
Feb 8, 2011
475
131
116
On a single core even with a weakly ordered memory model you can assume sequentially consistent ordering* - so there is no workaround the compiler would have to do. The weakly ordered memory model comes into play, when you have other observers in the system - like other cores.
I'm curious. How does the compiler know ahead of time if some other core will observe the result?
 

Thala

Senior member
Nov 12, 2014
961
306
136
I'm curious. How does the compiler know ahead of time if some other core will observe the result?
It does not know at all. It is task of the multi-threading library/Operating system to add the required barriers. So technically only the provider of synchronization implementation (like semaphores and locks) has to know the details of the memory model.
With other words as long you are on the application level and use the provided synchronization primitives, you do not have to care at all.

I think the fundamental insight is, that ordering alone does not help without synchronization - as consequence it is sufficient to require ordering only at the synchronization boundaries.

To make a concrete example, a thread on core A doing 100 writes and then signals thread on core B that it is done. An x86 core makes sure that all the writes are observable in program order, despite the observer is only in interested that all 100 writes are observable after synchronization. An ARM core schedules write more optimally with respect to performance and power metrics but not restricted by ordering requirements. After 100 writes it will execute a memory barrier and makes all writes observable.
 
Last edited:

gdansk

Senior member
Feb 8, 2011
475
131
116
It does not know at all. It is task of the multi-threading library/Operating system to add the required barriers. So technically only the provider of synchronization implementation (like semaphores and locks) has to know the details of the memory model.
That's interesting, so without pthreads it won't enforce ordering at all?
 

Thala

Senior member
Nov 12, 2014
961
306
136
That's interesting, so without pthreads it won't enforce ordering at all?
Yes, you typically will not see any barriers in application code (which is using pthread) - but you will find the barriers in the pthread library implementation if required by the memory model.

If you write (pseudo) code like this (assume A and B are 0 initialized):

write B<-1
write A<-1;


and on the other core
while(!A);
read B

On ARM it is feasible that you read B=0, while on Intel it is guranteed that you read B=1. Compiler will not help here.
 
Last edited:

gdansk

Senior member
Feb 8, 2011
475
131
116
Perhaps I had misjudged the IPC of modern phones. Out of curiosity I ran our test suite from work. It's single-threaded and essentially matches HTTP requests to firewall actions. I was able to run it on an (Android) Snapdragon 865 phone and a (Debian) desktop with LuaJIT v2.1.

Snapdragon 865 (~2.84 GHz) took on average of three runs: 81.74s (projected at 4.2GHz 55.27s)
R5 3600 (~4.2 GHz) took on average of three runs: 46.94s

So A77 is closer than I expected to Zen2 but of course it's always what you do with it.
 

Doug S

Member
Feb 8, 2020
143
172
76
Three things that we should all be excited from Apple's own Silicon.

1) Use of ARMv9 architecture.
2) LPDDR5 and its bandwidth. 6400 MHz is giving 128 GB of memory bandwidth.
3) Apple essentially making a developer platform for ARM architecture, with way higher adoption, than ever before.

And for only those three reasons, and the fact, that the most beneficial OS from all of this is going to be Linux, I for one am Rooting for Apple silicon team.
Has ARMv9 even been finalized by ARM? If so, when did that happen? Apple would have had to tape out the design late last year/early this year to be able to ship in systems this December as announced. They can't tape out a chip with an architecture that hasn't been finalized, so I'm pretty skeptical of the Mac running on ARMv9. Eventually sure, but not this year.

Besides, what does ARMv9 deliver that the optional parts of ARMv8 doesn't? Is SVE2 an ARMv9 only thing? Because I thought that was going into ARMv8.6 or whatever is next. That's really the only compelling addition to the ARM architecture I'm aware of.
 

Thala

Senior member
Nov 12, 2014
961
306
136
Perhaps I had misjudged the IPC of modern phones. Out of curiosity I ran our test suite from work. It's single-threaded and essentially matches HTTP requests to firewall actions. I was able to run it on an (Android) Snapdragon 865 phone and a (Debian) desktop with LuaJIT v2.1.

Snapdragon 865 (~2.84 GHz) took on average of three runs: 81.74s (projected at 4.2GHz 55.27s)
R5 3600 (~4.2 GHz) took on average of three runs: 46.94s

So A77 is closer than I expected to Zen2 but of course it's always what you do with it.
On pure C/C++-code the A77 should win in most of the cases against Zen2 with even some margin. If you are running some JIT stuff, you can always have the issue, that the JIT engine is better optimized for one architecture or the other. (Therefore also also dislike browser JavaScript benchmarks)
 

gdansk

Senior member
Feb 8, 2011
475
131
116
On pure C/C++-code the A77 should win in most of the cases against Zen2 with even some margin. If you are running some JIT stuff, you can always have the issue, that the JIT engine is better optimized for one architecture or the other. (Therefore also also dislike browser JavaScript benchmarks)
Yes, of course. But that's also true when comparing anything built with say clang or gcc too. Probably to a lesser degree because they're used by everyone. But I'm measuring with what I have access to. The LuaJIT compiler doesn't use AVX2 (latest is SSE4.1) so it's really leaving a lot of performance on the floor.
 

Thala

Senior member
Nov 12, 2014
961
306
136
Yes, of course. But that's also true when comparing anything built with say clang to gcc. But I'm measuring with what I have access to. The LuaJIT compiler doesn't use AVX2 (latest is SSE4.1) so it's really leaving a lot of performance on the floor.
Sure no problem, we run with what we have :) At least gives some indication.

For instance i have lots of code compiled, which i run on both - my Surface Pro X (Cortex A76) and my Desktop (Skylake) and the A76 is very close to Skylake if i normalize frequency. I don't have an A77, but from my A76 numbers i can easily conclude, that it will outperform a Skylake by a healthy margin.
 

insertcarehere

Senior member
Jan 17, 2013
289
97
101
Butthurt or outraged, hmm? ;)

Again, Shadow of the Tomb Raider was running on this very silicon with better performance than Renoir can deliver in 1080p. This games example people used, to prove how Rosetta 2 was advanced, and efficient, after the keynote.

Now they turn tables to prove its not optimized, yet?

I guess, Rosetta 2 has to be actually pretty efficient in translating the code, after all, hmm?
If a 2 year old chip is actually able to outperform AMD Renoir on SW emulation in games then x86 is truly screwed regardless what the geekbench results say and how many cores AMD decides to stuff into their next threadripper.
 
  • Like
Reactions: Etain05

Eug

Lifer
Mar 11, 2000
22,698
301
126
I don't know if this has been posted already or not, but a couple of months ago Bloomberg said some ARM Macs coming within the year would be 12-core, with 8 performance cores and 4 efficiency cores. Furthermore, the same article said that Apple is already looking at designs with more than 12 cores, but that wouldn't come until later.

I think this makes sense. Apple is said to go with a new iMac design this year, and their first version of it will actually be Intel supposedly. ARM would come later. A good range IMO would be A14X or something like that for up to the mid-tier iMacs and MacBook Pros, and then a higher end variant for the top end models. The iMac Pro and Mac Pro would come later with more than 12 cores.
 

soresu

Golden Member
Dec 19, 2014
1,179
443
136
If a 2 year old chip is actually able to outperform AMD Renoir on SW emulation in games then x86 is truly screwed regardless what the geekbench results say and how many cores AMD decides to stuff into their next threadripper.
There's a bit to unpack here, so a few points:

The Apple Axx core may be good, but unless they acquired a magic bean to boost its power it won't do jack against a Zen2 64C Threadripper at heavy threaded compute tasks.

Threadrippers are not gaming chips - unless by gaming you mean gaming while the PC is doing something (or several somethings) in the background.

Threadrippers are basically workstation chips strangely marketed to gamers (they even have ECC memory capability, albeit unvalidated).

You are not directly comparing Apples to Apples here, if you excuse the pun.

There is no Renoir based Mac - so you are not making a direct comparison from Renoir to Axx on the same software platform, it's not even the same gfx code as the Mac version was written in Metal instead of D3D12.

Another thing to bare in mind is that plenty of games (SOTR included) can vary significantly in performance depending on the level being played as shown by the more in depth benchmarks over at Phoronix. This means that a short demo of a a single level does not in anyway represent the full performance of any system.

The Rosetta demo video was not an exhaustive benchmark but a demonstration, so you'll forgive me for being somewhat skeptical about the results?

Lastly I watched the Rosetta video - SOTR is a very nice looking game, but at that crippled low texture res it's hardly worth playing at all on any platform, so while certainly an impressive feat of binary translation it's somewhat unimpressive beyond that.
 

soresu

Golden Member
Dec 19, 2014
1,179
443
136
So A77 is closer than I expected to Zen2 but of course it's always what you do with it.
Indeed, OS network stack tuning for specific scenarios may well enter into the equation also, as well as the network hardware on the platforms which will certainly affect the result.
 

soresu

Golden Member
Dec 19, 2014
1,179
443
136
Has ARMv9 even been finalized by ARM? If so, when did that happen? Apple would have had to tape out the design late last year/early this year to be able to ship in systems this December as announced. They can't tape out a chip with an architecture that hasn't been finalized, so I'm pretty skeptical of the Mac running on ARMv9. Eventually sure, but not this year.

Besides, what does ARMv9 deliver that the optional parts of ARMv8 doesn't? Is SVE2 an ARMv9 only thing? Because I thought that was going into ARMv8.6 or whatever is next. That's really the only compelling addition to the ARM architecture I'm aware of.
SVE2 is an optional feature for v8-A at present, as SVE is as far as I am aware.

It was only announced early last year along with TME to prepare software (compiler + OS) and hardware people for it's impending arrival, but it is not some mandatory part of v8.6-A (the -A part is important as -M and -R have different limitations and instructions).

TME (Transactional Memory Extension) is pretty significant too.

Though while I do believe that SVE2 will supersede NEON as the mandatory and preferred SIMD format in v9-A, I do not have any idea whether the same will be true of TME.

The announcement for SVE2 and TME referred to them both as "multi year investments" in the context of a major change, more than just a minor point ISA incremental iteration so this could well imply that they are both fated for mandatory v9-A occupancy.

It would be my guess that they will announce v9-A at ARM TechCon in October, though COVID could still change that.
 

soresu

Golden Member
Dec 19, 2014
1,179
443
136
A78 has an either 64 or 32 KiB 4 way which isn't going to help in most workloads compared to Zen 2's 32 KiB 8 way associative cache. It's germane to the subject of some supposed instruction set superiority. There is almost none.
Zen2 has 2x 256 bit AVX2 SIMD units, so really only X1 from ARM themselves can compare to it directly now that it has 4x 128 bit NEON units.

I'll be really interested to see how those 2 compare on SIMD heavy loads once a server oriented variant of X1 is wrought in silicon.
 

Doug S

Member
Feb 8, 2020
143
172
76
[QUOTE="soresu, post: 40210573, member: 354042"The Apple Axx core may be good, but unless they acquired a magic bean to boost its power it won't do jack against a Zen2 64C Threadripper at heavy threaded compute tasks.
[/QUOTE]

Well of course not, Apple puts 2 or 4 big cores on a chip, that's a long way away from 64 cores. They could build a 64 core solution if they really wanted to, but that's overkill even for the Mac Pro.

I'm not sure why anyone is arguing about 64 core Threadrippers in comparison with the chips going into the Mac. Apple has never tried to compete with the fastest PC workstations with the Mac Pro - the top end PC workstations support two sockets so even if Apple had supported the fastest Xeons they'd still only hit 50% of the performance of the top end PC workstations which would support two such Xeons.

So if Apple ships an ARM Mac Pro that can't match Threadripper I guess the "x86 is better than ARM" people will still have an argument to hang their hat on - but they'll claim it is because Apple's cores are somehow unsuitable for scaling to 64 cores, not that Apple simply didn't think that's a market segment worth pursuing given their customer base and mere 5% share of the PC market.

The other problem with the idea of going that big is that macOS doesn't scale as well as Linux or even as well as Windows. Since there isn't and never has been such a thing as a macOS enterprise server they haven't had a reason to invest in the sort of changes (finer grained locking, improved algorithms for scheduler processes and I/O, network stacks that can work across cores and so on) needed to scale to dozens and hundreds of cores like Linux and Windows can (Linux is actually able to scale to thousands of cores, not sure about Windows)

For some applications it wouldn't matter - if the problem can be easily divided up like rendering then it isn't too big of a deal. But where it does matter you start to see diminishing returns as cores are added. Not because of any issue with the architecture, but because at some point the kernel can't get out of its own way.
 

Glo.

Diamond Member
Apr 25, 2015
3,877
1,809
136
If a 2 year old chip is actually able to outperform AMD Renoir on SW emulation in games then x86 is truly screwed regardless what the geekbench results say and how many cores AMD decides to stuff into their next threadripper.
Which is exactly the point. If it runs so well, then it means that Rosetta2 is way better at compiling the code, and the performance hit is way lower, than it is from fellow forum members calculations.

So that 800 points may not be 75% like people WANT it to be, but for example 90-95% of performance of Apple silicon on MacOS.
 

SarahKerrigan

Member
Oct 12, 2014
187
173
116
Which is exactly the point. If it runs so well, then it means that Rosetta2 is way better at compiling the code, and the performance hit is way lower, than it is from fellow forum members calculations.

So that 800 points may not be 75% like people WANT it to be, but for example 90-95% of performance of Apple silicon on MacOS.
And it may be 1%, too. That has just as much evidence as 90-95%. Nobody gets 90%+ of native perf in a translation environment, even a really good one.

Go back to proclaiming that Apple is never, ever going to bring the Mac to ARM. That was more entertaining.
 

ASK THE COMMUNITY