Solved! ARM Apple High-End CPU - Intel replacement

Richie Rich · Oct 14, 2019

There is a first rumor about Intel replacement in Apple products:

ARM based high-end CPU
8 cores, no SMT
IPC +30% over Cortex A77
desktop performance (Core i7/Ryzen R7) with much lower power consumption
introduction with new gen MacBook Air in mid 2020 (considering also MacBook PRO and iMac)
massive AI accelerator

Source Coreteks:

Carfax83 · Nov 7, 2019

Thala said:
I was not talking about changing microarchtiecture, but just clocking the current architecture higher. Did you understand the point of the 5GHz CPU beeing at the voltage and frequency limits while the A13 is not?

Yes I understood, but I'm not certain the A13 could clock that high, and even if it could, how would it scale in performance seeing as it was never intended to run at such high clock speeds?

From my admitted limited understanding of microarchitecture, it seems that many or all of them have a sweet spot in terms of operating frequency. You even see this with GPUs. Pascal scales linearly really well up till a certain frequency, then the performance damn near flat lines as the clock speed continues to increase but performance tapers off, especially due to insufficient bandwidth.

Nothingness · Nov 7, 2019

Carfax83 said:
Yeah I did some reading about SVE2. Neither SVE or SVE2 has yet to be implemented yet though, so I guess we'll have to wait and see how they turn out.

One can already play with SVE (not SVE2) with QEMU and experimental versions of gcc/clang.

By the time SVE/SVE2 is implemented, AVX-512 should be a mainstay in x86 vector computing, and Intel will probably be looking at AVX-1024. Judging by how aggressive Intel is going at implementing it in their consumer lineup that is.

SVE is size-agnostic contrary to AVX. It can go up to 2048 bits if someone is crazy enough to do it.

Anyway what matters is hardware units width and number of units: some AVX-512 chips have a single FMA for instance; or one could have AVX-512 on 256-bit units thus requiring several micro-ops.

soresu · Nov 7, 2019

Carfax83 said:
By the time SVE/SVE2 is implemented, AVX-512 should be a mainstay in x86 vector computing, and Intel will probably be looking at AVX-1024. Judging by how aggressive Intel is going at implementing it in their consumer lineup that is.

The mess of market fragmentation that is AVX 512 is bad enough, throw in AVX 1024 and it will only get worse.

I would put SVE1 out of your mind completely, it was clearly designed mostly with HPC in mind - SVE2 however seems to have instruction parity (or superiority) with NEON coupled with SVE's vector length agnostic nature.

I'm hoping that the future will see the 'little' cores gaining 128 bit SVE2 support as a minimum baseline, so then you get 'big' cores going 256 or 512+ bit depending on the generation or use case.

I think 5nm will be the minimum process we will see SVE2 go mainstream on, though perhaps even 3nm depending on when Samsung's MBCFET process ramps.

TheGiant · Nov 7, 2019

Nothingness said:
One can already play with SVE (not SVE2) with QEMU and experimental versions of gcc/clang.

SVE is size-agnostic contrary to AVX. It can go up to 2048 bits if someone is crazy enough to do it.

Anyway what matters is hardware units width and number of units: some AVX-512 chips have a single FMA for instance; or one could have AVX-512 on 256-bit units thus requiring several micro-ops.

IMO what matters is what can cool it
vector units=power hog
however, I am glad AVX512 gets some real uses, Intel SVT does the job when other have not
IMO SVE will have hard time to go mainstream, because as it looks "dynamic" on paper, the SW and especially HW (power, heat..) will be a bottleneck

Thala · Nov 7, 2019

Carfax83 said:
Yes I understood, but I'm not certain the A13 could clock that high, and even if it could, how would it scale in performance seeing as it was never intended to run at such high clock speeds?

Between 0.9V and 1.4V there is quite a bit of frequency scaling possible. Performance scales almost linear, provided you also increase cache sizes and if necessary increase SoC DRAM bandwidth.
Not sure what you mean with "intended clock speed" - but surely DRAM throughput and cache sizes needs to adapted - you do want to have you SoC balanced and not the CPU waiting most of the time. The same applies if you increase the core count. Are you suggesting Apple is unable to balance the system when they increase core count and frequency?

To be clear here, when we are talking about a desktop class Apple CPU, we are not talking about overclocked A13 SoC. They will up the core count, frequency along with caches and memory controller.

Thala · Nov 7, 2019

soresu said:
I'm hoping that the future will see the 'little' cores gaining 128 bit SVE2 support as a minimum baseline, so then you get 'big' cores going 256 or 512+ bit depending on the generation or use case.

Thats dangerous having heterogenous architectures in a big.LITTLE setup. Look at Lakemont - Icelake has AVX512 while Tremont has not - so the whole SoC would not offer AVX512 at all.
However on the smaller cores can you still offer 256 or 512 bit SVE2 while the uarch implementation uses smaller ALUs and thus reduced performance per clock - but you still have guaranteed binary compatibility.
Take Cortex A5 as example. It has very small NEON units - but as it is only employing 64 bit ALUs most intstructions take 2 cycles.

soresu · Nov 7, 2019

Thala said:
Thats dangerous having heterogenous architectures in a big.LITTLE setup. Look at Lakemont - Icelake has AVX512 while Tremont has not - so the whole SoC would not offer AVX512 at all.
However on the smaller cores can you still offer 256 or 512 bit SVE2 while the uarch implementation uses smaller ALUs and thus reduced performance per clock - but you still have guaranteed binary compatibility.
Take Cortex A5 as example. It has very small NEON units - but as it is only employing 64 bit ALUs most intstructions take 2 cycles.

The vector length agnostic nature of SVE2 means it should be be homogeneous from a code perspective, assuming the same ISA version of course - unless I read their brief incorrectly and it's only agnostic to a compiler for auto vectorisation.

Thala · Nov 7, 2019

soresu said:
The vector length agnostic nature of SVE2 means it should be be homogeneous from a code perspective, assuming the same ISA version of course - unless I read their brief incorrectly and it's only agnostic to a compiler for auto vectorisation.

Well yeah you need to look at actual SVE code to understand how this works. For instance properly written SVE code requires to update your loop variables by vector-size. With other words, while binary compatible, the amount of iterations exectuted are inherently different between different vector-sizes - it should be clear that you need less (dynamic) instructions with wider vectors. Apparently if you do want to dynamically migrate SVE code while running such loops between cores with different verctor-size implementation will fail badly.
So even if the code is binary compatible, it does not mean it can be pre-empted at any point in time and continue running at a different vector-size implementation.

soresu · Nov 7, 2019

Thala said:
Well yeah you need to look at actual SVE code to understand how this works. For instance properly written SVE code requires to update your loop variables by vector-size. With other words, while binary compatible, the amount of iterations exectuted are inherently different between different vector-sizes - it should be clear that you need less (dynamic) instructions with wider vectors. Apparently if you do want to dynamically migrate SVE code while running such loops between cores with different verctor-size implementation will fail badly.
So even if the code is binary compatible, it does not mean it can be pre-empted at any point in time and continue running at a different vector-size implementation.

So basically migration requires a completion of the currently running instructions, from big to little cores and vice versa.

I guess that might not be ideal - but if the instruction load is large enough they should be executing on the big cores anyway, similar to the current use case for the pairings.

My guess would be that they will reveal another revision of the BigLittle/DynamiQ architecture along with the first pairing of SVE2 big and little cores which makes it an easier sell from a compiler and migration perspective.

DrMrLordX · Nov 7, 2019

WRT SVE2, there will only be so many vector length implementations available. I would consider it rare to see anything in hardware above 512 bit SVE2. So if you want to run chip-agnostic SVE2 code, you produce three loop variants per likely vector length (128 bit, 256 bit, 512 bit) and then call it a day. Not perfect, but not terribly difficult either.

Thala · Nov 7, 2019

DrMrLordX said:
So if you want to run chip-agnostic SVE2 code, you produce three loop variants per likely vector length (128 bit, 256 bit, 512 bit) and then call it a day. Not perfect, but not terribly difficult either.

No, you dont have to do this. You just write vector length agnostic SVE code and thats it ... not 3 loop variants. My point was that whatever agnostic code you write, the OS is not allowed to migrate that code dynamically between different vector length implementations.

So basically migration requires a completion of the currently running instructions, I guess that might not be ideal - but if the instruction load is large enough they should be executing on the big cores anyway, similar to the current use case for the pairings.

You do think too narrow. It will also fail if code running on the small cores suddenly starts using SVE instructions and the OS will migrate to the larger cores.

soresu · Nov 7, 2019

DrMrLordX said:
I would consider it rare to see anything in hardware above 512 bit SVE2.

I'm pretty sure that SVE was originally based on an experimental instruction set called ARGON referenced in a research paper that ARM published.

From what I remember there were diminishing returns above a certain vector length - that might not necessarily be 512 bit, but anything higher would almost certainly require a large area in the core.

As I posted somewhere a few posts back, even the Fujitsu A64FX main CPU core uses only 512 bit SVE, and that is intended for a supercomputer after all.

soresu · Nov 7, 2019

Thala said:
You do think too narrow. It will also fail if code running on the small cores suddenly starts using SVE instructions and the OS will migrate to the larger cores.

I assumed the vice versa of big to little was implied by my post, I'll amend it.

Thala · Nov 7, 2019

soresu said:
I assumed the vice versa of large to small was implied by my post, I'll amend them.

I thought your argument was, that SVE code would run on the larger cores only? I did say that having different vector length implementations in a big.LITTLE setup would prevent migration. As consequence different vector length implementation in a big.LITTLE setup can be ruled out.

soresu · Nov 7, 2019

To be honest I'm confused why they used such a sterile name like SVE for the new vector instruction set in the first place - especially given they used Helium for v8-M.

They should have named it Xenon, like the element used in high power lamps in cars and some torchlights.

Thala · Nov 7, 2019

soresu said:
They should have named it Xenon, like the element used in high power lamps in cars and some torchlights.

These lamps are all LED nowadays - also sterile name - coincidence?

soresu · Nov 7, 2019

Thala said:
I did say that having different vector length implementations in a big.LITTLE setup would prevent migration. As consequence different vector length implementation in a big.LITTLE setup can be ruled out.

Is this limitation explicitly laid out in the SVE2 documentation?

If it's in SVE docs I wouldn't be surprised given it wasn't meant for general use - but the SVE2 announcement made a point of highlighting NEON media/DSP instruction parity, even going so far as to note superior simulated execution speed of said instructions at 128 bit.

Which makes me think it was designed with general use in mind, ergo they likely have some solution for big little type designs.

By the time Matterhorn comes out DynamiQ will probably be as old as the initial big little layout was back then, so as I said earlier I would expect a further revision of that set up to be forthcoming - unless of course they plan to abandon that solution with Matterhorn and switch to AVFS/DVFS or whatever AMD and Intel use to scale clock and voltage dynamically..

soresu · Nov 7, 2019

Thala said:
These lamps are all LED nowadays - also sterile name - coincidence?

As in Lead?

Mate you are worse than my grandad with that joke.

Of course you could go one element up the periodic table and call the ISA Krypton!

Carfax83 · Nov 7, 2019

soresu said:
The mess of market fragmentation that is AVX 512 is bad enough, throw in AVX 1024 and it will only get worse.

I've seen this complaint about AVX 512 oft mentioned on this forum. Yet despite the fragmentation, AVX2 and AVX 512 are definitely being utilized. As I've said before, I was literally shocked at the performance gains that Intel was getting from their SVT codecs!

And then the AV1 decoder dav1d is making strong use of AVX2, and will incorporate AVX 512 once Intel's 10nm cores start proliferating.

So no matter what, AVX 512 is going to be a big thing it seems.

Carfax83 · Nov 7, 2019

Thala said:
To be clear here, when we are talking about a desktop class Apple CPU, we are not talking about overclocked A13 SoC. They will up the core count, frequency along with caches and memory controller.

OK I misunderstood your initial claim. I thought you were talking about overclocking the A13. But yeah, I would love to see a beefed up A13 or A14 derivative on desktop, just out of curiosity.

Is there any data on how much bandwidth the A13's L1 and L2 cache have?

Thala · Nov 7, 2019

soresu said:
Is this limitation explicitly laid out in the SVE2 documentation?

Look, that limitation has not to be mentioned in the documentation - as the documentation does not reason about context switching to a different implementation-width.
Look at it from this perspective: How are you going to store the execution state from wider registers and then restoring the very same execution state into smaller registers? The state will just not fit...period. This is just common sense.

soresu said:
Which makes me think it was designed with general use in mind, ergo they likely have some solution for big little type designs.

As i said, solution is that both big and little core on the same SoC have the the same vector-width - on register level. The ALUs can be smaller as i already pointed out above.

By the time Matterhorn comes out DynamiQ will probably be as old as the initial big little layout was back then, so as I said earlier I would expect a further revision of that set up to be forthcoming - unless of course they plan to abandon that solution with Matterhorn and switch to AVFS/DVFS or whatever AMD and Intel use to scale clock and voltage dynamically..

Not sure why you bring up DynamIQ? DynamIQ does not prevent implementation of any DVFS scheme...Power and clocking is part of the integration not part of the ARM IP delivery - it does not contain PLLs, LDOs, DCDCs, sensors nor any other analog circuits.

Thala · Nov 7, 2019

Carfax83 said:
I've seen this complaint about AVX 512 oft mentioned on this forum. Yet despite the fragmentation, AVX2 and AVX 512 are definitely being utilized. As I've said before, I was literally shocked at the performance gains that Intel was getting from their SVT codecs! And then the AV1 decoder dav1d is making strong use of AVX2, and will incorporate AVX 512 once Intel's 10nm cores start proliferating.

So no matter what, AVX 512 is going to be a big thing it seems.

Forget about AVX512 in this context - thats really old school tech compared to SVE. And while you get similar speedups compared to SVE when hand optimizing the code, SVE has much better supports for compilers to auto-generate good code. In addition SVE code is agnostic to vector length so it runs essentially on any SVE implementation.

TheGiant · Nov 7, 2019

Thala said:
Forget about AVX512 in this context - thats really old school tech compared to SVE. And while you get similar speedups compared to SVE when hand optimizing the code, SVE has much better supports for compilers to auto-generate good code. In addition SVE code is agnostic to vector length so it runs essentially on any SVE implementation.

do you see any use of SVE in CFD/numerical math computing?
I am asking because I've seen too much good tech designs to fail on the change phase

soresu · Nov 7, 2019

TheGiant said:
do you see any use of SVE in CFD/numerical math computing?
I am asking because I've seen too much good tech designs to fail on the change phase

This is a slide from linked PDF on the SVE2 and TME announcement blog entry:

soresu · Nov 7, 2019

Thala said:
As i said, solution is that both big and little core on the same SoC have the the same vector-width - on register level. The ALUs can be smaller as i already pointed out above.

Soooo, what you are basically saying is that I was right in the first place about being able to have cores with differing vector execution strengths in the same SoC, because that is all that I meant.

When you start to get into the register level specifics, you might as well have started speaking gobbledygook to me I'm afraid.

Where code and hardware are concerned I'm at best a naive to middling functional programmer of Python (in Maya) with a hobbyist interest in hardware, rather than any detailed knowledge of uArch specifics.

Solved! ARM Apple High-End CPU - Intel replacement

Senior member

Diamond Member

Diamond Member

Diamond Member

Senior member

Golden Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Lifer

Golden Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Golden Member

Senior member

Diamond Member

Diamond Member