Solved! ARM Apple High-End CPU - Intel replacement

Page 15 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Richie Rich

Senior member
Jul 28, 2019
470
229
76
There is a first rumor about Intel replacement in Apple products:
  • ARM based high-end CPU
  • 8 cores, no SMT
  • IPC +30% over Cortex A77
  • desktop performance (Core i7/Ryzen R7) with much lower power consumption
  • introduction with new gen MacBook Air in mid 2020 (considering also MacBook PRO and iMac)
  • massive AI accelerator

Source Coreteks:
 
  • Like
Reactions: vspalanki
Solution
What an understatement :D And it looks like it doesn't want to die. Yet.


Yes, A13 is competitive against Intel chips but the emulation tax is about 2x. So given that A13 ~= Intel, for emulated x86 programs you'd get half the speed of an equivalent x86 machine. This is one of the reasons they haven't yet switched.

Another reason is that it would prevent the use of Windows on their machines, something some say is very important.

The level of ignorance in this thread would be shocking if it weren't depressing.
Let's state some basics:

(a) History. Apple has never let backward compatibility limit what they do. They are not Intel, they are not Windows. They don't sell perpetual compatibility as a feature. Christ, the big...

soresu

Platinum Member
Dec 19, 2014
2,662
1,862
136
You are telling me, that SEGBUS is not an aligned memory access error even trought im telling im fixing it by aligning the data the program parses? This is with kernel fixup enabled, so that is already helping.

You are more than welcome to take and look

But it is unaligned memory access, no way around it. Thats why im writting a tool to fix the 3D models so the game can run on ARM cpus.

All i can say i have a renewed love for x86 after this experience. ive never written any AVX code and i have no experience with AVX.
Are these problems related to the original retail code, or stuff that has since been overhauled with more modern code (ie C++ 14/17)?

I have never seen anything like an exact run down of what has been changed since the original source release.
 

Shivansps

Diamond Member
Sep 11, 2013
3,855
1,518
136
Are these problems related to the original retail code, or stuff that has since been overhauled with more modern code (ie C++ 14/17)?

I have never seen anything like an exact run down of what has been changed since the original source release.

The problem is that there was an hidden requeriment that all 3D model data must be aligned in all shape and form, both the "block" size and all offsets inside the data must be divisible by 4 so when you parse the data and move/assign the pointers you never fall in a unaligned position. No one ever knew this until i tryied to run on RPI, so 3rd party model tools, some dating from 20 years ago, do not consider that, and no one ever realised because it just work on x86. These are cases that neither the kernel, compiler or the cpu are able to deal with, the kernel itselft is fixing a lot of these unaligned access but it cant fix them all, thus the SIGBUS. While there are problems related to new stuff in the code (the shield mesh collision optimisation system is unaligned by design and it just cant run on ARM i had to disable it), the other problem dates from 20+ years ago since people ever used model tools to edit/create 3D models, the 1999 x86 cpus were able to deal with this just fine, this is why it went undiscovered until i tryied to run it on RPI.

In the end this is a simple file read and parse operation, one that is very common on petty much every C/C++ software that need some I/O in form of files. This is why is said is not so easy to move from x86 to ARM.

You have no idea how suprised i was to discover this was still a issue on a A76 core. I always trought this was just an optimization that some archs may run into performance issues with wrongly aligned data, but a "hell no, SIGBUS" in a modern core? I could not belive it.
 
Last edited:
  • Like
Reactions: soresu

soresu

Platinum Member
Dec 19, 2014
2,662
1,862
136
The problem is that there was an hidden requeriment that all 3D model data must be aligned in all shape and form, both the "block" size and all offsets inside the data must be divisible by 4 so when you parse the data and move/assign the pointers you never fall in a unaligned position. No one ever knew this until i tryied to run on RPI, so 3rd party model tools, some dating from 20 years ago, do not consider that, and no one ever realised because it just work on x86. These are cases that neither the kernel, compiler or the cpu are able to deal with, the kernel itselft is fixing a lot of these unaligned access but it cant fix them all, thus the SIGBUS. While there are problems related to new stuff in the code (the shield mesh collision optimisation system is unaligned by design and it just cant run on ARM i had to disable it), the other problem dates from 20+ years ago since people ever used model tools to edit/create 3D models, the 1999 x86 cpus were able to deal with this just fine, this is why it went undiscovered until i tryied to run it on RPI.

In the end this is a simple file read and parse operation, one that is very common on petty much every C/C++ software that need some I/O in form of files. This is why is said is not so easy to move from x86 to ARM.

You have no idea how suprised i was to discover this was still a issue on a A76 core. I always trought this was just an optimization that some archs may run into performance issues with wrongly aligned data, but a "hell no, SIGBUS" in a modern core? I could not belive it.
Does it work properly on either of the QEMU or Exagear x86 emulators?
 

Nothingness

Platinum Member
Jul 3, 2013
2,410
745
136
You are telling me, that SEGBUS is not an aligned memory access error even trought im telling im fixing it by aligning the data the program parses? This is with kernel fixup enabled, so that is already helping.

You are more than welcome to take and look

But it is unaligned memory access, no way around it. Thats why im writting a tool to fix the 3D models so the game can run on ARM cpus.
What CPU are you running it on? An ARMv7 CPU with the A bit set can force alignment checking.

EDIT: found you mentioned A76; RPi doesn't run A76. Can you provide a disassembly of the faulting instruction?

All i can say i have a renewed love for x86 after this experience. ive never written any AVX code and i have no experience with AVX.
Yeah you'd better stick to it, ARM stinks.

Oh by the way don't pack your struct on AVX or you'll get alignment faults.
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
21,629
10,841
136
@Richie Rich

Regarding the Kunpeng 920 boards:

Interesting! I'm not sure how Kunpeng 920 stacks up against existing A76-based cores (like Snapdragon 855) but I'm guessing the Kunpeng 920-based PCs will have up to 8 v110 cores as opposed to the fastest mobile ARM SoCs that have (at most) 4xA76 + 4x A53. Not counting Apple's products, of course.

If v110 proves to be roughly-equivalent to an A76 per clocks, having 8xv110 would be very interesting indeed. It should be able to boot any Linux distro that can run on Huawei's server hardware right out-of-the-box. I would think.
 

Nothingness

Platinum Member
Jul 3, 2013
2,410
745
136
In the end this is a simple file read and parse operation, one that is very common on petty much every C/C++ software that need some I/O in form of files. This is why is said is not so easy to move from x86 to ARM.

You have no idea how suprised i was to discover this was still a issue on a A76 core. I always trought this was just an optimization that some archs may run into performance issues with wrongly aligned data, but a "hell no, SIGBUS" in a modern core? I could not belive it.
I think you're drawing conclusions a bit fast. You didn't even tell if you're running 64-bit code. You can't even tell on what CPU you're running as you talk about A76 which is not what the RPi runs. You didn't provide a disassembly of the faulty instruction.

It seems to me you just want to hate it :)
 

Thala

Golden Member
Nov 12, 2014
1,355
653
136
In the end this is a simple file read and parse operation, one that is very common on petty much every C/C++ software that need some I/O in form of files. This is why is said is not so easy to move from x86 to ARM.

You have no idea how suprised i was to discover this was still a issue on a A76 core. I always trought this was just an optimization that some archs may run into performance issues with wrongly aligned data, but a "hell no, SIGBUS" in a modern core? I could not belive it.

You still do not understand the HW part. Having the CPU supporting unaligned access is waste of transistors, power and performance...not a single modern CPU (or any modern extension to x86) supports unaligned access. And as i said, it is not "still" an issues on ARMv8 - as ARMv7 supports unaligned access - but ARM finally got rid of unaligned access for efficiency reasons.

You still dont understand the SW part either. Millions of programs just work on ARM, because there is proper serialization/de-serialization to file implemented. The problem in your case is, that someone thought it is a good idea to remove proper padding in the files - therfore some models work - other do not. In the big picture this is a non-issue - you make it sound as this is a common issue.
 

Nothingness

Platinum Member
Jul 3, 2013
2,410
745
136

That should close the discussion unless one is decided to hate an arch at the first issue encountered.

I remember fondly when 64-bit machines started to arrive. Everyone was complaining that software didn't work. It was funny to see people blame the machines instead of the poorly written software.
 

soresu

Platinum Member
Dec 19, 2014
2,662
1,862
136
@Richie Rich
If v110 proves to be roughly-equivalent to an A76 per clocks, having 8xv110 would be very interesting indeed. It should be able to boot any Linux distro that can run on Huawei's server hardware right out-of-the-box. I would think.

That should close the discussion unless one is decided to hate an arch at the first issue encountered.

I remember fondly when 64-bit machines started to arrive. Everyone was complaining that software didn't work. It was funny to see people blame the machines instead of the poorly written software.
To be fair FSO is a combination of 20 year old code and various overhauls since the source release of FS2 years ago - I'm just thankful that any work gets done on it at all.

Having said that, I look forward to the day I can watch this beauty glide across the screen from my RPi4:

1575740161678.png
 

Nothingness

Platinum Member
Jul 3, 2013
2,410
745
136
To be fair FSO is a combination of 20 year old code and various overhauls since the source release of FS2 years ago - I'm just thankful that any work gets done on it at all.
The problem is not that the software might have issues. The problem is blaming the hardware. Recent (like for 15 years) ARM CPU do support unaligned memory accesses. I don't know what issue the OP has, but it's not that ARM doesn't support unaligned ld/st in general, but something else such as packed structures + SIMD (that can lead to unaligned access in x86 too, I've experienced it first hand) or poorly configured OS.
 

Shivansps

Diamond Member
Sep 11, 2013
3,855
1,518
136
I think you're drawing conclusions a bit fast. You didn't even tell if you're running 64-bit code. You can't even tell on what CPU you're running as you talk about A76 which is not what the RPi runs. You didn't provide a disassembly of the faulty instruction.

It seems to me you just want to hate it :)

A72, no idea why i had A76 on my mind. All i can say it is now working after ive aligned all data inside 3dmodels and there is not a pointer going to a unaligned position. Like it or not it that is unaligned memory access, and if i disable kernel unaligned fixup the game does crashes, because some part of the code have uses unaligned access (mostly the scriting part).

BTW, RPI4 is not using AAarch64 yet, the Kernel was in beta last time i checked. I definately im to try to see if this issue dosent happens on AArch64. Why you want to know the instruction that is causing the sigbus?

You still do not understand the HW part. Having the CPU supporting unaligned access is waste of transistors, power and performance...not a single modern CPU (or any modern extension to x86) supports unaligned access. And as i said, it is not "still" an issues on ARMv8 - as ARMv7 supports unaligned access - but ARM finally got rid of unaligned access for efficiency reasons.

You still dont understand the SW part either. Millions of programs just work on ARM, because there is proper serialization/de-serialization to file implemented. The problem in your case is, that someone thought it is a good idea to remove proper padding in the files - therfore some models work - other do not. In the big picture this is a non-issue - you make it sound as this is a common issue.

I can understand ARM not wanting to support it. But i dont understand this: ARM v7 supported it, ARM v8 32 bit mode do not, somehow, but is supported again under 64 bit mode ( AArch64 )? Again i belived ARM did supported this until i encontered this problem.
Its not just the file padding only, some parts of the code had been developed with unaligned access, saddly.
 
Last edited:

Richie Rich

Senior member
Jul 28, 2019
470
229
76
A72, no idea why i had A76 on my mind. All i can say it is now working after ive aligned all data inside 3dmodels and there is not a pointer going to a unaligned position. Like it or not it that is unaligned memory access, and if i disable kernel unaligned fixup the game does crashes, because some part of the code have uses unaligned access (mostly the scriting part).

BTW, RPI4 is not using AAarch64 yet, the Kernel was in beta last time i checked. I definately im to try to see if this issue dosent happens on AArch64. Why you want to know the instruction that is causing the sigbus?



I can understand ARM not wanting to support it. But i dont understand this: ARM v7 supported it, ARM v8 32 bit mode do not, somehow, but is supported again under 64 bit mode ( AArch64 )? Again i belived ARM did supported this until i encontered this problem.
Its not just the file padding only, some parts of the code had been developed with unaligned access, saddly.
So finally the problem was between the chair and keyboard. I'm glad we can stop blaming an ARM as being major platform worldwide.
 

Nothingness

Platinum Member
Jul 3, 2013
2,410
745
136
A72, no idea why i had A76 on my mind. All i can say it is now working after ive aligned all data inside 3dmodels and there is not a pointer going to a unaligned position. Like it or not it that is unaligned memory access, and if i disable kernel unaligned fixup the game does crashes, because some part of the code have uses unaligned access (mostly the scriting part).

BTW, RPI4 is not using AAarch64 yet, the Kernel was in beta last time i checked. I definately im to try to see if this issue dosent happens on AArch64. Why you want to know the instruction that is causing the sigbus?
Because not all instructions accept being unaligned (SIMD ones for instance on x86) so I was wondering if you had run one of these instructions. Also see below for 32-bit ARM.

I can understand ARM not wanting to support it. But i dont understand this: ARM v7 supported it, ARM v8 32 bit mode do not, somehow, but is supported again under 64 bit mode ( AArch64 )? Again i belived ARM did supported this until i encontered this problem.
Its not just the file padding only, some parts of the code had been developed with unaligned access, saddly.
ldrd/strd (that is 2x32-bit ld/st) need alignment in 32-bit mode.

Here is an example on an NVIDIA Jetson:
Code:
$ cat align.c
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>

int main(void)
{
  static uint8_t bytes[32];
  uint64_t *p64;
  uint32_t *p32;

  for (int i = 0; i < 32; i++)
    bytes[i] = i;

  for (int i = 0; i < 32 - sizeof(*p32); i++) {
    p32 = (uint32_t *)&bytes[i];
    printf("%02d %p %08x\n", i, p32, *p32);
  }

  for (int i = 0; i < 32 - sizeof(*p64); i++) {
    p64 = (uint64_t *)&bytes[i];
    printf("%02d %p %016" PRIx64 "\n", i, p64, *p64);
  }

  return 0;
}

In 64-bit mode:
Code:
$ gcc-5 align.c -O -Wall -o align64 -static -g
$ ./align64
00 0x48ea30 03020100
01 0x48ea31 04030201
02 0x48ea32 05040302
03 0x48ea33 06050403
04 0x48ea34 07060504
05 0x48ea35 08070605
06 0x48ea36 09080706
07 0x48ea37 0a090807
08 0x48ea38 0b0a0908
09 0x48ea39 0c0b0a09
10 0x48ea3a 0d0c0b0a
11 0x48ea3b 0e0d0c0b
12 0x48ea3c 0f0e0d0c
13 0x48ea3d 100f0e0d
14 0x48ea3e 11100f0e
15 0x48ea3f 1211100f
16 0x48ea40 13121110
17 0x48ea41 14131211
18 0x48ea42 15141312
19 0x48ea43 16151413
20 0x48ea44 17161514
21 0x48ea45 18171615
22 0x48ea46 19181716
23 0x48ea47 1a191817
24 0x48ea48 1b1a1918
25 0x48ea49 1c1b1a19
26 0x48ea4a 1d1c1b1a
27 0x48ea4b 1e1d1c1b
00 0x48ea30 0706050403020100
01 0x48ea31 0807060504030201
02 0x48ea32 0908070605040302
03 0x48ea33 0a09080706050403
04 0x48ea34 0b0a090807060504
05 0x48ea35 0c0b0a0908070605
06 0x48ea36 0d0c0b0a09080706
07 0x48ea37 0e0d0c0b0a090807
08 0x48ea38 0f0e0d0c0b0a0908
09 0x48ea39 100f0e0d0c0b0a09
10 0x48ea3a 11100f0e0d0c0b0a
11 0x48ea3b 1211100f0e0d0c0b
12 0x48ea3c 131211100f0e0d0c
13 0x48ea3d 14131211100f0e0d
14 0x48ea3e 1514131211100f0e
15 0x48ea3f 161514131211100f
16 0x48ea40 1716151413121110
17 0x48ea41 1817161514131211
18 0x48ea42 1918171615141312
19 0x48ea43 1a19181716151413
20 0x48ea44 1b1a191817161514
21 0x48ea45 1c1b1a1918171615
22 0x48ea46 1d1c1b1a19181716
23 0x48ea47 1e1d1c1b1a191817

All variations of alignment work perfectly in 64-bit mode.

In 32-bit mode:
Code:
$ arm-linux-gnueabihf-gcc-5 align.c -O -Wall -o align32 -static -g
$ gdb align32
(...)
(gdb) r
Starting program: /home/nvidia/FS2/Align/align32 
00 0x79eec 03020100
01 0x79eed 04030201
02 0x79eee 05040302
03 0x79eef 06050403
04 0x79ef0 07060504
05 0x79ef1 08070605
06 0x79ef2 09080706
07 0x79ef3 0a090807
08 0x79ef4 0b0a0908
09 0x79ef5 0c0b0a09
10 0x79ef6 0d0c0b0a
11 0x79ef7 0e0d0c0b
12 0x79ef8 0f0e0d0c
13 0x79ef9 100f0e0d
14 0x79efa 11100f0e
15 0x79efb 1211100f
16 0x79efc 13121110
17 0x79efd 14131211
18 0x79efe 15141312
19 0x79eff 16151413
20 0x79f00 17161514
21 0x79f01 18171615
22 0x79f02 19181716
23 0x79f03 1a191817
24 0x79f04 1b1a1918
25 0x79f05 1c1b1a19
26 0x79f06 1d1c1b1a
27 0x79f07 1e1d1c1b
00 0x79eec 0706050403020100
(gdb) disassemble 
(...)
   0x000104ea <+78>:    movs    r7, #1
   0x000104ec <+80>:    adds    r3, r6, r4
=> 0x000104ee <+82>:    ldrd    r0, r1, [r3]
   0x000104f2 <+86>:    strd    r0, r1, [sp]
   0x000104f6 <+90>:    mov    r2, r4
   0x000104f8 <+92>:    mov    r1, r5
   0x000104fa <+94>:    mov    r0, r7
   0x000104fc <+96>:    bl    0x22674 <__printf_chk>

As you can see 32-bit unaligned access are working fine even for AArch32. It's only the ldrd instruction that checks alignment.

I hope that clears things up about ARM not supporting unaligned accesses in general which is simply not correct :)
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
I always assumed CISC vs RISC was the answer.

Their Atom cores are much, much smaller. 7 Goldmont cores in 1 Skylake core.

It must have to do with Intel needing to target the desktop with very high power levels and thermal density.

If they aimed for say 3GHz clocks they can do things that can't be done on current cores. Skylake, Sunny Cove and Zen cores have a 17-18 stage pipeline and the uop cache hit can reduce that by a few stages.

The A77 has 13 stage pipeline and also has a uop cache. Reducing pipeline stages will result in higher perf/clock and smaller core due to less complexity. Each pipeline stages are said to impact performance by 2-4%.

With lower clock target they can also reduce the cache latencies. Rather than 5 cycle for Sunny Cove, 4 cycles or even 3 is possible. For Skylake-X's mesh, it would be a benefit too since the mesh can run at clocks close to CPU frequency rather than being something like a third behind.

We don't know what can be done if Intel purpose optimizes it for low power.
 

Thala

Golden Member
Nov 12, 2014
1,355
653
136
With lower clock target they can also reduce the cache latencies. Rather than 5 cycle for Sunny Cove, 4 cycles or even 3 is possible. For Skylake-X's mesh, it would be a benefit too since the mesh can run at clocks close to CPU frequency rather than being something like a third behind.

I doubt that you can significantly reduce latencies. Thing is with low-power design is, that you target low voltage and low leakage in the first place and lower frequency is the result not the target. In order to still achieve something like 2.5+GHz you will need those pipeline stages.
Take Cortex A72 as example, it was topping out at someting around 2GHz in contemporary designs, still the architecture can achieve 4GHz+ as TSMC have demonstrated.
 
  • Like
Reactions: Richie Rich

soresu

Platinum Member
Dec 19, 2014
2,662
1,862
136
I doubt that you can significantly reduce latencies. Thing is with low-power design is, that you target low voltage and low leakage in the first place and lower frequency is the result not the target. In order to still achieve something like 2.5+GHz you will need those pipeline stages.
Take Cortex A72 as example, it was topping out at someting around 2GHz in contemporary designs, still the architecture can achieve 4GHz+ as TSMC have demonstrated.
There aren't many high end A72 designs on a more efficient process out there to compare against - mainly the HiSilicon Kirin 950/955, which is still mobile.

I don't imagine that you would run A72 at 4 Ghz on a phone even with 5nm efficiency.

There are strangely a couple of 28nm A72 designs which made it into many different products - RK3399 and MT8173.
 
Last edited:

Thala

Golden Member
Nov 12, 2014
1,355
653
136
I don't imagine that you would run A72 at 4 Ghz on a phone even with 5nm efficiency.

Of course not - this was not my point at all. I just used this example to show, that there is not much latency saving left because we already using high performance architectures in mobile - we have to because of low voltage und low leakage requirements. We just never run it at 4+GHz because that would defeat the low power property.
 
Last edited:
  • Like
Reactions: soresu

Nothingness

Platinum Member
Jul 3, 2013
2,410
745
136
I couldn't find the generic ARM thread any more. So here it goes.

This is the take of a programmer on Surface Pro X. Much more interesting than the dumb reviews I read.

 

Thala

Golden Member
Nov 12, 2014
1,355
653
136
I couldn't find the generic ARM thread any more. So here it goes.

This is the take of a programmer on Surface Pro X. Much more interesting than the dumb reviews I read.


Interesting read!
Now the solution why _interlockedIncrement is faster on ARM64 while AcquireSpinlockWithYield is not is, that the Spinlock implementation using much too many barriers. Even worse
with ARMv8 we have load with acquire and stores with release semantic. To top it - they are using both - the primitives with acquire release semantic PLUS the memory barriers in both _interlockedIncrement AcquireSpinlockWithYield.
So the _InterlockedIncrement could be even faster on ARM64.

Example from iOS - note that you see no barrier instruction at all, they are implicit due to acquire/release semantic.
_OSAtomicAdd32Barrier:
ldaxr w8, [x1]
add w8, w8, w0
stlxr w9, w8, [x1]
cbnz w9, _OSAtomicAdd32Barrier
mov x0, x8
ret lr

And then spinning around a wait or yield is not really what you want. Todays architectures (not x86/x64) can go into low-power mode via wfe instructions when spinning -> see Linux ARM64 spinlock implementation.

ARM64 should always be faster than x86 (at iso clock) when it comes to low level synchronization and cache coherency, when programmed properly...locked primitives are so last century...every modern architecture (of the last 20 years, mind you?) is using the LL/SC concept in the best case with acquire and release semantics like aarch64.
This is a good example how bad code can drag an excellent architecture down...

Question is, where can i file a bug-report to Microsoft now?
 
Last edited:

soresu

Platinum Member
Dec 19, 2014
2,662
1,862
136
Not sure if this was already posted, but I found an article on WikiChip about the future of Marvell (formerly Cavium) ThunderX server processors.

Link here.

It discusses targeted generational gains of >2x for ThunderX3 and onwards.

ThunderX3 will be on 7nm in 2020, and will use a derivative uArch of Vulcan called Triton, and supposedly has 4x 128 bit NEON units.

SVE was also mentioned - likely for TX4, but possibly for TX3 as well.
 

DrMrLordX

Lifer
Apr 27, 2000
21,629
10,841
136
ThunderX3 will be on 7nm in 2020, and will use a derivative uArch of Vulcan called Triton, and supposedly has 4x 128 bit NEON units.

Interesting that they're still going after NEON. You would think after Toshiba took the plunge on 512b SVE years ago that Marvell/Cavium would be shooting for SVE2. Is it possible that they're going to do some kind of hardware-level op conversion so that the NEON units can handle SVE2 instructions?
 

soresu

Platinum Member
Dec 19, 2014
2,662
1,862
136
Interesting that they're still going after NEON. You would think after Toshiba took the plunge on 512b SVE years ago that Marvell/Cavium would be shooting for SVE2. Is it possible that they're going to do some kind of hardware-level op conversion so that the NEON units can handle SVE2 instructions?
Was Toshiba involved in the Fujitsu A64FX project?

Either way SVE2 is brand new relatively speaking - we are only just now seeing A64FX wafers using SVE, 3+ years after it was first announced.
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
21,629
10,841
136
Was Toshiba involved in the Fujitsu A64FX project?

Woops, no, that's my brain being stupid. It was Fujitsu, not Toshiba.

Either way SVE2 is brand new relatively speaking - we are only just now seeing A64FX wafers using SVE, 3+ years after it was first announced.

Development arcs are pretty long and wide. If ThunderX3 went into planning before SVE2 was announced, I guess the designers had a choice between continuing to support NEON or supporting the (then still relatively new) SVE standard.
 

Arkaign

Lifer
Oct 27, 2006
20,736
1,377
126
I couldn't find the generic ARM thread any more. So here it goes.

This is the take of a programmer on Surface Pro X. Much more interesting than the dumb reviews I read.


Very interesting indeed!

I wouldn't really criticize SPX reviews overall though, as general user needs/pros/cons observed are drastically different than a developer and someone interested inherently in the ARM performance as it is (on native and x86).

To the general buyer of windows 10 mobile devices, what the architecture is makes little difference beyond 'hmm, interesting'. What matters is how it runs the things they want to run. And the SPX is a generally sluggish, patience-busting device. Performance IF you have ARM native stuff is acceptable if not overly exciting, but the emulation performance is an abomination. The chief benefits are a couple mm thinner, a bit lighter, and excellent battery life, but at the cost of monumentally slower performance outside of rare scenarios.

More native W10 ARM stuff could change that equation, but it's sort of a chicken and egg, and MS has never been trustworthy to follow through on non-mainstream OS branching. WinRT, Hell even WinXP x64, Win iA64, all extremely half baked, poorly supported, or prematurely abandoned.