News Big movements afoot in ARM server land

soresu · Sep 22, 2020

Haven't seen a new thread discussing it yet and loathe to use necromancy on an older thread, so here talking about the Neoverse V1 and future N2 cores just announced.

V1 (codename Zeus on previous roadmap) is available for licensing now and seems to be basically the Cortex X1/Hera core with 2x256b SVE1 units, projected as max 96C per socket and +50% IPC over N1.

N2 (codename Perseus) is coming next year and supposedly based on the Cortex Axx core to succeed A78/Hercules, so possibly Matterhorn - it has 2x 128b SVE units though ARM are tight lipped on whether they are SVE2 or not, projected as max 128C per socket and +40% IPC over N1.

Link here to the Anandtech article with lotsa depth.

Thala · Sep 25, 2020

soresu said:
Right now it is optional, and therefore only developers targeting specialized use cases (ie Fujitsu/Riken's post-K supercomputer) will make make the effort to put it in their applications (outside of auto vectorized compiler code).

SVE/SVE2 is ideal to be made mandatory, because of the possibility of just deploy a single small 128bit unit in order to be compliant - so you can put this even in a small core of a big.LITTLE system without blowing it up too much.
I assume we can expect an update to the small cores (A55 successor) next year, which hopefully will sport SVE2 as baseline.

ThatBuzzkiller · Sep 25, 2020

Thala said:
There is lots of coding space left in 32 bit - SVE does not require 64 bit at all.

Actually there isn't ...

SVE already makes some compromises in order to conserve instruction encoding space. Some SVE opcodes only have destructive forms which can complicate designing a compiler. Predication is mostly only available with destructive opcodes which can make autovectorization harder ...

EVEX (AVX-512) prefix by comparison doesn't have either of those limitations so it's more powerful from a compiler design standpoint ...

Thala · Sep 25, 2020

ThatBuzzkiller said:
Actually there isn't ...

SVE already makes some compromises in order to conserve instruction encoding space. Some SVE opcodes only have destructive forms which can complicate designing a compiler. Predication is mostly only available with destructive opcodes which can make autovectorization harder ...

EVEX (AVX-512) prefix by comparison doesn't have either of those limitations so it's more powerful from a compiler design standpoint ...

1) Either unpredicated and 3 operand or predicated and destructive. And due to this, there is quite a bit encoding space left - as i said. FMLA is 3-operand + predicate.
2) Destructive opcodes have never been a hurdle from compiler design standpoint nor for autovectorization - it would just add moves or stores wherever necessary.

ThatBuzzkiller · Sep 25, 2020

Thala said:
2) Destructive opcodes have never been a hurdle from compiler design standpoint nor for autovectorization - it would just add moves or stores wherever necessary.

Sure but copying data from the source operand register can take up valuable space for the extra temporary registers which can potentially impact performance so this is far from a trivial issue like you would imply ...

Predication not being available for constructive opcodes in SVE means less opportunities for autovectorization ...

With EVEX (AVX-512), you don't have either those limitations since every opcode has 3-operands or is constructive and masking comes for free from a programmer's perspective ...

With SVE/SVE2, these are your options or 'compromises' would be a better word here ...

Destructive opcodes: Masking is available but you get only 2-operands which is a source and destination register ... (2-operands can result in sub-optimal codegen from the compiler which can negatively affect performance)

Constructive opcodes: Masking is unavailable for many instructions but you get 3-operands ... (without masking/predication you can't autovectorize some kernels)

soresu · Sep 26, 2020

Thala said:
I assume we can expect an update to the small cores (A55 successor) next year

Please god yes, make it so.

It's almost 3 and a half years since A55 came out and not a single streamer stick uses it yet (I have the Fire TV Stick 4k).

Still holding out hope that the new Android TV Chromecast uses the X4 version of the Amlogic S905 that has both A55 cores and AV1 decode.

Coupled with it almost certainly using the ARM64 version of Android instead of the ARM32/v7-A version that Fire TV devices are perpetually stuck on it would be a very nice upgrade, and one that means I would not have to sideload Kodi anymore to boot.

It's a bit depressing that basically everything not using a big core is still using the ancient A53 core which is almost 8 years old at this point.

This is why Intel still gets so many design wins for Atom - no one is actually making use of the more recent little ARM cores and they are coming out too infrequently.

Hopefully we are getting at least 40% ST IPC over A55 because it was a small upgrade in 2017 and even worse now - based on the ST increase projected for A65/E1 40% should definitely be achievable.

Thala · Sep 26, 2020

ThatBuzzkiller said:
Sure but copying data from the source operand register can take up valuable space for the extra temporary registers which can potentially impact performance so this is far from a trivial issue like you would imply ...
Predication not being available for constructive opcodes in SVE means less opportunities for autovectorization ...

This is not the case.
"op a,b,c" is totally equivalent with "mov a,b; op a,a,c". No more space in form of temporary registers at all. Destructive opcodes have never been an issue for vectorization.
It has some impact on the microarchitecture implementation, because you want to have to additional move for free as often as possible.

Doug S · Sep 26, 2020

soresu said:
No, this has already ratified by ARM themselves in the SVE2/TME announcement - future chips with SVE2 will retain NEON code compatibility.

Yes, but that's because SVE2 is supported as an optional part of ARMv8, and NEON is a mandatory part of v8. If ARM revs to v9 and makes SVE2 mandatory, why in the world should they keep NEON around? I'm willing to bet heavily that if/when they rev to v9, NEON is gone.

Just like you could choose whether to design cores that ran both ARMv7 and ARMv8 code, when v9 appears you'll be able to design chips that are compatible with v8+v9 (which could run NEON code as part of that v8 capability) or v9 only.

Those who need backwards compatibility with NEON and v8 code in general would design chips that ran both, until they were able to deprecate the older stuff. Just like Apple did with their first v8 cores also running v7 code and then a few years later going ARMv8/64 bit only and dropping v7 entirely from their newer cores.

name99 · Sep 26, 2020

Thala said:
1) Either unpredicated and 3 operand or predicated and destructive. And due to this, there is quite a bit encoding space left - as i said. FMLA is 3-operand + predicate.
2) Destructive opcodes have never been a hurdle from compiler design standpoint nor for autovectorization - it would just add moves or stores wherever necessary.

The issue is not that opcode space is 100% full; it's that SVE (and a few other instruction types) are already forced to encode the instruction as essentially two pieces placed back to back.
So what we have is an ISA that is already essentially a combination of 32 and 64 bit, but with a lot of downside and none of the upside. The back to back instructions can be fused, of course, and are. But every fuse that's performed in hardware, rather than encoded in the ISA, takes away from a dynamic fuse opportunity. And the suboptimal encoding that forced upon us by requiring that both halves of the 64-bit instruction be valid instructions means that we still don't have lots of instruction space available.

Look, anything CAN be done -- at the cost of sub-optimality. But at some point the payoff, compared to struggling along, becomes substantial.
Or you become the US, unwilling to ever put up with five years of pain for the sake of getting your measurement system in sync with the rest of the world, so that you have endless on-going low-level pain. (Or the equivalent in x86 where, sure, keep telling us all that x86 imposes no burden whatever, even though Intel's creation of new micro-architectures seems to have ground to a halt...)

name99 · Sep 26, 2020

Thala said:
This is not the case.
"op a,b,c" is totally equivalent with "mov a,b; op a,a,c". No more space in form of temporary registers at all. Destructive opcodes have never been an issue for vectorization.
It has some impact on the microarchitecture implementation, because you want to have to additional move for free as often as possible.

Look, the SVE designers specifically called out that the instruction encoding space restricted their choices. I don't know why you're so insistent on arguing that a sub-optimal solution is in fact superior when we've specifically been told that is not the case.
Go look at

https://alastairreid.github.io/papers/sve-ieee-micro-2017.pdf

especially section 4.

soresu · Sep 26, 2020

Doug S said:
Yes, but that's because SVE2 is supported as an optional part of ARMv8, and NEON is a mandatory part of v8. If ARM revs to v9 and makes SVE2 mandatory, why in the world should they keep NEON around?

The situation is not nearly so simple as that.

There is definitely far more 'legacy' v8-A NEON code now than there was v7-A code around when Cortex A57 based products finally brought v8-A into the mainstream, let alone when Apple released A7 iRectangles.

Even if a v9-A core came out next year it would still take years for SVE2 code to replace all the NEON assembly in current applications.

This may go faster in the current age of ML/AI, where the possibility of ML augmented compilers generating auto vectorized code as good as or better than human handwritten assembly.

But until we actually start seeing such things in practical use in GCC or LLVM it's pointless to hang anything on it speeding up a transition period.

All those JIT and AOT compilers for Javascript, Webassembly and all the assorted dynarec JIT's in the gamut of emulators will take no small amount of time to be converted to SVE2 - and given many are already pretty performant on NEON code with more modern A76+ cores I doubt that devs would be in a great hurry to deprecate NEON either.

Either way the only reason 32 bit support is completely deprecated on Apple cores is because they completely control the product hardware and software stacks and the app store that they draw from.

32 bit (v7-A/ARM32) support is still in every big and little core ARM has announced even as recently as A78/X1.

OTOH I do believe that v9-A will forcibly deprecate ARM32 code running by all but emulation, which absolutely makes sense at this point after even Google Play has more than a year since stopped accepting new apps and updates using ARM32 code.

I just hope that ARM/Linaro are working with Google, Linux et al to make sure that ARM32 emulation where necessary is still performant.

A/// · Sep 26, 2020

I think it's time I dust off my ARM books I bought and barely ever read. This thread is incredibly interesting but there's large gaps in my knowledge where ARM is concerned. I was relaying my frustration with taking it all in with another member. @soresu's post is the kick in the butt I need to stop wasting time and pick up my books. Provided I can find them.

Thala · Sep 26, 2020

name99 said:
Look, the SVE designers specifically called out that the instruction encoding space restricted their choices. I don't know why you're so insistent on arguing that a sub-optimal solution is in fact superior when we've specifically been told that is not the case.

I am insistent on the fact, that it is a non issue for vectorization. I am insistent on the fact that there is no additional register pressure. Finally i did already mention, that you have to cope with the issue at microarchitectural level, which is relatively trivial - not sure where you ever read something about "superior". Would be nice, if you would follow the discussion a bit more carefully next time - i was replying to some very specific statements.

Having either fixed 64 bit instructions or god forbid, variable length instructions, have much bigger downsides in the bigger picture. The only thing i would give a nod is a mixed 32bit 64bit ISA but as i said, i do not consider the current approach much of a downer.

Doug S · Sep 26, 2020

soresu said:
The situation is not nearly so simple as that.

There is definitely far more 'legacy' v8-A NEON code now than there was v7-A code around when Cortex A57 based products finally brought v8-A into the mainstream, let alone when Apple released A7 iRectangles.

Even if a v9-A core came out next year it would still take years for SVE2 code to replace all the NEON assembly in current applications.

Can you name one ARM server application that isn't open source? If you have the source you don't care, you just recompile for a v9 target. Besides, even Apple took a few years to phase out ARMv7 and that was within the iOS ecosystem where they have pretty tight control, so there will be plenty of time for people to recompile, and newer versions of the open source software to come out that has optimized hand generated SVE2 code where that might matter.

Just as a thought experiment, let's say ARM formally announced the v9 spec Monday. Given Nuvia's stated timelines for late '21/early '22 do you think there would be any reason at all for them to support v8? They could go v9 only from their very first implementation, and never be able to run NEON code. If they do deliver the fastest ARM server cores around, surely it would be worth their customers' trouble to recompile.

ThatBuzzkiller · Sep 27, 2020

Thala said:
This is not the case.
"op a,b,c" is totally equivalent with "mov a,b; op a,a,c". No more space in form of temporary registers at all. Destructive opcodes have never been an issue for vectorization.
It has some impact on the microarchitecture implementation, because you want to have to additional move for free as often as possible.

Technically, a destructive 2-operand format will use an extra temporary register in that case ...

2-operand format:

Suppose reg0 = a and reg1 = b reg2 = undefined (you start out with 2 registers here)
mov src0:reg0 dest:reg2 (you are now using 3 registers, you started with 2 but now you need a temporary register here to copy value from register #0 to register #2)
op src0:reg0 src1:reg1 dest:reg0 (3 registers used in total)

3-operand format:

Suppose reg0 = a and reg1 = b reg2 = undefined (you start out with 2 registers here)
op src0:reg0 src1:reg1 dest:reg2 (you end with 3 registers used but notice that's there's no intermediary step here)

That potentially means spilling can happen with destructive 2-operand instructions during copying step compared to the constructive 3-operand format so even if both of these cases start with 2 registers and ends with registers used in total. If you want to reuse data, a machine with a destructive 2-operand format will have to use 3 registers earlier before performing the operation. This ultimately means that a destructive 2-operand format will comparatively place more limitations on data reuse in some programs and thus will complicate compiler design ...

Now that I've elaborated on further what I meant by "temporary register", I will tackle the subject of predication and masking ...

The problem with SVE here aren't the destructive instructions since they support masking. It's the constructive instructions which lacks support for masking. Masking is helpful for auto-vectorization since some loops may contain conditionals. Unfortunately with SVE, not every instruction supports predication or masking so some loops don't get auto-vectorized ...

ARM Ltd might have to go back to the drawing board again and create a new SIMD extension if compiler issues start to crop up because it'll discourage adoption among other vendors since it's added implementation complexity with nearly no benefit ...

Thala · Sep 27, 2020

ThatBuzzkiller said:
Technically, a destructive 2-operand format will use an extra temporary register in that case ...

2-operand format:

Suppose reg0 = a and reg1 = b reg2 = undefined (you start out with 2 registers here)
mov src0:reg0 dest:reg2 (you are now using 3 registers, you started with 2 but now you need a temporary register here to copy value from register #0 to register #2)
op src0:reg0 src1:reg1 dest:reg0 (3 registers used in total)

3-operand format:

Suppose reg0 = a and reg1 = b reg2 = undefined (you start out with 2 registers here)
op src0:reg0 src1:reg1 dest:reg2 (you end with 3 registers used but notice that's there's no intermediary step here)

That potentially means spilling can happen with destructive 2-operand instructions during copying step compared to the constructive 3-operand format so even if both of these cases start with 2 registers and ends with registers used in total. If you want to reuse data, a machine with a destructive 2-operand format will have to use 3 registers earlier before performing the operation. This ultimately means that a destructive 2-operand format will comparatively place more limitations on data reuse in some programs and thus will complicate compiler design ...

Now that I've elaborated on further what I meant by "temporary register", I will tackle the subject of predication and masking ...

The problem with SVE here aren't the destructive instructions since they support masking. It's the constructive instructions which lacks support for masking. Masking is helpful for auto-vectorization since some loops may contain conditionals. Unfortunately with SVE, not every instruction supports predication or masking so some loops don't get auto-vectorized ...

ARM Ltd might have to go back to the drawing board again and create a new SIMD extension if compiler issues start to crop up because it'll discourage adoption among other vendors since it's added implementation complexity with nearly no benefit ...

There is so much wrong with your post i am only to going to make 2 comments:
1) Spilling is explicit and not something that just happens. The compiler will not add additional spilling in either variant - the code is either feasible with all 3 active registers allocated or not...there is nothing in between. You should ask yourself, where the compiler would add spilling and why.
2) The compiler will always use the predicated version wherever necessary. There is nothing forcing the compiler to use the non-destructive versions.

ThatBuzzkiller · Sep 27, 2020

Thala said:
There is so much wrong with your post i am only to going to make 2 comments:
1) Spilling is explicit and not something that just happens. The compiler will not add additional spilling in either variant

That's a misconception on your part. The compiler can mostly do whatever it wants including implicit register allocation to give whatever result is desired by the programmer ... (spilling can happen behind your back like it or not)

Thala said:
- the code is either feasible with all 3 active registers allocated or not...there is nothing in between. You should ask yourself, where the compiler would add spilling and why.

I explained this to you already and it's absolutely not true that nothing happens in between with the destructive 2-operand instructions. You have to allocate more memory if you want to reuse the data before doing the operation in hand. You'd have to allocate as much as 50% more memory before doing any actual work in your program ...

Also, the other trade-off with having less operands is a longer program which can increase the time it takes to execute the program ...

Thala said:
2) The compiler will always use the predicated version wherever necessary. There is nothing forcing the compiler to use the non-destructive versions.

What if there isn't a predicated version of these instructions ?

Thala · Sep 28, 2020

ThatBuzzkiller said:
That's a misconception on your part. The compiler can mostly do whatever it wants including implicit register allocation to give whatever result is desired by the programmer ... (spilling can happen behind your back like it or not)
I explained this to you already and it's absolutely not true that nothing happens in between with the destructive 2-operand instructions. You have to allocate more memory if you want to reuse the data before doing the operation in hand. You'd have to allocate as much as 50% more memory before doing any actual work in your program ...

Look, I am arguing from the viewpoint of the compiler and not from the viewpoint of the high level programmer. So no, spilling does not just happen - it is a deliberate decision by the compiler because it is running out of free registers (e.g. the associated graph coloring problem has no solution)
It could very well be, that if you are developing a compiler, it would emit additional spilling code in such situations - i cant argue against this. However any reasonable compiler would not because it is not necessary.
In theory you could convince yourself, that the instance of the graph coloring problem is precisely the same for both variants. If this is too abstract, I make it easy. Show a single example where a compiler would need to add additional spilling code. And while you search for such an example, you will convince yourself that such an example does not exist.

What if there isn't a predicated version of these instructions ?

Well the ISA would be flawed then. Just to be sure, that was not the argument as far?

ThatBuzzkiller · Sep 28, 2020

Thala said:
Look, I am arguing from the viewpoint of the compiler and not from the viewpoint of the high level programmer. So no, spilling does not just happen - it is a deliberate decision by the compiler because it is running out of free registers (e.g. the associated graph coloring problem has no solution)
It could very well be, that if you are developing a compiler, it would emit additional spilling code in such situations - i cant argue against this. However any reasonable compiler would not because it is not necessary.
In theory you could convince yourself, that the instance of the graph coloring problem is precisely the same for both variants. If this is too abstract, I make it easy. Show a single example where a compiler would need to add additional spilling code. And while you search for such an example, you will convince yourself that such an example does not exist.

I concede that I might not be able to think up of such an example but I think I've supported the case for my original argument that you would need a temporary register or additional memory during data reuse with a destructive 2-operand format machine ...

I'm not quite sure I would call them the 'same' because the "previous state" between the 2-operand and 3-operand machines are different ...

2-operand format: The previous state can contain 3 registers and the next state will 3 registers.
3-operand format: The previous state has EXACTLY 2 registers and the next state will be 3 registers.

While the total registers used in the end are both the same, their prior state isn't so we can't exactly be certain that they'll have the same behaviour in all cases during execution ...

Thala said:
Well the ISA would be flawed then. Just to be sure, that was not the argument as far?

Would that make SVE a flawed ISA extension ? Right now there are a few instructions that come exclusively without masking and there aren't alternative versions that comes with masking support ...

News Big movements afoot in ARM server land

Diamond Member

Golden Member

Golden Member

Golden Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Senior member

Senior member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Golden Member

Golden Member

Golden Member

Golden Member