Discussion ARM Cortex/Neoverse IP + SoCs (no custom cores) Discussion

Page 42 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Nothingness

Diamond Member
Jul 3, 2013
3,292
2,360
136
Meh, really? What is really the point then?

I thought it would just be suboptimal but function without recompiling (after recompiling it would be suboptimal anyway, though, because you can have your abstracted length-agnostic SIMD but turns out that idea sucks in practice, wrt shuffle instructions for example).
Cf. an example in the post above.

My general vector knowledge is limited, so I might be off, but to me it looks VL agnostic in that simple example.
 

camel-cdr

Member
Feb 23, 2024
30
95
51
  • Like
Reactions: Nothingness

naukkis

Golden Member
Jun 5, 2002
1,004
849
136
Thanks a lot! I won't claim I understand the generated R-V code, but it looks similar to the AArch64 one.

It does not. RVV code is vector ISA, extremely similar to old Cray syntax. SVE code instead is clearly SIMD and compiler does vector parsing instead of hardware. RVV code is really simple and easy to debug - SVE is direct opposite to that.
 

naukkis

Golden Member
Jun 5, 2002
1,004
849
136
SVE problem is that only part of vector instructions are vector length-agnostic. I really don't understand what kind of designer does that kind of solution and is satisfied with it. So to do actually useful code for SVE programmer has to target vector length - or code so that it works with every execution width( which is practically impossible). So SVE is only useful when it's used as a fixed width vectors - and ARM pretty much are locking it using only 128 bit vectors - and it's their best wish to get some support for SVE.

RVV is different - it's a real vector isa and code written for it is executable with different width SIMDs - hardware does that extra work needed to work with different width execution units. And it shows - there's plenty of RVV designs where SVE desigs are from ARM only and suck.
 

FlameTail

Diamond Member
Dec 15, 2021
4,384
2,761
106
SVE problem is that only part of vector instructions are vector length-agnostic. I really don't understand what kind of designer does that kind of solution and is satisfied with it. So to do actually useful code for SVE programmer has to target vector length - or code so that it works with every execution width( which is practically impossible). So SVE is only useful when it's used as a fixed width vectors - and ARM pretty much are locking it using only 128 bit vectors - and it's their best wish to get some support for SVE.

RVV is different - it's a real vector isa and code written for it is executable with different width SIMDs - hardware does that extra work needed to work with different width execution units. And it shows - there's plenty of RVV designs where SVE desigs are from ARM only and suck.
So does it make sense for ARM to implement something like RVV in say... ARMv10?
 

naukkis

Golden Member
Jun 5, 2002
1,004
849
136
No. Vector ISA universally suck dog balls. Copy AVX512/10/yaddayadda.

Any particular reason for that? Coding for vector ISA is actually possible - SIMD coding instead is just self-torturing and it doesn't matter how long vector types cpu support they are mostly left unused. So best solutios are not using SIMD at all or go to vector ISA. SIMD is middleman for masokists to use.
 

adroc_thurston

Diamond Member
Jul 2, 2023
6,039
8,527
106
Any particular reason for that?
SIMD actually represents the underlying hardware reasonably well.
SIMD coding instead is just self-torturing and it doesn't matter how long vector types cpu support they are mostly left unused. So best solutios are not using SIMD at all or go to vector ISA. SIMD is middleman for masokists to use.
is this the magical autovec compiler shilling?
 

naukkis

Golden Member
Jun 5, 2002
1,004
849
136
SIMD actually represents the underlying hardware reasonably well.

But why adding hardware that makes using that hardware more logical problem?

is this the magical autovec compiler shilling?

No. Vector ISA hardware just abstracts hard parts of SIMD-programming to the level where coding is almost similar for scalar and vector targets. That abstraction does also possible to be vector-length agnostic. Some hardware designers are against vector ISA but for software view you do want vector ISA instead of SIMD any time.
 

adroc_thurston

Diamond Member
Jul 2, 2023
6,039
8,527
106
But why adding hardware that makes using that hardware more logical problem?
Performance.
Vector ISA hardware just abstracts hard parts of SIMD-programming to the level where coding is almost similar for scalar and vector targets.
That's every bit as bullshit as 'SIMT' and you know it.
but for software view you do want vector ISA instead of SIMD any time.
Software people hate performance.
 

naukkis

Golden Member
Jun 5, 2002
1,004
849
136
Performance.

That's every bit as bullshit as 'SIMT' and you know it.

Software people hate performance.

You know that "performance" is nowadays coming from SIMT GPUs? Vector ISA cpu can rival those GPU-solutions - pretty much next gen supercomputer CPUs will be RVV-based.
 

camel-cdr

Member
Feb 23, 2024
30
95
51
No. Vector ISA universally suck dog balls. Copy AVX512/10/yaddayadda.
Take the AVX10 spec, change encodings of AVX10/128, AVX10/256 and AVX10/512 to overlap, remove the 0.1% of instructions that don't make sense anymore, add instruction that returns the vector length.
Now you have a scalable vector ISA and it's possible to write length agnostic code.
This maps even better to the hardware then AVX512/VL, because you don't need to pretend you natively support 128 and 256 wide execution units when you only have 512 wide execution units.
 

MS_AT

Senior member
Jul 15, 2024
743
1,509
96
Take the AVX10 spec, change encodings of AVX10/128, AVX10/256 and AVX10/512 to overlap, remove the 0.1% of instructions that don't make sense anymore, add instruction that returns the vector length.
Now you have a scalable vector ISA and it's possible to write length agnostic code.
This maps even better to the hardware then AVX512/VL, because you don't need to pretend you natively support 128 and 256 wide execution units when you only have 512 wide execution units.
Depending how you want to treat it but I would say Zen4 did the pretending quite well if the other way around;)
 
  • Like
Reactions: Tlh97 and camel-cdr

FlameTail

Diamond Member
Dec 15, 2021
4,384
2,761
106
Rumour:

"News has come out about the Dimensity 9500 and Snapdragon 8 Elite 2. Both chips are expected to see a 20% increase in single and multi performance thanks to SME. (For the Snapdragon 8E2, the single-core score is 4000 on the GB6.)

By the way, the 8G5 uses a mix of Samsung Foundry SF2 and TSMC N3P"

Source

SF2 is a renamed node, previously known as SF3P.
The wording of this rumour suggests that Dimensity 9500 will also have SME.

If true, this means the next triplet of ARM Cortex cores (X930,A730,A530) will have SME support!

That hardly surprises me. I knew it was only a matter if time before stock ARM cores got SME, ever since ARM announced KleidiAI this year.

I am very curious how ARM will implement SME.

Apple has been the first and only vendor so far, to implement SME. They way they have done it is that the SME calculations are done by a a coprocessor. The SME block sits outside the CPU cores, and is shared by the cores in the cluster. Each cluster gets one SME block.
GQmKpHhWwAAMHfw.jpeg
You can see the SME/AMX blocks labelled in the above dieshot. The P-core cluster has one block, and the E-cluster has one block.

Considering that Nuvia was a scion of the Apple CPU team, and how the Oryon CPU has a similar topology to Apple's CPUs (clusters of 2-6 cores, with big shares L2), it's safe to assume that Qualcomm's SME implementation in Oryon would be very similar to Apple's.

But how will ARM do it?

As far as I know, Apple's way isn't the only way to implement SME. ARM could give each core it's own private SME block, that is part of the CPU core itself. However, this means the matrix throughput won't be as high as Apple's, because it would not be feasible to give each core a large private SME block (the die area/cost would be prohibitive). However, latency could be lower than Apple's way, since the SME block would be within the CPU core itself.

Or ARM could implement SME in Apple's way, by sharing an SME block across a cluster of cores. But there'll have to leap over several obstacles to do that;

Firstly, ARM doesn't have a cache hierarchy like Apple.

ARM = pL1/pL2/sL3
Apple = pL1/sL2

As I understand it, Apple's low latency and high capacity shared L2 is crucial for feeding the SME block.

ARM's L2 is low latency, but the capacity is not high enough. The fact that it's private is also challenging, as this means you cannot have a shared SME block by connecting it to the private L2.

On the other hand, ARM could connect the SME block to the shared L3 cache. But the L3 latency is higher, and the L3 is shared amongst a large number of cores. ARM's latest DSU supports upto 14 cores. Apple's maximum cluster size is 6 cores at the moment, so that's maximum amount of cores an Apple SME block has to serve. If ARM puts a shared SME block across 14 cores, it might face a situation of being spread too thin.

Please go ahead and share your own views on this matter, and correct me if I am mistaken.
 

soresu

Diamond Member
Dec 19, 2014
3,899
3,331
136
Anyone got a solid idea of the IPC increase from A715 -> A720?

I'm trying to make guesses about A730 perf over A76 for the upcoming RK3688, but actual IPC gain figures seem to be difficult to find on A710+, and of course we don't have any data yet on A725 beyond ARM's own anaemic PR.

They claim that A715 matches IPC (presumably scalar) with X1, so that would make it a round 30% IPC gain over A77, with A77 being a claimed 20% over A76.

So theoretically A76 -> A715 = +56% IPC.

Even without clock gains that should make RK3688 based prtoducts a decent upgrade over its predecessor based on that alone.

Can someone link me that site which µArch design layouts for ARM CPUs?
 

soresu

Diamond Member
Dec 19, 2014
3,899
3,331
136
Multiplying the 1.56 from A76 -> A715 gives a total figure of 1.81x for A76 -> A720.

Sweet - RK3688 isn't going to have an X µArch CPU core, but it should be great for emulation, and perhaps a bit of FEX emu + Proton action with older Windows games too.

Hopefully by then PanVK will be up to parity with Turnip so it at least supports as much of the DXVK and Zink specific VK extensions as possible.
 
Last edited: