Discussion ARM Cortex/Neoverse IP + SoCs (no custom cores) Discussion

adroc_thurston · Oct 24, 2024

Jan Olšan said:
What is really the point then?

"we have big SIMD at home". No, seriously.

Jan Olšan said:
I thought it would just be suboptimal but function without recompiling

No, you have exactly two targets for SVE. Vanilla compilers emit 128b veclen, and that NV HPC thingy vomits 512v vecs IIRC.

Nothingness · Oct 24, 2024

adroc_thurston said:
No, you have exactly two targets for SVE. Vanilla compilers emit 128b veclen, and that NV HPC thingy vomits 512v vecs IIRC.

This looks VL agnostic, doesn't it? https://godbolt.org/z/8K57957nT

I will leave to someone more knowledgeable than me to add the proper RISC-V flag to enable vector code generation.

Nothingness · Oct 24, 2024

Jan Olšan said:
Meh, really? What is really the point then?

I thought it would just be suboptimal but function without recompiling (after recompiling it would be suboptimal anyway, though, because you can have your abstracted length-agnostic SIMD but turns out that idea sucks in practice, wrt shuffle instructions for example).

Cf. an example in the post above.

My general vector knowledge is limited, so I might be off, but to me it looks VL agnostic in that simple example.

camel-cdr · Oct 24, 2024

Nothingness said:
This looks VL agnostic, doesn't it? https://godbolt.org/z/8K57957nT

I will leave to someone more knowledgeable than me to add the proper RISC-V flag to enable vector code generation.

https://godbolt.org/z/13xc73T3n

You just need to include v in the march string.
I also added -mrvv-max-lmul=dynamic, because that ends up with better codegen (tries to maximize LMUL).
IMO it should be the default option.

Nothingness · Oct 24, 2024

camel-cdr said:
https://godbolt.org/z/13xc73T3n

You just need to include v in the march string.
I also added -mrvv-max-lmul=dynamic, because that ends up with better codegen (tries to maximize LMUL).
IMO it should be the default option.

Thanks a lot! I won't claim I understand the generated R-V code, but it looks similar to the AArch64 one.

naukkis · Oct 25, 2024

Nothingness said:
Thanks a lot! I won't claim I understand the generated R-V code, but it looks similar to the AArch64 one.

It does not. RVV code is vector ISA, extremely similar to old Cray syntax. SVE code instead is clearly SIMD and compiler does vector parsing instead of hardware. RVV code is really simple and easy to debug - SVE is direct opposite to that.

soresu · Oct 25, 2024

adroc_thurston said:
and that NV HPC thingy vomits 512v vecs IIRC

Why when A64FX is the only CPU core that is that wide?

naukkis · Oct 25, 2024

SVE problem is that only part of vector instructions are vector length-agnostic. I really don't understand what kind of designer does that kind of solution and is satisfied with it. So to do actually useful code for SVE programmer has to target vector length - or code so that it works with every execution width( which is practically impossible). So SVE is only useful when it's used as a fixed width vectors - and ARM pretty much are locking it using only 128 bit vectors - and it's their best wish to get some support for SVE.

RVV is different - it's a real vector isa and code written for it is executable with different width SIMDs - hardware does that extra work needed to work with different width execution units. And it shows - there's plenty of RVV designs where SVE desigs are from ARM only and suck.

FlameTail · Oct 25, 2024

naukkis said:
SVE problem is that only part of vector instructions are vector length-agnostic. I really don't understand what kind of designer does that kind of solution and is satisfied with it. So to do actually useful code for SVE programmer has to target vector length - or code so that it works with every execution width( which is practically impossible). So SVE is only useful when it's used as a fixed width vectors - and ARM pretty much are locking it using only 128 bit vectors - and it's their best wish to get some support for SVE.

RVV is different - it's a real vector isa and code written for it is executable with different width SIMDs - hardware does that extra work needed to work with different width execution units. And it shows - there's plenty of RVV designs where SVE desigs are from ARM only and suck.

So does it make sense for ARM to implement something like RVV in say... ARMv10?

adroc_thurston · Oct 26, 2024

FlameTail said:
So does it make sense for ARM to implement something like RVV in say... ARMv10?

No. Vector ISA universally suck dog balls. Copy AVX512/10/yaddayadda.

naukkis · Oct 26, 2024

adroc_thurston said:
No. Vector ISA universally suck dog balls. Copy AVX512/10/yaddayadda.

Any particular reason for that? Coding for vector ISA is actually possible - SIMD coding instead is just self-torturing and it doesn't matter how long vector types cpu support they are mostly left unused. So best solutios are not using SIMD at all or go to vector ISA. SIMD is middleman for masokists to use.

adroc_thurston · Oct 26, 2024

naukkis said:
Any particular reason for that?

SIMD actually represents the underlying hardware reasonably well.

naukkis said:
SIMD coding instead is just self-torturing and it doesn't matter how long vector types cpu support they are mostly left unused. So best solutios are not using SIMD at all or go to vector ISA. SIMD is middleman for masokists to use.

is this the magical autovec compiler shilling?

naukkis · Oct 26, 2024

adroc_thurston said:
SIMD actually represents the underlying hardware reasonably well.

But why adding hardware that makes using that hardware more logical problem?

adroc_thurston said:
is this the magical autovec compiler shilling?

No. Vector ISA hardware just abstracts hard parts of SIMD-programming to the level where coding is almost similar for scalar and vector targets. That abstraction does also possible to be vector-length agnostic. Some hardware designers are against vector ISA but for software view you do want vector ISA instead of SIMD any time.

adroc_thurston · Oct 26, 2024

naukkis said:
But why adding hardware that makes using that hardware more logical problem?

Performance.

naukkis said:
Vector ISA hardware just abstracts hard parts of SIMD-programming to the level where coding is almost similar for scalar and vector targets.

That's every bit as bullshit as 'SIMT' and you know it.

naukkis said:
but for software view you do want vector ISA instead of SIMD any time.

Software people hate performance.

naukkis · Oct 26, 2024

adroc_thurston said:
Performance.

That's every bit as bullshit as 'SIMT' and you know it.

Software people hate performance.

You know that "performance" is nowadays coming from SIMT GPUs? Vector ISA cpu can rival those GPU-solutions - pretty much next gen supercomputer CPUs will be RVV-based.

adroc_thurston · Oct 26, 2024

naukkis said:
You know that "performance" is nowadays coming from SIMT GPUs?

No? A ton of stuff is CPU SIMD now, and forever and ever.

naukkis said:
Vector ISA cpu can rival those GPU-solutions - pretty much next gen supercomputer CPUs will be RVV-based.

drop the koolaid. seriously. This is embarassing.

camel-cdr · Oct 26, 2024

adroc_thurston said:
No. Vector ISA universally suck dog balls. Copy AVX512/10/yaddayadda.

Take the AVX10 spec, change encodings of AVX10/128, AVX10/256 and AVX10/512 to overlap, remove the 0.1% of instructions that don't make sense anymore, add instruction that returns the vector length.
Now you have a scalable vector ISA and it's possible to write length agnostic code.
This maps even better to the hardware then AVX512/VL, because you don't need to pretend you natively support 128 and 256 wide execution units when you only have 512 wide execution units.

MS_AT · Oct 26, 2024

camel-cdr said:
Take the AVX10 spec, change encodings of AVX10/128, AVX10/256 and AVX10/512 to overlap, remove the 0.1% of instructions that don't make sense anymore, add instruction that returns the vector length.
Now you have a scalable vector ISA and it's possible to write length agnostic code.
This maps even better to the hardware then AVX512/VL, because you don't need to pretend you natively support 128 and 256 wide execution units when you only have 512 wide execution units.

Depending how you want to treat it but I would say Zen4 did the pretending quite well if the other way around

FlameTail · Nov 11, 2024

FlameTail said:
Rumour:

"News has come out about the Dimensity 9500 and Snapdragon 8 Elite 2. Both chips are expected to see a 20% increase in single and multi performance thanks to SME. (For the Snapdragon 8E2, the single-core score is 4000 on the GB6.)

By the way, the 8G5 uses a mix of Samsung Foundry SF2 and TSMC N3P"

Source

SF2 is a renamed node, previously known as SF3P.

The wording of this rumour suggests that Dimensity 9500 will also have SME.

If true, this means the next triplet of ARM Cortex cores (X930,A730,A530) will have SME support!

That hardly surprises me. I knew it was only a matter if time before stock ARM cores got SME, ever since ARM announced KleidiAI this year.

I am very curious how ARM will implement SME.

Apple has been the first and only vendor so far, to implement SME. They way they have done it is that the SME calculations are done by a a coprocessor. The SME block sits outside the CPU cores, and is shared by the cores in the cluster. Each cluster gets one SME block.

You can see the SME/AMX blocks labelled in the above dieshot. The P-core cluster has one block, and the E-cluster has one block.

Considering that Nuvia was a scion of the Apple CPU team, and how the Oryon CPU has a similar topology to Apple's CPUs (clusters of 2-6 cores, with big shares L2), it's safe to assume that Qualcomm's SME implementation in Oryon would be very similar to Apple's.

But how will ARM do it?

As far as I know, Apple's way isn't the only way to implement SME. ARM could give each core it's own private SME block, that is part of the CPU core itself. However, this means the matrix throughput won't be as high as Apple's, because it would not be feasible to give each core a large private SME block (the die area/cost would be prohibitive). However, latency could be lower than Apple's way, since the SME block would be within the CPU core itself.

Or ARM could implement SME in Apple's way, by sharing an SME block across a cluster of cores. But there'll have to leap over several obstacles to do that;

Firstly, ARM doesn't have a cache hierarchy like Apple.

ARM = pL1/pL2/sL3
Apple = pL1/sL2

As I understand it, Apple's low latency and high capacity shared L2 is crucial for feeding the SME block.

ARM's L2 is low latency, but the capacity is not high enough. The fact that it's private is also challenging, as this means you cannot have a shared SME block by connecting it to the private L2.

On the other hand, ARM could connect the SME block to the shared L3 cache. But the L3 latency is higher, and the L3 is shared amongst a large number of cores. ARM's latest DSU supports upto 14 cores. Apple's maximum cluster size is 6 cores at the moment, so that's maximum amount of cores an Apple SME block has to serve. If ARM puts a shared SME block across 14 cores, it might face a situation of being spread too thin.

Please go ahead and share your own views on this matter, and correct me if I am mistaken.

soresu · Dec 15, 2024

Anyone got a solid idea of the IPC increase from A715 -> A720?

I'm trying to make guesses about A730 perf over A76 for the upcoming RK3688, but actual IPC gain figures seem to be difficult to find on A710+, and of course we don't have any data yet on A725 beyond ARM's own anaemic PR.

They claim that A715 matches IPC (presumably scalar) with X1, so that would make it a round 30% IPC gain over A77, with A77 being a claimed 20% over A76.

So theoretically A76 -> A715 = +56% IPC.

Even without clock gains that should make RK3688 based prtoducts a decent upgrade over its predecessor based on that alone.

Can someone link me that site which µArch design layouts for ARM CPUs?

Nothingness · Dec 15, 2024

@soresu there are some SPEC results here. Maybe this can help.

soresu · Dec 15, 2024

Nothingness said:
@soresu there are some SPEC results here. Maybe this can help.

Looks like A715 -> A720 is 16.1% according to that if my math isn't completely b0rked.

A715 at 2.8 ghz = 4.66
A720 at 2.592 ghz = 5.01

soresu · Dec 15, 2024

Multiplying the 1.56 from A76 -> A715 gives a total figure of 1.81x for A76 -> A720.

Sweet - RK3688 isn't going to have an X µArch CPU core, but it should be great for emulation, and perhaps a bit of FEX emu + Proton action with older Windows games too.

Hopefully by then PanVK will be up to parity with Turnip so it at least supports as much of the DXVK and Zink specific VK extensions as possible.

Nothingness · Dec 18, 2024

This looks like an interesting board A720: https://www.cnx-software.com/2024/1...core-armv9-soc-with-a-30-tops-ai-accelerator/

soresu · Dec 18, 2024

Nothingness said:
This looks like an interesting board A720: https://www.cnx-software.com/2024/1...core-armv9-soc-with-a-30-tops-ai-accelerator/

Noice, I keep forgetting about cnx, I should bookmark it and check more often 😅

Discussion ARM Cortex/Neoverse IP + SoCs (no custom cores) Discussion

Diamond Member

Diamond Member

Diamond Member

Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member