SarahKerrigan
Senior member
- Oct 12, 2014
- 229
- 253
- 136
Having actually WRITTEN code for SVE, I can answer this one easily enough: SVE is vector-length-agnostic and therefore compatible across implementations. You seem to be imagining it as "you run a 2048b instruction and it simply takes 16 cycles" - which is how it would work on a traditional vector machine (SX, Cray X1/X2, etc.) That is NOT how SVE works. Instead, you essentially loop based on interrogated vector length; if you want to add all elements in a 32-element, 1024-bit vector, for instance, you get the number of elements your hardware supports, do the add operation, then subtract the number of elements computed from 32 for each iteration you run. So in a 128b machine, you end up running an inner loop for eight iterations, while on a 1024b machine the inner loop only runs for one iteration. (This is a slight oversimplification, but roll with it.)Ok. So code written for ARM Fujitsu A64FX with 512-bit SVE using 512-bit registers will not run at Matterhorn with 256-bit SVE2? Another one, 2048-bit SVE2 can be scaled with 128-bit increment so there are 16 versions of SVE2 length (128 * 16 = 2048)...... this looks to me very messy. This would also mean that you cannot pair Little cores with bigger cores with different SIMD widths.
I think this solves Vector Length Agnostic instructions (VLA):
https://indico.math.cnrs.fr/event/4705/attachments/2362/2940/ARM_SVE_tutorial.pdf
- Vectors cannot be initialised from compile-time constant in memory, so...•INDEX Zd.S,#1,#4: Zd= [ 1, 5, 9, 13, 17, 21, 25, 29 ]
- Predicates also cannot be initialised from memory, so...•PTRUE Pd.S,MUL3: Pd= [ T, T, T , T, T, T , F, F ]
- Vector loop increment and trip count are unknown at compile-time, so...•INCD Xi: increment scalar Xiby # of 64-bit dwordsin vector •WHILELT Pd.D,Xi,Xe: next iteration predicate Pd= [ while i++ < e ]
- Vector register spill & fill must adjust to vector length, so...•ADDVL SP,SP,#-4: decrement stack pointer by (4*VL)•STR Z1,[SP,#3,MUL VL]: store vector Z1to address (SP+ 3*VL)
page 8.
As far as I understand VLA it will able to chop whatever long vector into lengths suitable for local SIMD unit to able to process it regardless its HW width. So little cores with slow 128-bit SIMD FPU can be paired with 4x256-bit SIMD FPUs thanks to VLA. Or do you see VLA function in different way?
In no case here is anything 2048b. There's not 2048b registers. There's not an "add 2048b' instruction. There's just an "add vector" instruction of machine vector length, and a way to use it intelligently to compose code streams that scale to longer vector lengths. The only thing that's 2048b is the maximum length hardware can support.
As for big.little, my expectation would be that SVE implementations within a core cluster must be consistent but I would be interested to see ways around that.
Last edited: