Rumour:
"News has come out about the Dimensity 9500 and Snapdragon 8 Elite 2. Both chips are expected to see a 20% increase in single and multi performance thanks to SME. (For the Snapdragon 8E2, the single-core score is 4000 on the GB6.)
By the way, the 8G5 uses a mix of Samsung Foundry SF2 and TSMC N3P"
Source
SF2 is a renamed node, previously known as SF3P.
The wording of this rumour suggests that Dimensity 9500 will also have SME.
If true, this means the next triplet of ARM Cortex cores (X930,A730,A530) will have SME support!
That hardly surprises me. I knew it was only a matter if time before stock ARM cores got SME, ever since ARM announced KleidiAI this year.
I am very curious how ARM will implement SME.
Apple has been the first and only vendor so far, to implement SME. They way they have done it is that the SME calculations are done by a a coprocessor. The SME block sits outside the CPU cores, and is shared by the cores in the cluster. Each cluster gets one SME block.

You can see the SME/AMX blocks labelled in the above dieshot. The P-core cluster has one block, and the E-cluster has one block.
Considering that Nuvia was a scion of the Apple CPU team, and how the Oryon CPU has a similar topology to Apple's CPUs (clusters of 2-6 cores, with big shares L2), it's safe to assume that Qualcomm's SME implementation in Oryon would be very similar to Apple's.
But how will ARM do it?
As far as I know, Apple's way isn't the only way to implement SME. ARM could give each core it's own private SME block, that is part of the CPU core itself. However, this means the matrix throughput won't be as high as Apple's, because it would not be feasible to give each core a large private SME block (the die area/cost would be prohibitive). However, latency could be lower than Apple's way, since the SME block would be within the CPU core itself.
Or ARM could implement SME in Apple's way, by sharing an SME block across a cluster of cores. But there'll have to leap over several obstacles to do that;
Firstly, ARM doesn't have a cache hierarchy like Apple.
ARM = pL1/pL2/sL3
Apple = pL1/sL2
As I understand it, Apple's low latency and high capacity shared L2 is crucial for feeding the SME block.
ARM's L2 is low latency, but the capacity is not high enough. The fact that it's private is also challenging, as this means you cannot have a shared SME block by connecting it to the private L2.
On the other hand, ARM could connect the SME block to the shared L3 cache. But the L3 latency is higher, and the L3 is shared amongst a large number of cores. ARM's latest DSU supports upto 14 cores. Apple's maximum cluster size is 6 cores at the moment, so that's maximum amount of cores an Apple SME block has to serve. If ARM puts a shared SME block across 14 cores, it might face a situation of being spread too thin.
Please go ahead and share your own views on this matter, and correct me if I am mistaken.