Pretty sure the clue is in the codename "MATterhorn".
Preeettty sure the latter is in opposition to the former there?
Again, A64FX is a REALLY high end supercomputer chip and they stuck with 2x512 - there is no way Matterhorn has a 2048 bit SVE2 unit unless they went forward 5 to 10 years in time and grabbed the latest fab node to bring back to the past.
Forksheets and VFET's for the win.
At most ARM themselves are targeting datacenters and high end servers with Matterhorn and it's immediate successors, 2048 bit is just hooverkill.
Don't be TOO certain of your "2048 bits is overkill" claims...
How exactly would the bit"width" be measured for matrix registers and operations?
What we DO know is
- Apple has an AMX unit on the A13 large core. (So SOMEONE thinks that sort of functionality is useful on a "phone" core)
- It's likely that the AMX instructions and functionality are close to what ARM is planning for ARMv9. Not certain, of course, but it would be silly for Apple to go in a gratuitously different direction, and then have to redo the design and compiler support when they move to ARMv9
- Apple says that the AMX unit has 1Tops performance. How do you get there?
2.5GHz. Let's say one op is a add or mult, so a MAC gives a factor of 2, of 8bit data. So we need a further amplification of 200. Well a 2048-bit register is 256 bytes wide, and there's your factor of ~200...
Of course that's a somewhat bogus comparison because those high TOPs numbers are from small-ish SQUARE matrix-matrix multiplication, not from dot products or even level 2 BLAS (matrix-vector multiply). They reflect/require an aggressive sea of MAC units, but not super large registers. But they DO show how, if you insist on using "width" to talk about the performance of your TPU, that's the sort of number you'd back out to.
Don't confuse two different issues:
- the width of the "registers" used for the TPU part of A13 and Matterhorn AND
- the wide of the SVE/SVE2 registers used by Matterhorn (and whatever future Apple core adds SVE/2)
We don't know that these even share registers, or how they share them.
Basically we have wandered into the point that EVERY tech discussion eventually wanders into, where a concept that was useful five years ago (eg "the nm of a process"...) continues to be used far beyond the point where it is of engineering relevance, because most of the participants in the discussion are more interested in horse races and scoring points than in understanding/accepting/admitting that the world has changed and their old score cards are no longer relevant.