News ARM Matterhorn

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
Found this on the ARM TechCon blog:
1570595006623.png
Someone mentioned this to me recently as something related to ARMv9.0-A, possibly the post Hercules big core.

Link for the ARM TechCon blog here.

A quote under another image:
"MatMuil will double CPU GEMM performance"

"Matternhorn will introduce what Arm is calling "Secure-EL2"
Isolating individual processes within secure memory to avoid cross-contamination.
There will also be stronger protections against return oriented programming exploits."
 
  • Like
Reactions: lightmanek

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
Found this on a related ARM blog from one of the TechCon speakers:

"Since we announced the Cortex-A73, we’ve gradually increased machine learning (ML) performance generation-over-generation and today, we’re working to significantly broaden our CPU coverage for ML. In order to enable this new digital world, we need to push compute to a higher level, which is why we’ve added Matrix Multiply (MatMul) to our next-generation Cortex CPU, “Matterhorn”, effectively doubling ML performance over previous generations."
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
Someone mentioned this to me recently as something related to ARMv9.0-A, possibly the post Hercules big core.
I did mention Matterhorn as the first ARMv9.0-A architecture on the performance core roadmap here
You were the someone in question, I believe I did ask for sources at the time!

Without a source it becomes 'possibly' instead of definitely.

Thanks for linking that thread back though, I lost the R Pi 4 information PDF I posted on there back then.
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
I did mention Matterhorn as the first ARMv9.0-A architecture on the performance core roadmap here
There is another TechCon keynote about this time tomorrow with the ARM head technical fellow Peter Greenhalgh, called "Creating Compute".

I believe considering they name dropped Matterhorn yesterday that they may do the same with v9.0-A tomorrow, though I've likely jinxed the possibility now.
 

Andrei.

Senior member
Jan 26, 2015
316
386
136
^ Utter garbage with no relation to Matterhorn.

30% IPC over A77 is 20% lower IPC than the A13. Apple has no plans to switch over from their own custom cores for their Arm Macbook. The A13 largely matches the 9900K, he seems to have no idea of this. Not to mention the GPU rubbish. Ignore such YouTubers.
 

Asterox

Golden Member
May 15, 2012
1,026
1,775
136
^ Utter garbage with no relation to Matterhorn.

30% IPC over A77 is 20% lower IPC than the A13. Apple has no plans to switch over from their own custom cores for their Arm Macbook. The A13 largely matches the 9900K, he seems to have no idea of this. Not to mention the GPU rubbish. Ignore such YouTubers.

You are totally wrong no doubt. If you watch his channel or all videos closely, you will notice that in many cases he was on target.

Time will tell what are Apple real or concrete future plans.As reminder, AMD Threadripper was not on AMD-s the first priority plan or schedule.
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
8cx uses A76 which is a 4 wide core, A13 is 6 wide and more than a little bigger than A76 already - the only way that makes sense is maybe at 5nm.

Even then it took Apple over 2 years between the announcement of v8.0-A and their v8 based Cyclone A7 core - if v9.0-A is announced tomorrow there won't be an Apple core based on it next year, possibly not even the year after.
 
  • Like
Reactions: Tlh97

Andrei.

Senior member
Jan 26, 2015
316
386
136
Totally wrong no doubt? Remind me again why I bother posting here.

The idea that Apple, which currently has the strongest mobile GPU in the market, and strongest CPU architecture out there outright, would suddenly fall back to have Arm provide them inferior IP is just so laughable that it's not worth discussing. The Apple macbook uses their own cores and their own GPU, and that's a fact.

Matterhorn is a Q4 2021/ Q1 2022 silicon product. But let's not get that minor detail get into the way of spreading made up rumours.
 
Last edited:

ksec

Senior member
Mar 5, 2010
420
117
116
I am pretty sure Matterhorn will be based on another ARMv8 revision.

ARMv9 will likely have an eye on High Performance platform, with everything they have learned in Mobile and recent server development. Considering it took Arm 4 years to make ARMv8, I think it will be another one or two before we see ARMv9.

And I think it will ( controversially ) look more like what people would call CISC rather than RISC.
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
I am pretty sure Matterhorn will be based on another ARMv8 revision.

ARMv9 will likely have an eye on High Performance platform, with everything they have learned in Mobile and recent server development. Considering it took Arm 4 years to make ARMv8, I think it will be another one or two before we see ARMv9.

And I think it will ( controversially ) look more like what people would call CISC rather than RISC.
I have been thinking v8.6-A actually, it supports the MatMul instructions we know to be in Matterhorn.
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
@soresu reminded me a Matterhorn architecture to recap:
  • IPC around Apple's A12... so +60% more PPC than current x86 competitors
  • FPU upgrade from 2x128-bit to 2x256-bit
  • ISA ARMv9
  • 2048-bit SVE2 SIMD
  • silicon in Q4 21/Q1 22

AWS Graviton 4? and Ampere next server CPU based on Matterhorn uarch will be serious server contenders. There are few questions though:
  1. MatMul is related to Matterhorn but for SVE is optional. Will Matterhorn include MatMul by default or will be there differences between Cortex cores?
  2. What about backward compatibility with NEON?
  3. Named as Cortex A80, A81, A82?
 
Last edited:

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
MatMul is related to Matterhorn but for SVE is optional. Will Matterhorn include MatMul by default or will be there differences between Cortex cores?
Pretty sure the clue is in the codename "MATterhorn".
  • FPU upgrade from 2x128-bit to 2x256-bit
  • 2048-bit SVE2 SIMD
Preeettty sure the latter is in opposition to the former there?

Again, A64FX is a REALLY high end supercomputer chip and they stuck with 2x512 - there is no way Matterhorn has a 2048 bit SVE2 unit unless they went forward 5 to 10 years in time and grabbed the latest fab node to bring back to the past.

Forksheets and VFET's for the win.

At most ARM themselves are targeting datacenters and high end servers with Matterhorn and it's immediate successors, 2048 bit is just hooverkill.
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
What about backward compatibility with NEON?
Backward compatibility with NEON code was outlined already in the ARM blog that announced SVE2 - they also pretty much said that 128 bit SVE2 implementations should still match or outperform NEON code in most such circumstances.
 

name99

Senior member
Sep 11, 2010
404
303
136
Pretty sure the clue is in the codename "MATterhorn".

Preeettty sure the latter is in opposition to the former there?

Again, A64FX is a REALLY high end supercomputer chip and they stuck with 2x512 - there is no way Matterhorn has a 2048 bit SVE2 unit unless they went forward 5 to 10 years in time and grabbed the latest fab node to bring back to the past.

Forksheets and VFET's for the win.

At most ARM themselves are targeting datacenters and high end servers with Matterhorn and it's immediate successors, 2048 bit is just hooverkill.

Don't be TOO certain of your "2048 bits is overkill" claims...
How exactly would the bit"width" be measured for matrix registers and operations?
What we DO know is
- Apple has an AMX unit on the A13 large core. (So SOMEONE thinks that sort of functionality is useful on a "phone" core)

- It's likely that the AMX instructions and functionality are close to what ARM is planning for ARMv9. Not certain, of course, but it would be silly for Apple to go in a gratuitously different direction, and then have to redo the design and compiler support when they move to ARMv9

- Apple says that the AMX unit has 1Tops performance. How do you get there?
2.5GHz. Let's say one op is a add or mult, so a MAC gives a factor of 2, of 8bit data. So we need a further amplification of 200. Well a 2048-bit register is 256 bytes wide, and there's your factor of ~200...
Of course that's a somewhat bogus comparison because those high TOPs numbers are from small-ish SQUARE matrix-matrix multiplication, not from dot products or even level 2 BLAS (matrix-vector multiply). They reflect/require an aggressive sea of MAC units, but not super large registers. But they DO show how, if you insist on using "width" to talk about the performance of your TPU, that's the sort of number you'd back out to.

Don't confuse two different issues:
- the width of the "registers" used for the TPU part of A13 and Matterhorn AND
- the wide of the SVE/SVE2 registers used by Matterhorn (and whatever future Apple core adds SVE/2)

We don't know that these even share registers, or how they share them.

Basically we have wandered into the point that EVERY tech discussion eventually wanders into, where a concept that was useful five years ago (eg "the nm of a process"...) continues to be used far beyond the point where it is of engineering relevance, because most of the participants in the discussion are more interested in horse races and scoring points than in understanding/accepting/admitting that the world has changed and their old score cards are no longer relevant.
 

SarahKerrigan

Senior member
Oct 12, 2014
339
468
136
Don't be TOO certain of your "2048 bits is overkill" claims...
How exactly would the bit"width" be measured for matrix registers and operations?
What we DO know is
- Apple has an AMX unit on the A13 large core. (So SOMEONE thinks that sort of functionality is useful on a "phone" core)

- It's likely that the AMX instructions and functionality are close to what ARM is planning for ARMv9. Not certain, of course, but it would be silly for Apple to go in a gratuitously different direction, and then have to redo the design and compiler support when they move to ARMv9

- Apple says that the AMX unit has 1Tops performance. How do you get there?
2.5GHz. Let's say one op is a add or mult, so a MAC gives a factor of 2, of 8bit data. So we need a further amplification of 200. Well a 2048-bit register is 256 bytes wide, and there's your factor of ~200...
Of course that's a somewhat bogus comparison because those high TOPs numbers are from small-ish SQUARE matrix-matrix multiplication, not from dot products or even level 2 BLAS (matrix-vector multiply). They reflect/require an aggressive sea of MAC units, but not super large registers. But they DO show how, if you insist on using "width" to talk about the performance of your TPU, that's the sort of number you'd back out to.

Don't confuse two different issues:
- the width of the "registers" used for the TPU part of A13 and Matterhorn AND
- the wide of the SVE/SVE2 registers used by Matterhorn (and whatever future Apple core adds SVE/2)

We don't know that these even share registers, or how they share them.

Basically we have wandered into the point that EVERY tech discussion eventually wanders into, where a concept that was useful five years ago (eg "the nm of a process"...) continues to be used far beyond the point where it is of engineering relevance, because most of the participants in the discussion are more interested in horse races and scoring points than in understanding/accepting/admitting that the world has changed and their old score cards are no longer relevant.

Soresu was responding to a post specifically claiming Matterhorn would have 2048b SVE2. It won't.
 
  • Like
Reactions: soresu

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
Apple says that the AMX unit has 1Tops performance. How do you get there?
TOPS is not TFLOPS - TOPS is more of a measure of ML computation power than for general floating point compute as TFLOPS are.

By now Apple will have a GPU capable of 1 TFLOPS though, about par with SD 855's GPU, the Adreno 650.

There are plenty of ML inference accelerators in mobile SoC's these days that can push 1 TOPS or more, it's becoming the new testosterone measuring contest across the industry between SoC and Phone vendors with all the new uses of ML in play.
- It's likely that the AMX instructions and functionality are close to what ARM is planning for ARMv9. Not certain, of course, but it would be silly for Apple to go in a gratuitously different direction, and then have to redo the design and compiler support when they move to ARMv9
Why? It's already working now on an ARMv8-A platform, v9 has nothing to do with it.

AMX was never designed to be an industry standard, it's just yet another example of Apple doing their own thing regardless of what anyone else is doing, ala the Metal API instead of Vulkan.
 
Last edited:

name99

Senior member
Sep 11, 2010
404
303
136
TOPS is not TFLOPS - TOPS is more of a measure of ML computation power than for general floating point compute as TFLOPS are.

By now Apple will have a GPU capable of 1 TFLOPS though, about par with SD 855's GPU, the Adreno 650.

There are plenty of ML inference accelerators in mobile SoC's these days that can push 1 TOPS or more, it's becoming the new testosterone measuring contest across the industry between SoC and Phone vendors with all the new uses of ML in play.

Why? It's already working now on an ARMv8-A platform, v9 has nothing to do with it.

AMX was never designed to be an industry standard, just yet another example of Apple doing their own thing regardless of what anyone else is doing, ala the Metal API.
TOPS is not TFLOPS - TOPS is more of a measure of ML computation power than for general floating point compute as TFLOPS are.

By now Apple will have a GPU capable of 1 TFLOPS though, about par with SD 855's GPU, the Adreno 650.

There are plenty of ML inference accelerators in mobile SoC's these days that can push 1 TOPS or more, it's becoming the new testosterone measuring contest across the industry between SoC and Phone vendors with all the new uses of ML in play.

Why? It's already working now on an ARMv8-A platform, v9 has nothing to do with it.

AMX was never designed to be an industry standard, just yet another example of Apple doing their own thing regardless of what anyone else is doing, ala the Metal API.

(a) It's not a redacted contest. I was trying to provide the context in which the statement "2048-bits wide" is not a completely insane claim; it's simply a claim that matrix-multiply units need to be conceptualized differently from vector units; but if you INSIST on conceptualizing them like vector units, then "2048 bits" is where your summary lands.

(b) Yes, we all know GPUs can do more OPS, and NPUs even more OPS.
Once again it's not a dick measuring contest! It's an attempt to understand what's happening in that part of *the CPU*. What's done on the CPU is interesting and useful, apart from what's also done in external accelerators.

(c) ARMv9 is coming soon. Shipped products maybe next year, maybe 2022.
If you think Apple's designs NOW are uninformed by what those designs will be under ARMv9, perhaps that's why you're commenting on the internet, not running Apple's CPU division?

ARM has, as far as I know, not released the DETAILS of how matrix multiplication is to be performed under 8.6A, just this blog post:

But clearly it only makes sense for this to mostly track where they want to go with v9, and with where Apple wants to go.


Profanity is not allowed in the
tech forums.

AT Mod Usandthem
 
Last edited by a moderator:

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
It's not a redacted.
For yours truly? No.

For them? Constantly - that's how business competition PR works, it's how they keep selling new generations of their closed hardware/software ecosystem to the sheep amongst us.

Bumping numbers is their way of drumming up interest to the less well informed vs those of us that actually read the deep dives to see a more accurate picture of what those numbers mean, if they actually mean anything at all (as with ARM Mali's less than impressive real world gains vs their internal benchmarks).
I was trying to provide the context in which the statement "2048-bits wide" is not a completely insane claim
It is insane at current process nodes. Again, Fujitsu's SUPERCOMPUTER chip only went up to 512x2 SVE units - which makes 2048 for even a server/datacenter variant of Axx seem a complete long shot.

It's not just a easy numbers game when designing processor cores - SIMD width doesn't come for free in either power or transistor budget, and Apple's 6 wide core is already both comparatively large and power hungry next to the competition, even now at 7nm, to the point that their greater IPC does them little good in the power consumption arena vs the latest Cortex-Axx cores given that little to nothing on their closed software platform can even tax their 6 wide cores anyway.

Just look at Intel's trouble with AVX512 - yes, they have been hampered by process node foibles, but the underlying problem is still there, and why Intel won't touch 1024 bit SIMD with a barge pole until at least their 5nm equivalent, if they ever do now that they seem to finally have a concrete GPU roadmap.

As to v9-A, little to nothing is known about it, so there is little point speculating on whether Apple's current core path is following, let alone its future path - if anything v8.6-A could be seen as a way to bring Apple's AMX sideshow loosely back under the ARM umbrella retroactively.

Don't be fooled into the impression that Apple are the cool kids that always play nice with others and are incapable of strong arming people - just look what they did to poor IMG TEC if you need any proof of their duplicitous nature when dealing with IP vendors.
 
Last edited by a moderator:

name99

Senior member
Sep 11, 2010
404
303
136
For yours truly? No.

For them? Constantly - that's how business competition PR works, it's how they keep selling new generations of their closed hardware/software ecosystem to the sheep amongst us.

Bumping numbers is their way of drumming up interest to the less well informed vs those of us that actually read the deep dives to see a more accurate picture of what those numbers mean, if they actually mean anything at all (as with ARM Mali's less than impressive real world gains vs their internal benchmarks).

It is insane at current process nodes. Again, Fujitsu's SUPERCOMPUTER chip only went up to 512x2 SVE units - which makes 2048 for even a server/datacenter variant of Axx seem a complete long shot.

It's not just a easy numbers game when designing processor cores - SIMD width doesn't come for free in either power or transistor budget, and Apple's 6 wide core is already both comparatively large and power hungry next to the competition, even now at 7nm, to the point that their greater IPC does them little good in the power consumption arena vs the latest Cortex-Axx cores given that little to nothing on their closed software platform can even tax their 6 wide cores anyway.

Just look at Intel's trouble with AVX512 - yes, they have been hampered by process node foibles, but the underlying problem is still there, and why Intel won't touch 1024 bit SIMD with a barge pole until at least their 5nm equivalent, if they ever do now that they seem to finally have a concrete GPU roadmap.

As to v9-A, little to nothing is known about it, so there is little point speculating on whether Apple's current core path is following, let alone its future path - if anything v8.6-A could be seen as a way to bring Apple's AMX sideshow loosely back under the ARM umbrella retroactively.

Don't be fooled into the impression that Apple are the cool kids that always play nice with others and are incapable of strong arming people - just look what they did to poor IMG TEC if you need any proof of their duplicitous nature when dealing with IP vendors.

(a) Apple's current large cores are 7-wide.

(b) I tried to explain the context of my statement, that a distinction needs to be drawn between "2048" as a statement of storage and "2048" as a statement of density of operations.
*Woosh*
Then I tried again.
*Woosh*.

I can only conclude that you are arguing in bad faith, meaning that from this point on it makes no sense to engage with you.

(c) "As to v9-A, little to nothing is known about it, so there is little point speculating on whether Apple's current core path is following, let alone its future path"
If that's your theory of tech, then what are doing in a thread about Matterhorn???
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
If that's your theory of tech, then what are doing in a thread about Matterhorn???
Unlike v9-A which has not even been confirmed outside of LinkedIn job details, Matterhorn has actually had some official details released in the recent ARM conference last october, chiefly concering it having MatMul and that was about it in the public info dump. In fact said information is literally in the first posts on this thread, so this entire paragraph is redundant.

I'm sure we could easily speculate about SVE2 and TME being mandated parts of v9-A amongst other things, but it would be as wild a speculation as talking about AVX 1024 at present, ARM are I have to say fantastically tight lipped as things go, even AMD aren't so good at keeping the covers on things.
(a) Apple's current large cores are 7-wide.
I congratulate you on correcting my mistake, however this revelation only serves to hammer my point home - that core is already huge, and having 2048 bit SIMD would just make it soooo much bigger AND MOAR POWAA HUNGRY.

No seriously, 2048 bit SIMD at even N5P would not be pretty for power consumption.

I did not say it is impossible mind you - only utterly ridiculous at the current semi process node, hence my joke reply to Richie Rich about needing forksheet and VFET based nodes, which are several years out yet, we still have at least one to two nodes of finFET left at TSMC before a shift to nanosheet/MBCFET, and they aren't going to switch to nanosheet only to change again at the next node, so probably four nodes away at least for forksheet.
 

DrMrLordX

Lifer
Apr 27, 2000
21,582
10,785
136
(b) I tried to explain the context of my statement, that a distinction needs to be drawn between "2048" as a statement of storage and "2048" as a statement of density of operations.

But none of us are really discussing "storage", we're discussing "density of operations", almost exclusively. That's ultimately going to determine maximum throughput and power draw.
 
  • Like
Reactions: Tlh97 and soresu