Discussion [WikiChip Fuse]The x86 Advanced Matrix Extension (AMX) Brings Matrix Operations; To Debut with Sapphire Rapids

tamz_msc

Diamond Member
Jan 5, 2017
3,699
3,547
136
  • Like
Reactions: Tlh97 and Burpo

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,439
14,409
136
If you can't beat AMD at a plain x86 implementation, change the rules by adding new instructions, like AVX512 did. Then make the software vendors use it.

Back in the 70s, early 80's Tektronix had a matrix chip in their 4051/4052. It really did speed things up a LOT, like 1000 fold.
Then just after 8086 came out, was the 8087 co-processor, same thing.
 

SAAA

Senior member
May 14, 2014
541
126
116
You know, I don't care what vendor does it as long as software speeds up and I can enjoy it as end user. Still this is mostly aimed at AI stuff and probably some big customer asked for so they delivered. Who knows when it will make its appearance on client cores.
 
  • Like
Reactions: pcp7

moinmoin

Diamond Member
Jun 1, 2017
4,926
7,609
136
Years ago I was told Sapphire Rapids will be the next chance of seeing Intel completely overhaul its Core design. These look more like Intel's usual MO though, some evolutionary changes to the existing design and a couple all new (and as such initially completely unsupported) instructions.
 

jpiniero

Lifer
Oct 1, 2010
14,487
5,155
136
Years ago I was told Sapphire Rapids will be the next chance of seeing Intel completely overhaul its Core design. These look more like Intel's usual MO though, some evolutionary changes to the existing design and a couple all new (and as such initially completely unsupported) instructions.

Yup, gotta figure that was scrapped.
 

gorobei

Diamond Member
Jan 7, 2007
3,649
974
136
interestingly from what little info is out there on what exactly a tensor core is, the takeaway i get is they are just dedicated matrix array computation units. if that is true it means you can do path tracing on the cpu side rather than the gpu, which jibes with the next unreal engine not using nvidia's or amd's dedicated hardware for global illumination.
 

jpiniero

Lifer
Oct 1, 2010
14,487
5,155
136
You know, I don't care what vendor does it as long as software speeds up and I can enjoy it as end user. Still this is mostly aimed at AI stuff and probably some big customer asked for so they delivered. Who knows when it will make its appearance on client cores.

Well, it's not in Alder Lake.
 

dmens

Platinum Member
Mar 18, 2005
2,271
917
136
interestingly from what little info is out there on what exactly a tensor core is, the takeaway i get is they are just dedicated matrix array computation units. if that is true it means you can do path tracing on the cpu side rather than the gpu, which jibes with the next unreal engine not using nvidia's or amd's dedicated hardware for global illumination.

LOL no, this hack doesn't have anywhere close to enough threads to do that.
 

gorobei

Diamond Member
Jan 7, 2007
3,649
974
136
LOL no, this hack doesn't have anywhere close to enough threads to do that.
im not saying this iteration, but in a gen or two after it game devs could think about doing the sampling for path tracing on whatever matrix instruction set intel/amd end up using.
 

teejee

Senior member
Jul 4, 2013
361
199
116
Isn't this just catching up with the AI accelerators we have in modern ARM SoC's (like Apple's Bionic)?
(but using CPU instructions instead of seperate accelerator on the die)
 
Last edited:

beginner99

Diamond Member
Jun 2, 2009
5,208
1,580
136
im not saying this iteration, but in a gen or two after it game devs could think about doing the sampling for path tracing on whatever matrix instruction set intel/amd end up using.

Note that sapphire rapids is the name for a server platform. So like with all incarnations of AVX, it will trickle down slowly and the low end will not get it anytime soon. I'm not even sure of newest Pentiums support AVX2 nowadays or still have it fused off.
 
Mar 11, 2004
23,020
5,485
146
interestingly from what little info is out there on what exactly a tensor core is, the takeaway i get is they are just dedicated matrix array computation units. if that is true it means you can do path tracing on the cpu side rather than the gpu, which jibes with the next unreal engine not using nvidia's or amd's dedicated hardware for global illumination.

Aren't they just doing a shortcut version using GPU compute? Which I think they had been developing it for years, before they even knew what AMD or Nvidia ray-tracing hardware would look like, so we'll see if they change things to utilize that hardware as it becomes more common.
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
Looks like Intel is stepping up the instruction set game. No support for regular data types though. More details to be found here.
AMX tile is 16x 64 bytes (1 kB) ..... which is 8,192-bit.

Why Intel announces this just one year before releasing Sapphire Rapids? The development such a AMX-capable FPU unit will take 4 years for AMD and VIA. Why Intel isn't releasing specs well ahead that everybody can collaborate, bring some improvements? This Intel's monopolistic behavior damages whole platform.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,699
3,547
136
AMX tile is 16x 64 bytes (1 kB) ..... which is 8,192-bit.

Why Intel announces this just one year before releasing Sapphire Rapids? The development such a AMX-capable FPU unit will take 4 years for AMD and VIA. Why Intel isn't releasing specs well ahead that everybody can collaborate, bring some improvements? This Intel's monopolistic behavior damages whole platform.
Intel can do whatever they want with respect to instruction sets because they're the market leader in x86 CPUs. That isn't going to change anytime soon, and unless the likes of Agner Fog with his idealized instruction set are taken seriously by the manufacturers you can forget about having them work together for any sort of universally supported instruction set.
 

gorobei

Diamond Member
Jan 7, 2007
3,649
974
136
Aren't they just doing a shortcut version using GPU compute? Which I think they had been developing it for years, before they even knew what AMD or Nvidia ray-tracing hardware would look like, so we'll see if they change things to utilize that hardware as it becomes more common.
dont really know as there hasnt been a huge amount of details out of epic about the engine. i would assume that stuff would come out at next years gdc if we arent still under quarantine.

but i have been seeing a few other game engine news blurbs about not using gpu but cpu for GI. for reflection, refraction, area lights, and shadows. i imagine dxr is able to handle passing that stuff off to the gpu with no issues for the dev, but GI has a bunch of variant methods that might be uniquely coded/formated to the game engine.

cortex found a nv paper(patent?) on something called a traversal accelerator, that is supposed to be used to speed up the montecarlo type sampling for path traced GI. while that custom module might be nice to have available as a dev, i cant imagine unreal engine which has to run on consoles, pc, and mobile wanting to optimize on such a niche hardware possibility when tons of spare cpu threads will be ubiquitous across all platforms with each new hardware generation.
 

dmens

Platinum Member
Mar 18, 2005
2,271
917
136
AMX tile is 16x 64 bytes (1 kB) ..... which is 8,192-bit.

Why Intel announces this just one year before releasing Sapphire Rapids? The development such a AMX-capable FPU unit will take 4 years for AMD and VIA. Why Intel isn't releasing specs well ahead that everybody can collaborate, bring some improvements? This Intel's monopolistic behavior damages whole platform.

Because it is a hack that no developer will use and its main purpose is so marketing can claim "AI leadership" in a slide deck.
 
  • Like
Reactions: lightmanek

name99

Senior member
Sep 11, 2010
404
303
136
Isn't this just catching up with the AI accelerators we have in modern ARM SoC's (like Apple's Bionic)?
(but using CPU instructions instead of seperate accelerator on the die)

Uh, Apple's A13 has something apparently very close to this; even called AMX (Apple Matrix Extensions). We know very little about these (beyond a claim of "one trillion 8-bit operations per second").

It's interesting to note that at WWDC we were not given further details about this, even as an aside, or as part of a talk on Accelerate (Apple's general framework for accelerating various types of numerical code). On the other hand, it's also interesting to note that the effort in LLVM to define a native matrix type, and to optimize various types of code to utilize that matrix type (which can then be mapped onto TPU's in various target CPU's or GPU's) is being led by Apple folk...

Quite what AMX is remains unclear. Apple described it as part of the CPU (ie NOT an accelerator like an NPU). It could be proprietary instructions; alternatively it could be an implementation of the Matrix instructions added to ARMv8.6 -- this is my bet.


(If this is ARMv8.6, it's even possible that Apple has been delaying details as part of an agreement with ARM as ARM finalizes the precise exact details of every aspect of 8.6 and it's documentation. Presumably this will all match Apple, but there may be parts, like the 32-bit behavior, or some interactions with the OS, hypervisor, debugging, and performance registers, where Apple doesn't care much about the ARM details, they have already done things their way.)

You could dismiss this as a fail by Apple, but I'd describe it more as "it's very difficult to get all of HW, compiler, client SW, etc) absolutely synchronized".
It would be very interesting to see the performance of various machine learning benchmarks (both inference and learning) on an A12 compared to an A13 to see what the differences are... (The NPU on the A13 is about 15..20% faster than on the A12; but it's probably optimized for inference. The new A13 AMX blocks are apparently for learning; and it's possible that Apple hasn't even yet hooked them up to Apple frameworks.
ie maybe the big reveal, requiring also some OS support [context swapping] and new APIs, will come with iOS14 in September?)

My guess is that, regardless of the details, these will be visible and in use (certainly via Apple frameworks, perhaps as direct LLVM compile targets) by the time Intel ships their AMX.
 
  • Like
Reactions: Richie Rich

Richie Rich

Senior member
Jul 28, 2019
470
229
76
Because it is a hack that no developer will use and its main purpose is so marketing can claim "AI leadership" in a slide deck.
That's the major difference in philosophy. Intel releases ISA extension to screw AMD up and increase their monopoly in x86 world. But they forgot that ARM CPUs have a SVE vectors which is much more powerful extension including matrix multiplication. Japanese Fugaku supercomputer (CPU only) based on ARM CPU with SVE vectors clearly demonstrated that they are able to beat GPU based SC (big problem for AMD's CDNA line up). ARM vendors collaborated and came up with SVE2 which will replace old NEON vectors and will be available in every smart phone since next year 2021 (as a part of ARMv9 Matterhorn).

AMD will need whole decade to implement AMX in CPUs (AMD has no AVX512 support after 7 years, they will go around using GPUs CDNA). As a result almost no body will use Intel's AMX extension same way as AVX512 never been widely used. That's the way into the hell.
 

mikk

Diamond Member
May 15, 2012
4,108
2,100
136
As a result almost no body will use Intel's AMX extension same way as AVX512 never been widely used. That's the way into the hell.

Widely used has nothing to do with AMD because AMD is so minor in the market it doesn't matter what they have or not have, it wouldn't have changed the AVX 512 usefulness in consumer apps. The bigger issue is Intels AVX512 abstinence in their mainstream CPUs which didn't support AVX512 up to Icelake-U which is minor as well. And also you could say the same about AVX2, it isn't widely used. Consumer productivity apps with real AVX2 usage are generaly very rare. I believe the most prominent consumer productivity application is probably x265 for HEVC encoding which supports AVX512 as well by the way: https://www.prlog.org/12701604-mult...ntel-avx-512-instructions-on-4k-encoding.html
 
  • Like
Reactions: Schmide

samboy

Senior member
Aug 17, 2002
217
77
101
Anyone know if the AMX extensions are IEEE754 floating point compliant?

That is, if you used standard IEEE754 floating point operations to calculate the dot product of a vector; will you get the same answer/error as using the AMX dot product implementation?

Doesn't matter too much for a game engine; but for other applications this can be a big deal if you want the same deterministic behavior on a number of platforms (including Intel chips without the AMX extension).

I hope that Intel has not ignored this aspect...... if so, then the extension is not usable for some applications.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,699
3,547
136
AVX512 is finding applications in various AI workloads, for example this one. So it is premature to declare that AMX wouldn't find any applications.
 
  • Like
Reactions: mikk

beginner99

Diamond Member
Jun 2, 2009
5,208
1,580
136
AVX512 is finding applications in various AI workloads, for example this one. So it is premature to declare that AMX wouldn't find any applications.

Intel MKL uses AVX512 and if you use a numpy version compiled against intel mkl, you profit from AVX512, basically double speed compared to AVX2 already for matrix stuff. (with the downside you get half the cores compared to AMD, so not really worth it)
 
Mar 11, 2004
23,020
5,485
146
dont really know as there hasnt been a huge amount of details out of epic about the engine. i would assume that stuff would come out at next years gdc if we arent still under quarantine.

but i have been seeing a few other game engine news blurbs about not using gpu but cpu for GI. for reflection, refraction, area lights, and shadows. i imagine dxr is able to handle passing that stuff off to the gpu with no issues for the dev, but GI has a bunch of variant methods that might be uniquely coded/formated to the game engine.

cortex found a nv paper(patent?) on something called a traversal accelerator, that is supposed to be used to speed up the montecarlo type sampling for path traced GI. while that custom module might be nice to have available as a dev, i cant imagine unreal engine which has to run on consoles, pc, and mobile wanting to optimize on such a niche hardware possibility when tons of spare cpu threads will be ubiquitous across all platforms with each new hardware generation.

They said something about it with the release of that tech demo here fairly recently. They seemed to indicate to me they were doing a lot less traces or something to begin with (hence my "shortcut" comment).

Which Nvidia GPUs have Tensor cores, but for whatever reason seems that most aren't using them despite claims they'd boost things (think there was talk of them using them for DLSS, but at least some of the games doing DLSS aren't using them, they're doing it some other manner, which is weird since they were going to the work of utilizing the ray-tracing bits - although that even seems like a bit of a sham, recalling how DICE scaled back the tracing a lot and doing other things in order to keep performance from tanking on BF1 or whatever game it was).