Discussion Intel current and future Lakes & Rapids thread

Page 454 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DrMrLordX

Lifer
Apr 27, 2000
21,582
10,785
136
Perhaps AVX VNNI and AVX512 VNNI are two different things. Thats because, as you correctly said, AVX512 VNNI instructions are only defined in conjunction with 512 bit registers. In addition, as far as i remember, it is required that if you implement any (of the many) AVX512 extensions, you need at least implement AVX512F.

The only information I can find about AVX-VNNI so far is this:


Most other search results seem to reference AVX512-VNNI


"most use" is relative. Given the fragmented state of all the different AVX512 extensions - i believe that there is hardly any commercial application, which is using AVX512. Even Intel's own Embree library is using mostly AVX or SSE.

Well, it seems that OpenVino supports AVX512-VNNI. But I don't know that counts as a commercial application. Also TensorFlow, PyTorch, and uhhhh something something I dunno.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
My understanding of AVX was that cracking it once (i.e. from 256b to 2x128b, or 512b to 2x256b) is doable without too much effort, but cracking it twice (512b to 4x128b) is disproportionately more complicated. Might be hearsay, but we'll have to see what "NextMont"/Meteor Lake does.

Also, why wouldn't Gracemont have 256-bit FP units? Because back with Goldmont it had full 128-bit FP units. One thing they mentioned was "vector performance".

The client versions of Icelake/Tigerlake doesn't have full 512-bit support either.
 

DrMrLordX

Lifer
Apr 27, 2000
21,582
10,785
136
Also, why wouldn't Gracemont have 256-bit FP units? Because back with Goldmont it had full 128-bit FP units. One thing they mentioned was "vector performance".

Does Tremont have 2x128 or 1x256?

The client versions of Icelake/Tigerlake doesn't have full 512-bit support either.

There's more to TigerLake than that though. Look at some of Anandtech's AVX-512 benchmarks in their recent TigerLake-H 8c review. 3DPM v2 performance on TigerLake-H is pretty beastly. I don't think it's competitive in that bench with Skylake-X at the same core count, but it's still better than Skylake/CoffeeLake/Comet Lake with its 2x256b AVX2.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Does Tremont have 2x128 or 1x256?

It's 2x128, you can't have 1x256 only because one needs to be for FPadd and other for FPmultiply. The support has been there since the original Goldmont chip. Silvermont is like the Pentium 4 that it has support for SSE2, but needs 2 cycles to complete, it was in Core 2 they made it 128-bits.

SSE2-->SSE4.1: 128-bit
AVX: 256-bit
AVX2: 256-bit with FMA

There's more to TigerLake than that though. Look at some of Anandtech's AVX-512 benchmarks in their recent TigerLake-H 8c review. 3DPM v2 performance on TigerLake-H is pretty beastly.

Yes of course. AVX-512 brings other benefits and the Sunny Cove core allows more to work using AVX-512. The double load/store also helps immensely with AVX-512.

But throughput-wise? Same as Skylake client.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Hmm. Does anyone really believe that Gracemont is going to be 2x256b then?

I would like to point out that AVX128 is a thing.

It's the same with AVX-512 being executed using 2x-256bit units in ICL/TGL and SSE2 using 64-bit units in Pentium 4.

You are talking about Jaguar right? That was a 28nm CPU and it had 128-bit FPUs, and double the L/S bandwidth which helps for FP performance in particular.
 
Last edited:

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
It's the same with AVX-512 being executed using 2x-256bit units in ICL/TGL and SSE2 using 64-bit units in Pentium 4.

I remember reading numerous times on Realworldtech forums that ICL and TGL have one native FMA 512 bit unit, and if I recall, the load and store units are 2x256 bit. Andrei was corrected by some of those forum users when he made a mistake in one of his reviews.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
I remember reading numerous times on Realworldtech forums that ICL and TGL have one native FMA 512 bit unit, and if I recall, the load and store units are 2x256 bit. Andrei was corrected by some of those forum users when he made a mistake in one of his reviews.

It's right in their x86 optimization manual:
All processors based on Ice Lake Client microarchitecture contain a single 512-bit FMA unit, whereas some of the processors based on Skylake Server microarchitecture contain two such units. Both processors contain two 256-bit FMA units. The power consumed by Ice Lake Client FMA units is the same, whereas on Skylake Server the 512-bit units consume twice as much.

Agner Fog:
There are three ports that can handle integer vector arithmetic and logic operations, port 0, 1, and 5. Port 0 and 1 have 256 bit width, while port 5 has 512 bit width. A 512 bit vector operation can use either port 5 or port 0 and 1 combined. For example, integer vector addition with a vector size of up to 256 bits has a throughput of three instructions per clock cycle, while addition of 512 bit vectors has a throughput of two instructions per clock cycle because port 0 and 1 need to be combined to make a 512-bit operation. Only port 0 and 1 can handle floating point vector operations. Both have 256 bit width. The throughput is two floating point operations per clock cycle with scalars and vectors of 128 or 256 bits, while the throughput is one floating point vector operation per clock cycle with 512 bit vectors, using port 0 and 1 combined.

From Icelake-SP Hot Chips presentation:
Server enhancements – larger Mid-level Cache (L2) + second FMA

So Sunny Cove has full AVX-512 performance on Integer workloads, but for floating point workloads it's identical to Skylake, even doubling up the 256-bit FP units to work AVX-512.

Certainly area won't be a big limiter to implementing full 256-bit AVX2 on Gracemont, since we can see from Knights Landing that even full AVX-512 support is quite small. It's just inefficient in Core.
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
21,582
10,785
136
You are talking about Jaguar right? That was a 28nm CPU and it had 128-bit FPUs, and double the L/S bandwidth which helps for FP performance in particular.

Actually I was more thinking Summit Ridge:


The AVX instructions support both 128-bit and 256-bit SIMD. The 128-bit versions can be useful to improve old code without needing to widen the vectorization, and avoid the penalty of going from SSE to AVX, they are also faster on some early AMD implementations of AVX. This mode is sometimes known as AVX-128

Yeah technically Jaguar (and all the CON cores) supported AVX, but using AVX never really improved performance on those chips. Jaguar was about as fast with SSE4.x while the CON cores were at their fastest using xOP. It wasn't until Summit Ridge that AMD had a CPU where 128b AVX code provided a significant performance advantage over older 128b SIMD (such as SSE4.x; Summit Ridge had no support for xOP). The guy who wrote/maintains y-cruncher actually referred to the instruction set he used for Summit Ridge as "ADX". Why, I don't know, but he did.

Now how all this relates to Alder Lake (specifically Gracemont) . . .we'll see I guess? If they are able to fully implement 2x256b AVX2 without bloating die size then great! But obviously they aren't making the jump to 1x512b for AVX-512, which is why we went down this rabbit hole in the first place.
 

jur

Junior Member
Nov 23, 2016
15
1
81
Just FYI guys, Ice Lake and Tiger Lake seem to be able to execute 2 instructions per cycle for a limited subset of avx512 instructions (this is on zmm registers). Check out: https://github.com/InstLatx64 This is still a very minor improvement in throughput, because Skylake already did 3x256 per cycle.

There are lines such as:
Code:
AVX512BW :VPADDB zmm, zmm, zmm  L:   0.78ns=  1.0c  T:   0.39ns=  0.50c
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Just FYI guys, Ice Lake and Tiger Lake seem to be able to execute 2 instructions per cycle for a limited subset of avx512 instructions (this is on zmm registers). Check out: https://github.com/InstLatx64 This is still a very minor improvement in throughput, because Skylake already did 3x256 per cycle.

AVX-512 allows you to vectorize instructions that couldn't be done before, and ICL/TGL has full 512-bit support for Integer. So in some cases you aren't going from 256 to 512, but 64 to 512. Of course if it wasn't for massive fragmentation Intel created it could have been much better. I mean no support on Core based Pentium/Celeron really? They end at SSE4!

It's in FMA where it doesn't advance compared to Skylake client. But it's also in FP where it uses a massive amount of power.

@DrMrLordX I think if we analyze the die size, AVX2 is about the limit before it becomes too much of an adder for die size for say, Tremont. So if you significantly increase the size again by even implementing AVX-512 for Integer(nevermind doubling it as with server) then it makes no sense.

Even for Core chips AVX2 was the point where the tradeoff between die area/power and performance was ok. When they doubled it again with AVX-512 on Skylake, we started having issues.
 

mikk

Diamond Member
May 15, 2012
4,112
2,108
136
According to Moore's Law Is Dead Intel plans to remove the ADL-S embargo on October 25th. He says only 125W k-SKUs will be launched this year and the other will follow in Q1 2022. ADL-P will be announced at CES early 2022. Some Raptor Lake infos: it's a 8+16 design roughly 1 year after ADL-S. Same Gracemont cores but enhanced Golden Cove with more IPC and improved perf/watt.

 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136

Numerous codecs and renderers also use AVX-512. Intel SVT-AV1 uses AVX-512 to great effect apparently, and is going to be used by Netflix for streaming. It performed very well on the Ice Lake Xeon in the Phoronix review:

embed.php
 
  • Like
Reactions: Magic Carpet

Asterox

Golden Member
May 15, 2012
1,026
1,775
136
According to Moore's Law Is Dead Intel plans to remove the ADL-S embargo on October 25th. He says only 125W k-SKUs will be launched this year and the other will follow in Q1 2022. ADL-P will be announced at CES early 2022. Some Raptor Lake infos: it's a 8+16 design roughly 1 year after ADL-S. Same Gracemont cores but enhanced Golden Cove with more IPC and improved perf/watt.


A lots of if, or how from Intel etc.

If AL 8+8 is 125W, i wonder how will Intel hit red target if Raptor Lake is 8+16.

If Alder Lake and Raptor Lake use same 10nm, red is just another classic morning fog.
 

diediealldie

Member
May 9, 2020
77
68
61
Im thinking out loud here, but maybe AVX-512 support in a similar way to how AMD implemented AVX-256 on Zen 1 with two AVX-128 units.
Probably not that easy. Unlike previous AVX instructions, AVX512 works as a utility as well. Pattern search, crypto-related operations are not trivial SIMD operations. Implementation must be pretty hard. In any case, Intel needs to find a way to reduce software development burdens.
 

jpiniero

Lifer
Oct 1, 2010
14,510
5,159
136
A lots of if, or how from Intel etc.

If AL 8+8 is 125W, i wonder how will Intel hit red target if Raptor Lake is 8+16.

If Alder Lake and Raptor Lake use same 10nm, red is just another classic morning fog.

At base clock the small cores likely don't draw that much.
 
  • Like
Reactions: dullard

mikk

Diamond Member
May 15, 2012
4,112
2,108
136
A lots of if, or how from Intel etc.

If AL 8+8 is 125W, i wonder how will Intel hit red target if Raptor Lake is 8+16.

If Alder Lake and Raptor Lake use same 10nm, red is just another classic morning fog.


Improved perf/watt from Raptor Cove and further improved 10nm. They may have to clock 8 Gracemont a little lower than 4 Gracemont, it's the little core anyways. More troublesome in the same power envelope would have been a core increase from the big core.
 
  • Like
Reactions: Tlh97 and dullard

DrMrLordX

Lifer
Apr 27, 2000
21,582
10,785
136
Some Raptor Lake infos: it's a 8+16 design roughly 1 year after ADL-S. Same Gracemont cores but enhanced Golden Cove with more IPC and improved perf/watt.

That's . . . weird. They could have added more Golden Cove (especially if perf/watt is improved; @isoclocks that puts them at an advantage). Instead they double up on Gracemont? What's really going on here? MLiD being goofy again?

Actually LLVM/Clang has updated as well if you are looking at this commit You also see, that the instructions are available for 128bit and 256bit vector sizes under AVX-VNNI.

I wonder if AVX-VNNI is backwards-compatible with AVX512-VNNI?
 

eek2121

Platinum Member
Aug 2, 2005
2,904
3,906
136
According to Moore's Law Is Dead Intel plans to remove the ADL-S embargo on October 25th. He says only 125W k-SKUs will be launched this year and the other will follow in Q1 2022. ADL-P will be announced at CES early 2022. Some Raptor Lake infos: it's a 8+16 design roughly 1 year after ADL-S. Same Gracemont cores but enhanced Golden Cove with more IPC and improved perf/watt.

Has he ever beaten leaks and gotten anything right? No, he has not. Desktop Alder Lake will be released before mobile (that much is a given considering Intel just refreshed TGL-U and just launched TGL-H...sorry "Moore's Law is Dead", many of us knew this for longer than you did) from my current understanding, Intel plans to announce ADL-S in late November or late December this year. They have not decided on a final date (I actually know this for a fact, that's why MLID needs to kindly go away), but importantly please realize that announcement != release. A desktop ADLS-S is not going to be widely available before next year. I will absolutely 100% put money down for a bet on this. I've been saying this for months, but these youtubers keep trying to get their clicks...

A lots of if, or how from Intel etc.

If AL 8+8 is 125W, i wonder how will Intel hit red target if Raptor Lake is 8+16.

If Alder Lake and Raptor Lake use same 10nm, red is just another classic morning fog.
Please stop with the bold and red text. You are usually wrong (almost as bad a certain other member of this forum who's name also happens to start with an A) and while occasionally you can be comedic, you annoy more often than not, especially for those of us that have the misfortune of browsing this forum on mobile.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Probably not that easy. Unlike previous AVX instructions, AVX512 works as a utility as well. Pattern search, crypto-related operations are not trivial SIMD operations. Implementation must be pretty hard. In any case, Intel needs to find a way to reduce software development burdens.

Guys, if the CPU supports FMA, it doesn't need to be broken down into two ops. Jaguar had to do that because it didn't have it, so the FPadd executed it in two cycles and so did FPmul.

CPUs that support FMA such as Icelake(first FMA Intel CPU is Haswell) will only have half the throughput, but each AVX-512 instruction will still have 1 cycle latency.
 

coercitiv

Diamond Member
Jan 24, 2014
6,151
11,686
136
Has he ever beaten leaks and gotten anything right?
It's worse than that, sometimes he just deletes videos:

People forget this again and again: a leaker is not somebody who has interesting information, but someone who has previously established themselves as a credible source.
 
Last edited:

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Actually LLVM/Clang has updated as well if you are looking at this commit You also see, that the instructions are available for 128bit and 256bit vector sizes under AVX-VNNI.


Good find, as predicted earlier in this thread and now confirmed by LLVM commit log:

2. Introduce ExplicitVEXPrefix flag so that vpdpbusd/vpdpbusds/vpdpbusds/vpdpbusds instructions only use vex-encoding when user explicity add {vex} prefix.

So basically instructions from AVX512 that used to be encoded with EVEX prefix, now come with VEX ( from AVX / AVX2 ) and do the same operations on 128/256bit registers. Smart thing for Intel to do.

I wonder if AVX-VNNI is backwards-compatible with AVX512-VNNI?

Very unlikely as EVEX and VEX encodings are two different things, and on old AVX512 hardware, those AVX VEX encoded instructions probably mean nothing and will trap. Needs testing tho.
 
  • Like
Reactions: Tlh97 and coercitiv

jpiniero

Lifer
Oct 1, 2010
14,510
5,159
136
That's . . . weird. They could have added more Golden Cove (especially if perf/watt is improved; @isoclocks that puts them at an advantage). Instead they double up on Gracemont? What's really going on here? MLiD being goofy again?

We discussed this earlier... Marketing likes the idea of having more cores and it would do better in MT benchmarks fully loaded. Adding another 2 slots is going to make the ring pretty long.