Discussion Intel current and future Lakes & Rapids thread

DrMrLordX · Jun 8, 2021

Thala said:
Perhaps AVX VNNI and AVX512 VNNI are two different things. Thats because, as you correctly said, AVX512 VNNI instructions are only defined in conjunction with 512 bit registers. In addition, as far as i remember, it is required that if you implement any (of the many) AVX512 extensions, you need at least implement AVX512F.

The only information I can find about AVX-VNNI so far is this:

GCC 11 Lands Support For Intel AVX-VNNI - Phoronix

www.phoronix.com

Most other search results seem to reference AVX512-VNNI

"most use" is relative. Given the fragmented state of all the different AVX512 extensions - i believe that there is hardly any commercial application, which is using AVX512. Even Intel's own Embree library is using mostly AVX or SSE.

Well, it seems that OpenVino supports AVX512-VNNI. But I don't know that counts as a commercial application. Also TensorFlow, PyTorch, and uhhhh something something I dunno.

IntelUser2000 · Jun 8, 2021

Exist50 said:
My understanding of AVX was that cracking it once (i.e. from 256b to 2x128b, or 512b to 2x256b) is doable without too much effort, but cracking it twice (512b to 4x128b) is disproportionately more complicated. Might be hearsay, but we'll have to see what "NextMont"/Meteor Lake does.

Also, why wouldn't Gracemont have 256-bit FP units? Because back with Goldmont it had full 128-bit FP units. One thing they mentioned was "vector performance".

The client versions of Icelake/Tigerlake doesn't have full 512-bit support either.

DrMrLordX · Jun 8, 2021

IntelUser2000 said:
Also, why wouldn't Gracemont have 256-bit FP units? Because back with Goldmont it had full 128-bit FP units. One thing they mentioned was "vector performance".

Does Tremont have 2x128 or 1x256?

The client versions of Icelake/Tigerlake doesn't have full 512-bit support either.

There's more to TigerLake than that though. Look at some of Anandtech's AVX-512 benchmarks in their recent TigerLake-H 8c review. 3DPM v2 performance on TigerLake-H is pretty beastly. I don't think it's competitive in that bench with Skylake-X at the same core count, but it's still better than Skylake/CoffeeLake/Comet Lake with its 2x256b AVX2.

IntelUser2000 · Jun 8, 2021

DrMrLordX said:
Does Tremont have 2x128 or 1x256?

It's 2x128, you can't have 1x256 only because one needs to be for FPadd and other for FPmultiply. The support has been there since the original Goldmont chip. Silvermont is like the Pentium 4 that it has support for SSE2, but needs 2 cycles to complete, it was in Core 2 they made it 128-bits.

SSE2-->SSE4.1: 128-bit
AVX: 256-bit
AVX2: 256-bit with FMA

There's more to TigerLake than that though. Look at some of Anandtech's AVX-512 benchmarks in their recent TigerLake-H 8c review. 3DPM v2 performance on TigerLake-H is pretty beastly.

Yes of course. AVX-512 brings other benefits and the Sunny Cove core allows more to work using AVX-512. The double load/store also helps immensely with AVX-512.

But throughput-wise? Same as Skylake client.

DrMrLordX · Jun 8, 2021

IntelUser2000 said:
It's 2x128

Hmm. Does anyone really believe that Gracemont is going to be 2x256b then?

SSE2-->SSE4.1: 128-bit
AVX: 256-bit
AVX2: 256-bit with FMA

I would like to point out that AVX128 is a thing.

IntelUser2000 · Jun 8, 2021

DrMrLordX said:
Hmm. Does anyone really believe that Gracemont is going to be 2x256b then?

I would like to point out that AVX128 is a thing.

It's the same with AVX-512 being executed using 2x-256bit units in ICL/TGL and SSE2 using 64-bit units in Pentium 4.

You are talking about Jaguar right? That was a 28nm CPU and it had 128-bit FPUs, and double the L/S bandwidth which helps for FP performance in particular.

Carfax83 · Jun 8, 2021

IntelUser2000 said:
It's the same with AVX-512 being executed using 2x-256bit units in ICL/TGL and SSE2 using 64-bit units in Pentium 4.

I remember reading numerous times on Realworldtech forums that ICL and TGL have one native FMA 512 bit unit, and if I recall, the load and store units are 2x256 bit. Andrei was corrected by some of those forum users when he made a mistake in one of his reviews.

IntelUser2000 · Jun 8, 2021

Carfax83 said:
I remember reading numerous times on Realworldtech forums that ICL and TGL have one native FMA 512 bit unit, and if I recall, the load and store units are 2x256 bit. Andrei was corrected by some of those forum users when he made a mistake in one of his reviews.

It's right in their x86 optimization manual:

All processors based on Ice Lake Client microarchitecture contain a single 512-bit FMA unit, whereas some of the processors based on Skylake Server microarchitecture contain two such units. Both processors contain two 256-bit FMA units. The power consumed by Ice Lake Client FMA units is the same, whereas on Skylake Server the 512-bit units consume twice as much.

Agner Fog:

There are three ports that can handle integer vector arithmetic and logic operations, port 0, 1, and 5. Port 0 and 1 have 256 bit width, while port 5 has 512 bit width. A 512 bit vector operation can use either port 5 or port 0 and 1 combined. For example, integer vector addition with a vector size of up to 256 bits has a throughput of three instructions per clock cycle, while addition of 512 bit vectors has a throughput of two instructions per clock cycle because port 0 and 1 need to be combined to make a 512-bit operation. Only port 0 and 1 can handle floating point vector operations. Both have 256 bit width. The throughput is two floating point operations per clock cycle with scalars and vectors of 128 or 256 bits, while the throughput is one floating point vector operation per clock cycle with 512 bit vectors, using port 0 and 1 combined.

From Icelake-SP Hot Chips presentation:

Server enhancements – larger Mid-level Cache (L2) + second FMA

So Sunny Cove has full AVX-512 performance on Integer workloads, but for floating point workloads it's identical to Skylake, even doubling up the 256-bit FP units to work AVX-512.

Certainly area won't be a big limiter to implementing full 256-bit AVX2 on Gracemont, since we can see from Knights Landing that even full AVX-512 support is quite small. It's just inefficient in Core.

DrMrLordX · Jun 9, 2021

IntelUser2000 said:
You are talking about Jaguar right? That was a 28nm CPU and it had 128-bit FPUs, and double the L/S bandwidth which helps for FP performance in particular.

Actually I was more thinking Summit Ridge:

Advanced Vector Extensions - Wikipedia

en.wikipedia.org

The AVX instructions support both 128-bit and 256-bit SIMD. The 128-bit versions can be useful to improve old code without needing to widen the vectorization, and avoid the penalty of going from SSE to AVX, they are also faster on some early AMD implementations of AVX. This mode is sometimes known as AVX-128

Yeah technically Jaguar (and all the CON cores) supported AVX, but using AVX never really improved performance on those chips. Jaguar was about as fast with SSE4.x while the CON cores were at their fastest using xOP. It wasn't until Summit Ridge that AMD had a CPU where 128b AVX code provided a significant performance advantage over older 128b SIMD (such as SSE4.x; Summit Ridge had no support for xOP). The guy who wrote/maintains y-cruncher actually referred to the instruction set he used for Summit Ridge as "ADX". Why, I don't know, but he did.

Now how all this relates to Alder Lake (specifically Gracemont) . . .we'll see I guess? If they are able to fully implement 2x256b AVX2 without bloating die size then great! But obviously they aren't making the jump to 1x512b for AVX-512, which is why we went down this rabbit hole in the first place.

jur · Jun 9, 2021

Just FYI guys, Ice Lake and Tiger Lake seem to be able to execute 2 instructions per cycle for a limited subset of avx512 instructions (this is on zmm registers). Check out: https://github.com/InstLatx64 This is still a very minor improvement in throughput, because Skylake already did 3x256 per cycle.

There are lines such as:

Code:

AVX512BW :VPADDB zmm, zmm, zmm  L:   0.78ns=  1.0c  T:   0.39ns=  0.50c

Magic Carpet · Jun 9, 2021

Thala said:
"most use" is relative. Given the fragmented state of all the different AVX512 extensions - i believe that there is hardly any commercial application, which is using AVX512.

Acustica Audio off top of my head.

IntelUser2000 · Jun 9, 2021

jur said:
Just FYI guys, Ice Lake and Tiger Lake seem to be able to execute 2 instructions per cycle for a limited subset of avx512 instructions (this is on zmm registers). Check out: https://github.com/InstLatx64 This is still a very minor improvement in throughput, because Skylake already did 3x256 per cycle.

AVX-512 allows you to vectorize instructions that couldn't be done before, and ICL/TGL has full 512-bit support for Integer. So in some cases you aren't going from 256 to 512, but 64 to 512. Of course if it wasn't for massive fragmentation Intel created it could have been much better. I mean no support on Core based Pentium/Celeron really? They end at SSE4!

It's in FMA where it doesn't advance compared to Skylake client. But it's also in FP where it uses a massive amount of power.

@DrMrLordX I think if we analyze the die size, AVX2 is about the limit before it becomes too much of an adder for die size for say, Tremont. So if you significantly increase the size again by even implementing AVX-512 for Integer(nevermind doubling it as with server) then it makes no sense.

Even for Core chips AVX2 was the point where the tradeoff between die area/power and performance was ok. When they doubled it again with AVX-512 on Skylake, we started having issues.

mikk · Jun 9, 2021

According to Moore's Law Is Dead Intel plans to remove the ADL-S embargo on October 25th. He says only 125W k-SKUs will be launched this year and the other will follow in Q1 2022. ADL-P will be announced at CES early 2022. Some Raptor Lake infos: it's a 8+16 design roughly 1 year after ADL-S. Same Gracemont cores but enhanced Golden Cove with more IPC and improved perf/watt.

Carfax83 · Jun 9, 2021

Magic Carpet said:
Acustica Audio off top of my head.

View attachment 45538

Numerous codecs and renderers also use AVX-512. Intel SVT-AV1 uses AVX-512 to great effect apparently, and is going to be used by Netflix for streaming. It performed very well on the Ice Lake Xeon in the Phoronix review:

Asterox · Jun 9, 2021

mikk said:
According to Moore's Law Is Dead Intel plans to remove the ADL-S embargo on October 25th. He says only 125W k-SKUs will be launched this year and the other will follow in Q1 2022. ADL-P will be announced at CES early 2022. Some Raptor Lake infos: it's a 8+16 design roughly 1 year after ADL-S. Same Gracemont cores but enhanced Golden Cove with more IPC and improved perf/watt.

A lots of if, or how from Intel etc.

If AL 8+8 is 125W, i wonder how will Intel hit red target if Raptor Lake is 8+16.

If Alder Lake and Raptor Lake use same 10nm, red is just another classic morning fog.

semiman · Jun 9, 2021

Shivansps said:
Im thinking out loud here, but maybe AVX-512 support in a similar way to how AMD implemented AVX-256 on Zen 1 with two AVX-128 units.

Probably not that easy. Unlike previous AVX instructions, AVX512 works as a utility as well. Pattern search, crypto-related operations are not trivial SIMD operations. Implementation must be pretty hard. In any case, Intel needs to find a way to reduce software development burdens.

jpiniero · Jun 9, 2021

Asterox said:
A lots of if, or how from Intel etc.

If AL 8+8 is 125W, i wonder how will Intel hit red target if Raptor Lake is 8+16.

If Alder Lake and Raptor Lake use same 10nm, red is just another classic morning fog.

At base clock the small cores likely don't draw that much.

Thala · Jun 9, 2021

DrMrLordX said:
The only information I can find about AVX-VNNI so far is this:

GCC 11 Lands Support For Intel AVX-VNNI - Phoronix

www.phoronix.com

Actually LLVM/Clang has updated as well if you are looking at this commit You also see, that the instructions are available for 128bit and 256bit vector sizes under AVX-VNNI.

mikk · Jun 9, 2021

Asterox said:
A lots of if, or how from Intel etc.

If AL 8+8 is 125W, i wonder how will Intel hit red target if Raptor Lake is 8+16.

If Alder Lake and Raptor Lake use same 10nm, red is just another classic morning fog.

Improved perf/watt from Raptor Cove and further improved 10nm. They may have to clock 8 Gracemont a little lower than 4 Gracemont, it's the little core anyways. More troublesome in the same power envelope would have been a core increase from the big core.

DrMrLordX · Jun 9, 2021

mikk said:
Some Raptor Lake infos: it's a 8+16 design roughly 1 year after ADL-S. Same Gracemont cores but enhanced Golden Cove with more IPC and improved perf/watt.

That's . . . weird. They could have added more Golden Cove (especially if perf/watt is improved; @isoclocks that puts them at an advantage). Instead they double up on Gracemont? What's really going on here? MLiD being goofy again?

Thala said:
Actually LLVM/Clang has updated as well if you are looking at this commit You also see, that the instructions are available for 128bit and 256bit vector sizes under AVX-VNNI.

I wonder if AVX-VNNI is backwards-compatible with AVX512-VNNI?

eek2121 · Jun 9, 2021

mikk said:
According to Moore's Law Is Dead Intel plans to remove the ADL-S embargo on October 25th. He says only 125W k-SKUs will be launched this year and the other will follow in Q1 2022. ADL-P will be announced at CES early 2022. Some Raptor Lake infos: it's a 8+16 design roughly 1 year after ADL-S. Same Gracemont cores but enhanced Golden Cove with more IPC and improved perf/watt.

Has he ever beaten leaks and gotten anything right? No, he has not. Desktop Alder Lake will be released before mobile (that much is a given considering Intel just refreshed TGL-U and just launched TGL-H...sorry "Moore's Law is Dead", many of us knew this for longer than you did) from my current understanding, Intel plans to announce ADL-S in late November or late December this year. They have not decided on a final date (I actually know this for a fact, that's why MLID needs to kindly go away), but importantly please realize that announcement != release. A desktop ADLS-S is not going to be widely available before next year. I will absolutely 100% put money down for a bet on this. I've been saying this for months, but these youtubers keep trying to get their clicks...

Asterox said:
A lots of if, or how from Intel etc.

If AL 8+8 is 125W, i wonder how will Intel hit red target if Raptor Lake is 8+16.

If Alder Lake and Raptor Lake use same 10nm, red is just another classic morning fog.

Please stop with the bold and red text. You are usually wrong (almost as bad a certain other member of this forum who's name also happens to start with an A) and while occasionally you can be comedic, you annoy more often than not, especially for those of us that have the misfortune of browsing this forum on mobile.

IntelUser2000 · Jun 10, 2021

diediealldie said:
Probably not that easy. Unlike previous AVX instructions, AVX512 works as a utility as well. Pattern search, crypto-related operations are not trivial SIMD operations. Implementation must be pretty hard. In any case, Intel needs to find a way to reduce software development burdens.

Guys, if the CPU supports FMA, it doesn't need to be broken down into two ops. Jaguar had to do that because it didn't have it, so the FPadd executed it in two cycles and so did FPmul.

CPUs that support FMA such as Icelake(first FMA Intel CPU is Haswell) will only have half the throughput, but each AVX-512 instruction will still have 1 cycle latency.

coercitiv · Jun 10, 2021

eek2121 said:
Has he ever beaten leaks and gotten anything right?

It's worse than that, sometimes he just deletes videos:

Zen 4 vs Golden Cove, a video full of fan fiction tier analysis = nuked off the earth

"Zen 4 will have SMT4, 100% sure" = nuked

Video about a Navi lineup that was completely false = nuked

Another Navi lineup "leak"? = nuked

Threadripper fanfiction? = nuked

Zen 3 I/O die speculation = nuked

People forget this again and again: a leaker is not somebody who has interesting information, but someone who has previously established themselves as a credible source.

JoeRambo · Jun 10, 2021

Thala said:
Actually LLVM/Clang has updated as well if you are looking at this commit You also see, that the instructions are available for 128bit and 256bit vector sizes under AVX-VNNI.

Good find, as predicted earlier in this thread and now confirmed by LLVM commit log:

2. Introduce ExplicitVEXPrefix flag so that vpdpbusd/vpdpbusds/vpdpbusds/vpdpbusds instructions only use vex-encoding when user explicity add {vex} prefix.

So basically instructions from AVX512 that used to be encoded with EVEX prefix, now come with VEX ( from AVX / AVX2 ) and do the same operations on 128/256bit registers. Smart thing for Intel to do.

DrMrLordX said:
I wonder if AVX-VNNI is backwards-compatible with AVX512-VNNI?

Very unlikely as EVEX and VEX encodings are two different things, and on old AVX512 hardware, those AVX VEX encoded instructions probably mean nothing and will trap. Needs testing tho.

jpiniero · Jun 10, 2021

DrMrLordX said:
That's . . . weird. They could have added more Golden Cove (especially if perf/watt is improved; @isoclocks that puts them at an advantage). Instead they double up on Gracemont? What's really going on here? MLiD being goofy again?

We discussed this earlier... Marketing likes the idea of having more cores and it would do better in MT benchmarks fully loaded. Adding another 2 slots is going to make the ring pretty long.

Discussion Intel current and future Lakes & Rapids thread

Lifer

Elite Member

Lifer

Elite Member

Lifer

Elite Member

Diamond Member

Elite Member

Lifer

Member

Diamond Member

Elite Member

Diamond Member

Diamond Member

Golden Member

Member

Lifer

Golden Member

Diamond Member

Lifer

Diamond Member

Elite Member

Diamond Member

Golden Member

Lifer