• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."

Discussion Intel current and future Lakes & Rapids thread

Page 454 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DrMrLordX

Lifer
Apr 27, 2000
17,268
6,267
136
Also, why wouldn't Gracemont have 256-bit FP units? Because back with Goldmont it had full 128-bit FP units. One thing they mentioned was "vector performance".
Does Tremont have 2x128 or 1x256?

The client versions of Icelake/Tigerlake doesn't have full 512-bit support either.
There's more to TigerLake than that though. Look at some of Anandtech's AVX-512 benchmarks in their recent TigerLake-H 8c review. 3DPM v2 performance on TigerLake-H is pretty beastly. I don't think it's competitive in that bench with Skylake-X at the same core count, but it's still better than Skylake/CoffeeLake/Comet Lake with its 2x256b AVX2.
 

IntelUser2000

Elite Member
Oct 14, 2003
7,497
2,279
136
Does Tremont have 2x128 or 1x256?
It's 2x128, you can't have 1x256 only because one needs to be for FPadd and other for FPmultiply. The support has been there since the original Goldmont chip. Silvermont is like the Pentium 4 that it has support for SSE2, but needs 2 cycles to complete, it was in Core 2 they made it 128-bits.

SSE2-->SSE4.1: 128-bit
AVX: 256-bit
AVX2: 256-bit with FMA

There's more to TigerLake than that though. Look at some of Anandtech's AVX-512 benchmarks in their recent TigerLake-H 8c review. 3DPM v2 performance on TigerLake-H is pretty beastly.
Yes of course. AVX-512 brings other benefits and the Sunny Cove core allows more to work using AVX-512. The double load/store also helps immensely with AVX-512.

But throughput-wise? Same as Skylake client.
 

IntelUser2000

Elite Member
Oct 14, 2003
7,497
2,279
136
Hmm. Does anyone really believe that Gracemont is going to be 2x256b then?

I would like to point out that AVX128 is a thing.
It's the same with AVX-512 being executed using 2x-256bit units in ICL/TGL and SSE2 using 64-bit units in Pentium 4.

You are talking about Jaguar right? That was a 28nm CPU and it had 128-bit FPUs, and double the L/S bandwidth which helps for FP performance in particular.
 
Last edited:

Carfax83

Diamond Member
Nov 1, 2010
6,056
859
126
It's the same with AVX-512 being executed using 2x-256bit units in ICL/TGL and SSE2 using 64-bit units in Pentium 4.
I remember reading numerous times on Realworldtech forums that ICL and TGL have one native FMA 512 bit unit, and if I recall, the load and store units are 2x256 bit. Andrei was corrected by some of those forum users when he made a mistake in one of his reviews.
 

IntelUser2000

Elite Member
Oct 14, 2003
7,497
2,279
136
I remember reading numerous times on Realworldtech forums that ICL and TGL have one native FMA 512 bit unit, and if I recall, the load and store units are 2x256 bit. Andrei was corrected by some of those forum users when he made a mistake in one of his reviews.
It's right in their x86 optimization manual:
All processors based on Ice Lake Client microarchitecture contain a single 512-bit FMA unit, whereas some of the processors based on Skylake Server microarchitecture contain two such units. Both processors contain two 256-bit FMA units. The power consumed by Ice Lake Client FMA units is the same, whereas on Skylake Server the 512-bit units consume twice as much.
Agner Fog:
There are three ports that can handle integer vector arithmetic and logic operations, port 0, 1, and 5. Port 0 and 1 have 256 bit width, while port 5 has 512 bit width. A 512 bit vector operation can use either port 5 or port 0 and 1 combined. For example, integer vector addition with a vector size of up to 256 bits has a throughput of three instructions per clock cycle, while addition of 512 bit vectors has a throughput of two instructions per clock cycle because port 0 and 1 need to be combined to make a 512-bit operation. Only port 0 and 1 can handle floating point vector operations. Both have 256 bit width. The throughput is two floating point operations per clock cycle with scalars and vectors of 128 or 256 bits, while the throughput is one floating point vector operation per clock cycle with 512 bit vectors, using port 0 and 1 combined.
From Icelake-SP Hot Chips presentation:
Server enhancements – larger Mid-level Cache (L2) + second FMA
So Sunny Cove has full AVX-512 performance on Integer workloads, but for floating point workloads it's identical to Skylake, even doubling up the 256-bit FP units to work AVX-512.

Certainly area won't be a big limiter to implementing full 256-bit AVX2 on Gracemont, since we can see from Knights Landing that even full AVX-512 support is quite small. It's just inefficient in Core.
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
17,268
6,267
136
You are talking about Jaguar right? That was a 28nm CPU and it had 128-bit FPUs, and double the L/S bandwidth which helps for FP performance in particular.
Actually I was more thinking Summit Ridge:


The AVX instructions support both 128-bit and 256-bit SIMD. The 128-bit versions can be useful to improve old code without needing to widen the vectorization, and avoid the penalty of going from SSE to AVX, they are also faster on some early AMD implementations of AVX. This mode is sometimes known as AVX-128
Yeah technically Jaguar (and all the CON cores) supported AVX, but using AVX never really improved performance on those chips. Jaguar was about as fast with SSE4.x while the CON cores were at their fastest using xOP. It wasn't until Summit Ridge that AMD had a CPU where 128b AVX code provided a significant performance advantage over older 128b SIMD (such as SSE4.x; Summit Ridge had no support for xOP). The guy who wrote/maintains y-cruncher actually referred to the instruction set he used for Summit Ridge as "ADX". Why, I don't know, but he did.

Now how all this relates to Alder Lake (specifically Gracemont) . . .we'll see I guess? If they are able to fully implement 2x256b AVX2 without bloating die size then great! But obviously they aren't making the jump to 1x512b for AVX-512, which is why we went down this rabbit hole in the first place.
 

jur

Junior Member
Nov 23, 2016
14
1
81
Just FYI guys, Ice Lake and Tiger Lake seem to be able to execute 2 instructions per cycle for a limited subset of avx512 instructions (this is on zmm registers). Check out: https://github.com/InstLatx64 This is still a very minor improvement in throughput, because Skylake already did 3x256 per cycle.

There are lines such as:
Code:
AVX512BW :VPADDB zmm, zmm, zmm  L:   0.78ns=  1.0c  T:   0.39ns=  0.50c
 

IntelUser2000

Elite Member
Oct 14, 2003
7,497
2,279
136
Just FYI guys, Ice Lake and Tiger Lake seem to be able to execute 2 instructions per cycle for a limited subset of avx512 instructions (this is on zmm registers). Check out: https://github.com/InstLatx64 This is still a very minor improvement in throughput, because Skylake already did 3x256 per cycle.
AVX-512 allows you to vectorize instructions that couldn't be done before, and ICL/TGL has full 512-bit support for Integer. So in some cases you aren't going from 256 to 512, but 64 to 512. Of course if it wasn't for massive fragmentation Intel created it could have been much better. I mean no support on Core based Pentium/Celeron really? They end at SSE4!

It's in FMA where it doesn't advance compared to Skylake client. But it's also in FP where it uses a massive amount of power.

@DrMrLordX I think if we analyze the die size, AVX2 is about the limit before it becomes too much of an adder for die size for say, Tremont. So if you significantly increase the size again by even implementing AVX-512 for Integer(nevermind doubling it as with server) then it makes no sense.

Even for Core chips AVX2 was the point where the tradeoff between die area/power and performance was ok. When they doubled it again with AVX-512 on Skylake, we started having issues.
 

mikk

Diamond Member
May 15, 2012
3,138
947
136
According to Moore's Law Is Dead Intel plans to remove the ADL-S embargo on October 25th. He says only 125W k-SKUs will be launched this year and the other will follow in Q1 2022. ADL-P will be announced at CES early 2022. Some Raptor Lake infos: it's a 8+16 design roughly 1 year after ADL-S. Same Gracemont cores but enhanced Golden Cove with more IPC and improved perf/watt.

 

Asterox

Senior member
May 15, 2012
531
772
136
According to Moore's Law Is Dead Intel plans to remove the ADL-S embargo on October 25th. He says only 125W k-SKUs will be launched this year and the other will follow in Q1 2022. ADL-P will be announced at CES early 2022. Some Raptor Lake infos: it's a 8+16 design roughly 1 year after ADL-S. Same Gracemont cores but enhanced Golden Cove with more IPC and improved perf/watt.

A lots of if, or how from Intel etc.

If AL 8+8 is 125W, i wonder how will Intel hit red target if Raptor Lake is 8+16.

If Alder Lake and Raptor Lake use same 10nm, red is just another classic morning fog.
 

diediealldie

Junior Member
May 9, 2020
10
18
41
Im thinking out loud here, but maybe AVX-512 support in a similar way to how AMD implemented AVX-256 on Zen 1 with two AVX-128 units.
Probably not that easy. Unlike previous AVX instructions, AVX512 works as a utility as well. Pattern search, crypto-related operations are not trivial SIMD operations. Implementation must be pretty hard. In any case, Intel needs to find a way to reduce software development burdens.
 

jpiniero

Diamond Member
Oct 1, 2010
9,339
1,887
136
A lots of if, or how from Intel etc.

If AL 8+8 is 125W, i wonder how will Intel hit red target if Raptor Lake is 8+16.

If Alder Lake and Raptor Lake use same 10nm, red is just another classic morning fog.
At base clock the small cores likely don't draw that much.
 
  • Like
Reactions: dullard

mikk

Diamond Member
May 15, 2012
3,138
947
136
A lots of if, or how from Intel etc.

If AL 8+8 is 125W, i wonder how will Intel hit red target if Raptor Lake is 8+16.

If Alder Lake and Raptor Lake use same 10nm, red is just another classic morning fog.

Improved perf/watt from Raptor Cove and further improved 10nm. They may have to clock 8 Gracemont a little lower than 4 Gracemont, it's the little core anyways. More troublesome in the same power envelope would have been a core increase from the big core.
 
  • Like
Reactions: Tlh97 and dullard

DrMrLordX

Lifer
Apr 27, 2000
17,268
6,267
136
Some Raptor Lake infos: it's a 8+16 design roughly 1 year after ADL-S. Same Gracemont cores but enhanced Golden Cove with more IPC and improved perf/watt.
That's . . . weird. They could have added more Golden Cove (especially if perf/watt is improved; @isoclocks that puts them at an advantage). Instead they double up on Gracemont? What's really going on here? MLiD being goofy again?

Actually LLVM/Clang has updated as well if you are looking at this commit You also see, that the instructions are available for 128bit and 256bit vector sizes under AVX-VNNI.
I wonder if AVX-VNNI is backwards-compatible with AVX512-VNNI?
 

eek2121

Senior member
Aug 2, 2005
990
1,071
136
According to Moore's Law Is Dead Intel plans to remove the ADL-S embargo on October 25th. He says only 125W k-SKUs will be launched this year and the other will follow in Q1 2022. ADL-P will be announced at CES early 2022. Some Raptor Lake infos: it's a 8+16 design roughly 1 year after ADL-S. Same Gracemont cores but enhanced Golden Cove with more IPC and improved perf/watt.

Has he ever beaten leaks and gotten anything right? No, he has not. Desktop Alder Lake will be released before mobile (that much is a given considering Intel just refreshed TGL-U and just launched TGL-H...sorry "Moore's Law is Dead", many of us knew this for longer than you did) from my current understanding, Intel plans to announce ADL-S in late November or late December this year. They have not decided on a final date (I actually know this for a fact, that's why MLID needs to kindly go away), but importantly please realize that announcement != release. A desktop ADLS-S is not going to be widely available before next year. I will absolutely 100% put money down for a bet on this. I've been saying this for months, but these youtubers keep trying to get their clicks...

A lots of if, or how from Intel etc.

If AL 8+8 is 125W, i wonder how will Intel hit red target if Raptor Lake is 8+16.

If Alder Lake and Raptor Lake use same 10nm, red is just another classic morning fog.
Please stop with the bold and red text. You are usually wrong (almost as bad a certain other member of this forum who's name also happens to start with an A) and while occasionally you can be comedic, you annoy more often than not, especially for those of us that have the misfortune of browsing this forum on mobile.
 

IntelUser2000

Elite Member
Oct 14, 2003
7,497
2,279
136
Probably not that easy. Unlike previous AVX instructions, AVX512 works as a utility as well. Pattern search, crypto-related operations are not trivial SIMD operations. Implementation must be pretty hard. In any case, Intel needs to find a way to reduce software development burdens.
Guys, if the CPU supports FMA, it doesn't need to be broken down into two ops. Jaguar had to do that because it didn't have it, so the FPadd executed it in two cycles and so did FPmul.

CPUs that support FMA such as Icelake(first FMA Intel CPU is Haswell) will only have half the throughput, but each AVX-512 instruction will still have 1 cycle latency.
 

coercitiv

Diamond Member
Jan 24, 2014
4,363
5,698
136
Has he ever beaten leaks and gotten anything right?
It's worse than that, sometimes he just deletes videos:
People forget this again and again: a leaker is not somebody who has interesting information, but someone who has previously established themselves as a credible source.
 
Last edited:

JoeRambo

Golden Member
Jun 13, 2013
1,119
979
136
Actually LLVM/Clang has updated as well if you are looking at this commit You also see, that the instructions are available for 128bit and 256bit vector sizes under AVX-VNNI.

Good find, as predicted earlier in this thread and now confirmed by LLVM commit log:

2. Introduce ExplicitVEXPrefix flag so that vpdpbusd/vpdpbusds/vpdpbusds/vpdpbusds instructions only use vex-encoding when user explicity add {vex} prefix.
So basically instructions from AVX512 that used to be encoded with EVEX prefix, now come with VEX ( from AVX / AVX2 ) and do the same operations on 128/256bit registers. Smart thing for Intel to do.

I wonder if AVX-VNNI is backwards-compatible with AVX512-VNNI?
Very unlikely as EVEX and VEX encodings are two different things, and on old AVX512 hardware, those AVX VEX encoded instructions probably mean nothing and will trap. Needs testing tho.
 
  • Like
Reactions: Tlh97 and coercitiv

jpiniero

Diamond Member
Oct 1, 2010
9,339
1,887
136
That's . . . weird. They could have added more Golden Cove (especially if perf/watt is improved; @isoclocks that puts them at an advantage). Instead they double up on Gracemont? What's really going on here? MLiD being goofy again?
We discussed this earlier... Marketing likes the idea of having more cores and it would do better in MT benchmarks fully loaded. Adding another 2 slots is going to make the ring pretty long.
 

dullard

Elite Member
May 21, 2001
22,801
1,035
126
If AL 8+8 is 125W, i wonder how will Intel hit red target if Raptor Lake is 8+16.
Like jpiniero and mikk said, the smaller cores aren't much of a power problem. I'll just put some rough numbers to it.

The similar 10 nm Jasper Lake N6000 has a 6 W TDP with 4 cores -- that works out to 1.5 W per core. If we realize that the cores don't actually use up the whole 6 W (there is the GPU and uncore in the same 6 W power envelope), then the actual small cores use less than 1.5 W each. If trust the Alder Lake mobile leaks, the small cores could be under 1 W each (necessary to meet the 5 W design with 1 bigger and 4 smaller cores https://www.techpowerup.com/news-tags/Gracemont#g279478-2) But, I'll stick with 1.5 W for this post since it is an arbitrarily adjustable number to some extent just by adjusting the base clock.

So, if Alder Lake's 8 small cores were similar to the N6000, they would use less than 12 W total. That leaves more than 113 W for the 8 bigger cores. If that rumor is correct and some Raptor Lake chips have 16 small cores, then that means the small cores still would use less than 24 W. That leaves more than 101 W for the 8 bigger cores.

The 8-core i9 11900 only has a TDP of 65 W. So big cores do not NEED the full 125 W. They'll be just fine with 101 W. Sure, they'll take more power if available, especially during the initial turbo period. But, they'll have acceptable performance with just 101 W.
 
Last edited:
  • Like
Reactions: Tlh97 and SAAA

dullard

Elite Member
May 21, 2001
22,801
1,035
126
That's . . . weird. They could have added more Golden Cove (especially if perf/watt is improved; @isoclocks that puts them at an advantage). Instead they double up on Gracemont? What's really going on here? MLiD being goofy again?
Intel has publicly stated that they are going towards a mix/match to meet your needs long-term vision. One rumor of one combination does not imply that it is the only combination of cores that will be produced. Given that Intel is using more power than the competition for similar performance, adding more low power cores is their best strategy to combat that problem.

I think you are over-estimating the performance of Golden Cove and under-estimating the performance of Gracemont. The two cores look to be shaping up to be much more similar in performance than many people think.
 

ASK THE COMMUNITY