Discussion Intel current and future Lakes & Rapids thread

Page 358 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

uzzi38

Platinum Member
Oct 16, 2019
2,746
6,653
146
That secret sauce is BEOL improvements.
You don't say.

I quoted what he said as proof that they didn't make significant improvements to the node between Zen 2 XT SKUs and Zen 3. They used fundamentally the same "recipe" as he states it with both.
 
  • Like
Reactions: Tlh97 and lobz

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
You don't say.

I quoted what he said as proof that they didn't make significant improvements to the node between Zen 2 XT SKUs and Zen 3. They used fundamentally the same "recipe" as he states it with both.
:rolleyes:
Then explain what "secret recipe" means in this context, if not making changes to the process of fabrication in order to get better electrical characteristics out of it.
 

Saylick

Diamond Member
Sep 10, 2012
3,923
9,142
136
:rolleyes:
Then explain what "secret recipe" means in this context, if not making changes to the process of fabrication in order to get better electrical characteristics out of it.
I think it means that the design rules of the node are fundamentally the same, but the choice and optimization in transistor types can be refined so that you are getting the most out of the node. Recall what happened with Fermi's refresh: same TSMC 40nm node between the GTX480 and GTX580 but Nvidia optimized the use of transistor types so that leakier transistors were used only where absolutely needed.

Emphasis mine:
Thus the trick to making a good GPU is to use leaky transistors where you must, and use slower transistors elsewhere. This is exactly what NVIDIA did for GF100, where they primarily used 2 types of transistors differentiated in this manner. At a functional unit level we’re not sure which units used what, but it’s a good bet that most devices operating on the shader clock used the leakier transistors, while devices attached to the base clock could use the slower transistors. Of course GF100 ended up being power hungry – and by extension we assume leaky anyhow – so that design didn’t necessarily work out well for NVIDIA.

For GF110, NVIDIA included a 3rd type of transistor, which they describe as having “properties between the two previous ones”. Or in other words, NVIDIA began using a transistor that was leakier than a slow transistor, but not as leaky as the leakiest transistors in GF100. Again we don’t know which types of transistors were used where, but in using all 3 types NVIDIA ultimately was able to lower power consumption without needing to slow any parts of the chip down. In fact this is where virtually all of NVIDIA’s power savings come from, as NVIDIA only outright removed few if any transistors considering that GF110 retains all of GF100’s functionality.

Also, see the following image:
n5-cells.png

Same node, multiple choices for stand-by power (leakage) and clockspeed. You can optimize for transistors and net perf/W gains without any refinement of the node or architecture.

Edit: I think a good cooking analogy would be: the node is the ingredients, the architecture is the recipe, and transistor optimizations are the chef's ability. The quality of the ingredients and the recipe may stay the same, but a more skilled chef can produce a better product due to better execution in the cooking steps. A lousy chef can turn grade A5 wagyu filet mignon into a smoldering char if they don't know what they are doing.
 
Last edited:

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Rocket Lake T part base clocks. Unsurprisingly pretty terrible. For comparison the 8 core 10700T's base is 2 Ghz.

Intel fell victim to their own TDP and base clocks definition. They pretty much need to make sure that some Intel hater running AVX512 prime95 can hit guaranteed base clocks in just 35W TDP and that is harder to do than with AVX256 workloads.
 

jpiniero

Lifer
Oct 1, 2010
16,493
6,983
136
Intel fell victim to their own TDP and base clocks definition. They pretty much need to make sure that some Intel hater running AVX512 prime95 can hit guaranteed base clocks in just 35W TDP and that is harder to do than with AVX256 workloads.

You know, that really shouldn't be a problem since the resulting flops should be similar. This isn't Skylake Server where you have two AVX-512 units.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
You know, that really shouldn't be a problem since the resulting flops should be similar. This isn't Skylake Server where you have two AVX-512 units.

Well and thanks sanity that they have moved away from those dual FMA capable AVX512 units. While peak flops might be similar due to 2x256 FMA versus 1x512 FMA, the "actual and factual" thoughput will be higher in everything that is not Linpack due to the following:

1) RKL actually has 2 AVX512 units and can execute at same cycle two instructions on 512bits of plenty using ALU etc instructions that matter to actual code throughput like MOVs, shifts, shuffles, logicals. Ports are asymetrical, but that is plenty for client.
2) It simply has more powerful core that again helps in achieving peak performance and resulting power draw.

So 1.4Ghz vs 2Ghz is very ok for overall IPC gain and AVX512 support. Won't beat any efficiency records on 14nm thats for sure.
 

jpiniero

Lifer
Oct 1, 2010
16,493
6,983
136
1) RKL actually has 2 AVX512 units and can execute at same cycle two instructions on 512bits of plenty using ALU etc instructions that matter to actual code throughput like MOVs, shifts, shuffles, logicals. Ports are asymetrical, but that is plenty for client.

I'd be surprised if it has 2 AVX-512 units since Sunny/Willow doesn't.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
I'd be surprised if it has 2 AVX-512 units since Sunny/Willow doesn't.

They do, people are so focused on FMA that everything that does not do 2xFMA/MUL/FADD/FMUL is not "true" 2xAVX512 units.

See for yourself:


Also


for quick reference of what should have throughput of 0.50c
 
  • Like
Reactions: Tlh97 and Pilum

dmens

Platinum Member
Mar 18, 2005
2,275
965
136
They do, people are so focused on FMA that everything that does not do 2xFMA/MUL/FADD/FMUL is not "true" 2xAVX512 units.

See for yourself:


Also


for quick reference of what should have throughput of 0.50c

lol, way to confuse throughput with latency
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
lol, way to confuse throughput with latency

Yes, throughout in cycles in rather standard way to express how many of certain operation can happen. 1c is self explanatory, 0.5c means 2 per cycle and 0.33c means 3 per cycle. And it has nothing to do with latency of said operation.

If You actually opened either of links You'd know that Sunny Cove has AVX512 ALU's on PORT0 and PORT5 and extra 512bit stuff on PORT5 and the instruction printout fully confirms it with multiple cases of 0.5c throughput for ALU etc operations involving ZMM registers.
3278 AVX512F :VPADDD zmm, zmm, zmm L: 0.41ns= 1.0c T: 0.21ns= 0.51c
3287 AVX512BW :VPSUBB zmm1, zmm1, zmm2 L: 0.41ns= 1.0c T: 0.21ns= 0.51c

And so on as expected for relevant operations.

So please stop making fool of Yourself, don't rush to destroy the remainder of Your supposed ex-Intel "credibility" with FUD comments like that.
 

jpiniero

Lifer
Oct 1, 2010
16,493
6,983
136
AT article says 2x256 or 1x512. I don't think you can do 0+1 and 5 at the same time. And I don't think the non-FMA AVX-512 instructions impact power draw like the FMA ones do.
 

jur

Member
Nov 23, 2016
45
32
91
lol, way to confuse throughput with latency
I'm more reading than writing, but this is interesting. He didn't confuse anything, but he did look at the wrong lines. 512bit fma should use zmm registers. These lines show expected numbers; 4c latency and 1 for throughput. There are avx512 instructions that operate on lower width vectors and those indeed have higher throughput - as shown in the table.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
AT article says 2x256 or 1x512. I don't think you can do 0+1 and 5 at the same time. And I don't think the non-FMA AVX-512 instructions impact power draw like the FMA ones do.

Are we going to split hair now? It has AVX512

hardware on two ports and it does impact power and performance. Not as high full FMA units would, those would drop base clocks to basement.

The thing about FMA on CPU is, that it is Linpack thing only and for real world real FP performance one needs massive memory BW to match flops, does not happen in laptops afaik ?

I'm more reading than writing, but this is interesting. He didn't confuse anything, but he did look at the wrong lines. 512bit fma should use zmm registers. These lines show expected numbers; 4c latency and 1 for throughput. There are avx512 instructions that operate on lower width vectors and those indeed have higher throughput - as shown in the table.

I have never claimed it has 2xFMA, in fact i have corrected the member that said it has 1 unit of AVX512. And there are plenty operations that use zmm and have throughput of 0.5c, just the expected ones, ALUs, logicals + all real world stuff that does not show up in table, like real code with movs, shuffles, logicals, ALU and FPU operations. All can happen at nice rate on client HW.
 

dmens

Platinum Member
Mar 18, 2005
2,275
965
136
I'm more reading than writing, but this is interesting. He didn't confuse anything, but he did look at the wrong lines. 512bit fma should use zmm registers. These lines show expected numbers; 4c latency and 1 for throughput. There are avx512 instructions that operate on lower width vectors and those indeed have higher throughput - as shown in the table.

Actually he did both. He looked at the wrong lines (DQ vs F), and the fact that latency and throughput are orthogonal and you cannot correlate the two. Just compare the KNL AVX-512 latencies versus throughput versus these implementations.

Either way this is pure FUD considering what the actual power cost of AVX is versus the rest of the system. Blaming AVX for high power is absolutely misleading.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Actually he did both. He looked at the wrong lines (DQ vs F), and the fact that latency and throughput are orthogonal and you cannot correlate the two. Just compare the KNL AVX-512 latencies versus throughput versus these implementations.

LOL what? What has KNL to do with this? Latency is latency, throughput is throughput. If there is was one AVX512 unit, it would not go beyond 1c throughput on any operation involving ZMM ( well technically register "rename" stage would still allow 0.2c throughput for certain idioms like zeroing with XOR etc ).
And DQ vs F? What the random name of AVX512 extension has to do with instruction latency or throughput as long as CPU supports it and it operates on 512 bit vectors as relevant to this discussion?

Either way this is pure FUD considering what the actual power cost of AVX is versus the rest of the system. Blaming AVX for high power is absolutely misleading.

Random chinese guy runs AVX512 FP load from AIDA that uses 250w. Can't blame AVX for it. Makes perfect sense. Got it.
 
Last edited:

jur

Member
Nov 23, 2016
45
32
91
Actually he did both. He looked at the wrong lines (DQ vs F), and the fact that latency and throughput are orthogonal and you cannot correlate the two. Just compare the KNL AVX-512 latencies versus throughput versus these implementations.

Either way this is pure FUD considering what the actual power cost of AVX is versus the rest of the system. Blaming AVX for high power is absolutely misleading.
The fact is that some avx512 instructions using zmm registers show throughput of 0.5. How is this possible if the thing does not have two avx 512 units? It does not have full 2 x avx512, but it does more than 1 x avx512 per cycle.
 

dmens

Platinum Member
Mar 18, 2005
2,275
965
136
LOL what? What has KNL to do with this? Latency is latency, throughput is throughput. If there is was one AVX512 unit, it would not go beyond 1c throughput on any operation involving ZMM ( well technically register "rename" stage would still allow 0.2c throughput for certain idioms like zeroing with XOR etc ).

Because they can play stupid games like pack certain instructions. Knights Mill did this. Knights Landing had two full AVX-512 units but higher ALU latency so your method here would fall flat on its face.
 
  • Like
Reactions: Tlh97 and lobz

ondma

Diamond Member
Mar 18, 2018
3,276
1,679
136
Yes, throughout in cycles in rather standard way to express how many of certain operation can happen. 1c is self explanatory, 0.5c means 2 per cycle and 0.33c means 3 per cycle. And it has nothing to do with latency of said operation.

If You actually opened either of links You'd know that Sunny Cove has AVX512 ALU's on PORT0 and PORT5 and extra 512bit stuff on PORT5 and the instruction printout fully confirms it with multiple cases of 0.5c throughput for ALU etc operations involving ZMM registers.
3278 AVX512F :VPADDD zmm, zmm, zmm L: 0.41ns= 1.0c T: 0.21ns= 0.51c
3287 AVX512BW :VPSUBB zmm1, zmm1, zmm2 L: 0.41ns= 1.0c T: 0.21ns= 0.51c

And so on as expected for relevant operations.

So please stop making fool of Yourself, don't rush to destroy the remainder of Your supposed ex-Intel "credibility" with FUD comments like that.
I think that ship has already sailed.
 

RTX2080

Senior member
Jul 2, 2018
334
533
136
Videocardz push an article about this: https://videocardz.com/newz/intel-core-i9-11900kf-heats-up-to-98c-with-360mm-aio-cooler

here's comparison of 10980xe/9980xe under AIDA FPU load:

it seems that these HEDT SKUs which were also AVX512 implemented, under AIDA FPU without-AVX stress load the consumption doesn't seem change much. but it's weird that some SKUs has lower consumption with AVX on(thermal throttling?). Nevertheless ~250watts is the level of 9980xe@4.4 which is ~258watts, let's hope the Z590 used by that leaker is immature.

mZxnh6F5cnvfqpZ3EAti3A-970-80.png.webp
35jqiKMLus6N2CNes3VK6A-970-80.png.webp
 
Last edited:

mikk

Diamond Member
May 15, 2012
4,291
2,381
136
Yeah sure, keep reading your roadmap tea leaves and guessing how chip validation works LOL. Whatever you are talking about is likely your own fantasy... don't expect me to guess what that may be. By the way, I never said they can release the internal schedule, I said they can do better than give a half year window.


You are completely ignorant, clueless, trolling or just stupid on this topic. Once again, there is a window of several weeks if the projected production schedule is 6-12 months away. This is not a secret, these roadmaps had been leaked in the past. They have a rough estimate and most likely could narrow it down to 1-2 months but the thing is they do not share these more exact schedules to the public for several reasons. Second half, first half, end of the year, holiday seasons etc. are typically public announcements without sharing too much. Your chronic drama style looks foolish.
 

uzzi38

Platinum Member
Oct 16, 2019
2,746
6,653
146
I think it means that the design rules of the node are fundamentally the same, but the choice and optimization in transistor types can be refined so that you are getting the most out of the node. Recall what happened with Fermi's refresh: same TSMC 40nm node between the GTX480 and GTX580 but Nvidia optimized the use of transistor types so that leakier transistors were used only where absolutely needed.

Emphasis mine:


Also, see the following image:
n5-cells.png

Same node, multiple choices for stand-by power (leakage) and clockspeed. You can optimize for transistors and net perf/W gains without any refinement of the node or architecture.

Edit: I think a good cooking analogy would be: the node is the ingredients, the architecture is the recipe, and transistor optimizations are the chef's ability. The quality of the ingredients and the recipe may stay the same, but a more skilled chef can produce a better product due to better execution in the cooking steps. A lousy chef can turn grade A5 wagyu filet mignon into a smoldering char if they don't know what they are doing.
Not gonna lie, that's a far better explanation than I could have ever gave. I need to keep a copy for this for future reference haha