Discussion Intel current and future Lakes & Rapids thread

uzzi38 · Jan 29, 2021

tamz_msc said:
That secret sauce is BEOL improvements.

You don't say.

I quoted what he said as proof that they didn't make significant improvements to the node between Zen 2 XT SKUs and Zen 3. They used fundamentally the same "recipe" as he states it with both.

tamz_msc · Jan 29, 2021

uzzi38 said:
You don't say.

I quoted what he said as proof that they didn't make significant improvements to the node between Zen 2 XT SKUs and Zen 3. They used fundamentally the same "recipe" as he states it with both.

Then explain what "secret recipe" means in this context, if not making changes to the process of fabrication in order to get better electrical characteristics out of it.

DrMrLordX · Jan 29, 2021

uzzi38 said:
Not really - kind of expected just by the fact that TGL-H35 launched at all tbh.

I expected it to be at worst 2 months behind TGL-H35, though I was suspicious of it when they didn't announce the SKU.

Saylick · Jan 29, 2021

tamz_msc said:
Then explain what "secret recipe" means in this context, if not making changes to the process of fabrication in order to get better electrical characteristics out of it.

I think it means that the design rules of the node are fundamentally the same, but the choice and optimization in transistor types can be refined so that you are getting the most out of the node. Recall what happened with Fermi's refresh: same TSMC 40nm node between the GTX480 and GTX580 but Nvidia optimized the use of transistor types so that leakier transistors were used only where absolutely needed.

Emphasis mine:

Thus the trick to making a good GPU is to use leaky transistors where you must, and use slower transistors elsewhere. This is exactly what NVIDIA did for GF100, where they primarily used 2 types of transistors differentiated in this manner. At a functional unit level we’re not sure which units used what, but it’s a good bet that most devices operating on the shader clock used the leakier transistors, while devices attached to the base clock could use the slower transistors. Of course GF100 ended up being power hungry – and by extension we assume leaky anyhow – so that design didn’t necessarily work out well for NVIDIA.

For GF110, NVIDIA included a 3rd type of transistor, which they describe as having “properties between the two previous ones”. Or in other words, NVIDIA began using a transistor that was leakier than a slow transistor, but not as leaky as the leakiest transistors in GF100. Again we don’t know which types of transistors were used where, but in using all 3 types NVIDIA ultimately was able to lower power consumption without needing to slow any parts of the chip down. In fact this is where virtually all of NVIDIA’s power savings come from, as NVIDIA only outright removed few if any transistors considering that GF110 retains all of GF100’s functionality.

Also, see the following image:

Same node, multiple choices for stand-by power (leakage) and clockspeed. You can optimize for transistors and net perf/W gains without any refinement of the node or architecture.

Edit: I think a good cooking analogy would be: the node is the ingredients, the architecture is the recipe, and transistor optimizations are the chef's ability. The quality of the ingredients and the recipe may stay the same, but a more skilled chef can produce a better product due to better execution in the cooking steps. A lousy chef can turn grade A5 wagyu filet mignon into a smoldering char if they don't know what they are doing.

jpiniero · Jan 29, 2021

https://twitter.com/x/status/1355177097249976322

Rocket Lake T part base clocks. Unsurprisingly pretty terrible. For comparison the 8 core 10700T's base is 2 Ghz.

Zucker2k · Jan 29, 2021

jpiniero said:
https://twitter.com/x/status/1355177097249976322

Rocket Lake T part base clocks. Unsurprisingly pretty terrible. For comparison the 8 core 10700T's base is 2 Ghz.

AVX-512. Plus, you won't catch one running at that clock except when power savings kick in, or it's strapped to a lowly cooler.

JoeRambo · Jan 29, 2021

jpiniero said:
Rocket Lake T part base clocks. Unsurprisingly pretty terrible. For comparison the 8 core 10700T's base is 2 Ghz.

Intel fell victim to their own TDP and base clocks definition. They pretty much need to make sure that some Intel hater running AVX512 prime95 can hit guaranteed base clocks in just 35W TDP and that is harder to do than with AVX256 workloads.

jpiniero · Jan 29, 2021

JoeRambo said:
Intel fell victim to their own TDP and base clocks definition. They pretty much need to make sure that some Intel hater running AVX512 prime95 can hit guaranteed base clocks in just 35W TDP and that is harder to do than with AVX256 workloads.

You know, that really shouldn't be a problem since the resulting flops should be similar. This isn't Skylake Server where you have two AVX-512 units.

JoeRambo · Jan 29, 2021

jpiniero said:
You know, that really shouldn't be a problem since the resulting flops should be similar. This isn't Skylake Server where you have two AVX-512 units.

Well and thanks sanity that they have moved away from those dual FMA capable AVX512 units. While peak flops might be similar due to 2x256 FMA versus 1x512 FMA, the "actual and factual" thoughput will be higher in everything that is not Linpack due to the following:

1) RKL actually has 2 AVX512 units and can execute at same cycle two instructions on 512bits of plenty using ALU etc instructions that matter to actual code throughput like MOVs, shifts, shuffles, logicals. Ports are asymetrical, but that is plenty for client.
2) It simply has more powerful core that again helps in achieving peak performance and resulting power draw.

So 1.4Ghz vs 2Ghz is very ok for overall IPC gain and AVX512 support. Won't beat any efficiency records on 14nm thats for sure.

jpiniero · Jan 29, 2021

JoeRambo said:
1) RKL actually has 2 AVX512 units and can execute at same cycle two instructions on 512bits of plenty using ALU etc instructions that matter to actual code throughput like MOVs, shifts, shuffles, logicals. Ports are asymetrical, but that is plenty for client.

I'd be surprised if it has 2 AVX-512 units since Sunny/Willow doesn't.

JoeRambo · Jan 29, 2021

jpiniero said:
I'd be surprised if it has 2 AVX-512 units since Sunny/Willow doesn't.

They do, people are so focused on FMA that everything that does not do 2xFMA/MUL/FADD/FMUL is not "true" 2xAVX512 units.

See for yourself:

InstLatx64/GenuineIntel00806C1_TigerLake_InstLatX64.txt at master · InstLatx64/InstLatx64

Copy of instlatx64.atw.hu. Contribute to InstLatx64/InstLatx64 development by creating an account on GitHub.

github.com

Also

Examining Intel's Ice Lake Processors: Taking a Bite of the Sunny Cove Microarchitecture

www.anandtech.com

for quick reference of what should have throughput of 0.50c

dmens · Jan 29, 2021

JoeRambo said:
They do, people are so focused on FMA that everything that does not do 2xFMA/MUL/FADD/FMUL is not "true" 2xAVX512 units.

See for yourself:

InstLatx64/GenuineIntel00806C1_TigerLake_InstLatX64.txt at master · InstLatx64/InstLatx64

Copy of instlatx64.atw.hu. Contribute to InstLatx64/InstLatx64 development by creating an account on GitHub.

github.com

Also

Examining Intel's Ice Lake Processors: Taking a Bite of the Sunny Cove Microarchitecture

www.anandtech.com

for quick reference of what should have throughput of 0.50c

lol, way to confuse throughput with latency

JoeRambo · Jan 29, 2021

dmens said:
lol, way to confuse throughput with latency

Yes, throughout in cycles in rather standard way to express how many of certain operation can happen. 1c is self explanatory, 0.5c means 2 per cycle and 0.33c means 3 per cycle. And it has nothing to do with latency of said operation.

If You actually opened either of links You'd know that Sunny Cove has AVX512 ALU's on PORT0 and PORT5 and extra 512bit stuff on PORT5 and the instruction printout fully confirms it with multiple cases of 0.5c throughput for ALU etc operations involving ZMM registers.
3278 AVX512F :VPADDD zmm, zmm, zmm L: 0.41ns= 1.0c T: 0.21ns= 0.51c
3287 AVX512BW :VPSUBB zmm1, zmm1, zmm2 L: 0.41ns= 1.0c T: 0.21ns= 0.51c

And so on as expected for relevant operations.

So please stop making fool of Yourself, don't rush to destroy the remainder of Your supposed ex-Intel "credibility" with FUD comments like that.

jpiniero · Jan 29, 2021

AT article says 2x256 or 1x512. I don't think you can do 0+1 and 5 at the same time. And I don't think the non-FMA AVX-512 instructions impact power draw like the FMA ones do.

jur · Jan 29, 2021

dmens said:
lol, way to confuse throughput with latency

I'm more reading than writing, but this is interesting. He didn't confuse anything, but he did look at the wrong lines. 512bit fma should use zmm registers. These lines show expected numbers; 4c latency and 1 for throughput. There are avx512 instructions that operate on lower width vectors and those indeed have higher throughput - as shown in the table.

JoeRambo · Jan 29, 2021

jpiniero said:
AT article says 2x256 or 1x512. I don't think you can do 0+1 and 5 at the same time. And I don't think the non-FMA AVX-512 instructions impact power draw like the FMA ones do.

Are we going to split hair now? It has AVX512

hardware on two ports and it does impact power and performance. Not as high full FMA units would, those would drop base clocks to basement.

The thing about FMA on CPU is, that it is Linpack thing only and for real world real FP performance one needs massive memory BW to match flops, does not happen in laptops afaik ?

jur said:
I'm more reading than writing, but this is interesting. He didn't confuse anything, but he did look at the wrong lines. 512bit fma should use zmm registers. These lines show expected numbers; 4c latency and 1 for throughput. There are avx512 instructions that operate on lower width vectors and those indeed have higher throughput - as shown in the table.

I have never claimed it has 2xFMA, in fact i have corrected the member that said it has 1 unit of AVX512. And there are plenty operations that use zmm and have throughput of 0.5c, just the expected ones, ALUs, logicals + all real world stuff that does not show up in table, like real code with movs, shuffles, logicals, ALU and FPU operations. All can happen at nice rate on client HW.

dmens · Jan 29, 2021

jur said:
I'm more reading than writing, but this is interesting. He didn't confuse anything, but he did look at the wrong lines. 512bit fma should use zmm registers. These lines show expected numbers; 4c latency and 1 for throughput. There are avx512 instructions that operate on lower width vectors and those indeed have higher throughput - as shown in the table.

Actually he did both. He looked at the wrong lines (DQ vs F), and the fact that latency and throughput are orthogonal and you cannot correlate the two. Just compare the KNL AVX-512 latencies versus throughput versus these implementations.

Either way this is pure FUD considering what the actual power cost of AVX is versus the rest of the system. Blaming AVX for high power is absolutely misleading.

JoeRambo · Jan 29, 2021

dmens said:
Actually he did both. He looked at the wrong lines (DQ vs F), and the fact that latency and throughput are orthogonal and you cannot correlate the two. Just compare the KNL AVX-512 latencies versus throughput versus these implementations.

LOL what? What has KNL to do with this? Latency is latency, throughput is throughput. If there is was one AVX512 unit, it would not go beyond 1c throughput on any operation involving ZMM ( well technically register "rename" stage would still allow 0.2c throughput for certain idioms like zeroing with XOR etc ).
And DQ vs F? What the random name of AVX512 extension has to do with instruction latency or throughput as long as CPU supports it and it operates on 512 bit vectors as relevant to this discussion?

dmens said:
Either way this is pure FUD considering what the actual power cost of AVX is versus the rest of the system. Blaming AVX for high power is absolutely misleading.

Random chinese guy runs AVX512 FP load from AIDA that uses 250w. Can't blame AVX for it. Makes perfect sense. Got it.

jur · Jan 29, 2021

dmens said:
Actually he did both. He looked at the wrong lines (DQ vs F), and the fact that latency and throughput are orthogonal and you cannot correlate the two. Just compare the KNL AVX-512 latencies versus throughput versus these implementations.

Either way this is pure FUD considering what the actual power cost of AVX is versus the rest of the system. Blaming AVX for high power is absolutely misleading.

The fact is that some avx512 instructions using zmm registers show throughput of 0.5. How is this possible if the thing does not have two avx 512 units? It does not have full 2 x avx512, but it does more than 1 x avx512 per cycle.

dmens · Jan 29, 2021

JoeRambo said:
LOL what? What has KNL to do with this? Latency is latency, throughput is throughput. If there is was one AVX512 unit, it would not go beyond 1c throughput on any operation involving ZMM ( well technically register "rename" stage would still allow 0.2c throughput for certain idioms like zeroing with XOR etc ).

Because they can play stupid games like pack certain instructions. Knights Mill did this. Knights Landing had two full AVX-512 units but higher ALU latency so your method here would fall flat on its face.

dmens · Jan 29, 2021

JoeRambo said:
Random chinese guy runs AVX512 FP load from AIDA that uses 250w. Can't blame AVX for it. Makes perfect sense. Got it.

Because that is the only workload where this chip will reach 250 watts. Got it.

ondma · Jan 30, 2021

JoeRambo said:
Yes, throughout in cycles in rather standard way to express how many of certain operation can happen. 1c is self explanatory, 0.5c means 2 per cycle and 0.33c means 3 per cycle. And it has nothing to do with latency of said operation.

If You actually opened either of links You'd know that Sunny Cove has AVX512 ALU's on PORT0 and PORT5 and extra 512bit stuff on PORT5 and the instruction printout fully confirms it with multiple cases of 0.5c throughput for ALU etc operations involving ZMM registers.
3278 AVX512F :VPADDD zmm, zmm, zmm L: 0.41ns= 1.0c T: 0.21ns= 0.51c
3287 AVX512BW :VPSUBB zmm1, zmm1, zmm2 L: 0.41ns= 1.0c T: 0.21ns= 0.51c

And so on as expected for relevant operations.

So please stop making fool of Yourself, don't rush to destroy the remainder of Your supposed ex-Intel "credibility" with FUD comments like that.

I think that ship has already sailed.

RTX2080 · Jan 30, 2021

Videocardz push an article about this: https://videocardz.com/newz/intel-core-i9-11900kf-heats-up-to-98c-with-360mm-aio-cooler

here's comparison of 10980xe/9980xe under AIDA FPU load:

Intel Core i9-10980XE Review: Intel Loses its Grip on HEDT

Enough with the refreshes, already.

www.tomshardware.com

it seems that these HEDT SKUs which were also AVX512 implemented, under AIDA FPU without-AVX stress load the consumption doesn't seem change much. but it's weird that some SKUs has lower consumption with AVX on(thermal throttling?). Nevertheless ~250watts is the level of 9980xe@4.4 which is ~258watts, let's hope the Z590 used by that leaker is immature.

mikk · Jan 30, 2021

dmens said:
Yeah sure, keep reading your roadmap tea leaves and guessing how chip validation works LOL. Whatever you are talking about is likely your own fantasy... don't expect me to guess what that may be. By the way, I never said they can release the internal schedule, I said they can do better than give a half year window.

You are completely ignorant, clueless, trolling or just stupid on this topic. Once again, there is a window of several weeks if the projected production schedule is 6-12 months away. This is not a secret, these roadmaps had been leaked in the past. They have a rough estimate and most likely could narrow it down to 1-2 months but the thing is they do not share these more exact schedules to the public for several reasons. Second half, first half, end of the year, holiday seasons etc. are typically public announcements without sharing too much. Your chronic drama style looks foolish.

uzzi38 · Jan 30, 2021

Saylick said:
I think it means that the design rules of the node are fundamentally the same, but the choice and optimization in transistor types can be refined so that you are getting the most out of the node. Recall what happened with Fermi's refresh: same TSMC 40nm node between the GTX480 and GTX580 but Nvidia optimized the use of transistor types so that leakier transistors were used only where absolutely needed.

Emphasis mine:

Also, see the following image:

Same node, multiple choices for stand-by power (leakage) and clockspeed. You can optimize for transistors and net perf/W gains without any refinement of the node or architecture.

Edit: I think a good cooking analogy would be: the node is the ingredients, the architecture is the recipe, and transistor optimizations are the chef's ability. The quality of the ingredients and the recipe may stay the same, but a more skilled chef can produce a better product due to better execution in the cooking steps. A lousy chef can turn grade A5 wagyu filet mignon into a smoldering char if they don't know what they are doing.

Not gonna lie, that's a far better explanation than I could have ever gave. I need to keep a copy for this for future reference haha

Discussion Intel current and future Lakes & Rapids thread

Platinum Member

Diamond Member

Lifer

Diamond Member

Lifer

Golden Member

Golden Member

Lifer

Golden Member

Lifer

Golden Member

Platinum Member

Golden Member

Lifer

Member

Golden Member

Platinum Member

Golden Member

Member

Platinum Member

Platinum Member

Diamond Member

Senior member

Diamond Member

Platinum Member