Discussion Intel current and future Lakes & Rapids thread

Thunder 57 · Sep 23, 2019

Ajay said:
...I just can't see them going with large monolithic client CPUs anymore (except for laptop).

I agree. It will be interesting to see what they mean by "Cache Redesign". I think at the very least they go to 1MB L2 with a non-inclusive L3.

Ajay · Sep 23, 2019

Thunder 57 said:
I think at the very least they go to 1MB L2 with a non-inclusive L3.

Or, one of these hybrid inclusive/non-inclusive L3$ designs.

jur · Sep 24, 2019

Thunder 57 said:
I agree. It will be interesting to see what they mean by "Cache Redesign". I think at the very least they go to 1MB L2 with a non-inclusive L3.

Desktop / notebook Tigerlake will have 256kb L2 and 3mb L3 / core. There were some leaks; maybe wccftech has news about it somewhere and there were also some twitter posts.

My impression is that large L2 caches burn quite some power - see Bulldozer, Skylake x. Besides, when comparing client Skylake to the server Skylake, one can see that for typical desktop workloads and gaming large L2 is not really beneficial. Maybe even the opposite since it has higher latency. Server Skylake is another story with it's 2x avx512 units - it's a different optimization point.

Thunder 57 · Sep 24, 2019

Ajay said:
Or, one of these hybrid inclusive/non-inclusive L3$ designs.

Don't think I've heard of that. How can it be inclusive and non-inclusive? Are there CPU's out there that use that design?

jur said:
Desktop / notebook Tigerlake will have 256kb L2 and 3mb L3 / core. There were some leaks; maybe wccftech has news about it somewhere and there were also some twitter posts.

My impression is that large L2 caches burn quite some power - see Bulldozer, Skylake x. Besides, when comparing client Skylake to the server Skylake, one can see that for typical desktop workloads and gaming large L2 is not really beneficial. Maybe even the opposite since it has higher latency. Server Skylake is another story with it's 2x avx512 units - it's a different optimization point.

I think the large L2 on BD was to compensate for its horrible latency. Could be wrong. But they did get BD down to surprisingly low power by Steamroller/Excavator. Interestingly with Excavator they did cut the L2 cache, while increasing the L1 though.

NostaSeronx · Sep 24, 2019

Thunder 57 said:
I think the large L2 on BD was to compensate for its horrible latency.

"The L2 cache’s large capacity and high associativity makes destructive interference between the two threads less likely." <- BULLDOZER: AN APPROACH TO MULTITHREADED COMPUTE PERFORMANCE

"For "Bulldozer" AMD implemented a write-thru L1D cache and focused on improving pre-fetch algorithms and increasing L1D cache bandwidth on popular small block transfers. For those less frequent, larger block transfers we rely upon the efficiencies built into our large 16-way 2MB L2.

Based on the average workloads today, we see this new design - despite the smaller L1 cache - as a more efficient way to process data." Mike Butler, Senior Fellow Design Engineer, AMD / HardOCP Readers Ask AMD Bulldozer Questions

¯\_(ツ)_/¯
The large L2 seems to stem more from the smaller L1 caches which provided more aggressive thrashing of an unknown smaller L2 option. I guess it could figuratively be the same with SMT, high-associativity and high-capacity is preferred in server workloads.

Skylake => 256 KiB/core - 4-way set associative - 12-cycle latency
Skylake-SP => 1 MiB/core - 16-way set associative - 14-cycle latency

Technically, some how Skylake-SP should have more effective multi-threading per core.

I have identified three different Willowcove implementations; a Server WLC(WillowcoveX), a Client WLC(Willowcove), and a Mobility WLC(Willowcove_M). Just because TGL doesn't have large L2 doesn't mean SPR won't.

^-- Bulldozer had an option for 1 MB 16-way w/ 18-cycle latency. But, is the 2-cycle gap worth the cache misses?

jpiniero · Sep 24, 2019

intel/mainline-tracking

This project is hosting an upstream tracking, rebasing branch of technology and enabling development for selected Intel platforms. It will get updates following most Linus Torvalds RC releases. - i...

github.com

A reference to Eagle Stream WS (Sapphire Rapids). It would make sense if Intel just cancels Icelake Server and just rolls with Sapphire Rapids.

mikk · Sep 25, 2019

“[Sunny Cove has an] 800 instruction window, sustains between 3 and 6 x86 instructions per clock,” says Keller, “massive data predictors, massive branch predictors… We’re working on a generation that’s significantly bigger than this and closer to the linear curve on performance. This is a really big mindset change.”

Intel’s next-gen CPU architecture will be “significantly bigger” than Sunny Cove

Jim Keller is touting a performance boost that scales "closer to the linear curve" relative to increases in transistor count

www.pcgamesn.com

I wonder if the significantly bigger CPU architecture will be Golden Cove or some other successor of it.

Thala · Sep 25, 2019

Thunder 57 said:
Don't think I've heard of that. How can it be inclusive and non-inclusive? Are there CPU's out there that use that design?

All of ARMs architectures starting from Cortex A75 with DSU have exclusive L2$ and pseudo-exclusive L3$.

Bouowmx · Sep 25, 2019

Concerning the Rocket Lake DMI x8 rumor/speculation from a while ago; Comet Lake-U/Y supports OPI x8 (same thing as DMI/PCIE), but at PCIE 2 speed: https://www.anandtech.com/show/14898/intels-495-chipset-datasheet-usb-32-gen-2-support-mobile-only

Ajay · Sep 25, 2019

jpiniero said:
intel/mainline-tracking

This project is hosting an upstream tracking, rebasing branch of technology and enabling development for selected Intel platforms. It will get updates following most Linus Torvalds RC releases. - i...

github.com

A reference to Eagle Stream WS (Sapphire Rapids). It would make sense if Intel just cancels Icelake Server and just rolls with Sapphire Rapids.

Eh, Intel will probably produce some Icelake-SP CPUs just to protect their investor community. Sapphire Rapids will ship in volume.

Ajay · Sep 25, 2019

mikk said:
Intel’s next-gen CPU architecture will be “significantly bigger” than Sunny Cove

Jim Keller is touting a performance boost that scales "closer to the linear curve" relative to increases in transistor count

www.pcgamesn.com

I wonder if the significantly bigger CPU architecture will be Golden Cove or some other successor of it.

Hard to imagine Intel pulling this off on 14nm. So by 2021, 10nm would need to have great yields. I suppose it could be targeting mobile only again.

jpiniero · Sep 25, 2019

Ajay said:
Hard to imagine Intel pulling this off on 14nm. So by 2021, 10nm would need to have great yields. I suppose it could be targeting mobile only again.

Most likely Ocean Cove or later on 7 nm.

Ajay · Sep 25, 2019

jpiniero said:
Most likely Ocean Cove or later on 7 nm.

That's supposed to be implemented in 2022 (according to wikichip), so it makes sense. Intel's going to need some big jumps in "IPC" to make up for likely clock rate regressions in 7nm. I'm interested to see what happens as lower power seems to equal speed for recent node classes. Tough to build out a large (very high xtor budget), high performance core and keep power down at the same time.

We just need to wait three years for the competition between AMD and Intel to get red hot - word.

Thunder 57 · Sep 25, 2019

NostaSeronx said:
"The L2 cache’s large capacity and high associativity makes destructive interference between the two threads less likely." <- BULLDOZER: AN APPROACH TO MULTITHREADED COMPUTE PERFORMANCE

"For "Bulldozer" AMD implemented a write-thru L1D cache and focused on improving pre-fetch algorithms and increasing L1D cache bandwidth on popular small block transfers. For those less frequent, larger block transfers we rely upon the efficiencies built into our large 16-way 2MB L2.

Based on the average workloads today, we see this new design - despite the smaller L1 cache - as a more efficient way to process data." Mike Butler, Senior Fellow Design Engineer, AMD / HardOCP Readers Ask AMD Bulldozer Questions

¯\_(ツ)_/¯
The large L2 seems to stem more from the smaller L1 caches which provided more aggressive thrashing of an unknown smaller L2 option. I guess it could figuratively be the same with SMT, high-associativity and high-capacity is preferred in server workloads.

Skylake => 256 KiB/core - 4-way set associative - 12-cycle latency
Skylake-SP => 1 MiB/core - 16-way set associative - 14-cycle latency

Technically, some how Skylake-SP should have more effective multi-threading per core.

I have identified three different Willowcove implementations; a Server WLC(WillowcoveX), a Client WLC(Willowcove), and a Mobility WLC(Willowcove_M). Just because TGL doesn't have large L2 doesn't mean SPR won't.

View attachment 11229
^-- Bulldozer had an option for 1 MB 16-way w/ 18-cycle latency. But, is the 2-cycle gap worth the cache misses?

No, it is not worth the misses. The problem is the latency is too high as it is. If Skylake-X is 1MB but 14 cycles while BD 1MB would have been 18, you are already nearly 30% slower at the same clocks. Now if BD had hit the clocks they wanted, it would have been less of an issue.

NostaSeronx · Sep 25, 2019

Thunder 57 said:
No, it is not worth the misses. The problem is the latency is too high as it is. If Skylake-X is 1MB but 14 cycles while BD 1MB would have been 18, you are already nearly 30% slower at the same clocks. Now if BD had hit the clocks they wanted, it would have been less of an issue.

You are comparing different L2 designs so there is going to be differences. The closet to AMD's L2 is Goldmont, since it is also shared between cores.

For example:
Goldmont -- 1 MB L2 16-way shared between two cores is 17-cycle latency.
Goldmont plus -- 2 MB L2 16-way shared between two cores(?/{s}wikichips) is 19-cycle latency.

AMD's clock targets for BD/PD was only 20% higher than Husky(32nm Stars core). Which they in fact did hit when they launched.

Thunder 57 · Sep 25, 2019

NostaSeronx said:
You are comparing different L2 designs so there is going to be differences. The closet to AMD's L2 is Goldmont, since it is also shared between cores.

For example:
Goldmont -- 1 MB L2 16-way shared between two cores is 17-cycle latency.
Goldmont plus -- 2 MB L2 16-way shared between two cores(?/{s}wikichips) is 19-cycle latency.

AMD's clock targets for BD/PD was only 20% higher than Husky(32nm Stars core). Which they in fact did hit when they launched.

Goldmont is also a low power design. I know they are different designs but I can't think of any x86 CPU that had latency similar to or greater than BD. Regarding clocks:

The CPU was supposed to offer 20 to 30% higher clock speeds at roughly the same power consumption, but in the end it could only offer a 10% boost at slightly higher power consumption.

Also from that page:

Lowly threaded desktop applications run best in a large, low latency L2 cache. But for server applications, we found worse problems than the L2 cache.

I know you are a fan of BD, but it had flaws like any CPU will. It just had a bit too many things go wrong and therefore isn't well regarded.

Dayman1225 · Sep 27, 2019

Gen12 has showed up on CompuBench.

96 EUs (Full GT2 configuration) running at 1.1 GHz

https://twitter.com/x/status/1177556461263523840

https://twitter.com/x/status/1177557923683127296

DrMrLordX · Sep 27, 2019

Hmm. Rocket Lake?

Dayman1225 · Sep 27, 2019

DrMrLordX said:
Hmm. Rocket Lake?

Tigerlake. Rocket Lake is rumoured to have GT1 Gen12 graphics. This is GT2 Gen12

Ajay · Sep 27, 2019

Dayman1225 said:
Tigerlake. Rocket Lake is rumoured to have GT1 Gen12 graphics. This is GT2 Gen12

I just wonder what the TIgerlake supply will be. Intel has tweaked the process (transistor 'optimization'), but yields could also be increased in the design phase (more redundancy and optimizations to improve clocks speeds).

mikk · Sep 27, 2019

Dayman1225 said:
Gen12 has showed up on CompuBench.

96 EUs (Full GT2 configuration) running at 1.1 GHz

Most of the results are already better than Icelake G7 which is a good sign, usually ES samples 1 year before launch are slow.

DrMrLordX · Sep 28, 2019

Dayman1225 said:
Tigerlake. Rocket Lake is rumoured to have GT1 Gen12 graphics. This is GT2 Gen12

Oh okay, I had forgotten that there was GT1.

Dayman1225 · Oct 3, 2019

Looks like Lakefield-R will use Gen12/Xe

https://twitter.com/x/status/1179788154011701250

jpiniero · Oct 3, 2019

So maybe it's also using Willow Cove too?

mikk · Oct 3, 2019

Dayman1225 said:
Looks like Lakefield-R will use Gen12/Xe

https://twitter.com/x/status/1179788154011701250

This is nothing new, it is already known from the driver leak from weeks ago (actually this picture is from the driver inf).

https://twitter.com/x/status/1164663022620569600

Discussion Intel current and future Lakes & Rapids thread

Platinum Member

Lifer

Junior Member

Platinum Member

Diamond Member

Lifer

Diamond Member

Golden Member

Golden Member

Lifer

Lifer

Lifer

Lifer

Platinum Member

Diamond Member

Platinum Member

Golden Member

Lifer

Golden Member

Lifer

Diamond Member

Lifer

Golden Member

Lifer

Diamond Member