Discussion Intel current and future Lakes & Rapids thread

Page 136 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Thunder 57

Platinum Member
Aug 19, 2007
2,675
3,797
136
...I just can't see them going with large monolithic client CPUs anymore (except for laptop).

I agree. It will be interesting to see what they mean by "Cache Redesign". I think at the very least they go to 1MB L2 with a non-inclusive L3.
 
Last edited:

jur

Junior Member
Nov 23, 2016
17
2
81
I agree. It will be interesting to see what they mean by "Cache Redesign". I think at the very least they go to 1MB L2 with a non-inclusive L3.

Desktop / notebook Tigerlake will have 256kb L2 and 3mb L3 / core. There were some leaks; maybe wccftech has news about it somewhere and there were also some twitter posts.

My impression is that large L2 caches burn quite some power - see Bulldozer, Skylake x. Besides, when comparing client Skylake to the server Skylake, one can see that for typical desktop workloads and gaming large L2 is not really beneficial. Maybe even the opposite since it has higher latency. Server Skylake is another story with it's 2x avx512 units - it's a different optimization point.
 
Last edited:

Thunder 57

Platinum Member
Aug 19, 2007
2,675
3,797
136
Or, one of these hybrid inclusive/non-inclusive L3$ designs.

Don't think I've heard of that. How can it be inclusive and non-inclusive? Are there CPU's out there that use that design?

Desktop / notebook Tigerlake will have 256kb L2 and 3mb L3 / core. There were some leaks; maybe wccftech has news about it somewhere and there were also some twitter posts.

My impression is that large L2 caches burn quite some power - see Bulldozer, Skylake x. Besides, when comparing client Skylake to the server Skylake, one can see that for typical desktop workloads and gaming large L2 is not really beneficial. Maybe even the opposite since it has higher latency. Server Skylake is another story with it's 2x avx512 units - it's a different optimization point.

I think the large L2 on BD was to compensate for its horrible latency. Could be wrong. But they did get BD down to surprisingly low power by Steamroller/Excavator. Interestingly with Excavator they did cut the L2 cache, while increasing the L1 though.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
I think the large L2 on BD was to compensate for its horrible latency.
"The L2 cache’s large capacity and high associativity makes destructive interference between the two threads less likely." <- BULLDOZER: AN APPROACH TO MULTITHREADED COMPUTE PERFORMANCE

"For "Bulldozer" AMD implemented a write-thru L1D cache and focused on improving pre-fetch algorithms and increasing L1D cache bandwidth on popular small block transfers. For those less frequent, larger block transfers we rely upon the efficiencies built into our large 16-way 2MB L2.

Based on the average workloads today, we see this new design - despite the smaller L1 cache - as a more efficient way to process data." Mike Butler, Senior Fellow Design Engineer, AMD / HardOCP Readers Ask AMD Bulldozer Questions

¯\_(ツ)_/¯
The large L2 seems to stem more from the smaller L1 caches which provided more aggressive thrashing of an unknown smaller L2 option. I guess it could figuratively be the same with SMT, high-associativity and high-capacity is preferred in server workloads.

Skylake => 256 KiB/core - 4-way set associative - 12-cycle latency
Skylake-SP => 1 MiB/core - 16-way set associative - 14-cycle latency

Technically, some how Skylake-SP should have more effective multi-threading per core.

I have identified three different Willowcove implementations; a Server WLC(WillowcoveX), a Client WLC(Willowcove), and a Mobility WLC(Willowcove_M). Just because TGL doesn't have large L2 doesn't mean SPR won't.

l2 cache.png
^-- Bulldozer had an option for 1 MB 16-way w/ 18-cycle latency. But, is the 2-cycle gap worth the cache misses?
 
Last edited:

jpiniero

Lifer
Oct 1, 2010
14,585
5,209
136

A reference to Eagle Stream WS (Sapphire Rapids). It would make sense if Intel just cancels Icelake Server and just rolls with Sapphire Rapids.
 

mikk

Diamond Member
May 15, 2012
4,135
2,136
136
“[Sunny Cove has an] 800 instruction window, sustains between 3 and 6 x86 instructions per clock,” says Keller, “massive data predictors, massive branch predictors… We’re working on a generation that’s significantly bigger than this and closer to the linear curve on performance. This is a really big mindset change.”

I wonder if the significantly bigger CPU architecture will be Golden Cove or some other successor of it.
 

Thala

Golden Member
Nov 12, 2014
1,355
653
136
Don't think I've heard of that. How can it be inclusive and non-inclusive? Are there CPU's out there that use that design?

All of ARMs architectures starting from Cortex A75 with DSU have exclusive L2$ and pseudo-exclusive L3$.
 
  • Like
Reactions: Ajay

Ajay

Lifer
Jan 8, 2001
15,433
7,849
136

A reference to Eagle Stream WS (Sapphire Rapids). It would make sense if Intel just cancels Icelake Server and just rolls with Sapphire Rapids.
Eh, Intel will probably produce some Icelake-SP CPUs just to protect their investor community. Sapphire Rapids will ship in volume.
 

Ajay

Lifer
Jan 8, 2001
15,433
7,849
136

I wonder if the significantly bigger CPU architecture will be Golden Cove or some other successor of it.

Hard to imagine Intel pulling this off on 14nm. So by 2021, 10nm would need to have great yields. I suppose it could be targeting mobile only again.
 

Ajay

Lifer
Jan 8, 2001
15,433
7,849
136
Most likely Ocean Cove or later on 7 nm.
That's supposed to be implemented in 2022 (according to wikichip), so it makes sense. Intel's going to need some big jumps in "IPC" to make up for likely clock rate regressions in 7nm. I'm interested to see what happens as lower power seems to equal speed for recent node classes. Tough to build out a large (very high xtor budget), high performance core and keep power down at the same time.

We just need to wait three years for the competition between AMD and Intel to get red hot - word.
 

Thunder 57

Platinum Member
Aug 19, 2007
2,675
3,797
136
"The L2 cache’s large capacity and high associativity makes destructive interference between the two threads less likely." <- BULLDOZER: AN APPROACH TO MULTITHREADED COMPUTE PERFORMANCE

"For "Bulldozer" AMD implemented a write-thru L1D cache and focused on improving pre-fetch algorithms and increasing L1D cache bandwidth on popular small block transfers. For those less frequent, larger block transfers we rely upon the efficiencies built into our large 16-way 2MB L2.

Based on the average workloads today, we see this new design - despite the smaller L1 cache - as a more efficient way to process data." Mike Butler, Senior Fellow Design Engineer, AMD / HardOCP Readers Ask AMD Bulldozer Questions

¯\_(ツ)_/¯
The large L2 seems to stem more from the smaller L1 caches which provided more aggressive thrashing of an unknown smaller L2 option. I guess it could figuratively be the same with SMT, high-associativity and high-capacity is preferred in server workloads.

Skylake => 256 KiB/core - 4-way set associative - 12-cycle latency
Skylake-SP => 1 MiB/core - 16-way set associative - 14-cycle latency

Technically, some how Skylake-SP should have more effective multi-threading per core.

I have identified three different Willowcove implementations; a Server WLC(WillowcoveX), a Client WLC(Willowcove), and a Mobility WLC(Willowcove_M). Just because TGL doesn't have large L2 doesn't mean SPR won't.

View attachment 11229
^-- Bulldozer had an option for 1 MB 16-way w/ 18-cycle latency. But, is the 2-cycle gap worth the cache misses?

No, it is not worth the misses. The problem is the latency is too high as it is. If Skylake-X is 1MB but 14 cycles while BD 1MB would have been 18, you are already nearly 30% slower at the same clocks. Now if BD had hit the clocks they wanted, it would have been less of an issue.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
No, it is not worth the misses. The problem is the latency is too high as it is. If Skylake-X is 1MB but 14 cycles while BD 1MB would have been 18, you are already nearly 30% slower at the same clocks. Now if BD had hit the clocks they wanted, it would have been less of an issue.
You are comparing different L2 designs so there is going to be differences. The closet to AMD's L2 is Goldmont, since it is also shared between cores.

For example:
Goldmont -- 1 MB L2 16-way shared between two cores is 17-cycle latency.
Goldmont plus -- 2 MB L2 16-way shared between two cores(?/{s}wikichips) is 19-cycle latency.

AMD's clock targets for BD/PD was only 20% higher than Husky(32nm Stars core). Which they in fact did hit when they launched.
 

Thunder 57

Platinum Member
Aug 19, 2007
2,675
3,797
136
You are comparing different L2 designs so there is going to be differences. The closet to AMD's L2 is Goldmont, since it is also shared between cores.

For example:
Goldmont -- 1 MB L2 16-way shared between two cores is 17-cycle latency.
Goldmont plus -- 2 MB L2 16-way shared between two cores(?/{s}wikichips) is 19-cycle latency.

AMD's clock targets for BD/PD was only 20% higher than Husky(32nm Stars core). Which they in fact did hit when they launched.

Goldmont is also a low power design. I know they are different designs but I can't think of any x86 CPU that had latency similar to or greater than BD. Regarding clocks:

The CPU was supposed to offer 20 to 30% higher clock speeds at roughly the same power consumption, but in the end it could only offer a 10% boost at slightly higher power consumption.

Also from that page:

Lowly threaded desktop applications run best in a large, low latency L2 cache. But for server applications, we found worse problems than the L2 cache.

I know you are a fan of BD, but it had flaws like any CPU will. It just had a bit too many things go wrong and therefore isn't well regarded.
 

Ajay

Lifer
Jan 8, 2001
15,433
7,849
136
Tigerlake. Rocket Lake is rumoured to have GT1 Gen12 graphics. This is GT2 Gen12
I just wonder what the TIgerlake supply will be. Intel has tweaked the process (transistor 'optimization'), but yields could also be increased in the design phase (more redundancy and optimizations to improve clocks speeds).