Discussion Intel current and future Lakes & Rapids thread

Page 497 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Racan

Golden Member
Sep 22, 2012
1,292
2,372
136
Yes. If Zen 4's a derivative, ~10% IPC is a more realistic expectation. Zen->Zen 2 's 15% was abnormally high for a tick iteration. Zen 2->Zen 3 isn't relevant given that Zen 3 was a tock. As far as vcache goes, you shouldn't expect it to be on anything other than a crazy-expensive halo sku, I'm talking like $1500. Stacking is still in its infancy and the tools being used can't handle the volume to do anything else. Plus the extra L3 is mainly useful for compile workloads, so I could see any capacity being bought up by corporate customers.

Now with that said, it's not like Zen 4's immediately dead in the water or anything. It'll still have a commanding power and area lead over Golden Cove, even factoring in Golden Cove's assumed IPC lead. Being on N5 will also give it extra clocks and efficiency as well. Plus it looks like it'll have AVX-512, which we now know Alder Lake lacks. So even if there's a slight IPC disadvantage, Zen 4 should easily make that up in other areas. If Alder Lake was only Golden Cove, AMD wouldn't need to worry. They'd probably break even on desktop and dominate in mobile while offering AVX-512, but Alder Lake isn't just Golden Cove.

The real problem for AMD is going to be Gracemont (frankly, I've been saying this for a while). It probably won't help a ton in gaming since the schedulers will have some teething issues. But in mobile, AMD simply doesn't have an answer. Gracemont could also be used in the server. Imagine a Sapphire Rapids fork that uses 240 gracemont cores instead of 60 Golden Cove cores (or maybe mix and match the die), that can be an extremely compelling product depending on your workload.

That would be extremely disappointing, it would make Zen 4 the weakest improvement since Zen>Zen+, especially after the longest wait time between Zen architectures. I hope it will be more substantial than that.
 

Saylick

Diamond Member
Sep 10, 2012
3,923
9,142
136
That would be extremely disappointing, it would make Zen 4 the weakest improvement since Zen>Zen+, especially after the longest wait time between Zen architectures. I hope it will be more substantial than that.
If rumors are to be believed, IPC increase for Zen 4 is >20%. Moore's Law is Dead heard something about the jump from Zen 2 to Zen 4 is similar to that as the original Zen 1. So, if you take the IPC gains for Zen 1 as 52%, and knowing that Zen 2 to Zen 3 is 19%, that implies that Zen 3 to Zen 4 is ~28%. I don't think it's going to be that high but I personally think AMD can match Intel's 19% IPC increase, if not out-do them.
 

dullard

Elite Member
May 21, 2001
25,907
4,494
126
Just in case some haven't seen it, here is the official Architecture day slide deck:
It has a lot of stuff in it that I didn't see in the couple of review websites that I read. The slide deck covers: Golden Cove, Grace Mont, Thread Director, Alder Lake, Xe HPG, Sapphire Rapids, IPU, Oak Springs Canyon, Arrow Creek, Mount Evans, and Ponte Vecchio.

The one that really stuck out like an impossible target is 1000x by 2025 (slides 5 to 11). That is, assuming that 1000x refers to performance.
 
Last edited:

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Gracemont versus Skylake
Single thread
-40% higher performance at same power, or 40% of the power at same performance
-Total performance is 7-8% higher

Multi Thread
-4C Gracemont is 80% higher performance than 2C4T Skylake. Based on SpecCPUs scaling lying in the 85-90% range, the few % advantage in single thread performance carries onto here without losses
-80% less power at same performance

A Gracemont cluster, consisting of 4 cores and the 2MB L2 cache takes up the same area as a single Skylake core with it's 256KB L2 cache. Skylake is larger at 8.7mm2 since it's on the 14nm process. Tremont cluster roughly equals Sunny Cove core. So again a very efficient area/performance improvement.

Full AVX2 support with FMA and 256-bit FP!

Innovative new concepts introduced, and just like the x86 optimization manual alluded, it has a hardware load balancer to improve throughput on the dual cluster decode.

Golden Cove's 19% gains are meh considering the sizable changes.
 
Last edited:

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
So if Golden Cove is only 19% faster per clock compared to Sunny Cove, and Sunny Cove is 18% faster than Skylake, and Gracemont is 7% faster than Skylake,

Golden Cove is only 31% faster than Gracemont. So much for a "little core". It sips power like a little core, and uses die area like a little core but much higher performing.
 

ondma

Diamond Member
Mar 18, 2018
3,275
1,678
136
So if Golden Cove is only 19% faster per clock compared to Sunny Cove, and Sunny Cove is 18% faster than Skylake, and Gracemont is 7% faster than Skylake,

Golden Cove is only 31% faster than Gracemont. So much for a "little core". It sips power like a little core, and uses die area like a little core but much higher performing.
Your point still stands but the gains in the big cores are not simply additive. The real gain is 1.18 x 1.19 or 40%. So GC is 33% faster than Gracemont. Problem with Gracemont is that it lacks hyperthreading and cant clock as high as the big cores.
 

Abwx

Lifer
Apr 2, 2011
11,835
4,789
136
Gracemont versus Skylake
Single thread
-40% higher performance at same power, or 40% less power at same performance
-Total performance is 7-8% higher

A shrinked SKL would consume about 50% lower power as well as getting 35-40% better perf at isofrequency, and 8% higher perf is 2C vs 1C/2T.


Golden Cove's 19% gains are meh considering the sizable changes.

I would say considering the sofwares used to get this number, guess that there s some marketing going on as a mean to get the same as below :

Zen3_arch_7.jpg
 

gdansk

Diamond Member
Feb 8, 2011
4,204
7,048
136
Just in case some haven't seen it, here is the official Architecture day slide deck:
It has a lot of stuff in it that I didn't see in the couple of review websites that I read. The slide deck covers: Golden Cove, Grace Mont, Thread Director, Alder Lake, Xe HPG, Sapphire Rapids, IPU, Oak Springs Canyon, Arrow Creek, Mount Evans, and Ponte Vecchio.

The one that really stuck out like an impossible target is 1000x by 2025 (slides 5 to 11). That is, assuming that 1000x refers to performance.
They probably mean 1000x in some niche like matrix operations of low precision int/fp. Shouldn't be too hard to achieve given by 2025 they'll have massive 600W+ MCM GPU derivatives of Xe HP or Xe HPC. And right now they have nothing comparable except a few Xeons with VNNI extensions.
 
Last edited:

uzzi38

Platinum Member
Oct 16, 2019
2,746
6,653
146
19% IPC is strong for a generational uplift but disappointing considering the base line for that measurement is Cypress Cove which already displayed lower IPC than Zen 3 (and mind you, also lower than both Sunny and Willow Cove) and the sheer size of some of the core compared to Z3 as well. I mean, 2x the ROB? Jeez.

Gracemonts are definitely the real star of the show here.
2e2e8db3abfcb0719f936b8fa4c2197a.jpg
 

gdansk

Diamond Member
Feb 8, 2011
4,204
7,048
136
19% IPC is strong for a generational uplift but disappointing considering the base line for that measurement is Cypress Cove which already displayed lower IPC than Zen 3 (and mind you, also lower than both Sunny and Willow Cove) and the sheer size of some of the core compared to Z3 as well. I mean, 2x the ROB? Jeez.

Gracemonts are definitely the real star of the show here.
Regardless, it should put them ahead of Zen 3 IPC and (I presume) Intel will keep these clocked slightly higher than Zen 3. So it comes down to how much Zen 3D will actually improve in these small benchmarks for IPC claims.

So I expect Zen 3D and Alder Lake to be trading blows. On mobile, however, it looks like Zen 3 will be much less competitive. Alder Lake will have IPC, maximum clock speed (probable) to win in ST. And those 8 E cores will more than make up for the missing 2 P core in MT workloads, even if GC cannot sustain good boost clocks in low power.
 
Last edited:

uzzi38

Platinum Member
Oct 16, 2019
2,746
6,653
146
Regardless, it should put them ahead of Zen 3 IPC and (I presume) Intel will keep these clocked slightly higher than Zen 3. So it comes down to how much Zen 3D will actually improve in these small benchmarks for IPC claims. I think we might see a year where which is better depends entirely on whether your workload prefers more L3 or has more ILP for Golden Cove to work with.
Zen 3 with V-Cache is probably going to be very workload dependant in it's gains. For synthetics SPEC is definitely going to benefit the most, Cinebench and CPU-z I expect minor improvements at best.

In terms of real world workloads though? Anything that reaches into memory often - gaming, code compilation, video editing... I struggle to think of an actual workload that won't significantly benefit actually.
 
  • Like
Reactions: Tlh97 and Tarkin77

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Just in case some haven't seen it, here is the official Architecture day slide deck:
https://download.intel.com/newsroom.../intel-architecture-day-2021-presentation.pdf

Full presentation has some real "gems" about hardware assisted scheduling. I had good laugh at this slide:
1629407279213.png


Well, that is great example of throwing performance under the bus. Imagine a game thread, say one of thread pool members specialized in heavy processing some physics or AI or whatever data each frame, that has finished its work for the frame and is busy waiting in spin loop waiting for the next frame to start.
L2 cache is hot and full of relevant data. Then morons from Intel and Microsoft arrive and move this recently-busy-but-now-idle thread to small core. Context switch takes ton of time, caches are cold, once thread is back to work again, it is once again promoted to big core and the cycle repeats. At the huge cost of cache misses.

So Intel and MS move in to reign in this ping-pong with scheduler tunables and heuristics and so on and we are back to square zero with schedulers. Don't forget that we are already at negative starting position, since scheduler needs to take into account all that rich performance data from hardware to make its decisions and that also takes cycles and dirties caches.
While it is nice to have less static scheduling that can actually react to changes in characteristics of threads, that will still come at a cost of both peak performance and performance consistency. Alder Lake will be great at full MT load, great up to 8T of load and suffer from scheduler everywhere else and those problems might not even rear their ugly heads until games start to use just right amount of threads in a wrong way.
 
  • Like
Reactions: lightmanek

Abwx

Lifer
Apr 2, 2011
11,835
4,789
136
Regardless, it should put them ahead of Zen 3 IPC and (I presume) Intel will keep these clocked slightly higher than Zen 3. So it comes down to how much Zen 3D will actually improve in these small benchmarks for IPC claims.

Numbers for small cores are not Spec_int but Spec_rate, wich is a full system test that take big advantage of bandwith.
 

Hulk

Diamond Member
Oct 9, 1999
5,118
3,660
136
The Golden Cove performance improvement/architectural changes I expected. If true, the Gracemont power/performance compared to Skylake is kind of mind blowing. ST Gracemont is basically faster, more power efficient, and much smaller than Skylake. This begs the question as to why we even need the Golden Cove cores?

After thinking about it a bit I believe the reason stems from the fact that ultimately ST performance is still very important. There are many apps, like Handbrake that don't scale well beyond 8 cores. Throwing 40 Gracemont cores at that app in the same die space as Alder Lake won't show the same performance as 8+8 ADL. Partly because of the higher clocks of Golden Cove, the higher inherent IPC, and as I wrote above, there are still quite a few apps that don't handle high core counts in a linear fashion performance-wise.

Starting to get just a little more excited about ADL...
 

gdansk

Diamond Member
Feb 8, 2011
4,204
7,048
136
I just notice that since Alder Lake is releasing on "Intel 7", Charlie was right that Intel would never release a 10 nm desktop chip :laughing:
If I recall correctly Charlie was talking about their first generation 10nm. He was already right, long ago, as they trashed their first generation 10nm. Even if they kept the name as 10nm ESF.
 

eek2121

Diamond Member
Aug 2, 2005
3,384
5,011
136
The Golden Cove performance improvement/architectural changes I expected. If true, the Gracemont power/performance compared to Skylake is kind of mind blowing. ST Gracemont is basically faster, more power efficient, and much smaller than Skylake. This begs the question as to why we even need the Golden Cove cores?

After thinking about it a bit I believe the reason stems from the fact that ultimately ST performance is still very important. There are many apps, like Handbrake that don't scale well beyond 8 cores. Throwing 40 Gracemont cores at that app in the same die space as Alder Lake won't show the same performance as 8+8 ADL. Partly because of the higher clocks of Golden Cove, the higher inherent IPC, and as I wrote above, there are still quite a few apps that don't handle high core counts in a linear fashion performance-wise.

Starting to get just a little more excited about ADL...

We already knew everything that Intel announced today. It is just that a lot of people did not believe the rumors/leaks. I remember getting yelled at for saying that gracemont would be faster than skylake…
 

dullard

Elite Member
May 21, 2001
25,907
4,494
126
Full presentation has some real "gems" about hardware assisted scheduling. I had good laugh at this slide:
View attachment 49077


Well, that is great example of throwing performance under the bus. Imagine a game thread, say one of thread pool members specialized in heavy processing some physics or AI or whatever data each frame, that has finished its work for the frame and is busy waiting in spin loop waiting for the next frame to start.
L2 cache is hot and full of relevant data. Then morons from Intel and Microsoft arrive and move this recently-busy-but-now-idle thread to small core. Context switch takes ton of time, caches are cold, once thread is back to work again, it is once again promoted to big core and the cycle repeats. At the huge cost of cache misses.

So Intel and MS move in to reign in this ping-pong with scheduler tunables and heuristics and so on and we are back to square zero with schedulers. Don't forget that we are already at negative starting position, since scheduler needs to take into account all that rich performance data from hardware to make its decisions and that also takes cycles and dirties caches.
While it is nice to have less static scheduling that can actually react to changes in characteristics of threads, that will still come at a cost of both peak performance and performance consistency. Alder Lake will be great at full MT load, great up to 8T of load and suffer from scheduler everywhere else and those problems might not even rear their ugly heads until games start to use just right amount of threads in a wrong way.
1) Simple fix to your concern: put a minimum idle limit in the guidance for considering a thread idle. Suppose a game is running at 60 fps. Then each frame takes 16.7 ms. Don't move an idle thread unless it is idle more than 16.7 ms. Then your game can not impacted at all by the scheduler. No ping-pong possible. Note: I'm not saying 16.7 ms is the ideal number, this is just an example. I'm not saying that this is an easy thing to solve (which is why it took both a big change from Intel and Microsoft). I'm just saying that your example is an easy thing to solve.

2) It is a new hardware based microcontroller doing the heavy lifting. Thus, with separate hardware doing much of the calculations, it is less work on the software, fewer cycles and less dirty cache.

3) Why are you suddenly running some processor heavy other program in the middle of gaming?
 
Last edited:

dullard

Elite Member
May 21, 2001
25,907
4,494
126
After thinking about it a bit I believe the reason stems from the fact that ultimately ST performance is still very important. There are many apps, like Handbrake that don't scale well beyond 8 cores.
Exactly. User interface needs high ST performance for a fast "feel". You could have a billion slow cores, and while it will crunch numbers quickly, it would feel like a dog on the user interface thread. And many applications just physically can't be divided into a large number of threads no matter what the programmer tries. There is always a need for a few very fast ST cores.
 

gdansk

Diamond Member
Feb 8, 2011
4,204
7,048
136
Presently available schedulers already try to avoid ping-pong of threads even between the same core type. The same mechanisms which keep busy threads running on the same core should apply here. Intel presented an example and I expect it is exactly that, an example. I presume that Intel engineers know that tasks which are using a spin lock do so because it is a low latency lock. But who knows.
 

Hougy

Member
Jan 13, 2021
81
62
91
If I recall correctly Charlie was talking about their first generation 10nm. He was already right, long ago, as they trashed their first generation 10nm. Even if they kept the name as 10nm ESF.
Actually, he said they cancelled 10 nm. So all 10 nm products. He even admitted he was wrong
 
  • Like
Reactions: Tlh97 and Exist50