Discussion Intel current and future Lakes & Rapids thread

Page 479 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
The counter-argument one might introduce is that GC is the new "medium" core, which is supposedly being groomed into replacing Cove cores in many products starting with the next generations.

I'm confused by this sentence. When you say GC I'm assuming you mean Golden Cove right? But Golden Cove is the high performance core so how could it be groomed into replacing Cove cores when it's already a Cove core?

I think you meant to say Gracemont or GM, unless there's something I don't know about.
 
  • Like
Reactions: uzzi38 and Tlh97

Gideon

Golden Member
Nov 27, 2007
1,608
3,573
136
if Gracemont is as good as these leaks suggest and Raptor lake can easily fit 16, Intel is missing a huge opportunity not offering 128 core SKUs (and perhaps also smaller 80 or 64 core ones) for hyperscalers.

It would fit many web workflows much better than big cores and counter Altera's offerrings very well.
 

coercitiv

Diamond Member
Jan 24, 2014
6,151
11,686
136
I'm confused by this sentence. When you say GC I'm assuming you mean Golden Cove right? But Golden Cove is the high performance core so how could it be groomed into replacing Cove cores when it's already a Cove core?

I think you meant to say Gracemont or GM, unless there's something I don't know about.
My bad, I meant Gracemont.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Regarding scaling, Cinebench will likely be very low overhead, and close to ideal. They're just parallel render threads. Not much complexity. The real test will be more mixed workloads like gaming and content creation.

It's not just software overhead I'm talking about. When it comes to sharing, the algorithm cannot account for all scenarios.

Like for example L2 cache bandwidth on Core 2. IXBTLabs had a good article about that. Depending on how much data is common to both cores, the L2 cache bandwidth would drop a lot, and at the point of the largest contention, would drop to zero!

(Likely one of the reasons they moved to private L2 caches)

Yes things have improved and the engineers have learned new things, but now the complexity has been upped by something like an order of magnitude. You have two completely different cores trying to load balance an application.

We also cannot compare closely controlled and single-tasking OS like mobile operating systems to Windows. Really the Windows world is a wild wild west out there.

Indeed, Cinebench might come close, but everything else might be a disappointment. We've officially reached the boiling point of hype. You know what happens after water boils? Well we turn it off! This much hype might result in huge disappointment for lots of people.

I think you meant to say Gracemont or GM, unless there's something I don't know about.

(That assumes MLiD is correct about Raptor Lake. We will see. Space-wise you might be able to fit 256 cores in the same die area of 56C Sapphire Rapids)

Gracemont being so powerful also supports what I said above. If they wanted to use a little core, then they'd take Airmont. But the worst-case scenario would be horrible, since you'd notice responsiveness drop noticeably. I mean you'd think it's stuttering. Airmont would literally be an "idle core".

But Gracemont is going to end up roughly 3x the performance per clock of Airmont! 3GHz+ Skylake is no slouch.
 
Last edited:

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
The chipsandcheese has an interesting article about uop caches and decoders.

It confirms what engineers have said before. The impact of decoders on power consumption is anywhere from 0.48% to 9.12%, and the author notes the figures overestimate decoder power, since performance also improves, increasing utilization of other areas of the core.

Maximum of 9% is a lot, but in the overall scheme of things nothing. It also confirms my belief that Intel/AMD is behind ARM because of execution issues and not so much the ISA.

Also Intel's talk about uop caches having 85% hit rate seems to be quite optimistic. CPU-Z and Ian's 3DPM benchmark is very high at 95-99%, but other applications fall between 25% to 65%.

Assuming an average hit rate of 45%, and Intel's claim uop cache hit reduces branch mispredict penalty by 5 cycles, it means we end up with something like a 17 stage pipeline chip.

Goldmont Plus is 13 stages, meaning 6 stage lower in the best case, 1 stage lower in the worst case, and 4 stages lower in average. If you take Intel's saying in the Netburst era that "each pipeline stage impacts performance by 2-5%", then something like 10% per clock performance is due to lower pipeline stages alone.

The guys at RWT also talked about the impact of decoders on transistors for x86 microprocessors. They were talking about quadratic impact, but some say exponential impact is also a possibility. This is likely the reason why Intel goes with what's called a 4-1-1-1 approach and uses micro/macro op fusion so more can be fed to the 1 throughput decoders.

Assuming an exponential change,
-2 decoders = 4x
-3 decoders = 9x
-4 decoders = 16x

In the x86 optimization manual about Tremont they talk about the dual cluster being more linear. If we take what they said at face value,
-2x 3 decoders = 18x

Meaning there's a possibility that dual 3 decode clusters might be minimally larger than 4 decoders on Core chips. Of course if you read the manual it has limitations. But assuming they can up the utilization, this might be the way to go for future x86 chips.
 
Last edited:

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
In the x86 optimization manual about Tremont they talk about the dual cluster being more linear. If we take what they said at face value,
-2x 3 decoders = 18x

AMD has 4 complex decoders, not sure what upper limits for instruction fetch are, but that is working all the time. Now Intel big cores are stuck in 1 complex + X simple decoders since Pentium days, if instruction stream is not such scheme friendly - tough luck.
With 3+3 clusters Intel is moving into another, equally retarded scheme - instruction stream needs to have a branch every few instructions to make use of that second cluster of decoders. I don't like it a single bit :)
With no uOP cache having just 3 decoders is glass jaw in situations where decoder bandwidth is needed the most - when your ROB is empty due to cold start, flush, mispredict or whatever. And ROB being filled slowly robs execution units downstream of OoO execution opportunities.

Remember Apple has 8 wide decode, so having 3+3 is not ambitious ( understatement of the year ).
 
  • Like
Reactions: Elfear and Tlh97

DrMrLordX

Lifer
Apr 27, 2000
21,582
10,785
136
Refresh my memory, aside from those all star employees moving on, didn't Murthy fire a few thousand engineers in various points of their career?

Good question! I don't know the answer to that. Though we could make a research project out of it sometime if you like.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
AMD has 4 complex decoders, not sure what upper limits for instruction fetch are, but that is working all the time. Now Intel big cores are stuck in 1 complex + X simple decoders since Pentium days, if instruction stream is not such scheme friendly - tough luck.

Complex versus simple decoders are a very small matter. For every SSE and AVX operations they all decode into one instruction and can be supported by all 4 decoders.

Also this is what it says for Tremont: "While Tremont microarchitecture did not build a dynamic mechanism to load balance the decode clusters, future generations of Intel Atom processors will include hardware to recognize and mitigate these cases without the need for explicit insertions of taken branches into the assembly code."

Based on the rumored performance(~Skylake) Gracemont should improve things considerably.
 
  • Like
Reactions: Carfax83

uzzi38

Platinum Member
Oct 16, 2019
2,565
5,575
146
Regarding scaling, Cinebench will likely be very low overhead, and close to ideal. They're just parallel render threads. Not much complexity. The real test will be more mixed workloads like gaming and content creation.
R20 also has the additional benefit of finishing within PL2 duration as well, which also aides ADL.
 

uzzi38

Platinum Member
Oct 16, 2019
2,565
5,575
146
if Gracemont is as good as these leaks suggest and Raptor lake can easily fit 16, Intel is missing a huge opportunity not offering 128 core SKUs (and perhaps also smaller 80 or 64 core ones) for hyperscalers.

It would fit many web workflows much better than big cores and counter Altera's offerrings very well.

Intel are definitely aware of that - there's a product for that (many Atoms for servers) as well.

EDIT: The codename is actually apparently already in the news. Neat.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Complex versus simple decoders are a very small matter. For every SSE and AVX operations they all decode into one instruction and can be supported by all 4 decoders.

Yeah, it is true that most of instructions are simple and decode into one uOP. But my gut feeling is that those special situations that beat branch prediction, miss uOP caches and so on already involve not "simple" instructions, that results in less decode throughput when it is needed the most.
Of course Intel has hard calcs and hard simulation data, they know better for sure.

Also this is what it says for Tremont: "While Tremont microarchitecture did not build a dynamic mechanism to load balance the decode clusters, future generations of Intel Atom processors will include hardware to recognize and mitigate these cases without the need for explicit insertions of taken branches into the assembly code."

Based on the rumored performance(~Skylake) Gracemont should improve things considerably.

They have already thrown in 64kb instruction cache, extra hardware to move from 3 wide decode to say 4.5 wide decode on average would work wonders for performance. That chipsandcheese data for Zen2 made people wonder about uOp caches and just why Apple is beating them with 8-wide decode.


One thing for sure, Alder Lake is turning into very interesting CPU. Both big cores and Atom cores will need deep dives and given the state of art for investigation these days, even if Intel does not provide a word we will know soon after release :)
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
That chipsandcheese data for Zen2 made people wonder about uOp caches and just why Apple is beating them with 8-wide decode.

We've had plenty of discussions about that on this forum. From what I recall, the major points were:

1) Process node. Apple has access to the latest cutting edge process node, while AMD has had to make due with inferior nodes until Zen 2, and Intel was stuck on 14nm++++++ for years.

2) Apple has top to bottom control over their product stack, so they can optimize to a much higher degree than either Intel or AMD.

3) Intel and AMD design CPUs for a wider variety of platforms and workloads, while Apple focuses strictly on mobile which favors high IPC, lower clocked CPUs; relatively speaking at least.

4) AMD was way behind Intel for many years to the point of near irrelevance. Of course Zen changed that and now they have a strong foundation to build high performance CPUs for the future.

Those were the main points as I recall, but number one is the most impactful I think. If Intel had never fumbled so badly on 10nm, they would be on 7nm++ by now, with 5nm imminent and this forum wouldn't be extolling the M1's performance to this degree.

Not trying to diminish what Apple has accomplished by any means, but those factors contributed heavily to the perception of Apple's CPU superiority no doubt.

One thing for sure, Alder Lake is turning into very interesting CPU. Both big cores and Atom cores will need deep dives and given the state of art for investigation these days, even if Intel does not provide a word we will know soon after release :)

Yep, I can't wait to do a full stack upgrade if Alder Lake proves as capable as I think it's going to be. :cool:
 
  • Like
Reactions: pcp7

dullard

Elite Member
May 21, 2001
24,998
3,327
126
And for mobile 2x performance is no problem. Tigerlake is 4 cores. The 2+8 is likely going to be faster than current Tigerlake. When you go 4+8 or 6+8 I can totally believe it being twice as fast.

But for desktop being 2x as fast? Desktop is already super high clocked with Rocketlake and the maximum Alderlake config is 8+8. Remember it's a marketing document. Of course when they say it'll be "twice as fast" it'll be an up to figure. Like some-of-our-products will offer 2x the performance.

But now the hype has suddenly reached a fever pitch. People are believing in scores that need Gracemont to be Sunny Cove performance, or even Golden Cove!
Minor technicality, but there are now multiple Tiger Lake chips with 8 cores. Two -B chips and six -H chips. For example, you can get a Dell laptop with 8 core Tiger Lake delivered to your home this week: https://www.dell.com/en-us/shop/gam...ptop/spd/alienware-m15-r6-laptop/wnm15r6exkfs Note: that is low stock, the more expensive laptops are not in limited quantities though.

You are correct that they will be "up to" 2x the performance not always 2x, it says "up to" right on the image. And you are correct that Gracemont will not be at Sunny Cove or Golden Cove performance.

But, Gracemont doesn't need to be at Sunny Cove levels for Alder Lake to get 2x desktop performance. Suppose that Intel is correct that Golden Cove is ~20% faster at some tasks compared to Sunny Cove. Suppose that Gracemont is ~20% slower than Sunny Cove (note this is just an example, I haven't see that piece of data yet). Then you get 8 cores of performance for Sunny Cove and 8*1.2 + 8*0.8 = 16 for Alder Lake (twice as fast). Again, I'll take that math with a huge grain of salt--especially since frequency values will impact the math significantly. But, in certain use cases almost double desktop performance is not out of the realm of possibility.
 
Last edited:

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
On point 2, it seems like that stack control allows Apple to transparently direct certain tasks to fixed function and specialized co-processor units that accelerate a lot of those functions greatly. Those are things that, in the x86 world, require a lot of work from the software developers for each package, standards compliance from the OS, and drivers support from the vendors. In the apple world, for the programmer, it's transparent, and for the user, it just works.
 

A///

Diamond Member
Feb 24, 2017
4,352
3,154
136
Good question! I don't know the answer to that. Though we could make a research project out of it sometime if you like.
I am sure there's plenty of Murthy voodoo dolls out there as a result. Though it depends on how much polyester batting JoAnn Fabrics sold during his tenure and how many sparklers were sold after his butt got the boot.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
But, Gracemont doesn't need to be at Sunny Cove levels for Alder Lake to get 2x desktop performance. Suppose that Intel is correct that Golden Cove is ~20% faster at some tasks compared to Sunny Cove. Suppose that Gracemont is ~20% slower than Sunny Cove (note this is just an example, I haven't see that piece of data yet). Then you get 8 cores of performance for Sunny Cove and 8*1.2 + 8*0.8 = 16 for Alder Lake (twice as fast).

You forgot two very important points, frequency and Hyperthreading.

Hyperthreading is responsible for 30% performance gain. Also current desktop CPUs clock 20-30% faster than Gracemont will.

Then you get 8x1.2 + (8x*0.85/1.3/1.2) = 14

This also assumes yet another large caveat: Hybrid will work without any overhead.

Simple arithmetic also don't take into account how the current cores perform. 6000 for Rocketlake, and 925 for 10W Tremont.

Your simple calculation results in 7200 for Golden Cove and 4800 for Gracemont. My calculation says 7200 for Golden Cove and nearly 3300 for Gracemont.

3300 seems like a big stretch to me nevermind 4800.
 
  • Like
Reactions: Tlh97 and coercitiv

dullard

Elite Member
May 21, 2001
24,998
3,327
126
You forgot two very important points, frequency and Hyperthreading.

Hyperthreading is responsible for 30% performance gain. Also current desktop CPUs clock 20-30% faster than Gracemont will.

Then you get 8x1.2 + (8x*0.85/1.3/1.2) = 14

This also assumes yet another large caveat: Hybrid will work without any overhead.
Um, you left off an important part of my message in your quote: "take that math with a huge grain of salt--especially since frequency values will impact the math significantly". I don't know the final launch frequencies, so I left it out and mentioned the fact. It was not something that I forgot.

Intel claims major hardware scheduling changes. We do not yet know how well or how poorly those will perform.

30% boost from Hyperthreading is only in the very best case scenario. Lots of uses see closer to a 10% to 20% gain, some uses have performance losses with hyperthreading turned on (HPC for a notable example where benchmarks run weeks at a time, not seconds). Hyperthreading adds a lot of power and heat, but it gives more threads. So there is a balance between needing to throttle down frequencies vs having more threads running. On a whole, most software gets a small boost. But, only a few programs get 30% boost.
 

lobz

Platinum Member
Feb 10, 2017
2,057
2,856
136
From what I can gather, Gracemont isn’t close to Sunny/Golden Cove except in applications that utilize AVX.

More importantly, people have underestimated both Golden Cove and Gracemont.

EDIT: specifically, the 11900k scored 69% of what the 5950X according to Anandtech. Golden cover is said to be 20% faster, but let’s say it is only 10% faster…
So you now just walked past the AVX post of the same person you're answering to.
 

lobz

Platinum Member
Feb 10, 2017
2,057
2,856
136
This is the reality. It's completely different between Intel and AMD when it comes to unconfirmed informations in basically every forum. Good leaks from Intel must be fantasy and things like that, there is much more scepticism involved. On AMD a huge hype train will grow and this is also a reason why there are more AMD related fakes in the web historically by the way. It's like that for a very long time, no matter if it was the Bulldozer or Zen+ era. I know that most people don't like what I said given that the big majority is pro AMD but it's reality. DrMrLordX above actually is a good example, he can't deal with it being Raja Koduri (possibly) right on ADL-S and comes up with this spam, some people would never admit he was right.
Comedy gold, keep it coming, Zucker and Mikk. Reality check != not believing good rumors. Come on...
 

aigomorla

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member
Super Moderator
Sep 28, 2005
20,839
3,174
126
Some TDP leaks number for Alder Lake..

That big die seems like it will be a absolute beast.

Core i9-12900K
  • P-Core 1-2C 5.3GHz/8C 5.0GHz
  • E-Core 1-4C 3.9GHz/8C 3.7GHz
  • 30MB L3 Cache
  • PL1=125W/PL2=228W
Core i7-12700K
  • P-Core 1-2C 5.0GHz/8C 4.7GHz
  • E-Core 1-2C 3.8GHz/4C 3.6GHz
  • 25MB L3 Cache
  • PL1=125W/PL2=228W
Core i5-12600K
  • P-Core 1-2C 4.9GHz/6C 4.5GHz
  • E-Core 1-2C 3.6GHz/4C 3.4GHz
  • 20MB L3 Cache
  • PL1=125W/PL2=228W

228W on that big die.... is like massive.
 
  • Like
Reactions: Zucker2k and A///

Thunder 57

Platinum Member
Aug 19, 2007
2,647
3,706
136
Some TDP leaks number for Alder Lake..

That big die seems like it will be a absolute beast.

Core i9-12900K
  • P-Core 1-2C 5.3GHz/8C 5.0GHz
  • E-Core 1-4C 3.9GHz/8C 3.7GHz
  • 30MB L3 Cache
  • PL1=125W/PL2=228W
Core i7-12700K
  • P-Core 1-2C 5.0GHz/8C 4.7GHz
  • E-Core 1-2C 3.8GHz/4C 3.6GHz
  • 25MB L3 Cache
  • PL1=125W/PL2=228W
Core i5-12600K
  • P-Core 1-2C 4.9GHz/6C 4.5GHz
  • E-Core 1-2C 3.6GHz/4C 3.4GHz
  • 20MB L3 Cache
  • PL1=125W/PL2=228W

228W on that big die.... is like massive.

Little bit less than Rocket Lake which may as well shoot flames itself. If it has performance to match it might not be so bad. Probably the most interesting Intel CPU launch in awhile. I'm actually excited to see what it does.
 
  • Like
Reactions: Tlh97

aigomorla

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member
Super Moderator
Sep 28, 2005
20,839
3,174
126
Little bit less than Rocket Lake which may as well shoot flames itself. If it has performance to match it might not be so bad. Probably the most interesting Intel CPU launch in awhile. I'm actually excited to see what it does.

same.. big cores and little cores.... would be very interesting, and would most likely rock on a laptop.

But i really see no point in this on a desktop.
Its not like its attached to a battery, and if it did, i would need a mr. fusion to power it off grid.

Give me BIG and BIGGER cores instead. :D
 

Thunder 57

Platinum Member
Aug 19, 2007
2,647
3,706
136
same.. big cores and little cores.... would be very interesting, and would most likely rock on a laptop.

But i really see no point in this on a desktop.
Its not like its attached to a battery, and if it did, i would need a mr. fusion to power it off grid.

Give me BIG and BIGGER cores instead. :D

Well Intel does call them big.bigger. I guess that is what you were getting at.
 

A///

Diamond Member
Feb 24, 2017
4,352
3,154
136
Some TDP leaks number for Alder Lake..

That big die seems like it will be a absolute beast.

Core i9-12900K
  • P-Core 1-2C 5.3GHz/8C 5.0GHz
  • E-Core 1-4C 3.9GHz/8C 3.7GHz
  • 30MB L3 Cache
  • PL1=125W/PL2=228W
Core i7-12700K
  • P-Core 1-2C 5.0GHz/8C 4.7GHz
  • E-Core 1-2C 3.8GHz/4C 3.6GHz
  • 25MB L3 Cache
  • PL1=125W/PL2=228W
Core i5-12600K
  • P-Core 1-2C 4.9GHz/6C 4.5GHz
  • E-Core 1-2C 3.6GHz/4C 3.4GHz
  • 20MB L3 Cache
  • PL1=125W/PL2=228W

228W on that big die.... is like massive.
The VideoCardz report references extrapolated numbers based on a qual sample. Though it is incredibly safe to presume Intel will be coming in hot and heavy. This is one of those times where "wait for benches" should hold one's attention. I recall 9th through 11th gen being incredible pre-release only to be midline once they came out. There's some gems in these generations but the majority of the product stack isn't worth it.