Discussion Intel current and future Lakes & Rapids thread

IntelUser2000 · Jan 11, 2022

Mopetar said:
The 12900K lists 6 MB of L3 cache for the efficiency cores, which makes me think it's a typo, but if it weren't that would further suggest the cache being the culprit.

It shouldn't matter that much on the ring bus. It should have access to the entire 25MB pool, unless extra cycle or two latency affects performance that much.

The L2 cache being the bottleneck makes more sense. There is a reason multi-core oriented CPUs moved to private L2 caches and shared L3.

a heavy workload the cache is getting thrashed really badly.

tomatosummit · Jan 11, 2022

IntelUser2000 said:
It shouldn't matter that much on the ring bus. It should have access to the entire 25MB pool, unless extra cycle or two latency affects performance that much.

The L2 cache being the bottleneck makes more sense. There is a reason multi-core oriented CPUs moved to private L2 caches and shared L3.

While I also think l2 is the big bottleneck the shared access to the l3 might also be additionally restrictive. It's still 4cores on one ringstop.
Or am I overthinking it because the access goes through the l2 module anyway?

IntelUser2000 · Jan 11, 2022

tomatosummit said:
While I also think l2 is the big bottleneck the shared access to the l3 might also be additionally restrictive. It's still 4cores on one ringstop.
Or am I overthinking it because the access goes through the l2 module anyway?

It might be worth testing on the 12900K or the 12600K then. According to wikichip logic 12600K will have 2MB for Gracemont module.

Gracemont: Revenge of the Atom Cores

This article can be considered a part 2 to our Golden Cove article because today we are looking at the other core in Alder Lake, Gracemont.

chipsandcheese.com

Remember for Gracemont the L2 is a shared cache, meaning for 4 cores, it has 1/4 the capacity per core compared to single thread, in addition to potential contention. Contention will be an issue for L3 as well, but capacity wise the ring will allow access to all 25MB(unless it has arbitrary restrictions).

Of course we are assuming the caches are the culprit here, not anywhere else.

Hulk · Jan 11, 2022

IntelUser2000 said:
It might be worth testing on the 12900K or the 12600K then. According to wikichip logic 12600K will have 2MB for Gracemont module.

Gracemont: Revenge of the Atom Cores

This article can be considered a part 2 to our Golden Cove article because today we are looking at the other core in Alder Lake, Gracemont.

chipsandcheese.com

Remember for Gracemont the L2 is a shared cache, meaning for 4 cores, it has 1/4 the capacity per core compared to single thread, in addition to potential contention. Contention will be an issue for L3 as well, but capacity wise the ring will allow access to all 25MB(unless it has arbitrary restrictions).

Of course we are assuming the caches are the culprit here, not anywhere else.

Peanut gallery chiming in here with a question.

Does the fact that Gracemont does not have HT somewhat alleviate the stress on the L2 as compared to a HT CPU? Or stated another way, when designing a CPU that is not going to be hyperthreaded would the L2 be sized smaller than for a HT CPU?

Thala · Jan 11, 2022

Hulk said:
Peanut gallery chiming in here with a question.

Does the fact that Gracemont does not have HT somewhat alleviate the stress on the L2 as compared to a HT CPU? Or stated another way, when designing a CPU that is not going to be hyperthreaded would the L2 be sized smaller than for a HT CPU?

The working set is larger when doing fine grained thread switching - so ideally you want the caches being larger starting from L1$.

Thala · Jan 11, 2022

IntelUser2000 said:
Remember for Gracemont the L2 is a shared cache, meaning for 4 cores, it has 1/4 the capacity per core compared to single thread, in addition to potential contention.

This typically is not the case, because there is sharing. On one hand there is code sharing, which is pretty obvious. But also shared data structures require only a single copy in a shared cache, contrary to having multiple copies in private caches. Of course the amount of sharing is larger, when all cores running the same application (e.g. when running multithreaded workloads like say Cinebench)

Mopetar · Jan 11, 2022

Thala said:
This typically is not the case, because there is sharing. On one hand there is code sharing, which is pretty obvious. But also shared data structures require only a single copy in a shared cache, contrary to having multiple copies in private caches. Of course the amount of sharing is larger, when all cores running the same application (e.g. when running multithreaded workloads like say Cinebench)

It probably depends if the scheduler is doing a good of assigning threads based on those shared resources. If it's putting something random like a background process for an unrelated application, it's unlikely to be accessing the same memory.

IntelUser2000 · Jan 12, 2022

Hulk said:
Does the fact that Gracemont does not have HT somewhat alleviate the stress on the L2 as compared to a HT CPU? Or stated another way, when designing a CPU that is not going to be hyperthreaded would the L2 be sized smaller than for a HT CPU?

So with Hyperthreading you have to deal with sharing of resources right?

Then if you have to have L2 cache being shared as well, that's further complication. Overall for scaling separate L2 and large, shared L3 for all cores is the best. You lose a bit on single thread, but the scaling is better so you quickly win out. But it's mostly for multi-core scaling. Hyperthreading benefit is just a side thing.

Mid level L2 also has an advantage that it fills the large capacity and latency gap between the L1 and the last level cache, to better fit all scenarios since we're talking about a general purpose CPU. You have to cover all bases, and then some, because you don't know what the people will use it for.

Thala said:
This typically is not the case, because there is sharing. On one hand there is code sharing, which is pretty obvious.

You are right but significant amount won't be. So when you are talking big numbers, such as having 4x the cores, then the per core amount will be substantially smaller. In this case it might as well be.

jpiniero · Jan 13, 2022

Sapphire Rapids delid video.

DrMrLordX · Jan 13, 2022

Sapphire Rapids on eBay? Someone's in trouble.

eek2121 · Jan 13, 2022

DrMrLordX said:
Sapphire Rapids on eBay? Someone's in trouble.

Too bad the boards didn't magically land on eBay.

Ajay · Jan 13, 2022

DrMrLordX said:
Sapphire Rapids on eBay? Someone's in trouble.

Makes me wonder if Intel even tracks trays of CPUs that are sent to OEMs for testing (guessing that where this came from). Looks like a pull from something, a bit scratched up. Maybe even a non-working sample that got tossed into a plastic bin on a shelf??? Grabbed by an 'enterprising' tech or janitor and sold on eBay under a fake name and address?? Intel, one would think, has thousands of these (or more) floating around at various OEMs/ODMs testing servers, peripherals and enterprise software for the upcoming SPR release?

repoman27 · Jan 13, 2022

Well, Intel sent YuuKi_AnS a SPR engineering sample nearly a year ago, which he went ahead and delidded.

动态-哔哩哔哩

t.bilibili.com

视频去哪了呢？_哔哩哔哩_bilibili

undefined, 视频播放量 undefined、弹幕量 undefined、点赞数 undefined、投硬币枚数 undefined、收藏人数 undefined、转发人数 undefined, 视频作者 undefined, 作者简介 undefined，相关视频：

www.bilibili.com

DrMrLordX · Jan 13, 2022

Ajay said:
Makes me wonder if Intel even tracks trays of CPUs that are sent to OEMs for testing (guessing that where this came from). Looks like a pull from something, a bit scratched up. Maybe even a non-working sample that got tossed into a plastic bin on a shelf??? Grabbed by an 'enterprising' tech or janitor and sold on eBay under a fake name and address?? Intel, one would think, has thousands of these (or more) floating around at various OEMs/ODMs testing servers, peripherals and enterprise software for the upcoming SPR release?

They probably track it by the markings on the lid.

repoman27 · Jan 13, 2022

Exist50 said:
I'm curious what people would think of 4+16 for the P die.

It would net the most threads for the least area / power. I don't see it happening for Meteor Lake though. Looking closely at the ADL-P and MTL CPU tile wafer shots, I've come around to agreeing with you that the MTL-M compute die is a 2P+8E design. I get the feeling Intel isn't going to reuse CPU tiles at all for Meteor Lake. (I mean, why bother understanding your own strategy, right?) So I think we'll see 2+8 and 6+8 LP tiles, and probably an 8+16 HP tile.

Hulk · Jan 13, 2022

repoman27 said:
I mean, why bother understanding your own strategy, right?

Love it!

Exist50 · Jan 14, 2022

repoman27 said:
It would net the most threads for the least area / power. I don't see it happening for Meteor Lake though. Looking closely at the ADL-P and MTL CPU tile wafer shots, I've come around to agreeing with you that the MTL-M compute die is a 2P+8E design. I get the feeling Intel isn't going to reuse CPU tiles at all for Meteor Lake. (I mean, why bother understanding your own strategy, right?) So I think we'll see 2+8 and 6+8 LP tiles, and probably an 8+16 HP tile.

Oh, it's not a possibility for Meteor Lake. But beyond that? Who can say...

As for tile reuse, I think it'll be quite interesting to see what gets reused, and where. Mixing and matching across the MTL lineup should be expected, but what about other product categories? Networking? Graphics?

Hulk · Jan 14, 2022

What are the advantages of using compute tiles for desktop with lower core counts that would fit on a monolithic chip?

If Raptor desktop is 8+16 and Meteor is 8+16 what's the advantage of the compute tiles? It is simply production cost?

LightningZ71 · Jan 14, 2022

Ask AMD, the APUs have been sporting core counts equivalent to the desktop X series processors for years. Breaking up the various SoC functions into chiplets/tiles that are made on processes that are tailored to their various functions can bring about smaller individual ICs that have higher yields per wafer and perform better in their totality due to not having compromises.

dullard · Jan 14, 2022

Hulk said:
What are the advantages of using compute tiles for desktop with lower core counts that would fit on a monolithic chip?

If Raptor desktop is 8+16 and Meteor is 8+16 what's the advantage of the compute tiles? It is simply production cost?

1) Higher yields with tiles. Instead of needing to throw away an entire monolithic chip (total loss of that silicon area), you might have several good tiles and one bad tile in the same area (mostly usable silicon area).

2) One design problem doesn't hold up the whole generation. Suppose there is a problem with a new iGPU but the new CPU is performing great. With a monolithic chip, the thing can't ship. With tiles, you can use a previous iGPU combined with your new CPU. This eliminates the need for long delays and the expenses of backporting (like the 26.5 months between Comet Lake and Alder Lake and the necessary costs of creating Rocket Lake).

3) Flexibility. This is related to #2, but you essentially can ship products incrementally when they are ready rather than waiting for all new concepts to be perfected. The CPU team doesn't need to wait for the GPU team to be ready (and vise versa). They can launch what they have and move on to the next project. Being able to have your people work more independently gives your designs, planning, etc far more flexibility. Heck, Intel can even outsource some of the tiles to other companies for even more flexibility.

The drawbacks are of course higher latency, higher power, and/or more costly packaging.

Exist50 · Jan 14, 2022

dullard said:
1) Higher yields with tiles. Instead of needing to throw away an entire monolithic chip (total loss of that silicon area), you might have several good tiles and one bad tile in the same area (mostly usable silicon area).

2) One design problem doesn't hold up the whole generation. Suppose there is a problem with a new iGPU but the new CPU is performing great. With a monolithic chip, the thing can't ship. With tiles, you can use a previous iGPU combined with your new CPU. This eliminates the need for long delays and the expenses of backporting (like the 26.5 months between Comet Lake and Alder Lake and the necessary costs of creating Rocket Lake).

3) Flexibility. This is related to #2, but you essentially can ship products incrementally when they are ready rather than waiting for all new concepts to be perfected. The CPU team doesn't need to wait for the GPU team to be ready (and vise versa). They can launch what they have and move on to the next project. Being able to have your people work more independently gives your designs, planning, etc far more flexibility. Heck, Intel can even outsource some of the tiles to other companies for even more flexibility.

The drawbacks are of course, higher latency and/or more costly packaging.

It's also somewhat worse from a power perspective. Supposedly one of the first Meteor Lake proposals from the design team, in a "best we can do" kind of spirit, was monolithic on N3. The current topology is very much a compromise dictated from on high, and very last minute.

dullard · Jan 14, 2022

Exist50 said:
It's also somewhat worse from a power perspective. Supposedly one of the first Meteor Lake proposals from the design team, in a "best we can do" kind of spirit, was monolithic on N3. The current topology is very much a compromise dictated from on high, and very last minute.

Noted. I added that to the drawback list above.

DrMrLordX · Jan 14, 2022

Exist50 said:
Supposedly one of the first Meteor Lake proposals from the design team, in a "best we can do" kind of spirit, was monolithic on N3. The current topology is very much a compromise dictated from on high, and very last minute.

. . . uh oh.

Ajay · Jan 14, 2022

Exist50 said:
It's also somewhat worse from a power perspective. Supposedly one of the first Meteor Lake proposals from the design team, in a "best we can do" kind of spirit, was monolithic on N3. The current topology is very much a compromise dictated from on high, and very last minute.

Hmm, kind of weird since Intel has been pushing multiple dice on package (EMIB, FOVEROS) for quite a while. What the heck is going on with that company?! Gee, let's NOT use the packing tech that WE developed

Exist50 · Jan 14, 2022

DrMrLordX said:
. . . uh oh.

Ajay said:
Hmm, kind of weird since Intel has been pushing multiple dice on package (EMIB, FOVEROS) for quite a while. What the heck is going on with that company?! Gee, let's NOT use the packing tech that WE developed

The story I've heard is that everyone was just called into the main conference room, and shown a presentation of basically "This is Meteor Lake now". Supposedly elicited a mixture of silence and chuckling.

For this reason, COVID-instigated hiring freezes, the addition of Arrow Lake, and Microsoft poaching half of Intel's best talent in Oregon, I consider myself a Meteor Lake sceptic even independent of any process issues.

And Ajay, right tool for the job. Advanced packaging is a useful tool, but doesn't beat monolithic in performance or power. And iirc, at the time, MTL was still more low power targeted. Though with process issues and a more flexible lineup, it might not be a bad idea in retrospect. But IDC will probably do things differently for Lunar Lake and whatever else they own.

Discussion Intel current and future Lakes & Rapids thread

Elite Member

Member

Elite Member

Diamond Member

Golden Member

Golden Member

Diamond Member

Elite Member

Lifer

Lifer

Diamond Member

Lifer

Senior member

Lifer

Senior member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Elite Member

Platinum Member

Elite Member

Lifer

Lifer

Platinum Member