Discussion Intel current and future Lakes & Rapids thread

Page 830 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Det0x

Golden Member
Sep 11, 2014
1,465
4,999
136
With different per-core clock speeds, how can it be synchronous? That's what I'm trying to dig into here. IIRC, Haswell added separate DVFS to Intel's ring bus, but it was still asynchronous before then.
L3 / ring runs at peak core-clockspeed for any given CCD
1689696634719.png
 
Last edited:
  • Like
Reactions: Tlh97 and Joe NYC

Exist50

Platinum Member
Aug 18, 2016
2,452
3,106
136
L3 / ring runs at peak core-clockspeed for any given CCD
View attachment 83176
Alright, thanks. So they need to have a clock domain crossing between the cores and the ring. For Intel, the ring doesn't peak as high, but it's still in the ballpark by a GHz or so. Unlikely to make a huge difference either way.
 

Hitman928

Diamond Member
Apr 15, 2012
6,695
12,370
136
Guess i showed the wrong screenshot if you gonna take everything in bad faith. Oncourse the difference is not that big in this scenario since the 7800x3d is among the lowest clocking Zen4.
View attachment 83178
Do i need to dig up a 6ghz screenshot of plain old regular Zen4 ?

I believe @Exist50 was talking about Intel's cache frequency in your quote, not AMD's.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,106
136
Guess i showed the wrong screenshot if you gonna take everything in bad faith. Oncourse the difference is not that big in this scenario since the 7800x3d is among the lowest clocking Zen4.
View attachment 83178
Do i need to dig up a 6ghz screenshot of plain old regular Zen4 ?
That second sentence was "For Intel"....
 
  • Like
Reactions: Det0x

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
I have the opposite impression. GLC is a terrible core. Bloated and power hungry compared to its competitors. Add on a less than ideal process and core count deficit, plus a sprinkling of bugs, and you can more or less explain SPR's problems without needing to look at the memory subsystem.

Original incarnation in ADL was not that good, RPL increased L2 cache to 2MBs, improved L3 speeds a little bit and it was decent already. Vs Zen3 on similar process.
Massive AVX512 machinery that got disabled is of course plain stupid, even more so when Zen4 showed how it's done properly with 256bit units without blowing up area and power that much.

GLC scales real well with memory latency, when it's OoO capacity is used to execute instructions instead of waiting for memory. It would really shine coupled with AMD's speedy L3, even more so with stacked L3.

And we arrive to starting point of this whole discussion - Intel is not blind to these problems, and they are doing something about it with additional cache level and most likely rebalance of current memory subsystem.
There are multiple ways to arrive to their desired destination of "feeding GLC or even wider core", but i think fundamentals remain the same:

1) L1 cache has to stay 48 if TLB basic unit stays 4KB, they can't do ARM hack of 128KB L1D cache
2) They need massive "2nd" level cache to increase performance and increase power efficiency by not hitting "uncore". At some point increasing core private L2 stops making sense i guess, it becomes better to share cache transistors between cores and burn tiny amount of transistors to ensure "fair" sharing between cores in cluster to please the cloud crowd.

So 16-32MB "L2" cluster => 2X cycle latency => need to insert additional cache level to not destroy GLC performance completely. Something like 10-12 cycle L2 of 256-512kb is just right for this task, and it makes some sense to make it inclusive in next level.

That's my reasoning and attempt to make sense of that "L0" news.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,106
136
Original incarnation in ADL was not that good, RPL increased L2 cache to 2MBs, improved L3 speeds a little bit and it was decent already. Vs Zen3 on similar process.
Massive AVX512 machinery that got disabled is of course plain stupid, even more so when Zen4 showed how it's done properly with 256bit units without blowing up area and power that much.

GLC scales real well with memory latency, when it's OoO capacity is used to execute instructions instead of waiting for memory. It would really shine coupled with AMD's speedy L3, even more so with stacked L3.

And we arrive to starting point of this whole discussion - Intel is not blind to these problems, and they are doing something about it with additional cache level and most likely rebalance of current memory subsystem.
There are multiple ways to arrive to their desired destination of "feeding GLC or even wider core", but i think fundamentals remain the same:

1) L1 cache has to stay 48 if TLB basic unit stays 4KB, they can't do ARM hack of 128KB L1D cache
2) They need massive "2nd" level cache to increase performance and increase power efficiency by not hitting "uncore". At some point increasing core private L2 stops making sense i guess, it becomes better to share cache transistors between cores and burn tiny amount of transistors to ensure "fair" sharing between cores in cluster to please the cloud crowd.

So 16-32MB "L2" cluster => 2X cycle latency => need to insert additional cache level to not destroy GLC performance completely. Something like 10-12 cycle L2 of 256-512kb is just right for this task, and it makes some sense to make it inclusive in next level.

That's my reasoning and attempt to make sense of that "L0" news.
Ah, so you're thinking that the current L1 size will remain the floor, but they'll add a larger cache tier / readjust the hierarchy above it? Because when I hear "L0 cache", I'm thinking something smaller and lower latency than the L1, with the rest more or less left the same.
 

Geddagod

Golden Member
Dec 28, 2021
1,531
1,622
106
Ah, so you're thinking that the current L1 size will remain the floor, but they'll add a larger cache tier / readjust the hierarchy above it? Because when I hear "L0 cache", I'm thinking something smaller and lower latency than the L1, with the rest more or less left the same.
Xino claims it's what @JoeRambo is claiming is going to happen. L1 as the floor, an 'old capacity L2 cache' (I'm guessing 512KB or something), then the regular L2 which is going to be, prob, at least 2MB and then the L3.
 

Geddagod

Golden Member
Dec 28, 2021
1,531
1,622
106
A couple of side notes:
  • When is Intel going to show us that SRF die shot they promised :c
  • Interesting that Raichu hasn't yet commented on the new ARL leaks. Even for false leaks, he usually says something... and also, for the memes, I expect a MLID cope video on Thursday
  • Speaking about ARL, I wonder if LNC entire frequency/power curve is screwed up, or if it's just the top end of the curve, if it struggles to hit the 5GHzish range.
  • Intel Q2 earnings call on he 27th, hopefully some updates. Hopefully they might solidify DMR on the roadmap as well (2025 or 2026).
  • Wonder if Intel's reluctance to talk about ARL (even in comparison to LNL) is due to them having troubles with the development of ARL.
1689750788548.png
  • More Intel cost cutting. At least it's not the engineers...
1689750839620.png
  • NUC isn't dead, kinda?
 

SpudLobby

Golden Member
May 18, 2022
1,041
702
106
A pitty they don't work on a MTL refresh using Intel 3, or is there a chance MTL refresh could use Intel 3? I guess not. If they really intend to use Intel 4 MTL for ARL-U we might see ARL with a big variation of process node and architectures. Some people are telling 20A is still planned for some SKUs. Maybe N3B for 8+16, 20A for 6+8 and Intel 4 for 2+8.
Yep, this is my thought. But don't hold your breath.

It's Intel.
 

SpudLobby

Golden Member
May 18, 2022
1,041
702
106
So what's going on with an L0 that's rumored? Are they going to make the L1 cache larger and add an L0, and shift the hierarchy up a bit, or just pad the cache hierarchy with a smaller, lower latency L1?
 

moinmoin

Diamond Member
Jun 1, 2017
5,242
8,456
136
NUC isn't dead, kinda?
Likely little more than ASUS licensing/buying the brand. At best they keep using the same ODM that Intel used (though with the move to plastic housing I can't say I care much, the aluminium ones before that were nice).
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Ah, so you're thinking that the current L1 size will remain the floor, but they'll add a larger cache tier / readjust the hierarchy above it? Because when I hear "L0 cache", I'm thinking something smaller and lower latency than the L1, with the rest more or less left the same.
"L0 cache below L1" => the real question is how would that cache really operate.
I think with caching, if we were to ignore engineering things, two things matter -> latency and capacity. Current L1 is 48KB at 5 cycle latency. Below them we have VRF at 1 ( or 0 cycles if stars align ) . VRF is very substantial with hundreds of 64 and 256(512) bit registers. Tens of kilobytes really even before we consider "metadata".

Then there is a architectural problem of making said L0 faster. I think P4 used to initially have 2 cycle L1 of 8KB. Then it got nerfed to 16KB at 4 cycles. But it was different clocks era with much smaller code and data sizes and 32bits.
Would such L0 cache make sense today? Nope. Once we consider that you need to check L0 ( even at 2 cycles ) and then on miss check L1 ( that was 5 cycles today ), end result is probably loss of perf and energy at considerable expense of area and complexity.

Intel's L1 and AMD's L1 is where they are today due to limitations inherent to 4KB page size on x86.


is good read if you want to know why. Apple is getting out of it by having 16KB pages and therefore 128KB L1D works for them, they'd have 32KB L1D if they were on x86 with similar setup.

Xino claims it's what @JoeRambo is claiming is going to happen. L1 as the floor, an 'old capacity L2 cache' (I'm guessing 512KB or something), then the regular L2 which is going to be, prob, at least 2MB and then the L3.

Same latency/size considerations apply i think to this small L2, without making next level way larger than current 2MB it does not make much sense to insert that additional levels IMO.
 

mikk

Diamond Member
May 15, 2012
4,296
2,382
136
  • Wonder if Intel's reluctance to talk about ARL (even in comparison to LNL) is due to them having troubles with the development of ARL.


I always wondered why they talked much more about LNL and basically nothing about ARL. And now it turns out ARL-U won't even using a real ARL. What about ARL-P, maybe it's the same?
 

Geddagod

Golden Member
Dec 28, 2021
1,531
1,622
106
Same latency/size considerations apply i think to this small L2, without making next level way larger than current 2MB it does not make much sense to insert that additional levels IMO.
Ye, I think 2MB really is the minimum here. Interesting how both Zen 5 and LNC look to be getting some big changes in their cache subsystem (Zen 5 increased L1 and 'ladder' (istg if it's just mesh lol) cache.
 

Geddagod

Golden Member
Dec 28, 2021
1,531
1,622
106
I always wondered why they talked much more about LNL and basically nothing about ARL. And now it turns out ARL-U won't even using a real ARL. What about ARL-P, maybe it's the same?
ARL-P is like the one sku rumored to use 20A lol. The 6+8 die. Are you referring to the ARL-U not being real ARL bcuz of MTL-R or something filling up that lineup? In that case, I wouldn't be surprised if LNL fills up the premium 'U' segment, considering it looks like the 'U' segment base TDP is 15 watts.
 

mikk

Diamond Member
May 15, 2012
4,296
2,382
136
ARL-P is like the one sku rumored to use 20A lol. The 6+8 die. Are you referring to the ARL-U not being real ARL bcuz of MTL-R or something filling up that lineup? In that case, I wouldn't be surprised if LNL fills up the premium 'U' segment, considering it looks like the 'U' segment base TDP is 15 watts.


Only ARL-S gets the new extensions like SHA512 or AVX-VNNI-INT16 according to Intels programming reference pdf as well as the latest GCC 14 compiler release. The other ARL doesn't get the new ones which could suggest it is based on an older architecture like Redwood Cove+Crestmont.
 
  • Like
Reactions: moinmoin

Exist50

Platinum Member
Aug 18, 2016
2,452
3,106
136
is good read if you want to know why. Apple is getting out of it by having 16KB pages and therefore 128KB L1D works for them, they'd have 32KB L1D if they were on x86 with similar setup.
Eh, even there, they propose a way to make a bigger L1. They just dismiss it as too big and power hungry. In general, it's a poor bet to assume there's no engineering solution to these sorts of scaling problems.

As for latency, it's plenty possible to make a lower latency cache than 5 cycles. Could probably have a single cycle cache if they back off the frequency a bit.

If what you propose is what happens, then great. I just wouldn't bet on some of those assumptions holding long term.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,106
136
More Intel cost cutting. At least it's not the engineers...
I'm not convinced the engineers are unaffected. Certainly they've made drastic cuts to engineering as well. But at a certain point, hard to distinguish what "wave" of layoffs they're at, and how many actually remain.
I wouldn't be surprised if LNL fills up the premium 'U' segment, considering it looks like the 'U' segment base TDP is 15 watts.
That would make sense, since LNL would clearly be the more premium of the two. Hard to say what TDP range this "MX" line will target, but if nothing else, 4+4 vs (a theoretical) 2+8, probably isn't a hugely meaningful difference for this market.
Wonder if Intel's reluctance to talk about ARL (even in comparison to LNL) is due to them having troubles with the development of ARL.
If there are unique troubles with one of the two, it would have to be LNL. I think they're just talking about LNL more because it's so different from MTL.
 
  • Like
Reactions: SpudLobby

Geddagod

Golden Member
Dec 28, 2021
1,531
1,622
106
'm not convinced the engineers are unaffected. Certainly they've made drastic cuts to engineering as well. But at a certain point, hard to distinguish what "wave" of layoffs they're at, and how many actually remain.
Ian reported a new round of marketing lay offs in the past 24 hrs (yesterday). I hope it doesn't affect engineers in this 'round'. Not that I hate the people who have jobs in marketing, I mean it much suck getting laid off, but like Intel needs their engineers more than ever.
If there are unique troubles with one of the two, it would have to be LNL. I think they're just talking about LNL more because it's so different from MTL.
That would make sense. It appears as if LNL's uncore changes are the main star of the show, so even if LNC is a bit lackluster, LNL as a whole could still be really impressive.

A bit unrelated, but I wonder if both AMD and Intel kinda shot themselves in the foot recently with their all core frequency boosts. The 7950x boosts up to 5.4 GHz all core, and the 7600x literally boosts the same speed in ST as it does in MT (5.5GHz).
The 13900k meanwhile has its P cores go up to 5.5GHz and its E-cores at 4.3GHz, which again, seem really close to the P core peak ST frequency of 5.8Ghz.
Usually, when a company moves to a newer node, they are able to boost the all core frequency as a large part of their MT gains- but it appears for ARL and Zen 5, at maxed out power draws, the only MT perf gain they are going to get is from IPC increases, since the all core frequency doesn't have anywhere to go.
Obv this doesn't mean that there aren't any benefits, since efficiency and perf at lower power draws is going to be improved, but in terms of total performance uplift, without a care for power, the uplift could look small. For example, the 7950x vs 5950x was nearly a 50% perf uplift in MT CB23, largely due to the massive all core frequency uplift, but for the 8950x vs 7950x, I would expect the MT uplift to be at most 35% since it's going to have to be essentially all IPC, since the 7950x all core clocks are already so high.
I think for this reason alone it should make sense for PTL/Zen 6 to increase core counts (8+32 for PTL maybe, 24 cores for Zen 6?). Those shouldn't have large IPC increases, and unless we see Max frequency core clocks hitting like 7GHz, we won't see large gains in all core frequency either. Obv we could also see Zen 6/PTL also only increase MT perf marginally (kinda like zen 3) too, but I hope with increased competition between AMD and Intel we could see more (though feeding all those cores bandwitdh is a problem that is also going to have to be solved, obviously, as well).