Discussion Intel current and future Lakes & Rapids thread

Page 857 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Saylick

Diamond Member
Sep 10, 2012
4,052
9,472
136
That happens inside a single core. Rentable cores are supposedly going to take this a step further and allow the instructions of a single thread to be executed out of order on different cores.
But why do that if the same core has enough width in the execution engine to do it in parallel and "in-house"? That's what superscaler execution is all about. If a single thread can't saturate the execution engine, then you can use SMT to bring in another thread.

If there truly is a thread that has a portion that is so embarrassingly parallel that bringing in another core to do the work benefits it, a GPU likely would be a better choice. In reality, I don't think there's too many programs that fall in this Goldilocks region where it runs best on what would effectively be a 20+ wide execution engine (made from two or more smaller CPU cores) but not so embarrassingly parallel that a GPU could run it faster.
 

Saylick

Diamond Member
Sep 10, 2012
4,052
9,472
136
That's the problem! SMT is going away!
Sure, or you can also have a deeper reorder buffer (ROB) so that you can shuffle instructions around to prevent the core from idling.

Again, I'm not sure what the goal is for Rentable Units. If the intent is to basically do the elusive "Reverse HT", whereby smaller cores "fuse" into a bigger core, then they should just call it what it is.
 
Jul 27, 2020
28,109
19,175
146
1692224051411.png
I know my MS Paint skills suck. Let's move on :)

So let's assume the thread is sliced into 5 streams based on branches.

Ideally, P-core will predict well and take the correct branch and breeze through the execution. E-core will execute the other branches but it will be for nought.

Let's look at the more realistic scenario:

P-core executes the first stream successfully. Makes a booboo at the 2nd branching stream and takes the wrong turn. While the P-core's pipeline is being flushed, execution control is handed over to E-core. It has already executed the other branch in advance so it has the results ready. So taking this result, control is handed back to P-core and it begins executing the third stream instead of having to start from scratch. It executes the 3rd stream successfully and mispredicts on the 4th one. Rinse. Repeat and so on.
 

Saylick

Diamond Member
Sep 10, 2012
4,052
9,472
136
View attachment 84514
I know my MS Paint skills suck. Let's move on :)

So let's assume the thread is sliced into 5 streams based on branches.

Ideally, P-core will predict well and take the correct branch and breeze through the execution. E-core will execute the other branches but it will be for nought.

Let's look at the more realistic scenario:

P-core executes the first stream successfully. Makes a booboo at the 2nd branching stream and takes the wrong turn. While the P-core's pipeline is being flushed, execution control is handed over to E-core. It has already executed the other branch in advance so it has the results ready. So taking this result, control is handed back to P-core and it begins executing the third stream instead of having to start from scratch. It executes the 3rd stream successfully and mispredicts on the 4th one. Rinse. Repeat and so on.
Thank you for explaining like I'm five. The MS Paint sketches were on point.
 

A///

Diamond Member
Feb 24, 2017
4,351
3,160
136
I can believe it if Intel Israel is behind this innovation. They seem to be the ones always rescuing Intel in their time of need.
Intel israel has also messed up historically. nothing special about it. just because they came up with core and thus core 2 duo means nothing nearly 20 years later. I seem to recall a company that released a set of processors over 5 years ago that has been a thorn in intel's backside since then and can't manage to take out.
 
Jul 27, 2020
28,109
19,175
146
Intel israel has also messed up historically. nothing special about it. just because they came up with core and thus core 2 duo means nothing nearly 20 years later.
Most recently: https://www.pcworld.com/article/395006/intels-alder-lake-what-you-need-to-know.html

Arik Gihon, the chief architect of Alder Lake, told attendees that Alder Lake will be able to clock the memory speed up or down, saving power, in response to real-time heuristic analysis of the work being performed.
1692225160982.png
Imagine where Intel would be at today if that guy in their Israel R&D center hadn't designed Alder Lake...
 

A///

Diamond Member
Feb 24, 2017
4,351
3,160
136
  • Like
Reactions: Lodix
Jul 27, 2020
28,109
19,175
146
Yep, love me a heterogeneous processor that needs more cores than the competition and more power to perform on an near equal footing. who doesn't?
I'm saying it could be a LOT worse. How about only 10 or 12 P-cores max and you can't use all of them at once otherwise the CPU throttles down to 3 GHz in all core workloads.
 

A///

Diamond Member
Feb 24, 2017
4,351
3,160
136
I'm saying it could be a LOT worse. How about only 10 or 12 P-cores max and you can't use all of them at once otherwise the CPU throttles down to 3 GHz in all core workloads.
I think worse would have been if tejas made it out of the lab. quite honestly would have been nice having cheaper electrical heat then because gas those years were very expensive.
 

H433x0n

Golden Member
Mar 15, 2023
1,224
1,606
106
Yep, love me a heterogeneous processor that needs more cores than the competition and more power to perform on a near equal footing. who doesn't?
Heterogenous is the way things are going. Soon there will be 3 or more different cores powering client processors.
 
  • Like
Reactions: Henry swagger

A///

Diamond Member
Feb 24, 2017
4,351
3,160
136
Heterogenous is the way things are going. Soon there will be 3 or more different cores powering client processors.
it is, but intel's method may have cost them dearly in the long run. too early to tell. we'll get a better idea in late 2024 or mid 2025 with amd's zen 6 normal and dense being on the same package. there's been some discussions at length in the zen 5 thread how this will be who knows tbh. anyway it's friday which means it's wine night, toodlie doos.
 

naukkis

Golden Member
Jun 5, 2002
1,020
853
136
P-core executes the first stream successfully. Makes a booboo at the 2nd branching stream and takes the wrong turn. While the P-core's pipeline is being flushed, execution control is handed over to E-core. It has already executed the other branch in advance so it has the results ready. So taking this result, control is handed back to P-core and it begins executing the third stream instead of having to start from scratch. It executes the 3rd stream successfully and mispredicts on the 4th one. Rinse. Repeat and so on.

I really don't see how it's possible to share thread execution between cores. Not only L1-caches has to shared between cores but register file too. That kind of execution splitting is only possible when cores share L1-caches and register files. Is Intel designing something like Bulldozer but with at least partially shared register and L1-caches?
 
  • Like
Reactions: Tlh97 and Joe NYC
Jul 27, 2020
28,109
19,175
146
Is Intel designing something like Bulldozer but with at least partially shared register and L1-caches?
One possibility is that each compute cluster will be composed of identical number of paired P-cores and E-cores with shared caches. Another crazy possibility is that the E-core cluster is stacked on top of the P-core cluster with shared cache in between. But this is all speculation so far. Maybe we won't see rentable cores before Lunar Lake.
 
  • Like
Reactions: Tlh97 and Joe NYC

SiliconFly

Golden Member
Mar 10, 2023
1,924
1,284
106
From what I understand, Intel will slice a single thread into different instruction streams and try to process them on different cores so that if one instruction stream comes to a halt for whatever reason, the processing keeps going on, on the other cores and results are ready before the execution reaches that part of the process's instructions.
I don't think so. It's just not possible to "slice" a thread and have the pieces parallelly execute on different cores due to race conditions. It's nightmare scenario for any core . Not achievable. Thread performance can never exceed core performance.

From what I gather, rentable units are 2 (or more) cores coupled into a core complex. And when the main core wants to increase IPC, it just "borrows" (or rents) resources from other idle core(s) from the core complex, like registers, caches, alu, tlb, etc. Massively interconnected and bridgeable core logic.

If implemented right, lightly threaded (or single-threaded) workloads can gain a significant bump in performance. But under heavy multi-threaded workloads, it falls back to native physical core performance (like RPL without hyper-threading).
 
Last edited:

SiliconFly

Golden Member
Mar 10, 2023
1,924
1,284
106
Lol

No. Only person who thinks it's RWC+ is witeken...
At this point, it's getting real hard to say whether ARL has either RWC+ or LNC. Too many leaks say it can be either. But initial ARL leaks clearly said it's LNC. And now considering the rumors that say ARL doesn't have hyper-threading, it looks more likely ARL has the first iteration of Jim Keller's core (i.e, LNC).

But then again, Intel being Intel, it's better to be safe than sorry. Even the s****y kaby lake came out after skylake with hyper-threading disaled for lame reasons. If they have their way, ARL could be the next kaby lake. Don't underestimate Intel. ;)
 
  • Haha
Reactions: Darkmont

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
I really don't see how it's possible to share thread execution between cores. Not only L1-caches has to shared between cores but register file too. That kind of execution splitting is only possible when cores share L1-caches and register files. Is Intel designing something like Bulldozer but with at least partially shared register and L1-caches?

I 100% agree with thread execution not being sharable between threads and that speculative execution ahead on different core is load of bs that would not work, nor would make any sense from energy efficiency or post-Spectre security point of view even if it did.

Can L2 be shared without any complications and with huge gains? Damn sure, Penryn had 6MB of L2 for two cores with awesome latency. Even if SRAM is not scaling, additional transistors spent on logic to make sharing "smooth" does make sense and with rumoured extra lower cache level it would work.
Can L1 be shared? Makes little sense, as size is limited anyway, any "sharing" will add latency and kill performance.
Can PRF be shared? 300x512bits + 300x64bits is nearly 50KB of brutally multiported structure in Golden Cove already, size bounded by ROB sizes. I doubt it can be shared without introducing problems with clocking, energy saving etc. But who knows, esp FP side is very wasteful currently.
Can variuos load/store queues be shared? When L2 is shared and is inclusive of L1? Why not, as long as it does not introduce glass jaws like Buldozer.
Can L2 BTB and TLBs be shared? For sure, these are huge and there was massive increase in security related transistors to make them not prone to "Spectre" like meltdowns. Might as well make them even more massive and shared by two cores.
 
  • Like
Reactions: Tlh97 and Schmide

lightisgood

Senior member
May 27, 2022
250
121
86

Attachments

  • 6th_Xeon.png
    6th_Xeon.png
    263.6 KB · Views: 27

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
27,250
16,108
136

This is A0 stepping.
So, probably SRF will be up to 432 cores.
No benchmarks at all ?? Nothing leaked ?
 

Abwx

Lifer
Apr 2, 2011
11,885
4,873
136

This is A0 stepping.
So, probably SRF will be up to 432 cores.
2 x 48 = 96
3 x 48 = 144
 

Henry swagger

Senior member
Feb 9, 2022
512
313
106

This is A0 stepping.
So, probably SRF will be up to 432 cores.
Wonder what clock speed they.ll run at ?🤔 ... maybe 4ghz all core 😁
 

Abwx

Lifer
Apr 2, 2011
11,885
4,873
136
'3' x N = PRQ
SRF is not PRQ yet.

X x N = 144
Repeatedly, Yuuk_AnS says that is A0 stepping (i.e. X = 1 , therefore, N = 144).
3 x 144 = 432

Consequently, SRF will be up to 432 cores.
350W is not enough for 432 cores, for 144 cores that s about 2W/core.