Discussion Intel current and future Lakes & Rapids thread

Saylick · Aug 16, 2023

igor_kavinski said:
That happens inside a single core. Rentable cores are supposedly going to take this a step further and allow the instructions of a single thread to be executed out of order on different cores.

But why do that if the same core has enough width in the execution engine to do it in parallel and "in-house"? That's what superscaler execution is all about. If a single thread can't saturate the execution engine, then you can use SMT to bring in another thread.

If there truly is a thread that has a portion that is so embarrassingly parallel that bringing in another core to do the work benefits it, a GPU likely would be a better choice. In reality, I don't think there's too many programs that fall in this Goldilocks region where it runs best on what would effectively be a 20+ wide execution engine (made from two or more smaller CPU cores) but not so embarrassingly parallel that a GPU could run it faster.

igor_kavinski · Aug 16, 2023

Saylick said:
If a single thread can't saturate the execution engine, then you can use SMT to bring in another thread.

That's the problem! SMT is going away!

Saylick · Aug 16, 2023

igor_kavinski said:
That's the problem! SMT is going away!

Sure, or you can also have a deeper reorder buffer (ROB) so that you can shuffle instructions around to prevent the core from idling.

Again, I'm not sure what the goal is for Rentable Units. If the intent is to basically do the elusive "Reverse HT", whereby smaller cores "fuse" into a bigger core, then they should just call it what it is.

igor_kavinski · Aug 16, 2023

I know my MS Paint skills suck. Let's move on

So let's assume the thread is sliced into 5 streams based on branches.

Ideally, P-core will predict well and take the correct branch and breeze through the execution. E-core will execute the other branches but it will be for nought.

Let's look at the more realistic scenario:

P-core executes the first stream successfully. Makes a booboo at the 2nd branching stream and takes the wrong turn. While the P-core's pipeline is being flushed, execution control is handed over to E-core. It has already executed the other branch in advance so it has the results ready. So taking this result, control is handed back to P-core and it begins executing the third stream instead of having to start from scratch. It executes the 3rd stream successfully and mispredicts on the 4th one. Rinse. Repeat and so on.

Saylick · Aug 16, 2023

igor_kavinski said:
View attachment 84514
I know my MS Paint skills suck. Let's move on

So let's assume the thread is sliced into 5 streams based on branches.

Ideally, P-core will predict well and take the correct branch and breeze through the execution. E-core will execute the other branches but it will be for nought.

Let's look at the more realistic scenario:

P-core executes the first stream successfully. Makes a booboo at the 2nd branching stream and takes the wrong turn. While the P-core's pipeline is being flushed, execution control is handed over to E-core. It has already executed the other branch in advance so it has the results ready. So taking this result, control is handed back to P-core and it begins executing the third stream instead of having to start from scratch. It executes the 3rd stream successfully and mispredicts on the 4th one. Rinse. Repeat and so on.

Thank you for explaining like I'm five. The MS Paint sketches were on point.

A/// · Aug 16, 2023

igor_kavinski said:
I can believe it if Intel Israel is behind this innovation. They seem to be the ones always rescuing Intel in their time of need.

Intel israel has also messed up historically. nothing special about it. just because they came up with core and thus core 2 duo means nothing nearly 20 years later. I seem to recall a company that released a set of processors over 5 years ago that has been a thorn in intel's backside since then and can't manage to take out.

igor_kavinski · Aug 16, 2023

A/// said:
Intel israel has also messed up historically. nothing special about it. just because they came up with core and thus core 2 duo means nothing nearly 20 years later.

Most recently: https://www.pcworld.com/article/395006/intels-alder-lake-what-you-need-to-know.html

Arik Gihon, the chief architect of Alder Lake, told attendees that Alder Lake will be able to clock the memory speed up or down, saving power, in response to real-time heuristic analysis of the work being performed.

Imagine where Intel would be at today if that guy in their Israel R&D center hadn't designed Alder Lake...

A/// · Aug 16, 2023

igor_kavinski said:
Most recently: https://www.pcworld.com/article/395006/intels-alder-lake-what-you-need-to-know.html

View attachment 84515
Imagine where Intel would be at today if that guy in their Israel R&D center hadn't designed Alder Lake...

Yep, love me a heterogeneous processor that needs more cores than the competition and more power to perform on an near equal footing. who doesn't?

igor_kavinski · Aug 16, 2023

A/// said:
Yep, love me a heterogeneous processor that needs more cores than the competition and more power to perform on an near equal footing. who doesn't?

I'm saying it could be a LOT worse. How about only 10 or 12 P-cores max and you can't use all of them at once otherwise the CPU throttles down to 3 GHz in all core workloads.

A/// · Aug 16, 2023

igor_kavinski said:
I'm saying it could be a LOT worse. How about only 10 or 12 P-cores max and you can't use all of them at once otherwise the CPU throttles down to 3 GHz in all core workloads.

I think worse would have been if tejas made it out of the lab. quite honestly would have been nice having cheaper electrical heat then because gas those years were very expensive.

H433x0n · Aug 16, 2023

A/// said:
Yep, love me a heterogeneous processor that needs more cores than the competition and more power to perform on a near equal footing. who doesn't?

Heterogenous is the way things are going. Soon there will be 3 or more different cores powering client processors.

A/// · Aug 16, 2023

H433x0n said:
Heterogenous is the way things are going. Soon there will be 3 or more different cores powering client processors.

it is, but intel's method may have cost them dearly in the long run. too early to tell. we'll get a better idea in late 2024 or mid 2025 with amd's zen 6 normal and dense being on the same package. there's been some discussions at length in the zen 5 thread how this will be who knows tbh. anyway it's friday which means it's wine night, toodlie doos.

naukkis · Aug 17, 2023

igor_kavinski said:
P-core executes the first stream successfully. Makes a booboo at the 2nd branching stream and takes the wrong turn. While the P-core's pipeline is being flushed, execution control is handed over to E-core. It has already executed the other branch in advance so it has the results ready. So taking this result, control is handed back to P-core and it begins executing the third stream instead of having to start from scratch. It executes the 3rd stream successfully and mispredicts on the 4th one. Rinse. Repeat and so on.

I really don't see how it's possible to share thread execution between cores. Not only L1-caches has to shared between cores but register file too. That kind of execution splitting is only possible when cores share L1-caches and register files. Is Intel designing something like Bulldozer but with at least partially shared register and L1-caches?

igor_kavinski · Aug 17, 2023

naukkis said:
Is Intel designing something like Bulldozer but with at least partially shared register and L1-caches?

One possibility is that each compute cluster will be composed of identical number of paired P-cores and E-cores with shared caches. Another crazy possibility is that the E-core cluster is stacked on top of the P-core cluster with shared cache in between. But this is all speculation so far. Maybe we won't see rentable cores before Lunar Lake.

SiliconFly · Aug 17, 2023

igor_kavinski said:
From what I understand, Intel will slice a single thread into different instruction streams and try to process them on different cores so that if one instruction stream comes to a halt for whatever reason, the processing keeps going on, on the other cores and results are ready before the execution reaches that part of the process's instructions.

I don't think so. It's just not possible to "slice" a thread and have the pieces parallelly execute on different cores due to race conditions. It's nightmare scenario for any core . Not achievable. Thread performance can never exceed core performance.

From what I gather, rentable units are 2 (or more) cores coupled into a core complex. And when the main core wants to increase IPC, it just "borrows" (or rents) resources from other idle core(s) from the core complex, like registers, caches, alu, tlb, etc. Massively interconnected and bridgeable core logic.

If implemented right, lightly threaded (or single-threaded) workloads can gain a significant bump in performance. But under heavy multi-threaded workloads, it falls back to native physical core performance (like RPL without hyper-threading).

SiliconFly · Aug 17, 2023

Geddagod said:
Lol

No. Only person who thinks it's RWC+ is witeken...

At this point, it's getting real hard to say whether ARL has either RWC+ or LNC. Too many leaks say it can be either. But initial ARL leaks clearly said it's LNC. And now considering the rumors that say ARL doesn't have hyper-threading, it looks more likely ARL has the first iteration of Jim Keller's core (i.e, LNC).

But then again, Intel being Intel, it's better to be safe than sorry. Even the s****y kaby lake came out after skylake with hyper-threading disaled for lame reasons. If they have their way, ARL could be the next kaby lake. Don't underestimate Intel.

igor_kavinski · Aug 17, 2023

SiliconFly said:
From what I gather, rentable units are 2 (or more) cores coupled into a core complex.

Ah. Now that term rentable makes sense.

JoeRambo · Aug 17, 2023

naukkis said:
I really don't see how it's possible to share thread execution between cores. Not only L1-caches has to shared between cores but register file too. That kind of execution splitting is only possible when cores share L1-caches and register files. Is Intel designing something like Bulldozer but with at least partially shared register and L1-caches?

I 100% agree with thread execution not being sharable between threads and that speculative execution ahead on different core is load of bs that would not work, nor would make any sense from energy efficiency or post-Spectre security point of view even if it did.

Can L2 be shared without any complications and with huge gains? Damn sure, Penryn had 6MB of L2 for two cores with awesome latency. Even if SRAM is not scaling, additional transistors spent on logic to make sharing "smooth" does make sense and with rumoured extra lower cache level it would work.
Can L1 be shared? Makes little sense, as size is limited anyway, any "sharing" will add latency and kill performance.
Can PRF be shared? 300x512bits + 300x64bits is nearly 50KB of brutally multiported structure in Golden Cove already, size bounded by ROB sizes. I doubt it can be shared without introducing problems with clocking, energy saving etc. But who knows, esp FP side is very wasteful currently.
Can variuos load/store queues be shared? When L2 is shared and is inclusive of L1? Why not, as long as it does not introduce glass jaws like Buldozer.
Can L2 BTB and TLBs be shared? For sure, these are huge and there was massive increase in security related transistors to make them not prone to "Spectre" like meltdowns. Might as well make them even more massive and shared by two cores.

lightisgood · Aug 17, 2023

Intel's 144 Core Sierra Forest A0 Stepping Spotted: 108MB L3 Cache and 350W TDP | Hardware Times

The A0 stepping of Intel’s upcoming Xeon Sierra Forest processors has leaked out, courtesy of @yuuki_ans. It includes info regarding two chips: A 96-core and a 144-core CPU. Both have a TDP of 350W, the same as Sapphire Rapids and AMD’s Bergamo cloud offerings. On the memory side, the 96-core...

www.hardwaretimes.com

This is A0 stepping.
So, probably SRF will be up to 432 cores.

Markfw · Aug 17, 2023

lightisgood said:
Intel's 144 Core Sierra Forest A0 Stepping Spotted: 108MB L3 Cache and 350W TDP | Hardware Times

The A0 stepping of Intel’s upcoming Xeon Sierra Forest processors has leaked out, courtesy of @yuuki_ans. It includes info regarding two chips: A 96-core and a 144-core CPU. Both have a TDP of 350W, the same as Sapphire Rapids and AMD’s Bergamo cloud offerings. On the memory side, the 96-core...

www.hardwaretimes.com

This is A0 stepping.
So, probably SRF will be up to 432 cores.

No benchmarks at all ?? Nothing leaked ?

Abwx · Aug 17, 2023

lightisgood said:
Intel's 144 Core Sierra Forest A0 Stepping Spotted: 108MB L3 Cache and 350W TDP | Hardware Times

The A0 stepping of Intel’s upcoming Xeon Sierra Forest processors has leaked out, courtesy of @yuuki_ans. It includes info regarding two chips: A 96-core and a 144-core CPU. Both have a TDP of 350W, the same as Sapphire Rapids and AMD’s Bergamo cloud offerings. On the memory side, the 96-core...

www.hardwaretimes.com

This is A0 stepping.
So, probably SRF will be up to 432 cores.

2 x 48 = 96
3 x 48 = 144

Henry swagger · Aug 17, 2023

lightisgood said:
Intel's 144 Core Sierra Forest A0 Stepping Spotted: 108MB L3 Cache and 350W TDP | Hardware Times

The A0 stepping of Intel’s upcoming Xeon Sierra Forest processors has leaked out, courtesy of @yuuki_ans. It includes info regarding two chips: A 96-core and a 144-core CPU. Both have a TDP of 350W, the same as Sapphire Rapids and AMD’s Bergamo cloud offerings. On the memory side, the 96-core...

www.hardwaretimes.com

This is A0 stepping.
So, probably SRF will be up to 432 cores.

Wonder what clock speed they.ll run at ?🤔 ... maybe 4ghz all core 😁

A/// · Aug 17, 2023

pat should take up drinking.

lightisgood · Aug 18, 2023

Abwx said:
2 x 48 = 96
3 x 48 = 144

'3' x N = PRQ
SRF is not PRQ yet.

X x N = 144
Repeatedly, Yuuk_AnS says that is A0 stepping (i.e. X = 1 , therefore, N = 144).
3 x 144 = 432

Consequently, SRF will be up to 432 cores.

Abwx · Aug 18, 2023

lightisgood said:
'3' x N = PRQ
SRF is not PRQ yet.

X x N = 144
Repeatedly, Yuuk_AnS says that is A0 stepping (i.e. X = 1 , therefore, N = 144).
3 x 144 = 432

Consequently, SRF will be up to 432 cores.

350W is not enough for 432 cores, for 144 cores that s about 2W/core.

Discussion Intel current and future Lakes & Rapids thread

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member

Golden Member

Diamond Member

Golden Member

Lifer

Golden Member

Golden Member

Lifer

Golden Member

Senior member

Attachments

Moderator Emeritus, Elite Member

Lifer

Senior member

Diamond Member

Senior member

Lifer