Discussion Intel current and future Lakes & Rapids thread

jpiniero · Feb 27, 2021

uzzi38 said:
MindFactory is selling 11700Ks for 479 Euros, and people are receiving them now.

Reviewers are going to be pissed if people buy it and post benchmarks. Albeit if those prices are legit Rocket Lake is not going to be all that interesting.

uzzi38 · Feb 27, 2021

Oh and here's some real TGL-H. Boost and base clocks there. Seems like no TVB nor TB3.0 at all any more.

https://twitter.com/x/status/1365657163461390337

coercitiv · Feb 27, 2021

First benchmarks ran on recently bought 11700K are in on Hardwareluxx:

https://twitter.com/x/status/1365749432801763328

moonbogg · Feb 27, 2021

coercitiv said:
First benchmarks ran on recently bought 11700K are in on Hardwareluxx:

https://twitter.com/x/status/1365749432801763328

Hmm. I feel like I should say something more, but I'm struggling here.

coercitiv · Feb 27, 2021

moonbogg said:
Hmm. I feel like I should say something more, but I'm struggling here.

Are you under NDA as well? 😛

Hulk · Feb 27, 2021

5GHz on all cores for the 11700K? Why bother with the 11900K if this is true?

moonbogg · Feb 27, 2021

Hulk said:
5GHz on all cores for the 11700K? Why bother with the 11900K if this is true?

Because for 50% more money you get an easy 5.1Ghz on all cores (or something).

Det0x · Feb 27, 2021

Geekbench 5 compared to my 5950x

Generic vs ASUS System Product Name - Geekbench

browser.geekbench.com

Rocket prettymuch only wins in AES-XTS compared to Zen3

Hulk · Feb 27, 2021

moonbogg said:
Because for 50% more money you get an easy 5.1Ghz on all cores (or something).

Ah yes. Thanks for reminding me. Don't know how I missed that. Gonna grab me a 11900K the second they are available!

Hulk · Feb 27, 2021

Det0x said:
Geekbench 5 compared to my 5950x

Generic vs ASUS System Product Name - Geekbench

browser.geekbench.com

Rocket prettymuch only wins in AES-XTS compared to Zen3

Zen 3 is still a beast but that comparison seems to show a strong 5950X vs a weak 11700K. Of course we don't know how they are clocked, but here's a comparison that shows things as being a bit more equal.

ASUS System Product Name vs Gigabyte Technology Co., Ltd. B550 AORUS PRO AX - Geekbench

browser.geekbench.com

Det0x · Feb 27, 2021

Hulk said:
Zen 3 is still a beast but that comparison seems to show a strong 5950X vs a weak 11700K. Of course we don't know how they are clocked, but here's a comparison that shows things as being a bit more equal.

ASUS System Product Name vs Gigabyte Technology Co., Ltd. B550 AORUS PRO AX - Geekbench

browser.geekbench.com

Hmm lower score runs higher clockspeed.. Something dont add up here.

4975mhz = 1746 points

https://browser.geekbench.com/v5/cpu/6718899.gb5

4888 mhz = 1806 points

https://browser.geekbench.com/v5/cpu/6664831.gb5

Hulk · Feb 27, 2021

Det0x said:
Hmm lower score runs higher clockspeed.. Something dont add up here.

4975mhz = 1746 points

https://browser.geekbench.com/v5/cpu/6718899.gb5

4888 mhz = 1806 points

https://browser.geekbench.com/v5/cpu/6664831.gb5

Agreed. That's why we need real, vetted reviews. Geekbench scores are all over the place. When I compare using Geekbench I generally try to use non K parts on Dell or other systems where I know overclocking is difficult, not possible, and/or not worth it to the buyer since they are buying a Dell😉

jpiniero · Feb 27, 2021

https://www.notebookcheck.net/VAIO-Z-Core-i7-11375H-Review-The-Laptop-for-CEOs-and-Executives.524276.0.html

Review of a Vaio laptop with the 11375H. PL2 is 65 W. Laptop itself isn't great but the single thread scores are very good.

AMDK11 · Feb 27, 2021

If only he had run FlopsCPU v1.4 on a locked clock, which measures both single thread IPC and multi-threading.

jpiniero · Feb 27, 2021

coercitiv said:
First benchmarks ran on recently bought 11700K are in on Hardwareluxx:

The 11700K used 192 W at 4.6 (locked ACT) on all cores in Cinebench R23. That's AVX2 I believe and not AVX-512. For 125 W it dropped to 4 Ghz.

eek2121 · Feb 27, 2021

The 11980HK looks like quite a part. 8 cores, 16 threads, 2.6/5.0 ghz, 45W TDP.

Hulk · Feb 27, 2021

I realize I've asked this question before here but I really don't know why Intel spent all of that die space for the larger L2/L3 caches in Tiger Lake (Willow Cove)? The result was an average IPC decrease compared to Sunny Cove. They could have saved a little die space if they held it to 4 cores and/or possibly added a 6 core part.

Perhaps when the 8 core parts come out the larger caches will be necessary to feed the cores?

I just can't seem to wrap my head around it especially when they go back to Sunny Cove for Rocket Lake.

jpiniero · Feb 28, 2021

Hulk said:
I realize I've asked this question before here but I really don't know why Intel spent all of that die space for the larger L2/L3 caches in Tiger Lake (Willow Cove)? The result was an average IPC decrease compared to Sunny Cove. They could have saved a little die space if they held it to 4 cores and/or possibly added a 6 core part.

I had thought that somehow it helps with the frequency gains they got. Also IIRC Golden Cove's cache sizes are the same as Willow Cove so maybe that had mostly to do with it.

Hulk · Feb 28, 2021

jpiniero said:
I had thought that somehow it helps with the frequency gains they got. Also IIRC Golden Cove's cache sizes are the same as Willow Cove so maybe that had mostly to do with it.

The caches on a processor came into being when there was an imbalance in the speed of processor and the data bus, namely to insure the front end of the CPU has all of the data/instructions it requires to operate optimally.

Are you theorizing that when Sunny Cove gained approximately a 1GHz moving to 10SF that Intel deemed the current cache structure not sufficient? That's a good theory.

I'm thinking it definitely has something to do with the slower memory subsystem on mobile devices because for Rocket Lake they went with Sunny Cove.

jpiniero · Feb 28, 2021

Hulk said:
Are you theorizing that when Sunny Cove gained approximately a 1GHz moving to 10SF that Intel deemed the current cache structure not sufficient? That's a good theory.

No I was thinking that the cache changes were needed to get the frequency increase on 10 nm. That doesn't really make any sense admittingly.

It may just be that the cache structure was intended to give better performance but the L3 speed was butchered to get Tiger Lake out of the door.

Cardyak · Feb 28, 2021

Hulk said:
I realize I've asked this question before here but I really don't know why Intel spent all of that die space for the larger L2/L3 caches in Tiger Lake (Willow Cove)? The result was an average IPC decrease compared to Sunny Cove. They could have saved a little die space if they held it to 4 cores and/or possibly added a 6 core part.

Perhaps when the 8 core parts come out the larger caches will be necessary to feed the cores?

I just can't seem to wrap my head around it especially when they go back to Sunny Cove for Rocket Lake.

jpiniero said:
It may just be that the cache structure was intended to give better performance but the L3 speed was butchered to get Tiger Lake out of the door.

I’ve been pondering this as well, it seems that the L2 cache in Willow Cove had a very impressive upgrade. (2.5x increase in size for only a 1 cycle increase in latency) but the L3 cache change was so poor it’s actually caused a regression of performance in certain workloads.

I have several theories such as:

1. As jpiniero stated, Intel screwed up the L3 implementation and just rushed it out the door anyway (fairly unlikely)

2. The L3 cache change causes a performance regression in some workloads but an improvement in others, and the regressions are over-represented in AnandTechs review which skews the entire IPC average calculation. Other benchmarks show a slight IPC gain for Willow Cove of around 4%-5% (somewhat plausible)

3. The L3 cache change was inevitable as future core designs scale up. Intel had to bite the bullet and alter the way the L3 cache was implemented at some point in the near future, so they decided to do it now. Maybe it’s one of the those things in design and engineering where you have to go back a step and regress in the short term so you can move forward 2 or 3 steps in the future and gain more performance in the long term. (Personally I think this is the most likely)

Hulk · Feb 28, 2021

Cardyak said:
I’ve been pondering this as well, it seems that the L2 cache in Willow Cove had a very impressive upgrade. (2.5x increase in size for only a 1 cycle increase in latency) but the L3 cache change was so poor it’s actually caused a regression of performance in certain workloads.

I have several theories such as:

1. As jpiniero stated, Intel screwed up the L3 implementation and just rushed it out the door anyway (fairly unlikely)

2. The L3 cache change causes a performance regression in some workloads but an improvement in others, and the regressions are over-represented in AnandTechs review which skews the entire IPC average calculation. Other benchmarks show a slight IPC gain for Willow Cove of around 4%-5% (somewhat plausible)

3. The L3 cache change was inevitable as future core designs scale up. Intel had to bite the bullet and alter the way the L3 cache was implemented at some point in the near future, so they decided to do it now. Maybe it’s one of the those things in design and engineering where you have to go back a step and regress in the short term so you can move forward 2 or 3 steps in the future and gain more performance in the long term. (Personally I think this is the most likely)

Adding on to your #3 do you think that for 8 core mobile Tiger Lake the larger cache structures will really come into play since the data/instruction requirements will be doubled? We'll never know because there will never be a Sunny Cove 8 core mobile part I assume?

uzzi38 · Feb 28, 2021

This is accurate as far as TDP for SPR goes: davidbepo (5 GHz overload) on Twitter: "SPR ⚡🔥 is 350W first VERY early number for genoa is Genoa 3,9 (320W)" / Twitter

Don't mind the Bepo maths on performance ratings.

AMDK11 · Feb 28, 2021

Overall, looking at the changes in the CypressCove microarchitecture compared to Skylake, the average IPC increase of 18% should be on the bank because Intel did not introduce such large changes between the previous microarchitecture, which is quite intriguing.

Sunny / CypressCove

5-Way Instruction Assignment (Skylake 4-Way, Haswell 4-Way, SandyBridge 4-Way, Nehalem 4-Way, Conroe (Core 2) 4-Way)
Instruction re-queuing (OoO (ROB)) 352 entries in flight (Skylake 224, Haswell 192, SandyBridge 168, Nehalem 128, Conroe (Core 2) 96)
Scheduler 160 entries (Skylake 97, Broadwell 64, Haswell 60, SandyBridge 54, Nehalem 36, Conroe (Core 2) 32)
Register Files - Integer 280 entries + FP 224 entries (Skylake 180 + 168, Haswell 168 + 168, SandyBridge 160 + 144, Nehalem N / A, Conroe (Core 2) N / A)
Dispatch 10-Way (dispatch from scheduler (execution unit ports)) (Skylake 8-Way, Haswell 8-Way, SandyBridge 6-Way, Nehalem 6-Way, Conroe (Core 2) 6-Way)

X86 Skylake core 217 million transistors
Front-end
Cache L1-32KB 8-Way Instructions
µOP cache of 1536 entries
ITLB 8 entries (2M)
Allocation Queue (IDQ) 64 µOP / thread or 128 µOP single thread
LSD can detect up to 64 µOP loops / thread or 128 µOP single thread
5-way x86 decoder (1 comprehensive, 4 straight)
Back-end
Assignment of 4-Way Instructions
Instruction re-queuing (OoO (ROB)) 224 entries on the fly
Scheduler 97 entries
Register Files - Integer 180 entries + FP 168 entries
8-Way Dispatch (dispatch from scheduler (execution unit ports))
Execution Engine
3x FP-ALU (Arithmetic-logic-floating-point units (2x FMAC 256bit))
1x ALU (Arithmetic Logic Unit)
1x StoreData (data warehouse)
3x AGU (2x loading addresses, 1x generating addresses)
Memory subsystem
In-Flight Loads 72 entries (loading in flight with L1D)
In-Flight Stores 56 entries (in-flight storage to L1D)
L1-Data Cache 32KB 8-Way
Cache L2 256KB 4-Way

-------------------------------------------------- -------------------------------------------------- --------------------

X86 CypressCove core 300 million transistors
Front-end
Cache L1-32KB 8-Way Instructions
µOP cache of 2250 entries
Smarter prefetchers (smarter preselector)
Improved Branch Predictor
ITLB 16 entries (double 2M)
Allocation Queue (IDQ) 70 µOP / thread or 140 µOP single thread
LSD can detect up to 70 µOP loop / thread or 140 µOP single thread
5-way x86 decoder (1 comprehensive, 4 straight)
Back-end
Assignment of 5-Way Instructions
Instruction re-queuing (OoO (ROB)) 352 entries on the fly
Scheduler of 160 entries
Register Files - Integer 280 entries + FP 224 entries
10-Way Dispatch (dispatch from scheduler (execution unit ports))
Execution Engine
3x FP-ALU (Arithmetic logic floating point units (1x FMAC512bit or 2x FMAC256bit)) (in fact it is 1x FMAC512bit + 1x FMAC256bit)
1x ALU (Arithmetic Logic Unit)
2x StoreData (data warehouse)
2x AGU (loading addresses)
2x AGU (address generation)
Memory subsystem
In-Flight Loads 128 entries (loading in flight with L1D)
In-Flight Stores 72 entries (in-flight storage to L1D)
48KB 12-Way L1 Data Cache
Cache L2 512KB 8-Way

Hulk · Mar 1, 2021

AMDK11 said:
Overall, looking at the changes in the CypressCove microarchitecture compared to Skylake, the average IPC increase of 18% should be on the bank because Intel did not introduce such large changes between the previous microarchitecture, which is quite intriguing.

Sunny / CypressCove

5-Way Instruction Assignment (Skylake 4-Way, Haswell 4-Way, SandyBridge 4-Way, Nehalem 4-Way, Conroe (Core 2) 4-Way)
Instruction re-queuing (OoO (ROB)) 352 entries in flight (Skylake 224, Haswell 192, SandyBridge 168, Nehalem 128, Conroe (Core 2) 96)
Scheduler 160 entries (Skylake 97, Broadwell 64, Haswell 60, SandyBridge 54, Nehalem 36, Conroe (Core 2) 32)
Register Files - Integer 280 entries + FP 224 entries (Skylake 180 + 168, Haswell 168 + 168, SandyBridge 160 + 144, Nehalem N / A, Conroe (Core 2) N / A)
Dispatch 10-Way (dispatch from scheduler (execution unit ports)) (Skylake 8-Way, Haswell 8-Way, SandyBridge 6-Way, Nehalem 6-Way, Conroe (Core 2) 6-Way)

X86 Skylake core 217 million transistors
Front-end
Cache L1-32KB 8-Way Instructions
µOP cache of 1536 entries
ITLB 8 entries (2M)
Allocation Queue (IDQ) 64 µOP / thread or 128 µOP single thread
LSD can detect up to 64 µOP loops / thread or 128 µOP single thread
5-way x86 decoder (1 comprehensive, 4 straight)
Back-end
Assignment of 4-Way Instructions
Instruction re-queuing (OoO (ROB)) 224 entries on the fly
Scheduler 97 entries
Register Files - Integer 180 entries + FP 168 entries
8-Way Dispatch (dispatch from scheduler (execution unit ports))
Execution Engine
3x FP-ALU (Arithmetic-logic-floating-point units (2x FMAC 256bit))
1x ALU (Arithmetic Logic Unit)
1x StoreData (data warehouse)
3x AGU (2x loading addresses, 1x generating addresses)
Memory subsystem
In-Flight Loads 72 entries (loading in flight with L1D)
In-Flight Stores 56 entries (in-flight storage to L1D)
L1-Data Cache 32KB 8-Way
Cache L2 256KB 4-Way

-------------------------------------------------- -------------------------------------------------- --------------------

X86 CypressCove core 300 million transistors
Front-end
Cache L1-32KB 8-Way Instructions
µOP cache of 2250 entries
Smarter prefetchers (smarter preselector)
Improved Branch Predictor
ITLB 16 entries (double 2M)
Allocation Queue (IDQ) 70 µOP / thread or 140 µOP single thread
LSD can detect up to 70 µOP loop / thread or 140 µOP single thread
5-way x86 decoder (1 comprehensive, 4 straight)
Back-end
Assignment of 5-Way Instructions
Instruction re-queuing (OoO (ROB)) 352 entries on the fly
Scheduler of 160 entries
Register Files - Integer 280 entries + FP 224 entries
10-Way Dispatch (dispatch from scheduler (execution unit ports))
Execution Engine
3x FP-ALU (Arithmetic logic floating point units (1x FMAC512bit or 2x FMAC256bit)) (in fact it is 1x FMAC512bit + 1x FMAC256bit)
1x ALU (Arithmetic Logic Unit)
2x StoreData (data warehouse)
2x AGU (loading addresses)
2x AGU (address generation)
Memory subsystem
In-Flight Loads 128 entries (loading in flight with L1D)
In-Flight Stores 72 entries (in-flight storage to L1D)
48KB 12-Way L1 Data Cache
Cache L2 512KB 8-Way

To me it looks like Intel has been flip-flopping between front and back end improvements.

Sandy Bridge seemed to focus on the front end. Adding uop cache, improving the branch predictor unit, ring bus, double the decode queue, larger reorder buffer. There were back end improvements but the focus seemed to be feeding the decoders. According to my "work rate" analysis from Anandtech testing (earlier in this thread) that gave Sandy a 11.8% boost over Nehalem.

Then I have a feeling Sandy didn't perform as well as their simulations suggested and they went ahead and found the bottlenecks and made some tweaks with Ivy Bridge, the first and only tick+, and got another 6.7%, or about 19% over Nehalem, which I think is what they were expecting over Nehalem with Sandy in the first place.

With all of this improvement to the front end the back end was now the bottleneck so with Haswell the big change was the execution ports going from 6 to 8. 8.7% uptick for Haswell.

For Skylake the front end was again the bottleneck to the big change was the addition of another simple decoder and other larger structures to support it. 8.9% improvement for Skylake.

With Sunny Cove the backend was again the bottleneck but they learned from Haswell and not only increased the execution ports form 8 to 10, they also made sure those extra ports would be utilized by increase the data and L2 caches, as well as increasing the size of buffers, registers, loads, stores, basically the Haswell and Skylake improvements again in one swoop. 21% improvement, again calculated from Anandtech tests. If I had to guess Sunny Cove is a very front-to-back end balanced design hence the large improvement from Skylake.

I think if Golden Cove will achieve a similar throughput improvement it's going to take a fairly large overhaul. Another simple decoder, two more execution ports, the Willow Cove cache structure, DDR5 memory, and improvement/enlargement of most internal structures. Basically a wider, smarter Sunny Cove with inherently faster main memory access. After Golden Cove I don't know. How much more juice can they squeeze from this design, which dates back to Banias circa 2003? Or maybe they can continue to add decoders, ports, larger structures, and smarter OoO scheduling?
Time will tell.

But back to the point of my post. After Conroe, or a while Intel didn't need to do more than what I would call "half steps." Meaning improve the front or back end massively, but not both at the same time, which is what the did with Sunny Cove due to the threat from AMD.

I expect Golden Cove to be a "full step" ahead.

I don't know a lot about Zen 3 or AMD architectures but they seem to have taken the Core design philosophy and gone hog wild with it. 4 complex decoders and like 14 or 16 or something execution ports? We'll know the "work rate" or "throughput" comparison for Sunny Cove vs Zen 3 in a few weeks with the release of Rocket Lake, but if they are comparable then I think Intel probably has achieved the same result as AMD with a slightly smaller yet smarter design. And importantly possible with more room for improvement going forward. It's all super interesting to follow. What a great race to watch.

Discussion Intel current and future Lakes & Rapids thread

Lifer

Platinum Member

Diamond Member

Lifer

Diamond Member

Diamond Member

Lifer

Golden Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Lifer

Senior member

Lifer

Diamond Member

Diamond Member

Lifer

Diamond Member

Lifer

Member

Diamond Member

Platinum Member

Senior member

Diamond Member