• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."

Discussion Intel current and future Lakes & Rapids thread

Page 377 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Hulk

Platinum Member
Oct 9, 1999
2,974
413
126
5GHz on all cores for the 11700K? Why bother with the 11900K if this is true?
 

Hulk

Platinum Member
Oct 9, 1999
2,974
413
126

Det0x

Senior member
Sep 11, 2014
497
588
136
Zen 3 is still a beast but that comparison seems to show a strong 5950X vs a weak 11700K. Of course we don't know how they are clocked, but here's a comparison that shows things as being a bit more equal.

Hmm lower score runs higher clockspeed.. Something dont add up here.

4975mhz = 1746 points

4888 mhz = 1806 points
 
  • Like
Reactions: Tlh97

Hulk

Platinum Member
Oct 9, 1999
2,974
413
126
Hmm lower score runs higher clockspeed.. Something dont add up here.

4975mhz = 1746 points

4888 mhz = 1806 points
Agreed. That's why we need real, vetted reviews. Geekbench scores are all over the place. When I compare using Geekbench I generally try to use non K parts on Dell or other systems where I know overclocking is difficult, not possible, and/or not worth it to the buyer since they are buying a Dell;)
 
  • Like
Reactions: lightmanek

AMDK11

Member
Jul 15, 2019
40
28
51
If only he had run FlopsCPU v1.4 on a locked clock, which measures both single thread IPC and multi-threading.
 
Last edited:

Hulk

Platinum Member
Oct 9, 1999
2,974
413
126
I realize I've asked this question before here but I really don't know why Intel spent all of that die space for the larger L2/L3 caches in Tiger Lake (Willow Cove)? The result was an average IPC decrease compared to Sunny Cove. They could have saved a little die space if they held it to 4 cores and/or possibly added a 6 core part.

Perhaps when the 8 core parts come out the larger caches will be necessary to feed the cores?

I just can't seem to wrap my head around it especially when they go back to Sunny Cove for Rocket Lake.
 

jpiniero

Diamond Member
Oct 1, 2010
9,063
1,747
126
I realize I've asked this question before here but I really don't know why Intel spent all of that die space for the larger L2/L3 caches in Tiger Lake (Willow Cove)? The result was an average IPC decrease compared to Sunny Cove. They could have saved a little die space if they held it to 4 cores and/or possibly added a 6 core part.
I had thought that somehow it helps with the frequency gains they got. Also IIRC Golden Cove's cache sizes are the same as Willow Cove so maybe that had mostly to do with it.
 

Hulk

Platinum Member
Oct 9, 1999
2,974
413
126
I had thought that somehow it helps with the frequency gains they got. Also IIRC Golden Cove's cache sizes are the same as Willow Cove so maybe that had mostly to do with it.
The caches on a processor came into being when there was an imbalance in the speed of processor and the data bus, namely to insure the front end of the CPU has all of the data/instructions it requires to operate optimally.

Are you theorizing that when Sunny Cove gained approximately a 1GHz moving to 10SF that Intel deemed the current cache structure not sufficient? That's a good theory.

I'm thinking it definitely has something to do with the slower memory subsystem on mobile devices because for Rocket Lake they went with Sunny Cove.
 

jpiniero

Diamond Member
Oct 1, 2010
9,063
1,747
126
Are you theorizing that when Sunny Cove gained approximately a 1GHz moving to 10SF that Intel deemed the current cache structure not sufficient? That's a good theory.
No I was thinking that the cache changes were needed to get the frequency increase on 10 nm. That doesn't really make any sense admittingly.

It may just be that the cache structure was intended to give better performance but the L3 speed was butchered to get Tiger Lake out of the door.
 

Cardyak

Member
Sep 12, 2018
45
73
61
I realize I've asked this question before here but I really don't know why Intel spent all of that die space for the larger L2/L3 caches in Tiger Lake (Willow Cove)? The result was an average IPC decrease compared to Sunny Cove. They could have saved a little die space if they held it to 4 cores and/or possibly added a 6 core part.

Perhaps when the 8 core parts come out the larger caches will be necessary to feed the cores?

I just can't seem to wrap my head around it especially when they go back to Sunny Cove for Rocket Lake.

It may just be that the cache structure was intended to give better performance but the L3 speed was butchered to get Tiger Lake out of the door.
I’ve been pondering this as well, it seems that the L2 cache in Willow Cove had a very impressive upgrade. (2.5x increase in size for only a 1 cycle increase in latency) but the L3 cache change was so poor it’s actually caused a regression of performance in certain workloads.

I have several theories such as:

1. As jpiniero stated, Intel screwed up the L3 implementation and just rushed it out the door anyway (fairly unlikely)

2. The L3 cache change causes a performance regression in some workloads but an improvement in others, and the regressions are over-represented in AnandTechs review which skews the entire IPC average calculation. Other benchmarks show a slight IPC gain for Willow Cove of around 4%-5% (somewhat plausible)

3. The L3 cache change was inevitable as future core designs scale up. Intel had to bite the bullet and alter the way the L3 cache was implemented at some point in the near future, so they decided to do it now. Maybe it’s one of the those things in design and engineering where you have to go back a step and regress in the short term so you can move forward 2 or 3 steps in the future and gain more performance in the long term. (Personally I think this is the most likely)
 

Hulk

Platinum Member
Oct 9, 1999
2,974
413
126
I’ve been pondering this as well, it seems that the L2 cache in Willow Cove had a very impressive upgrade. (2.5x increase in size for only a 1 cycle increase in latency) but the L3 cache change was so poor it’s actually caused a regression of performance in certain workloads.

I have several theories such as:

1. As jpiniero stated, Intel screwed up the L3 implementation and just rushed it out the door anyway (fairly unlikely)

2. The L3 cache change causes a performance regression in some workloads but an improvement in others, and the regressions are over-represented in AnandTechs review which skews the entire IPC average calculation. Other benchmarks show a slight IPC gain for Willow Cove of around 4%-5% (somewhat plausible)

3. The L3 cache change was inevitable as future core designs scale up. Intel had to bite the bullet and alter the way the L3 cache was implemented at some point in the near future, so they decided to do it now. Maybe it’s one of the those things in design and engineering where you have to go back a step and regress in the short term so you can move forward 2 or 3 steps in the future and gain more performance in the long term. (Personally I think this is the most likely)
Adding on to your #3 do you think that for 8 core mobile Tiger Lake the larger cache structures will really come into play since the data/instruction requirements will be doubled? We'll never know because there will never be a Sunny Cove 8 core mobile part I assume?
 

AMDK11

Member
Jul 15, 2019
40
28
51
Overall, looking at the changes in the CypressCove microarchitecture compared to Skylake, the average IPC increase of 18% should be on the bank because Intel did not introduce such large changes between the previous microarchitecture, which is quite intriguing.

Sunny / CypressCove

5-Way Instruction Assignment (Skylake 4-Way, Haswell 4-Way, SandyBridge 4-Way, Nehalem 4-Way, Conroe (Core 2) 4-Way)
Instruction re-queuing (OoO (ROB)) 352 entries in flight (Skylake 224, Haswell 192, SandyBridge 168, Nehalem 128, Conroe (Core 2) 96)
Scheduler 160 entries (Skylake 97, Broadwell 64, Haswell 60, SandyBridge 54, Nehalem 36, Conroe (Core 2) 32)
Register Files - Integer 280 entries + FP 224 entries (Skylake 180 + 168, Haswell 168 + 168, SandyBridge 160 + 144, Nehalem N / A, Conroe (Core 2) N / A)
Dispatch 10-Way (dispatch from scheduler (execution unit ports)) (Skylake 8-Way, Haswell 8-Way, SandyBridge 6-Way, Nehalem 6-Way, Conroe (Core 2) 6-Way)




X86 Skylake core 217 million transistors
Front-end
Cache L1-32KB 8-Way Instructions
µOP cache of 1536 entries
ITLB 8 entries (2M)
Allocation Queue (IDQ) 64 µOP / thread or 128 µOP single thread
LSD can detect up to 64 µOP loops / thread or 128 µOP single thread
5-way x86 decoder (1 comprehensive, 4 straight)
Back-end
Assignment of 4-Way Instructions
Instruction re-queuing (OoO (ROB)) 224 entries on the fly
Scheduler 97 entries
Register Files - Integer 180 entries + FP 168 entries
8-Way Dispatch (dispatch from scheduler (execution unit ports))
Execution Engine
3x FP-ALU (Arithmetic-logic-floating-point units (2x FMAC 256bit))
1x ALU (Arithmetic Logic Unit)
1x StoreData (data warehouse)
3x AGU (2x loading addresses, 1x generating addresses)
Memory subsystem
In-Flight Loads 72 entries (loading in flight with L1D)
In-Flight Stores 56 entries (in-flight storage to L1D)
L1-Data Cache 32KB 8-Way
Cache L2 256KB 4-Way

-------------------------------------------------- -------------------------------------------------- --------------------

X86 CypressCove core 300 million transistors
Front-end
Cache L1-32KB 8-Way Instructions
µOP cache of 2250 entries
Smarter prefetchers (smarter preselector)
Improved Branch Predictor
ITLB 16 entries (double 2M)
Allocation Queue (IDQ) 70 µOP / thread or 140 µOP single thread
LSD can detect up to 70 µOP loop / thread or 140 µOP single thread
5-way x86 decoder (1 comprehensive, 4 straight)
Back-end
Assignment of 5-Way Instructions
Instruction re-queuing (OoO (ROB)) 352 entries on the fly
Scheduler of 160 entries
Register Files - Integer 280 entries + FP 224 entries
10-Way Dispatch (dispatch from scheduler (execution unit ports))
Execution Engine
3x FP-ALU (Arithmetic logic floating point units (1x FMAC512bit or 2x FMAC256bit)) (in fact it is 1x FMAC512bit + 1x FMAC256bit)
1x ALU (Arithmetic Logic Unit)
2x StoreData (data warehouse)
2x AGU (loading addresses)
2x AGU (address generation)
Memory subsystem
In-Flight Loads 128 entries (loading in flight with L1D)
In-Flight Stores 72 entries (in-flight storage to L1D)
48KB 12-Way L1 Data Cache
Cache L2 512KB 8-Way
 
Last edited:

Hulk

Platinum Member
Oct 9, 1999
2,974
413
126
Overall, looking at the changes in the CypressCove microarchitecture compared to Skylake, the average IPC increase of 18% should be on the bank because Intel did not introduce such large changes between the previous microarchitecture, which is quite intriguing.

Sunny / CypressCove

5-Way Instruction Assignment (Skylake 4-Way, Haswell 4-Way, SandyBridge 4-Way, Nehalem 4-Way, Conroe (Core 2) 4-Way)
Instruction re-queuing (OoO (ROB)) 352 entries in flight (Skylake 224, Haswell 192, SandyBridge 168, Nehalem 128, Conroe (Core 2) 96)
Scheduler 160 entries (Skylake 97, Broadwell 64, Haswell 60, SandyBridge 54, Nehalem 36, Conroe (Core 2) 32)
Register Files - Integer 280 entries + FP 224 entries (Skylake 180 + 168, Haswell 168 + 168, SandyBridge 160 + 144, Nehalem N / A, Conroe (Core 2) N / A)
Dispatch 10-Way (dispatch from scheduler (execution unit ports)) (Skylake 8-Way, Haswell 8-Way, SandyBridge 6-Way, Nehalem 6-Way, Conroe (Core 2) 6-Way)




X86 Skylake core 217 million transistors
Front-end
Cache L1-32KB 8-Way Instructions
µOP cache of 1536 entries
ITLB 8 entries (2M)
Allocation Queue (IDQ) 64 µOP / thread or 128 µOP single thread
LSD can detect up to 64 µOP loops / thread or 128 µOP single thread
5-way x86 decoder (1 comprehensive, 4 straight)
Back-end
Assignment of 4-Way Instructions
Instruction re-queuing (OoO (ROB)) 224 entries on the fly
Scheduler 97 entries
Register Files - Integer 180 entries + FP 168 entries
8-Way Dispatch (dispatch from scheduler (execution unit ports))
Execution Engine
3x FP-ALU (Arithmetic-logic-floating-point units (2x FMAC 256bit))
1x ALU (Arithmetic Logic Unit)
1x StoreData (data warehouse)
3x AGU (2x loading addresses, 1x generating addresses)
Memory subsystem
In-Flight Loads 72 entries (loading in flight with L1D)
In-Flight Stores 56 entries (in-flight storage to L1D)
L1-Data Cache 32KB 8-Way
Cache L2 256KB 4-Way

-------------------------------------------------- -------------------------------------------------- --------------------

X86 CypressCove core 300 million transistors
Front-end
Cache L1-32KB 8-Way Instructions
µOP cache of 2250 entries
Smarter prefetchers (smarter preselector)
Improved Branch Predictor
ITLB 16 entries (double 2M)
Allocation Queue (IDQ) 70 µOP / thread or 140 µOP single thread
LSD can detect up to 70 µOP loop / thread or 140 µOP single thread
5-way x86 decoder (1 comprehensive, 4 straight)
Back-end
Assignment of 5-Way Instructions
Instruction re-queuing (OoO (ROB)) 352 entries on the fly
Scheduler of 160 entries
Register Files - Integer 280 entries + FP 224 entries
10-Way Dispatch (dispatch from scheduler (execution unit ports))
Execution Engine
3x FP-ALU (Arithmetic logic floating point units (1x FMAC512bit or 2x FMAC256bit)) (in fact it is 1x FMAC512bit + 1x FMAC256bit)
1x ALU (Arithmetic Logic Unit)
2x StoreData (data warehouse)
2x AGU (loading addresses)
2x AGU (address generation)
Memory subsystem
In-Flight Loads 128 entries (loading in flight with L1D)
In-Flight Stores 72 entries (in-flight storage to L1D)
48KB 12-Way L1 Data Cache
Cache L2 512KB 8-Way
To me it looks like Intel has been flip-flopping between front and back end improvements.

Sandy Bridge seemed to focus on the front end. Adding uop cache, improving the branch predictor unit, ring bus, double the decode queue, larger reorder buffer. There were back end improvements but the focus seemed to be feeding the decoders. According to my "work rate" analysis from Anandtech testing (earlier in this thread) that gave Sandy a 11.8% boost over Nehalem.

Then I have a feeling Sandy didn't perform as well as their simulations suggested and they went ahead and found the bottlenecks and made some tweaks with Ivy Bridge, the first and only tick+, and got another 6.7%, or about 19% over Nehalem, which I think is what they were expecting over Nehalem with Sandy in the first place.

With all of this improvement to the front end the back end was now the bottleneck so with Haswell the big change was the execution ports going from 6 to 8. 8.7% uptick for Haswell.

For Skylake the front end was again the bottleneck to the big change was the addition of another simple decoder and other larger structures to support it. 8.9% improvement for Skylake.

With Sunny Cove the backend was again the bottleneck but they learned from Haswell and not only increased the execution ports form 8 to 10, they also made sure those extra ports would be utilized by increase the data and L2 caches, as well as increasing the size of buffers, registers, loads, stores, basically the Haswell and Skylake improvements again in one swoop. 21% improvement, again calculated from Anandtech tests. If I had to guess Sunny Cove is a very front-to-back end balanced design hence the large improvement from Skylake.

I think if Golden Cove will achieve a similar throughput improvement it's going to take a fairly large overhaul. Another simple decoder, two more execution ports, the Willow Cove cache structure, DDR5 memory, and improvement/enlargement of most internal structures. Basically a wider, smarter Sunny Cove with inherently faster main memory access. After Golden Cove I don't know. How much more juice can they squeeze from this design, which dates back to Banias circa 2003? Or maybe they can continue to add decoders, ports, larger structures, and smarter OoO scheduling?
Time will tell.

But back to the point of my post. After Conroe, or a while Intel didn't need to do more than what I would call "half steps." Meaning improve the front or back end massively, but not both at the same time, which is what the did with Sunny Cove due to the threat from AMD.

I expect Golden Cove to be a "full step" ahead.

I don't know a lot about Zen 3 or AMD architectures but they seem to have taken the Core design philosophy and gone hog wild with it. 4 complex decoders and like 14 or 16 or something execution ports? We'll know the "work rate" or "throughput" comparison for Sunny Cove vs Zen 3 in a few weeks with the release of Rocket Lake, but if they are comparable then I think Intel probably has achieved the same result as AMD with a slightly smaller yet smarter design. And importantly possible with more room for improvement going forward. It's all super interesting to follow. What a great race to watch.
 

Carfax83

Diamond Member
Nov 1, 2010
6,051
850
126
I expect Golden Cove to be a "full step" ahead.
What do you think about the merits of Intel's monolithic design approach vs AMD's chiplet based design philosophy?

Honestly I still prefer monolithic designs, as they seem more efficient overall. Intel has made some really excellent IMCs that just have much better memory performance than what AMD has done; especially in terms of latency.

It was actually shocking that Skylake could still compete somewhat favorably with Zen 2, despite it being a much older microarchitecture. Of course Zen 3 decisively put an end to that.

With Golden Cove, Intel looks like they will finally make a comeback though. I'm still disappointed that they seemingly abandoned AVX-512 with Golden Cove. Or perhaps they will bring back a HEDT line that has full sized Golden Cove cores all round and none of that big-little crap.

A Golden Cove based 10 core CPU with quad channel DDR5 would be right up my alley :cool:
 
Last edited:

AMDK11

Member
Jul 15, 2019
40
28
51
X86 CypressCove core 300 million transistors
Front-end
Cache L1-32KB 8-Way
Instructions µOP cache of 2250 entries Smarter prefetchers (smarter preselector)
Improved Branch Predictor
ITLB 16 entries (double 2M)
Allocation Queue (IDQ) 70 µOP / thread or 140 µOP single thread
LSD can detect up to 70 µOP loop / thread or 140 µOP single thread
5-way x86 decoder (1 comprehensive, 4 straight)
Back-end
Assignment of 5-Way Instructions
Instruction re-queuing (OoO (ROB)) 352 entries on the fly
Scheduler of 160 entries
Register Files - Integer 280 entries + FP 224 entries
10-Way Dispatch (dispatch from scheduler (execution unit ports))
Execution Engine
3x FP-ALU (Arithmetic logic floating point units (1x FMAC512bit or 2x FMAC256bit)) (in fact it is 1x FMAC512bit + 1x FMAC256bit)
1x ALU (Arithmetic Logic Unit)
2x StoreData (data warehouse)
2x AGU (loading addresses)
2x AGU (address generation)
Memory subsystem
In-Flight Loads 128 entries (loading in flight with L1D)
In-Flight Stores 72 entries (in-flight storage to L1D)
48KB 12-Way L1 Data Cache
Cache L2 512KB 8-Way -------------------------------------------------- -------------------------------------------------- -------------------------
X86 Zen3 core
Front-end
Cache L1-32KB 8-Way Instructions
µOP cache of 4096 entries
4-way x86 decoder (4 comprehensive)
Back-end
Instruction re-queuing (OoO (ROB)) 256 entries on the fly
Scheduler Integer 96 entries
Scheduler FP 64 entries
Register Files Integer 192 entries
Register Files FP 160 entries
Dispatch Integer 10-Way (shipping from the scheduler (ports of execution units))
Dispatch FP 3-Way (Shipping)
Execution Engine
6x FPU (2x FMAC256bit floating point units)
4x ALU (Arithmetic Logic Units)
2x StoreData (data warehouse)
1x Dedicated Branch
3x AGU (3x loading addresses or 2x address storage)
Memory subsystem
72 In-Flight Loads (loading in flight with L1D)
In-Flight Stores 64 entries (in-flight storage to L1D)
L1-Data Cache 32KB 8-Way
Cache L2 512KB 8-Way
 
Last edited:

ASK THE COMMUNITY