Discussion Intel current and future Lakes & Rapids thread

Page 630 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

IntelUser2000

Elite Member
Oct 14, 2003
8,271
3,174
136
You need hybrid bonding to get close to monolithic. Foveros has a smaller bump pitch than EMIB, which is probably why it's used here, but it's still not quite monolithic.
So Intel says on-die fabrics need 0.1pJ/bit, and Foveros requires 0.3pJ/bit. Still a fair bit higher, but organic interposers are over 1pJ/bit, and EMIB is slightly below 1pJ/bit.

When you are talking about datapaths for I/O, then it cannot be shut down as readily as the individual blocks do, because they need to meet QoS and bandwidth/latency requirements. Also because you need to serve every block that uses the bus. If you have 50 blocks on a bus, but only one is active, you still need to have the whole bus online.

It seems that on a higher level you can use things as power gating to lower power use, but they don't use it for say, the decoders as it's too latency critical.

So theoretically it should offer both faster on/off transitions for the chipset actually allow it to reach much lower power idle(and/or more often) and also reduce the absolute floor of power the communication line uses. This means the ~1.5W or so TDP of the on-package chipset is often reached even in very light load. Yes you can get that really low in idle, but it's very easy to knock it off that state.

That is theory of course. Atom didn't benefit from things being on-die until Silvermont when they updated to a proper on-die bus.

Unfortunately, I think you'll be disappointed, but I hope my pessimism ends up being incorrect.
This is an opportunity. I think there's a chance for them to improve this greatly, though still playing catch up at this point.

Also, good to know davidbepo is a crock.
 
Last edited:

IntelUser2000

Elite Member
Oct 14, 2003
8,271
3,174
136
If my estimation on the die size is roughly correct the total area (with empty space) is around 178 mm^2.
So the top left should be Intel4 8c+16c compute part. The tile below would be N4 IO.
The center part would be N3 iGPU but it has space for 384 EU so it either is:
2 media engine + 384 EU
or
2 media engine + 192 EU + 6 tensor processing core (TPC) or similar
The right lean tile would be N5 SOC.
No that's wrong. Many people have explained this over many pages.

Top left is 2+8. Bottom tiny die is GPU. Center large tile is SoC or basically what's called PCH in current terms. Right is I/O.

You are probably following the twitter feed of someone like Wildcracks who said the same thing about the center being the iGPU not SoC, but we discussed this a while ago. He's wrong, although a reasonable expectation.

And the numbers you are using aren't even from him but complete imagination. Arrowlake is 384EU, not Meteorlake.
 
Last edited:

Exist50

Senior member
Aug 18, 2016
910
883
136
It seems that on a higher level you can use things as power gating to lower power use, but they don't use it for say, the decoders as it's too latency critical.

So theoretically it should offer both faster on/off transitions for the chipset actually allow it to reach much lower power idle(and/or more often) and also reduce the absolute floor of power the communication line uses. This means the ~1.5W or so TDP of the on-package chipset is often reached even in very light load. Yes you can get that really low in idle, but it's very easy to knock it off that state.

That is theory of course. Atom didn't benefit from things being on-die until Silvermont when they updated to a proper on-die bus.
Oh there are plenty of theoretical advantages, don't get me wrong, but I'm not convinced any will actually get realized with Meteor Lake as rushed as it seems to be.

Top left is 2+8. Bottom tiny die is GPU. Center large tile is SoC or basically what's called PCH in current terms. Right is I/O.
You swapped the GPU and IO.
 

nicalandia

Golden Member
Jan 10, 2019
1,623
1,911
106

JoeRambo

Golden Member
Jun 13, 2013
1,603
1,712
136
This one graph from the same source is very interesting:


While ~1-4MB range latency improvements are obviously due to 2MB of L2 ( remarkably it is not slower than 1MB in <1MB range ), Intel seems to have remembered their premier cache skills and 8-32MB range is looking improved ( even more so considering that graph is log scale, very sizeable improvements in cycles for what is also larger cache with more stops on ring due to two more clusters of small cores).
Intel seems hell bent to feed their very wide and powerful core and combined with faster DDR5 support it will perform substantially better than ADL.
 

Carfax83

Diamond Member
Nov 1, 2010
6,189
979
126
While ~1-4MB range latency improvements are obviously due to 2MB of L2 ( remarkably it is not slower than 1MB in <1MB range ), Intel seems to have remembered their premier cache skills and 8-32MB range is looking improved ( even more so considering that graph is log scale, very sizeable improvements in cycles for what is also larger cache with more stops on ring due to two more clusters of small cores).
Intel seems hell bent to feed their very wide and powerful core and combined with faster DDR5 support it will perform substantially better than ADL.
Looks like your prediction a while back is going to be accurate. I think a lot of people are underestimating the performance of Raptor Lake. At this point, I am almost certain Raptor Lake will have the highest single thread performance compared to Zen 4, and will also be highly competitive in multithreaded apps; though Zen 4 will be stronger there overall.

Raptor Lake should also take the gaming crown.
 

Carfax83

Diamond Member
Nov 1, 2010
6,189
979
126
I predict Raptor Lake to be slower than the 5800X3D in games that like Cache.
Well, cache is getting significantly increased in Raptor Lake. The 5800X3D is barely faster than the 12900K when the latter is paired with high speed DDR5, which compensates for the lack of cache somewhat.

But what you stated could also be the case for Zen 4 as well. There's a high chance that Zen 4 will be slower than the 5800X3D in many games.
 

nicalandia

Golden Member
Jan 10, 2019
1,623
1,911
106
Well, cache is getting significantly increased in Raptor Lake. The 5800X3D is barely faster than the 12900K when the latter is paired with high speed DDR5, which compensates for the lack of cache somewhat.

But what you stated could also be the case for Zen 4 as well. There's a high chance that Zen 4 will be slower than the 5800X3D in many games.
You guys are banking on that additional 60% increase in L2$ which gives an additional 6 MiB of L2$. The L3 remains the same. Zen4 will have a 100% increase on the L2 and L3 will remain the same.

Sapphire Rapids should shed some light on the gaming performance of Raptor Lake because it also has 2 MiB of L2 per performance Core
 
Last edited:
  • Like
Reactions: Tlh97 and AAbattery

Carfax83

Diamond Member
Nov 1, 2010
6,189
979
126
You guys are banking on that additional 60% increase in L2$ which gives 6 additional MiB of L2$. The L3 remains the same. Zen4 will have a 100% increase on the L2 and L3 will remain the same.
L3 is going from 30MB in Alder lake to 36MB in Raptor Lake, and it looks like it will shave some cycles off the access time. Memory controller will also be better.
 

nicalandia

Golden Member
Jan 10, 2019
1,623
1,911
106
L3 is going from 30MB in Alder lake to 36MB in Raptor Lake, and it looks like it will shave some cycles off the access time. Memory controller will also be better.
No, that's just the 13900K or SKUs that have 16 e cores, the rest will remain the same. the 13700K will likely be 8+8 with 30 MiB of L3. So the Per core L3 remains the same.
 

Carfax83

Diamond Member
Nov 1, 2010
6,189
979
126
No, that's just the 13900K or SKUs that have 16 e cores, the rest will remain the same. the 13700K will likely be 8+8 with 30 MiB of L3. So the Per core L3 remains the same.
The E cores are getting their L2 cache doubled as well, so that should take some of the pressure off of the L3 cache and make it more performant.
 

JoeRambo

Golden Member
Jun 13, 2013
1,603
1,712
136
I think 30 or 36MB of L3 is almost the same deal, just 20% increase, when AMD is operating in 200% increases due to their X3D stuff.

What is more important are:

1) 2MB of L2 cache of ~same speed as ADL. Increases performance both in obvious way due to increased hit rates and also reduces misses to L3 that is not exactly strongest part of GC core.
2) L3 seems to be faster and that is very important, in both obvious "cache" ways and less obvious ones like inter thread communication and locks running at speed of L3 cache
3) Memory is getting faster in DDR5 speed, IMC is tuned and due to (2) each request takes less time checking L3 for hit/miss.
4) E-Core L2 might get moved from voltage/freq plane of L3 cache, allowing cache to clock better with E-Cores running. 3.7Ghz is anemic in year 2022, when AMD is running 5050mhz coupled L3 with awesome latency.

Overall real good direction to take for manufacturer whose chips had same latency with IMC on chip as AMD had with separate IOD.
 

nicalandia

Golden Member
Jan 10, 2019
1,623
1,911
106
The E cores are getting their L2 cache doubled as well, so that should take some of the pressure off of the L3 cache and make it more performant.
While Games might take advantage of the L3 that is not being used by e cores(6 additional MiB), they will not take advantage of the L2 on e cores since they have plenty of P cores to choose from.
 

eek2121

Golden Member
Aug 2, 2005
1,914
2,229
136
Looks like your prediction a while back is going to be accurate. I think a lot of people are underestimating the performance of Raptor Lake. At this point, I am almost certain Raptor Lake will have the highest single thread performance compared to Zen 4, and will also be highly competitive in multithreaded apps; though Zen 4 will be stronger there overall.

Raptor Lake should also take the gaming crown.
Raptor lake might be slightly faster than Alder Lake in ST performance, but the difference is likely to be less than 10%. I believe you are overestimating what a small addition to cache will do. Look at AMD APUs vs. their CPUs. Their CPUs have DOUBLE the cache of the APUs, yet it helps very little in most cases: CPU 2021 Benchmarks - Compare Products on AnandTech

Zen 4 is likely to be much faster.

Also note that apparently the 5800X3D is in bench! CPU 2021 Benchmarks - Compare Products on AnandTech
 

lobz

Platinum Member
Feb 10, 2017
2,021
2,725
136
Meteor Lake bottom die being just a passive interposer? I don't even.

Meteor Lake is Lakefield 2.0. The bottom die is the PCH, which per tradition is built on the N-1 node, in this case 10nm Foveros.

Proof: see the high density package and tell me where the PCH is... https://wccftech.com/intel-shows-off-14th-gen-meteor-lake-standard-high-density-die-packages-cpu-tiles-produced-by-intel-gpu-tiles-by-tsmc/

This is my first post in years just to correct this nonsensical "theory". Foveros is active interposer.

Edit: if Intel just needed passive connections it would use EMIB, which is Intel's ultra-low cost alternative to passive interposer. @jpiniero @IntelUser2000 @ashFTW
As long as it's On Track™, sure.
 
  • Like
Reactions: ftt

IntelUser2000

Elite Member
Oct 14, 2003
8,271
3,174
136
Look at AMD APUs vs. their CPUs. Their CPUs have DOUBLE the cache of the APUs, yet it helps very little in most cases:
The difference here is that the L2 cache latency will not change in Raptorlake. Faster L3 cache will also help since it's quite slow in Alderlake.

We don't know the extent of the performance improvement that's all.
 
  • Like
Reactions: Carfax83

Carfax83

Diamond Member
Nov 1, 2010
6,189
979
126
Raptor lake might be slightly faster than Alder Lake in ST performance, but the difference is likely to be less than 10%. I believe you are overestimating what a small addition to cache will do. Look at AMD APUs vs. their CPUs. Their CPUs have DOUBLE the cache of the APUs, yet it helps very little in most cases: CPU 2021 Benchmarks - Compare Products on AnandTech
@IntelUser2000 and @JoeRambo already stated why this is incorrect in regards to the cache. Also, Raptor Lake should get a decent IPC uplift, higher clock speeds and a better IMC which is capable of using higher DDR5 frequencies off the bat compared with Zen 4.

I think 10% is on the low end, but as with anything it will depend on the workload. It's conceivable that in some workloads, the performance gain could be less or greater than 10%.

Zen 4 is likely to be much faster.
Honestly either way, the consumer wins. I'm sure Raptor Lake won't just walk over Zen 4, and Zen 4 won't stomp Raptor Lake into the ground. The more competitive they are with each other, the better it will be.
 
  • Like
Reactions: tjf81 and ashFTW

JoeRambo

Golden Member
Jun 13, 2013
1,603
1,712
136
Look at AMD APUs vs. their CPUs. Their CPUs have DOUBLE the cache of the APUs, yet it helps very little in most cases:
AMD's APUs are bad examples to show how cache helps. APUs have IMC on die and substantially improved memory latency, so impact of having half of L3 is reduced. Can't really compare between them, as it's hard to evaluate more hitrate versus faster misses that work for all accesses.

On topic of RPL improvements in IPC and ST performance in general -> i have no idea, but Intel seems to be making right changes. Probably 5-10% average improvement with some outliers gaining less and benefiting just from better clocks and some way more due to cache improvements.
 
  • Like
Reactions: Carfax83

moinmoin

Diamond Member
Jun 1, 2017
3,592
5,053
136
Overall real good direction to take for manufacturer whose chips had same latency with IMC on chip as AMD had with separate IOD.
That's the crux. Intel had some catching up to do there (all the Skylake clones had excellent cache latency) so it's important it does just that.
 

Carfax83

Diamond Member
Nov 1, 2010
6,189
979
126
That's the crux. Intel had some catching up to do there (all the Skylake clones had excellent cache latency) so it's important it does just that.
But Golden Coves cache bandwidth is far greater than Sky Lake's right? Usually higher bandwidth also means higher latency, though correct me if I'm wrong.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,271
3,174
136
But Golden Coves cache bandwidth is far greater than Sky Lake's right? Usually higher bandwidth also means higher latency, though correct me if I'm wrong.
There isn't a direct relation.

However, you could optimize in a certain direction. Sacrificing latency to get higher bandwidth is a switch in mindset, from single thread to multi-thread optimization. So when they moved from L2 being the LLC to L2 being private and L3 being the LLC with Nehalem, it was a push towards better multi-threaded performance. L3 on Nehalem would be slower than L2 on Core 2 in single threaded applications for example.

With multi-threaded applications, you need the bandwidth of the caches to scale with cores to keep it scaling.

They are on a transitional phase to move away from the endless Skylake stagnation so there will be a lot of improvements, but on theory what I said would apply.

That's the crux. Intel had some catching up to do there (all the Skylake clones had excellent cache latency) so it's important it does just that.
Another thing is that Skylake likely wasn't meant to scale above certain core counts. The greatest benefit of the ring bus was it's simplicity. So for relatively low core counts it beat other implementations even though in theory it didn't sound so fast.

The greatest single ring stop they've done was 12 I think? Not so simple anymore. In Raptorlake they'll reach that number again. Also the push for ridiculous frequency doesn't help.
 
  • Like
Reactions: Tlh97 and Vattila

Carfax83

Diamond Member
Nov 1, 2010
6,189
979
126
There isn't a direct relation.

However, you could optimize in a certain direction. Sacrificing latency to get higher bandwidth is a switch in mindset, from single thread to multi-thread optimization. So when they moved from L2 being the LLC to L2 being private and L3 being the LLC with Nehalem, it was a push towards better multi-threaded performance. L3 on Nehalem would be slower than L2 on Core 2 in single threaded applications for example.

With multi-threaded applications, you need the bandwidth of the caches to scale with cores to keep it scaling.

They are on a transitional phase to move away from the endless Skylake stagnation so there will be a lot of improvements, but on theory what I said would apply.
The reason why I thought that was because in the Chips and Cheese deep dive article, the author stated:

High bandwidth at high clocks doesn’t come for free. At all cache levels, Golden Cove has to cope with more latency than Zen 3. In exchange, Golden Cove’s L1 and L2 caches are larger than AMD’s, and deliver more bandwidth.
This to me kind of implies Intel made a trade off between having big, high bandwidth caches at the expense of some additional latency, which they mitigated with the much bigger ROB. But perhaps the extra latency is more due to the bigger size of the cache rather than the enormous bandwidth.
 

ASK THE COMMUNITY