Discussion Intel current and future Lakes & Rapids thread

Exist50 · May 13, 2022

dullard said:
Note: I am not going to deny that cores in the big tile exist. I hope they do, since the more technology and the more cores out there the better off we are will be. But, even that link doesn't specify Meteor Lake. We can imply that it is in Meteor Lake due to the same 3-tile graphic used for other Meteor Lake presentations (ignoring the obvious simplification that Meteor Lake isn't 3 tiles). But that is also the same image used for Arrow Lake. So, these are still just hints at this point for those of us without inside information.

Well Arrow Lake is supposed to use the same SoC tile as Meteor Lake, right? So...

nicalandia said:
Is this better?

View attachment 61429

Purely in terms of the CPU tile, yeah, that looks right.

IntelUser2000 · May 14, 2022

Exist50 said:
You need hybrid bonding to get close to monolithic. Foveros has a smaller bump pitch than EMIB, which is probably why it's used here, but it's still not quite monolithic.

So Intel says on-die fabrics need 0.1pJ/bit, and Foveros requires 0.3pJ/bit. Still a fair bit higher, but organic interposers are over 1pJ/bit, and EMIB is slightly below 1pJ/bit.

When you are talking about datapaths for I/O, then it cannot be shut down as readily as the individual blocks do, because they need to meet QoS and bandwidth/latency requirements. Also because you need to serve every block that uses the bus. If you have 50 blocks on a bus, but only one is active, you still need to have the whole bus online.

It seems that on a higher level you can use things as power gating to lower power use, but they don't use it for say, the decoders as it's too latency critical.

So theoretically it should offer both faster on/off transitions for the chipset actually allow it to reach much lower power idle(and/or more often) and also reduce the absolute floor of power the communication line uses. This means the ~1.5W or so TDP of the on-package chipset is often reached even in very light load. Yes you can get that really low in idle, but it's very easy to knock it off that state.

That is theory of course. Atom didn't benefit from things being on-die until Silvermont when they updated to a proper on-die bus.

Unfortunately, I think you'll be disappointed, but I hope my pessimism ends up being incorrect.

This is an opportunity. I think there's a chance for them to improve this greatly, though still playing catch up at this point.

Also, good to know davidbepo is a crock.

IntelUser2000 · May 14, 2022

Anhiel said:
If my estimation on the die size is roughly correct the total area (with empty space) is around 178 mm^2.
So the top left should be Intel4 8c+16c compute part. The tile below would be N4 IO.
The center part would be N3 iGPU but it has space for 384 EU so it either is:
2 media engine + 384 EU
or
2 media engine + 192 EU + 6 tensor processing core (TPC) or similar
The right lean tile would be N5 SOC.

No that's wrong. Many people have explained this over many pages.

Top left is 2+8. Bottom tiny die is GPU. Center large tile is SoC or basically what's called PCH in current terms. Right is I/O.

You are probably following the twitter feed of someone like Wildcracks who said the same thing about the center being the iGPU not SoC, but we discussed this a while ago. He's wrong, although a reasonable expectation.

And the numbers you are using aren't even from him but complete imagination. Arrowlake is 384EU, not Meteorlake.

Exist50 · May 14, 2022

IntelUser2000 said:
It seems that on a higher level you can use things as power gating to lower power use, but they don't use it for say, the decoders as it's too latency critical.

So theoretically it should offer both faster on/off transitions for the chipset actually allow it to reach much lower power idle(and/or more often) and also reduce the absolute floor of power the communication line uses. This means the ~1.5W or so TDP of the on-package chipset is often reached even in very light load. Yes you can get that really low in idle, but it's very easy to knock it off that state.

That is theory of course. Atom didn't benefit from things being on-die until Silvermont when they updated to a proper on-die bus.

Oh there are plenty of theoretical advantages, don't get me wrong, but I'm not convinced any will actually get realized with Meteor Lake as rushed as it seems to be.

IntelUser2000 said:
Top left is 2+8. Bottom tiny die is GPU. Center large tile is SoC or basically what's called PCH in current terms. Right is I/O.

You swapped the GPU and IO.

eek2121 · May 15, 2022

witeken said:
This is my first post in years

It's all my fault, isn't it? 🤣

itsmydamnation · May 15, 2022

eek2121 said:
It's all my fault, isn't it? 🤣

atleast you didn't summon "the one"

nicalandia · May 18, 2022

This is supposed to be for Raptor Lake 13900K, But might as well be a 14900K since the Cache size have not changed for Meteor Lake

Alleged Intel Core i9-13900K Raptor Lake CPU-Z Shows Up To 68 MB Cache: Higher Clocks & Increased Cache Designed To Tackle AMD's Raphael 'Zen 4' CPUs

Intel's Core i9-13900K 13th Gen Raptor Lake Desktop CPU shows up with up to 68 MB increased cache in a leaked CPU-z screenshot.

wccftech.com

JoeRambo · May 19, 2022

This one graph from the same source is very interesting:

https://twitter.com/x/status/1526909564741791745

While ~1-4MB range latency improvements are obviously due to 2MB of L2 ( remarkably it is not slower than 1MB in <1MB range ), Intel seems to have remembered their premier cache skills and 8-32MB range is looking improved ( even more so considering that graph is log scale, very sizeable improvements in cycles for what is also larger cache with more stops on ring due to two more clusters of small cores).
Intel seems hell bent to feed their very wide and powerful core and combined with faster DDR5 support it will perform substantially better than ADL.

Carfax83 · May 19, 2022

JoeRambo said:
While ~1-4MB range latency improvements are obviously due to 2MB of L2 ( remarkably it is not slower than 1MB in <1MB range ), Intel seems to have remembered their premier cache skills and 8-32MB range is looking improved ( even more so considering that graph is log scale, very sizeable improvements in cycles for what is also larger cache with more stops on ring due to two more clusters of small cores).
Intel seems hell bent to feed their very wide and powerful core and combined with faster DDR5 support it will perform substantially better than ADL.

Looks like your prediction a while back is going to be accurate. I think a lot of people are underestimating the performance of Raptor Lake. At this point, I am almost certain Raptor Lake will have the highest single thread performance compared to Zen 4, and will also be highly competitive in multithreaded apps; though Zen 4 will be stronger there overall.

Raptor Lake should also take the gaming crown.

nicalandia · May 19, 2022

I predict Raptor Lake to be slower than the 5800X3D in games that like Cache.

Carfax83 · May 19, 2022

nicalandia said:
I predict Raptor Lake to be slower than the 5800X3D in games that like Cache.

Well, cache is getting significantly increased in Raptor Lake. The 5800X3D is barely faster than the 12900K when the latter is paired with high speed DDR5, which compensates for the lack of cache somewhat.

But what you stated could also be the case for Zen 4 as well. There's a high chance that Zen 4 will be slower than the 5800X3D in many games.

nicalandia · May 19, 2022

Carfax83 said:
Well, cache is getting significantly increased in Raptor Lake. The 5800X3D is barely faster than the 12900K when the latter is paired with high speed DDR5, which compensates for the lack of cache somewhat.

But what you stated could also be the case for Zen 4 as well. There's a high chance that Zen 4 will be slower than the 5800X3D in many games.

You guys are banking on that additional 60% increase in L2$ which gives an additional 6 MiB of L2$. The L3 remains the same. Zen4 will have a 100% increase on the L2 and L3 will remain the same.

Sapphire Rapids should shed some light on the gaming performance of Raptor Lake because it also has 2 MiB of L2 per performance Core

Carfax83 · May 19, 2022

nicalandia said:
You guys are banking on that additional 60% increase in L2$ which gives 6 additional MiB of L2$. The L3 remains the same. Zen4 will have a 100% increase on the L2 and L3 will remain the same.

L3 is going from 30MB in Alder lake to 36MB in Raptor Lake, and it looks like it will shave some cycles off the access time. Memory controller will also be better.

nicalandia · May 19, 2022

Carfax83 said:
L3 is going from 30MB in Alder lake to 36MB in Raptor Lake, and it looks like it will shave some cycles off the access time. Memory controller will also be better.

No, that's just the 13900K or SKUs that have 16 e cores, the rest will remain the same. the 13700K will likely be 8+8 with 30 MiB of L3. So the Per core L3 remains the same.

Carfax83 · May 19, 2022

nicalandia said:
No, that's just the 13900K or SKUs that have 16 e cores, the rest will remain the same. the 13700K will likely be 8+8 with 30 MiB of L3. So the Per core L3 remains the same.

The E cores are getting their L2 cache doubled as well, so that should take some of the pressure off of the L3 cache and make it more performant.

JoeRambo · May 19, 2022

I think 30 or 36MB of L3 is almost the same deal, just 20% increase, when AMD is operating in 200% increases due to their X3D stuff.

What is more important are:

1) 2MB of L2 cache of ~same speed as ADL. Increases performance both in obvious way due to increased hit rates and also reduces misses to L3 that is not exactly strongest part of GC core.
2) L3 seems to be faster and that is very important, in both obvious "cache" ways and less obvious ones like inter thread communication and locks running at speed of L3 cache
3) Memory is getting faster in DDR5 speed, IMC is tuned and due to (2) each request takes less time checking L3 for hit/miss.
4) E-Core L2 might get moved from voltage/freq plane of L3 cache, allowing cache to clock better with E-Cores running. 3.7Ghz is anemic in year 2022, when AMD is running 5050mhz coupled L3 with awesome latency.

Overall real good direction to take for manufacturer whose chips had same latency with IMC on chip as AMD had with separate IOD.

nicalandia · May 19, 2022

Carfax83 said:
The E cores are getting their L2 cache doubled as well, so that should take some of the pressure off of the L3 cache and make it more performant.

While Games might take advantage of the L3 that is not being used by e cores(6 additional MiB), they will not take advantage of the L2 on e cores since they have plenty of P cores to choose from.

eek2121 · May 19, 2022

Carfax83 said:
Looks like your prediction a while back is going to be accurate. I think a lot of people are underestimating the performance of Raptor Lake. At this point, I am almost certain Raptor Lake will have the highest single thread performance compared to Zen 4, and will also be highly competitive in multithreaded apps; though Zen 4 will be stronger there overall.

Raptor Lake should also take the gaming crown.

Raptor lake might be slightly faster than Alder Lake in ST performance, but the difference is likely to be less than 10%. I believe you are overestimating what a small addition to cache will do. Look at AMD APUs vs. their CPUs. Their CPUs have DOUBLE the cache of the APUs, yet it helps very little in most cases: CPU 2021 Benchmarks - Compare Products on AnandTech

Zen 4 is likely to be much faster.

Also note that apparently the 5800X3D is in bench! CPU 2021 Benchmarks - Compare Products on AnandTech

lobz · May 19, 2022

witeken said:
Meteor Lake bottom die being just a passive interposer? I don't even.

Meteor Lake is Lakefield 2.0. The bottom die is the PCH, which per tradition is built on the N-1 node, in this case 10nm Foveros.

Proof: see the high density package and tell me where the PCH is... https://wccftech.com/intel-shows-of...pu-tiles-produced-by-intel-gpu-tiles-by-tsmc/

This is my first post in years just to correct this nonsensical "theory". Foveros is active interposer.

Edit: if Intel just needed passive connections it would use EMIB, which is Intel's ultra-low cost alternative to passive interposer. @jpiniero @IntelUser2000 @ashFTW

As long as it's On Track™, sure.

IntelUser2000 · May 19, 2022

eek2121 said:
Look at AMD APUs vs. their CPUs. Their CPUs have DOUBLE the cache of the APUs, yet it helps very little in most cases:

The difference here is that the L2 cache latency will not change in Raptorlake. Faster L3 cache will also help since it's quite slow in Alderlake.

We don't know the extent of the performance improvement that's all.

Carfax83 · May 20, 2022

eek2121 said:
Raptor lake might be slightly faster than Alder Lake in ST performance, but the difference is likely to be less than 10%. I believe you are overestimating what a small addition to cache will do. Look at AMD APUs vs. their CPUs. Their CPUs have DOUBLE the cache of the APUs, yet it helps very little in most cases: CPU 2021 Benchmarks - Compare Products on AnandTech

@IntelUser2000 and @JoeRambo already stated why this is incorrect in regards to the cache. Also, Raptor Lake should get a decent IPC uplift, higher clock speeds and a better IMC which is capable of using higher DDR5 frequencies off the bat compared with Zen 4.

I think 10% is on the low end, but as with anything it will depend on the workload. It's conceivable that in some workloads, the performance gain could be less or greater than 10%.

Zen 4 is likely to be much faster.

Honestly either way, the consumer wins. I'm sure Raptor Lake won't just walk over Zen 4, and Zen 4 won't stomp Raptor Lake into the ground. The more competitive they are with each other, the better it will be.

JoeRambo · May 20, 2022

eek2121 said:
Look at AMD APUs vs. their CPUs. Their CPUs have DOUBLE the cache of the APUs, yet it helps very little in most cases:

AMD's APUs are bad examples to show how cache helps. APUs have IMC on die and substantially improved memory latency, so impact of having half of L3 is reduced. Can't really compare between them, as it's hard to evaluate more hitrate versus faster misses that work for all accesses.

On topic of RPL improvements in IPC and ST performance in general -> i have no idea, but Intel seems to be making right changes. Probably 5-10% average improvement with some outliers gaining less and benefiting just from better clocks and some way more due to cache improvements.

moinmoin · May 21, 2022

JoeRambo said:
Overall real good direction to take for manufacturer whose chips had same latency with IMC on chip as AMD had with separate IOD.

That's the crux. Intel had some catching up to do there (all the Skylake clones had excellent cache latency) so it's important it does just that.

Carfax83 · May 22, 2022

moinmoin said:
That's the crux. Intel had some catching up to do there (all the Skylake clones had excellent cache latency) so it's important it does just that.

But Golden Coves cache bandwidth is far greater than Sky Lake's right? Usually higher bandwidth also means higher latency, though correct me if I'm wrong.

IntelUser2000 · May 22, 2022

Carfax83 said:
But Golden Coves cache bandwidth is far greater than Sky Lake's right? Usually higher bandwidth also means higher latency, though correct me if I'm wrong.

There isn't a direct relation.

However, you could optimize in a certain direction. Sacrificing latency to get higher bandwidth is a switch in mindset, from single thread to multi-thread optimization. So when they moved from L2 being the LLC to L2 being private and L3 being the LLC with Nehalem, it was a push towards better multi-threaded performance. L3 on Nehalem would be slower than L2 on Core 2 in single threaded applications for example.

With multi-threaded applications, you need the bandwidth of the caches to scale with cores to keep it scaling.

They are on a transitional phase to move away from the endless Skylake stagnation so there will be a lot of improvements, but on theory what I said would apply.

moinmoin said:
That's the crux. Intel had some catching up to do there (all the Skylake clones had excellent cache latency) so it's important it does just that.

Another thing is that Skylake likely wasn't meant to scale above certain core counts. The greatest benefit of the ring bus was it's simplicity. So for relatively low core counts it beat other implementations even though in theory it didn't sound so fast.

The greatest single ring stop they've done was 12 I think? Not so simple anymore. In Raptorlake they'll reach that number again. Also the push for ridiculous frequency doesn't help.

Discussion Intel current and future Lakes & Rapids thread

Platinum Member

Elite Member

Elite Member

Platinum Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Platinum Member

Elite Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Elite Member