Discussion Intel current and future Lakes & Rapids thread

Page 658 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

ashFTW

Senior member
Sep 21, 2020
316
236
126
I don't recall it being on schedule though?
10nm delays are to be blamed for that. Hopefully that’s not something Intel will repeat, at least not to that degree. Otherwise they will be toast!
 
Last edited:

Exist50

Platinum Member
Aug 18, 2016
2,452
3,106
136
Pat said himself Intel wants to reach process performance-per-watt parity with TSMC in 2024. So I'm not sure we can call anything earlier on par already.
But TSMC's best in Perf/Watt during 2023-24 is N3/N3E, so that still makes sense. I'd really hope that Intel 3 is at least close enough to TSMC 4P/4X by 2024.

Not necessary but do you really want to commit to making reticle-limit sized tiles on a new process (Intel 3).
Intel 3 shouldn't really be "new" by 2024. Intel has a full year for yield learnings via Intel 4. And of course, not every core has to be functional.

EMIB seems to take a lot of space on the SPR die
It does, but the larger the die, the lower the relative cost. And Foveros isn't free either. MTL will undoubtably show some sort of die-die keep out zone.

If done properly, it should provide shorter paths to I/O and memory by placing them on the base die. It also provides an opportunity for a large cache to be placed there.
There's some opportunity there, especially from a cache perspective, but think about how many wafers that would require, even for a passive interposer! Also, stacking cores on top of IO might be impossible from a thermal perspective, and most IO isn't that latency critical.

And once you look at the Falcon shores design that needs x86 chiplets as well, why not build them that way to start with. Less engineering, faster time to market etc
Compute tiles assembled with Foveros vs attached with EMIB wouldn't really make a difference there. Can disaggregate IO in both cases.

10nm delays are to be blamed for that
At least partially, but not entirely.
 

Doug S

Diamond Member
Feb 8, 2020
3,298
5,734
136
Two 600+ mm2 tiles could reach 100 cores with perfect yield.


Don't forget the reticle limit is going to halve when the high NA EUV machines come online. Sure that's not a concern for Intel 3, but designers are likely to shy away from such designs knowing they are a very short term solution.
 
  • Like
Reactions: Ajay

ashFTW

Senior member
Sep 21, 2020
316
236
126
Intel 3 shouldn't really be "new" by 2024. Intel has a full year for yield learnings via Intel 4. And of course, not every core has to be functional.
~50 10mm2 cores can be put on a 650 mm2 tile that uses EMIB to connect to another such tile and an ”IO tile”. The extra 150mm2 is for EMIB support and accelerators. And let’s assume 10-15% cores have to be disabled due to defects. That‘s 85-90 functioning cores. Is that enough? In 2024??

People here don’t even believe Intel can yield 400 mm2 SPR die on now fairly mature Intel 7. Is full year of yield learning good enough to have such high confidence in yielding ginormous die? There is also plenty of learning Intel 3 has to go through for new features. And not everything is going to be orthogonal there.

It does, but the larger the die, the lower the relative cost. And Foveros isn't free either. MTL will undoubtably show some sort of die-die keep out zone.
True that. I have used 20% area “lost to Foveros” in my previous calculations. But perhaps both EMIB and Foveros can be used better to have low impact on the main tile, by moving bulk of the interface logic to the EMIB or the base tiles? The first use of EMIB on Haydes Canyon NUC to bridge an AMD GPU with HBM was a bit deceptive, mostly due to my own ignorance at the time, since it presumably required no changes to the GPU die (or the HBM).

There's some opportunity there, especially from a cache perspective, but think about how many wafers that would require, even for a passive interposer! Also, stacking cores on top of IO might be impossible from a thermal perspective, and most IO isn't that latency critical.
How does one have a passive interposer/base die with cache on it? Don’t the cache transistors need switching??

If you absolutely require bigger caches to support the large number of cores, what other options does one have, other than stacking? And, what about memory accesses? Doesn’t that need to be low latency?

Compute tiles assembled with Foveros vs attached with EMIB wouldn't really make a difference there. Can disaggregate IO in both cases.
Falcon Shores etc, being intermediate steps to Zetta scale, are really large designs. Making them 2.5D would make the socket stupendously huge. They are evolving from the “bridge products”, and hence should continue using Foveros.
 
Last edited:

ashFTW

Senior member
Sep 21, 2020
316
236
126
Don't forget the reticle limit is going to halve when the high NA EUV machines come online. Sure that's not a concern for Intel 3, but designers are likely to shy away from such designs knowing they are a very short term solution.
I have also argued this point before. And hence I’ve been leaning towards smaller (max ~400 mm2) top die and 4-stack Foveros.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
27,094
16,014
136
Golden cove client core vs server core.. plus golden cove vs zen 3 size with cores and cache only no memory or pcie phy
And what does that have to do with anything ? No benchmarks, nothing, just some pictures that may not even be equal in real size.
 
  • Like
Reactions: Drazick

ashFTW

Senior member
Sep 21, 2020
316
236
126
And what does that have to do with anything ? No benchmarks, nothing, just some pictures that may not even be equal in real size.
You want benchmarks posted along with die shots? 😂 How‘s that going to add anything here that hasn’t already been discussed ad nauseam before?

Golden cove client core vs server core.. plus golden cove vs zen 3 size with cores and cache only no memory or pcie phy
Several interesting observations can be made:
  • The images are to scale, so they can be compared
  • The size of the 8 core cluster of client Golden Cove with L2/L3 compares quite nicely with that of Zen3, the difference in favor of Zen3 can be largely attributed to GC’s much larger size.
  • Density wise Intel 7 is indeed quite close in TSMC N7, which justifies Intel‘s renaming their process nodes
  • The server GC is so much bigger than the client GC, which gives huge area advantage to AMD on EPYC
  • Advanced Matrix instruction set (AMX) takes up a huge area on GC server core. With tighter integration on CPU/GPU coming, was this really necessary? This now has to be carried forward in future cores along with all the AVX 512 mess. At least these instructions are not over engineered!
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
@Exist50 Don't know why you think it needs Intel 3 to be competitive with TSMC N4. Already SemiWiki and Wikichip says Intel 4 is between N5 and N3 and is closer to N3. Perhaps in density you are right but Intel 3 is another 18% gain in performance, that's a gain equal to N7 to N5 and N5 to N3.

Two 600+ mm2 tiles could reach 100 cores with perfect yield. So, Foveros is not necessary, but do you really want to commit to making such large sized tiles on a new process (Intel 3). I would rather stich together much smaller (say 100mm2) top tiles using Foveros; EMIB seems to take a lot of space on the SPR die, so I’m a bit wary of that. Also going 3D has advantages. If done properly, it should provide shorter paths to I/O and memory interfaces on the base die. It also provides an opportunity for a large cache to be placed there.

Always be conservative with semiconductors, always. The hype goes to the stratosphere every. single. time. I am inclined to believe Falcon Shores is a way different thing from Granite Rapids. Yes, within the Falcon Shore platform you'll get the flexibility, but not something that's essentially fitting in the same socket. Remember I said Sapphire Rapids HBM is BGA. Optimizing to have such a hetereogenous configuration will have different x86 cores to maximize the configuration. Using exactly GNR would be a waste.

Also since we don't know how much Raptor Cove is over Golden nevermind Redwood Cove is over predecessors and nevermind again Lion Cove which we speculate is in GNR is over predecessors, that estimation of 650mm2 for 100 perfect yield cores can easily change plus or minus 30%!

Look at how the FP block shrunk by 40% on Intel 4. That's a 60% density gain. We don't know how that applies to the blocks that makes server Golden Cove extra large over client Golden Cove.

Remember how we talk about how Golden Cove is inefficient in die area? I think they'll make this better generation after generation. The older diagram had 60(sixty) blocks per tile on Intel 4!

Intel also demonstrated a 6T SRAM module that's comparable in power efficiency to the 8T module but far smaller. Still 20% larger than the regular 6T SRAM but 8T SRAM is further 40% larger. It's an example of design optimization. They have been using 8T on L1 and L2 caches for over a decade now.

Ian says Granite Rapids is using HD libraries which doesn't exist for Intel 4.
 
Last edited:

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Intel 7 using HP libraries for Alderlake also explains why Gracemont is relatively larger than Tremont. Still, Crestmont in Intel 4 even with HP library is back in the 1mm2 range.
 

ashFTW

Senior member
Sep 21, 2020
316
236
126
@Exist50 Don't know why you think it needs Intel 3 to be competitive with TSMC N4. Already two analyst sites out there(one being wikichip) out there says Intel 4 is between N5 and N3 and is closer to N3. Perhaps in density you are right but Intel 3 is another 18% gain in performance, that's a gain equal to N7 to N5 and N5 to N3.
Good density gains will also come with backside power delivery on Intel 20A using Power Via, better than N2’s Buried Power Rails.

31D6BEBC-3627-4A43-A72D-A490F61867E2.jpeg

Always be conservative with semiconductors, always. The hype goes to the stratosphere every. single. time. I am inclined to believe Falcon Shores is a way different thing from Granite Rapids. Yes, within the Falcon Shore platform you'll get the flexibility, but not something that's essentially fitting in the same socket. Remember I said Sapphire Rapids HBM is BGA. Optimizing to have such a hetereogenous configuration will have different x86 cores to maximize the configuration. Using exactly GNR would be a waste.
I have discussed using both Granite Rapids and Sierra Forest tiles as is (to increase reuse) within the Falcon Shores design. To be used as is, they have to be the same platform. These are top tiles, so BGA considerations don’t matter. Or am I missing something? Have you seen two of my “system block diagrams” that I have previously posted here and on Twitter more recently?



Also since we don't know how much Raptor Cove is over Golden nevermind Redwood Cove is over predecessors and nevermind again Lion Cove which we speculate is in GNR is over predecessors, that estimation of 650mm2 for 100 perfect yield cores can easily change plus or minus 30%!

Look at how the FP block shrunk by 40% on Intel 4. That's a 60% density gain. We don't know how that applies to the blocks that makes server Golden Cove extra large over client Golden Cove.
One has to make certain assumptions when trying to make predictions. I assumed 10mm2 size on Intel 3 for whatever core they end up using. Unless there is a complete redesign, the cores just keep getting bigger every generation.

Remember how we talk about how Golden Cove is inefficient in die area? I think they'll make this better generation after generation. The older diagram had 60(sixty) blocks per tile on Intel 4!
What it doesn’t show is area dedicated to EMIB and fixed function accelerators, which could easily occupy 10 of those blocks. And how many blocks will be lost to defects, especially on a new process. Now does that 60 blocks still look competitive? And, I don’t know how much to read into that image; Intel is not going to signal the exact core count so early into its development, giving big advantage to its competitor.
 
Last edited:

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
@ashFTW I just don't believe it'll be such a simple mix and match from off the shelf parts as you imply. The claims they have touted are dramatic. They'll likely need substantial system-level changes to achieve this. While on the other hand, GNR is a general purpose server platform. The two go against each other.

Also volumes are going to be fraction of a fraction of Xeons. Like Xeon Phi, and HBM versions. Like they do things that sometimes don't make sense to us like BGA 300W+ chips but it's very specific and focused at the market. To say such chips have things in common is not entirely false but quite distant.

We'll see about the process density since N2 is lot, lot less than expected. According to that very slide the buried power rail area scaling is "good" but we got 10%. I think TSMC is just reducing risks by focusing on performance and then density next generation because move to the new FET is a huge change. Think what happened on the FinFET generation. 20nm was the density focus and 16/14 was new transistor with almost no density gains.

Considering Granite Rapids using a new core is only 10-12% gain and the bit from MLID about Pat focusing more on execution over risky designs I don't know if I expect density gains out of 20A.
 
Last edited:
  • Like
Reactions: Saylick

ashFTW

Senior member
Sep 21, 2020
316
236
126
@ashFTW I just don't believe it'll be such a simple mix and match from off the shelf parts as you imply. The claims they have touted are dramatic. They'll likely need substantial system-level changes to achieve this. While on the other hand, GNR is a general purpose server platform. The two go against each other.
They absolutely do not. There is actually a ton of synergy. Note that all the “general purpose server platform” features come from the base tiles. These features will be needed not only by Granite Rapids, and Sierra Forest, which share the platform as Intel has already announced, but also by Falcon Shores which is going into a Xeon socket. Why not build a common platform and Foveros (Direct) chiplet interconnect ”design rules” using UCIe. Using varying number (1 to 4) of top tiles, the entire “metal” GNR server range can be covered. For example a 32 core silver Xeon could be made with just one top tile as shown in my block diagrams.

Edit: An additional thing I want to point out is that Intel has publicly stated their huge ambition to reach Zetta scale over the next ~6 years. They are not going to get there without exploiting every single synergy along the way. Falcon shores is part of “wave 2” on this path, and it requires x86 cores and Xe cores on the same package inside a Xeon socket. The right interfaces have to be there so that innovations can proceed in parallel, and can just plug in without having to rework the whole thing. With wave 3, my guess is that the base tiles will be evolved for higher I/O and memory bandwidths. Some of the interfaces may move to Optical PHY. The final wave, if necessary, should be a design refinements to apply all the lessons learned along the way, and to move critical pieces to the latest process where it makes sense.

Also volumes are going to be fraction of a fraction of Xeons. Like Xeon Phi, and HBM versions. Like they do things that sometimes don't make sense to us like BGA 300W+ chips but it's very specific and focused at the market. To say such chips have things in common is not entirely false but quite distant.
One of Intel’s problem is that they make too many parts. It’s as if the engineering teams and the architects don’t collaborate much with each other. It explains lot of the delays and constant redefinition/rework. They need to focus on making fewest number of parts with ample engineering resources and focus dedicated to each part, and then stitching these parts together as needed into products, no matter whether they are high volume or not.

This is the future, and it’s not very distant. The entire client platform with Meteor lake is moving to disaggregated design using Foveros and Foveros Omni. Different size CPU and GPU chiplets will be able to “plug into” a common base tile to build products acrosss the entire client range. This is coming next year! Why would it not make sense for the server chips, which actually has a much smaller volume? With the “bridge” series of chips, Intel already has extensive experience building very high TDP chips. Foveros direct is coming next year to further facilitate this while reducing power.
 
Last edited:

ashFTW

Senior member
Sep 21, 2020
316
236
126
Ian says Granite Rapids is using HD libraries which doesn't exist for Intel 4.
If true, that’s indeed good news! As i said on my tweets, I expected Sierra Forest to be built like that, but not Granite Rapids. But with such large number of cores, chasing very high single thread performance makes little sense. This could have been one of the main reasons GNR was moved to Intel 3.
 
Last edited:

ashFTW

Senior member
Sep 21, 2020
316
236
126
Considering Granite Rapids using a new core is only 10-12% gain and the bit from MLID about Pat focusing more on execution over risky designs I don't know if I expect density gains out of 20A.
Watch this Applied Materials video to see how density gains are achieved via back side power delivery DTCO automatically without changes to lithography. The video is long, but quite informative!

This might be all Intel does for density increase in 20A. With 18A they will have🤞High-NA EUV to further address density.

 
Last edited:

lobz

Platinum Member
Feb 10, 2017
2,057
2,856
136
That would still be a flat doubling over Genoa/Bergamo, which is 96/128c for Zen 4/4c.

And it makes sense for that to be unrealistic. They have a presumably large architectural change (generally means bigger die/more transistors), but only a minor density improvement from N4, and are limited to the same socket. Where would they get the space for double the cores?

To try to get back on topic, Granite Rapids vs Turin is shaping up to be much more interesting than I expected. I was originally thinking we'd see a matchup in late 2023 between RWC-based Granite Rapids on Intel 4 and Zen5-based Turin on N3, which (for a healthy/more normal N3) would be a beatdown. Instead we're going to get rough process parity (Intel 3 vs TSMC 4) and probably Lion Cove vs Zen 5. Should be a much more "even" match up.

As for core count, I'm expecting Turn and Granite Rapids to be pretty similar at the end of the day, probably in the ballpark of 100-150 cores (hopefully towards the top end) for the max config. Also, I have no idea why people are referencing SPR's topology. Intel's shown Granite Rapids's diagrams that are at least close enough.
I fully understand your scepticism this time, but I'm not willing to give up on the possibility just yet. We'll probably be a bit more informed towards the end of the year, I can't wait 🙂 the time coming could shape up to be a tiny revolution in itself, where all the ambitious packaging techniques, new interconnects, EUV becoming more prevalent, the inevitable change from fins in process tech are all starting to finally become more realistic - in terms of commercial use I mean.
 

nicalandia

Diamond Member
Jan 10, 2019
3,331
5,282
136
And what does that have to do with anything ? No benchmarks, nothing, just some pictures that may not even be equal in real size.
Just some die annotations for comparison, benchmarks are not required as they share the same core(Golden Cove) and will perform about the same on standard benchmarks. What is telling is how much larger the Server core is compared to the client. I think Intel in order to stand out from the competition made design choices that will hurt them in the long run since die area is quite expensive.
 

jpiniero

Lifer
Oct 1, 2010
16,493
6,986
136
Intel Xeon W-3433 16C/32T based on Intel Fishhawk Falls HEDT


View attachment 63334

I am currently checking to see what type of CPU is(a Ice Lake Refresh or a Sapphire Rapids)

Definitely Sapphire.

Edit: The Xeon W 3323 has the full 8 channels, so this is a 4 tile product. I wonder if Intel is going to ensure that the core counts are equal on each tile.
 
Last edited:

eek2121

Diamond Member
Aug 2, 2005
3,384
5,011
136
Good luck trying to find AMD server hardware. At least where I live (U.A.E.), asking for AMD is met with blank stares. Even the Azure data centers here have only one Epyc based HBv2 series available with no option other than 120 cores.
Dell's lead time is 3-4 weeks for non-Milan-X SKUs.
Intel Xeon W-3433 16C/32T based on Intel Fishhawk Falls HEDT


View attachment 63334

I am currently checking to see what type of CPU is(a Ice Lake Refresh or a Sapphire Rapids)

Probably the most intriguing part coming out of Intel for the next 6 months, in my eyes. Hopefully they can scale the core counts up to something decent for HEDT. Too bad they won't have a desktop variant.
 
  • Like
Reactions: lobz

nicalandia

Diamond Member
Jan 10, 2019
3,331
5,282
136
Hopefully they can scale the core counts up to something decent for HEDT. Too bad they won't have a desktop variant.
Their HEDT segment is dead. It's called Xeon Workstation now.

This new CPU Xeon W-3433 is the natural successor to the Ice lake Xeon W-3335(16 core also) which lags behind the Zen2 3955WX. This new W-3433 might be a good match for the 5955WX but will be No match for the 7950WX

1655738027267.png


1655738169466.png


 
Last edited: