Discussion Intel current and future Lakes & Rapids thread

Page 767 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Geddagod

Golden Member
Dec 28, 2021
1,057
888
96
I think the subject of "optimal" chiplet sizes deserves a bit more nuance. A larger chiplet size decreases yields, yes, but it also means less interconnect overhead (in both area and power) and a larger L3 domain (particularly useful for VM bucketing). AMD's solution is empirically successful, but I don't think it's necessarily the only viable path. And obviously that equilibrium is heavily dependent on what packaging tech is available.

But I agree with IntelUser2000 here in that the specifics of their chiplet implementation isn't Intel's main problem right now. Sure, it weighs on their financials, but if they end up PnP competitive with AMD, they can at least get decent revenue. And obviously, since Intel also fabs them, their effective wafer prices should be substantially cheaper than AMD sees. Though with their talk of an "internal foundry model", that tradeoff might change somewhat.
AMD not using super expensive packaging methods for their interconnects, such as using giant interposers, at least for their EPYC CPUs, certainly helps it seems. I also suspect the economics of packaging a bunch of chiplets place an effective cap at how many chiplets AMD can add before they are forced to start increasing the amount of cores on each chiplet (beyond also power and engineering limitations I mean) but I don't think they are approaching that crossroad with zen 4 genoa quite yet.
 

Hitman928

Diamond Member
Apr 15, 2012
4,838
6,979
136
Thanks. That's an interesting slide. It's been a while since I heard the 0.5 number, but it encompassed TSMC at the time as well, so I'm curious about the disconnect. Perhaps 0.5 is the earliest possible time, but for TSMC's lead customers (Apple, historically Huawei), they need better, pushing back the actual start of volume production.

Funny enough, I once heard Cannon Lake's DD number some years back. Not going to repeat it precisely, but let's just say that decimal point is going a long way to the right.

Are you sure you weren't thinking 0.15 D0? At 0.5 D0, you would get something less than 20% yield on a decently large die (e.g. 400 mm2). Or maybe 0.5 D0 is approximately when they are entering risk production?
 

Exist50

Platinum Member
Aug 18, 2016
2,445
3,043
136
Are you sure you weren't thinking 0.15 D0? At 0.5 D0, you would get something less than 20% yield on a decently large die (e.g. 400 mm2). Or maybe 0.5 D0 is approximately when they are entering risk production?
Nah, definitely meant 0.5 for volume, but again, been a while since I had that conversation.
At 0.5 D0, you would get something less than 20% yield on a decently large die (e.g. 400 mm2).
That's assuming no recovery, right? And 400mm2 is also pretty large for the first die on a new process. The numbers seem pretty reasonable.
 

Hitman928

Diamond Member
Apr 15, 2012
4,838
6,979
136
Nah, definitely meant 0.5 for volume, but again, been a while since I had that conversation.

That would be a very low standard for volume production, but then again, Intel has been burning cash like crazy recently.

That's assuming no recovery, right? And 400mm2 is also pretty large for the first die on a new process. The numbers seem pretty reasonable.

No recovery but even still, 400m2 is only like half the max reticle size. Once you open up to customers and say you are ready for high volume, you better be ready for bigger designs because those are your biggest customers. If you take something much more mainstream like Apple's A14/A15 processors and put it through a fab with 0.5 D0, you'd be talking around 60% yield. I don't think Apple would be too happy about that. Yes, there will be redundancies in place on the die for defects, but it really depends on where the defect lands so it's not like you can use redundancies to get back close to 100% yield when you start so low. Plus, your die will have to be bigger to begin with because of the added redundancies. Higher yield means you don't have to worry as much so you can go light on the redundancies.

Even Nvidia's A100 monster GPU at over 800 mm2 should be yielding somewhere around 45% on TSMC (before redundancies). On a 0.5 D0 process, you'd be looking at around 5%.
 

Exist50

Platinum Member
Aug 18, 2016
2,445
3,043
136
If you take something much more mainstream like Apple's A14/A15 processors and put it through a fab with 0.5 D0, you'd be talking around 60% yield. I don't think Apple would be too happy about that. Yes, there will be redundancies in place on the die for defects
As I said, Apple likely has a higher bar, but they're an interesting reference point because they're extremely aggressive with redundant hardware and recoverability.
Even Nvidia's A100 monster GPU at over 800 mm2 should be yielding somewhere around 45% on TSMC (before redundancies). On a 0.5 D0 process, you'd be looking at around 5%.
The A100 is quite significantly cut down. 6912/8192 shaders. And that's the max config and ignoring any redundancy mechanisms. I don't think that 0.5 number is nearly as unrealistic as you make it out to be. Actually, that 5% number reminds me of early GK110 rumors. Nvidia's flagship dies have long pushed the limits.
 
  • Like
Reactions: Tlh97 and Vattila

Exist50

Platinum Member
Aug 18, 2016
2,445
3,043
136
1.875MB of L3, combined with 2MB of private? L2 for GLC. SNC had 1.5MB of L3 combined with 1.25MB of shared? L2 cache. Seems like a decent enough uplift from the previous architecture. Weird certainly, that the L3 is smaller than the L2.
Don't private caches have less effective space than a shared cache when cores are working on something that needs the same data? Since the data has to be 'replicated' in both cores private L2 caches but in a shared cache there just has to be one instance of it? Might be a factor in why L2 increased more proportionally than L3 did between generations. Could be totally off base for this though.
Both SNC and GLC have private L2. And yes, shared cache can get you more effective capacity than a bunch of private caches, but you do suffer latency and power overhead, plus managing interference. I don't think the rumored increase is entirely out of the question, but it does seem like an odd choice for such an incremental product. I'm very curious how large the die is...
 
  • Like
Reactions: Tlh97 and Geddagod

Hitman928

Diamond Member
Apr 15, 2012
4,838
6,979
136
As I said, Apple likely has a higher bar, but they're an interesting reference point because they're extremely aggressive with redundant hardware and recoverability.

The A100 is quite significantly cut down. 6912/8192 shaders. And that's the max config and ignoring any redundancy mechanisms. I don't think that 0.5 number is nearly as unrealistic as you make it out to be. Actually, that 5% number reminds me of early GK110 rumors. Nvidia's flagship dies have long pushed the limits.

I can tell you that none of the companies I worked for would ever go into volume manufacturing with yield rates like this, not unless the fab was basically willing to eat a large portion of dies with defects or give the equivalent discount on wafer prices to bring the effective yield to be in line with industry standards. This would not be a winning proposition for the foundry though.
 

jpiniero

Lifer
Oct 1, 2010
14,178
4,969
136
I can tell you that none of the companies I worked for would ever go into volume manufacturing with yield rates like this, not unless the fab was basically willing to eat a large portion of dies with defects or give the equivalent discount on wafer prices to bring the effective yield to be in line with industry standards. This would not be a winning proposition for the foundry though.

But are you talking about products that would be 100% chips only or accounting for the cut down ones too.
 

Exist50

Platinum Member
Aug 18, 2016
2,445
3,043
136
I can tell you that none of the companies I worked for would ever go into volume manufacturing with yield rates like this
Well again, those yield numbers are ignoring all binning and recovery mechanisms. You're not going to be making an 800mm2 die on a bleeding edge node without some way to sell the partially-broken ones. As mentioned, Nvidia doesn't even sell a fully enabled config.

For that matter, we have no real way of knowing what their volume breakdown is over the life of the product. Do customers buy the A100 just at release (like phone sales), or do they typically wait a while? Enterprise tends to do the latter. Given the yield curves posted above, shifting the volume a few months would make a large impact on yields.
or give the equivalent discount on wafer prices to bring the effective yield to be in line with industry standards
Well that would surely be a given. The customers aren't going to be the ones paying for poor yields. As you say, no company would agree to that, at least not for a proven node. But that's not how foundry pricing is generally done anyways.
 

BorisTheBlade82

Senior member
May 1, 2020
638
956
106
BS Buster is a second tier leaker at the best in line with RGT, if he had any legitimate leaks before.
To be honest, for me BS Buster is not a leaker at all. He is just some other guy quoting somebody else's tweets and interpreting them.
So basically the same as me (and maybe you) - but with a bit more followers.

Wait, so this lego thing is not happening? Could it be because of the complexity of interconnecting all these tiles?

View attachment 77591



BTW not having everything under one hood beside that interconnection losses brings more flexibility how the system can be configured...
I still do not believe that the Interconnects are the root cause for Intel's struggle.
 

Geddagod

Golden Member
Dec 28, 2021
1,057
888
96
Someone might want to double check me on this (again) but even comparing L1i$, RWC and Zen 4 seem pretty similar in terms of area used. This is the same pattern that repeats in the 512KB L2 SRAM blocks of RWC vs Zen 4. It seems to me like that despite on paper Zen 4 having a density advantage of 20 something percent on paper, AMD, due to design choices most likely, aren't taking advantage of that.
This is also really similar to Zen 3 having a nearly 20% density advantage over Intel 7 HCC SRAM but their L2 actually being 7% larger than GLCs, and only 7% less dense without padding.
 

Hitman928

Diamond Member
Apr 15, 2012
4,838
6,979
136
But are you talking about products that would be 100% chips only or accounting for the cut down ones too.

Defect free chip though quality of silicon is also another factor not considered.

Well again, those yield numbers are ignoring all binning and recovery mechanisms. You're not going to be making an 800mm2 die on a bleeding edge node without some way to sell the partially-broken ones. As mentioned, Nvidia doesn't even sell a fully enabled config.

For that matter, we have no real way of knowing what their volume breakdown is over the life of the product. Do customers buy the A100 just at release (like phone sales), or do they typically wait a while? Enterprise tends to do the latter. Given the yield curves posted above, shifting the volume a few months would make a large impact on yields.

Well that would surely be a given. The customers aren't going to be the ones paying for poor yields. As you say, no company would agree to that, at least not for a proven node. But that's not how foundry pricing is generally done anyways.

My point is that no fab will continue to exist shipping 0.5 D0 at volume, even with redundancies and cut down SKUs taken into consideration. Those are risk production numbers at best which no one uses for volume. If the foundry wants to financially cover that defect density at volume, then they could probably persuade customers to use them, but they're not going to be financially viable long term. If the argument is that the defect density will decrease over time, then you have to keep in mind that you have a new node every few years and if you are releasing to volume at that defect rate then you are multiplying your losses every few years to try and keep up. This is not a winning business model.
 

Doug S

Platinum Member
Feb 8, 2020
2,018
3,099
106
I can tell you that none of the companies I worked for would ever go into volume manufacturing with yield rates like this, not unless the fab was basically willing to eat a large portion of dies with defects or give the equivalent discount on wafer prices to bring the effective yield to be in line with industry standards. This would not be a winning proposition for the foundry though.

Were the companies you were working for interested in using leading edge processes the moment they go mass production? There are compromises involved to go leading edge, you can insist on and get higher yields when you are designing for trailing edge.
 

Hitman928

Diamond Member
Apr 15, 2012
4,838
6,979
136
Were the companies you were working for interested in using leading edge processes the moment they go mass production? There are compromises involved to go leading edge, you can insist on and get higher yields when you are designing for trailing edge.

No, we're N-1 at best. Most of my work is further back than that or on III-V processes. But we still get the PDKs in house to evaluate and discuss yields and costs and such with the foundries. Apple is always on the leading edge and is the biggest customer there but they aren't accepting 0.5 D0, I guarantee you that. Since they work with TSMC they don't have to either as TSMC doesn't release to volume at such a high defect density.

Edit: So the thing to me that would make sense is if the D0 = 0.5 threshold was a foundry's baseline requirement to enter risk production and start to take engineering runs from key customers. This would make much more sense as the specs and performance for the process should be pretty much locked in. This would give the key customers a decent number of chips for testing to prepare for the actual volume runs and the foundry the chance to run additional volume for data and to finish refining the flow for volume.
 
Last edited:

BorisTheBlade82

Senior member
May 1, 2020
638
956
106
While I have more of a conscious incompetence on this topic, I always gathered that anything above 0.2 is just not feasible financially. And @Hitman928 at least gives me any reason to believe in him possessing conscious competence in this regard.

/edit: Just try and use any of the free yield calculator on a 300mm wafer with any Intel/AMD die you like and play around with defect densities from 0.5 to 0.05 - you might be quite shocked. And 0.05 is a ballpark were I expect TSMC N5 to be right now.
 
  • Like
Reactions: Tlh97 and Vattila

Exist50

Platinum Member
Aug 18, 2016
2,445
3,043
136
My point is that no fab will continue to exist shipping 0.5 D0 at volume, even with redundancies and cut down SKUs taken into consideration.
I mean, you can do the math. 0.5DD with a moderately sized die and/or decent recovery options gives plenty of usable dies. Obviously, lower defect density would be better, but you're acting like it's insanity.

For modern high complexity SoCs, there's a lot of effort put into defect resiliency and redundancy. Needing to disable a component outright is one of the last lines of defense.
 
  • Like
Reactions: Tlh97 and Vattila

Exist50

Platinum Member
Aug 18, 2016
2,445
3,043
136
While I have more of a conscious incompetence on this topic, I always gathered that anything above 0.2 is just not feasible financially. And @Hitman928 at least gives me any reason to believe in him possessing conscious competence in this regard.

/edit: Just try and use any of the free yield calculator on a 300mm wafer with any Intel/AMD die you like and play around with defect densities from 0.5 to 0.05 - you might be quite shocked. And 0.05 is a ballpark were I expect TSMC N5 to be right now.
Have you found a calculator that lets you specify some metric of defect tolerance? Perfect dies might be comparatively rare, but you don't need perfect dies. Quite common to have redundant interface wires, array bits, etc. so you can tolerate a failure without even compromising the product. Good odds AMD does the same for the 3D V-cache contacts as well.
 
  • Like
Reactions: Vattila

Hitman928

Diamond Member
Apr 15, 2012
4,838
6,979
136
I mean, you can do the math. 0.5DD with a moderately sized die and/or decent recovery options gives plenty of usable dies. Obviously, lower defect density would be better, but you're acting like it's insanity.

Because in the corporate world, it kind of is. The only way you could make that work at volume is if your process was so far ahead of anyone else that customers are willing to take on the additional costs to get the competitive advantage. Even then, there will be limits because the chip designer's end customers will only tolerate so much price increase, no matter how much better something may be.

And again, redundancies in the chip only get you so far and cause additional increases in cost themselves. Making chips, especially on leading nodes, is crazy expensive. No one wants to be making it worse by going to a foundry with well above industry standard defect rates.
 

Exist50

Platinum Member
Aug 18, 2016
2,445
3,043
136
Because in the corporate world, it kind of is. The only way you could make that work at volume is if your process was so far ahead of anyone else that customers are willing to take on the additional costs to get the competitive advantage. Even then, there will be limits because the chip designer's end customers will only tolerate so much price increase, no matter how much better something may be.

And again, redundancies in the chip only get you so far and cause additional increases in cost themselves. Making chips, especially on leading nodes, is crazy expensive. No one wants to be making it worse by going to a foundry with well above industry standard defect rates.
Again, you seem to be assuming A) the higher initial yields will be paid by the first customer(s) and B) the yields are terrible (like your 5% number). Neither is true. All fabs charge an amortized rate, factoring in competition, otherwise who would be willing to volunteer to be first? And again, just ballpark the math. See what the yields are for, say, a typical 100mm2 phone chip if you can tolerate 1-2 defects per die. No one's trying to make a perfect 800mm2 die.
 

Hitman928

Diamond Member
Apr 15, 2012
4,838
6,979
136
Again, you seem to be assuming A) the higher initial yields will be paid by the first customer(s) and B) the yields are terrible (like your 5% number). Neither is true. All fabs charge an amortized rate, factoring in competition, otherwise who would be willing to volunteer to be first? And again, just ballpark the math. See what the yields are for, say, a typical 100mm2 phone chip if you can tolerate 1-2 defects per die. No one's trying to make a perfect 800mm2 die.

No, I'm assuming that no one would use a foundry with terrible yields when they have much better options. Your example is way over simplifying it too. Do you know what it takes for any given design to be able to tolerate 2 defects per die, no matter their location?
 

Exist50

Platinum Member
Aug 18, 2016
2,445
3,043
136
No, I'm assuming that no one would use a foundry with terrible yields when they have much better options.
Again, I urge you to actually look at the numbers for a realistic product, instead of an imaginary perfect 800mm2 die that no one cares about making. What do you think the yields would be for, say, a phone chip? Because a square, 100mm2 die (approximate high end phone chip size) at 0.5DD gets you ~62% with no recovery at all. And how many of those failing dies would just have one defect somewhere? If even half of those dies are recoverable, then we're talking 80% yields, or 20% less than perfect.

If you don't think that's remotely viable for the first quarter of production on a leading node, then what is your criteria, so we can talk real numbers?
Do you know what it takes for any given design to be able to tolerate 2 defects per die, no matter their location?
You don't have to be able to tolerate defects in any location; it's all probabilistic. You add redundancy to particularly sensitive circuits, large arrays, etc. And you combine that with other recovery techniques like downbinning to end up with your final product mix. These techniques have been commonplace for ages now.
 
  • Like
Reactions: Vattila

Hitman928

Diamond Member
Apr 15, 2012
4,838
6,979
136
Again, I urge you to actually look at the numbers for a realistic product, instead of an imaginary perfect 800mm2 die that no one cares about making. What do you think the yields would be for, say, a phone chip? Because a square, 100mm2 die (approximate high end phone chip size) at 0.5DD gets you ~62% with no recovery at all. And how many of those failing dies would just have one defect somewhere? If even half of those dies are recoverable, then we're talking 80% yields, or 20% less than perfect.

If you don't think that's remotely viable for the first quarter of production on a leading node, then what is your criteria, so we can talk real numbers?

Industry standard is below 0.2 based upon my experience with the foundries I've worked with.

You don't have to be able to tolerate defects in any location; it's all probabilistic. You add redundancy to particularly sensitive circuits, large arrays, etc. And you combine that with other recovery techniques like downbinning to end up with your final product mix. These techniques have been commonplace for ages now.

I'll put it this way, if I am a manager and I'm looking at using a process with 0.1 d0 and 0.5 d0, with an estimated die size of 100 mm2, then I'm looking at an estimated yield rate of 91% versus 62%. Now, say I can accept 95% defect yield rate. Well, for each process I now have to analyze how much extra design, verification, and test time I need to budget as well as the area increase for each design to hit my acceptable yield rate. Additionally, things like cache can be made redundant fairly easily, things like IO and logic, not easy. So how much of my design is cache versus logic. How much is interconnect and IO? How many "units" (e.g. cores) of my design can I afford to lose and still have a salvageable product? If I get a defect or 2 and they happen to land in the IO or the interconnect, is it even recoverable or salvageable at all? Basically, how much more expensive is my chip going to be after all the additional design, validation, and added area in order to hit acceptable yields? How much of that is NRE and how much is ongoing manufacturing? What if I look at all the additional costs with the worse fab and figure out that for the same costs, I can just go with the better fab and increase my base design size and design time to come up with an even more competitive product at the same price?

So yes, you can recover and salvage, but this is not the wave of a wand type of stuff that happens for free and the worse the defect rate the more headaches and less competitive of a product you get.