Discussion Intel current and future Lakes & Rapids thread

Hitman928 · Mar 6, 2023

Exist50 said:
Thanks. That's an interesting slide. It's been a while since I heard the 0.5 number, but it encompassed TSMC at the time as well, so I'm curious about the disconnect. Perhaps 0.5 is the earliest possible time, but for TSMC's lead customers (Apple, historically Huawei), they need better, pushing back the actual start of volume production.

Funny enough, I once heard Cannon Lake's DD number some years back. Not going to repeat it precisely, but let's just say that decimal point is going a long way to the right.

Are you sure you weren't thinking 0.15 D0? At 0.5 D0, you would get something less than 20% yield on a decently large die (e.g. 400 mm2). Or maybe 0.5 D0 is approximately when they are entering risk production?

jpiniero · Mar 6, 2023

Hitman928 said:
Are you sure you weren't thinking 0.15 D0? At 0.5 D0, you would get something less than 20% yield on a decently large die (e.g. 400 mm2).

You can see why Sapphire Rapids is problematic.

Exist50 · Mar 6, 2023

Hitman928 said:
Are you sure you weren't thinking 0.15 D0? At 0.5 D0, you would get something less than 20% yield on a decently large die (e.g. 400 mm2). Or maybe 0.5 D0 is approximately when they are entering risk production?

Nah, definitely meant 0.5 for volume, but again, been a while since I had that conversation.

Hitman928 said:
At 0.5 D0, you would get something less than 20% yield on a decently large die (e.g. 400 mm2).

That's assuming no recovery, right? And 400mm2 is also pretty large for the first die on a new process. The numbers seem pretty reasonable.

moinmoin · Mar 6, 2023

Exist50 said:
Nah, definitely meant 0.5 for volume, but again, been a while since I had that conversation.

That's assuming no recovery, right? And 400mm2 is also pretty large for the first die on a new process. The numbers seem pretty reasonable.

You guys seriously make me afraid of the state IFS actually is in.

Hitman928 · Mar 6, 2023

Exist50 said:
Nah, definitely meant 0.5 for volume, but again, been a while since I had that conversation.

That would be a very low standard for volume production, but then again, Intel has been burning cash like crazy recently.

That's assuming no recovery, right? And 400mm2 is also pretty large for the first die on a new process. The numbers seem pretty reasonable.

No recovery but even still, 400m2 is only like half the max reticle size. Once you open up to customers and say you are ready for high volume, you better be ready for bigger designs because those are your biggest customers. If you take something much more mainstream like Apple's A14/A15 processors and put it through a fab with 0.5 D0, you'd be talking around 60% yield. I don't think Apple would be too happy about that. Yes, there will be redundancies in place on the die for defects, but it really depends on where the defect lands so it's not like you can use redundancies to get back close to 100% yield when you start so low. Plus, your die will have to be bigger to begin with because of the added redundancies. Higher yield means you don't have to worry as much so you can go light on the redundancies.

Even Nvidia's A100 monster GPU at over 800 mm2 should be yielding somewhere around 45% on TSMC (before redundancies). On a 0.5 D0 process, you'd be looking at around 5%.

Exist50 · Mar 6, 2023

Hitman928 said:
If you take something much more mainstream like Apple's A14/A15 processors and put it through a fab with 0.5 D0, you'd be talking around 60% yield. I don't think Apple would be too happy about that. Yes, there will be redundancies in place on the die for defects

As I said, Apple likely has a higher bar, but they're an interesting reference point because they're extremely aggressive with redundant hardware and recoverability.

Hitman928 said:
Even Nvidia's A100 monster GPU at over 800 mm2 should be yielding somewhere around 45% on TSMC (before redundancies). On a 0.5 D0 process, you'd be looking at around 5%.

The A100 is quite significantly cut down. 6912/8192 shaders. And that's the max config and ignoring any redundancy mechanisms. I don't think that 0.5 number is nearly as unrealistic as you make it out to be. Actually, that 5% number reminds me of early GK110 rumors. Nvidia's flagship dies have long pushed the limits.

Exist50 · Mar 6, 2023

Geddagod said:
1.875MB of L3, combined with 2MB of private? L2 for GLC. SNC had 1.5MB of L3 combined with 1.25MB of shared? L2 cache. Seems like a decent enough uplift from the previous architecture. Weird certainly, that the L3 is smaller than the L2.
Don't private caches have less effective space than a shared cache when cores are working on something that needs the same data? Since the data has to be 'replicated' in both cores private L2 caches but in a shared cache there just has to be one instance of it? Might be a factor in why L2 increased more proportionally than L3 did between generations. Could be totally off base for this though.

Both SNC and GLC have private L2. And yes, shared cache can get you more effective capacity than a bunch of private caches, but you do suffer latency and power overhead, plus managing interference. I don't think the rumored increase is entirely out of the question, but it does seem like an odd choice for such an incremental product. I'm very curious how large the die is...

Hitman928 · Mar 6, 2023

Exist50 said:
As I said, Apple likely has a higher bar, but they're an interesting reference point because they're extremely aggressive with redundant hardware and recoverability.

The A100 is quite significantly cut down. 6912/8192 shaders. And that's the max config and ignoring any redundancy mechanisms. I don't think that 0.5 number is nearly as unrealistic as you make it out to be. Actually, that 5% number reminds me of early GK110 rumors. Nvidia's flagship dies have long pushed the limits.

I can tell you that none of the companies I worked for would ever go into volume manufacturing with yield rates like this, not unless the fab was basically willing to eat a large portion of dies with defects or give the equivalent discount on wafer prices to bring the effective yield to be in line with industry standards. This would not be a winning proposition for the foundry though.

jpiniero · Mar 6, 2023

Hitman928 said:
I can tell you that none of the companies I worked for would ever go into volume manufacturing with yield rates like this, not unless the fab was basically willing to eat a large portion of dies with defects or give the equivalent discount on wafer prices to bring the effective yield to be in line with industry standards. This would not be a winning proposition for the foundry though.

But are you talking about products that would be 100% chips only or accounting for the cut down ones too.

Exist50 · Mar 6, 2023

Hitman928 said:
I can tell you that none of the companies I worked for would ever go into volume manufacturing with yield rates like this

Well again, those yield numbers are ignoring all binning and recovery mechanisms. You're not going to be making an 800mm2 die on a bleeding edge node without some way to sell the partially-broken ones. As mentioned, Nvidia doesn't even sell a fully enabled config.

For that matter, we have no real way of knowing what their volume breakdown is over the life of the product. Do customers buy the A100 just at release (like phone sales), or do they typically wait a while? Enterprise tends to do the latter. Given the yield curves posted above, shifting the volume a few months would make a large impact on yields.

Hitman928 said:
or give the equivalent discount on wafer prices to bring the effective yield to be in line with industry standards

Well that would surely be a given. The customers aren't going to be the ones paying for poor yields. As you say, no company would agree to that, at least not for a proven node. But that's not how foundry pricing is generally done anyways.

BorisTheBlade82 · Mar 6, 2023

IntelUser2000 said:
BS Buster is a second tier leaker at the best in line with RGT, if he had any legitimate leaks before.

To be honest, for me BS Buster is not a leaker at all. He is just some other guy quoting somebody else's tweets and interpreting them.
So basically the same as me (and maybe you) - but with a bit more followers.

Kocicak said:
Wait, so this lego thing is not happening? Could it be because of the complexity of interconnecting all these tiles?

View attachment 77591

BTW not having everything under one hood beside that interconnection losses brings more flexibility how the system can be configured...

I still do not believe that the Interconnects are the root cause for Intel's struggle.

Geddagod · Mar 6, 2023

Someone might want to double check me on this (again) but even comparing L1i$, RWC and Zen 4 seem pretty similar in terms of area used. This is the same pattern that repeats in the 512KB L2 SRAM blocks of RWC vs Zen 4. It seems to me like that despite on paper Zen 4 having a density advantage of 20 something percent on paper, AMD, due to design choices most likely, aren't taking advantage of that.
This is also really similar to Zen 3 having a nearly 20% density advantage over Intel 7 HCC SRAM but their L2 actually being 7% larger than GLCs, and only 7% less dense without padding.

Hitman928 · Mar 6, 2023

jpiniero said:
But are you talking about products that would be 100% chips only or accounting for the cut down ones too.

Defect free chip though quality of silicon is also another factor not considered.

Exist50 said:
Well again, those yield numbers are ignoring all binning and recovery mechanisms. You're not going to be making an 800mm2 die on a bleeding edge node without some way to sell the partially-broken ones. As mentioned, Nvidia doesn't even sell a fully enabled config.

For that matter, we have no real way of knowing what their volume breakdown is over the life of the product. Do customers buy the A100 just at release (like phone sales), or do they typically wait a while? Enterprise tends to do the latter. Given the yield curves posted above, shifting the volume a few months would make a large impact on yields.

Well that would surely be a given. The customers aren't going to be the ones paying for poor yields. As you say, no company would agree to that, at least not for a proven node. But that's not how foundry pricing is generally done anyways.

My point is that no fab will continue to exist shipping 0.5 D0 at volume, even with redundancies and cut down SKUs taken into consideration. Those are risk production numbers at best which no one uses for volume. If the foundry wants to financially cover that defect density at volume, then they could probably persuade customers to use them, but they're not going to be financially viable long term. If the argument is that the defect density will decrease over time, then you have to keep in mind that you have a new node every few years and if you are releasing to volume at that defect rate then you are multiplying your losses every few years to try and keep up. This is not a winning business model.

Doug S · Mar 6, 2023

Hitman928 said:
I can tell you that none of the companies I worked for would ever go into volume manufacturing with yield rates like this, not unless the fab was basically willing to eat a large portion of dies with defects or give the equivalent discount on wafer prices to bring the effective yield to be in line with industry standards. This would not be a winning proposition for the foundry though.

Were the companies you were working for interested in using leading edge processes the moment they go mass production? There are compromises involved to go leading edge, you can insist on and get higher yields when you are designing for trailing edge.

Hitman928 · Mar 6, 2023

Doug S said:
Were the companies you were working for interested in using leading edge processes the moment they go mass production? There are compromises involved to go leading edge, you can insist on and get higher yields when you are designing for trailing edge.

No, we're N-1 at best. Most of my work is further back than that or on III-V processes. But we still get the PDKs in house to evaluate and discuss yields and costs and such with the foundries. Apple is always on the leading edge and is the biggest customer there but they aren't accepting 0.5 D0, I guarantee you that. Since they work with TSMC they don't have to either as TSMC doesn't release to volume at such a high defect density.

Edit: So the thing to me that would make sense is if the D0 = 0.5 threshold was a foundry's baseline requirement to enter risk production and start to take engineering runs from key customers. This would make much more sense as the specs and performance for the process should be pretty much locked in. This would give the key customers a decent number of chips for testing to prepare for the actual volume runs and the foundry the chance to run additional volume for data and to finish refining the flow for volume.

BorisTheBlade82 · Mar 6, 2023

While I have more of a conscious incompetence on this topic, I always gathered that anything above 0.2 is just not feasible financially. And @Hitman928 at least gives me any reason to believe in him possessing conscious competence in this regard.

/edit: Just try and use any of the free yield calculator on a 300mm wafer with any Intel/AMD die you like and play around with defect densities from 0.5 to 0.05 - you might be quite shocked. And 0.05 is a ballpark were I expect TSMC N5 to be right now.

Exist50 · Mar 6, 2023

Hitman928 said:
My point is that no fab will continue to exist shipping 0.5 D0 at volume, even with redundancies and cut down SKUs taken into consideration.

I mean, you can do the math. 0.5DD with a moderately sized die and/or decent recovery options gives plenty of usable dies. Obviously, lower defect density would be better, but you're acting like it's insanity.

For modern high complexity SoCs, there's a lot of effort put into defect resiliency and redundancy. Needing to disable a component outright is one of the last lines of defense.

Exist50 · Mar 6, 2023

BorisTheBlade82 said:
While I have more of a conscious incompetence on this topic, I always gathered that anything above 0.2 is just not feasible financially. And @Hitman928 at least gives me any reason to believe in him possessing conscious competence in this regard.

/edit: Just try and use any of the free yield calculator on a 300mm wafer with any Intel/AMD die you like and play around with defect densities from 0.5 to 0.05 - you might be quite shocked. And 0.05 is a ballpark were I expect TSMC N5 to be right now.

Have you found a calculator that lets you specify some metric of defect tolerance? Perfect dies might be comparatively rare, but you don't need perfect dies. Quite common to have redundant interface wires, array bits, etc. so you can tolerate a failure without even compromising the product. Good odds AMD does the same for the 3D V-cache contacts as well.

Hitman928 · Mar 6, 2023

Exist50 said:
I mean, you can do the math. 0.5DD with a moderately sized die and/or decent recovery options gives plenty of usable dies. Obviously, lower defect density would be better, but you're acting like it's insanity.

Because in the corporate world, it kind of is. The only way you could make that work at volume is if your process was so far ahead of anyone else that customers are willing to take on the additional costs to get the competitive advantage. Even then, there will be limits because the chip designer's end customers will only tolerate so much price increase, no matter how much better something may be.

And again, redundancies in the chip only get you so far and cause additional increases in cost themselves. Making chips, especially on leading nodes, is crazy expensive. No one wants to be making it worse by going to a foundry with well above industry standard defect rates.

Exist50 · Mar 6, 2023

Hitman928 said:
Because in the corporate world, it kind of is. The only way you could make that work at volume is if your process was so far ahead of anyone else that customers are willing to take on the additional costs to get the competitive advantage. Even then, there will be limits because the chip designer's end customers will only tolerate so much price increase, no matter how much better something may be.

And again, redundancies in the chip only get you so far and cause additional increases in cost themselves. Making chips, especially on leading nodes, is crazy expensive. No one wants to be making it worse by going to a foundry with well above industry standard defect rates.

Again, you seem to be assuming A) the higher initial yields will be paid by the first customer(s) and B) the yields are terrible (like your 5% number). Neither is true. All fabs charge an amortized rate, factoring in competition, otherwise who would be willing to volunteer to be first? And again, just ballpark the math. See what the yields are for, say, a typical 100mm2 phone chip if you can tolerate 1-2 defects per die. No one's trying to make a perfect 800mm2 die.

Hitman928 · Mar 6, 2023

Exist50 said:
Again, you seem to be assuming A) the higher initial yields will be paid by the first customer(s) and B) the yields are terrible (like your 5% number). Neither is true. All fabs charge an amortized rate, factoring in competition, otherwise who would be willing to volunteer to be first? And again, just ballpark the math. See what the yields are for, say, a typical 100mm2 phone chip if you can tolerate 1-2 defects per die. No one's trying to make a perfect 800mm2 die.

No, I'm assuming that no one would use a foundry with terrible yields when they have much better options. Your example is way over simplifying it too. Do you know what it takes for any given design to be able to tolerate 2 defects per die, no matter their location?

Exist50 · Mar 6, 2023

Hitman928 said:
No, I'm assuming that no one would use a foundry with terrible yields when they have much better options.

Again, I urge you to actually look at the numbers for a realistic product, instead of an imaginary perfect 800mm2 die that no one cares about making. What do you think the yields would be for, say, a phone chip? Because a square, 100mm2 die (approximate high end phone chip size) at 0.5DD gets you ~62% with no recovery at all. And how many of those failing dies would just have one defect somewhere? If even half of those dies are recoverable, then we're talking 80% yields, or 20% less than perfect.

If you don't think that's remotely viable for the first quarter of production on a leading node, then what is your criteria, so we can talk real numbers?

Hitman928 said:
Do you know what it takes for any given design to be able to tolerate 2 defects per die, no matter their location?

You don't have to be able to tolerate defects in any location; it's all probabilistic. You add redundancy to particularly sensitive circuits, large arrays, etc. And you combine that with other recovery techniques like downbinning to end up with your final product mix. These techniques have been commonplace for ages now.

Hitman928 · Mar 6, 2023

Exist50 said:
Again, I urge you to actually look at the numbers for a realistic product, instead of an imaginary perfect 800mm2 die that no one cares about making. What do you think the yields would be for, say, a phone chip? Because a square, 100mm2 die (approximate high end phone chip size) at 0.5DD gets you ~62% with no recovery at all. And how many of those failing dies would just have one defect somewhere? If even half of those dies are recoverable, then we're talking 80% yields, or 20% less than perfect.

If you don't think that's remotely viable for the first quarter of production on a leading node, then what is your criteria, so we can talk real numbers?

Industry standard is below 0.2 based upon my experience with the foundries I've worked with.

You don't have to be able to tolerate defects in any location; it's all probabilistic. You add redundancy to particularly sensitive circuits, large arrays, etc. And you combine that with other recovery techniques like downbinning to end up with your final product mix. These techniques have been commonplace for ages now.

I'll put it this way, if I am a manager and I'm looking at using a process with 0.1 d0 and 0.5 d0, with an estimated die size of 100 mm2, then I'm looking at an estimated yield rate of 91% versus 62%. Now, say I can accept 95% defect yield rate. Well, for each process I now have to analyze how much extra design, verification, and test time I need to budget as well as the area increase for each design to hit my acceptable yield rate. Additionally, things like cache can be made redundant fairly easily, things like IO and logic, not easy. So how much of my design is cache versus logic. How much is interconnect and IO? How many "units" (e.g. cores) of my design can I afford to lose and still have a salvageable product? If I get a defect or 2 and they happen to land in the IO or the interconnect, is it even recoverable or salvageable at all? Basically, how much more expensive is my chip going to be after all the additional design, validation, and added area in order to hit acceptable yields? How much of that is NRE and how much is ongoing manufacturing? What if I look at all the additional costs with the worse fab and figure out that for the same costs, I can just go with the better fab and increase my base design size and design time to come up with an even more competitive product at the same price?

So yes, you can recover and salvage, but this is not the wave of a wand type of stuff that happens for free and the worse the defect rate the more headaches and less competitive of a product you get.

jpiniero · Mar 6, 2023

Hitman928 said:
My point is that no fab will continue to exist shipping 0.5 D0 at volume, even with redundancies and cut down SKUs taken into consideration. Those are risk production numbers at best which no one uses for volume.

Except Intel...

Hitman928 · Mar 6, 2023

jpiniero said:
Except Intel...

Like I said before, they've been burning cash like crazy recently so. . . maybe. The only example of something like this I know of is when AMD first spun off GF. The foundry was struggling to keep up with Intel and basically as part of their agreement, the first 2 or 3 years or something, AMD didn't have to pay for defective dies. It was actually a nice little bonus for AMD at the time, even though they were behind Intel on process tech, but it caused GF to even more money than they did once it expired. GF was never profitable until they stopped chasing advanced nodes and stuck with a bit more niche nodes that had already reached really good yield rates and addressed somewhat under served customer needs. GF would have gone bankrupt if it weren't for AMD's contractual obligation to buy from them and the Saudi's determination to have a foothold in the technology race and their very large bank account to keep it going.

Discussion Intel current and future Lakes & Rapids thread

Diamond Member

Lifer

Platinum Member

Diamond Member

Diamond Member

Platinum Member

Platinum Member

Diamond Member

Lifer

Platinum Member

Senior member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Platinum Member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Lifer

Diamond Member