Discussion Intel current and future Lakes & Rapids thread

Page 768 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Hitman928

Diamond Member
Apr 15, 2012
6,024
10,352
136
Except Intel...

Like I said before, they've been burning cash like crazy recently so. . . maybe. The only example of something like this I know of is when AMD first spun off GF. The foundry was struggling to keep up with Intel and basically as part of their agreement, the first 2 or 3 years or something, AMD didn't have to pay for defective dies. It was actually a nice little bonus for AMD at the time, even though they were behind Intel on process tech, but it caused GF to even more money than they did once it expired. GF was never profitable until they stopped chasing advanced nodes and stuck with a bit more niche nodes that had already reached really good yield rates and addressed somewhat under served customer needs. GF would have gone bankrupt if it weren't for AMD's contractual obligation to buy from them and the Saudi's determination to have a foothold in the technology race and their very large bank account to keep it going.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,102
136
I'll put it this way, if I am a manager and I'm looking at using a process with 0.1 d0 and 0.5 d0, with an estimated die size of 100 mm2, then I'm looking at an estimated yield rate of 91% versus 62%.
If you require only perfect dies, which no one does. So let's say half of those defects are recoverable. Then you're at 95% vs 80% (rounding for laziness). In other words, ~20% more, and with that number shrinking by the month. You seriously think a company can't ship a product with those numbers?

And TSMC has ~50% gross margins. You don't think there's room for transient differences to be priced in? And of course, that assumes two entirely equivalent nodes.
Now, say I can accept 95% defect yield rate. Well, for each process I now have to analyze how much extra design, verification, and test time I need to budget as well as the area increase for each design to hit my acceptable yield rate.
Where are you getting that number from? Why, as a customer, would you even care what the fab's yields are so long as they can get you the number of good dies promised?
Additionally, things like cache can be made redundant fairly easily
And since cache alone is about half a typical die, that's huge. And the synthesized arrays are also easy to add redundancy, and that's another huge chunk right there (probably a good half+ of the remainder). Again, redundancy is the norm these days.
What if I look at all the additional costs with the worse fab and figure out that for the same costs, I can just go with the better fab and increase my base design size and design time to come up with an even more competitive product at the same price?
Amortized yields are factored into the price customers pay. Again, no one would be willing to be first to a node if they had to absorb all the cost and risk. Apple's certainly not paying for TSMC's N3 slips.
So yes, you can recover and salvage, but this is not the wave of a wand type of stuff that happens for free and the worse the defect rate the more headaches and less competitive of a product you get.
Yes, none of this is free, but the worst of the costs are absorbed by the fab, and on the design side, it's already the default practice to have mitigations in place. And again, this a "problem" that gets better by the month. I think you're taking some very reasonable economic considerations and blowing them into existential issues.
 

jpiniero

Lifer
Oct 1, 2010
15,103
5,661
136
Like I said before, they've been burning cash like crazy recently so. . . maybe.

Well that's because of the future nodes. As crazy as it sounds, I still believe the plan is to sell (pre-pre-pre) risk production so they can claim "4 nodes in 5 years" or whatever it was. The smaller Meteor Lake CPU tile is only 40 mm2 and I reckon they could slice and dice it down to 0 big + 2 small if need be.
 

Hitman928

Diamond Member
Apr 15, 2012
6,024
10,352
136
If you require only perfect dies, which no one does. So let's say half of those defects are recoverable. Then you're at 95% vs 80% (rounding for laziness). In other words, ~20% more, and with that number shrinking by the month. You seriously think a company can't ship a product with those numbers?

What do you mean by recoverable? Do you mean due to the built in redundancy? What type of design are you talking about, how much area do you have to add to hit that recovery rate on average?


And TSMC has ~50% gross margins. You don't think there's room for transient differences to be priced in? And of course, that assumes two entirely equivalent nodes.

Given the ever increasing costs and difficulties of node progressions, I'd say there's probably not that much room there, no. Gross margin doesn't take into account all of the research and node development costs, the administration overhead, customer engineer support, etc. These are big time costs.

Where are you getting that number from? Why, as a customer, would you even care what the fab's yields are so long as they can get you the number of good dies promised?

Because a fab sells wafers, not dies, and the purchase agreement for those wafers will be done based upon the fab's yield data.

Even if we assume that doesn't matter, say the fab will even cover the remaining yield difference financially at an average yield per design of 95% vs. 65%. At HVM of even 20K WSPM and $8000 per wafer, the fab would then be covering roughly $48M to start the first month alone.

And since cache alone is about half a typical die, that's huge. And the synthesized arrays are also easy to add redundancy, and that's another huge chunk right there (probably a good half+ of the remainder). Again, redundancy is the norm these days.

Fabs tape out way more than cutting edge CPUs and GPUs but even ignoring that, again, you have to calculate the cost of all that redundancy, it is not free and no amount of redundancy will make a 0.5 d0 process attractive versus an industry standard defect rate process.

Amortized yields are factored into the price customers pay. Again, no one would be willing to be first to a node if they had to absorb all the cost and risk. Apple's certainly not paying for TSMC's N3 slips.

Which is why they work hand in hand and Apple aligns schedules as best as possible with actual HVM ready nodes from TSMC.

Yes, none of this is free, but the worst of the costs are absorbed by the fab, and on the design side, it's already the default practice to have mitigations in place. And again, this a "problem" that gets better by the month. I think you're taking some very reasonable economic considerations and blowing them into existential issues.

Default practice to have mitigations for industry standard defect rates, yes. But if you are dealing with a foundry with 5x the defect density as what you are used to, you're stepping outside of default practice. And again, how much redundancy is needed and how effective it can be is design dependent. Not everyone tapes out with half+ of the die being large arrays that are easy to make redundant.

Believe what you want, I don't know what else to say. All I can tell you is that any fab trying to get customers with that defect rate is not going to have many customers or is not going to be competitive long term with that kind of business model. If your argument is that you could do it, well yeah you could. You can do all kinds of things. Doesn't mean it is a winning strategy.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,102
136
What do you mean by recoverable? Do you mean due to the built in redundancy?
Combination of techniques. Redundancy, array parity, downbinning, etc. And these are techniques ubiquitous across modern chips.
Given the ever increasing costs and difficulties of node progressions, I'd say there's probably not that much room there, no. Gross margin doesn't take into account all of the research and node development costs, the administration overhead, customer engineer support, etc. These are big time costs.
Lmao, what on earth are you talking about? TSMC is making enormous profits. They're not just scraping by.
At HVM of even 20K WSPM and $8000 per wafer, the fab would then be covering roughly $48M to start the first month alone.
Ok, and? $50m, and decreasing from there is plenty acceptable when you consider what the dies actually sell for and the profit margins involved. And that's, once again, with fundamentally flawed assumptions.
Fabs tape out way more than cutting edge CPUs and GPUs
There's not much else that uses the bleeding edge nodes, and you would be extremely hard pressed to find a complex modern design lacking in SRAM and synthesized logic. And some ASICs that do are even more granular/binnable. Crypto ASICs, for example.
but even ignoring that, again, you have to calculate the cost of all that redundancy, it is not free
Redundancy is pretty darn cheap compared to the benefit it brings. Look at error correction algorithms for a reference point. Same principal.
Default practice to have mitigations for industry standard defect rates, yes.
Yet this entire time, you've been presenting numbers assuming no such mitigations exist. They have an even bigger impact when defect rates are higher. And also again, why are you assuming 0.5 isn't a standard defect density for a new node? I can assure you multiple fabs have shipped far worse.
Believe what you want, I don't know what else to say. All I can tell you is that any fab trying to get customers with that defect rate is not going to have many customers or is not going to be competitive long term with that kind of business model.
And I'm pointing out the numbers and empirical observations don't support that conclusion. What is a "winning strategy" if not whatever makes money?
 

Hitman928

Diamond Member
Apr 15, 2012
6,024
10,352
136
While I have more of a conscious incompetence on this topic, I always gathered that anything above 0.2 is just not feasible financially. And @Hitman928 at least gives me any reason to believe in him possessing conscious competence in this regard.

/edit: Just try and use any of the free yield calculator on a 300mm wafer with any Intel/AMD die you like and play around with defect densities from 0.5 to 0.05 - you might be quite shocked. And 0.05 is a ballpark were I expect TSMC N5 to be right now.

My last comment on this topic will be just to address the 0.2 you mention (that I also mentioned). Defect density is not a set thing across all nodes. An older node (say 180 nm) can get by with a bigger defect density. A modern node (e.g., 7 nm or 5 nm) needs a lower defect density. This is because the feature density of the process itself becomes so high that the same size defect is much more destructive on a modern node versus an old node. In other words, if you have a defect that takes out one transistor at 180 nm, it could completely wreck a cluster of FETs at 5 nm so you have to have a stricter defect density requirement to adjust for this.

Lastly, when bringing up a node, there are multiple sources of defects which can act together to form defect patterns much different than what you see from the seemingly random distribution in wafer yield calculators. There was a paper that tried to come up with ways of mapping defect patterns when bringing up a process that goes through this. It is behind an IEEE paywall but it has some good info if you have access. The pattern examples they give seems to be publicly available though so I will post it here. The yellow are defects. The "none" is what you would typically see from a wafer yield calculator but the defects can be much more concentrated in certain areas (mainly happens as you are bringing the process up) which will make those dies unsalvageable. The examples given are obviously extreme because they want to show off their defect mapping technique, but the idea is the same if the number of defects is greatly reduced.

Typical-examples-of-nine-wafer-defect-classes.ppm


Wafer defect patterns recognition based on OPTICS and multi-label classification | IEEE Conference Publication | IEEE Xplore
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,102
136
A modern node (e.g., 7 nm or 5 nm) needs a lower defect density. This is because the feature density of the process itself becomes so high that the same size defect is much more destructive on a modern node versus an old node.
Defect density should generally take that into account. A scratch wouldn't be one defect.
 

BorisTheBlade82

Senior member
May 1, 2020
680
1,069
136
@Exist50
Honestly, I couldn't express my reasoning regarding this topic better than @Hitman928 already did. While I mostly agree with your general stance in this Forum, in this specific case I have to side with him. And I consider it not to be the best practice to be overly nitpicky, pedantic and strawmanning, sorry mate.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,102
136
@Exist50
Honestly, I couldn't express my reasoning regarding this topic better than @Hitman928 already did. While I mostly agree with your general stance in this Forum, in this specific case I have to side with him. And I consider it not to be the best practice to be overly nitpicky, pedantic and strawmanning, sorry mate.
I don't mean to be argumentative or nitpicky, but to be honest, I don't even see how this is a matter of opinion. Those claims of yield expectation simply aren't in line with the industry for leading edge nodes, and I presented some data and explained the design considerations for why that is the case. Things that people working on modern SoCs are familiar with. So I'm honestly more confused than anything else as to why this is such a contentious topic. I don't think TSMC, or especially Samsung, would bat an eye at it.

But in the interest of not beating a dead horse further, I'll drop it too.
 
  • Like
Reactions: BorisTheBlade82

Exist50

Platinum Member
Aug 18, 2016
2,452
3,102
136
AMD not using super expensive packaging methods for their interconnects, such as using giant interposers, at least for their EPYC CPUs, certainly helps it seems. I also suspect the economics of packaging a bunch of chiplets place an effective cap at how many chiplets AMD can add before they are forced to start increasing the amount of cores on each chiplet (beyond also power and engineering limitations I mean) but I don't think they are approaching that crossroad with zen 4 genoa quite yet.
This question got me thinking. Is throwing more and more silicon at their server platforms a sustainable long-term solution for Intel, AMD, etc? At what point is the cost so high that it alienates too much of the market? Or do they just keep further subdividing the server platform into more socket tiers? Eventually they'll either need new memory tech, or they're not going to be able to fit enough physical memory channels in a standard width rack... That, or things plateau for a while.
 

Kocicak

Golden Member
Jan 17, 2019
1,062
1,117
136
I still do not believe that the Interconnects are the root cause for Intel's struggle.
Are you sure that interconnecting all those 31 (that is THIRTY ONE) tiles is an easy thing to do? And not only that, you need to test each of these tiles and also interconnects thoroughly before you mount it on.

How many perfectly functioning different silicone pieces (incl. interconnects) need to be put together (while not making any errors in this too) to make a functioning unit?

In my opinion this is a nightmare to make.
 

BorisTheBlade82

Senior member
May 1, 2020
680
1,069
136
Are you sure that interconnecting all those 31 (that is THIRTY ONE) tiles is an easy thing to do? And not only that, you need to test each of these tiles and also interconnects thoroughly before you mount it on.

How many perfectly functioning different silicone pieces (incl. interconnects) need to be put together (while not making any errors in this too) to make a functioning unit?

In my opinion this is a nightmare to make.
Just to make sure: I am in no way saying that this is a trivial task - see conscious incompetence.

But IMHO this is more or less a divide and conquer topic. As a start you are coming from known good dice, Compute as well as EMIB. From then on it is a question of increasing packaging yield - a solvable problem from my PoV.
 

DrMrLordX

Lifer
Apr 27, 2000
21,998
11,552
136
I can tell you that none of the companies I worked for would ever go into volume manufacturing with yield rates like this, not unless the fab was basically willing to eat a large portion of dies with defects or give the equivalent discount on wafer prices to bring the effective yield to be in line with industry standards. This would not be a winning proposition for the foundry though.

Intel already went that route with IceLake-SP, and possibly other products. IceLake-SP was notorious.
 

eek2121

Diamond Member
Aug 2, 2005
3,098
4,386
136
Like I said before, they've been burning cash like crazy recently so. . . maybe. The only example of something like this I know of is when AMD first spun off GF. The foundry was struggling to keep up with Intel and basically as part of their agreement, the first 2 or 3 years or something, AMD didn't have to pay for defective dies. It was actually a nice little bonus for AMD at the time, even though they were behind Intel on process tech, but it caused GF to even more money than they did once it expired. GF was never profitable until they stopped chasing advanced nodes and stuck with a bit more niche nodes that had already reached really good yield rates and addressed somewhat under served customer needs. GF would have gone bankrupt if it weren't for AMD's contractual obligation to buy from them and the Saudi's determination to have a foothold in the technology race and their very large bank account to keep it going.

If GF could have made 7nm work they would have likely kept AMD’s business and they would absolutely be profitable today.

I am waiting for the day we see AMD using IFS. 🤣
 
  • Haha
Reactions: lightmanek

scannall

Golden Member
Jan 1, 2012
1,960
1,678
136
If GF could have made 7nm work they would have likely kept AMD’s business and they would absolutely be profitable today.

I am waiting for the day we see AMD using IFS. 🤣
They got a working 7nm node from IBM as a gift. I don't know if they didn't want to spend the money on the equipment, or if they couldn't get the equipment in time for it to be profitable though. ASML is sold out for who knows how long.
 
Jul 27, 2020
19,613
13,476
146
They got a working 7nm node from IBM as a gift. I don't know if they didn't want to spend the money on the equipment, or if they couldn't get the equipment in time for it to be profitable though. ASML is sold out for who knows how long.
Arabs had huge shares in GF. Maybe they were too risk averse and vetoed spending billions on equipment for future needs. Must have hired some "experts" to tell them that the physics of shrinking transistors were riddled with insurmountable hurdles.
 
  • Like
Reactions: lightmanek

Geddagod

Golden Member
Dec 28, 2021
1,295
1,368
106
No idea what happened to SPR.
I would love to see the power allocation on these SPR parts vs Milan or Rome though. If there was a non-hardware related reason why SPR is weaker than Milan in some cases, this would be my best guess. Power allocation becomes hugely important in these parts, with IO and L3 cache power consumption tweaks being able to create drastic changes in performance- so much so that Milan vs Rome was originally a 15% perf/watt regression, but then with shifts in power to the core versus IO and L3 cache in a different system, it obviously became a perf/watt uplift over rome (in Anandtech's testing).
What makes this even more believable is that CB23 isn't exactly a memory intensive workload, so shifting power away from the IO towards the cores themselves could show benefit.
However this is only if there is a non-hwardware related issue. It could just be Intel 7+GLC is way worse at lower voltages/clocks than Zen 3 (which is also true I think). Also, the difference is so small between what I personally expected (slightly faster than Milan) versus reality (slightly slower than Milan) that I don't think there really is an issue per se, just SPR performing worse than expected.