• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Discussion The beauty of AMD chiplet design

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Kocicak

Golden Member
I have not seen this discussed before. Beside all the obvious advantages of the chiplet design, there is one more thing: the current 8 core chiplet will be relevant and usable for two or even three years. I believe that even that far in future it will be usable in some low end processors or other applications. The same little universal 8 core chiplet produced in so high volume allowing the development cost to be disolved so much, that overall it will be extremelly cheap to produce.

Why not to split the consumer processor line in two parts: one higher end part, which would be getting new generation chiplets every time they are released, and lower end processors, which would have the computing unit updated for example every second year? I believe it would be very cost effective way and it would really allow the consumers to enjoy the benefits of the design and high volume (and therefore low cost) production.
 
See this for an explanation of how not all is as it seems.

Take home is that - even at the point the stuff leaves the foundry - splitting the floor plan across nodes is likely to improve recurring costs, not increase them.

Once you consider the benefits of flexible packaging to meet market demand and both harvesting and binning - then its fairly clear that splitting across two nodes is a big winner.

That’s all good, but if the cost savings are so good, why haven’t we seen this before?

I don’t doubt AMD has a good reason for doing this, but I do doubt all the statements in this thread making assumptions about costs. I’m skeptical that this is the second coming as some are treating it.
 
manufacturing non speed critical components on older node is definitely cost beneficial move. While die size of such chip will be larger, per transistor cost will be lower (especially at first) and it obviously frees up capacity on 7nm node. It also increases yield for the package as a whole (less smaller chips will have bigger yield than a single larger).
Tradeoff is obviously power efficiency and footprint. Having part of the cpu functionality on an older node chip will eat more power (longer interconnects and bigger transistors).

But chiplets have other benefits as well. It could allow mix and matching anything amd wants to do; ddr4, ddr5, pcie3/4, fat or slim igpu and any kind of cpu chiplet that they can come up with. Im sure they could make a cheap stoney ridge succesor using a io die with single channel ddr4, r2 igpu and 2c4t cpu chiplet.


This is all so throwback to the earlier years of computing; as similar design was very prominent in the pest. Think of io chiplet as northbridge and cpu chilet as a cpu in socket. Nowadays everything is within a single package.
 
Chiplets are less expensive to manufacture. Reducing the size of the die that you're producing means that you're more likely to get more functional parts back.

They also don't have to design multiple different chips, which saves on additional work and means that you don't need to create separate masks for those other chips.

Both the fixed and per unit costs are going to decrease on the whole. Packaging might be slightly more expensive, but not enough to offset the other cost savings.
I've been thinking (uh-oh) about the chiplet design. We now have it consisting of mainly core and cache sections that together probably comprise more than 80-85% of the area vs 50% of Zen1. This should mean that if AMD can use down to 4C/chiplet, pretty much every defect will occur on a part of the die (core & cache) that still allows them to obtain a useful part. The whole early node defect density bugbear can be neutralized. The main thing an improved defect fab process will bring is fully functional 8C die, which might not matter as much if a lot of 6C products are sold.

This is a separate factor to the small size of the die.
 
That’s all good, but if the cost savings are so good, why haven’t we seen this before?

I don’t doubt AMD has a good reason for doing this, but I do doubt all the statements in this thread making assumptions about costs. I’m skeptical that this is the second coming as some are treating it.
we have seen this before; it was actually the norm.
Evertything up until 1366/1156 was designed this way. Just not a single package, but rather seperate chips (northbridges)
 
I've been thinking (uh-oh) about the chiplet design. We now have it consisting of mainly core and cache sections that together probably comprise more than 80-85% of the area vs 50% of Zen1. This should mean that if AMD can use down to 4C/chiplet, pretty much every defect will occur on a part of the die (core & cache) that still allows them to obtain a useful part. The whole early node defect density bugbear can be neutralized. The main thing an improved defect fab process will bring is fully functional 8C die, which might not matter as much if a lot of 6C products are sold.

Not every chip with defects can be salvaged by turning off the defective parts. If the defect is in a logic transistor, then not using that transistor (by turning off a core, for instance) will let you use that chip. However, if the defect is a short between ground and power planes, the chip will melt itself if you give it power, no matter what parts you intend to turn off.

Of course, only losing guaranteed-to-be-useless chips like ones with major shorts still greatly improves yields. It just doesn't bring them to 100%.
 
Not every chip with defects can be salvaged by turning off the defective parts. If the defect is in a logic transistor, then not using that transistor (by turning off a core, for instance) will let you use that chip. However, if the defect is a short between ground and power planes, the chip will melt itself if you give it power, no matter what parts you intend to turn off.

Of course, only losing guaranteed-to-be-useless chips like ones with major shorts still greatly improves yields. It just doesn't bring them to 100%.
Just a much higher yield than a simple use of defect density rates would suggest. Correct?
 
That’s all good, but if the cost savings are so good, why haven’t we seen this before?

I don’t doubt AMD has a good reason for doing this, but I do doubt all the statements in this thread making assumptions about costs. I’m skeptical that this is the second coming as some are treating it.

Because this node jump is the point where the size of the die would be dominated by the space needed for the traces onto the package. So shrinking much of it to 7nm would bring no size benefit - to make an 8C you'd just see empty silicon with traces running through it.

If you made 16C as your baseline, then it wouldn't be so bad in terms of used silicon - but it'd mean manufacturing prices are prohibitively large for something you might then have to fuse off 12 cores from to fill a 4C market segment.
 
we have seen this before; it was actually the norm.
Evertything up until 1366/1156 was designed this way. Just not a single package, but rather seperate chips (northbridges)

Exactly this.

Prior to A64 and Nehalem the CPU had a link to the Northbridge and that was it, the NB handled communication to the rest of the system with some additional features being placed on the Southbridge.

Then A64 integrated the memory controller and that did reduce latency but if you compare A64 to Core2 the Core2 latency is not higher by enough to make it poor in gaming. Core2 was absolutely dominant in gaming over A64 and initially Nehalem (1st intel CPU with an IMC) was hit and miss in terms of improving over Core2. This was all with Core2 using an FSB and a Northbridge, an on package IO die is going to be significantly better than that.

I think the latency problem is overblown. Just like the odd memory configuration of the 2990WX was blamed for its poor performance when in reality it was the Windows Scheduler because performance in Linux is significantly better. I am sure latency plays a part but I bet the bigger issue is simply that Intel has been so dominant in the CPU space for so long that a lot of game engines are tuned to Intel architectures.
 
I think the latency problem is overblown.

Depends on the uarch and workload. Remember that Zen/Zen+ has an inherent IF latency penalty just from migrating threads or even doing L3 cache reads across CCXs. You don't have to hit the memory controller to suffer from that latency. Conroe covered up for a lot of the deficiencies of it's FSB-based Northbridge/memory controller by having a (for the time) strong cache subsystem - one that blew away AMD's in performance.

Zen2 may not have 4c CCXs divided by an IF link internal to the chiplet. If that is the case, then that element of the IF latency penalty will be gone. Now all you have to worry about is the memory controller. Having 16 MB of L3 fully-accessible to all eight cores of a chiplet will be a game-changer for Zen2 performance.
 
Depends on the uarch and workload. Remember that Zen/Zen+ has an inherent IF latency penalty just from migrating threads or even doing L3 cache reads across CCXs. You don't have to hit the memory controller to suffer from that latency. Conroe covered up for a lot of the deficiencies of it's FSB-based Northbridge/memory controller by having a (for the time) strong cache subsystem - one that blew away AMD's in performance.

Zen2 may not have 4c CCXs divided by an IF link internal to the chiplet. If that is the case, then that element of the IF latency penalty will be gone. Now all you have to worry about is the memory controller. Having 16 MB of L3 fully-accessible to all eight cores of a chiplet will be a game-changer for Zen2 performance.

I agree regarding internal latency and there is a lot AMD can do to improve it.

I was solely (and I should have explicitly stated it) talking about memory latency.
 
I was solely (and I should have explicitly stated it) talking about memory latency.

Well, I get that, but if you look at what Zen2 may well have done, we now have a situation where any one core in a chiplet will have full access to 16MB of L3 without inter-CCX latency penalties. That's up from 8MB on Zen/Zen+. Unless an application was very well-constructed with respect to its working set, odds were pretty good that an application that used more than 8MB of memory would house some of its vital data in L3 on another CCX, negating most of the advantage of L3 in the first place. You may as well hit main memory at that point (outside of main memory requiring a cache snoop anyway, just because that's how it works).

For many applications, Zen2 will effectively double your L3 cache without actually adding any more L3 cache. If that makes any sense. That'll go a long way towards hiding system RAM latency just the same way strong cache architecture helped Conroe.
 
That’s all good, but if the cost savings are so good, why haven’t we seen this before?

I don’t doubt AMD has a good reason for doing this, but I do doubt all the statements in this thread making assumptions about costs. I’m skeptical that this is the second coming as some are treating it.
We HAVE actually seen this approach before - all the way back with the first-generation Core i3s, which had the CPU on one die, and the iGPU, memory controller and PCIe controller on another die within the same LGA1156 package.

Which begs the question as to why, if doing it that way was cheaper, they went to a monolithic design with the Sandy Bridge i3s.
 
We HAVE actually seen this approach before - all the way back with the first-generation Core i3s, which had the CPU on one die, and the iGPU, memory controller and PCIe controller on another die within the same LGA1156 package.

Which begs the question as to why, if doing it that way was cheaper, they went to a monolithic design with the Sandy Bridge i3s.

Well if it was used for celeron, pentium, i3 and i5 cpus that suggests it is cheap otherwise it would not have been worthwhile doing it in such a low margin segment. I guess the benefits of monolithic were better at the time but since we a larger variety of core counts available now it has switched.
 
For all the supposed "beauty" of this design, I would still bet that once 7nm EUV is working we'll see a monolithic APU.

Perhaps we will see a monolithic APU - but I wouldn't bet on it. You've gotta assume that AMD were not going off on a limb making the I/O controller and CPU chiplet and that there is a bit of joined up thinking going on.

-- I/O Controller
-- x86 chiplet
-- GPU chiplet
-- custom accelerator chiplets for folks that have the wallet to pay for them.
 
Perhaps we will see a monolithic APU - but I wouldn't bet on it. You've gotta assume that AMD were not going off on a limb making the I/O controller and CPU chiplet and that there is a bit of joined up thinking going on.

-- I/O Controller
-- x86 chiplet
-- GPU chiplet
-- custom accelerator chiplets for folks that have the wallet to pay for them.
Yeah this is a whole pandora's box thing. I doubt AMD ever moves back single Monolithic die for the medium turn. Now maybe they might bounce back and forth between dies that can be standalone dies to chiplet/io. But even then I doubt it. The only time AMD will make a switch back is if there is a major hold up in performance with the IO being external like that. But AMD has created a solution for themselves that allow adopting new process's almost immeadiately for high volume parts and maximizing Yields on wafers they buy per wafer. It's going to take a lot to convince AMD that they need to create a huge mono design and assume all of the risks they are avoiding now.
 
Yeah this is a whole pandora's box thing. I doubt AMD ever moves back single Monolithic die for the medium turn. Now maybe they might bounce back and forth between dies that can be standalone dies to chiplet/io. But even then I doubt it. The only time AMD will make a switch back is if there is a major hold up in performance with the IO being external like that. But AMD has created a solution for themselves that allow adopting new process's almost immeadiately for high volume parts and maximizing Yields on wafers they buy per wafer. It's going to take a lot to convince AMD that they need to create a huge mono design and assume all of the risks they are avoiding now.

I bet the next Ryzen APU (the one with real Zen 2 cores) is monolithic.
 
I bet the next Ryzen APU (the one with real Zen 2 cores) is monolithic.
Depends on how many cores.

If they do 8 core APU then i can see the same CPU chiplet but a I/O+GPU die. I dont see a CPU chiplet + GPU chiplet + an I/O die as GPU's love bandwdith way more then CPU's and that wouldn't help power. My guess is the next gen consoles will also be like this.

If they stick to 4 cores then a see monolithic, i dont expect 6 cores.
 
Yeah, that is the downside of Quad core CCX. I think 6 cores would be kind of the sweet spot for an APU if was an easy option.

It's also the genius of the 4c CCX. Intel in SL-X has a huge complex mesh that makes CPU to CPU communication skyrocket. By going 4c and a crossbar to the next CXX those 4 have almost ring bus small latency (least to next nearest CPU on ringbus). It keeps the power hungry IF at 6 connections instead of 21(I Think I got my math right there) for a 6c CCX. I think an APU might be a bad spot to push CCX size the IF power draw would increase more than the IF of adding a second CCX.
 
Regarding the latency concerns, the following chart was published on Twitter:

View attachment 2826
Link
Since the two graphs show different scales I adapted the Zen 2 one and overlaid them to better illustrate the difference:

ztKibSD.jpg


If this is real there's a significant latency improvement between ~6 and ~24MB, while beyond that it's slightly worse (though the graph does a lot of stupid interpolation, it would be worth more knowing the exact data points).
 
Back
Top