64 core EPYC Rome （Zen2）Architecture Overview？

Gideon · Nov 15, 2018

Zapetu said:
I also find the lack of editing in the new forums highly limiting.

No Editing ?! :O

But yeah, IMO not using bold fonts for unread threads also makes it much harder to pick out interesting information.

moinmoin · Nov 15, 2018

Zapetu said:
But AMDs current CCX implementation is really only uniform for 4 cores with a reasonable amount of direct links between them

Yes, that's my point. Every further expansion above it degrades latency, so latency uniformity with low latency is already gone at that point. The different approaches just decide by how much. CCX is totally uniform for 4 cores, complexity makes more cores unlikely. Ring bus isn't but I guess one could argue that up to some amount of cores the increase is negligible. 85ns worst case is 2.5 times the best case of 34ns. Mesh has high uniformity with high power consumption; while worst case is just 19% above the best, the best case starts at 69.3ns, close to the ring bus' worst.

AMD obviously decided that there must be other feasible approaches. Let's see how Zen 2/Epyc 2's architecture turns out latency wise.

(Regarding lack of editing, a time window for fixing typos would be nicer indeed. Also I'm afraid locking OPs from being edited will just ensure more throw away threads to be created.)

Abwx · Nov 15, 2018

Atari2600 said:
IPC = Instructions Per Clock.

It's something that many seem to now equate to Instructions Per Core - when it never meant this.

So you'd be looking at an "Instructions Per Core" improvement of 1.13 x 1.13 ~ 1.28 [considering both IPC and clock rate]

I m aware of the diference, i just pointed that the two numbers are surprisingly close.

Other than this the C-Ray bench displayed by AMD put Zen 2 at 13-26% better IPC than Zen 1 depending of the Epyc 2 sample frequency wich should be in the 1.8-2GHz range

The only alleged link on this respect is a ChipHell CB R15 run at 1.8GHz base., and if it s legit then it s more than 13% in legacy non AVX/FMA FP.

Topweasel · Nov 15, 2018

moinmoin said:
Yes, that's my point. Every further expansion above it degrades latency, so latency uniformity with low latency is already gone at that point. The different approaches just decide by how much. CCX is totally uniform for 4 cores, complexity makes more cores unlikely. Ring bus isn't but I guess one could argue that up to some amount of cores the increase is negligible. 85ns worst case is 2.5 times the best case of 34ns. Mesh has high uniformity with high power consumption; while worst case is just 19% above the best, the best case starts at 69.3ns, close to the ring bus' worst.

AMD obviously decided that there must be other feasible approaches. Let's see how Zen 2/Epyc 2's architecture turns out latency wise.

(Regarding lack of editing, a time window for fixing typos would be nicer indeed. Also I'm afraid locking OPs from being edited will just ensure more throw away threads to be created.)

It's an easy solution. Zen was starting out with 8 cores. They have a mesh solution that grows out from core to core, CCX to CCX, die to die, even socket to socket. So AMD was at the tail end of where a ring bus was feasible for a start. Also comparing Zen's intra CCX. First I don't know where you get the numbers from. Intra CCX numbers are a lot better than that. But this assumes that AMD would have had similar results on a ring bus. Topping all of this off a Ring bus kills the modular ability and has to be redesigned for every die. On top of all of that is that the uniformity gives them an established system that they can more easily tweak in future system both increasing bandwidth and lowering latency.

moinmoin · Nov 15, 2018

Topweasel said:
First I don't know where you get the numbers from.

Agner Fog's measurements from Microarchitectures (on page 157).

Topweasel said:
Intra CCX numbers are a lot better than that.

We were talking about uniformity of latency. That the ring bus is not uniform. And that the mesh that replaces the ring bus is significantly more uniform but starts at a latency close to the ring bus' worst case.

Since you seem to want intra CCX numbers, Agner Fog has it at 40 clocks (page 219).

Topweasel · Nov 15, 2018

moinmoin said:
Agner Fog's measurements from Microarchitectures (on page 157).

157 is for Skylake and As for Ryzen it was what I thought it was a uniform 40ns. Which makes best case scenario Skylake 50% slower and twice as fast as worst case scenario. On top of that it's not noted what version of the Arch this was tested for or I didn't note it. This is could be 4c Skylake. What about 6C and 8C Skylake (CFL and CFL-R)? It's very very very doubtful that a ring bus hold up for much more than 4 cores in comparison. Now it's an established component of the cores and socket and AMD can tweak it each gen. The final part of this is all things considered there is reason Intel went Mesh with SL-X and AMD for Zen. Servers. No amount of mulling over minor losses in very specific use cases (games) is going to make a ring bus worth AMD doing anything other than the IF-mesh.

moinmoin · Nov 15, 2018

You are talking to the wrong person.

Topweasel · Nov 15, 2018

Yeah in retrospect I think you are right. My apologies.

Zapetu · Nov 18, 2018

Design Costs for 16nm, 10nm, 7nm and 5nm Nodes

Using some pixel counting I made a chart of the manufacturing costs of newer nodes. I only found a couple of links containing respective information about 16nm or lower nodes. Here's the original chart (Big Trouble At 3nm) and here on pages 9 and 11 is some additional information (EUV MASK TECHNOLOGY AND ECONOMICS: IMPACT OF MASK COSTS ON PATTERNING STRATEGY (PDF)).

I tried my best to be as accurate as possible but there's still a margin of error (at least ±$1M, might be even ±$2M).

It's an oversimplification (and it's not true) to assume that each chip would cost about the same to design and develop. Reusing IP, architecture and quite likely also software (I don't know the licensing terms) would reduce significant amount of the costs. Designing a completely new novel architecture will presumably cost (many times) more than what is predicted in the charts. A simple die shrink from 14LPP to 12LP (E.g. Pinnacle Ridge and Polaris 30) might has been only a couple of $10M while the original designs (especially Summit Ridge) were much more expensive. Just a reminder that AMD could very well have made different designs for Rome, Matisse and ~~Picasso~~ (more likely Renoir) including separate 7nm dies but they might as well have gone with the chiplet route (for better yields and binning). We'll find out eventually but right now none of the options can be excluded.

Here's a list of different GloFo and TSMC designs starting from 14 nm node that AMD currently has made (that I know of):

Global Foundries 14LPP
- Summit Ridge - ~ 213 mm² (AnandTech review)
- Polaris 10 - ~ 232 mm² (AnandTech review)
- Polaris 11 - ~ 123 mm²
- Polaris 12 - ~ 101 mm²
- Vega 10 - ~ 486 mm² (AnandTech review)
- Raven Ridge - ~ 210 mm² (AnandTech review)
- ('Vega M' in Kaby Lake G) - ? mm²
- (Subor Z+ SoC) - ? mm² (reported ~ $400M total (whole device?))
- Vega Mobile 16 and 20 - ? mm²
Global Foundries 12LP
- Pinnacle Ridge - ~ 213 mm² (AnandTech review)
- Polaris 30 - ~ 232mm² (?) (AnandTech review)
TSMC 7 nm (WikiChip)
- Vega 20 (7 nm HPC) - ~ 336 mm²
- Rome 8 core chiplet (7 nm HPC) - ~ 73 mm²
GloFo 14LPP or 14HP (WikiChip)
- Rome IO die - ~ 420 mm²

Hopefully this clarifies some things a little bit more. Whatever Matisse is a a chiplet or a monolithic design depends on many factors but AMD doesn't have to choose the chiplet route just because of financial reason. It's still peculiar how the chiplet die size was chosen to be quite small.

Addition: Since I made this post my signature, I will put some other very useful links here in the end. I have no connections to anyone in the industry but I do like to observe and learn about these kind of things. I hope that someone finds these useful.

MIT 6.004 L13: The Memory Hierarchy

MIT 6.004 L14: Hardware Caches

MIT 6.004 L21: Cache Coherence

MIT 6.004 L22: Advanced Multicores

PotatoWithEarsOnSide · Nov 18, 2018

Increased yield plus the GF WSA would be the most obvious reasons.

Zapetu · Nov 18, 2018

exquisitechar said:
The amazing @kokhua predicts on Twitter that there is no L4$. I believe him.

amd6502 said:
On these forums it seemed the opposite to me. He infers some sort of L4 based on the larger than expected size of the IO die. Where is the tweet?

You may have already found it but here it is (scroll a little bit up). He's been hard at work and has calculated die areas for different parts of the Zeppelin chip, and has come to the conclusion that there is not enough room for the L4 cache. It may very well be so but I'm still not willing to give up on L4 (or a similar eDRAM solution) as long as 14HPC is still a possibility.

AdoredTV has posted a new video and while it contains some interesting speculation, there's one thing that I'm questioning somewhat. At 6:20 and more specifically at 8:03 he starts talking about, if I understood him correctly, cutting the IO die in smaller pieces to reuse different quadrants of the IO die for Ryzen CPUs. Is that even possible using today's technologies? While I do like the concept and with it even a large GPU chip could be split in half to create two smaller GPU chips that would work on their own. The problem might be, if it would otherwise be somehow feasible, that some of the parts would have to be duplicated for smaller chips to work and that would waste some space from the large die. Still the concept is really neat but I haven't heard about anything that would allow it to be done. Here's the video:

I like his speculation though, even if he occasionally gets something wrong (as do I but he's obviously much more aware of the semiconductor industry than I am). But still, in my opinion, more likely outcome would be for AMD to create separate (lithography) masks for the 1/4 cut IO. Still overall I liked the video very much and I do like to speculate as well.

amd6502 · Nov 18, 2018

Nice twitter find! I think he says not enough for 512MB L4 which would be a lot. But more land in "next city" meaning post-rome update (7nm?) might have such a gigantic L4? Would a 128M-256MB L4 be worth it as a buffer?

Oops never mind (I found it) he says no L4:

ROME I/O die is ~420 mm^2, so balance is 265 mm^2. In Zeppelin, the controllers/random logic and glue amount to ~68 mm^2. Multiplying by 4 gives 272 mm^2, more then the balance 265 mm^2. Based on this alone, I think we can conclude that there is no L4 cache.

Zapetu · Nov 18, 2018

I made new versions of my Rome and Ryzen chiplet designs to further point out, why the chiplet design sort of makes a lot of sense. AdoredTV also talked about it in his latest video. I will try to use some 'spoiler tags' to not make this post too image heavy.

Here's one important question, though. We all know that Vega 20 is a 336 mm² die on TSMC 7 nm and therefore yields already should be good enough for reasonable die sizes. If Matisse is after all going to be a separate monolithic die (>150mm²) also on 7 nm, then why didn't AMD make their Rome chiplet design like this:

Sure, you can always say that 8 chiplet design is better for yields and binning, like AdoredTV has many times pointed out and it is a very good argument. But if over 150mm² Matisse would be just fine for desktop then why wouldn't about the same size chiplet design be chosen for server CPU? The distance between the chiplets and the IO die would be very short and on-package routing much simpler. TR (ThreadRipper) would also work fine with larger dies. If AMD is going to very aggressively bin their 75mm² chiplets and only use the best ones for EPYC, then is there enough market for TR3 to use all of those not fully functioning surplus chiplets (and also IO dies on more mature process). As I have stated before, I also think that ~~Picasso~~ (more like Renoir) APU might very well be a monolithic 7 nm design (because mobile CPUs would greatly benefit from lower power consumption), but i'm not so sure about Matisse. There are benefits to all designs.

And in no way am I saying that 4 x 16C CCD + IOD would have been a better choice than 8 x 8C CCD IOD. 16C CCD would have just been a little easier from the design point of view to use but obviously, as AdoredTV very well pointed out, yields and binning do matter a lot. AMD could have gone with even 16 x 4C CCDs but without EMIB/silicon interposer like connection it would have been a packaging nightmare. Still there are pros and cons for both design choices, but AMD went the long way to do the 8 chiplet one.

Here's a new version of the Rome package based on what knowledge we have right now:

Matisse (Ryzen 3000)

Here are some pros and cons for both the chiplet and the monolithic design:

Chiplet design (1x or 2x CCD (~75mm² ) and 1x ~105mm² IOD)
- (+) Better yields for smaller size chiplet, better binning for higher clock speeds
- (+) Up to 32MB L3 for each CCD (instead of e.g. 16MB)
- (+) IO does't scale that well and doesn't waste 7 nm space
- (+) Helps to fill WSA requirements with IOD
- (+/-) Use the same design as larger IOD
  - (+) Can use the exact same design and IP as the large IOD, only cut in quarters
  - (-) If larger IOD uses 14HP then smaller one also has to (else redesign for different process)
  - (+/-) Might also have L4 cache if the larger IOD has that
- (-) Longer latencies due to IF-fabric link between CCD and IOD, might affect gaming performance significantly
Monolithic design (~150 (± 50)mm² with 8 to 12 cores and 16 to 24MB L3 (or 32 to 48MB L3, bigger die size))
- (+) IMCs (integrated memory controllers) for lower memory access latencies
- (+) More predictable performance in latency sensitive workloads (e.g. gaming)
- (+/-) Less L3 cache might be enough due to IMC
- (-) Lower yields (many partly functioning chips can be used for low-end parts, though)
- (-) IO takes a lot more (valuable) 7 nm die space that could be used for e.g. larger caches
- (-) Must axe some things to permit smaller die size, 12 cores is quite likely the max, might be 8 cores only
- (+/-) Design costs
  - (-) Needs practically to design all IO (IMC, PCIe, USB, SATA, etc.) for 7 nm process and can't directly reuse Rome's IOD features in a copy-paste manner
  - (+) If (~~Picasso~~) Renoir is also a monolithic design then most of the IO can be used for it also, no need to resign those again, can focus on iGPU (Vega/Navi?)

~~(Picasso)~~ Renoir (Ryzen 3000 (Mobile) APU)

NOTE: This should possibly be Ryzen 4000 APU and not the 3000 series one.

Here are some pros and cons for both the chiplet and the monolithic design:

Chiplet design (1x CCD (~75mm² ) and 1x ~175mm² IO + GPU die)
- Based on Vattila's design here.
- Basically just Raven Ridge with an 8 core chiplet
- (+) Better yields for smaller size chiplet, better binning for lower power consumption
- (+) Up to 32MB L3 and up to 8 cores for each CCD (instead of e.g. 4MB L3 in 4 core CCX)
- (+) IO does't scale that well and doesn't waste 7 nm space
- (+) Helps to fill WSA requirements with IO + GPU die
- (+/-) Design costs
  - (+) Can use Vega GPU, IMC, PCIe and general IO from Raven Ridge
  - (-) Has to design some parts of the Infinity Fabric 2.0 for 14LPP or 12LP to connect to 7 nm CCD
  - (+/-) Might need some redesigning if 12LP is used instead of 14LPP (like Pinnacle Ridge or Polaris 30)
    - 12LP might provide some better power consumption but likely not denser design since AMD has never used 12LP fully and it's only a stop-gap node
  - (+) Since APUs are not top-end performance parts, no need to add any L4 cache (14LPP and 12LP don't support eDRAM anyway)
- (+/-) There is enough performance packed in a single 8 core 32MB CCD that a 4 core 4MB CCX really doesn't do better even with IMC
- (-) Higher power consumption because all data and code must go to the CPU cores through on-package IF-link, might not be the best for mobile applications but 7 nm CPU should be optimizable for lower power draw even if caches are larger and there are a couple of more cores
- (-) Can't benefit at all from 7 nm GPU, power efficiency might not be much better or might even be a little bit worse than Raven Ridge, CPU performance would be likely better, though
Monolithic design (~100 ... 200mm² (?) with 4 to 6 cores and 8 to 12MB L3 )
- (+++) GPU on 7 nm for much better power efficiency, big selling point compared to other 14nm+++ mobile CPUs/APUs
- (+) IMCs (integrated memory controllers) for lower memory access latencies
- (+) More predictable performance in latency sensitive workloads (e.g. gaming) although 4 cores isn't much to begin with, 6 cores would be much better (or even 8 cores, although that might be overkill)
- (+/-) Less L3 cache might be enough due to IMC
- (-) Lower yields (many partly functioning chips can be used for low-end desktop/laptop parts, though)
- (-) IO takes a lot more (valuable) 7 nm die space that could be used for e.g. GPU, there's most likely no point to move just IO in a separate die, because a additional IF-link draws extra power (bad for mobile devices)
- (+/-) Design costs
  - (-) Needs practically to design all IO (IMC, PCIe, USB, SATA, etc.) for 7 nm process and needs also a GPU designed for 7 nm process (can't just copy-paste 14LPP Vega)
  - (+) If Matisse would also be a monolithic design then could share some design efforts for lower overall design costs

Edit: Some more thoughts

~~If (Picasso) Renoir would use a different process (e.g. 7FF) than Matisse~~ (7HPC) then it wouldn't necessarily be that easy to port different IPs and designs (memory controllers, other IO including PCIe, GPUs, etc.) between them. That would make chiplet + IO die a much easier approach, although it would lack integrated memory controllers ()IMCs) and likely have some latency issues.
As raghu78 pointed out here, AMD is only going to use N7 HPC (7 nm HPC, 7HPC) for all CPUs and GPUs in 2019. That means that sharing different designs and IP (like DDR4 PHYs and such) is not that hard. While chiplet design has it's benefits especially from manufacturing point of view, monolithic dies for (~~Picasso~~) Renoir and Matisse are not out of the question. We'll see later on what route AMD has chosen.

PotatoWithEarsOnSide · Nov 18, 2018

If they used 16C chiplets, as per your speculative suggestion, then it wouldn't look all that pretty for AM4. The chiplet would be as large as the IO, and would need to be placed to one side or other instead of spread equally on both sides. On top of this, the 8C chiplet potentially allows space for a 7nm GPU.
I like this modular approach. I suspect that the improvement that it brings must be significantly offsetting the negatives to the design. If AMD's biggest weakness is the upfront costs of developing for many different segments, then this whole approach seems to have knocked that on it's head.

Tuna-Fish · Nov 18, 2018

Zapetu said:
why didn't AMD make their Rome chiplet design like this:

Yields.

Zapetu said:
But if over 150mm² Matisse would be just fine for desktop then why wouldn't about the same size chiplet design be chosen for server CPU?

Why make an inferior choice if you don't have to? If they want to avoid the extra latency of an IO die, they have to make Matisse single die. This makes sense for desktop. This then constrains them to ~150mm^2 die, and they just have to eat the loss this causes them.

However, with the server chips they don't have such a constraint. And they can pick their die size freely, given some constraints, and smaller is always better. 7nm yields are good enough to make chips, but they are not at full mature levels yet. Actual, current yield numbers are trade secrets and the curve is usually only fully revealed after the yields are mature, so we have to guess. But using the numbers AdoredTV used:

(264*2)/667= 0.79. Moving to 16-core CCD means giving up 20% of your product. Smaller is always better. You never ask, "why didn't they just double it". The reason is always obvious. It costs a lot of money to move to bigger die. Even if they are using a bigger die for products that need it, they still want the smallest possible die for the product that don't. The more interesting question is, why didn't they just halve it? Why not use 16x4 core dies? That would buy them another 10% of yield. I think the reason for that is that CCX is 8 cores now.

PotatoWithEarsOnSide · Nov 18, 2018

The defective 8C dies could still be salvageable as 6C and 4C CPUs, whereas a defective 4C die couldn't be salvaged at all within the Ryzen spectrum of products.
In all likelihood, fully functional 8C dies PLUS salvaged defective dies is greater yielding than using 4C dies as base.

Topweasel · Nov 18, 2018

Easy Answer. Threadripper. I believe that TR3 will stay at 32 cores but offer better performance based on several factors (like one die not having to talk to another to access memory). I believe it is arranged that way so that they can remove 4 chips without unbalancing heat. To help with this design I believe EPYC CPU's 32c and under will be using 4 dies instead of 8.

More complicated. Yields. It is true that Vega 20 that will be sold only as a workstation/hpc card for well over a grand (4+ maybe) is using same process and significantly larger. But not great yields on a chip like that can be masked by the ASP of the chip. They need to be able to price each die for ~$100 for low end Ryzen's. They clock higher than Vega20. So they need to maximize yields at every opportunity.

Even more complex answer. Spacing. Current running theory is that the Navi APU will be set up as 2 chiplets and an IO chip. So on top of my personal theory that Ryzen 3k is going to be single Chiplet / Single IO and then Zen 3 will be Dual chiplet single IO (moving standard desktop to 16 cores). AMD will need space and affordability on the APU's to include a 2 chiplets and an IO for $100-$200 ASP.

PotatoWithEarsOnSide · Nov 18, 2018

I suspect that low end Ryzens will be on defective dies, which effectively have zero cost; they'll have factored their manufacturing costs based soly upon fully functional dies.
If they're paying $15k per wafer, even with 400 fully functional dies they are below $40 per die. Of course, there's the IO die to add too, which is likely pretty neglible, and the R&D costs (which are likely to be spread over a much larger volume of SKUs sold anyway). I fully expect margins to be rather high even at AMD pricing levels.

Topweasel · Nov 18, 2018

The IO die isn't going to be negligable and defunct dies aren't going to be a zero sum. On desktop lower cost CPU's are volume sellers. Which has historically meant for AMD selling working dies cut down to spec to meet demand. Unless AMD has really bad yields this will still be the case. Like every company it's on AMD to keep costs down and margins high to remain profitable and there is a mile in between that and Intel pricing for AMD to be value oriented pricing while being disruptive with performance. That said I don't expect the whole package to be much more expensive for AMD then Summit Ridge. But you aren't doing that with a 150-180mm core chiplet and a sizeable 14nm IO die.

Zapetu · Nov 18, 2018

PotatoWithEarsOnSide said:
If they used 16C chiplets, as per your speculative suggestion, then it wouldn't look all that pretty for AM4.

Exactly. And with ~150mm² CCD there would not have been much of a point to make Matisse a chiplet design anyway,. I was only speculating that if both Matisse and (~~Picasso~~) Renoir would be monolithic designs then Rome could as well have been 16C CCD. And I'm not saying that it would have been a better design than 8C CCD but that it would have been one usable design choice.

Tuna-Fish said:
Yields.

Why make an inferior choice if you don't have to? If they want to avoid the extra latency of an IO die, they have to make Matisse single die. This makes sense for desktop. This then constrains them to ~150mm^2 die, and they just have to eat the loss this causes them.

True. 16C CCD would have been a substantially inferior design for yields and binning but there are many pros and cons for every design. As Vattila has many times pointed out, having fully connected quads on each hierarchical level is quite a nice design choice. In my opinion, his earlier design would have been much nicer with only 4 chiplets where each chiplet would have had 4 fully connected 4C CCXs and all CCD-to-IOD IF-links would then again been a fully connected quad on the IO die. That would have made the IOD somewhat simpler. Still, I'm not saying that there's something wrong with AMD's way of doing it or even that this 16C design would have been a better choice. And I'm sure AMD made the best possible design they could all things considered.

Tuna-Fish said:
7nm yields are good enough to make chips, but they are not at full mature levels yet. Actual, current yield numbers are trade secrets and the curve is usually only fully revealed after the yields are mature, so we have to guess.

Still they want to go with Vega 20 (336 mm²) first... But it's true that it's not their bread an butter and they really want to do well with server CPUs. Therefore it's much more important design choice what they did with their EPYC line and they can be much more risky with their Radeon Instinct line. That's actually only product which competes very well with nVidia's current offerings (in highest end).

Tuna-Fish said:
(264*2)/667= 0.79. Moving to 16-core CCD means giving up 20% of your product. Smaller is always better. You never ask, "why didn't they just double it". The reason is always obvious. It costs a lot of money to move to bigger die. Even if they are using a bigger die for products that need it, they still want the smallest possible die for the product that don't.

We'll that's certainly more true with 7 nm than it was with with larger nodes. But Nvidia for example would like to make chips that were even bigger than the reticle limit. IBM does very big monolithic chips and so does Intel. Obviously no one makes their chips larger than they need to be but monolithic designs do have many benefits of which none has anything to do with better yields or binning.

It's true that AMD has to do it the smart way as they have no other choice but it seems to have panned out very well for them.

Tuna-Fish said:
The more interesting question is, why didn't they just halve it? Why not use 16x4 core dies? That would buy them another 10% of yield. I think the reason for that is that CCX is 8 cores now.

There are many pros and cons for both 2x 4C CCX and 1x 8C CCX and we still don't know for sure which one it's going to be. It might not be that far-fetched, though, that 8x CCX is overall a better choice for many server tasks (I'm sure many would disagree) as 4C CCX should have the best possible best case latency for L3 cache but 8C CCX wouldn't have the same kind of cross CCX latency "issues" as 2x 4D CCX on the same die might still have. I think that 16 chiplets on an organic packaging would be quite hard to route without an EMiB-like technology or a silicon interposer so maybe 8 core chiplets were the best AMD currently could viably do.

And I'm in no way disagreeing with you but would like look at this from many different points of view. Many nice points, a very good explanation of yields and a lot to think about.

Zapetu · Nov 18, 2018

Topweasel said:
That said I don't expect the whole package to be much more expensive for AMD then Summit Ridge. But you aren't doing that with a 150-180mm core chiplet and a sizeable 14nm IO die.

True. And there really isn't that much spare room on AM4 package but it can be done . I'm guessing that purely from a market placement point of view they could aim for lowest possible die sizes using the following kinds of designs:

Picasso APU (7 nm SoC? or HPC?) 4 to 6 core monolithic die (2 to 6 cores), Vega 9 to 12CUs (or later Navi), smallest possible die size with no unnecessary extra features, GPU is easily bandwidth limited anyway (Edit: I was referring to the wrong APU)
Edit: Renoir APU 4 to 6 core monolithic die (2 to 6 cores), Vega/Navi, small die size
Matisse (7 nm HPC) 8 to 12 cores monolithic die (6 to 12 cores), quite a small die size, not too big L3, etc. or only 8 cores but larger L3
ThreadRipper (7 nm HPC, 14 nm HP?)- 16 to 32 cores, all EPYC Rome leftover stuff: 1-8 chiplets + IO die
EPYC (same as TR) - 8 (?) to 64 cores, all fully working IO dies

I find it more likely that if they're going with the chiplet design on AM4, they would choose either the 2x chiplets and 1/4 IOD die (because it's "very easy" to design) but maybe that's for the next gen or they could also go with a single chiplet + ~1/6 IOD without GPU approach if they absolutely have to. If they really would use 14HP, then any kind of iGPU is kind of a long shot. But if they settle for 14LPP on the big IOD, then sure, there are plenty of Vega designs easily portable.

All the points that have been presented about yields and binning are very good and valid but maybe AMD wants to first perfect their chiplet approach on servers before moving it to mainstream.

Topweasel said:
Current running theory is that the Navi APU will be set up as 2 chiplets and an IO chip. So on top of my personal theory that Ryzen 3k is going to be single Chiplet / Single IO and then Zen 3 will be Dual chiplet single IO (moving standard desktop to 16 cores). AMD will need space and affordability on the APU's to include a 2 chiplets and an IO for $100-$200 ASP.

EMIB-like packaging technologies would greatly help products like mobile APU with HBM(2) and would permit low latency / low power interconnects between CPU / GPU chiplets and the IO die. Once they get over the organic packaging phase, then sure, the future will be chiplets. Bu you never know how soon AMD is going to try something, even the Rome chiplet design was a very bold move with traditional packaging technologies.

More advanced packaging methods would also help to develop MCM GPUs that are in great need for AMD to compete on top again. And current organic packaging is not high bandwidth enough for GPUs and unfortunately Navi will almost certainly be a monolithic design as most of you already know.

I will stop linking to AdoredTV's videos soon (someone else can do it later) but let me first direct some of you to my favorite video of his, although likely most of you have already seen it. Even passive silicon interposers might not be fast enough and active interposers (could be bridge chiplets as well) with some clever routing algorithms are needed. But maybe in a few years time...

I thing this was the first video link that H T C posted on these forums, although personally I had seen few times before.

maddie · Nov 18, 2018

As one who expected a good chance of an interposer based solution, I was conscious of an increased cost for this choice.

With the organic package based integration of the chiplets a reality, why is there still discussion of this being too expensive for the desktop?
Is the cost of placing 2 die on an organic package that expensive, to the point where it negates the other multitude of savings?
Do these not matter?
Lower design costs, lower inventory costs, faster response to varying sales, etc?

Tuna-Fish · Nov 18, 2018

maddie said:
As one who expected a good chance of an interposer based solution, I was conscious of an increased cost for this choice.

Multiple chiplets on an organic base are not substantially more expensive than one monolithic chip. There is a little bit of extra cost in packaging and testing, but that is probably less than the savings. The biggest reason to go monolithic on desktop is DRAM latency -- since they are using lines on organic package, they will have to use SerDes, and that always costs a few cycles. For servers, the unified memory space is worth losing a little bit on latency, but on desktop, especially in games, it still hurts.

Even given that, I honestly think that it's quite unlikely that there would be 3 separate 7nm Zen2 dies. Aside from the chiplet for EPYC and TR, I think that an APU die is possible, but I honestly think it would be unlikely that there would be a separate one for >8 core Ryzen. Either entire non-HEDT market is served from the APU chip, or there is a separate IO chip that gets paired with 2 cpu chiplets.

jpiniero · Nov 18, 2018

Zapetu said:
Still they want to go with Vega 20 (336 mm²) first... But it's true that it's not their bread an butter and they really want to do well with server CPUs. Therefore it's much more important design choice what they did with their EPYC line and they can be much more risky with their Radeon Instinct line. That's actually only product which competes very well with nVidia's current offerings (in highest end).

Sure looks like Real ASP's are way way higher on Instinct than even Epyc. That's why it was first.

msi2 · Nov 18, 2018

Tuna-Fish said:
Multiple chiplets on an organic base are not substantially more expensive than one monolithic chip. There is a little bit of extra cost in packaging and testing, but that is probably less than the savings. The biggest reason to go monolithic on desktop is DRAM latency -- since they are using lines on organic package, they will have to use SerDes, and that always costs a few cycles. For servers, the unified memory space is worth losing a little bit on latency, but on desktop, especially in games, it still hurts.

Even given that, I honestly think that it's quite unlikely that there would be 3 separate 7nm Zen2 dies. Aside from the chiplet for EPYC and TR, I think that an APU die is possible, but I honestly think it would be unlikely that there would be a separate one for >8 core Ryzen. Either entire non-HEDT market is served from the APU chip, or there is a separate IO chip that gets paired with 2 cpu chiplets.

So, reading that, would it mean that a console APU has to be monolithic? Mainly for latency reasons?

64 core EPYC Rome （Zen2）Architecture Overview？

Platinum Member

Diamond Member

Lifer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Member

Senior member

Member

Senior member

Member

Senior member

Golden Member

Senior member

Diamond Member

Senior member

Diamond Member

Member

Member

Diamond Member

Golden Member

Lifer

Junior Member