• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."

Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 6000)

Page 64 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

What do you expect with Zen 4?


  • Total voters
    244

maddie

Diamond Member
Jul 18, 2010
3,448
2,418
136
There are a lot of possibilities if stacking is involved. It could be directly on the IO die or it could be a chiplet stacked on top of the IO die or a chiplet stacked on top of a larger interposer with other chiplets. If the IO die is made on the latest process, then it may make sense for it to be directly on the same die. For lower cost systems, it would make sense for it to just be directly on the IO die with no stacking; basically the same as current Zen 3.

That kind of comes back to making a chip for stacking on an interposer vs. a non-stacked solution. If they make a cpu chiplet specifically designed for stacking, then how do you do a lower end design where stacking is possibly too expensive? Do they make two different chiplets? It seems like they wouldn’t do two different chiplets. It seems like the lower end would just be a fully integrated APU. Where does the IO die with graphics fit? What market does it cover? It might be that the integrated graphics is so much better than previous solutions (due to DDR5 or large caches or something) that it can compete well with low end discrete graphics. That would change the marketing positions if the integrated IO die graphics was sufficient for 1080p. I have been suspicious of graphics in the IO die due to the market segmentation. If you are going for cheap, then a monolithic APU seems to make more sense. It does make some sense to have some graphics functionality across the whole line though.
What size of die is needed to allow all of the IO connections? Could it be that the minimum die size needed for the 7/6nm IO die has wasted space that they're filling with graphics?
 

moinmoin

Platinum Member
Jun 1, 2017
2,507
3,177
136
AMD designed the CCDs for it knowing that it would only be available for use in Q4 2021.
Or did they? I really wish to have been a fly on the wall back when all of Zen 1 through 3 were planned. The fact that all three were planned close together imo makes it rather likely that the decisions to go with Zeppelin for Zen 1, chiplets/IOD for Zen 2 and V-Cache for Zen 3 was one of the earlier goals in their conception. Makes one wonder how early Zen 2 going to TSMC was a given. And as somebody else on here (@lobz) already mentioned, AMD already had plenty experience with TSVs thanks to Fury X. Since AMD actually developed HBM chances are high the new techs used were developed in close cooperation as well.
 

Vattila

Senior member
Oct 22, 2004
603
705
136
Here is David Schor's take on V-Cache:

AMD 3D Stacks SRAM Bumplessly – WikiChip Fuse

1623112605630.png

I think it is pretty obvious by now that the CCD solution with optional V-Cache will remain a key feature of Zen 4. It is nicely flexible and low risk. The big question in my view is whether they will finally move away from the slow and power-hungry interconnect implemented in the organic substrate to a faster, wider and more power-efficient chiplet interconnect on silicon interposer and/or over embedded silicon bridges. The ugly mock-ups from the "leaked" sources indicate that they will not. I think and hope they will. Despite the positive feedback to my own mock-ups based on silicon interposers and bridges, I find it hard to gauge the general consensus here. Do you think the current interconnect in the package can be extended to Zen 4 — with higher bandwidth demands and more chiplets complicating the routing further — or is AMD bound to move to a more efficient interconnect on silicon?



 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
17,268
6,267
136
Mmmm ... I am not sure I understood the relation between Milan-X and Genoa
Milan-X will presumably feature v-cache. Two scenarios:

1). AMD delays Genoa until Q4 2022 to feature v-cache on all Genoa products (essentially turning it into Genoa-X). Milan-X extends the life of their server/workstation offerings until Genoa is finally ready.
2). AMD keeps Genoa on track for a release to interested customers without v-cache while simultaneously releasing Milan-X to customers that would prefer the extra L3 for cache-intensive applications. AMD follows up with Genoa-X after Q4 2022 to better meet the needs of customers that had interest in Milan-X.

Which do you think is more likely?

I'm in agreement with you that there are enough server customers out there that want the improvements of Genoa but don't really need stacked L3. But AMD is bringing it to their EPYC lineup (Milan-X), and it's logical to conclude that there will be a Genoa product with die stacking in the future, once the packaging tech is fully validated for N5.
 

MadRat

Lifer
Oct 14, 1999
11,650
36
91
The chips that use V-cache stacked probably are not the same ones using an integrated GPU. I'd think different market segments.
 

Vattila

Senior member
Oct 22, 2004
603
705
136
The fact that all three were planned close together imo makes it rather likely that the decisions to go with Zeppelin for Zen 1, chiplets/IOD for Zen 2 and V-Cache for Zen 3 was one of the earlier goals in their conception.
I am particularly impressed by how they lowered design risk around the Zen 3 core revamp, by keeping the package design and chiplet topology identical, while planning for a risk-free extension of the L3 with V-Cache. Pretty slick. Going forward, how they will evolve the chiplet topology and interconnect are the most interesting design issues in my view, as well as which components (e.g. GPU, FPGA, HBM, VPU) they may include in the package.
 

maddie

Diamond Member
Jul 18, 2010
3,448
2,418
136
Or did they? I really wish to have been a fly on the wall back when all of Zen 1 through 3 were planned. The fact that all three were planned close together imo makes it rather likely that the decisions to go with Zeppelin for Zen 1, chiplets/IOD for Zen 2 and V-Cache for Zen 3 was one of the earlier goals in their conception. Makes one wonder how early Zen 2 going to TSMC was a given. And as somebody else on here (@lobz) already mentioned, AMD already had plenty experience with TSVs thanks to Fury X. Since AMD actually developed HBM chances are high the new techs used were developed in close cooperation as well.
Is v-cache simply TSVs?

I read stacking is old school, TSVs are old school. Nothing to see here. What about Cu>Cu direct bonding?

I don't know how they're doing this. What is causing the Cu fusion at the interface? If you do please tell me.

This is, in my opinion a fairs ways different than old style TSVs with solder micro-bumps.
 

Vattila

Senior member
Oct 22, 2004
603
705
136
What about Cu>Cu direct bonding? I don't know how they're doing this. What is causing the Cu fusion at the interface? If you do please tell me.
SemiEngineering article on bonding:

Bonding Issues For Multi-Chip Packages

"One proposed alternative, copper-copper direct bonding, has the advantage of simplicity. With no intervening layer, temperature and pressure fuse the top and bottom pads into a single piece of metal, making the strongest possible connection. That’s the idea behind thermocompression bonding. Copper pillars on one die match pads on a second die. Heat and pressure drive diffusion across the interface to make a permanent bond."

semiengineering.com

PS. Also see cold welding — how two clean flat surfaces of the same metal spontaneously fuse in a vacuum.
 

maddie

Diamond Member
Jul 18, 2010
3,448
2,418
136
SemiEngineering article on bonding:

Bonding Issues For Multi-Chip Packages

"One proposed alternative, copper-copper direct bonding, has the advantage of simplicity. With no intervening layer, temperature and pressure fuse the top and bottom pads into a single piece of metal, making the strongest possible connection. That’s the idea behind thermocompression bonding. Copper pillars on one die match pads on a second die. Heat and pressure drive diffusion across the interface to make a permanent bond."

semiengineering.com
Good find.

Was just going to speculate if pressure fusion was in play. Saved that post.
 
  • Like
Reactions: Tlh97 and Vattila

maddie

Diamond Member
Jul 18, 2010
3,448
2,418
136
SemiEngineering article on bonding:

Bonding Issues For Multi-Chip Packages

"One proposed alternative, copper-copper direct bonding, has the advantage of simplicity. With no intervening layer, temperature and pressure fuse the top and bottom pads into a single piece of metal, making the strongest possible connection. That’s the idea behind thermocompression bonding. Copper pillars on one die match pads on a second die. Heat and pressure drive diffusion across the interface to make a permanent bond."

semiengineering.com

PS. Also see cold welding — how two clean flat surfaces of the same metal spontaneously fuse in a vacuum.
Well something else is happening. The full quote seems to eliminate mass production for the pressure fusion alone.

"One proposed alternative, copper-copper direct bonding, has the advantage of simplicity. With no intervening layer, temperature and pressure fuse the top and bottom pads into a single piece of metal, making the strongest possible connection. That’s the idea behind thermocompression bonding. Copper pillars on one die match pads on a second die. Heat and pressure drive diffusion across the interface to make a permanent bond. Typical temperatures in the range of 300 ºC soften the copper, allowing the two surfaces to conform to each other. Thermocompression bonding can take 15 to 60 minutes, though, and requires a controlled atmosphere to prevent copper oxidation.

The vacuum method appears much more suited for mass fabrication.

It's truly amazing how far ahead is TSMC with this in connects/mm^2 and still has far to go. Intel 3rd gen foveros will be similar.

EMIB 350
EMIB (future) 750
Foveros 400
Foveros (future) 1600
Foveros (future2) >10000

TSMC SOIC >8000
 
Last edited:

Vattila

Senior member
Oct 22, 2004
603
705
136
The WikiChip article was a good read, which made me inclined to speculate further. I think the "structural silicon" dies above the core complexes are not going to remain "dummy" dies forever. Schor speculates that the dummy dies may include thick copper traces to aid heat transfer. However, why not put all that copper to use? I see an opportunity to put fat vector engines in these dies. Instead of squeezing 512-bit wide SIMD units and data paths into the core below to support AVX-512, instead put (perhaps even wider) vector units in the dies above. If they can implement it as an option, even better for both risk management and SKU options (cost). Of course, when the processor is running all those vector units at full tilt, the processor may have to throttle down. But that is not unlike Intel processors today. Unlike Intel's processors, the x86 core in my proposed design would not incur die size penalty for AVX-512, nor any compromises in the core design floorplan that may impede clock speed and efficiency. I.e. without the vector engines (no die present, disabled or not active) the core will execute logic code optimally at high frequency with the current power-efficient 256-bit data paths. For wide vector code, all the power budget is allocated to the vector engine above, with dynamic frequency control within the TDP.

Even though cores and vector engines will be limited by power budget, and would have to dynamically adjust frequency, it seems to me that you could achieve pretty amazing performance per area this way, with a very flexible mix between logic and vector processing.

Zen Vector Engine Chiplets (speculation).png
 
Last edited:

jamescox

Senior member
Nov 11, 2009
220
392
136
What size of die is needed to allow all of the IO connections? Could it be that the minimum die size needed for the 7/6nm IO die has wasted space that they're filling with graphics?
Unfortunately, I have no idea. It is a rather ridiculous number of pins required for a singular IO die. I don't know what the ball density is on their packaging though. If they have things stacked on top if an interposer, then that reduces the number of connections significantly. Current IO die has 8 x16 pci-express, 8 DDR4 channels, 8 x32 connections to cpu die, plus some other miscellaneous connections. You also have power and ground pins. If the cpu die were stacked on top, then that removes the IO die to cpu connections.
 

jamescox

Senior member
Nov 11, 2009
220
392
136
The WikiChip article was a good read, which made me inclined to speculate further. I think the "structural silicon" dies above the core complexes are not going to remain "dummy" dies forever. Schor speculates that the dummy dies may include thick copper traces to aid heat transfer. However, why not put all that copper to use? I see an opportunity to put fat vector engines in these dies. Instead of squeezing 512-bit wide SIMD units and data paths into the core below to support AVX-512, instead put (perhaps even wider) vector units in the dies above. If they can implement it as an option, even better for both risk management and SKU options (cost). Of course, when the processor is running all those vector units at full tilt, the processor may have to throttle down. But that is not unlike Intel processors today. Unlike Intel's processors, the x86 core in my proposed design would not incur die size penalty for AVX-512, nor any compromises in the core design floorplan that may impede clock speed and efficiency. I.e. without the vector engines (no die present, disabled or not active) the core will execute logic code optimally at high frequency with the current power-efficient 256-bit data paths. For wide vector code, all the power budget is allocated to the vector engine above, with dynamic frequency control within the TDP.

Even though cores and vector engines will be limited by power budget, and would have to dynamically adjust frequency, it seems to me that you could achieve pretty amazing performance per area this way, with a very flexible mix between logic and vector processing.

View attachment 45431
You are stacking a high power chip on top of another high power chip. If this is doable, then you should just be able to stack multiple cpus on top of each other. I don't know what they do to optimize the thermal transfer through the structural silicon, but they have said that it is optimized in some manner. Besides the thermal limitations, they are also limited on what material they can put there. They can't just put pieces of copper since the thermal expansion rate being different could cause damage as the chip heats up and cools down.
 

jamescox

Senior member
Nov 11, 2009
220
392
136
Here is David Schor's take on V-Cache:

AMD 3D Stacks SRAM Bumplessly – WikiChip Fuse

View attachment 45426

I think it is pretty obvious by now that the CCD solution with optional V-Cache will remain a key feature of Zen 4. It is nicely flexible and low risk. The big question in my view is whether they will finally move away from the slow and power-hungry interconnect implemented in the organic substrate to a faster, wider and more power-efficient chiplet interconnect on silicon interposer and/or over embedded silicon bridges. The ugly mock-ups from the "leaked" sources indicate that they will not. I think and hope they will. Despite the positive feedback to my own mock-ups based on silicon interposers and bridges, I find it hard to gauge the general consensus here. Do you think the current interconnect in the package can be extended to Zen 4 — with higher bandwidth demands and more chiplets complicating the routing further — or is AMD bound to move to a more efficient interconnect on silicon?



Is there anything from AMD indicating that they actually plan on using HBM on Epyc die? With the amount of SRAM cache, I don't know if the an HBM L4 cache would actually be that useful. Epyc is going to have a lot of DDR5 channels and a lot of SRAM cache.

At the moment, I am kind of thinking that initial Zen 4 may still be similar to Zen 3, with just pci-express 5 level interconnect speeds. The large interposers have scaling issues. How do you efficiently and cost effectively make intermediate products? Zen 3 goes from 8 to 64 cores, although the 8 core is still 8 cpu die. You probably don't want to use a giant interposer for a smaller number of cpu die, so do they use multiple interposer sizes?

I had the idea that they may split up the IO die into multiple active interposers, but this may be unlikely. It makes some sense since it would be very modular. The current IO die is around 435 square millimeters; I don't know what that will shrink to on a TSMC process, but splitting it into multiple active interposers may make sense. Some parts will get bigger if they use more memory channels and upgrade to pci express 5. The internal pathways will need to be wider and/or higher clock to handle the extra bandwidth requirements. It will not scale as well with the IO, so it may still be rather large if it is a single chip rather than some kind of stacked solution.

A possibly much lower risk option would be to have close to the same layout except with the cpu to IO die connections replaced by local silicon interconnects. That would make the common case very cheap. The version with low cpu chip count (4 or 6, depending on layout) would have very short embedded silicon bridges. For routing under another cpu chiplet, they would need either a really long silicon bridge or they would need to kind of daisy chain them. I could see them using rows of 3 chiplets with tiny embedded silicon bridges between each. The silicon bridges could probably be all the same and very small. The number of hops probably isn't that important. The connection would be very wide and they wouldn't need to be serialized. They might be able to set up the cpu chiplet to pass through multiple connections such that they aren't shared.

Any of these may contain stacked components, like stacked L3 caches or stacked IO die:

1. Same as Zen 3, except double the link speed to pci-express 5 levels and done.
2. Giant interposer(s) under everything as in mock-ups. Expensive and may be hard to scale to different number of chiplets.
3. Tiny, modular interposers with IO and other chiplets stacked. Possibly expensive. Still need to connect interposers together, so may be unlikely; why use an interposer rather than local silicon interconnect?
4. Same type of layout as Zen 3, except use local silicon interconnect to connect cpu chiplets to IO die. No actual interposers. Possible daisy chain cpu chiplets with local silicon interconnect. May be shared link or pass through. The bandwidth could be ridiculously high, so even it it were shared, it might not be an issue.

Any other ideas? There are a lot of possibilities with stacking, so there may be a surprise, but we have some idea of what connection technology TSMC has available.

I am kind of leaning towards #4. It would help a lot with power consumption and bandwidth without introducing any scaling issues. It is obviously still very modular and the cost would scale. You could make the common 4 cpu chiplet version with 4 cpu chiplets, 4 LSI die, and the IO die. You could scale all of the way up to 8, 12 or more cpu chiplets with the cost scaling with it. I don't know if the local silicon interconnect can be used across multiple products the way the cache die probably can. It could be a standardized, wide infinity fabric protocol chip that might get used to connect other infinity fabric chiplets, like gpus and fpga devices. It is also possible that the initial version is just #1 and extra 3D stuff comes later. Perhaps they have a multi-layer IO die to take advantage of different processes, but leave the interconnect mostly the same.
 

DisEnchantment

Senior member
Mar 3, 2017
816
2,022
136
The WikiChip article was a good read, which made me inclined to speculate further. I think the "structural silicon" dies above the core complexes are not going to remain "dummy" dies forever. Schor speculates that the dummy dies may include thick copper traces to aid heat transfer. However, why not put all that copper to use? I see an opportunity to put fat vector engines in these dies. Instead of squeezing 512-bit wide SIMD units and data paths into the core below to support AVX-512, instead put (perhaps even wider) vector units in the dies above. If they can implement it as an option, even better for both risk management and SKU options (cost). Of course, when the processor is running all those vector units at full tilt, the processor may have to throttle down. But that is not unlike Intel processors today. Unlike Intel's processors, the x86 core in my proposed design would not incur die size penalty for AVX-512, nor any compromises in the core design floorplan that may impede clock speed and efficiency. I.e. without the vector engines (no die present, disabled or not active) the core will execute logic code optimally at high frequency with the current power-efficient 256-bit data paths. For wide vector code, all the power budget is allocated to the vector engine above, with dynamic frequency control within the TDP.

Even though cores and vector engines will be limited by power budget, and would have to dynamically adjust frequency, it seems to me that you could achieve pretty amazing performance per area this way, with a very flexible mix between logic and vector processing.

View attachment 45431
Like you said, vector units seems like a good fit, clock low and wide and twice or thrice the latency of the inbuilt AVX units for example.
Throughput should be good if they are wide enough. One thing with this though is with A+A/CXL coherent systems it is not really a game changer.
Heat transfer and power delivery would be tough problems to solve.

I am very curious what else they come up with in order to customize the core itself, or they might just make a different CCD

Which do you think is more likely?
My guess is as good as anybody's, but it sounds reasonable that they will have non stacked Genoa first and those that need the V Cache will get it later.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,065
664
136
Do you think the current interconnect in the package can be extended to Zen 4 — with higher bandwidth demands and more chiplets complicating the routing further — or is AMD bound to move to a more efficient interconnect on silicon?
They obviously can extend the current system to Zen 4, it just wouldn't be as good. Will they? This depends entirely on something we cannot know. What is the cost of using better interconnects, and what is the maximum manufacturing throughput for chips using them?

For reasons that have been done to death, including in this thread, better interconnect would be awesome and would give a very substantial improvement in performance. (Lower latency, higher throughput, lower non-core power which allows higher core power, etc etc etc.) In that sense, it is not a question if AMD wants it -- they almost certainly do. The question is when/if AMD is able to do it. If the facilities used for bonding chips are only capable of making <½ of the chips AMD currently sells, well, that's not happening yet then. If the process adds a too large fixed cost to manufacturing, well it's not happening for anything without high enough margins then.
 

MadRat

Lifer
Oct 14, 1999
11,650
36
91
The WikiChip article was a good read, which made me inclined to speculate further. I think the "structural silicon" dies above the core complexes are not going to remain "dummy" dies forever. Schor speculates that the dummy dies may include thick copper traces to aid heat transfer. However, why not put all that copper to use? I see an opportunity to put fat vector engines in these dies. Instead of squeezing 512-bit wide SIMD units and data paths into the core below to support AVX-512, instead put (perhaps even wider) vector units in the dies above. If they can implement it as an option, even better for both risk management and SKU options (cost). Of course, when the processor is running all those vector units at full tilt, the processor may have to throttle down. But that is not unlike Intel processors today. Unlike Intel's processors, the x86 core in my proposed design would not incur die size penalty for AVX-512, nor any compromises in the core design floorplan that may impede clock speed and efficiency. I.e. without the vector engines (no die present, disabled or not active) the core will execute logic code optimally at high frequency with the current power-efficient 256-bit data paths. For wide vector code, all the power budget is allocated to the vector engine above, with dynamic frequency control within the TDP.

Even though cores and vector engines will be limited by power budget, and would have to dynamically adjust frequency, it seems to me that you could achieve pretty amazing performance per area this way, with a very flexible mix between logic and vector processing.

View attachment 45431
If by vector engine you mean SIMT extension for cryptography functions - or maybe SIMD extension for double precision FPU? That would get awfully hot on top of your lower cores. It sounds like something more for its own real estate on the same planes as chiplets.
 
Last edited:
  • Like
Reactions: Tlh97 and Vattila

MadRat

Lifer
Oct 14, 1999
11,650
36
91
Is v-cache simply TSVs?

I read stacking is old school, TSVs are old school. Nothing to see here. What about Cu>Cu direct bonding?

I don't know how they're doing this. What is causing the Cu fusion at the interface? If you do please tell me.

This is, in my opinion a fairs ways different than old style TSVs with solder micro-bumps.
Pretty cool how that Fury X used 5mm x 7mm chips on an interposer. Memory on-die with 4096-bit connects! That was 6 years ago. The biggest problem here is that TSV drove high prices. Funny thing is that Nvidia dominated it with 256-bit connects to GDDR6. Not sure if it was cooling problems from stacking or their HBCC, but AMD stumbled with Fury X. So whether things improved or not with the AMD HBCC controller, only time will tell.

 
  • Like
Reactions: Tlh97 and Vattila

lobz

Golden Member
Feb 10, 2017
1,663
2,142
136
Pretty cool how that Fury X used 5mm x 7mm chips on an interposer. Memory on-die with 4096-bit connects! That was 6 years ago. The biggest problem here is that TSV drove high prices. Funny thing is that Nvidia dominated it with 256-bit connects to GDDR6. Not sure if it was cooling problems from stacking or their HBCC, but AMD stumbled with Fury X. So whether things improved or not with the AMD HBCC controller, only time will tell.

At that time that was an actual engineering marvel, regardless of how unfeasible it was for the consumer market.
 

eek2121

Senior member
Aug 2, 2005
990
1,071
136
I am particularly impressed by how they lowered design risk around the Zen 3 core revamp, by keeping the package design and chiplet topology identical, while planning for a risk-free extension of the L3 with V-Cache. Pretty slick. Going forward, how they will evolve the chiplet topology and interconnect are the most interesting design issues in my view, as well as which components (e.g. GPU, FPGA, HBM, VPU) they may include in the package.
I am more impressed that they are going to squeeze another 10-20% out of 7nm, without even touching the core design.

Intel management needs to fire themselves.
 

moinmoin

Platinum Member
Jun 1, 2017
2,507
3,177
136
Schor speculates that the dummy dies may include thick copper traces to aid heat transfer. However, why not put all that copper to use? I see an opportunity to put fat vector engines in these dies.
A: "Guys, we got dummy silicon layers to make use of above the hotspots now. What to do with them?"
B: "We already got patents on how to use paths in silicon to more efficiently move away heat from those hotspots. Let's use that."
C: "Naw, let's instead multiply the heat by stacking fat vector units above them! YOLO!" 😜

Seriously though, I wrote about that before, I'm pretty sure AMD already considers hotspots in the way they design on silicon layouts, adding deliberate dark silicon spacing where hotspots are etc. While adding fat vector engines above that is not impossible it does literally increase the layouting difficulty by another dimension. We very likely get to the point where this is actually made active use of, exploiting the additional vertical space for much more "cooling space" while being able to retain dense interconnects within core logic, but I don't think we are already there yet.
 

ASK THE COMMUNITY