Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 24 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Especially when the real reason is so obvious, there is no need to speculate. The whole purpose of chiplets is cost efficiency. AMD is widening its chip to increase IPC. Going from 5N to 4N doesn't provide significant density improvements, and AMD will probably choose a blend for the best performance/efficiency. More than 8 cores just doesn't make sense economically.
There have been some rumors saying that the L2 cache may be shared across all 8 cores. Such an L2 cache would have rather low density due to the lack of scaling of cache and also because L2 needs to be much faster than L3. If those rumors are true, then there may be a large amount of die area with just L2. Sharing L2 across 8 cores seemed unlikely, but I think most Apple processors have used shared L2 among multiple cores. They may be able to stack L3 on top of the L2 and maybe the GMI link area. They might be able to make a cheap version with no L3. I am not sure how large the die area would be. I expect Zen5 to have significantly increased FP processing capability, which takes a lot of die are and a lot of bandwidth. Such streaming code does not necessarily need an increase in cache size though.

I have been thinking that they may move to having the L3 off the cpu die. They could do v-cache on top of the cpu die, or possible use something like an MCD type chip with stacked cache or some kind of base die set-up with IO and cache die(s) under the cpu chiplets. That seems like it makes a lot of sense for cache coherency, but it would imply that they are going to use the infinity fabric fan out type connection or other stacking to connect to the IO die. A regular serdes based GMI would not be fast enough for caches. Having the cpu stacked on top of the IO and cache components makes a lot of sense from a power dissipation perspective and would allow them to fit more chiplets on an AM5 package, if they need increased core count.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,602
5,788
136
ExecuFix mentioned max cTDP of Turin at 50% more than Genoa.
I suppose we could guess a similar increase in core count assuming they didn't regress in per core TDP on a newer node. Between 128 and 144 Cores . (33% - 50% more cores)
400W max cTDP/4.2W per core for Genoa(9564), 280W max cTDP/4.4W per core for Milan(7763)/Rome(7H12), ignoring IOD for this math
 

Tigerick

Senior member
Apr 1, 2022
655
537
106
Below is my drawing of possible Zen5 chiplet layout:
  • If Turin is coming with double the core counts (192 vs 96), then AMD has to find a way to fit in double cores within same die area. As you can see from the die shot of Genoa, there are no much area left
  • Remember Zen 2 with CCX design? AMD has mentioned the Zen 4 + L2 cache takes about 3.84mm2 with N5 process. We can assume Zen 5 with same L2 cache use around 4 mm2 die area using N4P process. So 4 * 16 = 64 mm2 plus around 10 mm2 for Infinity Fabric for die 2 die connect. 74mm2 is almost same die size as Zen4 chiplet excluding L3 cache.
  • L3 cache has to be external for this layout to work, and there are multiple way to stack it, we will see how AMD handles the designs
  • Apple has been using 8 P cores with shared L2 cache design since M1/M2 Pro/Max, so AMD should make it work especially AMD has experienced with CCX design.
Zen5Layout.pngAMDSlide_575px.jpgAMD-Epyc-Gen4-Genoa-delidded-vid-1_900x-675x380.jpg
 

Joe NYC

Golden Member
Jun 26, 2021
1,942
2,283
106
They could fit a ridiculous amount of cache on each one, but that large of cache has not proved to be that useful for GPUs. The infinity cache on RDNA3 is only 96 MB. It has 8 stackes of HBM3, so it doesn't need the bandwidth boost from infinity cache. It already has ridiculously high bandwidth. If it is stacked with SoIC rather than other stacking tech, then that could be a very different beast.

Yes, it is SoIC, Hybrid Bond stacking.

Eliminating all of the L3 and all of the I/O will allow more efficient use of the N5 (or N4) compute die areas, use even less area for I/O than RDNA3. Hybrid bond has higher density than RDL, so even less area needs to be used for connecting with outside world.

We will see how the various workloads (from scientific to AI) respond to extremely high bandwidth (and capacity) of system level cache. The need for extremely high bandwidth is well known, and it is the primary bottleneck (not compute), so we will see how much the cache helps in addition to HBM.

That could allow compute units to have massive local caches rather than a monolithic, but much farther L3 cache. All chiplets used would need to be designed with that in mind though. The Zen 4 chiplets likely would not be able to use it. In fact it is unclear how stacked Zen 4 chiplets will work on a base die anyway.

It will have to be a different CCD of Zen 4, to work stacked on top of the base die.

But once AMD has this type of implementation of the Zen 4 CCD, I wonder if there may be any possibility of using it elsewhere, such as 1/4 of the Mi300 with CPU only compute unit inside the AM5 socket, with 2 stacks of HBM for 32 MB of memory.

It would have higher cost than current desktop Zen 4 chips, but a shared Last Level Cache of perhaps 256 MB of SRAM would be another level beyond the V-Cache versions of Zen 4, and it would side step the problems that the 7950x3d would face, those problems being:
- more challenging cooling process with cache on top of die
- cache not being shared, causing some issues of threads jumping between CCDs, and losing access to previously used L3 content
- asymmetry of the 7950x3d implementation

Or a laptop implementation, with one GPU, one CPU compute unit and also HBM memory, eliminating the need for laptop to have DRAM.

Given AMD's modular approach, something like embedded MCD (off package memory controllers + cache) , embedded IO die, bridge chips (LSI), etc seems like it may make more sense; chiplets that can be used across many different products rather than just MI300. I am not sure if we have any info on what the MI300 will have for off package connectivity in the SH5 socket. Will it have the same IO as SP5? HPC often needs TB of memory, so it can't just be the 128 GB of HBM. Also, sending signals across these giant interposers may be problematic. They don't daisy chain or run silicon under cpu chiplets in Epyc; it is better to just go the IFOP route with a separate connection than to have to route across multiple chips in silicon. The Epyc IO die already has a number of switches and repeaters internally that add latency. Due to the scaling differences between IO, cache, and logic, I am still thinking that the "base die" may be something made out of a number of different pieces of silicon, made on slightly different processes.

Remains to be seen if there is going to be any DDR5 controllers on the base die, and whether the SH5 socket has that capability.

The desire on part of the cloud providers is to move to CXL memory that could be pooled. So there would be tradeoff between having pins multiple channels of DDR5 of having far more PCIe lanes, that could be reconfigured as CXL lanes for memory or storage access or reconfigured as Infinity Fabric lanes for connecting, say 4 way, 4 socket Mi300 nodes.

If that is the approach SH5 takes, then the re-use of the IO dies would be limited or impossible. If AMD does take the base die approach to Zen 5, even if the IO die is different between SP5 and SH5, Zen 5 then, in theory, could have re-use of the Zen 5 CCDs, if the new standard becomes being stacked on top of the base die.

I think AMD might just as well have 2 different architectures (and sockets) and let client decide what works best for them.

For AMD to grow its market share in datacenter, they may in fact need to have solutions for number of different tasks, from Sienna SP6, to Genoa SP5 and Mi300 SH5.
 
Last edited:

Joe NYC

Golden Member
Jun 26, 2021
1,942
2,283
106
Maybe V-cache Zen 5 is a 2 Hi stack.

According to SkyJuice, the RDNA3 MCD can already accommodate 2 Hi Stack, and according to the leaker (who deleted his account - GrayMont), he said that Zen 4 also has that potential capability.

So, it remains to be seen. It would be surprising to me if AMD did not make any improvement to V-Cache capacity in Zen 4. Since we know that the stacked die is the same 64 MB as Zen 3, 2 layers would be the only way to add extra capacity.
 

Joe NYC

Golden Member
Jun 26, 2021
1,942
2,283
106
They could alternatively stack it underneath. Would have its own challenges, but might be better for thermals. Though we never did get a deep dive on what the limiting factor actually is for clocks w/ v-cache.

I think they could still get away with 96c reasonably comfortably if they had to, but if not, they could wait for a 3nm Zen 5 version (refresh?) and then do a sort of mid-cycle upgrade. Or they could just make a new socket with a 128c, 16 channel config, but that's probably not an ideal outcome.

Yeah, since they are already doing it with Mi300 card.

And if that is the decision, that the CCD will be on top of the L3 / System Level Cache, it would have to be "Standard", not optional as is the case now.

In case it is standard, than L3 can be completely removed from the CPU CCD, saving space, and also, the area for Infinity fabric connections would become Hybrid Bond TSVs, which are more dense and area efficient.

OTOH, in interview on AnandTech, Mike Clark mentioned (unclear if it was in relation to Zen 5 as well as Zen 4) that AMD wants to retain the capability to sell lower end chips without stacked V-Cache...
 

yuri69

Senior member
Jul 16, 2013
388
619
136
I would expect Zen 5 to start using stacked die in some manner, so it may look more like MI300 than Genoa, which is partially why I am so interested in exactly what is in MI300. If they can economically use the same interconnect used in RDNA3/MCDs, then that would reduce power consumption significantly. Going up to pci-e 5 or 6 speeds for SerDes based GMI has to cost a lot of power. Speculating on how stacked die are going to used or arranged is very difficult without more information.
AMD is all about making the production cheap. That is the primary motivation behind the chiplets. MI300 seems to be the first really complex and expensive product - the target market demands are high.

Mandating Zen 5 to use a complex packaging doesn't seem to be that great. Remember in times of Rome/Milan the platform cost made low-tier server offerings prohibitive. This has been amplified by the 12-channel SP5. Hence the need to split the server platforms by adding the lower-tier 6-channel SP6. Making Zen 5 default to stacking would make he cost high again. Also the client cost would be affected.

I would expect AMD to keep things cheap (with all the drawbacks/bottlenecks of the IFOP approach).
 

maddie

Diamond Member
Jul 18, 2010
4,740
4,674
136
AMD is all about making the production cheap. That is the primary motivation behind the chiplets. MI300 seems to be the first really complex and expensive product - the target market demands are high.

Mandating Zen 5 to use a complex packaging doesn't seem to be that great. Remember in times of Rome/Milan the platform cost made low-tier server offerings prohibitive. This has been amplified by the 12-channel SP5. Hence the need to split the server platforms by adding the lower-tier 6-channel SP6. Making Zen 5 default to stacking would make he cost high again. Also the client cost would be affected.

I would expect AMD to keep things cheap (with all the drawbacks/bottlenecks of the IFOP approach).
There is very little area scaling for cache, (95%?), so once this happens, [Cost of L3 cache in 6nm + die stacking cost < Cost of L3 cache in 3nm]. they can do it cheaper. Basically the wafer cost difference will determine profitability.

My thinking is that we're already there and Zen5 being a more clean sheet design, will reflect this.
 

MrTeal

Diamond Member
Dec 7, 2003
3,569
1,699
136
RGT claims only 16C/32T for top Zen 5 desktop AM5 part. That seems kinda low as I was expecting AMD to do Zen 5 + Zen 5C (or whatever they'll call it). They could easily to 8C Zen 5 + 16C Zen 5C for a total of 24 Zen 5 cores. ISA would be the same, Zen 5C might clock ~15-20% lower but that's fine as they would still get ~30% boost versus 16C Zen 5 in MT workloads.
I definitely don't want to get into a situation like being stuck at 4C8T for 10 years, but 16/32 doesn't seem like too much of a limitation in MT workloads at present. Unless I've missed something fundamental I'm not sure 8 full cores and 16 compact cores wouldn't be a step back on the desktop. They're not really low power efficiency cores and they're also not as performant as a full core. They might show higher MT numbers in Cinebench, but regressions in other things. At least for right now the number of people who would benefit much from 32/64 or on a consumer platform is pretty small vs single threaded improvements. What they really need is to get TR releases closer in time to the release of the desktop platforms.
 
  • Like
Reactions: Kaluan

Mopetar

Diamond Member
Jan 31, 2011
7,837
5,992
136
If it's nothing new the patent is invalid and completely wasted.

Patents cover specific implementations, not ideas. It's why companies like Ford, GM, Toyota, etc. are still receiving patents for engines despite the engine itself being centuries old.

The fact that a patent was granted means that the patent office agreed with AMD that whatever they did was different enough from existing implementations to warrant a patent.
 

Joe NYC

Golden Member
Jun 26, 2021
1,942
2,283
106
AMD is all about making the production cheap. That is the primary motivation behind the chiplets. MI300 seems to be the first really complex and expensive product - the target market demands are high.

Mandating Zen 5 to use a complex packaging doesn't seem to be that great. Remember in times of Rome/Milan the platform cost made low-tier server offerings prohibitive. This has been amplified by the 12-channel SP5. Hence the need to split the server platforms by adding the lower-tier 6-channel SP6. Making Zen 5 default to stacking would make he cost high again. Also the client cost would be affected.

I would expect AMD to keep things cheap (with all the drawbacks/bottlenecks of the IFOP approach).

One thing that was said by AMD, I think by Lisa, is that the cost packaging, such as 3D stacking, is a function of volume. As the volume goes up, the price goes down.

We don't know where things are as far as crossover of costs now, and where things may be say 1 year from now where Zen 5 is launching.

I doubt there is something about 3D stacking that is inherently very expensive. I think the concern that AMD may have is volume, if TSMC will have a sufficient capacity for millions of chips to be stacked.
 
Last edited:

Exist50

Platinum Member
Aug 18, 2016
2,445
3,043
136
The fact that a patent was granted means that the patent office agreed with AMD that whatever they did was different enough from existing implementations to warrant a patent.
Eh, the patent system is a mess, particularly with tech. Half of it seems to be outsourced to the courts to fix later. This is one of my favorite examples of a patent that by all rights shouldn't exist: https://patents.google.com/patent/US7862780
I doubt there is something about 3D stacking that is inherently very expensive. I think the concern that AMD may have is volume, if TSMC will have a sufficient capacity for millions of chips to be stacked.
Last I heard, one of the critical machines for hybrid bonding was still in risk/trial production levels. Probably this year or next, we'll start to see those actually start showing up in volume, and thus many more/more common hybrid bonded designs.

Though that said, advanced packaging like that isn't free. I think it'll be a number of years yet before costs come down enough for it to part of every product (like stacking all of the L3), but unless there's a radical change in SRAM scaling trends, it does seem like this will be used in more and more places with time.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,602
5,788
136
So, it remains to be seen. It would be surprising to me if AMD did not make any improvement to V-Cache capacity in Zen 4. Since we know that the stacked die is the same 64 MB as Zen 3, 2 layers would be the only way to add extra capacity.
Zen 3 V-Cache can have many stacks of SRAM, they did not do it probably because of excessive latency and power.
Also Andreas Schilling showed the BIOS config for the 8 Hi Stacks.
I doubt there is something about 3D stacking that is inherently very expensive. I think the concern that AMD may have is volume, if TSMC will have a sufficient capacity for millions of chips to be stacked.
Since 5800X3D is one of the most if not the most highest selling AM4 chip I would say the economics are slowly working out. There is no dearth of 5800X3D at all. Chip can be had for less than 350 bucks. Many current and upcoming products are chiplet based with advanced packaging. 5800X3D, 7XX0X3D, Milan-X, Genoa-X, RDNA3, MI300, STX (supposedly), RDNA4(supposedly)
This year TF-AMD's biggest packaging facility will come online in Malaysia. TSMC's AP2C came online in 2H2022.
 

Saylick

Diamond Member
Sep 10, 2012
3,146
6,364
136
Zen 3 V-Cache can have many stacks of SRAM, they did not do it probably because of excessive latency and power.
Also Andreas Schilling showed the BIOS config for the 8 Hi Stacks.

Since 5800X3D is one of the most if not the most highest selling AM4 chip I would say the economics are slowly working out. There is no dearth of 5800X3D at all. Chip can be had for less than 350 bucks. Many current and upcoming products are chiplet based with advanced packaging. 5800X3D, 7XX0X3D, Milan-X, Genoa-X, RDNA3, MI300, STX (supposedly), RDNA4(supposedly)
This year TF-AMD's biggest packaging facility will come online in Malaysia. TSMC's AP2C came online in 2H2022.
The packaging facility doesn't do the 3D stacking, if I am not mistaken. That's just substrates and all that jazz. That 3D stacking (hybrid bonding) is entirely on TSMC, no?
 
  • Like
Reactions: Mopetar

DisEnchantment

Golden Member
Mar 3, 2017
1,602
5,788
136
The packaging facility doesn't do the 3D stacking, if I am not mistaken. That's just substrates and all that jazz. That 3D stacking (hybrid bonding) is entirely on TSMC, no?
Yes all FE packaging (e.g SoIC) is done by TSMC. AP2C is one of the latest 3D packaging Fab at TSMC. But other BEOL stuffs (Fan Outs etc) can be done by TF-AMD.
 

Exist50

Platinum Member
Aug 18, 2016
2,445
3,043
136
Zen 3 V-Cache can have many stacks of SRAM, they did not do it probably because of excessive latency and power.
Multi-layer stacking has additional complications vs 2-layer. Yes, AMD has support for it in BIOS, but that doesn't necessarily mean the manufacturing side is ready for it.
 

Mopetar

Diamond Member
Jan 31, 2011
7,837
5,992
136
I'm assuming that the extra v-cache stacks would just increase the associativity of the cache. That's probably something that cloud provides would want on their server chips, but I'm not sure it adds a lot for consumers outside of a few niche workloads. I'm not even sure how much extra scaling games would get out of the next 64 MB of cache.

It's definitely something that might be really useful for CDNA GPUs as it lets them work more effectively with larger data sets. With AMD moving the infinity cache to separate chiplets, they could stack multiple v-cache layers on each chiplet to create a massive amount of cache. Even with the current setup you could theoretically create a GPU with 3 GB of cache assuming 64 MB layers stacked 8-hi.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Yes, it is SoIC, Hybrid Bond stacking.

Eliminating all of the L3 and all of the I/O will allow more efficient use of the N5 (or N4) compute die areas, use even less area for I/O than RDNA3. Hybrid bond has higher density than RDL, so even less area needs to be used for connecting with outside world.

We will see how the various workloads (from scientific to AI) respond to extremely high bandwidth (and capacity) of system level cache. The need for extremely high bandwidth is well known, and it is the primary bottleneck (not compute), so we will see how much the cache helps in addition to HBM.



It will have to be a different CCD of Zen 4, to work stacked on top of the base die.

But once AMD has this type of implementation of the Zen 4 CCD, I wonder if there may be any possibility of using it elsewhere, such as 1/4 of the Mi300 with CPU only compute unit inside the AM5 socket, with 2 stacks of HBM for 32 MB of memory.

It would have higher cost than current desktop Zen 4 chips, but a shared Last Level Cache of perhaps 256 MB of SRAM would be another level beyond the V-Cache versions of Zen 4, and it would side step the problems that the 7950x3d would face, those problems being:
- more challenging cooling process with cache on top of die
- cache not being shared, causing some issues of threads jumping between CCDs, and losing access to previously used L3 content
- asymmetry of the 7950x3d implementation

Or a laptop implementation, with one GPU, one CPU compute unit and also HBM memory, eliminating the need for laptop to have DRAM.



Remains to be seen if there is going to be any DDR5 controllers on the base die, and whether the SH5 socket has that capability.

The desire on part of the cloud providers is to move to CXL memory that could be pooled. So there would be tradeoff between having pins multiple channels of DDR5 of having far more PCIe lanes, that could be reconfigured as CXL lanes for memory or storage access or reconfigured as Infinity Fabric lanes for connecting, say 4 way, 4 socket Mi300 nodes.

If that is the approach SH5 takes, then the re-use of the IO dies would be limited or impossible. If AMD does take the base die approach to Zen 5, even if the IO die is different between SP5 and SH5, Zen 5 then, in theory, could have re-use of the Zen 5 CCDs, if the new standard becomes being stacked on top of the base die.

I think AMD might just as well have 2 different architectures (and sockets) and let client decide what works best for them.

For AMD to grow its market share in datacenter, they may in fact need to have solutions for number of different tasks, from Sienna SP6, to Genoa SP5 and Mi300 SH5.
Has AMD confirmed that it is SoIC? It has obvious bandwidth advantages, but other types of stacking allow them to mix chips on different processes or even made at different fabs. Also, they are not mutually exclusive. They can have an SoIC stack under or on top of chips using other stacking methods.

I think I already said something about CXL possibly being used for MI300. They definitely need more off package memory. With 128 GB of HBM cache, CXL as backing store would be fine, so it is unclear if SH5 will have DDR memory controllers. The connectivity to neighboring GPUs needs to be very high, so they may use the same infinity fabric fan out used for RDNA3/MCDs or some bridge chips.

I don’t know if they need the Zen 4 chiplets to be different. Making a version specifically for MI300 is unlikely, but there is some possibility of a refresh version meant for stacking that is more widely used. I think it may also be possible to use a standard Zen 4 die. As I said somewhere earlier, they could have designed connectivity over the IO area for cpu chiplets. It would not be SoIC, but CPUs just do not need that much bandwidth. In fact, the bandwidth from micro-solder ball type stacking is sufficient for most things and probably overkill for CPUs.

AMD has been making more and more specialized chips since they have much larger budget now, but they still need to make things in a modular and reusable manner to keep cost down. They can’t compete with Intel without being very efficient on their usage of expensive silicon. Nvidia seems to be able to afford to keep on making giant, monolithic die for now, but they will need to go the chiplet route also just for scalability. There may be a single base die per gpu in MI300 but I think it may make sense to have separate cache and IO die embedded under the compute die. Perhaps an IO / switch die extends under neighboring gpu devices to provide the connectivity.
 
  • Like
Reactions: Joe NYC and ftt

MadRat

Lifer
Oct 14, 1999
11,910
238
106
I would think they would centralize the cache and memory controller on one edge of the package and then route to the periphery. What kind of topology would determine the routing. Can they modular build something to support mesh, star, and bus?
 
  • Like
Reactions: Joe NYC

DisEnchantment

Golden Member
Mar 3, 2017
1,602
5,788
136
There is very little area scaling for cache, (95%?), so once this happens, [Cost of L3 cache in 6nm + die stacking cost < Cost of L3 cache in 3nm]. they can do it cheaper. Basically the wafer cost difference will determine profitability.

My thinking is that we're already there and Zen5 being a more clean sheet design, will reflect this.
From the economic perspective, it sounds intriguing indeed. Quite compelling when you think about the Zen 4 CCD, with 35% die are composed of L3 and the 2x GMI holding all those TCoil and related circuitry for the Line drivers .

But then when you think about the complexities they need to tackle they might not do it after all. They are going to lose a lot of Frequency headroom due to the thermal density. Below presentation is from Naffziger in 2021.
1676306975663.png
For instance, 7800X3D loses 10%+ top end frequency over 7700X.
Since Zen 4 missed the N4P PDK at tape out (N4P tape out in 2H2022, but Zen 4 taped out one year earlier than that), Zen 5 might even gain even more frequency than Zen 4 for the N4P based parts.
6GHz+ should be quite easy for N4P based Zen 5 (my 7950X consistently hit 5.88GHz on 280mm AIO), I suppose, whereas a 3D stacked Zen 5 might not make it past 5GHz.
Additionally disclosures of V-Cache variants would be odd since it is unclear where they would stack them otherwise.
Besides for next year, N5/N4 nodes are already mature and plenty, ~180k wpm from F18P1~4 and Arizona F21P1 and the likes of Apple moving to N3 again with ~160k wpm from F18P5~8

For N3E based Zen 5, they can work with FinFlex for minimizing the size of the L3 SRAM arrays if need be.

However if they use the Silicon Bridge interconnect, they could cut back on the CCD die size by a few mm2, if they could replace the TCoils/Line Drivers with a more parallel interface, like HBI for instance, at even lower power than the Cu RDL fanouts. I keep seeing patents like these below maybe they will do it after all but if not, these Cu RDL fanout links should help cut substrate based routing at 1/7 of the 2pJ of the 36Gbps GMI3. They can increase the link width and speed to reduce the penalty to go to the L3 on another CCD.
Multirow semiconductor chip connections
1676306279902.png

HYBRID BRIDGED FANOUT CHIPLET CONNECTIVITY
1676306435436.png
Circuit board with bridge chiplets
1676306591119.png
 

Geddagod

Golden Member
Dec 28, 2021
1,154
1,017
106
From the economic perspective, it sounds intriguing indeed. Quite compelling when you think about the Zen 4 CCD, with 35% die are composed of L3 and the 2x GMI holding all those TCoil and related circuitry for the Line drivers .

But then when you think about the complexities they need to tackle they might not do it after all. They are going to lose a lot of Frequency headroom due to the thermal density. Below presentation is from Naffziger in 2021.
View attachment 76406
For instance, 7800X3D loses 10%+ top end frequency over 7700X.
Since Zen 4 missed the N4P PDK at tape out (N4P tape out in 2H2022, but Zen 4 taped out one year earlier than that), Zen 5 might even gain even more frequency than Zen 4 for the N4P based parts.
6GHz+ should be quite easy for N4P based Zen 5 (my 7950X consistently hit 5.88GHz on 280mm AIO), I suppose, whereas a 3D stacked Zen 5 might not make it past 5GHz.
Additionally disclosures of V-Cache variants would be odd since it is unclear where they would stack them otherwise.
Besides for next year, N5/N4 nodes are already mature and plenty, ~180k wpm from F18P1~4 and Arizona F21P1 and the likes of Apple moving to N3 again with ~160k wpm from F18P5~8

For N3E based Zen 5, they can work with FinFlex for minimizing the size of the L3 SRAM arrays if need be.

However if they use the Silicon Bridge interconnect, they could cut back on the CCD die size by a few mm2, if they could replace the TCoils/Line Drivers with a more parallel interface, like HBI for instance, at even lower power than the Cu RDL fanouts. I keep seeing patents like these below maybe they will do it after all but if not, these Cu RDL fanout links should help cut substrate based routing at 1/7 of the 2pJ of the 36Gbps GMI3. They can increase the link width and speed to reduce the penalty to go to the L3 on another CCD.
Aren't wider archs also harder to clock higher?
GLC used a way better node than CML, deeper pipeline too, but tied max clockspeeds.
Plus if base Zen 5 is 3D stacked like you think(I still don't think it will be), even if it stacked underneath, that should cause a clock regression too
idk if we will see a 6Ghz Zen 5. Even non-3D stacked versions might not hit it... Maybe they push out one sku that reaches it just for marketing.
 

moinmoin

Diamond Member
Jun 1, 2017
4,950
7,659
136
But then when you think about the complexities they need to tackle they might not do it after all. They are going to lose a lot of Frequency headroom due to the thermal density.
I wonder if they could just turn it on its heads. The thermal density is in the cores, you want it to directly touch the IHS. Current v-cache and its dummy spacer prevent that.

But could L3$ with integrated GMI links form the base die, and the CCD (now without L3$ and GMI links) go on top of that? Disadvantage: The two dies always have to be put together. Advantages in the ideal case: L3$+GMI die could stay cheaper and denser, and may actually facilitate another increase in L3$ size not feasible otherwise. CCD should decrease in size and potentially cost by saving all the space for L3$+GMI. And all the thermally dense cores stay atop close to the IHS. Potential additional disadvantage: Thermal density increases even further as a result of that.

I don't think this will happen with Zen 5. But if lack of SRAM scaling should continue for more node gens this approach seems really beneficial to me. The bridges could actually be a way toward that.
 

maddie

Diamond Member
Jul 18, 2010
4,740
4,674
136
From the economic perspective, it sounds intriguing indeed. Quite compelling when you think about the Zen 4 CCD, with 35% die are composed of L3 and the 2x GMI holding all those TCoil and related circuitry for the Line drivers .

But then when you think about the complexities they need to tackle they might not do it after all. They are going to lose a lot of Frequency headroom due to the thermal density. Below presentation is from Naffziger in 2021.
View attachment 76406
For instance, 7800X3D loses 10%+ top end frequency over 7700X.
Since Zen 4 missed the N4P PDK at tape out (N4P tape out in 2H2022, but Zen 4 taped out one year earlier than that), Zen 5 might even gain even more frequency than Zen 4 for the N4P based parts.
6GHz+ should be quite easy for N4P based Zen 5 (my 7950X consistently hit 5.88GHz on 280mm AIO), I suppose, whereas a 3D stacked Zen 5 might not make it past 5GHz.
Additionally disclosures of V-Cache variants would be odd since it is unclear where they would stack them otherwise.
Besides for next year, N5/N4 nodes are already mature and plenty, ~180k wpm from F18P1~4 and Arizona F21P1 and the likes of Apple moving to N3 again with ~160k wpm from F18P5~8

For N3E based Zen 5, they can work with FinFlex for minimizing the size of the L3 SRAM arrays if need be.

However if they use the Silicon Bridge interconnect, they could cut back on the CCD die size by a few mm2, if they could replace the TCoils/Line Drivers with a more parallel interface, like HBI for instance, at even lower power than the Cu RDL fanouts. I keep seeing patents like these below maybe they will do it after all but if not, these Cu RDL fanout links should help cut substrate based routing at 1/7 of the 2pJ of the 36Gbps GMI3. They can increase the link width and speed to reduce the penalty to go to the L3 on another CCD.
Strictly from a cooling viewpoint.
His 2nd point indicates a base cache layer with the (logic + L1 & L2) on an upper layer.
His 3rd point says, if you can, do not put an additional BEOL layer above your hottest sections.

We would have a low power load (L3) entering the top layer from below + computation layer heat flux directly to the CPU cover plate.

At present we have the entire CPU heat load crossing the bonding layer across the entire die (blanks over cores) plus an additional BEOL layer.

Without doing a detailed analysis, my intuition tells me that the cooling situation will be better for the cache below design and not comparable to the present V-cache.

The fin-flex can save space on a monolithic layout, but isn't the optimized libraries for the 6nm cache, the same saving?