Info 64MB V-Cache on 5XXX Zen3 Average +15% in Games

Page 94 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Kedas

Senior member
Dec 6, 2018
355
339
136
Well we know now how they will bridge the long wait to Zen4 on AM5 Q4 2022.
Production start for V-cache is end this year so too early for Zen4 so this is certainly coming to AM4.
+15% Lisa said is "like an entire architectural generation"
 
Last edited:
  • Like
Reactions: Tlh97 and Gideon

StefanR5R

Elite Member
Dec 10, 2016
6,551
10,293
136
Genoa will have 12 CCDs per package; Bergamo will most certainly not, as 128 is not easily divided by 12. Hence, there is an obvious likelihood that Bergamo's CCDs could be either smaller or larger than Genoa's CCDs. (The IOD may potentially differ too.)

It's a bit early to break out MS Paint to piece a Zen 4c die shot together.
 

uzzi38

Platinum Member
Oct 16, 2019
2,746
6,653
146
Without a Chiplet resize I just don't see where are they going to fit 128 Cores of regular sized Zen4 Cores... Hence Zen4C which is said to be dense(as in SRAM Density not Logic as TSMC it's not there yet specially in 5nm)
Oh right, that's what I was saying too. The assumption that Zen 4C and Zen 4 cores are the same physical size (including caches) is a terrible idea. The assumption that the chiplets are also the exact same physical size is also a terrible idea. They have more space to work with (8 chiplets vs 12 on Genoa) and Zen 4 C doesn't need to be a particularly high clocking design.
 
  • Like
Reactions: Tlh97 and Saylick

tomatosummit

Member
Mar 21, 2019
184
177
116
Without a Chiplet resize I just don't see where are they going to fit 128 Cores of regular sized Zen4 Cores... Hence Zen4C which is said to be dense(as in SRAM Density not Logic as TSMC it's not there yet specially in 5nm)
The ccd probably doesn't have to be the same size but a small change won't make that much difference. SP5 is looking huge right now and if there are less ccds then genoa there's less space constraints on the package.
Very hypothetically there could even be some smart allocation of IF connections, 6 per quad on the IO die, 2 to each ccd on genoa and 3 for bergamo to spur a different ccd IO layout.
The cores can strip things out like avx execution capability (playstation5 showed this was a possibility already) and taking cache off the core silicon are potentials.

What zen generations have been is fairly stringent in their ccx layouts and with z4 being a z3 derrivative is it still limited to an 8core ring layout?
-Bergamo ccds could be dual ccx ala' zen2
-or (sickens me to think this) I like the wccf leak layout because it would be the same ring layout with two cores on each stop

That wouldn't be cheaper at all, and what's more I don't know why you're fixated on removing the L3 cache from the base die. There's no point to that. You're trying to fit the exact same size cores in the exact same amount of space without realising that
Why do you think bergamo is a cost sensitive solution?
It looks more like a way to get as many cores with enough performance into a socket than a value or efficiency specific play.
Hyperscalers are already buying the biggest cpus available even if they get discounts mere mortals would kill for, they are a big enough market for amd to target a cpu line towards and their demands will be met, be it "more cores and less junk like extra cache and avx".
A stacked only l3 cache is probably there just to keep some semibalance of performance for a 16core chiplet that would otherwise have only l2 and in the future could even be a performance differenciator with more l3 layers for certain customers although that's probably more in the wheelhouse of a genoa-X product.
Unless you're assuming 3d stacking to be particularly expensive but it looks like it's not, although driven by competition as well, the rrp of epyc-x and ryzen3d hasn't increased asking prices at all so the extra l3 silicon and packing process seems to be a fairly insignificant extra and bergamo isn't likely to be a product that can't eat up those costs.
 

uzzi38

Platinum Member
Oct 16, 2019
2,746
6,653
146
Why do you think bergamo is a cost sensitive solution?
It looks more like a way to get as many cores with enough performance into a socket than a value or efficiency specific play.
Hyperscalers are already buying the biggest cpus available even if they get discounts mere mortals would kill for, they are a big enough market for amd to target a cpu line towards and their demands will be met, be it "more cores and less junk like extra cache and avx".
A stacked only l3 cache is probably there just to keep some semibalance of performance for a 16core chiplet that would otherwise have only l2 and in the future could even be a performance differenciator with more l3 layers for certain customers although that's probably more in the wheelhouse of a genoa-X product.
Unless you're assuming 3d stacking to be particularly expensive but it looks like it's not, although driven by competition as well, the rrp of epyc-x and ryzen3d hasn't increased asking prices at all so the extra l3 silicon and packing process seems to be a fairly insignificant extra and bergamo isn't likely to be a product that can't eat up those costs.

For what workloads? Don't confuse Azure using Milan-X for HPC instances as an example of the majority of the hyperscaler market.

Also, let me phrase things a little differently. What is AMD competing against in terms of cloud specific products? You have Graviton, which sure has a strong focus on perf/W but also there's the fact that because Amazon are designing it themselves, they save some cost on it vs competing Intel/AMD parts. You have Sierra Forrest (eventually), which is also focusing on compute density by shipping Intel's Atom cores. As is Altera. In fact in the case of the latter they released the 128c model with half the L3 cache specifically for this purpose.

Now lets be real, regardless of whether the cache sits on top of the die or on it, that power consumption argument doesn't change. And spoiler alert: Bergamo is not a low power consumption part, nor does it have a lower base TDP than Genoa. So you tell me? What else is there for AMD to focus on?
 

nicalandia

Diamond Member
Jan 10, 2019
3,331
5,282
136
Genoa will have 12 CCDs per package; Bergamo will most certainly not, as 128 is not easily divided by 12. Hence, there is an obvious likelihood that Bergamo's CCDs could be either smaller or larger than Genoa's CCDs. (The IOD may potentially differ too.)

It's a bit early to break out MS Paint to piece a Zen 4c die shot together.

If you had to guess on Known information we have so far(that the SP5 CPU is made out of four quadrants and you just can't place a CCD in the middle where the Quadrants meet

Here is an X-Ray of the SP5 CPU. I've been scratching my head on how they are going to go about it. We know that TSMC SRAM libraries used in Milan-X are twice as dense as it's High performance 7nm, I suspect that will also be the case in 5nm, but that is SRAM density, they can't built Zen4C on the same libraries and expect the same density(Logic Side, FP Libraries, IO), so with that in mind. How do they go about?

1648076450917.png


Orange Lines to divide the Quadrants and the Red one is what I believe they will do, just make a larger Die with super dense SRAM(will not be twice as large as Zen4 since it will not fit on the package)
 
Last edited:

MadRat

Lifer
Oct 14, 1999
11,967
281
126
I'd think shrinks are getting to the point where proximity of each 0-core per die and per chiplet will be critical in respect to proximity to shared cache and physical memory. If you have multiple die on multiple chiplets then your selection of active core probably resembles a March Madness bracket when it comes to workload priorities. And to some extent it probably is virtualized in hardware, where the OS may rank cores yet hardware and software ranks would likely differ. Software in that scenario will never know which core it truly has been assigned, only when it needs another core it simply activates the next by rank. That would also allow AMD to creatively customize core counts, performance, or both by using virtualized core management at the hardware level. (Your 8-core chip might have 24 physical core that it could address, move locations of virtual cores dynamically to load balance thermals, and you never actually see it because you in fact never use more than 8 simultaneously.) You could allow software to signal different ranking strategies, but you would still want hardware doing the execution and to allow for exceptions that bypass defective dies/chiplet cores or communication pathways.
 
Last edited:

jamescox

Senior member
Nov 11, 2009
644
1,105
136
If you had to guess on Known information we have so far(that the SP5 CPU is made out of four quadrants and you just can't place a CCD in the middle where the Quadrants meet

Here is an X-Ray of the SP5 CPU. I've been scratching my head on how they are going to go about it. We know that TSMC SRAM libraries used in Milan-X are twice as dense as it's High performance 7nm, I suspect that will also be the case in 5nm, but that is SRAM density, they can't built Zen4C on the same libraries and expect the same density(Logic Side, FP Libraries, IO), so with that in mind. How do they go about?

View attachment 59103


Orange Lines to divide the Quadrants and the Red one is what I believe they will do, just make a larger Die with super dense SRAM(will not be twice as large as Zen4 since it will not fit on the package)
I still think it might be stacked using bridge chips. Apple seems to be using a stacked device with bridge chips for the M1 Ultra. Nvidia will have Grace, which seems to be a device using bridge chips also. Intel will have Safire Rapids with EMIB, I guess, if it actually comes out this year. AMD will be a bit behind if they don’t have this in 2023 with Bergamo.

A stacked device using bridge chips would probably not use SoIC, so it would likely not be dense enough connectivity for L3 cache. It could have L4 cache though. For stacking, the die would need to be placed directly adjacent to the IO die, so that would favor a rectangle shape rather than near square. The short dimension of the cpu die times 4 would need to be close to the long dimension of the IO die. It is already close. A Zen 4 die is 6.75 (x4 = 27) on the short side and the IO die is 24.79 on the long side. You shrink the cpu chiplet to 6.2 mm on the short side and it would almost exactly match. Not sure how the IO die would be different. It is plausible that it is the same IO die with both interfaces.

It make sense for it to be more of a rectangle with 2 CCX. It would still need L3 cache in this scenario. We do not know the die size but the L3 cache has been rumored to be 16 MB. That would make a lot of sense since this would basically be two of the 8 core mobile CCX on a dense, low power process. Also, if the infinity cache chips are cheap enough to use on GPUs, then it seems like they would be cheap enough for Bergamo.
 

Mopetar

Diamond Member
Jan 31, 2011
8,436
7,631
136
Does Zen 3 have cache control instructions to prevent certain required data from being evicted again and again due to cache pressure?

There are a lot of different cache replacement policy algorithms, but most use something close to or approximating an LRU (least recently used) algorithm. As long as something is using it frequently enough it won't be evicted.

If it's used that often though it's likely to be in either the L2 or even L1 caches of the cores that need it and they won't even need to go to L3 for it so the added v-cache doesn't help here.
 

tomatosummit

Member
Mar 21, 2019
184
177
116
For what workloads? Don't confuse Azure using Milan-X for HPC instances as an example of the majority of the hyperscaler market.

Also, let me phrase things a little differently. What is AMD competing against in terms of cloud specific products? You have Graviton, which sure has a strong focus on perf/W but also there's the fact that because Amazon are designing it themselves, they save some cost on it vs competing Intel/AMD parts. You have Sierra Forrest (eventually), which is also focusing on compute density by shipping Intel's Atom cores. As is Altera. In fact in the case of the latter they released the 128c model with half the L3 cache specifically for this purpose.

Now lets be real, regardless of whether the cache sits on top of the die or on it, that power consumption argument doesn't change. And spoiler alert: Bergamo is not a low power consumption part, nor does it have a lower base TDP than Genoa. So you tell me? What else is there for AMD to focus on?
I think I got my wires crossed somewhere. I noticed you posted a couple of times about cost regarding bergamo but I think I misunderstood how you were replying to others.
I agree with everything else. STH is constantly touting 350w for future sockets and not only intel desktop cpus.
Even if 350w is a large power draw, across 128cores it's fairly tepid although I would bet the ppw would be far above the competition still, even if you include the large IO die and the potential cache stack power draw.
 
  • Like
Reactions: Tlh97 and lobz

tomatosummit

Member
Mar 21, 2019
184
177
116
I still think it might be stacked using bridge chips. Apple seems to be using a stacked device with bridge chips for the M1 Ultra. Nvidia will have Grace, which seems to be a device using bridge chips also. Intel will have Safire Rapids with EMIB, I guess, if it actually comes out this year. AMD will be a bit behind if they don’t have this in 2023 with Bergamo.

A stacked device using bridge chips would probably not use SoIC, so it would likely not be dense enough connectivity for L3 cache. It could have L4 cache though. For stacking, the die would need to be placed directly adjacent to the IO die, so that would favor a rectangle shape rather than near square. The short dimension of the cpu die times 4 would need to be close to the long dimension of the IO die. It is already close. A Zen 4 die is 6.75 (x4 = 27) on the short side and the IO die is 24.79 on the long side. You shrink the cpu chiplet to 6.2 mm on the short side and it would almost exactly match. Not sure how the IO die would be different. It is plausible that it is the same IO die with both interfaces.

It make sense for it to be more of a rectangle with 2 CCX. It would still need L3 cache in this scenario. We do not know the die size but the L3 cache has been rumored to be 16 MB. That would make a lot of sense since this would basically be two of the 8 core mobile CCX on a dense, low power process. Also, if the infinity cache chips are cheap enough to use on GPUs, then it seems like they would be cheap enough for Bergamo.
Unless there's a more exotic cache solution, be it active bridges, or stacked l4 on die then I don't see a need for silicon bridges for bergamo.
The IO die is going to be using 56g serdes with pcie5 so using the same for the IF links doubles the throughput before whatever else amd can do.
If the charlie leak is real then performance might not be much above 2x anyway.
Your examples, spr especially, are trying for better cache coherency while amd's zen plan so far has been disassociated that probably doesn't require bridging just yet.
 

MadRat

Lifer
Oct 14, 1999
11,967
281
126
When information is collected in the caches do not all caches fill at the same time from a memory transfer? I was under the impression that the inner most level of cache has the most specific information from the memory transfer, and all of the caches are related in scope. It's when one needs to read that it cuts access time because the information will be someplace that is closer to the core. But the initial loading process is the same speed as any other access to RAM. Maybe I understood the explanation wrong.
 

jamescox

Senior member
Nov 11, 2009
644
1,105
136
Unless there's a more exotic cache solution, be it active bridges, or stacked l4 on die then I don't see a need for silicon bridges for bergamo.
The IO die is going to be using 56g serdes with pcie5 so using the same for the IF links doubles the throughput before whatever else amd can do.
If the charlie leak is real then performance might not be much above 2x anyway.
Your examples, spr especially, are trying for better cache coherency while amd's zen plan so far has been disassociated that probably doesn't require bridging just yet.
I am talking about an active bridge with possibly up to 512 MB of cache, so I guess that is an “exotic” solution. Possibly the same infinity cache chips used to connect RDNA3 gpu chips together, rumored to be 512 MB or 384 MB. Bergamo might use 2 of them though, one on each side to attach 4 cpu die each.

There are lots of reasons to use stacking. One is massive bandwidth; HBM levels of interconnect should be possible, so they could easily have a 1024-bit link or more at low clock for each chiplet. The other is much lower power, both for the interconnect and the cache. One of the reasons that HBM was invented is that the high speed interfaces required for gpu memory were actually taking large amounts of die area and power. HBM-style interfaces take very little die area and are significantly lower power. The cache can be made on a cache optimized process, like what is used for v-cache, so it can be very dense and power efficient. It would still be a rather large die at 512 MB; perhaps it is more than one bridge chip. I was thinking of possibly four 128 MB die for maximum flexibility and modularity. That might be similar in size to a cpu chiplet.

The speed increase to pci-express 5 level speeds is not going to be free. I can’t find any power consumption numbers, but I assume it will be a bit more in Genoa to drive serdes at that speed vs Milan. Genoa also has more cpu links and more memory channels, so it is likely overdue for a stacked solution to reduce the interconnect power. Bergamo will only have 8 chiplets, but still the same IO and memory as Genoa.

I suspect that Zen 4c might be a bit more widely applicable than just “cloud”. With a denser process and smaller L3, the cores may not actually be cut down much at all. They could actually be full Zen 4 cores. I saw a labeled die photo that was supposed to be cezanne; an 8 core CCX with 16 MB L3 only taking up about 50 mm2 (about 6.2 x 8.3). That would be about right for fitting 4 along each side of the IO die, although that is Zen 3 on 7 nm compared to a 12 nm GF IO die. It would be different for Bergamo at 5 nm and whatever the IO die is made on, but everything might scale such that it is still a match. I am wondering if Genoa IO die will still be GF while Bergamo IO die is made at TSMC. The cpu die, with two 8-core CCX might be a long narrow die, perhaps similar to the aspect ratio of Zen 1.
 

tomatosummit

Member
Mar 21, 2019
184
177
116
I am talking about an active bridge with possibly up to 512 MB of cache, so I guess that is an “exotic” solution. Possibly the same infinity cache chips used to connect RDNA3 gpu chips together, rumored to be 512 MB or 384 MB. Bergamo might use 2 of them though, one on each side to attach 4 cpu die each.

There are lots of reasons to use stacking. One is massive bandwidth; HBM levels of interconnect should be possible, so they could easily have a 1024-bit link or more at low clock for each chiplet. The other is much lower power, both for the interconnect and the cache. One of the reasons that HBM was invented is that the high speed interfaces required for gpu memory were actually taking large amounts of die area and power. HBM-style interfaces take very little die area and are significantly lower power. The cache can be made on a cache optimized process, like what is used for v-cache, so it can be very dense and power efficient. It would still be a rather large die at 512 MB; perhaps it is more than one bridge chip. I was thinking of possibly four 128 MB die for maximum flexibility and modularity. That might be similar in size to a cpu chiplet.

The speed increase to pci-express 5 level speeds is not going to be free. I can’t find any power consumption numbers, but I assume it will be a bit more in Genoa to drive serdes at that speed vs Milan. Genoa also has more cpu links and more memory channels, so it is likely overdue for a stacked solution to reduce the interconnect power. Bergamo will only have 8 chiplets, but still the same IO and memory as Genoa.

I suspect that Zen 4c might be a bit more widely applicable than just “cloud”. With a denser process and smaller L3, the cores may not actually be cut down much at all. They could actually be full Zen 4 cores. I saw a labeled die photo that was supposed to be cezanne; an 8 core CCX with 16 MB L3 only taking up about 50 mm2 (about 6.2 x 8.3). That would be about right for fitting 4 along each side of the IO die, although that is Zen 3 on 7 nm compared to a 12 nm GF IO die. It would be different for Bergamo at 5 nm and whatever the IO die is made on, but everything might scale such that it is still a match. I am wondering if Genoa IO die will still be GF while Bergamo IO die is made at TSMC. The cpu die, with two 8-core CCX might be a long narrow die, perhaps similar to the aspect ratio of Zen 1.
It's undeniable that that stacking is better for performance all around but it's not a free lunch. Although 3d has come out cheaper than expected there are avenues amd could take that wouldn't be just throwing more silicon cache or bridges at the problems if the design doesn't explicitly require it.
N6 will be a decent power drop even for IO and as shown with rembrandt, amd's uncore design teams aren'tn idle and really good power improvements that should be implemented in zen4 io dies as well.
Back to silicon cost (again assuming charlie's berg leak is correct) then keeping the same IO design between genoa and bergamo is another saver.

More complex 3d designs are certainly the future but I don't think it's in zen4 beyond 3dcache.
And I wouldn't really say amd is falling behind in that regard even if spr and pv are chock full of the stuff, they're more like throwing silicon at the problems while amd is still being more budget focussed. RDNA3 being the obvious coming example for amd's 3d stuff.
 

MadRat

Lifer
Oct 14, 1999
11,967
281
126
If you do a narrow die then you probably need your infinity fabric to talk through both ends of it. Otherwise your traffic through the chip would probably suffer at the end furthest from the connection.
 

jamescox

Senior member
Nov 11, 2009
644
1,105
136
If you do a narrow die then you probably need your infinity fabric to talk through both ends of it. Otherwise your traffic through the chip would probably suffer at the end furthest from the connection.
That depends on how far the cache die / bridge chip under laps the cpu die and IO die. They might be able to have a connection at the edge of each CCX towards the IO die with the bridge die directly under it. You can’t do too long of a run; I believe an AMD representative already talked about this at some point, but I don’t remember where.

When I looked at the die sizes before, I came to the conclusion that Bergamo might actually be doable in a single or 1.5x reticle size. You have the 5 nm shrink, possibly a much more dense process, a lot of the serdes PHYs removed, and the on die cache reduced. They may have cut other stuff like FP units, but that seems less likely. This means that it could actually be a rather low risk use of stacking if it is a single reticle size. Apple seems to already be using some form of TSMC silicon bridge tech for M1 Ultra. I don’t see why it is so hard to believe that AMD would have such a product next year. Nvidia will have the Grace chip next year (likely late next year) likely using some form of bridge chip also.
 

jamescox

Senior member
Nov 11, 2009
644
1,105
136
It's undeniable that that stacking is better for performance all around but it's not a free lunch. Although 3d has come out cheaper than expected there are avenues amd could take that wouldn't be just throwing more silicon cache or bridges at the problems if the design doesn't explicitly require it.
N6 will be a decent power drop even for IO and as shown with rembrandt, amd's uncore design teams aren'tn idle and really good power improvements that should be implemented in zen4 io dies as well.
Back to silicon cost (again assuming charlie's berg leak is correct) then keeping the same IO design between genoa and bergamo is another saver.

More complex 3d designs are certainly the future but I don't think it's in zen4 beyond 3dcache.
And I wouldn't really say amd is falling behind in that regard even if spr and pv are chock full of the stuff, they're more like throwing silicon at the problems while amd is still being more budget focussed. RDNA3 being the obvious coming example for amd's 3d stuff.
So 400 W server / HPC CPUs are both fine and dandy at the same time? No reason to reduce interconnect power consumption? Apple is already selling an APU with what looks like an EFB silicon bridge. Nvidia will have Grace in 2023 with likely silicon bridge. AMD may have GPUs with EFB bridges this year or early next year; not sure what the current rumors are on AMD RDNA3.

Do you think it might be interesting if the GPUs connect to the same bridge chip used in Bergamo? If they are modular and interchangeable? I believe AMD CDNA GPUs already have a large number of infinity fabric links. I haven’t read up too much on that though. Would a device with 64 Zen4c cores on one side of an IO die and a gpu on the other be interesting? That would only be around 500 GB/s of DDR5 bandwidth directly accessible from a GPU and up to 12 TB DDR5 capacity. Who would want something like that? I guess it would be a lot like Nvidia’s Grace-Hopper “superchip” but the Zen4c cpus would at least perform a lot better and it would have access to more memory.

Sarcasm aside, I think it would be more surprising if AMD doesn’t have a EFB based device in 2023. It might be later in 2023, but they will be a year or two behind if they wait for Zen 5. The Grace-Hopper device is probably not an option for many use cases since the cpu would be too weak to handle input processing fast enough to keep the gpu utilized. That might annoy some people who are locked in by CUDA code. Most will have to go with a lower density pci-express connected solution with much lower bandwidth between the cpu and gpu. They can get around that to some extent by using many smaller GPUs with separate links.
 

tomatosummit

Member
Mar 21, 2019
184
177
116
So 400 W server / HPC CPUs are both fine and dandy at the same time? No reason to reduce interconnect power consumption? Apple is already selling an APU with what looks like an EFB silicon bridge. Nvidia will have Grace in 2023 with likely silicon bridge. AMD may have GPUs with EFB bridges this year or early next year; not sure what the current rumors are on AMD RDNA3.

Do you think it might be interesting if the GPUs connect to the same bridge chip used in Bergamo? If they are modular and interchangeable? I believe AMD CDNA GPUs already have a large number of infinity fabric links. I haven’t read up too much on that though. Would a device with 64 Zen4c cores on one side of an IO die and a gpu on the other be interesting? That would only be around 500 GB/s of DDR5 bandwidth directly accessible from a GPU and up to 12 TB DDR5 capacity. Who would want something like that? I guess it would be a lot like Nvidia’s Grace-Hopper “superchip” but the Zen4c cpus would at least perform a lot better and it would have access to more memory.

Sarcasm aside, I think it would be more surprising if AMD doesn’t have a EFB based device in 2023. It might be later in 2023, but they will be a year or two behind if they wait for Zen 5. The Grace-Hopper device is probably not an option for many use cases since the cpu would be too weak to handle input processing fast enough to keep the gpu utilized. That might annoy some people who are locked in by CUDA code. Most will have to go with a lower density pci-express connected solution with much lower bandwidth between the cpu and gpu. They can get around that to some extent by using many smaller GPUs with separate links.
Yeah, rising power draw is no joke. Before zen 2, server cpus uses 150-180w and we're going into a 300+w world. In accelerator world, server pcie slots tapped out at 300w so OAM3 supports up to 500w with CDNA2 and SXM is now up to 700w with hopper, cpu sockets are in the same systems so they can use more, to be slightly on topic, stacked cache has shown increased power draw anyway.

So in a situation where ccds with large local stacked cache actually reduces the required interconnect bandwidth the chances of seeing bridge chips is reduced.
No one's going to not buy what is probably going to be most efficient and most performant cpus on the market for at least one more generation just because the IO die is using an extra 50w, which is just a trade off for reduced costs by using less silicon on bridge chips and less advanced packaging.
The IO die does actually hurt the low end server that xeon bronze competes in but thare are rumours of a smaller zen4 epyc socket to address that.

Zen5 is rumoured for 2023 and it's new design family which opens up the book for new designs so maybe it's active bridge/interposers then.

What is using silicon bridges are devices that require large aggregate bandwidth, namely gpus and apus, nvidia XX100 all using hbm, cdna2 uses fan out for hbm and interconnect, sapphire rapids uses emib for shared l3 and shared hbm bandwidth (apple's apu similar) and ponte vecchio that uses everything you can imagine.
Grace cpus look more traditional. The images we have been shown look like discrete packages and all material says lots of nvlink nvlink nvlink that has no problem running over copper currently.

There was a good presentation last hot chips on 3d stacking chips and the design considerations around it that make reusable silicon, especially something intrinsic like a cache bridge chip, really unlikely. Even on chiplets we haven't seen anything other than the ccd used across products in desktop and server that hardly differ in intended use, we didn't even see a tiny gpu chiplet that could be used in a low end discrete cpu or attached to an io die or even has a different io+gpu die, nothing fun at all.
There was a leak in ~2016 of a huge amd 16 or 32 core apu with hbm with stacked all sorts design that hasn't seen the light of day yet, ponte vechio looks most similar to that minus cpu chiplets and m1max is doing big apu but not exactly in an interesting design.
 

jamescox

Senior member
Nov 11, 2009
644
1,105
136
Yeah, rising power draw is no joke. Before zen 2, server cpus uses 150-180w and we're going into a 300+w world. In accelerator world, server pcie slots tapped out at 300w so OAM3 supports up to 500w with CDNA2 and SXM is now up to 700w with hopper, cpu sockets are in the same systems so they can use more, to be slightly on topic, stacked cache has shown increased power draw anyway.

So in a situation where ccds with large local stacked cache actually reduces the required interconnect bandwidth the chances of seeing bridge chips is reduced.
No one's going to not buy what is probably going to be most efficient and most performant cpus on the market for at least one more generation just because the IO die is using an extra 50w, which is just a trade off for reduced costs by using less silicon on bridge chips and less advanced packaging.
The IO die does actually hurt the low end server that xeon bronze competes in but thare are rumours of a smaller zen4 epyc socket to address that.

Zen5 is rumoured for 2023 and it's new design family which opens up the book for new designs so maybe it's active bridge/interposers then.

What is using silicon bridges are devices that require large aggregate bandwidth, namely gpus and apus, nvidia XX100 all using hbm, cdna2 uses fan out for hbm and interconnect, sapphire rapids uses emib for shared l3 and shared hbm bandwidth (apple's apu similar) and ponte vecchio that uses everything you can imagine.
Grace cpus look more traditional. The images we have been shown look like discrete packages and all material says lots of nvlink nvlink nvlink that has no problem running over copper currently.

There was a good presentation last hot chips on 3d stacking chips and the design considerations around it that make reusable silicon, especially something intrinsic like a cache bridge chip, really unlikely. Even on chiplets we haven't seen anything other than the ccd used across products in desktop and server that hardly differ in intended use, we didn't even see a tiny gpu chiplet that could be used in a low end discrete cpu or attached to an io die or even has a different io+gpu die, nothing fun at all.
There was a leak in ~2016 of a huge amd 16 or 32 core apu with hbm with stacked all sorts design that hasn't seen the light of day yet, ponte vechio looks most similar to that minus cpu chiplets and m1max is doing big apu but not exactly in an interesting design.
I think that the Apple M1 Ultra (with a silicon bridge connecting 2 M1 Max chips) is an interesting design. I don’t think we know specifically if it is the same EFB that AMD would use, but it seems to be a silicon bridge of some kind, so likely the same tech from TSMC will be used for infinity cache GPUs. We don’t know how the Grace cpu is connected together either. The images are almost certainly just a rendering. If the CPUs are directly adjacent, it would make a lot more sense to use a silicon bridge for the kind of bandwidth they are talking about. The gpu would then be connected by NVlink. It isn’t going to be available for a long time, which makes me wonder if Nvidia just wanted to talk about it first such that they look like they are leading the technology. If AMD announces a similar device later, it looks a little more like they are the follower, even if their device ends up being available first.

For Bergamo, with 128-cores, a lot of bandwidth will be required. Zen 4 may have significantly increased floating point compute over Zen 3 and Bergamo may not have any of that cut out. AMD has dealt with bandwidth requirements in their GPUs by using infinity cache. I believe AMD had talked about using infinity cache across multiple products. it would be a good way to reduce interconnect power, increase bandwidth, and add a lot of cache back. Bergamo does not seem to be using v-cache, but it might not be necessary if it has infinity cache.

Charlie at semiaccurate has called Bergamo a monster. It doesn’t sound like one of it is just two 8-core CCX with 16 MB L3 per die connected by serdes. It is a lot of cores and they should be more power efficient, meaning they may actually be able to sustain good clock speeds, so perhaps the performance could be massive without extra cache. Extra cache can help a lot though, as Milan-X has demonstrated.

The memory interface for current Epyc is very wide, at 512-bit. That is, in fact, wider than most current GPUs. That is GDDR6 vs DDR4, but with DDR5, it goes up to 768-bit (64 x 12 or possibly more accuratly 32 x 24) for SP5 which is near 500 GB/s per Epyc socket. That will be close to the bandwidth of an Nvidia P100 GPU. Current nvidia A10 GPUs are only 600 GB/s. Saying that Bergamo isn’t a “high aggregate bandwidth” device doesn’t seem correct. We also may get a significant DDR5 speed increase before Bergamo comes out next year. I assume the 460 GB/s number I have seen is actually with rather low clocked DDR5; I will need to look that up.

There could be some other secret sauce to Bergamo or some Genoa derivative. I don’t know how well it would compete with cpus using integrated HBM. The infinity cache would be one way to compete without going full (and expensive) HBM, just like AMD did with their GPUs. The infinity cache would just be a 1 or 2 embedded die rather than something like 4 or 8 stacks of HBM. AMD integrating HBM on a cpu, unless it included with a GPU, seems unlikely. HBM can obviously supply huge bandwidth, but it still has DRAM latency. It is plausible that AMD will start making parts with multiple levels of v-cache rather than using infinity cache chips.

I am just speculating here. We very well might not see an EFB part until Zen 5, but that is a long way off. It could be that Grace/Hopper is out in a similar time frame as Zen 5 using EFB. I could see AMD using Bergamo as a bit of a test bed for Zen 5 though. That is, the Bergamo IO die might be the same as what is used with Zen 5. Using infinity cache doesn’t fit with some of the rumors, but it makes a lot of sense to me. It also could be the case that it is just a bunch of serdes connected chips with small caches. Rather boring, but entirely possible.
 
  • Like
Reactions: Drazick and Tlh97

nicalandia

Diamond Member
Jan 10, 2019
3,331
5,282
136
Not 3D V Cache, but it's being released at the same time so it's relevant..

New AMD Ryzen 7 5700X benchmarks showcase stellar performance

 
  • Like
Reactions: Tlh97 and scannall
Nov 26, 2005
15,189
401
126
That'll kill the resale value of the 5800X. The L1 cache is the same as-well? 512KB?

Odd that they released the 3700X and the 3800X around the same time while for Vermeer they released the 5800X 11/05/20, then they are releasing the 5700X 04/04/22. Again they're proving the 5800X a bad buy. Dang got suckered again, lol
 

gdansk

Diamond Member
Feb 8, 2011
4,209
7,070
136
They were so close to an opportune launch with X3D. Only 5 months late or so.

But the 5700X? Seems like it should have launched in 2020 with the rest. But they wanted that ASP.
 
Last edited: