Info 64MB V-Cache on 5XXX Zen3 Average +15% in Games

Page 94 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Toggle sidebar Toggle sidebar

K

Kedas

Senior member

Jun 1, 2021

#1

Well we know now how they will bridge the long wait to Zen4 on AM5 Q4 2022.
Production start for V-cache is end this year so too early for Zen4 so this is certainly coming to AM4.
+15% Lisa said is "like an entire architectural generation"

Last edited: Jun 1, 2021

Reactions: Tlh97 and Gideon

S

StefanR5R

Elite Member

Mar 23, 2022

#2,326

Genoa will have 12 CCDs per package; Bergamo will most certainly not, as 128 is not easily divided by 12. Hence, there is an obvious likelihood that Bergamo's CCDs could be either smaller or larger than Genoa's CCDs. (The IOD may potentially differ too.)

It's a bit early to break out MS Paint to piece a Zen 4c die shot together.

uzzi38

Platinum Member

Mar 23, 2022

#2,327

nicalandia said:
Without a Chiplet resize I just don't see where are they going to fit 128 Cores of regular sized Zen4 Cores... Hence Zen4C which is said to be dense(as in SRAM Density not Logic as TSMC it's not there yet specially in 5nm)

Oh right, that's what I was saying too. The assumption that Zen 4C and Zen 4 cores are the same physical size (including caches) is a terrible idea. The assumption that the chiplets are also the exact same physical size is also a terrible idea. They have more space to work with (8 chiplets vs 12 on Genoa) and Zen 4 C doesn't need to be a particularly high clocking design.

Reactions: Tlh97 and Saylick

T

tomatosummit

Member

Mar 23, 2022

#2,328

nicalandia said:
Without a Chiplet resize I just don't see where are they going to fit 128 Cores of regular sized Zen4 Cores... Hence Zen4C which is said to be dense(as in SRAM Density not Logic as TSMC it's not there yet specially in 5nm)

The ccd probably doesn't have to be the same size but a small change won't make that much difference. SP5 is looking huge right now and if there are less ccds then genoa there's less space constraints on the package.
Very hypothetically there could even be some smart allocation of IF connections, 6 per quad on the IO die, 2 to each ccd on genoa and 3 for bergamo to spur a different ccd IO layout.
The cores can strip things out like avx execution capability (playstation5 showed this was a possibility already) and taking cache off the core silicon are potentials.

What zen generations have been is fairly stringent in their ccx layouts and with z4 being a z3 derrivative is it still limited to an 8core ring layout?
-Bergamo ccds could be dual ccx ala' zen2
-or (sickens me to think this) I like the wccf leak layout because it would be the same ring layout with two cores on each stop

uzzi38 said:
That wouldn't be cheaper at all, and what's more I don't know why you're fixated on removing the L3 cache from the base die. There's no point to that. You're trying to fit the exact same size cores in the exact same amount of space without realising that

Why do you think bergamo is a cost sensitive solution?
It looks more like a way to get as many cores with enough performance into a socket than a value or efficiency specific play.
Hyperscalers are already buying the biggest cpus available even if they get discounts mere mortals would kill for, they are a big enough market for amd to target a cpu line towards and their demands will be met, be it "more cores and less junk like extra cache and avx".
A stacked only l3 cache is probably there just to keep some semibalance of performance for a 16core chiplet that would otherwise have only l2 and in the future could even be a performance differenciator with more l3 layers for certain customers although that's probably more in the wheelhouse of a genoa-X product.
Unless you're assuming 3d stacking to be particularly expensive but it looks like it's not, although driven by competition as well, the rrp of epyc-x and ryzen3d hasn't increased asking prices at all so the extra l3 silicon and packing process seems to be a fairly insignificant extra and bergamo isn't likely to be a product that can't eat up those costs.

D

deasd

Senior member

Mar 23, 2022

#2,329

SPEC comparison, Milan vs Milan-X

https://twitter.com/x/status/1506640004658851848

Reactions: Mopetar, Tlh97, nicalandia and 3 others

uzzi38

Platinum Member

Mar 23, 2022

#2,330

tomatosummit said:
Why do you think bergamo is a cost sensitive solution?
It looks more like a way to get as many cores with enough performance into a socket than a value or efficiency specific play.
Hyperscalers are already buying the biggest cpus available even if they get discounts mere mortals would kill for, they are a big enough market for amd to target a cpu line towards and their demands will be met, be it "more cores and less junk like extra cache and avx".
A stacked only l3 cache is probably there just to keep some semibalance of performance for a 16core chiplet that would otherwise have only l2 and in the future could even be a performance differenciator with more l3 layers for certain customers although that's probably more in the wheelhouse of a genoa-X product.
Unless you're assuming 3d stacking to be particularly expensive but it looks like it's not, although driven by competition as well, the rrp of epyc-x and ryzen3d hasn't increased asking prices at all so the extra l3 silicon and packing process seems to be a fairly insignificant extra and bergamo isn't likely to be a product that can't eat up those costs.

For what workloads? Don't confuse Azure using Milan-X for HPC instances as an example of the majority of the hyperscaler market.

Also, let me phrase things a little differently. What is AMD competing against in terms of cloud specific products? You have Graviton, which sure has a strong focus on perf/W but also there's the fact that because Amazon are designing it themselves, they save some cost on it vs competing Intel/AMD parts. You have Sierra Forrest (eventually), which is also focusing on compute density by shipping Intel's Atom cores. As is Altera. In fact in the case of the latter they released the 128c model with half the L3 cache specifically for this purpose.

Now lets be real, regardless of whether the cache sits on top of the die or on it, that power consumption argument doesn't change. And spoiler alert: Bergamo is not a low power consumption part, nor does it have a lower base TDP than Genoa. So you tell me? What else is there for AMD to focus on?

N

nicalandia

Diamond Member

Mar 23, 2022

#2,331

StefanR5R said:
Genoa will have 12 CCDs per package; Bergamo will most certainly not, as 128 is not easily divided by 12. Hence, there is an obvious likelihood that Bergamo's CCDs could be either smaller or larger than Genoa's CCDs. (The IOD may potentially differ too.)

It's a bit early to break out MS Paint to piece a Zen 4c die shot together.

If you had to guess on Known information we have so far(that the SP5 CPU is made out of four quadrants and you just can't place a CCD in the middle where the Quadrants meet

Here is an X-Ray of the SP5 CPU. I've been scratching my head on how they are going to go about it. We know that TSMC SRAM libraries used in Milan-X are twice as dense as it's High performance 7nm, I suspect that will also be the case in 5nm, but that is SRAM density, they can't built Zen4C on the same libraries and expect the same density(Logic Side, FP Libraries, IO), so with that in mind. How do they go about?

Orange Lines to divide the Quadrants and the Red one is what I believe they will do, just make a larger Die with super dense SRAM(will not be twice as large as Zen4 since it will not fit on the package)

Last edited: Mar 23, 2022

MadRat

Lifer

Mar 23, 2022

#2,332

I'd think shrinks are getting to the point where proximity of each 0-core per die and per chiplet will be critical in respect to proximity to shared cache and physical memory. If you have multiple die on multiple chiplets then your selection of active core probably resembles a March Madness bracket when it comes to workload priorities. And to some extent it probably is virtualized in hardware, where the OS may rank cores yet hardware and software ranks would likely differ. Software in that scenario will never know which core it truly has been assigned, only when it needs another core it simply activates the next by rank. That would also allow AMD to creatively customize core counts, performance, or both by using virtualized core management at the hardware level. (Your 8-core chip might have 24 physical core that it could address, move locations of virtual cores dynamically to load balance thermals, and you never actually see it because you in fact never use more than 8 simultaneously.) You could allow software to signal different ranking strategies, but you would still want hardware doing the execution and to allow for exceptions that bypass defective dies/chiplet cores or communication pathways.

Last edited: Mar 23, 2022

J

jamescox

Senior member

Mar 23, 2022

#2,333

nicalandia said:
If you had to guess on Known information we have so far(that the SP5 CPU is made out of four quadrants and you just can't place a CCD in the middle where the Quadrants meet

Here is an X-Ray of the SP5 CPU. I've been scratching my head on how they are going to go about it. We know that TSMC SRAM libraries used in Milan-X are twice as dense as it's High performance 7nm, I suspect that will also be the case in 5nm, but that is SRAM density, they can't built Zen4C on the same libraries and expect the same density(Logic Side, FP Libraries, IO), so with that in mind. How do they go about?

View attachment 59103

Orange Lines to divide the Quadrants and the Red one is what I believe they will do, just make a larger Die with super dense SRAM(will not be twice as large as Zen4 since it will not fit on the package)

I still think it might be stacked using bridge chips. Apple seems to be using a stacked device with bridge chips for the M1 Ultra. Nvidia will have Grace, which seems to be a device using bridge chips also. Intel will have Safire Rapids with EMIB, I guess, if it actually comes out this year. AMD will be a bit behind if they don’t have this in 2023 with Bergamo.

A stacked device using bridge chips would probably not use SoIC, so it would likely not be dense enough connectivity for L3 cache. It could have L4 cache though. For stacking, the die would need to be placed directly adjacent to the IO die, so that would favor a rectangle shape rather than near square. The short dimension of the cpu die times 4 would need to be close to the long dimension of the IO die. It is already close. A Zen 4 die is 6.75 (x4 = 27) on the short side and the IO die is 24.79 on the long side. You shrink the cpu chiplet to 6.2 mm on the short side and it would almost exactly match. Not sure how the IO die would be different. It is plausible that it is the same IO die with both interfaces.

It make sense for it to be more of a rectangle with 2 CCX. It would still need L3 cache in this scenario. We do not know the die size but the L3 cache has been rumored to be 16 MB. That would make a lot of sense since this would basically be two of the 8 core mobile CCX on a dense, low power process. Also, if the infinity cache chips are cheap enough to use on GPUs, then it seems like they would be cheap enough for Bergamo.

M

Mopetar

Diamond Member

Mar 24, 2022

#2,334

igor_kavinski said:
Does Zen 3 have cache control instructions to prevent certain required data from being evicted again and again due to cache pressure?

There are a lot of different cache replacement policy algorithms, but most use something close to or approximating an LRU (least recently used) algorithm. As long as something is using it frequently enough it won't be evicted.

If it's used that often though it's likely to be in either the L2 or even L1 caches of the cores that need it and they won't even need to go to L3 for it so the added v-cache doesn't help here.

Reactions: Tlh97 and Thunder 57

T

tomatosummit

Member

Mar 24, 2022

#2,335

uzzi38 said:
For what workloads? Don't confuse Azure using Milan-X for HPC instances as an example of the majority of the hyperscaler market.

Also, let me phrase things a little differently. What is AMD competing against in terms of cloud specific products? You have Graviton, which sure has a strong focus on perf/W but also there's the fact that because Amazon are designing it themselves, they save some cost on it vs competing Intel/AMD parts. You have Sierra Forrest (eventually), which is also focusing on compute density by shipping Intel's Atom cores. As is Altera. In fact in the case of the latter they released the 128c model with half the L3 cache specifically for this purpose.

Now lets be real, regardless of whether the cache sits on top of the die or on it, that power consumption argument doesn't change. And spoiler alert: Bergamo is not a low power consumption part, nor does it have a lower base TDP than Genoa. So you tell me? What else is there for AMD to focus on?

I think I got my wires crossed somewhere. I noticed you posted a couple of times about cost regarding bergamo but I think I misunderstood how you were replying to others.
I agree with everything else. STH is constantly touting 350w for future sockets and not only intel desktop cpus.
Even if 350w is a large power draw, across 128cores it's fairly tepid although I would bet the ppw would be far above the competition still, even if you include the large IO die and the potential cache stack power draw.

Reactions: Tlh97 and lobz

T

tomatosummit

Member

Mar 24, 2022

#2,336

jamescox said:
I still think it might be stacked using bridge chips. Apple seems to be using a stacked device with bridge chips for the M1 Ultra. Nvidia will have Grace, which seems to be a device using bridge chips also. Intel will have Safire Rapids with EMIB, I guess, if it actually comes out this year. AMD will be a bit behind if they don’t have this in 2023 with Bergamo.

A stacked device using bridge chips would probably not use SoIC, so it would likely not be dense enough connectivity for L3 cache. It could have L4 cache though. For stacking, the die would need to be placed directly adjacent to the IO die, so that would favor a rectangle shape rather than near square. The short dimension of the cpu die times 4 would need to be close to the long dimension of the IO die. It is already close. A Zen 4 die is 6.75 (x4 = 27) on the short side and the IO die is 24.79 on the long side. You shrink the cpu chiplet to 6.2 mm on the short side and it would almost exactly match. Not sure how the IO die would be different. It is plausible that it is the same IO die with both interfaces.

It make sense for it to be more of a rectangle with 2 CCX. It would still need L3 cache in this scenario. We do not know the die size but the L3 cache has been rumored to be 16 MB. That would make a lot of sense since this would basically be two of the 8 core mobile CCX on a dense, low power process. Also, if the infinity cache chips are cheap enough to use on GPUs, then it seems like they would be cheap enough for Bergamo.

Unless there's a more exotic cache solution, be it active bridges, or stacked l4 on die then I don't see a need for silicon bridges for bergamo.
The IO die is going to be using 56g serdes with pcie5 so using the same for the IF links doubles the throughput before whatever else amd can do.
If the charlie leak is real then performance might not be much above 2x anyway.
Your examples, spr especially, are trying for better cache coherency while amd's zen plan so far has been disassociated that probably doesn't require bridging just yet.

MadRat

Lifer

Mar 24, 2022

#2,337

When information is collected in the caches do not all caches fill at the same time from a memory transfer? I was under the impression that the inner most level of cache has the most specific information from the memory transfer, and all of the caches are related in scope. It's when one needs to read that it cuts access time because the information will be someplace that is closer to the core. But the initial loading process is the same speed as any other access to RAM. Maybe I understood the explanation wrong.

DrMrLordX

Lifer

Mar 25, 2022

#2,338

MadRat said:
When information is collected in the caches do not all caches fill at the same time from a memory transfer?

Depends on if the caches are inclusive.

K

krumme

Diamond Member

Mar 25, 2022

#2,339

When can i actually buy this stuff?

DrMrLordX

Lifer

Mar 25, 2022

#2,340

krumme said:
When can i actually buy this stuff?

April 20th I think? Yeah. Well maybe, that's the launch date.

Reactions: lightmanek, Tlh97 and krumme

J

jamescox

Senior member

Mar 25, 2022

#2,341

tomatosummit said:
Unless there's a more exotic cache solution, be it active bridges, or stacked l4 on die then I don't see a need for silicon bridges for bergamo.
The IO die is going to be using 56g serdes with pcie5 so using the same for the IF links doubles the throughput before whatever else amd can do.
If the charlie leak is real then performance might not be much above 2x anyway.
Your examples, spr especially, are trying for better cache coherency while amd's zen plan so far has been disassociated that probably doesn't require bridging just yet.

I am talking about an active bridge with possibly up to 512 MB of cache, so I guess that is an “exotic” solution. Possibly the same infinity cache chips used to connect RDNA3 gpu chips together, rumored to be 512 MB or 384 MB. Bergamo might use 2 of them though, one on each side to attach 4 cpu die each.

There are lots of reasons to use stacking. One is massive bandwidth; HBM levels of interconnect should be possible, so they could easily have a 1024-bit link or more at low clock for each chiplet. The other is much lower power, both for the interconnect and the cache. One of the reasons that HBM was invented is that the high speed interfaces required for gpu memory were actually taking large amounts of die area and power. HBM-style interfaces take very little die area and are significantly lower power. The cache can be made on a cache optimized process, like what is used for v-cache, so it can be very dense and power efficient. It would still be a rather large die at 512 MB; perhaps it is more than one bridge chip. I was thinking of possibly four 128 MB die for maximum flexibility and modularity. That might be similar in size to a cpu chiplet.

The speed increase to pci-express 5 level speeds is not going to be free. I can’t find any power consumption numbers, but I assume it will be a bit more in Genoa to drive serdes at that speed vs Milan. Genoa also has more cpu links and more memory channels, so it is likely overdue for a stacked solution to reduce the interconnect power. Bergamo will only have 8 chiplets, but still the same IO and memory as Genoa.

I suspect that Zen 4c might be a bit more widely applicable than just “cloud”. With a denser process and smaller L3, the cores may not actually be cut down much at all. They could actually be full Zen 4 cores. I saw a labeled die photo that was supposed to be cezanne; an 8 core CCX with 16 MB L3 only taking up about 50 mm2 (about 6.2 x 8.3). That would be about right for fitting 4 along each side of the IO die, although that is Zen 3 on 7 nm compared to a 12 nm GF IO die. It would be different for Bergamo at 5 nm and whatever the IO die is made on, but everything might scale such that it is still a match. I am wondering if Genoa IO die will still be GF while Bergamo IO die is made at TSMC. The cpu die, with two 8-core CCX might be a long narrow die, perhaps similar to the aspect ratio of Zen 1.

T

tomatosummit

Member

Mar 25, 2022

#2,342

jamescox said:
I am talking about an active bridge with possibly up to 512 MB of cache, so I guess that is an “exotic” solution. Possibly the same infinity cache chips used to connect RDNA3 gpu chips together, rumored to be 512 MB or 384 MB. Bergamo might use 2 of them though, one on each side to attach 4 cpu die each.

There are lots of reasons to use stacking. One is massive bandwidth; HBM levels of interconnect should be possible, so they could easily have a 1024-bit link or more at low clock for each chiplet. The other is much lower power, both for the interconnect and the cache. One of the reasons that HBM was invented is that the high speed interfaces required for gpu memory were actually taking large amounts of die area and power. HBM-style interfaces take very little die area and are significantly lower power. The cache can be made on a cache optimized process, like what is used for v-cache, so it can be very dense and power efficient. It would still be a rather large die at 512 MB; perhaps it is more than one bridge chip. I was thinking of possibly four 128 MB die for maximum flexibility and modularity. That might be similar in size to a cpu chiplet.

The speed increase to pci-express 5 level speeds is not going to be free. I can’t find any power consumption numbers, but I assume it will be a bit more in Genoa to drive serdes at that speed vs Milan. Genoa also has more cpu links and more memory channels, so it is likely overdue for a stacked solution to reduce the interconnect power. Bergamo will only have 8 chiplets, but still the same IO and memory as Genoa.

I suspect that Zen 4c might be a bit more widely applicable than just “cloud”. With a denser process and smaller L3, the cores may not actually be cut down much at all. They could actually be full Zen 4 cores. I saw a labeled die photo that was supposed to be cezanne; an 8 core CCX with 16 MB L3 only taking up about 50 mm2 (about 6.2 x 8.3). That would be about right for fitting 4 along each side of the IO die, although that is Zen 3 on 7 nm compared to a 12 nm GF IO die. It would be different for Bergamo at 5 nm and whatever the IO die is made on, but everything might scale such that it is still a match. I am wondering if Genoa IO die will still be GF while Bergamo IO die is made at TSMC. The cpu die, with two 8-core CCX might be a long narrow die, perhaps similar to the aspect ratio of Zen 1.

It's undeniable that that stacking is better for performance all around but it's not a free lunch. Although 3d has come out cheaper than expected there are avenues amd could take that wouldn't be just throwing more silicon cache or bridges at the problems if the design doesn't explicitly require it.
N6 will be a decent power drop even for IO and as shown with rembrandt, amd's uncore design teams aren'tn idle and really good power improvements that should be implemented in zen4 io dies as well.
Back to silicon cost (again assuming charlie's berg leak is correct) then keeping the same IO design between genoa and bergamo is another saver.

More complex 3d designs are certainly the future but I don't think it's in zen4 beyond 3dcache.
And I wouldn't really say amd is falling behind in that regard even if spr and pv are chock full of the stuff, they're more like throwing silicon at the problems while amd is still being more budget focussed. RDNA3 being the obvious coming example for amd's 3d stuff.

MadRat

Lifer

Mar 25, 2022

#2,343

If you do a narrow die then you probably need your infinity fabric to talk through both ends of it. Otherwise your traffic through the chip would probably suffer at the end furthest from the connection.

J

jamescox

Senior member

Mar 25, 2022

#2,344

MadRat said:
If you do a narrow die then you probably need your infinity fabric to talk through both ends of it. Otherwise your traffic through the chip would probably suffer at the end furthest from the connection.

That depends on how far the cache die / bridge chip under laps the cpu die and IO die. They might be able to have a connection at the edge of each CCX towards the IO die with the bridge die directly under it. You can’t do too long of a run; I believe an AMD representative already talked about this at some point, but I don’t remember where.

When I looked at the die sizes before, I came to the conclusion that Bergamo might actually be doable in a single or 1.5x reticle size. You have the 5 nm shrink, possibly a much more dense process, a lot of the serdes PHYs removed, and the on die cache reduced. They may have cut other stuff like FP units, but that seems less likely. This means that it could actually be a rather low risk use of stacking if it is a single reticle size. Apple seems to already be using some form of TSMC silicon bridge tech for M1 Ultra. I don’t see why it is so hard to believe that AMD would have such a product next year. Nvidia will have the Grace chip next year (likely late next year) likely using some form of bridge chip also.

J

jamescox

Senior member

Mar 25, 2022

#2,345

tomatosummit said:
It's undeniable that that stacking is better for performance all around but it's not a free lunch. Although 3d has come out cheaper than expected there are avenues amd could take that wouldn't be just throwing more silicon cache or bridges at the problems if the design doesn't explicitly require it.
N6 will be a decent power drop even for IO and as shown with rembrandt, amd's uncore design teams aren'tn idle and really good power improvements that should be implemented in zen4 io dies as well.
Back to silicon cost (again assuming charlie's berg leak is correct) then keeping the same IO design between genoa and bergamo is another saver.

More complex 3d designs are certainly the future but I don't think it's in zen4 beyond 3dcache.
And I wouldn't really say amd is falling behind in that regard even if spr and pv are chock full of the stuff, they're more like throwing silicon at the problems while amd is still being more budget focussed. RDNA3 being the obvious coming example for amd's 3d stuff.

So 400 W server / HPC CPUs are both fine and dandy at the same time? No reason to reduce interconnect power consumption? Apple is already selling an APU with what looks like an EFB silicon bridge. Nvidia will have Grace in 2023 with likely silicon bridge. AMD may have GPUs with EFB bridges this year or early next year; not sure what the current rumors are on AMD RDNA3.

Do you think it might be interesting if the GPUs connect to the same bridge chip used in Bergamo? If they are modular and interchangeable? I believe AMD CDNA GPUs already have a large number of infinity fabric links. I haven’t read up too much on that though. Would a device with 64 Zen4c cores on one side of an IO die and a gpu on the other be interesting? That would only be around 500 GB/s of DDR5 bandwidth directly accessible from a GPU and up to 12 TB DDR5 capacity. Who would want something like that? I guess it would be a lot like Nvidia’s Grace-Hopper “superchip” but the Zen4c cpus would at least perform a lot better and it would have access to more memory.

Sarcasm aside, I think it would be more surprising if AMD doesn’t have a EFB based device in 2023. It might be later in 2023, but they will be a year or two behind if they wait for Zen 5. The Grace-Hopper device is probably not an option for many use cases since the cpu would be too weak to handle input processing fast enough to keep the gpu utilized. That might annoy some people who are locked in by CUDA code. Most will have to go with a lower density pci-express connected solution with much lower bandwidth between the cpu and gpu. They can get around that to some extent by using many smaller GPUs with separate links.

Reactions: Tlh97 and lightmanek

T

tomatosummit

Member

Mar 25, 2022

#2,346

jamescox said:
So 400 W server / HPC CPUs are both fine and dandy at the same time? No reason to reduce interconnect power consumption? Apple is already selling an APU with what looks like an EFB silicon bridge. Nvidia will have Grace in 2023 with likely silicon bridge. AMD may have GPUs with EFB bridges this year or early next year; not sure what the current rumors are on AMD RDNA3.

Do you think it might be interesting if the GPUs connect to the same bridge chip used in Bergamo? If they are modular and interchangeable? I believe AMD CDNA GPUs already have a large number of infinity fabric links. I haven’t read up too much on that though. Would a device with 64 Zen4c cores on one side of an IO die and a gpu on the other be interesting? That would only be around 500 GB/s of DDR5 bandwidth directly accessible from a GPU and up to 12 TB DDR5 capacity. Who would want something like that? I guess it would be a lot like Nvidia’s Grace-Hopper “superchip” but the Zen4c cpus would at least perform a lot better and it would have access to more memory.

Sarcasm aside, I think it would be more surprising if AMD doesn’t have a EFB based device in 2023. It might be later in 2023, but they will be a year or two behind if they wait for Zen 5. The Grace-Hopper device is probably not an option for many use cases since the cpu would be too weak to handle input processing fast enough to keep the gpu utilized. That might annoy some people who are locked in by CUDA code. Most will have to go with a lower density pci-express connected solution with much lower bandwidth between the cpu and gpu. They can get around that to some extent by using many smaller GPUs with separate links.

Yeah, rising power draw is no joke. Before zen 2, server cpus uses 150-180w and we're going into a 300+w world. In accelerator world, server pcie slots tapped out at 300w so OAM3 supports up to 500w with CDNA2 and SXM is now up to 700w with hopper, cpu sockets are in the same systems so they can use more, to be slightly on topic, stacked cache has shown increased power draw anyway.

So in a situation where ccds with large local stacked cache actually reduces the required interconnect bandwidth the chances of seeing bridge chips is reduced.
No one's going to not buy what is probably going to be most efficient and most performant cpus on the market for at least one more generation just because the IO die is using an extra 50w, which is just a trade off for reduced costs by using less silicon on bridge chips and less advanced packaging.
The IO die does actually hurt the low end server that xeon bronze competes in but thare are rumours of a smaller zen4 epyc socket to address that.

Zen5 is rumoured for 2023 and it's new design family which opens up the book for new designs so maybe it's active bridge/interposers then.

What is using silicon bridges are devices that require large aggregate bandwidth, namely gpus and apus, nvidia XX100 all using hbm, cdna2 uses fan out for hbm and interconnect, sapphire rapids uses emib for shared l3 and shared hbm bandwidth (apple's apu similar) and ponte vecchio that uses everything you can imagine.
Grace cpus look more traditional. The images we have been shown look like discrete packages and all material says lots of nvlink nvlink nvlink that has no problem running over copper currently.

There was a good presentation last hot chips on 3d stacking chips and the design considerations around it that make reusable silicon, especially something intrinsic like a cache bridge chip, really unlikely. Even on chiplets we haven't seen anything other than the ccd used across products in desktop and server that hardly differ in intended use, we didn't even see a tiny gpu chiplet that could be used in a low end discrete cpu or attached to an io die or even has a different io+gpu die, nothing fun at all.
There was a leak in ~2016 of a huge amd 16 or 32 core apu with hbm with stacked all sorts design that hasn't seen the light of day yet, ponte vechio looks most similar to that minus cpu chiplets and m1max is doing big apu but not exactly in an interesting design.

J

jamescox

Senior member

Mar 25, 2022

#2,347

tomatosummit said:
Yeah, rising power draw is no joke. Before zen 2, server cpus uses 150-180w and we're going into a 300+w world. In accelerator world, server pcie slots tapped out at 300w so OAM3 supports up to 500w with CDNA2 and SXM is now up to 700w with hopper, cpu sockets are in the same systems so they can use more, to be slightly on topic, stacked cache has shown increased power draw anyway.

So in a situation where ccds with large local stacked cache actually reduces the required interconnect bandwidth the chances of seeing bridge chips is reduced.
No one's going to not buy what is probably going to be most efficient and most performant cpus on the market for at least one more generation just because the IO die is using an extra 50w, which is just a trade off for reduced costs by using less silicon on bridge chips and less advanced packaging.
The IO die does actually hurt the low end server that xeon bronze competes in but thare are rumours of a smaller zen4 epyc socket to address that.

Zen5 is rumoured for 2023 and it's new design family which opens up the book for new designs so maybe it's active bridge/interposers then.

What is using silicon bridges are devices that require large aggregate bandwidth, namely gpus and apus, nvidia XX100 all using hbm, cdna2 uses fan out for hbm and interconnect, sapphire rapids uses emib for shared l3 and shared hbm bandwidth (apple's apu similar) and ponte vecchio that uses everything you can imagine.
Grace cpus look more traditional. The images we have been shown look like discrete packages and all material says lots of nvlink nvlink nvlink that has no problem running over copper currently.

There was a good presentation last hot chips on 3d stacking chips and the design considerations around it that make reusable silicon, especially something intrinsic like a cache bridge chip, really unlikely. Even on chiplets we haven't seen anything other than the ccd used across products in desktop and server that hardly differ in intended use, we didn't even see a tiny gpu chiplet that could be used in a low end discrete cpu or attached to an io die or even has a different io+gpu die, nothing fun at all.
There was a leak in ~2016 of a huge amd 16 or 32 core apu with hbm with stacked all sorts design that hasn't seen the light of day yet, ponte vechio looks most similar to that minus cpu chiplets and m1max is doing big apu but not exactly in an interesting design.

I think that the Apple M1 Ultra (with a silicon bridge connecting 2 M1 Max chips) is an interesting design. I don’t think we know specifically if it is the same EFB that AMD would use, but it seems to be a silicon bridge of some kind, so likely the same tech from TSMC will be used for infinity cache GPUs. We don’t know how the Grace cpu is connected together either. The images are almost certainly just a rendering. If the CPUs are directly adjacent, it would make a lot more sense to use a silicon bridge for the kind of bandwidth they are talking about. The gpu would then be connected by NVlink. It isn’t going to be available for a long time, which makes me wonder if Nvidia just wanted to talk about it first such that they look like they are leading the technology. If AMD announces a similar device later, it looks a little more like they are the follower, even if their device ends up being available first.

For Bergamo, with 128-cores, a lot of bandwidth will be required. Zen 4 may have significantly increased floating point compute over Zen 3 and Bergamo may not have any of that cut out. AMD has dealt with bandwidth requirements in their GPUs by using infinity cache. I believe AMD had talked about using infinity cache across multiple products. it would be a good way to reduce interconnect power, increase bandwidth, and add a lot of cache back. Bergamo does not seem to be using v-cache, but it might not be necessary if it has infinity cache.

Charlie at semiaccurate has called Bergamo a monster. It doesn’t sound like one of it is just two 8-core CCX with 16 MB L3 per die connected by serdes. It is a lot of cores and they should be more power efficient, meaning they may actually be able to sustain good clock speeds, so perhaps the performance could be massive without extra cache. Extra cache can help a lot though, as Milan-X has demonstrated.

The memory interface for current Epyc is very wide, at 512-bit. That is, in fact, wider than most current GPUs. That is GDDR6 vs DDR4, but with DDR5, it goes up to 768-bit (64 x 12 or possibly more accuratly 32 x 24) for SP5 which is near 500 GB/s per Epyc socket. That will be close to the bandwidth of an Nvidia P100 GPU. Current nvidia A10 GPUs are only 600 GB/s. Saying that Bergamo isn’t a “high aggregate bandwidth” device doesn’t seem correct. We also may get a significant DDR5 speed increase before Bergamo comes out next year. I assume the 460 GB/s number I have seen is actually with rather low clocked DDR5; I will need to look that up.

There could be some other secret sauce to Bergamo or some Genoa derivative. I don’t know how well it would compete with cpus using integrated HBM. The infinity cache would be one way to compete without going full (and expensive) HBM, just like AMD did with their GPUs. The infinity cache would just be a 1 or 2 embedded die rather than something like 4 or 8 stacks of HBM. AMD integrating HBM on a cpu, unless it included with a GPU, seems unlikely. HBM can obviously supply huge bandwidth, but it still has DRAM latency. It is plausible that AMD will start making parts with multiple levels of v-cache rather than using infinity cache chips.

I am just speculating here. We very well might not see an EFB part until Zen 5, but that is a long way off. It could be that Grace/Hopper is out in a similar time frame as Zen 5 using EFB. I could see AMD using Bergamo as a bit of a test bed for Zen 5 though. That is, the Bergamo IO die might be the same as what is used with Zen 5. Using infinity cache doesn’t fit with some of the rumors, but it makes a lot of sense to me. It also could be the case that it is just a bunch of serdes connected chips with small caches. Rather boring, but entirely possible.

Reactions: Drazick and Tlh97

N

nicalandia

Diamond Member

Mar 30, 2022

#2,348

Not 3D V Cache, but it's being released at the same time so it's relevant..

New AMD Ryzen 7 5700X benchmarks showcase stellar performance

www.notebookcheck.net

New AMD Ryzen 7 5700X benchmarks showcase stellar performance

The AMD Ryzen 7 5700X has been benchmarked on Geekbench, where it scores 1,634/1,645 and 10,179/10,196 on the single and multi-core test, respectively. It performs on par with the more expensive Ryzen 5800X and could even give the Intel Core i5-12600K a run for its money.

www.notebookcheck.net

www.notebookcheck.net

Reactions: Tlh97 and scannall

BTRY B 529th FA BN

Lifer

Mar 30, 2022

#2,349

That'll kill the resale value of the 5800X. The L1 cache is the same as-well? 512KB?

Odd that they released the 3700X and the 3800X around the same time while for Vermeer they released the 5800X 11/05/20, then they are releasing the 5700X 04/04/22. Again they're proving the 5800X a bad buy. Dang got suckered again, lol

Reactions: Tlh97, Makaveli and Ranulf

G

gdansk

Diamond Member

Mar 30, 2022

#2,350

They were so close to an opportune launch with X3D. Only 5 months late or so.

But the 5700X? Seems like it should have launched in 2020 with the rest. But they wanted that ASP.

Last edited: Mar 30, 2022

Reactions: Tlh97, CHADBOGA and Ranulf

You must log in or register to reply here.

Share:

Facebook X (Twitter) Reddit Tumblr WhatsApp Email Link

TRENDING THREADS

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)
- Started by DisEnchantment
- Sep 29, 2022
- Replies: 25K
CPUs and Overclocking
T
Discussion Intel Meteor, Arrow, Lunar & Panther Lakes + WCL Discussion Threads
- Started by Tigerick
- Aug 22, 2022
- Replies: 23K
CPUs and Overclocking
Discussion Intel current and future Lakes & Rapids thread
- Started by TheF34RChannel
- Jun 18, 2017
- Replies: 23K
CPUs and Overclocking
Discussion Apple Silicon SoC thread
- Started by Eug
- Nov 10, 2020
- Replies: 11K
CPUs and Overclocking
Question Zen 6 Speculation Thread
- Started by IronLynx
- May 22, 2024
- Replies: 8K
CPUs and Overclocking

Top Bottom

This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.

Accept Learn more…