Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Vattila · Oct 6, 2019

Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!

jamescox · Jul 8, 2021

JoeRambo said:
We have found the best strategy with THP is using as new kernel as possible during validation ( and then "freezing" it obv. ), keeping THP on madvice default setting, so it does not do crazy stuff on it's own.
And then explicitely using THP + up front allocation in our custom programs and JVM instances. When it works it is amazing, esp with medium sized JVM heaps ( think tens of GBs) that are busy with allocation and garbage collection. Even better when bound to NUMA node with numactl.

-XX:+AlwaysPreTouch -XX:+UseTransparentHugePages -XmxNNg -XmsNNg ( where NN is desired size of heap in GB )

Tells newish ( since 11 or so) JVM to advice kernel to use THP on heap setup, uses fixed heap and immediately allocates NN GB of virtual memory and touches it all to make physical memory allocation for all of it in 2MB THP pages.

Some good memory saving in page tables and performance gain due to lower TLB misses and reduced cache footprint of page tables backing memory to be had for free.

We just disabled the defrag, so if it has a THP available, it will use it, otherwise, immediately fill the allocation with 4K pages. We ran into the issue that the systems would occasionally by pushed into swapping before the load was adjusted and swapping will fragment 2MB pages into 4K pages.

jamescox · Jul 8, 2021

Hans de Vries said:
The crucial thing is here "Page Migration" from one memory to the other.

Preferably by a special purpose hardwired co-processor on the IO-die.

It does not make much sense to include HBM2e memory without this. The co-processor monitors the access bandwidth to individual pages and determines if these pages should be moved from Main Memory (using DDR5 DIMMS) to Main Memory using HBM2e or back.

The OS would set a number of general control parameters but not be involved in any real time migration actions, that would be a nightmare.

https://www.freepatentsonline.com/20210182206.pdf
View attachment 46889

For those who talk about a HBM2e cache: Realize that you need about 12% extra memory for the cache tags. So a 32 GByte HBM2e cache needs ~4 GByte of tags.

How large is the cache line or page size (not sure what terminology they are using) for HBM cache? I don’t think I have seen this anywhere. If they are building systems with tens or hundreds of GB of HBM, then I was assuming that the system was using something like 4K pages.

JoeRambo · Jul 8, 2021

jamescox said:
We just disabled the defrag, so if it has a THP available, it will use it, otherwise, immediately fill the allocation with 4K pages. We ran into the issue that the systems would occasionally by pushed into swapping before the load was adjusted and swapping will fragment 2MB pages into 4K pages.

Yeah, early THP was nasty, lost trust. To this day i would not dare to run it fully enabled on production system

Without defrag enabled, there was very little software that was aware of THP ( pretty much need to use madvice when asking for VA space ), was completely pointless before explicit support has arrived much later. And lately we have been getting nice hacks like mimaloc, that automagically transform heap allocations into THP capable ones for relevant programs using LD_PRELOAD techniques. Needs testing and validation, but from my experience it just works and is easier that modifying program code.

On topic of THP and large pages - i think Linux is seeing work being done, enabling file page cache to use sort of THP and large pages. Would win a percent of two of performance with some workloads and saving memory for page tables is always a win in my book.

Doug S · Jul 8, 2021

jamescox said:
If they could have multiple page sizes and have everything play nice then that is obviously the best solution. That doesn’t seem to work in the current implementation. I thought that Apple went to default 64K page sizes, which might be a good intermediary. In researching the THP issues (long stalls on allocation, up to 90 seconds due to defrag or compact operations), a lot of people seem to recommend just disabling them completely. I wouldn’t say that I know enough about the workings of the system to have an informed opinion; I just note that they don’t really seem to work for our application.

Apple's page size is 16K, which is probably as large as you want to go before it begins to cost too much in terms of overhead, especially on personal devices like phones and laptops.

Now in theory you could go larger on servers, but the fact you have tons of memory is balanced by the fact you may have a lot more files mapped (i.e. file server) or a lot more processes, so setting even a 64K default would be costly for certain uses.

scineram · Jul 9, 2021

I specifically mentioned those two because they make the most sense for modern systems. 16KB for consumer devices, 64KB for servers and big memory systems.

lobz · Jul 9, 2021

Hans de Vries said:
The crucial thing is here "Page Migration" from one memory to the other.

Preferably by a special purpose hardwired co-processor on the IO-die.

It does not make much sense to include HBM2e memory without this. The co-processor monitors the access bandwidth to individual pages and determines if these pages should be moved from Main Memory (using DDR5 DIMMS) to Main Memory using HBM2e or back.

The OS would set a number of general control parameters but not be involved in any real time migration actions, that would be a nightmare.

https://www.freepatentsonline.com/20210182206.pdf
View attachment 46889

For those who talk about a HBM2e cache: Realize that you need about 12% extra memory for the cache tags. So a 32 GByte HBM2e cache needs ~4 GByte of tags.

long time no see! great post

jamescox · Jul 9, 2021

Doug S said:
Apple's page size is 16K, which is probably as large as you want to go before it begins to cost too much in terms of overhead, especially on personal devices like phones and laptops.

Now in theory you could go larger on servers, but the fact you have tons of memory is balanced by the fact you may have a lot more files mapped (i.e. file server) or a lot more processes, so setting even a 64K default would be costly for certain uses.

It seems like the software needs a lot of work fast. We might get HPC systems with tens to hundreds of GB of HBM cache soon. We may also get systems with CXL memory extenders for many TB in even a relatively small system. This opens up a lot of new possibilities, if CXL latency is actually reasonable. It seems like it should be roughly the same as accessing memory on a remote numa socket, perhaps with a bit more overhead. I have had much time to look at it yet, but I already saw a CXL memory card announcement with DDR5 and PCI express 5 physical layer.

Vattila · Jul 12, 2021

Charlie at SemiAccurate apparently confirming 128-core "Bergamo":

"More than just a 128c monster — AMD has an upcoming 128 core CPU called Bergamo and what it signifies is more important than that it is. SemiAccurate thinks Bergamo is the first of a new class of CPUs with more to follow."

What is AMD's Bergamo CPU? - SemiAccurate

jamescox · Jul 12, 2021

Vattila said:
Charlie at SemiAccurate apparently confirming 128-core "Bergamo":

"More than just a 128c monster — AMD has an upcoming 128 core CPU called Bergamo and what it signifies is more important than that it is. SemiAccurate thinks Bergamo is the first of a new class of CPUs with more to follow."

What is AMD's Bergamo CPU? - SemiAccurate

I can’t read the article. Makes me wonder if it is a stacked device with at least 2 layers of cpu die. The 96 core thing is still a bit odd, but didn’t they talk about a 48 core with Rome before revealing that it could go up to 64? Some 48 core versions of Rome / Milan are asymmetric since they use 6 CCD rather than 8 ccd with cores disabled. The 96 core might be 16 die x 6 core using multi-layer stacks rather than an asymmetric configuration. It would be interesting if there is some sales data on how well the 48 core Rome / Milan is selling.

Gideon · Jul 13, 2021

Some interesting tests by TSMC about in-chip water cooling.

TSMC Exploring On-Chip, Semiconductor-Integrated Watercooling

Future chips may feature watercooling integrated into the silicon.

www.tomshardware.com

Doug S · Jul 13, 2021

Gideon said:
Some interesting tests by TSMC about in-chip water cooling.

TSMC Exploring On-Chip, Semiconductor-Integrated Watercooling

Future chips may feature watercooling integrated into the silicon.

www.tomshardware.com

I'm sure this is something the big cloud providers would love to see. They'll just have to figure out how to go from delivering 10-20 kW per rack to delivering 100-200 kW per rack (if they go from ~260 watts to 2.6 kw per socket)

Vattila · Jul 13, 2021

Gideon said:
Some interesting tests by TSMC about in-chip water cooling.

Interesting indeed. Underfox's recent patent tweet included a couple of AMD patents on cooling and thermal management, so there is intense R&D in this field, it seems. I guess we will see more ingenious in-package thermal management and cooling solutions in the near/medium future, improving on the brute force heat sink cooling used today.

Kepler_L2 · Jul 13, 2021

Vattila said:
Interesting indeed. Underfox's recent patent tweet included a couple of AMD patents on cooling and thermal management, so there is intense R&D in this field, it seems. I guess we will see more ingenious in-package thermal management and cooling solutions in the near/medium future, improving on the brute force heat sink cooling used today.

Not surprising, it's pretty much a requirement for logic-on-logic 3D stacking.

Ajay · Jul 13, 2021

Gideon said:
Some interesting tests by TSMC about in-chip water cooling.

TSMC Exploring On-Chip, Semiconductor-Integrated Watercooling

Future chips may feature watercooling integrated into the silicon.

www.tomshardware.com

While worth a shot, I think using copper** conductors would be a better choice. While it may be more complex to achieve mechanically (bonding, dealing with differing coefficients of expansion, etc.), that approach would more effective at getting heat out of 3 dimensional silicon structures. The issue with current discussions about using 'vias' to conduct head out of silicon dice is that the conversation revolves around the current stepped inverted cone used. The very small final discs in metal layers just above the transistor layer cannot conduct much heat. I think that constant radii cylinders (or rectangular prisms) will need to be used, though the area under such structures will be no-go for xtors. Getting those 'fins' spread out and bonded to redesigned heat spreaders are all very challenging problems. Just my 2 cents.

** probably some alloy rather than pure copper.

maddie · Jul 13, 2021

Ajay said:
While worth a shot, I think using copper** conductors would be a better choice. While it may be more complex to achieve mechanically (bonding, dealing with differing coefficients of expansion, etc.), that approach would more effective at getting heat out of 3 dimensional silicon structures. The issue with current discussions about using 'vias' to conduct head out of silicon dice is that the conversation revolves around the current stepped inverted cone used. The very small final discs in metal layers just above the transistor layer cannot conduct much heat. I think that constant radii cylinders (or rectangular prisms) will need to be used, though the area under such structures will be no-go for xtors. Getting those 'fins' spread out and bonded to redesigned heat spreaders are all very challenging problems. Just my 2 cents.

** probably some alloy rather than pure copper.

Won't you get capacitance effects in that copper layer so close to the electron flows? If it isn't close then it's the same as the heatsink.

jamescox · Jul 13, 2021

maddie said:
Won't you get capacitance effects in that copper layer so close to the electron flows? If it isn't close then it's the same as the heatsink.

They already have to take parasitic capacitance into account. Modern chips are a bottom layer of transistors and then many metal interconnect layers on top. All of the interconnect traces have to be assigned some parasitic capacitance and resistance in circuit simulations. It is also coupled with the charge state of traces above and below each trace for around 10 to 12 layers of interconnect. This blows up into a big compute problem quickly.

I don’t know about the manufacturing process for this though. TSVs are made by etching deep into the wafer and then exposed by polishing down the wafer from the other side. This doesn’t seem workable for placing directly below devices since the devices must have a layer of silicon below.

I guess we are going to see some possibly exotic solutions, but probably not for a while. Although, I may have said that in the past about AMD and was wrong. I think they might be able to do at least 2 layers by just using well binned devices at lower clocks. If they can keep it down to 4 or maybe 6 die stacks, then they could use LSI or other stacking tech for connecting CCD stacks to the IO die. I was expecting the initial version of Genoa to use serial connections, but it is not very power efficient with another doubling of speed. Using LSI would make sense, but it does require that the chips are adjacent, so it limits the number of CCD. The 6 stack solution is still asymmetric with respect to the 4 IO die quadrants, but they already make such devices. Perhaps the 128-core device comes a bit later and uses 4 high stacks with some exotic cooling.

soresu · Jul 13, 2021

Ajay said:
While worth a shot, I think using copper** conductors would be a better choice. While it may be more complex to achieve mechanically (bonding, dealing with differing coefficients of expansion, etc.), that approach would more effective at getting heat out of 3 dimensional silicon structures. The issue with current discussions about using 'vias' to conduct head out of silicon dice is that the conversation revolves around the current stepped inverted cone used. The very small final discs in metal layers just above the transistor layer cannot conduct much heat. I think that constant radii cylinders (or rectangular prisms) will need to be used, though the area under such structures will be no-go for xtors. Getting those 'fins' spread out and bonded to redesigned heat spreaders are all very challenging problems. Just my 2 cents.

** probably some alloy rather than pure copper.

The basic research behind this technology is already done and dusted - theory tested and proven.

Check out DARPA's ICEcool project from the last decade:

https://www.darpa.mil/about-us/timeline/icecool

Blog

The IBM Research blog is the home for stories told by the researchers, scientists, and engineers inventing What’s Next in science and technology.

www.ibm.com

This news from TSMC is basically them figuring out how to take that research and implement it in a mass manufactured process node, which is much more viable due to the 3D tooling they have expanded into for newer generations with AMD et al.

Ajay · Jul 13, 2021

soresu said:
The basic research behind this technology is already done and dusted - theory tested and proven.

Check out DARPA's ICEcool project from the last decade:

https://www.darpa.mil/about-us/timeline/icecool

Blog

The IBM Research blog is the home for stories told by the researchers, scientists, and engineers inventing What’s Next in science and technology.

www.ibm.com

This news from TSMC is basically them figuring out how to take that research and implement it in a mass manufactured process node, which is much more viable due to the 3D tooling they have expanded into for newer generations with AMD et al.

That's refrigerant based cooling.

jpiniero · Jul 14, 2021

jamescox said:
I can’t read the article. Makes me wonder if it is a stacked device with at least 2 layers of cpu die. The 96 core thing is still a bit odd

I'd say that Genoa looks like Rome, except it has 12 dies instead of 8. 96 was probably chosen because of power consumption and perhaps space. Have to read that article to see what Charlie thinks but Bergamo could be stacked dies or stacked on top of the IO die. Either way the power consumption is going to be crazy.

Timorous · Jul 14, 2021

jpiniero said:
I'd say that Genoa looks like Rome, except it has 12 dies instead of 8. 96 was probably chosen because of power consumption and perhaps space. Have to read that article to see what Charlie thinks but Bergamo could be stacked dies or stacked on top of the IO die. Either way the power consumption is going to be crazy.

Stacked on the IO die would mean the IO Die and chiplets are on the same node right? TSMC don't do cross node stacking or is it that they don't do cross node stacking yet?

Abwx · Jul 14, 2021

Some news on Zen 4 for DT.

https://twitter.com/x/status/1415065218087100422

Patrick Schur
@patrickschur_
The exact TDP numbers for Raphael are 65, 95, 105, 120 and 170 W.
12:19 nachm. · 14. Juli 2021

AMD-CPU-Gerüchte: Raphael bleibt bei 16 Kernen, aber bekommt 170 Watt

AMDs kommende Zen-4-CPU mit dem Codenamen Raphael wird voraussichtlich erneut nicht die Anzahl der Kerne steigern, dafür jedoch die TDP.

www.computerbase.de

Timorous · Jul 14, 2021

What do they need 170W for? 16c + igp?

Ajay · Jul 14, 2021

Abwx said:
Some news on Zen 4 for DT.

https://twitter.com/x/status/1415065218087100422

Patrick Schur
@patrickschur_
The exact TDP numbers for Raphael are 65, 95, 105, 120 and 170 W.
12:19 nachm. · 14. Juli 2021

AMD-CPU-Gerüchte: Raphael bleibt bei 16 Kernen, aber bekommt 170 Watt

AMDs kommende Zen-4-CPU mit dem Codenamen Raphael wird voraussichtlich erneut nicht die Anzahl der Kerne steigern, dafür jedoch die TDP.

www.computerbase.de

Hmm, wonder if it's just the max socket TDP for AM5. Could be AMD is giving itself some extra headroom.

Abwx · Jul 14, 2021

That s half old news set apart for the TDPs list, according to a May 27 french article 170W is the short term boost.

AMD Zen 4 : plus de 20% de gain d'IPC ? - Overclocking.com

Nous avons déjà publié il y a quelques jours une somme d’informations sur la future plateforme AM5 qui permettra de recevoir les prochains Ryzen disposant de l’architecture Zen 4. C’est maintenant au tour de cette architecture de profiter de quelques fuites. Beaucoup d’éléments sont déjà connus...

overclocking.com

jpiniero · Jul 14, 2021

Timorous said:
What do they need 170W for? 16c + igp?

Allegedly it's a special model. Like a 7950XXX.

Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Senior member

Senior member

Senior member

Golden Member

Diamond Member

Senior member

Platinum Member

Senior member

Senior member

Senior member

Platinum Member

Diamond Member

Senior member

Senior member

Lifer

Diamond Member

Senior member

Diamond Member

Lifer

Lifer

Golden Member

Lifer

Golden Member

Lifer

Lifer

Lifer