Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 74 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
821
1,457
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

jamescox

Senior member
Nov 11, 2009
644
1,105
136
We have found the best strategy with THP is using as new kernel as possible during validation ( and then "freezing" it obv. ), keeping THP on madvice default setting, so it does not do crazy stuff on it's own.
And then explicitely using THP + up front allocation in our custom programs and JVM instances. When it works it is amazing, esp with medium sized JVM heaps ( think tens of GBs) that are busy with allocation and garbage collection. Even better when bound to NUMA node with numactl.

-XX:+AlwaysPreTouch -XX:+UseTransparentHugePages -XmxNNg -XmsNNg ( where NN is desired size of heap in GB )

Tells newish ( since 11 or so) JVM to advice kernel to use THP on heap setup, uses fixed heap and immediately allocates NN GB of virtual memory and touches it all to make physical memory allocation for all of it in 2MB THP pages.

Some good memory saving in page tables and performance gain due to lower TLB misses and reduced cache footprint of page tables backing memory to be had for free.
We just disabled the defrag, so if it has a THP available, it will use it, otherwise, immediately fill the allocation with 4K pages. We ran into the issue that the systems would occasionally by pushed into swapping before the load was adjusted and swapping will fragment 2MB pages into 4K pages.
 

jamescox

Senior member
Nov 11, 2009
644
1,105
136
The crucial thing is here "Page Migration" from one memory to the other.

Preferably by a special purpose hardwired co-processor on the IO-die.

It does not make much sense to include HBM2e memory without this. The co-processor monitors the access bandwidth to individual pages and determines if these pages should be moved from Main Memory (using DDR5 DIMMS) to Main Memory using HBM2e or back.

The OS would set a number of general control parameters but not be involved in any real time migration actions, that would be a nightmare.

View attachment 46889

For those who talk about a HBM2e cache: Realize that you need about 12% extra memory for the cache tags. So a 32 GByte HBM2e cache needs ~4 GByte of tags.
How large is the cache line or page size (not sure what terminology they are using) for HBM cache? I don’t think I have seen this anywhere. If they are building systems with tens or hundreds of GB of HBM, then I was assuming that the system was using something like 4K pages.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
We just disabled the defrag, so if it has a THP available, it will use it, otherwise, immediately fill the allocation with 4K pages. We ran into the issue that the systems would occasionally by pushed into swapping before the load was adjusted and swapping will fragment 2MB pages into 4K pages.

Yeah, early THP was nasty, lost trust. To this day i would not dare to run it fully enabled on production system :)
Without defrag enabled, there was very little software that was aware of THP ( pretty much need to use madvice when asking for VA space ), was completely pointless before explicit support has arrived much later. And lately we have been getting nice hacks like mimaloc, that automagically transform heap allocations into THP capable ones for relevant programs using LD_PRELOAD techniques. Needs testing and validation, but from my experience it just works and is easier that modifying program code.

On topic of THP and large pages - i think Linux is seeing work being done, enabling file page cache to use sort of THP and large pages. Would win a percent of two of performance with some workloads and saving memory for page tables is always a win in my book.
 

Doug S

Diamond Member
Feb 8, 2020
3,574
6,311
136
If they could have multiple page sizes and have everything play nice then that is obviously the best solution. That doesn’t seem to work in the current implementation. I thought that Apple went to default 64K page sizes, which might be a good intermediary. In researching the THP issues (long stalls on allocation, up to 90 seconds due to defrag or compact operations), a lot of people seem to recommend just disabling them completely. I wouldn’t say that I know enough about the workings of the system to have an informed opinion; I just note that they don’t really seem to work for our application.


Apple's page size is 16K, which is probably as large as you want to go before it begins to cost too much in terms of overhead, especially on personal devices like phones and laptops.

Now in theory you could go larger on servers, but the fact you have tons of memory is balanced by the fact you may have a lot more files mapped (i.e. file server) or a lot more processes, so setting even a 64K default would be costly for certain uses.
 
  • Like
Reactions: Tlh97 and scannall

scineram

Senior member
Nov 1, 2020
376
295
136
I specifically mentioned those two because they make the most sense for modern systems. 16KB for consumer devices, 64KB for servers and big memory systems.
 

lobz

Platinum Member
Feb 10, 2017
2,057
2,856
136
The crucial thing is here "Page Migration" from one memory to the other.

Preferably by a special purpose hardwired co-processor on the IO-die.

It does not make much sense to include HBM2e memory without this. The co-processor monitors the access bandwidth to individual pages and determines if these pages should be moved from Main Memory (using DDR5 DIMMS) to Main Memory using HBM2e or back.

The OS would set a number of general control parameters but not be involved in any real time migration actions, that would be a nightmare.

View attachment 46889

For those who talk about a HBM2e cache: Realize that you need about 12% extra memory for the cache tags. So a 32 GByte HBM2e cache needs ~4 GByte of tags.
long time no see! great post :)
 

jamescox

Senior member
Nov 11, 2009
644
1,105
136
Apple's page size is 16K, which is probably as large as you want to go before it begins to cost too much in terms of overhead, especially on personal devices like phones and laptops.

Now in theory you could go larger on servers, but the fact you have tons of memory is balanced by the fact you may have a lot more files mapped (i.e. file server) or a lot more processes, so setting even a 64K default would be costly for certain uses.
It seems like the software needs a lot of work fast. We might get HPC systems with tens to hundreds of GB of HBM cache soon. We may also get systems with CXL memory extenders for many TB in even a relatively small system. This opens up a lot of new possibilities, if CXL latency is actually reasonable. It seems like it should be roughly the same as accessing memory on a remote numa socket, perhaps with a bit more overhead. I have had much time to look at it yet, but I already saw a CXL memory card announcement with DDR5 and PCI express 5 physical layer.
 

jamescox

Senior member
Nov 11, 2009
644
1,105
136
Charlie at SemiAccurate apparently confirming 128-core "Bergamo":

"More than just a 128c monster — AMD has an upcoming 128 core CPU called Bergamo and what it signifies is more important than that it is. SemiAccurate thinks Bergamo is the first of a new class of CPUs with more to follow."

What is AMD's Bergamo CPU? - SemiAccurate
I can’t read the article. Makes me wonder if it is a stacked device with at least 2 layers of cpu die. The 96 core thing is still a bit odd, but didn’t they talk about a 48 core with Rome before revealing that it could go up to 64? Some 48 core versions of Rome / Milan are asymmetric since they use 6 CCD rather than 8 ccd with cores disabled. The 96 core might be 16 die x 6 core using multi-layer stacks rather than an asymmetric configuration. It would be interesting if there is some sales data on how well the 48 core Rome / Milan is selling.
 

Doug S

Diamond Member
Feb 8, 2020
3,574
6,311
136
Some interesting tests by TSMC about in-chip water cooling.



I'm sure this is something the big cloud providers would love to see. They'll just have to figure out how to go from delivering 10-20 kW per rack to delivering 100-200 kW per rack (if they go from ~260 watts to 2.6 kw per socket)
 

Vattila

Senior member
Oct 22, 2004
821
1,457
136
Some interesting tests by TSMC about in-chip water cooling.

Interesting indeed. Underfox's recent patent tweet included a couple of AMD patents on cooling and thermal management, so there is intense R&D in this field, it seems. I guess we will see more ingenious in-package thermal management and cooling solutions in the near/medium future, improving on the brute force heat sink cooling used today.
 

Kepler_L2

Senior member
Sep 6, 2020
998
4,262
136
Interesting indeed. Underfox's recent patent tweet included a couple of AMD patents on cooling and thermal management, so there is intense R&D in this field, it seems. I guess we will see more ingenious in-package thermal management and cooling solutions in the near/medium future, improving on the brute force heat sink cooling used today.
Not surprising, it's pretty much a requirement for logic-on-logic 3D stacking.
 

Ajay

Lifer
Jan 8, 2001
16,094
8,114
136
Some interesting tests by TSMC about in-chip water cooling.


While worth a shot, I think using copper** conductors would be a better choice. While it may be more complex to achieve mechanically (bonding, dealing with differing coefficients of expansion, etc.), that approach would more effective at getting heat out of 3 dimensional silicon structures. The issue with current discussions about using 'vias' to conduct head out of silicon dice is that the conversation revolves around the current stepped inverted cone used. The very small final discs in metal layers just above the transistor layer cannot conduct much heat. I think that constant radii cylinders (or rectangular prisms) will need to be used, though the area under such structures will be no-go for xtors. Getting those 'fins' spread out and bonded to redesigned heat spreaders are all very challenging problems. Just my 2 cents.

** probably some alloy rather than pure copper.
 
  • Like
Reactions: Vattila

maddie

Diamond Member
Jul 18, 2010
5,156
5,544
136
While worth a shot, I think using copper** conductors would be a better choice. While it may be more complex to achieve mechanically (bonding, dealing with differing coefficients of expansion, etc.), that approach would more effective at getting heat out of 3 dimensional silicon structures. The issue with current discussions about using 'vias' to conduct head out of silicon dice is that the conversation revolves around the current stepped inverted cone used. The very small final discs in metal layers just above the transistor layer cannot conduct much heat. I think that constant radii cylinders (or rectangular prisms) will need to be used, though the area under such structures will be no-go for xtors. Getting those 'fins' spread out and bonded to redesigned heat spreaders are all very challenging problems. Just my 2 cents.

** probably some alloy rather than pure copper.
Won't you get capacitance effects in that copper layer so close to the electron flows? If it isn't close then it's the same as the heatsink.
 

jamescox

Senior member
Nov 11, 2009
644
1,105
136
Won't you get capacitance effects in that copper layer so close to the electron flows? If it isn't close then it's the same as the heatsink.
They already have to take parasitic capacitance into account. Modern chips are a bottom layer of transistors and then many metal interconnect layers on top. All of the interconnect traces have to be assigned some parasitic capacitance and resistance in circuit simulations. It is also coupled with the charge state of traces above and below each trace for around 10 to 12 layers of interconnect. This blows up into a big compute problem quickly.

I don’t know about the manufacturing process for this though. TSVs are made by etching deep into the wafer and then exposed by polishing down the wafer from the other side. This doesn’t seem workable for placing directly below devices since the devices must have a layer of silicon below.

I guess we are going to see some possibly exotic solutions, but probably not for a while. Although, I may have said that in the past about AMD and was wrong. I think they might be able to do at least 2 layers by just using well binned devices at lower clocks. If they can keep it down to 4 or maybe 6 die stacks, then they could use LSI or other stacking tech for connecting CCD stacks to the IO die. I was expecting the initial version of Genoa to use serial connections, but it is not very power efficient with another doubling of speed. Using LSI would make sense, but it does require that the chips are adjacent, so it limits the number of CCD. The 6 stack solution is still asymmetric with respect to the 4 IO die quadrants, but they already make such devices. Perhaps the 128-core device comes a bit later and uses 4 high stacks with some exotic cooling.
 

soresu

Diamond Member
Dec 19, 2014
4,105
3,565
136
While worth a shot, I think using copper** conductors would be a better choice. While it may be more complex to achieve mechanically (bonding, dealing with differing coefficients of expansion, etc.), that approach would more effective at getting heat out of 3 dimensional silicon structures. The issue with current discussions about using 'vias' to conduct head out of silicon dice is that the conversation revolves around the current stepped inverted cone used. The very small final discs in metal layers just above the transistor layer cannot conduct much heat. I think that constant radii cylinders (or rectangular prisms) will need to be used, though the area under such structures will be no-go for xtors. Getting those 'fins' spread out and bonded to redesigned heat spreaders are all very challenging problems. Just my 2 cents.

** probably some alloy rather than pure copper.
The basic research behind this technology is already done and dusted - theory tested and proven.

Check out DARPA's ICEcool project from the last decade:



This news from TSMC is basically them figuring out how to take that research and implement it in a mass manufactured process node, which is much more viable due to the 3D tooling they have expanded into for newer generations with AMD et al.
 

Ajay

Lifer
Jan 8, 2001
16,094
8,114
136
The basic research behind this technology is already done and dusted - theory tested and proven.

Check out DARPA's ICEcool project from the last decade:



This news from TSMC is basically them figuring out how to take that research and implement it in a mass manufactured process node, which is much more viable due to the 3D tooling they have expanded into for newer generations with AMD et al.
That's refrigerant based cooling.
 

jpiniero

Lifer
Oct 1, 2010
16,804
7,251
136
I can’t read the article. Makes me wonder if it is a stacked device with at least 2 layers of cpu die. The 96 core thing is still a bit odd

I'd say that Genoa looks like Rome, except it has 12 dies instead of 8. 96 was probably chosen because of power consumption and perhaps space. Have to read that article to see what Charlie thinks but Bergamo could be stacked dies or stacked on top of the IO die. Either way the power consumption is going to be crazy.
 

Timorous

Golden Member
Oct 27, 2008
1,978
3,864
136
I'd say that Genoa looks like Rome, except it has 12 dies instead of 8. 96 was probably chosen because of power consumption and perhaps space. Have to read that article to see what Charlie thinks but Bergamo could be stacked dies or stacked on top of the IO die. Either way the power consumption is going to be crazy.

Stacked on the IO die would mean the IO Die and chiplets are on the same node right? TSMC don't do cross node stacking or is it that they don't do cross node stacking yet?
 

Abwx

Lifer
Apr 2, 2011
11,885
4,873
136

Ajay

Lifer
Jan 8, 2001
16,094
8,114
136
Some news on Zen 4 for DT.





Patrick Schur
@patrickschur_

The exact TDP numbers for Raphael are 65, 95, 105, 120 and 170 W.
12:19 nachm. · 14. Juli 2021


Hmm, wonder if it's just the max socket TDP for AM5. Could be AMD is giving itself some extra headroom.
 

Abwx

Lifer
Apr 2, 2011
11,885
4,873
136
That s half old news set apart for the TDPs list, according to a May 27 french article 170W is the short term boost.

ZEN-4-cpu.jpg