Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 11 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

lobz

Platinum Member
Feb 10, 2017
2,057
2,856
136
What is the nonsense about RDNA3? TSMC 7nEUV is 10% more efficient than 7N. Whatever.
Which nonsense? AMD officially claimed another 50% eff. improvement for RDNA3 over 2. That will of course be on a better process.
Ah, and is it 10% more efficient than N7P? Whatever.
 

soresu

Platinum Member
Dec 19, 2014
2,657
1,858
136
Which nonsense? AMD officially claimed another 50% eff. improvement for RDNA3 over 2. That will of course be on a better process.
Ah, and is it 10% more efficient than N7P? Whatever.
Wait, wut?

When did they say anything about RDNA3 other than Advanced Process?
 

soresu

Platinum Member
Dec 19, 2014
2,657
1,858
136
It seems to me that the X3D packaging that AMD is talking about could be TSMC's CoWoS with SoIC.
2.5D HBM and 3D SoC?


View attachment 18525

Also AMD registered some new patent applications for chiplet IVR.
20200066677
A data processor is implemented as an integrated circuit. The data processor includes a processor die. The processor die is connected to an integrated voltage regulator die using die-to-die bonding. The integrated voltage regulator die provides a regulated voltage to the processor die, and the processor die operates in response to the regulated voltage.
This likely allows them to maintain integration while optimising the IVR process which may suit better than one that favors logic.

Who knows, they might even use SiC (Silicon Carbide) for higher efficiency.
 

uzzi38

Platinum Member
Oct 16, 2019
2,625
5,895
146
What is the nonsense about RDNA3? TSMC 7nEUV is 10% more efficient than 7N. Whatever.
RDNA isn't N7. It's N7P. AMD confirmed it at ISSCC. That's my point. The bump in efficiency going from N7P to N7+ is pretty much margin of error territory. Given that, AMD's +50% perf/W can only be attained through pure optimisations and tweaks.

And +50% perf/W was claimed by AMD as a generational uplift, one that will be seen again extend to RDNA3 and beyond. It was said verbally, I don't believe they put it on the slides.

We've become accustomed to believing in the power of Su, now it's also time to believe in the power of Wang. It was time as soon as we got that breakdown of the Series X. Damn thing pulls obscenely low power for 2080 Super-2080Ti (somewhere inbetween anyway) performance.
 

Ajay

Lifer
Jan 8, 2001
15,431
7,849
136
RDNA isn't N7. It's N7P. AMD confirmed it at ISSCC. That's my point. The bump in efficiency going from N7P to N7+ is pretty much margin of error territory. Given that, AMD's +50% perf/W can only be attained through pure optimisations and tweaks.

And +50% perf/W was claimed by AMD as a generational uplift, one that will be seen again extend to RDNA3 and beyond. It was said verbally, I don't believe they put it on the slides.

We've become accustomed to believing in the power of Su, now it's also time to believe in the power of Wang. It was time as soon as we got that breakdown of the Series X. Damn thing pulls obscenely low power for 2080 Super-2080Ti (somewhere inbetween anyway) performance.
Thanks for the explanation. Sorry, I’m a bit pissy from staying home and my wife doing the same. Should be better when we get her moved into her own office :p. Can’t imagine what it is like for those with children at home as well.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,780
136
TSMC's N5 can theroetically hit 4.1 GHz@0.85V compared to ~1.05 V on N7


1585750227257.png


Base clocks of 2.8+ GHz for low power and Server chips would be very possible. A theroretical chip can hit 2.9 GHz at 700mV.
If AMD is using same N5 HD chiplet for desktop and server then we could expect Desktop base clocks for high end SKUs pushed to 4+ GHz

However if N5 HPC is used for desktop, I would expect base clocks for high end SKUs will be starting from 4.2+ GHz

1585750641545.png

1585751063978.png

Zen4 will bring real pain to the competition.

Bigger chip surface area due to move to AM5, increased density and probably 3D stacking, I wonder what they will put on the chip. 12nm IOD has to go. Something else needs to be there. This thing is a deadweight.See Renoir. Even with IO it is denser than the desktop chiplets themselves.
More cores or wider cores?
 
Last edited:

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
TSMC's N5 can theroetically hit 4.1 GHz@0.85V compared to ~1.05 V on N7


View attachment 18981


Base clocks of 2.8+ GHz for low power and Server chips would be very possible. A theroretical chip can hit 2.9 GHz at 700mV.
If AMD is using same N5 HD chiplet for desktop and server then we could expect Desktop base clocks for high end SKUs pushed to 4+ GHz

However if N5 HPC is used for desktop, I would expect base clocks for high end SKUs will be starting from 4.2+ GHz

View attachment 18982

View attachment 18983

Zen4 will bring real pain to the competition.

Bigger chip surface area due to move to AM5, increased density and probably 3D stacking, I wonder what they will put on the chip. 12nm IOD has to go. Something else needs to be there. This thing is a deadweight.See Renoir. Even with IO it is denser than the desktop chiplets themselves.
More cores or wider cores?
I was wondering, I do not know if feasible, for the bigger chip surface area (if it materializes) for AM5, but here are two points I would speculate on:

1) larger AM5 surface area would permit wider spacing between chiplets for heat dissipation reasons
2) higher density 1.84x and higher speed 1.15x would allow for design decisions to electively make the cores less dense (perhaps only a 1.5x density increase), which if I understand correctly would allow for decent clocks, while still allowing for 50% increase in transistors. IPC is nice, but IPC x clocks is what the real performance is, and for desktop/HEDT (and even for efficiency reasons), at some point more transistors is going to negatively affect your clocks enough that your total performance drops, correct?
 
  • Like
Reactions: lightmanek

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,780
136
1) larger AM5 surface area would permit wider spacing between chiplets for heat dissipation reasons
They can also manage the heat density to some extent by operating the chip at its sweeter spot in the voltage/frequency curve.
But that rapid thermal spike is indeed a real problem and I am not sure spacing them apart will help, because inside the single chiplet the heat builds up so fast before the heat can be removed by the cooler. Frequency throttling is only way they bring down the temperature.


2) higher density 1.84x and higher speed 1.15x would allow for design decisions to electively make the cores less dense (perhaps only a 1.5x density increase), which if I understand correctly would allow for decent clocks, while still allowing for 50% increase in transistors. IPC is nice, but IPC x clocks is what the real performance is, and for desktop/HEDT (and even for efficiency reasons), at some point more transistors is going to negatively affect your clocks enough that your total performance drops, correct?
The "issue" is that AMD tries to use same chiplet for server and Desktop, at least for Zen2.
Zen2 chiplet is N7 HD. Had it been using 7.5T HP cell libs, it would have been able to hit 4.7+ GHz consistently but it would also have meant that Rome would not be able to fit 64 cores and more importantly within a reasonable power envelope.

If AMD will use all the scaling boosters on N5 then can hit 5 GHz most likely without being toasty.
N5 HD ~15% over N7 HD --> N5 HP ~10% over N5 HD --> N5 HP + eLVT ~15-25% over N5 HP with a loss of density.
This does not mean that they can hit 6 GHz, it means that the Voltage/Frequency curve starts going exponential at much higher frequency
 

Gideon

Golden Member
Nov 27, 2007
1,625
3,650
136
The "issue" is that AMD tries to use same chiplet for server and Desktop, at least for Zen2.

Hopefully by Zen 4 AMD has the assets to build both a low-power (N5 HD) and a high-power (N5 HPC) version of the same chiplet. There would be markets for both. On top of Desktop high-end some server workloads (HPC) and HEDT would also benefit from the higher clocked chiplet really well. On the other hand 45-65W TDP desktop SKUs and low-power threadripper versions would benefit from HD chiplets.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
Performance CCD (Die 1) => 1-Hi CCD built for best thermal dissipation. 1x8x8 => 64-cores but highest GHz(arbitrary).
Dense CCD (Die 2) => 2-Hi CCD middle ground. 2x8x8 => 128-cores but average GHz(arbitrary).
Extra Dense CCD (Die 2) => 4-Hi CCD built for best core count. 4x8x8 => 256-cores but lowest GHz(arbitrary).

Zen3 is power efficiency overhaul(same TDP = higher performance[small], lower TDP[big] = same performance), Zen4 will push the performance back up(All TDP zones = increased performance). ¯\_(ツ)_/¯

[small] => small increase
[big] => big decrease
 
Last edited:

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
They can also manage the heat density to some extent by operating the chip at its sweeter spot in the voltage/frequency curve.
But that rapid thermal spike is indeed a real problem and I am not sure spacing them apart will help, because inside the single chiplet the heat builds up so fast before the heat can be removed by the cooler. Frequency throttling is only way they bring down the temperature.
Makes sense- absolutely. You can only dissipate heat so quickly. But if you make the core physically larger, use a more dense process, and space things out (on the core that is), wouldn't that help reduce heat buildup side-to-side and instead give time to dissipate vertically?

The "issue" is that AMD tries to use same chiplet for server and Desktop, at least for Zen2.
Zen2 chiplet is N7 HD. Had it been using 7.5T HP cell libs, it would have been able to hit 4.7+ GHz consistently but it would also have meant that Rome would not be able to fit 64 cores and more importantly within a reasonable power envelope.
This is a great point we often forget. AMD chose to prioritize efficiency/density a bit. But would the 7.5T have been even warmer than 6T they're using now, prohibiting boosts from hitting that high due to thermal throttling? I'm not sure if that would even be a thing, perhaps you know more.

If AMD will use all the scaling boosters on N5 then can hit 5 GHz most likely without being toasty.
N5 HD ~15% over N7 HD --> N5 HP ~10% over N5 HD --> N5 HP + eLVT ~15-25% over N5 HP with a loss of density.
This does not mean that they can hit 6 GHz, it means that the Voltage/Frequency curve starts going exponential at much higher frequency
If those numbers are true:
N7 HD -> N5 HD = 1 x 1.15 = 1.15
N5 HD -> N5 HP = 1.15 * 1.10 = 1.265
N5 HP + eLVT = 1.265 * 1.15 = 1.455
45% boost would definitely permit some wiggle room on voltage, heat, die size, etc. That's a lot of flexibility, but will these scaling boosters increase cost or complexity or decrease yields to the extent that it wouldn't be viable?
 

soresu

Platinum Member
Dec 19, 2014
2,657
1,858
136
Performance CCD (Die 1) => 1-Hi CCD built for best thermal dissipation. 1x8x8 => 64-cores but highest GHz(arbitrary).
Dense CCD (Die 2) => 2-Hi CCD middle ground. 2x8x8 => 128-cores but average GHz(arbitrary).
Extra Dense CCD (Die 2) => 4-Hi CCD built for best core count. 4x8x8 => 256-cores but lowest GHz(arbitrary).

Zen3 is power efficiency overhaul(same TDP = higher performance[small], lower TDP[big] = same performance), Zen4 will push the performance back up(All TDP zones = increased performance). ¯\_(ツ)_/¯

[small] => small increase
[big] => big decrease
There's no earthly way they are making a 64C CCD at all, let alone for stacking - that defeats the entire point of chiplets.

Far more likely is that they will just stack 8C CCD's in 2 or 4 hi configuration in a similar pattern to the current setup.

They might switch to 12 or 16C CCD's with Zen 4 or 5, but it would be a real stretch for Zen3, and economically it becomes harder to maximise profits from a larger core count CCD anyway - once yield matures you would end up gimping 16C chips to fill out the market.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
There's no earthly way they are making a 64C CCD die at all, let alone for stacking - that defeats the entire point of chiplets.

Far more likely is that they will just stack 8C CCD's in 2 or 4 hi configuration in a similar pattern to the current setup.
Amount of dies in stack * core count * stack/die count
Perf Die = 1 die in stack * 8 cores * 8 dies => 64 cores in EPYC, but will probably be in Threadripper only.
Dense Die 2-Hi => 2 dies in stack * 8 cores * 8 stacks => 128 cores in EPYC CPU
Dense Die 4-Hi => 4 dies in stack * 8 cores * 8 stacks => 256 cores in EPYC CPU
 

eek2121

Platinum Member
Aug 2, 2005
2,930
4,026
136
I would love to be a bug on the wall at AMD.

I strongly suspect AMD plans to converge all of their chips at some point. They will do this by focusing on power/efficiency first and foremost. I bet we will see some enthusiast grade 45 watt parts, possibly with Zen 3. I also strongly suspect an integrated APU to wiggle it’s way in. Why? Even if the APU isn’t used for graphics, it can be used for compute and it could also be used to accelerate existing instruction sets. I suspect in the future we will see shared floating point ops, AES acceleration, etc all handled by the GPU logic. Let that sink in for a bit. I believe we are entering an age of GPU/CPU convergence.

The 4900hs is a good example of early efforts.
 
  • Like
Reactions: lightmanek

soresu

Platinum Member
Dec 19, 2014
2,657
1,858
136
Amount of dies in stack * core count * stack/die count
Perf Die = 1 die in stack * 8 cores * 8 dies => 64 cores in EPYC, but will probably be in Threadripper only.
Dense Die 2-Hi => 2 dies in stack * 8 cores * 8 stacks => 128 cores in EPYC CPU
Dense Die 4-Hi => 4 dies in stack * 8 cores * 8 stacks => 256 cores in EPYC CPU
Ah yh, possible but only at super restricted clock speeds/voltage for the 4 hi setup.

Unless this integrated heat spreading interposer thing in the patent is extremely effective - or perhaps they can sandwich some sort of micro heatpipes between CCD stack layers to shunt thermals to the sides of the stack.

The ideal solution would be to use ICEcool, somehow shrink down a micro fluidic pump and reservoir in order to use the IHS as a micro fluidic radiator - it would be chunky, but if done properly you would have pretty much no hot spots on the IHS with a fairly even thermal dissipation across it.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,780
136
Well, well, well ... it looks like AMD is trying to implement a Network-on-Chip CPU.

From a set of these patents it looks like AMD is implementing a full blown NoC wherein various coherent agents (e.g. CCX) communicate with each other over the Fabric.
Besides just the CCXs on the NoC, a description of how coherency is maintained across the CCX is presented.
From what we can see in the patents
- CCXs are coherent at L3 level maintained via the Coherency Probe Network
- APICs reside outside of the Cores and are also coherent bus masters and they deliver interrutps not anymore via wires but through IF messages.
- All these blocks are connected by a crossbar and also a mechanism to route the message from one block to another is described
- Memory controller is another block on the IF fabric.
- Mechanism to optimize the access to the Memory subsystem concurrently by many clients is described
- Lots of trace paths have been removed since the routing is no longer scalable due to the increase in the number of cores and replaced with a true network broadcast mechanism to deliver critical signals needed by a modern processor

Technically all of these blocks can be fabricated using different processes and integrated with a crossbar.

Untitled1.png Untitled12.png


Annotation 2020-04-03 175127.png


Additional there are lots of patents around the topics below which we have known before
- 3D die stacking and TSV tech
- Differential data signalling across the blocks

Patents in USPTO
20200065275
PROBE INTERRUPT DELIVERY
Abstract
Systems, apparatuses, and methods for routing interrupts on a coherency probe network are disclosed. A computing system includes a plurality of processing nodes, a coherency probe network, and one or more control units. The coherency probe network carries coherency probe messages between coherent agents. Interrupts that are detected by a control unit are converted into messages that are compatible with coherency probe messages and then routed to a target destination via the coherency probe network. Interrupts are generated with a first encoding while coherency probe messages have a second encoding. Cache subsystems determine whether a message received via the coherency probe network is an interrupt message or a coherency probe message based on an encoding embedded in the received message. Interrupt messages are routed to interrupt controller(s) while coherency probe messages are processed in accordance with a coherence probe action field embedded in the message.

20200099993
MULTICAST IN THE PROBE CHANNEL
Abstract
Systems, apparatuses, and methods for processing multi-cast messages are disclosed. A system includes at least one or more processing units, one or more memory controllers, and a communication fabric coupled to the processing unit(s) and the memory controller(s). The communication fabric includes a plurality of crossbars which connect various agents within the system. When a multi-cast message is received by a crossbar, the crossbar extracts a message type indicator and a recipient type indicator from the message. The crossbar uses the message type indicator to determine which set of masks to lookup using the recipient type indicator. Then, the crossbar determines which one or more masks to extract from the selected set of masks based on values of the recipient type indicator. The crossbar combines the one or more masks with a multi-cast route to create a port vector for determining on which ports to forward the multi-cast message.

[0002] Generally speaking, the fabric facilitates communication by routing messages between a plurality of components on an integrated circuit (i.e., chip) or multi-chip module. Examples of messages communicated over a fabric include memory access requests, status updates, data transfers, coherency probes, coherency probe responses, system messages, and the like. The system messages can include messages indicating when different types of events occur within the system. These events include agents entering or leaving a low-power state, shutdown events, commitment of transactions to long-term storage, thermal events, bus locking events, translation lookaside buffer (TLB) shootdowns, and so on. With a wide variety of messages to process and with increasing numbers of clients on modern system on chips (SoCs) and integrated circuits (ICs), determining how to route the messages through the fabric can be challenging.

20190391764
DYNAMIC MEMORY TRAFFIC OPTIMIZATION IN MULTI-CLIENT SYSTEMS
Abstract
Systems, apparatuses, and methods for dynamically optimizing memory traffic in multi-client systems are disclosed. A system includes a plurality of client devices, a memory subsystem, and a communication fabric coupled to the client devices and the memory subsystem. The system includes a first client which generates memory access requests targeting the memory subsystem. Prior to sending a given memory access request to the fabric, the first client analyzes metadata associated with data targeted by the given memory access request. If the metadata indicates the targeted data is the same as or is able to be derived from previously retrieved data, the first client prevents the request from being sent out on the fabric on the data path to memory subsystem. This helps to reduce memory bandwidth consumption and allows the fabric and the memory subsystem to stay in a low-power state for longer periods of time

20200089550
BROADCAST COMMAND AND RESPONSE
Abstract
Systems, apparatuses, and methods for implementing a broadcast read response protocol are disclosed. A computing system includes a plurality of processing engines coupled to a memory subsystem. A first processing engine executes a read and broadcast response command, wherein the read and broadcast response command targets first data at a first address in the memory subsystem. One or more other processing engines execute a wait command to wait to receive the first data requested by the first processing engine. After receiving the first data from the memory subsystem, the plurality of processing engines process the first data as part of completing a first operation. In one implementation, the first operation is implementing a given layer of a machine learning model. In one implementation, the given layer is a convolutional layer of a neural network.

20190199617
SELF IDENTIFYING INTERCONNECT TOPOLOGY
Abstract
A system for automatically discovering fabric topology includes at least one or more processing units, one or more memory devices, a security processor, and a communication fabric with an unknown topology coupled to the processing unit(s), memory device(s), and security processor. The security processor queries each component of the fabric to retrieve various attributes associated with the component. The security processor utilizes the retrieved attributes to create a network graph of the topology of the components within the fabric. The security processor generates routing tables from the network graph and programs the routing tables into the fabric components. Then, the fabric components utilize the routing tables to determine how to route incoming packets.

20190108861
DYNAMIC CONTROL OF MULTI-REGION FABRIC
Abstract
Systems, apparatuses, and methods for implementing dynamic control of a multi-region fabric are disclosed. A system includes at least one or more processing units, one or more memory devices, and a communication fabric coupled to the processing unit(s) and memory device(s). The system partitions the fabric into multiple regions based on different traffic types and/or periodicities of the clients connected to the regions. For example, the system partitions the fabric into a stutter region for predictable, periodic clients and a non-stutter region for unpredictable, non-periodic clients. The system power-gates the entirety of the fabric in response to detecting a low activity condition. After power-gating the entirety of the fabric, the system periodically wakes up one or more stutter regions while keeping the other non-stutter regions in power-gated mode. Each stutter region monitors stutter client(s) for activity and processes any requests before going back into power-gated mode.
 
Last edited:

Saylick

Diamond Member
Sep 10, 2012
3,127
6,304
136
Excuse my ignorance, and perhaps you can dumb down what this patent means, but to my understanding, does this imply they are going with a mesh-type interconnect but with multiple dies? So currently Rome is 1 IOD with 8 independent CCDs but I presume a network based system would allow all CCDs to communicate directly to each other?
 

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,780
136
Excuse my ignorance, and perhaps you can dumb down what this patent means, but to my understanding, does this imply they are going with a mesh-type interconnect but with multiple dies? So currently Rome is 1 IOD with 8 independent CCDs but I presume a network based system would allow all CCDs to communicate directly to each other?
I believe so too, this bus would be more like a mesh

If we recollect, the SDF already is moving data between the CCDs and the IOD and the SCF deals with the control and handshaking between the SMUs, CCXs, IO and others.
All communication from the CCDs, either via SCF or SDF, go to the IOD which take care of routing it elsewhere. So in the case of Rome all CCDs connect to the IOD.

Now we have a Bus. All coherent agents are connected to the Bus. CCDs can "address" another CCD or Memory controller directly.
This Bus is actually a crossbar which is an active component and it routes the messages from one active agent on the bus to another.
These "messages" being routed as you can imagine are traditional SCF and SDF data plus a new kind of message called Probe messages.
These probe messages abstract and contain all those traditional signals which in the past would have to be delivered via traces.

I believe these messages are not the traditional messages in the sense of what a NIC does currently like processing the ether type of an L2 pdu or something.
They are probably register level, means there is hardwired logic to operate on the levels transferred across the traces running around the probe network.

Lots of patents in this regards describing the innovations for minimizing the energy cost for data movement.
 

uzzi38

Platinum Member
Oct 16, 2019
2,625
5,895
146
Lots of patents in this regards describing the innovations for minimizing the energy cost for data movement.
Well you know, when your current generation server product burns half it's power budget on moving data around and things are set to get worse in the future without signficant work in that aspect...
 

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
I'm pretty sure that 2020 is the last full year of the WSA. Its winding down finally.
AMD recently (well, actually across the last two months for different parts) published its annual report on Form 10-K for 2019.

Page 39 (45 in the PDF) agrees with scannall, "Purchase obligations" is 1,677 mil this year, 592 mil next year, and after that there's only 21 mil left that could well be mostly unrelated to the WSA.

Btw. above report has a lot of further information, like past payment to GloFo on page 56 (62 in the PDF): 2019 (through May 15, 2019) $0.5 bil, 2018 $1.6 bil, 2017 $1.1 bil.

To me it sounds like AMD perfectly timed the use of IODs produced at GloFo. Products based on Zen 3 and later likely won't consist of any dies by GloFo anymore.