Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 66 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

NTMBK

Lifer
Nov 14, 2011
10,232
5,013
136
I never said any such thing.

I said HBM is a niche product and, because of economy of scale favors GDDR, the progress for HBM is relatively slow. HBM technology is good for controlling thermal footprints but it did not solve the bandwidth competition for AMD. I didn't blame HBM for Fury X shortcomings which in reality came from AMD's small core architecture preference. NVidia chose a large core preference which doesn't share AMD's problems. And I pointed out GDDR6 had 50% better bandwidth than HBM2, which is an advantage when it comes to NVidia's preference in architecture, Its not a perfect explanation but it is the general gist.

Radeon VII with HBM2 had memory bandwidth of 1028GB/s - in 2019. The latest and greatest Geforce 3090 with GDDR6X has 936GB/s, despite launching over 18 months later.

GDDR6 does not have better bandwidth.
 

MadRat

Lifer
Oct 14, 1999
11,910
238
106
Radeon VII with HBM2 had memory bandwidth of 1028GB/s - in 2019. The latest and greatest Geforce 3090 with GDDR6X has 936GB/s, despite launching over 18 months later.

GDDR6 does not have better bandwidth.
That's like saying an old 256-bit memory path SDR card proves SDR is faster than DDR by comparing the former to a 64-bit memory path limited DDR card.
 
  • Haha
Reactions: lobz
Mar 11, 2004
23,073
5,554
146
That's like saying an old 256-bit memory path SDR card proves SDR is faster than DDR by comparing the former to a 64-bit memory path limited DDR card.

You do realize people are laughing at you, right? So your complaint is that HBM, which was explicitly designed for a wider bus, has a wider bus than GDDR so its unfair to compare it, whilst you tried comparing HBM (or HBM2) to GDDR6, which did not exist at the time? Do you understand why people are laughing at you and your pointlessly silly arguments?
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,346
1,525
136
Well if you were looking for maximum density at lowest voltage, HBM has always made sense. You are talking the antithesis of a consumer product. What was the point you're trying to make?

It also provides the maximum throughput.

And I pointed out GDDR6 had 50% better bandwidth than HBM2,

This is not true. Right now, the highest throughput memory interfaces on any "gpu-like" things are the 5120 bit HBM2 interfaces on A100, which provide ~70% more aggregate throughput than any GDDR interface on any card. HBM is currently the way you get the most throughput. The only reason companies are going for GDDR instead of it in the consumer space is cost reduction.

That's like saying an old 256-bit memory path SDR card proves SDR is faster than DDR by comparing the former to a 64-bit memory path limited DDR card.

I have literally no understanding of what the point you are trying to make here is. Can you expand on it? Of course HBM has wider interfaces, that's the whole point of it. The DRAM memory cells on both are identical, and also identical to all other DRAM. The only difference is the interface -- GDDR goes for a high-speed narrow interface on traces on a pcb, HBM lets you save power while increasing throughput by going for somewhat lower-speed but much wider interface on a silicon substrate. HBM lets you be faster, but it also costs more.
 

MadRat

Lifer
Oct 14, 1999
11,910
238
106
Haven't been around here much for ten years. I see the general tone of the forum has changed.

I see people keep bringing up bits and pieces that make it more apparent they are not talking HBM2 but rather the extended HBM2 standard that was first being pushed by Samsung. HBM2E > HBM2. I should have realized you were talking the extended HBM2 standard by looking up the pathways in the architecture. As with comparisons between SDR and DDR, unless all of the information was similar any comparison was apples versus oranges.
 

MadRat

Lifer
Oct 14, 1999
11,910
238
106
You do realize people are laughing at you, right? So your complaint is that HBM, which was explicitly designed for a wider bus, has a wider bus than GDDR so its unfair to compare it, whilst you tried comparing HBM (or HBM2) to GDDR6, which did not exist at the time? Do you understand why people are laughing at you and your pointlessly silly arguments?
Everyone needs a laugh so that's not a bad thing.

Now show that part where I compared HBM straight up to GDDR6 in any direct comparison. Also, you may want to check your facts about the dates on GDDR6 being demonstrated and the JEDEC standard being finalized. Your historical accuracy is questionable.
 

MadRat

Lifer
Oct 14, 1999
11,910
238
106
The topic never really left the original topic. The whole HBM discussion spawned from the discussion of chiplets and design tradeoffs. This is a speculation thread so it's all very much on topic.

 
Last edited:

jamescox

Senior member
Nov 11, 2009
637
1,103
136
The topic never really left the original topic. The whole HBM discussion spawned from the discussion of chiplets and design tradeoffs. This is a speculation thread so it's all very much on topic.


Graphics related things are definitely on topic since there is likely to be a lot of reuse between chiplet based GPUs and CPUs from AMD. The SRAM cache chip is likely to be used on GPUs also as “infinity cache”. Ray tracing performance should be spectacular with that much cache. Some graphics tech will likely end up in the cpu also. I have wondered if they will have some unified gpu compute unit that can be reused on the cpu for any vector extensions or something weird like that.

I am still using an old dell xps 435t/9000 as a Linux box with an I7-920, 24 GB of memory (was used as a workstation at my old job), and (I think) a 256 MB Nvidia 8600 GT. The latest Linux kernel doesn’t support the nvidia driver for that card and the open source one crashes, so I had to stick with 20.04 with the nvidia driver rather than 21.04 (lubuntu). It would be funny if I can build a system with a graphics card with 256 MB of SRAM cache. AMD is already at 128 MB on GPUs without any stacking. It wouldn’t be hard to double that with just 2 cache chips, but the stacking will probably happen with RDNA3 gpu chiplets.

The X3D images are one of the things that make me wonder if they are going to do a Ryzen desktop sized interposer with cores stacked on top of an IO die. It would be splitting the Epyc IO die up into separate interposers with 4 or more combined for Epyc. I would expect that to be an Zen 4 Genoa thing though, not Milan-x.

I expect that the HBM will only be on the very high end HPC products where large interposers are not too expensive unless they do some other, cheaper, stacking tech. Does it make sense for Milan-x use cpu chiplets on an interposer though? Milan-x is presumably still Zen 3 based, so placing it on an interposer is a rather big change. I would be surprised if we get more than 4 layer stacked L3 cache with milan-x, but it might be a pipe cleaner for stacking tech used in Zen 4. Perhaps Warhol actually is Zen 3 designed for stacking? It may be that the cpu chiplets are not on interposer(s) with milan-x and the possible HBM version is a specialized IO die with HBM stacked as an L4 cache on top of the IO die / active interposer. Not sure if that would be reusable in Zen 4 products though. It would provide a lot of extra bandwidth for the large L3 caches (aggressive prefetch and such), but the latency on HBM is not that great.
 
  • Like
Reactions: Tlh97

DrMrLordX

Lifer
Apr 27, 2000
21,620
10,829
136
The topic never really left the original topic. The whole HBM discussion spawned from the discussion of chiplets and design tradeoffs. This is a speculation thread so it's all very much on topic.

Not when discussing non-CPU implementations of HBM that have no bearing on Milan-X whatsoever.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,780
136
A very long post.

A more consequential patent is this actually, which I also posted some pages back

US Patent has been awarded was filed on Oct 27 2017

But what is interesting is that AMD applied for another similar patent again. It seems likely that they found additional things to do the with it and applied for it again. Filed June 25 2020.
Not yet awarded, still in application state but will surely be awarded as it is just a continuation of awarded patent.

Similar abstract but 20 additional claims.

You can read both, they are awesome.

View attachment 37823View attachment 37824



Like I wrote in my earlier post, there are shared registers between the big and small cores.
In low power mode the small cores are running only a small subset of low powered instructions, when a complex instruction is encountered a trap occurs and the big core takes over seamlessly.
OS is not even aware that all of this is happening. It is basically the same core to the OS.
Because of shared Registers L2 etc, the number of small cores is exactly the same as the number of big cores and at any time either the big or the small core is active not both.
In my opinion this could be done fairly cheaply in terms of die real estate, they could power gate selected blocks in the current core and add special execution ports, L/S ports and other tidbits and thats all.
I imagine in low power the first target would be to power gate the complex decoder block, leaving just simple decoding blocks, power gate all the executions ports except the most simple and power efficient one, power gate a chunk of the register file, and so on.
More innovative than the big.LITTLE imo.

I suppose this opens up new possibilities on what the CPU can do.
It can execute multiple ISA with the same core :blush: . illegal opcode trap in one core changes the CPU to another and continue execution. And OS is not aware.
Dreams...
A very interesting patent came up which adds even more clarity to the one above we already knew. AMD's solution for big.LITTLE as described above is very clever and looks nothing like what ARM has in place today.

As some of you may know, on Android, vendor implementation of this API below return the high performance cores
public static final int[] getExclusiveCores ()
This is used by the app to set affinity to the big cores for maximum performance.

On native Linux the thread migration for big.LITTLE is done in software with one of the more recent patches below.

On asymmetric cpu capacity systems load intensive tasks can end up on
cpus that don't suit their compute demand. In this scenarios 'misfit'
tasks should be migrated to cpus with higher compute capacity to ensure
better throughput. group_misfit_task indicates this scenario, but tweaks
to the load-balance code are needed to make the migrations happen.

Misfit balancing only makes sense between a source group of lower
per-cpu capacity and destination group of higher compute capacity.
Otherwise, misfit balancing is ignored. group_misfit_task has lowest
priority so any imbalance due to overload is dealt with first.

The modifications are:

1. Only pick a group containing misfit tasks as the busiest group if the
destination group has higher capacity and has spare capacity.
2. When the busiest group is a 'misfit' group, skip the usual average
load and group capacity checks.
3. Set the imbalance for 'misfit' balancing sufficiently high for a task
to be pulled ignoring average load.
4. Pick the first cpu with the rq->misfit flag raised as the source cpu.
5. If the misfit task is alone on the source cpu, go for active
balancing.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

Also from ARM ppt some time back


The kernel tracks the load and tries to rebalance the scheduling to make sure the process which is consuming CPU slices get moved to the performance cores.

A novel idea from AMD which first appeared in the quoted post above is to have the big and small cores transparent to the OS.
The current patent futher elaborates on this by describing how the thread migration is to be done in Hardware,

The Hardware indeed has multiple blocks for the cores, see below

1623584131457.png

But what is interesting is that this patent elaborates on using the inbuilt perf counters in the CPU to actually move the thread without the OS knowing.


[0037] In the event that the input 310 cannot be serviced by the first filter stage 320, the input 310 is passed to a subsequent filter stage, such as a second filter stage 330 as depicted in FIG. 3. In one example, the second filter stage is a little or tiny processor. In this example, the little or tiny processor uses an x86 instruction set. This little or tiny processor, for example, can service interrupt service routine (ISR) tasks that require x86 instructions, can execute restore tasks such as restoration of an architecture state associated with device configuration registers, restoration of a micro-architectural state required for a device to resume its execution, or operating system execution, and can execute general purpose low instructions per cycle (IPC) tasks. In another example, the little or tiny processor can warm up a last level cache. In this example, the little or tiny processor fetches code and/or data into a shared cache between the little or tiny processor and the big processors so that when execution switches to the big processor, demand misses are avoided. On the condition that the ISR is passed to the little or tiny processor, the GPIO stage is placed into an idle, stalled, or powered down state. The little or tiny processor is a less-powerful processor than, for example, a more-powerful processor, e.g. a big core, from the highest power complex 340. In one example, the operating system or kernel is unaware of the little or tiny processor. For example, similar to that described above with respect to first filter stage 320, any subsequent filter stages and the highest power complex 340 remain in a low power or powered off state, thus reducing power consumption and improving performance per unit of power used.

[0059] The method 800 further includes, at step 820, saving an architecture state of the first processor in a first memory location. In one example, the architecture state is a combination of one or more registers and one or more flags. The first memory location, in some examples, is associated with the first processor. In another example, method 800 includes starting step 815 at a time such that it overlaps with step 810 and finishes as step 820 also finishes to avoid any delays associated with completing step 815.

[0046] The one or more metrics include, for example, a core utilization metric of the relatively less-powerful processor. In one example, the core utilization metric is a measure of how much the relatively less-powerful and/or relatively less-power consuming processor is running at a maximal speed. This measure can, for example, indicate a percentage of time over some period that the relatively less-powerful and/or relatively less-power consuming processor operates at or near the maximal speed. In another example, the core utilization metric is a percentage of time over a time interval that the core residency of the relatively less-powerful and/or less-power consuming processor is in an active state. The one or more metrics can also include, for example, a memory utilization metric. In one example, the memory utilization metric is a measure of how much the memory is used by the relatively less-powerful processor. This measure, in one example, indicates a percentage of time over some period that the memory is operating in a maximal performance state, sometimes referred to as a p-state. The one or more metrics can also include, for example, a direct memory access (DMA) progress indication. In one example, the DMA progress indication is a data rate over some period of time. In yet another example, the one or more metrics can include an interrupt arrival rate and/or a count of pending interrupts. In this example, a large number of each indicates urgency to switch from smaller or fewer intermediate processors to bigger and/or more numerous highest power complexes.

[0060] The method 800 further includes, at step 830, copying the architecture state from the first memory address to a second memory address. The second memory address, in some examples, is associated with the second processor. In some examples, the architecture state is adjusted for the second processor. Optionally, at step 840, this adjustment is performed so that the adjusted architecture state is applied to the second processor. At step 850, the method further includes restoring the architecture state on the second processor from the second memory address. In another example, the memory used for copying the architecture state as in step 830 and restoring the architecture state as in step 850 is dedicated static random access memory (SRAM). In yet another example, in lieu of use of memory in steps 830 and 850, register buses may be bridged between the first processor and the second processor so that the architecture state is moved directly between the processors. At step 860, an incoming interrupt is redirected to the second processor. Although step 860 is depicted in FIG. 8 as following step 850, any incoming interrupt that is received at any point prior to completion of step 850 is stalled, such that at step 860, the interrupt is redirected to the second processor. At step 870, the ISR address of the incoming interrupt is fetched by the second processor and the interrupt is serviced. Following completion of servicing the interrupt, at step 880, normal execution is resumed on the second processor.
Patent is quite detailed, all the way to handling cache probes, ISR migration, etc

So what happens is that in case the core utilization is very high for the little core, migration of the register state happens and then the big core takes over, OS is not aware of all of this and operation is seamless.


TL;DR;
With these patents AMD is solving two things, which is a very ingenious approach
1. big.LITTLE architecture that is virtually indistinguishable to the OS scheduler with thread migration done in HW (in contrast to ARM approach)
  • Using perf monitor to migrate CPU registers, thread state and execution to big core in HW itself without OS knowledge
  • Issue here is that only the big or small core is available at any time not both, but not really an issue on desktop if you already have 16 cores to begin with
2. Big and small cores have different levels of ISA support (i.e the small cores cannot support AVX for example and the big cores can). ( in contrast to Intel ADL approach )
 
Last edited:

Bigos

Member
Jun 2, 2019
128
282
136
This sounds terrible from the OS kernel point of view. The physical CPU suddenly getting faster or slower without OS intervention is exactly what you don't want to do. This already happens with hardware-controlled turbo, but this will make this even more complicated (previously you could figure out performance by comparing clock speed, now you have no idea how the little core stacks against the big core).

If the OS is at the very least notified when the migration happens it might be made to work, but otherwise it will be a scheduling nightmare. It will still be terrible, as the CPU budget will vary. Consider a big CPU running and the kernel scheduler allocating tasks to it based on its capacity. But then a core gets underutilized for some reason (many branch mispredictions? cache misses?) and the hardware decides to migrate to a little core. Suddenly, the CPU is not fast enough and the kernel scheduler needs to compensate by migration. This all adds latency and reduces overall performance while at the same time being a lot more complicated model to support.

The first arm big.LITTLE implementations allowed you to use either big or small cores. This restriction was terrible for the Linux scheduler so they changed it so that you can fire both big and little cores at the same time, so the scheduler can just migrate workloads in a granular fashion. The approach you are describing sounds like a step backward.

No wonder AMD is hiring kernel scheduler developers...
 

maddie

Diamond Member
Jul 18, 2010
4,738
4,667
136
I'm not quite sure about that.
By definition, in this case, the scheduler is NOT aware of movements between little or big cores. It does NOT need to know.
There's only one core seen by the OS.
It seems that this eliminates the scheduling issues due to different cores. The hardware will up-core and down-core as needed by the thread, as I assume this should go both ways.
 
  • Like
Reactions: Tlh97 and Thibsie

Hitman928

Diamond Member
Apr 15, 2012
5,244
7,793
136
I'm not quite sure about that.
By definition, in this case, the scheduler is NOT aware of movements between little or big cores. It does NOT need to know.
There's only one core seen by the OS.

Yes, this is my understanding as well. There won't be any thread migration issues (assuming the hardware works as intended to begin with) because the OS won't be aware of the setup. The OS will just assign threads as it does now for a Zen 3 CPU, but with the new scheme, the CPU essentially has a hardware scheduler that determines if the low power core can handle it or if it needs to be assigned to the big core. From what I understand as well, the 'little' cores in this configuration are really little and really just there to handle the lightest of tasks so as to let the big cores remain in deep C-states.

You would never have a situation where threads are being migrated from big to little or where the little cores and big cores are all running in parallel to get max multi-threaded performance like is currently done with ARM/Intel BIG.little configurations. It's essentially a pure efficiency play to not use the big cores for things where their performance and power consumption are completely overkill.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,780
136
This sounds terrible from the OS kernel point of view. The physical CPU suddenly getting faster or slower without OS intervention is exactly what you don't want to do. This already happens with hardware-controlled turbo, but this will make this even more complicated (previously you could figure out performance by comparing clock speed, now you have no idea how the little core stacks against the big core).

If the OS is at the very least notified when the migration happens it might be made to work, but otherwise it will be a scheduling nightmare. It will still be terrible, as the CPU budget will vary. Consider a big CPU running and the kernel scheduler allocating tasks to it based on its capacity. But then a core gets underutilized for some reason (many branch mispredictions? cache misses?) and the hardware decides to migrate to a little core. Suddenly, the CPU is not fast enough and the kernel scheduler needs to compensate by migration. This all adds latency and reduces overall performance while at the same time being a lot more complicated model to support.

The first arm big.LITTLE implementations allowed you to use either big or small cores. This restriction was terrible for the Linux scheduler so they changed it so that you can fire both big and little cores at the same time, so the scheduler can just migrate workloads in a granular fashion. The approach you are describing sounds like a step backward.

No wonder AMD is hiring kernel scheduler developers...
I think if such a thing were to materialize, it should be possible to disable this low power feature via the the BIOS, especially for server type SKUs.
But in any case, this is only one of many things that SW has to deal with in the upcoming CPU generation.
There are patent applications to offload decoded x86 uops to special logic blocks, programmable execution pipes, clock gate-able cache blocks (V Cache can actually be clock gated) and many more
Linux (and SW in general) will evolve to handle much more complex kinds of DVFS mechanisms.
 

maddie

Diamond Member
Jul 18, 2010
4,738
4,667
136
Yes, this is my understanding as well. There won't be any thread migration issues (assuming the hardware works as intended to begin with) because the OS won't be aware of the setup. The OS will just assign threads as it does now for a Zen 3 CPU, but with the new scheme, the CPU essentially has a hardware scheduler that determines if the low power core can handle it or if it needs to be assigned to the big core. From what I understand as well, the 'little' cores in this configuration are really little and really just there to handle the lightest of tasks so as to let the big cores remain in deep C-states.

You would never have a situation where threads are being migrated from big to little or where the little cores and big cores are all running in parallel to get max multi-threaded performance like is currently done with ARM/Intel BIG.little configurations. It's essentially a pure efficiency play to not use the big cores for things where their performance and power consumption are completely overkill.
Yep, what we essentially have here is a further advance on Zen scalability. This was often mentioned by AMD for the Zen1 design. The ability to scale the efficient power operating range to a larger degree than normal. Low mobile to high performance with the same core.

This expands that high level property to a much greater degree than before.
 
  • Like
Reactions: Tlh97

MadRat

Lifer
Oct 14, 1999
11,910
238
106
Sounds like seemless core switching would improve power factor overall with use of the little cores and keeping big cores idle. Keeping excessive big cores in use versus smaller cores is probably less than ideal when there is low demand for working threads.

Does this help with licensing costs in any way?
 
Last edited:
  • Like
Reactions: Vattila

Thibsie

Senior member
Apr 25, 2017
747
798
136
Sounds like seemless core switching would improve power factor overall with use of the little cores and keeping big cores idle. Keeping excessive big cores in use versus smaller cores is probably less than ideal when there is low demand for working threads.

Does this help with licensing costs in any way?

If there's only 8 cores seen by the OS while there's really is 8+8,.it might.
 

Gideon

Golden Member
Nov 27, 2007
1,625
3,650
136
This sounds terrible from the OS kernel point of view. The physical CPU suddenly getting faster or slower without OS intervention is exactly what you don't want to do. This already happens with hardware-controlled turbo, but this will make this even more complicated (previously you could figure out performance by comparing clock speed, now you have no idea how the little core stacks against the big core).

If the OS is at the very least notified when the migration happens it might be made to work, but otherwise it will be a scheduling nightmare. It will still be terrible, as the CPU budget will vary. Consider a big CPU running and the kernel scheduler allocating tasks to it based on its capacity. But then a core gets underutilized for some reason (many branch mispredictions? cache misses?) and the hardware decides to migrate to a little core. Suddenly, the CPU is not fast enough and the kernel scheduler needs to compensate by migration. This all adds latency and reduces overall performance while at the same time being a lot more complicated model to support.
The kernel will need to adapt. Hardware scheduling would be at least an order of magnitude faster than software. It just makes no sense for the OS to constantly micromanage migrations between cores for every single process, when such HW capability exists.

The OS scheduler will just need to work at a higher abstraction level (like with hardware-controlled turbo). It should obviously be able to pin immovable tasks to certain cores and inform the CPU of what it thinks should be run on a small core and what not (maybe even with a priority level, where only the highest one is binding) but for most processes the HW should be free to decide, as it is much more aware of it's own capabilites (vs the OS that needs to generalize across all CPUs) and can do it a hell of a lot faster.