Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 68 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

blckgrffn

Diamond Member
May 1, 2003
9,110
3,029
136
www.teamjuchems.com
That's completely wrong. You think posting to Anandtech doesn't use SIMD instructions? Check out whatever is responsible in your OS kernel for zeroing pages when a new page is needed, it probably uses AVX2 in some circumstances - and that's the tip of the iceberg. You think floating point isn't needed? Sorry, all math in Javascript is done in floating point, there's no way to avoid it if you are running a browser.

I doubt there's anything you can do with a modern PC or smartphone that would allow any worthwhile reduction of instruction set coverage. Not even running an "idle loop" (which is a halt instruction these days) because there are always background/housekeeping processes running at times so the scheduler, I/O dispatch, filesystem, and other parts of the kernel will remain active.

I don't think you can usefully cut out any instructions from a small core other than 1) AVX512 (and that's only true on x86 because Intel didn't provide for variable SIMD width capability like SVE2) and 2) virtualization. Anything else you cut out will mean almost every thread will be forced onto big cores before long.

I can read this whole page of the thread which takes minutes, but if tiny cores allowed the PC to stay in a non-sleep state while I read the static content and the big cores go to some super deep sleep state... and these little cores keep the PC seemingly interactive and awake, presumably? The clock keeps ticking :)

Many PCs probably spend the vast majority of their times displaying static content. Not saying there aren't clicks every few seconds, but if the deep sleep states are transparent to the end user 🤷‍♂️

How much idle power does it need to save to be worth the investment? I don't know.
 

Doug S

Platinum Member
Feb 8, 2020
2,202
3,405
136
I can read this whole page of the thread which takes minutes, but if tiny cores allowed the PC to stay in a non-sleep state while I read the static content and the big cores go to some super deep sleep state... and these little cores keep the PC seemingly interactive and awake, presumably? The clock keeps ticking :)

Many PCs probably spend the vast majority of their times displaying static content. Not saying there aren't clicks every few seconds, but if the deep sleep states are transparent to the end user 🤷‍♂️

How much idle power does it need to save to be worth the investment? I don't know.

You think Javascript isn't being executed when static content is being displayed? That hasn't been true for nearly 20 years.

You'll need to support the full instruction set in your little cores. But that's fine, you'll still get benefit from them even if they aren't as small as you might wish they could be in an ideal world.
 
  • Like
Reactions: Tlh97 and scineram

eek2121

Platinum Member
Aug 2, 2005
2,904
3,906
136
That's completely wrong. You think posting to Anandtech doesn't use SIMD instructions? Check out whatever is responsible in your OS kernel for zeroing pages when a new page is needed, it probably uses AVX2 in some circumstances - and that's the tip of the iceberg. You think floating point isn't needed? Sorry, all math in Javascript is done in floating point, there's no way to avoid it if you are running a browser.

I doubt there's anything you can do with a modern PC or smartphone that would allow any worthwhile reduction of instruction set coverage. Not even running an "idle loop" (which is a halt instruction these days) because there are always background/housekeeping processes running at times so the scheduler, I/O dispatch, filesystem, and other parts of the kernel will remain active.

I don't think you can usefully cut out any instructions from a small core other than 1) AVX512 (and that's only true on x86 because Intel didn't provide for variable SIMD width capability like SVE2) and 2) virtualization. Anything else you cut out will mean almost every thread will be forced onto big cores before long.
You think Javascript isn't being executed when static content is being displayed? That hasn't been true for nearly 20 years.

You'll need to support the full instruction set in your little cores. But that's fine, you'll still get benefit from them even if they aren't as small as you might wish they could be in an ideal world.

…and if I am running with javascript disabled? what if i am writing code in vim? What if the machine is a simple file sharing machine? There are plenty of opportunities to use a small core over a big one. Even something as basic as tracking a mouse pointer doesn’t need to use a big core.
 
  • Like
Reactions: Mopetar

Doug S

Platinum Member
Feb 8, 2020
2,202
3,405
136
…and if I am running with javascript disabled? what if i am writing code in vim? What if the machine is a simple file sharing machine? There are plenty of opportunities to use a small core over a big one. Even something as basic as tracking a mouse pointer doesn’t need to use a big core.

OK sure if you are one of the niche cases of people who disable Javascript or run CLI stuff in console mode, fine I'll grant you that. The overwhelming majority of PC/smartphone users don't do stuff like that.

Tracking a mouse pointer doesn't need the performance of a big core, but it will almost certainly exercise your whole instruction set. Do you have any idea of the size of the hot code footprint tracking a mouse pointer on an otherwise idle system these days? A modern GUI is multiple layers of libraries.

A typical person who will leave Javascript enabled will exercise floating point if that mouse cursor moves in any browser window. When the pointer moves between windows, window expose events will exercise stuff like bcopy/memset that uses AVX2, and so on.
 

eek2121

Platinum Member
Aug 2, 2005
2,904
3,906
136
OK sure if you are one of the niche cases of people who disable Javascript or run CLI stuff in console mode, fine I'll grant you that. The overwhelming majority of PC/smartphone users don't do stuff like that.

Tracking a mouse pointer doesn't need the performance of a big core, but it will almost certainly exercise your whole instruction set. Do you have any idea of the size of the hot code footprint tracking a mouse pointer on an otherwise idle system these days? A modern GUI is multiple layers of libraries.

A typical person who will leave Javascript enabled will exercise floating point if that mouse cursor moves in any browser window. When the pointer moves between windows, window expose events will exercise stuff like bcopy/memset that uses AVX2, and so on.
Considering I’ve done operating system development in the past. Yes I am aware of how mouse tracking works. Modern operating systems have many threads, and not all of them use all instruction sets. By placing those threads on smaller cores, you save power. You seem to believe it is an all or nothing type of deal. It isn’t.
 

Saylick

Diamond Member
Sep 10, 2012
3,084
6,184
136
Pardon my ignorance, but can someone enlighten me on whether or not execution units are gated off if an instruction doesn't use them in a given cycle? Or is the entire core powered on regardless of what instructions come in? For example, if I have heavy integer code, are the FPUs powered down for the most part? What about decoders? Can a portion of them be powered down in a given clock cycle if the core were able to know that the instructions coming down did not require a "complex" decoder?

The reason why I ask is to understand if a theoretical "low featured core" can be a subset of the "full featured core", or if it has to be a separate core with duplicated blocks, execution units, etc.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,744
3,077
136
Pardon my ignorance, but can someone enlighten me on whether or not execution units are gated off if an instruction doesn't use them in a given cycle? Or is the entire core powered on regardless of what instructions come in? For example, if I have heavy integer code, are the FPUs powered down for the most part? What about decoders? Can a portion of them be powered down in a given clock cycle if the core were able to know that the instructions coming down did not require a "complex" decoder?

The reason why I ask is to understand if a theoretical "low featured core" can be a subset of the "full featured core", or if it has to be a separate core with duplicated blocks, execution units, etc.
So there is power gating and clock gating , also remember things are pipelined and many core functions and operations are multiple cycles/stages.

Yes things get gating when they can be, power gating is much harder/slower then clock gating.
 

moinmoin

Diamond Member
Jun 1, 2017
4,934
7,619
136
Pardon my ignorance, but can someone enlighten me on whether or not execution units are gated off if an instruction doesn't use them in a given cycle? Or is the entire core powered on regardless of what instructions come in? For example, if I have heavy integer code, are the FPUs powered down for the most part? What about decoders? Can a portion of them be powered down in a given clock cycle if the core were able to know that the instructions coming down did not require a "complex" decoder?

The reason why I ask is to understand if a theoretical "low featured core" can be a subset of the "full featured core", or if it has to be a separate core with duplicated blocks, execution units, etc.
A big problem is all the interdependence. Stuff like ooo, branch prediction, SMT etc. is for ensuring all resources are kept in use for more performance.

Unfortunately AMD stopped publishing details of newer chips, but there are a couple slides for Raven Ridge:

Of particular interest:
raven_ridge_power_regions.png

"Power gating on Raven Ridge is split into two regions:
  • Region A - the interface between the CPU, GPU, and I/O Hub
  • Region B - the memory controller, multimedia engine, and display interface
The two regions can be independently power gated depending on the workload. For example, during a typical movie playback, Region B is mostly active while Region A is mostly power gates only become briefly active when necessary."
 

Saylick

Diamond Member
Sep 10, 2012
3,084
6,184
136
So there is power gating and clock gating , also remember things are pipelined and many core functions and operations are multiple cycles/stages.

Yes things get gating when they can be, power gating is much harder/slower then clock gating.

A big problem is all the interdependence. Stuff like ooo, branch prediction, SMT etc. is for ensuring all resources are kept in use for more performance.

Unfortunately AMD stopped publishing details of newer chips, but there are a couple slides for Raven Ridge:

Of particular interest:
raven_ridge_power_regions.png

"Power gating on Raven Ridge is split into two regions:
  • Region A - the interface between the CPU, GPU, and I/O Hub
  • Region B - the memory controller, multimedia engine, and display interface
The two regions can be independently power gated depending on the workload. For example, during a typical movie playback, Region B is mostly active while Region A is mostly power gates only become briefly active when necessary."

Gotcha, so it sounds like for the sake of avoiding engineering headaches it's cleaner to just duplicate hardware resources so that each core, the "low featured" and "full featured" cores, can be power gated independently. The patent says that the registers and caches can be shared, which makes sense in my mind as the "low featured" core needs to access the same data as the "full featured" core for AMD's strategy to work anyways, but at least when it comes to the pipelined blocks of the core, those essentially operate as a tightly-knit machine and are harder to split up without unbalancing the remainder of the core. Is this more or less correct?
 
  • Like
Reactions: jrdls

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
The critical parts of the core like branch prediction, ALUs, and FPUs are too latency sensitive to be power gated. Remember, nothing is instant so it takes time to sleep and time to wake up. If it's too slow, then it'll take more power trying to get portions of the core sleeping.

Even if it sounds great, in real world testing it'll show all sorts of problems. Remember few years ago AMD having problems with that because they were gating cores independently? It took them few years to just figure that part out.

You can power gate parts of the SoC, but within the cores? Not gonna happen.
 

MadRat

Lifer
Oct 14, 1999
11,909
229
106
I'm pretty positive the secret sauce in CPU engineering will always be described in simple terms for the public. I've never like abstraction of how things work but I'm not naive to think someone wants to give away their bread & butter just to post the secret ingredients. No matter the information we're provided, there's always a discussion on a technical level that arrived at that decision which we are not privy.
 
  • Like
Reactions: Tlh97 and ryan20fun

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
Maybe read my post again? The low-feature cores, the things that support instruction execution of low priority processes such as operating system (OS) maintenance, timer support and various monitor functions. So that the big fat full cores don't need to be fired up for such.
I did read it, but that still doesn't explain your use of the word 'toaster'. Is this some meme?
Yes, I was born in the last year of the Boomer generation...
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
So the Genoa packages are gonna have to be considerably larger. I assume the not-14nm-anymore I/O die will save some space though, but who knows what and how exactly will those packages contain :)
Not necessarily, with the move to N5/N5P even if they double the FP/SIMD per core again from Zen2 the CCD area will still drop enough to allow for several more dies even without the IOD area shrinking.

The real question is exactly how much will the IOD area change with the move to PCIe 5 to account for and likely other IO changes coming for Genoa.

If the IOD is actually an active interposer that the CCDs are mounted on then they could easily fit twice the cores on the same package size in my (loosely informed and somewhat ignorant) opinion.

Of course then again the whole V cache thing also adds an entirely new wrinkle to the stacking equation which makes IOD/interposer -> CCD -> V cache a bit more problematic, unless every die is all at the same height of course.

Then again even if the package height alters slightly for a 3 high stack it won't matter if they simply have new socket cooler specs to account for it.
 
  • Like
Reactions: Tlh97

jamescox

Senior member
Nov 11, 2009
637
1,103
136
OK sure if you are one of the niche cases of people who disable Javascript or run CLI stuff in console mode, fine I'll grant you that. The overwhelming majority of PC/smartphone users don't do stuff like that.

Tracking a mouse pointer doesn't need the performance of a big core, but it will almost certainly exercise your whole instruction set. Do you have any idea of the size of the hot code footprint tracking a mouse pointer on an otherwise idle system these days? A modern GUI is multiple layers of libraries.

A typical person who will leave Javascript enabled will exercise floating point if that mouse cursor moves in any browser window. When the pointer moves between windows, window expose events will exercise stuff like bcopy/memset that uses AVX2, and so on.
It seems like you are possibly massively over-estimating the amount of floating point used in a JavaScript execution thread vs. the amount of all integer used in actually displaying the GUI. The amount of work to display the GUI is often huge compared to what is actually running in the GUI. Also, I don’t know if anyone is talking about having a core with absolutely no FP resources. If you have a separate small core with a scalar FP unit or even just a small narrow 128-bit unit, that could be used to handle any floating point instructions. Technically they could emulate any vector instructions with a scalar unit; it would just be slow. You could actually emulate floating point units with integer units, but that would be excruciatingly slow.

How often do you think some JavaScript code results in vectorized AVX instructions that would actually make use of a wide unit? I would suspect that it is mostly scalar operations. The AVX registers may get used for memory copy operations and such, but the actual floating point resources needed in the cpu core for something like web browsing is probably tiny. If AMD’s implementation allows the small core to share a register file, then most of this is irrelevant anyway. I don’t know if I would consider such a setup as even separate cores. It still remains the case that integer execution units, even those supporting a lot of instructions, are tiny die size-wise compared to floating point units and that is multiplied for vector floating point units, not including it, sharing it, and/or keeping it powered down is desirable.

Perhaps we are going excavator style with multiple small cores sharing the vector execution resources. It could have a small scalar FPU that handles everything except actual vectorized instructions such that the big vector units stay powered down. Intel has certainly had power issues with processors supporting AVX512, so it would make sense to handle very light loads with a small unit, even if it takes multiple clocks, instead of taking the time to wake up the big unit.
 
  • Like
Reactions: Tlh97

CHADBOGA

Platinum Member
Mar 31, 2009
2,135
832
136
I did read it, but that still doesn't explain your use of the word 'toaster'. Is this some meme?
Yes, I was born in the last year of the Boomer generation...
I'd guess it is a way of describing the sort of cores that would be suited to powering devices like toasters, when people were overhyping the Internet of Things.
 

Saylick

Diamond Member
Sep 10, 2012
3,084
6,184
136
I did read it, but that still doesn't explain your use of the word 'toaster'. Is this some meme?
Yes, I was born in the last year of the Boomer generation...
I think it's just a cutesy word used to describe the size of the cores, being that a toaster is a rather small and simple device, let alone its size as a kitchen appliance relative to other, larger kitchen devices, e.g. refrigerator.
 
  • Like
Reactions: Tlh97

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
How often do you think some JavaScript code results in vectorized AVX instructions that would actually make use of a wide unit? I would suspect that it is mostly scalar operations.
Until asm js (now essentially defunct) and the just recently stabilized WASM SIMD there was as far as I know basically no use of SIMD in Javascript VMs.

Now that WASM SIMD is basically done though we may see a lot more use for it, but more likely in independent Electron style apps (ala Discord) rather than directly in a browser with the full browser GUI around it.

I could see a resurgence in game console emulators and game engines on the net following this too, especially with the coming WebGPU standard finally bringing us Vulkan/DX12 style low level gfx in a browser engine.

It may also see some use in conjunction with a future iteration of the 'Web Audio' standard which is only now lurching towards it's v1 at Editor Draft status.
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
I think it's just a cutesy word used to describe the size of the cores, being that a toaster is a rather small and simple device, let alone its size as a kitchen appliance relative to other, larger kitchen devices, e.g. refrigerator.
I associate it with the Cylons of Battlestar Galactica since I no longer eat breakfast and have no use for a toaster.

But that's just me 🤣🤣.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Not necessarily, with the move to N5/N5P even if they double the FP/SIMD per core again from Zen2 the CCD area will still drop enough to allow for several more dies even without the IOD area shrinking.

The real question is exactly how much will the IOD area change with the move to PCIe 5 to account for and likely other IO changes coming for Genoa.

If the IOD is actually an active interposer that the CCDs are mounted on then they could easily fit twice the cores on the same package size in my (loosely informed and somewhat ignorant) opinion.

Of course then again the whole V cache thing also adds an entirely new wrinkle to the stacking equation which makes IOD/interposer -> CCD -> V cache a bit more problematic, unless every die is all at the same height of course.

Then again even if the package height alters slightly for a 3 high stack it won't matter if they simply have new socket cooler specs to account for it.
I would think that they would set it up to all be the same height. Attempting to deal with it in the lid doesn’t seem like a good idea. They have much more precise control of the height of the silicon so it is best to just make the lid flat. For 1, 2, or 4 high cache stacks, I would assume that they will need to polish down the base die further to match heights.

I doubt that they would need to increase the package size for fitting the chips. The IO die will be on a new process. It may be partially stacked or distributed, so I would expect it to take less area even if it doesn’t use much stacking. Although, the massive number of IO pins may require require a certain package size independent of what is stacked on top.

With the way the rumors have been, I am wondering if they are planning on making some 2 layer cpu stacks for 128-cores. With lower clocks, they may be able to stack multiple cpu layers, so it could be a special super high core count part. Perhaps it will go up to 96 with a single layer or 128 with lower clocked cores or perhaps even lower power cores. AMD has a lot more R&D money now, so I don’t know if multiple types of chiplets is out of the question. Perhaps a small core version with reduced FP or other resources, but with 16 cores per chiplet.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
That should have made it easier for Apple to get the hardware migration working, so while you can't go by that fact it certainly provides nothing encouraging for AMD. The main caution I'd have from extrapolating from Apple is that yes they tried it for one year and went to software control, but it was also their first big/little implementation. Maybe they had intended to do software migration but couldn't get the software to work right at first so we got a year of having the little cores be invisible to the software.

While it is true you can migrate threads more quickly if the hardware is doing it on its own, there is little gain having migrations happen more quickly. The overall latency of such moves will be overwhelmingly dominated by giant time sinks like refilling the L1 and TLB. Its like a faster plane making one hour flight time take 50 minutes, while ignoring the two hours it takes to get from home to the airport, parking, get through security, wait for boarding, taxiing on the runway and then another couple hours at the destination.

The software also "knows" a lot of things the hardware doesn't, like the priority that may have been assigned to a thread, how often it is blocking on I/O, and so forth, which figure into a good scheduler's decisions. Anyone who paid attention to the several times over the past couple decades that the Linux kernel's scheduler was completely revamped from scratch, and saw all the issues that go into getting it right, should be extremely wary of allowing hardware to decide on its own whether something should run on a big core or little core.
The hardware knows a lot of things that the software doesn’t. That was the fallacy that lead to IA-64 (software scheduling vs. OOO speculative execution). I did some experiments with intel performance counter monitor a few years ago. It could access a lot of low level information from the processors internal counters. With the amount of information available, I suspect that the hardware could judge quite well whether a process would benefit from being bumped up to a big core or pushed down to a little core. Something like blocking on IO isn’t really relevant at this level. Blocking on IO is a very long term event when viewed from the hardware. This type of hardware scheduling wouldn’t even be visible to the OS scheduler anyway, so it would just continue working as always. This isn’t really even the same job as the OS scheduler. Such a hardware system, if I understand it correctly, only has to decide whether to move the running process between big and little cores. If it is achieving an IPC of 0.25 or something like that, then it is mostly waiting on memory and there is no reason to use the big core.
 

Gideon

Golden Member
Nov 27, 2007
1,608
3,573
136
The hardware knows a lot of things that the software doesn’t. That was the fallacy that lead to IA-64 (software scheduling vs. OOO speculative execution). I did some experiments with intel performance counter monitor a few years ago. It could access a lot of low level information from the processors internal counters. With the amount of information available, I suspect that the hardware could judge quite well whether a process would benefit from being bumped up to a big core or pushed down to a little core. Something like blocking on IO isn’t really relevant at this level. Blocking on IO is a very long term event when viewed from the hardware. This type of hardware scheduling wouldn’t even be visible to the OS scheduler anyway, so it would just continue working as always. This isn’t really even the same job as the OS scheduler. Such a hardware system, if I understand it correctly, only has to decide whether to move the running process between big and little cores. If it is achieving an IPC of 0.25 or something like that, then it is mostly waiting on memory and there is no reason to use the big core.

I think A510 is an interesting example with it's Bulldozerish shared-FP unit design.

It seems a nightmare to detect that one small core is hogging the FP resources causing things to slow down for the other core, without lots of model-specific code in the OS Scheduler. At the very least the schduler needs to be aware to not schedule FP loads to adjacent small cores (in some implementations of the A510, mind you, not all!)

If AMD were to use a similar design with their big-little approach, it would be relatively trivial for the hardware to detect and fire up one of the big cores instead. It only needs to account for one architecture and already has access to loads more low-level CPU-specific data to make that decision.


CPU_37_575px.png
 
Last edited:
  • Like
Reactions: Tlh97

eek2121

Platinum Member
Aug 2, 2005
2,904
3,906
136
I think A510 is an interesting example with it's Bulldozerish shared-FP unit design.

It seems a nightmare to detect that one small core is hogging the FP resources causing things to slow down for the other core, without lots of model-specific code in the OS Scheduler. At the very least the schduler needs to be aware to not schedule FP loads to adjacent small cores (in some implementations of the A510, mind you, not all!)

If AMD were to use a similar design with their big-little approach, it would be relatively trivial for the hardware to detect and fire up one of the big cores instead as it only needs to account for one architecture and already has access to loads more low-level CPU-specific data to make that decision.


CPU_37_575px.png

Part of me wonders if the cores end up getting completely virtualized in the future, with hardware or firmware taking over the duties of the Windows/Linux scheduler.