Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Vattila · Oct 6, 2019

Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!

blckgrffn · Jun 14, 2021

Doug S said:
That's completely wrong. You think posting to Anandtech doesn't use SIMD instructions? Check out whatever is responsible in your OS kernel for zeroing pages when a new page is needed, it probably uses AVX2 in some circumstances - and that's the tip of the iceberg. You think floating point isn't needed? Sorry, all math in Javascript is done in floating point, there's no way to avoid it if you are running a browser.

I doubt there's anything you can do with a modern PC or smartphone that would allow any worthwhile reduction of instruction set coverage. Not even running an "idle loop" (which is a halt instruction these days) because there are always background/housekeeping processes running at times so the scheduler, I/O dispatch, filesystem, and other parts of the kernel will remain active.

I don't think you can usefully cut out any instructions from a small core other than 1) AVX512 (and that's only true on x86 because Intel didn't provide for variable SIMD width capability like SVE2) and 2) virtualization. Anything else you cut out will mean almost every thread will be forced onto big cores before long.

I can read this whole page of the thread which takes minutes, but if tiny cores allowed the PC to stay in a non-sleep state while I read the static content and the big cores go to some super deep sleep state... and these little cores keep the PC seemingly interactive and awake, presumably? The clock keeps ticking

Many PCs probably spend the vast majority of their times displaying static content. Not saying there aren't clicks every few seconds, but if the deep sleep states are transparent to the end user 🤷‍♂️

How much idle power does it need to save to be worth the investment? I don't know.

Doug S · Jun 14, 2021

blckgrffn said:
I can read this whole page of the thread which takes minutes, but if tiny cores allowed the PC to stay in a non-sleep state while I read the static content and the big cores go to some super deep sleep state... and these little cores keep the PC seemingly interactive and awake, presumably? The clock keeps ticking

Many PCs probably spend the vast majority of their times displaying static content. Not saying there aren't clicks every few seconds, but if the deep sleep states are transparent to the end user 🤷‍♂️

How much idle power does it need to save to be worth the investment? I don't know.

You think Javascript isn't being executed when static content is being displayed? That hasn't been true for nearly 20 years.

You'll need to support the full instruction set in your little cores. But that's fine, you'll still get benefit from them even if they aren't as small as you might wish they could be in an ideal world.

eek2121 · Jun 14, 2021

Doug S said:
That's completely wrong. You think posting to Anandtech doesn't use SIMD instructions? Check out whatever is responsible in your OS kernel for zeroing pages when a new page is needed, it probably uses AVX2 in some circumstances - and that's the tip of the iceberg. You think floating point isn't needed? Sorry, all math in Javascript is done in floating point, there's no way to avoid it if you are running a browser.

I doubt there's anything you can do with a modern PC or smartphone that would allow any worthwhile reduction of instruction set coverage. Not even running an "idle loop" (which is a halt instruction these days) because there are always background/housekeeping processes running at times so the scheduler, I/O dispatch, filesystem, and other parts of the kernel will remain active.

I don't think you can usefully cut out any instructions from a small core other than 1) AVX512 (and that's only true on x86 because Intel didn't provide for variable SIMD width capability like SVE2) and 2) virtualization. Anything else you cut out will mean almost every thread will be forced onto big cores before long.

Doug S said:
You think Javascript isn't being executed when static content is being displayed? That hasn't been true for nearly 20 years.

You'll need to support the full instruction set in your little cores. But that's fine, you'll still get benefit from them even if they aren't as small as you might wish they could be in an ideal world.

…and if I am running with javascript disabled? what if i am writing code in vim? What if the machine is a simple file sharing machine? There are plenty of opportunities to use a small core over a big one. Even something as basic as tracking a mouse pointer doesn’t need to use a big core.

Doug S · Jun 14, 2021

eek2121 said:
…and if I am running with javascript disabled? what if i am writing code in vim? What if the machine is a simple file sharing machine? There are plenty of opportunities to use a small core over a big one. Even something as basic as tracking a mouse pointer doesn’t need to use a big core.

OK sure if you are one of the niche cases of people who disable Javascript or run CLI stuff in console mode, fine I'll grant you that. The overwhelming majority of PC/smartphone users don't do stuff like that.

Tracking a mouse pointer doesn't need the performance of a big core, but it will almost certainly exercise your whole instruction set. Do you have any idea of the size of the hot code footprint tracking a mouse pointer on an otherwise idle system these days? A modern GUI is multiple layers of libraries.

A typical person who will leave Javascript enabled will exercise floating point if that mouse cursor moves in any browser window. When the pointer moves between windows, window expose events will exercise stuff like bcopy/memset that uses AVX2, and so on.

eek2121 · Jun 14, 2021

Doug S said:
OK sure if you are one of the niche cases of people who disable Javascript or run CLI stuff in console mode, fine I'll grant you that. The overwhelming majority of PC/smartphone users don't do stuff like that.

Tracking a mouse pointer doesn't need the performance of a big core, but it will almost certainly exercise your whole instruction set. Do you have any idea of the size of the hot code footprint tracking a mouse pointer on an otherwise idle system these days? A modern GUI is multiple layers of libraries.

A typical person who will leave Javascript enabled will exercise floating point if that mouse cursor moves in any browser window. When the pointer moves between windows, window expose events will exercise stuff like bcopy/memset that uses AVX2, and so on.

Considering I’ve done operating system development in the past. Yes I am aware of how mouse tracking works. Modern operating systems have many threads, and not all of them use all instruction sets. By placing those threads on smaller cores, you save power. You seem to believe it is an all or nothing type of deal. It isn’t.

Saylick · Jun 14, 2021

Pardon my ignorance, but can someone enlighten me on whether or not execution units are gated off if an instruction doesn't use them in a given cycle? Or is the entire core powered on regardless of what instructions come in? For example, if I have heavy integer code, are the FPUs powered down for the most part? What about decoders? Can a portion of them be powered down in a given clock cycle if the core were able to know that the instructions coming down did not require a "complex" decoder?

The reason why I ask is to understand if a theoretical "low featured core" can be a subset of the "full featured core", or if it has to be a separate core with duplicated blocks, execution units, etc.

itsmydamnation · Jun 14, 2021

Saylick said:
Pardon my ignorance, but can someone enlighten me on whether or not execution units are gated off if an instruction doesn't use them in a given cycle? Or is the entire core powered on regardless of what instructions come in? For example, if I have heavy integer code, are the FPUs powered down for the most part? What about decoders? Can a portion of them be powered down in a given clock cycle if the core were able to know that the instructions coming down did not require a "complex" decoder?

The reason why I ask is to understand if a theoretical "low featured core" can be a subset of the "full featured core", or if it has to be a separate core with duplicated blocks, execution units, etc.

So there is power gating and clock gating , also remember things are pipelined and many core functions and operations are multiple cycles/stages.

Yes things get gating when they can be, power gating is much harder/slower then clock gating.

moinmoin · Jun 14, 2021

Saylick said:
Pardon my ignorance, but can someone enlighten me on whether or not execution units are gated off if an instruction doesn't use them in a given cycle? Or is the entire core powered on regardless of what instructions come in? For example, if I have heavy integer code, are the FPUs powered down for the most part? What about decoders? Can a portion of them be powered down in a given clock cycle if the core were able to know that the instructions coming down did not require a "complex" decoder?

The reason why I ask is to understand if a theoretical "low featured core" can be a subset of the "full featured core", or if it has to be a separate core with duplicated blocks, execution units, etc.

A big problem is all the interdependence. Stuff like ooo, branch prediction, SMT etc. is for ensuring all resources are kept in use for more performance.

Unfortunately AMD stopped publishing details of newer chips, but there are a couple slides for Raven Ridge:

Zen - Microarchitectures - AMD - WikiChip

Zen (family 17h) is the microarchitecture developed by AMD as a successor to both Excavator and Puma. Zen is an entirely new design, built from the ground up for optimal balance of performance and power capable of covering the entire computing spectrum from fanless notebooks to high-performance...

en.wikichip.org

Of particular interest:

"Power gating on Raven Ridge is split into two regions:

Region A - the interface between the CPU, GPU, and I/O Hub
Region B - the memory controller, multimedia engine, and display interface

The two regions can be independently power gated depending on the workload. For example, during a typical movie playback, Region B is mostly active while Region A is mostly power gates only become briefly active when necessary."

Saylick · Jun 14, 2021

itsmydamnation said:
So there is power gating and clock gating , also remember things are pipelined and many core functions and operations are multiple cycles/stages.

Yes things get gating when they can be, power gating is much harder/slower then clock gating.

moinmoin said:
A big problem is all the interdependence. Stuff like ooo, branch prediction, SMT etc. is for ensuring all resources are kept in use for more performance.

Unfortunately AMD stopped publishing details of newer chips, but there are a couple slides for Raven Ridge:

Zen - Microarchitectures - AMD - WikiChip

Zen (family 17h) is the microarchitecture developed by AMD as a successor to both Excavator and Puma. Zen is an entirely new design, built from the ground up for optimal balance of performance and power capable of covering the entire computing spectrum from fanless notebooks to high-performance...

en.wikichip.org

Of particular interest:

"Power gating on Raven Ridge is split into two regions:

Region A - the interface between the CPU, GPU, and I/O Hub

Region B - the memory controller, multimedia engine, and display interface

The two regions can be independently power gated depending on the workload. For example, during a typical movie playback, Region B is mostly active while Region A is mostly power gates only become briefly active when necessary."

Gotcha, so it sounds like for the sake of avoiding engineering headaches it's cleaner to just duplicate hardware resources so that each core, the "low featured" and "full featured" cores, can be power gated independently. The patent says that the registers and caches can be shared, which makes sense in my mind as the "low featured" core needs to access the same data as the "full featured" core for AMD's strategy to work anyways, but at least when it comes to the pipelined blocks of the core, those essentially operate as a tightly-knit machine and are harder to split up without unbalancing the remainder of the core. Is this more or less correct?

IntelUser2000 · Jun 14, 2021

The critical parts of the core like branch prediction, ALUs, and FPUs are too latency sensitive to be power gated. Remember, nothing is instant so it takes time to sleep and time to wake up. If it's too slow, then it'll take more power trying to get portions of the core sleeping.

Even if it sounds great, in real world testing it'll show all sorts of problems. Remember few years ago AMD having problems with that because they were gating cores independently? It took them few years to just figure that part out.

You can power gate parts of the SoC, but within the cores? Not gonna happen.

MadRat · Jun 14, 2021

I'm pretty positive the secret sauce in CPU engineering will always be described in simple terms for the public. I've never like abstraction of how things work but I'm not naive to think someone wants to give away their bread & butter just to post the secret ingredients. No matter the information we're provided, there's always a discussion on a technical level that arrived at that decision which we are not privy.

Ajay · Jun 14, 2021

moinmoin said:
Maybe read my post again? The low-feature cores, the things that support instruction execution of low priority processes such as operating system (OS) maintenance, timer support and various monitor functions. So that the big fat full cores don't need to be fired up for such.

I did read it, but that still doesn't explain your use of the word 'toaster'. Is this some meme?
Yes, I was born in the last year of the Boomer generation...

soresu · Jun 14, 2021

lobz said:
So the Genoa packages are gonna have to be considerably larger. I assume the not-14nm-anymore I/O die will save some space though, but who knows what and how exactly will those packages contain

Not necessarily, with the move to N5/N5P even if they double the FP/SIMD per core again from Zen2 the CCD area will still drop enough to allow for several more dies even without the IOD area shrinking.

The real question is exactly how much will the IOD area change with the move to PCIe 5 to account for and likely other IO changes coming for Genoa.

If the IOD is actually an active interposer that the CCDs are mounted on then they could easily fit twice the cores on the same package size in my (loosely informed and somewhat ignorant) opinion.

Of course then again the whole V cache thing also adds an entirely new wrinkle to the stacking equation which makes IOD/interposer -> CCD -> V cache a bit more problematic, unless every die is all at the same height of course.

Then again even if the package height alters slightly for a 3 high stack it won't matter if they simply have new socket cooler specs to account for it.

jamescox · Jun 14, 2021

Doug S said:
OK sure if you are one of the niche cases of people who disable Javascript or run CLI stuff in console mode, fine I'll grant you that. The overwhelming majority of PC/smartphone users don't do stuff like that.

Tracking a mouse pointer doesn't need the performance of a big core, but it will almost certainly exercise your whole instruction set. Do you have any idea of the size of the hot code footprint tracking a mouse pointer on an otherwise idle system these days? A modern GUI is multiple layers of libraries.

A typical person who will leave Javascript enabled will exercise floating point if that mouse cursor moves in any browser window. When the pointer moves between windows, window expose events will exercise stuff like bcopy/memset that uses AVX2, and so on.

It seems like you are possibly massively over-estimating the amount of floating point used in a JavaScript execution thread vs. the amount of all integer used in actually displaying the GUI. The amount of work to display the GUI is often huge compared to what is actually running in the GUI. Also, I don’t know if anyone is talking about having a core with absolutely no FP resources. If you have a separate small core with a scalar FP unit or even just a small narrow 128-bit unit, that could be used to handle any floating point instructions. Technically they could emulate any vector instructions with a scalar unit; it would just be slow. You could actually emulate floating point units with integer units, but that would be excruciatingly slow.

How often do you think some JavaScript code results in vectorized AVX instructions that would actually make use of a wide unit? I would suspect that it is mostly scalar operations. The AVX registers may get used for memory copy operations and such, but the actual floating point resources needed in the cpu core for something like web browsing is probably tiny. If AMD’s implementation allows the small core to share a register file, then most of this is irrelevant anyway. I don’t know if I would consider such a setup as even separate cores. It still remains the case that integer execution units, even those supporting a lot of instructions, are tiny die size-wise compared to floating point units and that is multiplied for vector floating point units, not including it, sharing it, and/or keeping it powered down is desirable.

Perhaps we are going excavator style with multiple small cores sharing the vector execution resources. It could have a small scalar FPU that handles everything except actual vectorized instructions such that the big vector units stay powered down. Intel has certainly had power issues with processors supporting AVX512, so it would make sense to handle very light loads with a small unit, even if it takes multiple clocks, instead of taking the time to wake up the big unit.

CHADBOGA · Jun 14, 2021

Ajay said:
I did read it, but that still doesn't explain your use of the word 'toaster'. Is this some meme?
Yes, I was born in the last year of the Boomer generation...

I'd guess it is a way of describing the sort of cores that would be suited to powering devices like toasters, when people were overhyping the Internet of Things.

Saylick · Jun 14, 2021

Ajay said:
I did read it, but that still doesn't explain your use of the word 'toaster'. Is this some meme?
Yes, I was born in the last year of the Boomer generation...

I think it's just a cutesy word used to describe the size of the cores, being that a toaster is a rather small and simple device, let alone its size as a kitchen appliance relative to other, larger kitchen devices, e.g. refrigerator.

soresu · Jun 15, 2021

jamescox said:
How often do you think some JavaScript code results in vectorized AVX instructions that would actually make use of a wide unit? I would suspect that it is mostly scalar operations.

Until asm js (now essentially defunct) and the just recently stabilized WASM SIMD there was as far as I know basically no use of SIMD in Javascript VMs.

Now that WASM SIMD is basically done though we may see a lot more use for it, but more likely in independent Electron style apps (ala Discord) rather than directly in a browser with the full browser GUI around it.

I could see a resurgence in game console emulators and game engines on the net following this too, especially with the coming WebGPU standard finally bringing us Vulkan/DX12 style low level gfx in a browser engine.

It may also see some use in conjunction with a future iteration of the 'Web Audio' standard which is only now lurching towards it's v1 at Editor Draft status.

soresu · Jun 15, 2021

Saylick said:
I think it's just a cutesy word used to describe the size of the cores, being that a toaster is a rather small and simple device, let alone its size as a kitchen appliance relative to other, larger kitchen devices, e.g. refrigerator.

I associate it with the Cylons of Battlestar Galactica since I no longer eat breakfast and have no use for a toaster.

But that's just me 🤣🤣.

Gideon · Jun 15, 2021

soresu said:
Until asm js (now essentially defunct) and the just recently stabilized WASM SIMD there was as far as I know basically no use of SIMD in Javascript VMs.

That's true regarding client code, but I'm pretty sure the browser itself uses them under the hood (for bcopy/memset, etc).

DisEnchantment · Jun 15, 2021

If you are sitting on Linux just sending some icmp messages may also uses AVX, ebtables/iptables are now also supporting AVX

jamescox · Jun 15, 2021

Gideon said:
That's true regarding client code, but I'm pretty sure the browser itself uses them under the hood (for bcopy/memset, etc).

I don’t know if using AVX registers requires powering up the execution units. That seems like an obvious optimization.

jamescox · Jun 15, 2021

soresu said:
Not necessarily, with the move to N5/N5P even if they double the FP/SIMD per core again from Zen2 the CCD area will still drop enough to allow for several more dies even without the IOD area shrinking.

The real question is exactly how much will the IOD area change with the move to PCIe 5 to account for and likely other IO changes coming for Genoa.

If the IOD is actually an active interposer that the CCDs are mounted on then they could easily fit twice the cores on the same package size in my (loosely informed and somewhat ignorant) opinion.

Of course then again the whole V cache thing also adds an entirely new wrinkle to the stacking equation which makes IOD/interposer -> CCD -> V cache a bit more problematic, unless every die is all at the same height of course.

Then again even if the package height alters slightly for a 3 high stack it won't matter if they simply have new socket cooler specs to account for it.

I would think that they would set it up to all be the same height. Attempting to deal with it in the lid doesn’t seem like a good idea. They have much more precise control of the height of the silicon so it is best to just make the lid flat. For 1, 2, or 4 high cache stacks, I would assume that they will need to polish down the base die further to match heights.

I doubt that they would need to increase the package size for fitting the chips. The IO die will be on a new process. It may be partially stacked or distributed, so I would expect it to take less area even if it doesn’t use much stacking. Although, the massive number of IO pins may require require a certain package size independent of what is stacked on top.

With the way the rumors have been, I am wondering if they are planning on making some 2 layer cpu stacks for 128-cores. With lower clocks, they may be able to stack multiple cpu layers, so it could be a special super high core count part. Perhaps it will go up to 96 with a single layer or 128 with lower clocked cores or perhaps even lower power cores. AMD has a lot more R&D money now, so I don’t know if multiple types of chiplets is out of the question. Perhaps a small core version with reduced FP or other resources, but with 16 cores per chiplet.

jamescox · Jun 15, 2021

Doug S said:
That should have made it easier for Apple to get the hardware migration working, so while you can't go by that fact it certainly provides nothing encouraging for AMD. The main caution I'd have from extrapolating from Apple is that yes they tried it for one year and went to software control, but it was also their first big/little implementation. Maybe they had intended to do software migration but couldn't get the software to work right at first so we got a year of having the little cores be invisible to the software.

While it is true you can migrate threads more quickly if the hardware is doing it on its own, there is little gain having migrations happen more quickly. The overall latency of such moves will be overwhelmingly dominated by giant time sinks like refilling the L1 and TLB. Its like a faster plane making one hour flight time take 50 minutes, while ignoring the two hours it takes to get from home to the airport, parking, get through security, wait for boarding, taxiing on the runway and then another couple hours at the destination.

The software also "knows" a lot of things the hardware doesn't, like the priority that may have been assigned to a thread, how often it is blocking on I/O, and so forth, which figure into a good scheduler's decisions. Anyone who paid attention to the several times over the past couple decades that the Linux kernel's scheduler was completely revamped from scratch, and saw all the issues that go into getting it right, should be extremely wary of allowing hardware to decide on its own whether something should run on a big core or little core.

The hardware knows a lot of things that the software doesn’t. That was the fallacy that lead to IA-64 (software scheduling vs. OOO speculative execution). I did some experiments with intel performance counter monitor a few years ago. It could access a lot of low level information from the processors internal counters. With the amount of information available, I suspect that the hardware could judge quite well whether a process would benefit from being bumped up to a big core or pushed down to a little core. Something like blocking on IO isn’t really relevant at this level. Blocking on IO is a very long term event when viewed from the hardware. This type of hardware scheduling wouldn’t even be visible to the OS scheduler anyway, so it would just continue working as always. This isn’t really even the same job as the OS scheduler. Such a hardware system, if I understand it correctly, only has to decide whether to move the running process between big and little cores. If it is achieving an IPC of 0.25 or something like that, then it is mostly waiting on memory and there is no reason to use the big core.

Gideon · Jun 15, 2021

jamescox said:
The hardware knows a lot of things that the software doesn’t. That was the fallacy that lead to IA-64 (software scheduling vs. OOO speculative execution). I did some experiments with intel performance counter monitor a few years ago. It could access a lot of low level information from the processors internal counters. With the amount of information available, I suspect that the hardware could judge quite well whether a process would benefit from being bumped up to a big core or pushed down to a little core. Something like blocking on IO isn’t really relevant at this level. Blocking on IO is a very long term event when viewed from the hardware. This type of hardware scheduling wouldn’t even be visible to the OS scheduler anyway, so it would just continue working as always. This isn’t really even the same job as the OS scheduler. Such a hardware system, if I understand it correctly, only has to decide whether to move the running process between big and little cores. If it is achieving an IPC of 0.25 or something like that, then it is mostly waiting on memory and there is no reason to use the big core.

I think A510 is an interesting example with it's Bulldozerish shared-FP unit design.

It seems a nightmare to detect that one small core is hogging the FP resources causing things to slow down for the other core, without lots of model-specific code in the OS Scheduler. At the very least the schduler needs to be aware to not schedule FP loads to adjacent small cores (in some implementations of the A510, mind you, not all!)

If AMD were to use a similar design with their big-little approach, it would be relatively trivial for the hardware to detect and fire up one of the big cores instead. It only needs to account for one architecture and already has access to loads more low-level CPU-specific data to make that decision.

eek2121 · Jun 15, 2021

Gideon said:
I think A510 is an interesting example with it's Bulldozerish shared-FP unit design.

It seems a nightmare to detect that one small core is hogging the FP resources causing things to slow down for the other core, without lots of model-specific code in the OS Scheduler. At the very least the schduler needs to be aware to not schedule FP loads to adjacent small cores (in some implementations of the A510, mind you, not all!)

If AMD were to use a similar design with their big-little approach, it would be relatively trivial for the hardware to detect and fire up one of the big cores instead as it only needs to account for one architecture and already has access to loads more low-level CPU-specific data to make that decision.

Part of me wonders if the cores end up getting completely virtualized in the future, with hardware or firmware taking over the duties of the Windows/Linux scheduler.

Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Elite Member

Lifer

Lifer

Diamond Member

Senior member

Platinum Member

Diamond Member

Diamond Member

Diamond Member

Platinum Member

Golden Member

Senior member

Senior member

Senior member

Platinum Member

Diamond Member