Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 75 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
809
1,412
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

MadRat

Lifer
Oct 14, 1999
11,946
265
126
Sounds like seemless core switching would improve power factor overall with use of the little cores and keeping big cores idle. Keeping excessive big cores in use versus smaller cores is probably less than ideal when there is low demand for working threads.

Does this help with licensing costs in any way?
 
Last edited:
  • Like
Reactions: Vattila

Thibsie

Senior member
Apr 25, 2017
865
973
136
Sounds like seemless core switching would improve power factor overall with use of the little cores and keeping big cores idle. Keeping excessive big cores in use versus smaller cores is probably less than ideal when there is low demand for working threads.

Does this help with licensing costs in any way?

If there's only 8 cores seen by the OS while there's really is 8+8,.it might.
 

Gideon

Golden Member
Nov 27, 2007
1,774
4,145
136
This sounds terrible from the OS kernel point of view. The physical CPU suddenly getting faster or slower without OS intervention is exactly what you don't want to do. This already happens with hardware-controlled turbo, but this will make this even more complicated (previously you could figure out performance by comparing clock speed, now you have no idea how the little core stacks against the big core).

If the OS is at the very least notified when the migration happens it might be made to work, but otherwise it will be a scheduling nightmare. It will still be terrible, as the CPU budget will vary. Consider a big CPU running and the kernel scheduler allocating tasks to it based on its capacity. But then a core gets underutilized for some reason (many branch mispredictions? cache misses?) and the hardware decides to migrate to a little core. Suddenly, the CPU is not fast enough and the kernel scheduler needs to compensate by migration. This all adds latency and reduces overall performance while at the same time being a lot more complicated model to support.
The kernel will need to adapt. Hardware scheduling would be at least an order of magnitude faster than software. It just makes no sense for the OS to constantly micromanage migrations between cores for every single process, when such HW capability exists.

The OS scheduler will just need to work at a higher abstraction level (like with hardware-controlled turbo). It should obviously be able to pin immovable tasks to certain cores and inform the CPU of what it thinks should be run on a small core and what not (maybe even with a priority level, where only the highest one is binding) but for most processes the HW should be free to decide, as it is much more aware of it's own capabilites (vs the OS that needs to generalize across all CPUs) and can do it a hell of a lot faster.
 

uzzi38

Platinum Member
Oct 16, 2019
2,705
6,427
146
Honestly I'm quite worried about this method of dealing with heterogeneous cores as well, but because I'm pretty sure AMD aren't the first to try having littles essentially invisible to the OS.

Apple were with the A10. They didn't keep the design with the A11.

I dunno the differences between how the two worked specifically, I haven't seen anything to really explain the exact implementation. It's possible AMD's implementation may not run into whatever issues Apple's did. But... It does worry me that Apple of all companies dropped the idea.
 

Asterox

Golden Member
May 15, 2012
1,039
1,823
136
Honestly I'm quite worried about this method of dealing with heterogeneous cores as well, but because I'm pretty sure AMD aren't the first to try having littles essentially invisible to the OS.

Apple were with the A10. They didn't keep the design with the A11.

I dunno the differences between how the two worked specifically, I haven't seen anything to really explain the exact implementation. It's possible AMD's implementation may not run into whatever issues Apple's did. But... It does worry me that Apple of all companies dropped the idea.

Important detail, but it is kind of expected or no any rush for Desktop.


"it is believed that Ryzen 8000 series “Strix Point” will be AMD’s first to implement heterogeneous architecture with 3nm Zen5 cores combined with Zen4D on a single package. A desktop variant codenamed “Granite Ridge” is not currently rumored to feature big/small core architecture"
 

jamescox

Senior member
Nov 11, 2009
644
1,105
136
Honestly I'm quite worried about this method of dealing with heterogeneous cores as well, but because I'm pretty sure AMD aren't the first to try having littles essentially invisible to the OS.

Apple were with the A10. They didn't keep the design with the A11.

I dunno the differences between how the two worked specifically, I haven't seen anything to really explain the exact implementation. It's possible AMD's implementation may not run into whatever issues Apple's did. But... It does worry me that Apple of all companies dropped the idea.
We can't really go by what Apple does since they have full control of the hardware and software with their ARM solutions. AMD or Intel does not. They are going to have to work closely with Microsoft and open source developers. It would be nice if they could at least get the page size issues resolved. I have been dealing with the 2 MB transparent huge pages on centos 6 and 7. They cause all kinds of issues since it tries to defrag 4k pages into 2 MB pages on allocation. This can cause huge delays on memory allocation and does not seem very effective at reclaiming huge pages. I don't know if this is much better in centos 7. It is still an old kernel. As far as I know, any swapping to disk will quickly fragment 2 MB pages into 4k pages because the swap system only handles 4k pages. Some applications seem to perform significantly better with 2 MB pages (massively more effective TLB), but the fragmentation to 4k pages and ineffective de-fragmentation cause performance to degrade over time. It is significant with the the defrag on centos 6. It was sometimes causing delays in the minutes on memory allocation. If you think about how many pages are required for an application using gigabytes of memory, it seems like we should have abandoned 4k pages a long time ago. With how long page size issues seem to have been a problem, I don't have very high expectations for scheduler optimizations, so it is probably very good for AMD to have a hardware solution, even if you cant use all of the cores at once.

Since Apple has full control, visibility to the OS isn't an issue. I believe Apple arbitrarily switched to 16k pages throughout; I need to read up on that that though. We don't seem to be able to do that with x86 systems, so it seems it scheduling may be a problem. Ideally, you want all cores visible to the OS such that it can make the best use of them. You still want it to work with existing OS versions, so hopefully the hardware solution is only for when OS support isn't available. Hopefully we don't have completely different solutions between Intel and AMD. AMD may actually get more support from Microsoft since a heterogeneous system may get used in next generation consoles where they want the performance but are limited to a low power envelope. For some server applications, you would want small cores, but probably not as weak as the cores that might get used in mobile solutions. There is some cross over in the low power constraint, but servers probably will require a more powerful core meant to be on all of the time.
 

MadRat

Lifer
Oct 14, 1999
11,946
265
126
Previously someone mentioned adding fat vector engines. I know they were talking about adding it on as an extra layer rather than altering the CCD. Would there be any advantage attaching these vector units to the IOC and address it as an external coprocessor? (It's still much closer to the CCDs than a truly external chipset add-on.) Would the equivalent of NVidia tensor cores added there be an advantage or is this overlap with AVX-512? With machine learning all the rage, I'd think this would rank right up there with a secure cryptoprocessor and a random number generator as helpful niches.
 
  • Like
Reactions: Vattila

Doug S

Platinum Member
Feb 8, 2020
2,784
4,744
136
We can't really go by what Apple does since they have full control of the hardware and software with their ARM solutions. AMD or Intel does not. They are going to have to work closely with Microsoft and open source developers.


That should have made it easier for Apple to get the hardware migration working, so while you can't go by that fact it certainly provides nothing encouraging for AMD. The main caution I'd have from extrapolating from Apple is that yes they tried it for one year and went to software control, but it was also their first big/little implementation. Maybe they had intended to do software migration but couldn't get the software to work right at first so we got a year of having the little cores be invisible to the software.

While it is true you can migrate threads more quickly if the hardware is doing it on its own, there is little gain having migrations happen more quickly. The overall latency of such moves will be overwhelmingly dominated by giant time sinks like refilling the L1 and TLB. Its like a faster plane making one hour flight time take 50 minutes, while ignoring the two hours it takes to get from home to the airport, parking, get through security, wait for boarding, taxiing on the runway and then another couple hours at the destination.

The software also "knows" a lot of things the hardware doesn't, like the priority that may have been assigned to a thread, how often it is blocking on I/O, and so forth, which figure into a good scheduler's decisions. Anyone who paid attention to the several times over the past couple decades that the Linux kernel's scheduler was completely revamped from scratch, and saw all the issues that go into getting it right, should be extremely wary of allowing hardware to decide on its own whether something should run on a big core or little core.
 

soresu

Diamond Member
Dec 19, 2014
3,229
2,515
136
Previously someone mentioned adding fat vector engines. I know they were talking about adding it on as an extra layer rather than altering the CCD. Would there be any advantage attaching these vector units to the IOC and address it as an external coprocessor? (It's still much closer to the CCDs than a truly external chipset add-on.) Would the equivalent of NVidia tensor cores added there be an advantage or is this overlap with AVX-512? With machine learning all the rage, I'd think this would rank right up there with a secure cryptoprocessor and a random number generator as helpful niches.
What you are talking about sounds more like the ML/matrix ops chiplet that is rumoured to come with RDNA4.

Considering that Zen4 seems to be starting to have APUs in the mainstream performance segment with Raphael as well as value end with Phoenix, I think it's possible that such a feature may be there in the future as part of a multi step approach to broadening the reach of the AMD ecosystem.
 
  • Like
Reactions: Vattila

Ajay

Lifer
Jan 8, 2001
16,094
8,109
136
2. Big and small cores have different levels of ISA support (i.e the small cores cannot support AVX for example and the big cores can). ( in contrast to Intel ADL approach )
Damn that's brilliant because it's simple. Of course, halting and then executing that instruction stream on a big core represent a performance hit in cycles wasted. The thread state store and copy to the (normally) private L1$ of the larger cache is interesting; curious how that is done since there is no direct L1$ to L1$ port. I'll have to read on. Thanks @DisEnchantment - you are the ATF patent search king!
 

moinmoin

Diamond Member
Jun 1, 2017
5,064
8,032
136
2. Big and small cores have different levels of ISA support (i.e the small cores cannot support AVX for example and the big cores can). ( in contrast to Intel ADL approach )
With all the discussion of CPU's thread scheduling in hardware potentially being bad due to lack of transparency to the OS, the above quote needs repeating. This is *not* your usual big.LITTLE approach with heterogeneous cores but normalized lowest common denominator ISA support (which essentially defeats the whole purpose). This is a tiny toaster level core which supports common low-effort features that shouldn't require firing up all of the big fat full core, as the above patent describes (read it!):

"For example, the low-feature processors may support instruction execution of low priority processes such as operating system (OS) maintenance, timer support and various monitor functions that are used to allow a device to appear powered on and available through periodic wake ups while most of the time it is in fact powered off. By minimizing the power needed to support these operations, battery life can be greatly extended, thereby improving the efficiency of lower power operations and improving battery life. Accordingly, instructions executed on the low-feature processors offer improvements in power savings relative to other implementations that employ other low power techniques, but continue to execute those instructions on full-feature processors"

AMD engineers apparently looked at what's the purpose of big.LITTLE, and possible ways how that can be achieved even more efficiently. All while keeping Zen cores fully suitable to the full range of use cases where they actually need being fired up.
 

firewolfsm

Golden Member
Oct 16, 2005
1,848
29
91
It reads more like a zen based processor with a small cluster of 2-4 arm style cores that can take up small tasks, rather than an actual big.little processor. But a layout where each zen core contains a small co-processor might work as well.
 

eek2121

Diamond Member
Aug 2, 2005
3,100
4,398
136
The implications of this, are of course, staggering. AMD could potentially expose, say 32 “small” cores to the OS, and if a core has high usage or requires an instruction that doesn’t exist on the small core, transparently transfer execution to one of 8 big cores. The best part is that they can then move a number of seldom used instruction sets off the small cores in order to improve power management while decreasing die size.
 
  • Like
Reactions: Tlh97

Ajay

Lifer
Jan 8, 2001
16,094
8,109
136
"For example, the low-feature processors may support instruction execution of low priority processes such as operating system (OS) maintenance, timer support and various monitor functions that are used to allow a device to appear powered on and available through periodic wake ups while most of the time it is in fact powered off. By minimizing the power needed to support these operations, battery life can be greatly extended, thereby improving the efficiency of lower power operations and improving battery life. Accordingly, instructions executed on the low-feature processors offer improvements in power savings relative to other implementations that employ other low power techniques, but continue to execute those instructions on full-feature processors"
This statement doesn't make sense. Why would one want to keep the 'low featured' cores powered off?? That doesn't minimize power, keeping the full featured cores powered off does.
 

moinmoin

Diamond Member
Jun 1, 2017
5,064
8,032
136
Wasn't there a comment by AMD that their would be a special variant of Genoa for an upcoming super computer?
That's Trento, derived from Milan.

This statement doesn't make sense. Why would one want to keep the 'low featured' cores powered off?? That doesn't minimize power, keeping the full featured cores powered off does.
I read it as a confirmation that it's either/or, not both cores at the same time. As in if you actually use the CPU, you usually want to use the Zen core.
 
  • Like
Reactions: Tlh97

Ajay

Lifer
Jan 8, 2001
16,094
8,109
136
This seems like a crazy idea for non-mobile CPUs. Why waste the design time and silicon die space for large cores that are meant to be used as seldom as possible? It makes sense for monolithic APUs for use in Ultrabooks and other lower power form factors. Otherwise - nuts!
 

eek2121

Diamond Member
Aug 2, 2005
3,100
4,398
136
This seems like a crazy idea for non-mobile CPUs. Why waste the design time and silicon die space for large cores that are meant to be used as seldom as possible? It makes sense for monolithic APUs for use in Ultrabooks and other lower power form factors. Otherwise - nuts!

The "big" cores consume more power, but perform faster. The "small" cores consume less power, but don't support all features. Chatting on an internet forum doesn't need most of the instruction sets modern day CPUs provide. Why power all that silicon? Playing a game requires a number of instruction sets that aren't normally used. During that time, the small cores can be put to sleep, giving the big cores more headroom (by way of TDP) to run.

EDIT: I should mention, by "small" I don't necessarily mean "slow". The cores themselves could be every bit as performant, or even more so, by not including certain types of instructions. We probably won't see those types of optimizations with AMD's first implementation, but as we move forward? Definitely.
 

Thibsie

Senior member
Apr 25, 2017
865
973
136
This seems like a crazy idea for non-mobile CPUs. Why waste the design time and silicon die space for large cores that are meant to be used as seldom as possible? It makes sense for monolithic APUs for use in Ultrabooks and other lower power form factors. Otherwise - nuts!

They probably only intend to use it device needing lower power CPU/APUs.