Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 67 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

uzzi38

Platinum Member
Oct 16, 2019
2,622
5,879
146
Honestly I'm quite worried about this method of dealing with heterogeneous cores as well, but because I'm pretty sure AMD aren't the first to try having littles essentially invisible to the OS.

Apple were with the A10. They didn't keep the design with the A11.

I dunno the differences between how the two worked specifically, I haven't seen anything to really explain the exact implementation. It's possible AMD's implementation may not run into whatever issues Apple's did. But... It does worry me that Apple of all companies dropped the idea.
 

Asterox

Golden Member
May 15, 2012
1,026
1,775
136
Honestly I'm quite worried about this method of dealing with heterogeneous cores as well, but because I'm pretty sure AMD aren't the first to try having littles essentially invisible to the OS.

Apple were with the A10. They didn't keep the design with the A11.

I dunno the differences between how the two worked specifically, I haven't seen anything to really explain the exact implementation. It's possible AMD's implementation may not run into whatever issues Apple's did. But... It does worry me that Apple of all companies dropped the idea.

Important detail, but it is kind of expected or no any rush for Desktop.


"it is believed that Ryzen 8000 series “Strix Point” will be AMD’s first to implement heterogeneous architecture with 3nm Zen5 cores combined with Zen4D on a single package. A desktop variant codenamed “Granite Ridge” is not currently rumored to feature big/small core architecture"
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Honestly I'm quite worried about this method of dealing with heterogeneous cores as well, but because I'm pretty sure AMD aren't the first to try having littles essentially invisible to the OS.

Apple were with the A10. They didn't keep the design with the A11.

I dunno the differences between how the two worked specifically, I haven't seen anything to really explain the exact implementation. It's possible AMD's implementation may not run into whatever issues Apple's did. But... It does worry me that Apple of all companies dropped the idea.
We can't really go by what Apple does since they have full control of the hardware and software with their ARM solutions. AMD or Intel does not. They are going to have to work closely with Microsoft and open source developers. It would be nice if they could at least get the page size issues resolved. I have been dealing with the 2 MB transparent huge pages on centos 6 and 7. They cause all kinds of issues since it tries to defrag 4k pages into 2 MB pages on allocation. This can cause huge delays on memory allocation and does not seem very effective at reclaiming huge pages. I don't know if this is much better in centos 7. It is still an old kernel. As far as I know, any swapping to disk will quickly fragment 2 MB pages into 4k pages because the swap system only handles 4k pages. Some applications seem to perform significantly better with 2 MB pages (massively more effective TLB), but the fragmentation to 4k pages and ineffective de-fragmentation cause performance to degrade over time. It is significant with the the defrag on centos 6. It was sometimes causing delays in the minutes on memory allocation. If you think about how many pages are required for an application using gigabytes of memory, it seems like we should have abandoned 4k pages a long time ago. With how long page size issues seem to have been a problem, I don't have very high expectations for scheduler optimizations, so it is probably very good for AMD to have a hardware solution, even if you cant use all of the cores at once.

Since Apple has full control, visibility to the OS isn't an issue. I believe Apple arbitrarily switched to 16k pages throughout; I need to read up on that that though. We don't seem to be able to do that with x86 systems, so it seems it scheduling may be a problem. Ideally, you want all cores visible to the OS such that it can make the best use of them. You still want it to work with existing OS versions, so hopefully the hardware solution is only for when OS support isn't available. Hopefully we don't have completely different solutions between Intel and AMD. AMD may actually get more support from Microsoft since a heterogeneous system may get used in next generation consoles where they want the performance but are limited to a low power envelope. For some server applications, you would want small cores, but probably not as weak as the cores that might get used in mobile solutions. There is some cross over in the low power constraint, but servers probably will require a more powerful core meant to be on all of the time.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,780
136
I dunno the differences between how the two worked specifically, I haven't seen anything to really explain the exact implementation. It's possible AMD's implementation may not run into whatever issues Apple's did. But... It does worry me that Apple of all companies dropped the idea.
Its a patent application, there is no reason to 'worry' :D .
We are in a technical speculation thread discussing likely possibilities (albeit not blue sky fantasy)
If anything, AMD are being very pragmatic in their approaches. Also note there is no prior art in the application.

Since I generally do not follow anything Apple, I would only like to comment on the ARM big.LITTLE.
The cores are independent and they are asymmetric SMP cores, coherent at L2.
Disabling a small core when migrating the load to a big core makes no big difference if your CPU is not power constrained at all (e.g. if you are on cutting edge nodes), maybe if you were on an older node perhaps and power/thermal envelope is/was an issue.

AMD's application is for cores that have their register file bridged. In this setup they cannot run in parallel (for obvious reason that they use the same register file or the register files are bridged)
The patent application goes as far as saying that during migration, the small core stalls and the execution resumes from the last program counter in the big core.
This is way beyond what the OS can handle. OS manages at TCBs and PCBs not instruction level.

The small core could in fact be the same big core with much more granular power/clock gate-able blocks
And most certainly not a separate die for such small cores. It would be just a few mm2 without the L3, see Tremont.
 

MadRat

Lifer
Oct 14, 1999
11,910
238
106
Previously someone mentioned adding fat vector engines. I know they were talking about adding it on as an extra layer rather than altering the CCD. Would there be any advantage attaching these vector units to the IOC and address it as an external coprocessor? (It's still much closer to the CCDs than a truly external chipset add-on.) Would the equivalent of NVidia tensor cores added there be an advantage or is this overlap with AVX-512? With machine learning all the rage, I'd think this would rank right up there with a secure cryptoprocessor and a random number generator as helpful niches.
 
  • Like
Reactions: Vattila

Doug S

Platinum Member
Feb 8, 2020
2,252
3,482
136
We can't really go by what Apple does since they have full control of the hardware and software with their ARM solutions. AMD or Intel does not. They are going to have to work closely with Microsoft and open source developers.


That should have made it easier for Apple to get the hardware migration working, so while you can't go by that fact it certainly provides nothing encouraging for AMD. The main caution I'd have from extrapolating from Apple is that yes they tried it for one year and went to software control, but it was also their first big/little implementation. Maybe they had intended to do software migration but couldn't get the software to work right at first so we got a year of having the little cores be invisible to the software.

While it is true you can migrate threads more quickly if the hardware is doing it on its own, there is little gain having migrations happen more quickly. The overall latency of such moves will be overwhelmingly dominated by giant time sinks like refilling the L1 and TLB. Its like a faster plane making one hour flight time take 50 minutes, while ignoring the two hours it takes to get from home to the airport, parking, get through security, wait for boarding, taxiing on the runway and then another couple hours at the destination.

The software also "knows" a lot of things the hardware doesn't, like the priority that may have been assigned to a thread, how often it is blocking on I/O, and so forth, which figure into a good scheduler's decisions. Anyone who paid attention to the several times over the past couple decades that the Linux kernel's scheduler was completely revamped from scratch, and saw all the issues that go into getting it right, should be extremely wary of allowing hardware to decide on its own whether something should run on a big core or little core.
 

soresu

Platinum Member
Dec 19, 2014
2,656
1,858
136
Previously someone mentioned adding fat vector engines. I know they were talking about adding it on as an extra layer rather than altering the CCD. Would there be any advantage attaching these vector units to the IOC and address it as an external coprocessor? (It's still much closer to the CCDs than a truly external chipset add-on.) Would the equivalent of NVidia tensor cores added there be an advantage or is this overlap with AVX-512? With machine learning all the rage, I'd think this would rank right up there with a secure cryptoprocessor and a random number generator as helpful niches.
What you are talking about sounds more like the ML/matrix ops chiplet that is rumoured to come with RDNA4.

Considering that Zen4 seems to be starting to have APUs in the mainstream performance segment with Raphael as well as value end with Phoenix, I think it's possible that such a feature may be there in the future as part of a multi step approach to broadening the reach of the AMD ecosystem.
 
  • Like
Reactions: Vattila

Ajay

Lifer
Jan 8, 2001
15,429
7,849
136
2. Big and small cores have different levels of ISA support (i.e the small cores cannot support AVX for example and the big cores can). ( in contrast to Intel ADL approach )
Damn that's brilliant because it's simple. Of course, halting and then executing that instruction stream on a big core represent a performance hit in cycles wasted. The thread state store and copy to the (normally) private L1$ of the larger cache is interesting; curious how that is done since there is no direct L1$ to L1$ port. I'll have to read on. Thanks @DisEnchantment - you are the ATF patent search king!
 

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
2. Big and small cores have different levels of ISA support (i.e the small cores cannot support AVX for example and the big cores can). ( in contrast to Intel ADL approach )
With all the discussion of CPU's thread scheduling in hardware potentially being bad due to lack of transparency to the OS, the above quote needs repeating. This is *not* your usual big.LITTLE approach with heterogeneous cores but normalized lowest common denominator ISA support (which essentially defeats the whole purpose). This is a tiny toaster level core which supports common low-effort features that shouldn't require firing up all of the big fat full core, as the above patent describes (read it!):

"For example, the low-feature processors may support instruction execution of low priority processes such as operating system (OS) maintenance, timer support and various monitor functions that are used to allow a device to appear powered on and available through periodic wake ups while most of the time it is in fact powered off. By minimizing the power needed to support these operations, battery life can be greatly extended, thereby improving the efficiency of lower power operations and improving battery life. Accordingly, instructions executed on the low-feature processors offer improvements in power savings relative to other implementations that employ other low power techniques, but continue to execute those instructions on full-feature processors"

AMD engineers apparently looked at what's the purpose of big.LITTLE, and possible ways how that can be achieved even more efficiently. All while keeping Zen cores fully suitable to the full range of use cases where they actually need being fired up.
 

firewolfsm

Golden Member
Oct 16, 2005
1,848
29
91
It reads more like a zen based processor with a small cluster of 2-4 arm style cores that can take up small tasks, rather than an actual big.little processor. But a layout where each zen core contains a small co-processor might work as well.
 

eek2121

Platinum Member
Aug 2, 2005
2,930
4,025
136
The implications of this, are of course, staggering. AMD could potentially expose, say 32 “small” cores to the OS, and if a core has high usage or requires an instruction that doesn’t exist on the small core, transparently transfer execution to one of 8 big cores. The best part is that they can then move a number of seldom used instruction sets off the small cores in order to improve power management while decreasing die size.
 
  • Like
Reactions: Tlh97

Ajay

Lifer
Jan 8, 2001
15,429
7,849
136
"For example, the low-feature processors may support instruction execution of low priority processes such as operating system (OS) maintenance, timer support and various monitor functions that are used to allow a device to appear powered on and available through periodic wake ups while most of the time it is in fact powered off. By minimizing the power needed to support these operations, battery life can be greatly extended, thereby improving the efficiency of lower power operations and improving battery life. Accordingly, instructions executed on the low-feature processors offer improvements in power savings relative to other implementations that employ other low power techniques, but continue to execute those instructions on full-feature processors"
This statement doesn't make sense. Why would one want to keep the 'low featured' cores powered off?? That doesn't minimize power, keeping the full featured cores powered off does.
 

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
Wasn't there a comment by AMD that their would be a special variant of Genoa for an upcoming super computer?
That's Trento, derived from Milan.

This statement doesn't make sense. Why would one want to keep the 'low featured' cores powered off?? That doesn't minimize power, keeping the full featured cores powered off does.
I read it as a confirmation that it's either/or, not both cores at the same time. As in if you actually use the CPU, you usually want to use the Zen core.
 
  • Like
Reactions: Tlh97

Ajay

Lifer
Jan 8, 2001
15,429
7,849
136
This seems like a crazy idea for non-mobile CPUs. Why waste the design time and silicon die space for large cores that are meant to be used as seldom as possible? It makes sense for monolithic APUs for use in Ultrabooks and other lower power form factors. Otherwise - nuts!
 

eek2121

Platinum Member
Aug 2, 2005
2,930
4,025
136
This seems like a crazy idea for non-mobile CPUs. Why waste the design time and silicon die space for large cores that are meant to be used as seldom as possible? It makes sense for monolithic APUs for use in Ultrabooks and other lower power form factors. Otherwise - nuts!

The "big" cores consume more power, but perform faster. The "small" cores consume less power, but don't support all features. Chatting on an internet forum doesn't need most of the instruction sets modern day CPUs provide. Why power all that silicon? Playing a game requires a number of instruction sets that aren't normally used. During that time, the small cores can be put to sleep, giving the big cores more headroom (by way of TDP) to run.

EDIT: I should mention, by "small" I don't necessarily mean "slow". The cores themselves could be every bit as performant, or even more so, by not including certain types of instructions. We probably won't see those types of optimizations with AMD's first implementation, but as we move forward? Definitely.
 

Doug S

Platinum Member
Feb 8, 2020
2,252
3,482
136
Chatting on an internet forum doesn't need most of the instruction sets modern day CPUs provide. Why power all that silicon? Playing a game requires a number of instruction sets that aren't normally used. During that time, the small cores can be put to sleep, giving the big cores more headroom (by way of TDP) to run.


That's completely wrong. You think posting to Anandtech doesn't use SIMD instructions? Check out whatever is responsible in your OS kernel for zeroing pages when a new page is needed, it probably uses AVX2 in some circumstances - and that's the tip of the iceberg. You think floating point isn't needed? Sorry, all math in Javascript is done in floating point, there's no way to avoid it if you are running a browser.

I doubt there's anything you can do with a modern PC or smartphone that would allow any worthwhile reduction of instruction set coverage. Not even running an "idle loop" (which is a halt instruction these days) because there are always background/housekeeping processes running at times so the scheduler, I/O dispatch, filesystem, and other parts of the kernel will remain active.

I don't think you can usefully cut out any instructions from a small core other than 1) AVX512 (and that's only true on x86 because Intel didn't provide for variable SIMD width capability like SVE2) and 2) virtualization. Anything else you cut out will mean almost every thread will be forced onto big cores before long.
 

Thibsie

Senior member
Apr 25, 2017
746
798
136
This seems like a crazy idea for non-mobile CPUs. Why waste the design time and silicon die space for large cores that are meant to be used as seldom as possible? It makes sense for monolithic APUs for use in Ultrabooks and other lower power form factors. Otherwise - nuts!

They probably only intend to use it device needing lower power CPU/APUs.
 

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
What are these 'toaster' cores? And why are people calling them that?
Maybe read my post again? The low-feature cores, the things that support instruction execution of low priority processes such as operating system (OS) maintenance, timer support and various monitor functions. So that the big fat full cores don't need to be fired up for such.