Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 69 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

blckgrffn

Diamond Member
May 1, 2003
9,128
3,069
136
www.teamjuchems.com
@Doug S I mentally pictured a user recording where the user were moving the mouse around crazily while one hand typing out a reply and got a laugh out of it, so thanks for that :D Luckily browser interaction isn't the only use case, given this could work everywhere all the time in a software agnostic fashion.

To me, there seems to be plenty of times when a big core (or even just most of them) could rest in a deep sleep state because even if it seems like humans are interacting with the machine, some code is firing, watching, etc. the gaps between tasks are an eternity for a CPU if it can change states fast enough to be transparent.

If it's worth a handful of percentage increases even when targeted at a mobile platform, it does seem like every rock needs to be turned over in pursuit.
 

Doug S

Platinum Member
Feb 8, 2020
2,269
3,522
136
Considering I’ve done operating system development in the past. Yes I am aware of how mouse tracking works. Modern operating systems have many threads, and not all of them use all instruction sets. By placing those threads on smaller cores, you save power. You seem to believe it is an all or nothing type of deal. It isn’t.

It IS an all or nothing deal. If you implement a cut down ISA for the small cores then ONLY those threads that never use unsupported instructions can run on the small cores. Everything else is restricted to running only on big cores, because you'd be stupid to allow a thread that has trapped off the little core to migrate back to it.

What exactly do you think could be cut out of the small cores? What's the benefit, other than saving a tiny amount of area? Whatever static power leakage this costs you in ungated transistors you will more than lose from threads that could have run on a small core if it wasn't so handicapped they're forced to run on more power hungry big cores.

It just doesn't make sense, other than for as I said before AVX512 and virtualization (and I suppose the 16 bit real mode stuff, segments and other legacy crap though IMHO they ought to dump that entirely)
 

Doug S

Platinum Member
Feb 8, 2020
2,269
3,522
136
It seems like you are possibly massively over-estimating the amount of floating point used in a JavaScript execution thread vs. the amount of all integer used in actually displaying the GUI. The amount of work to display the GUI is often huge compared to what is actually running in the GUI. Also, I don’t know if anyone is talking about having a core with absolutely no FP resources. If you have a separate small core with a scalar FP unit or even just a small narrow 128-bit unit, that could be used to handle any floating point instructions. Technically they could emulate any vector instructions with a scalar unit; it would just be slow. You could actually emulate floating point units with integer units, but that would be excruciatingly slow.

You only need ONE use of floating point to force a thread to trap off a small core that doesn't support it. How much Javascript code do you think is out there that does not use ANY math at all? If it sets a value, compares a value, anything then it will trap.

Once it has trapped to the big core you aren't going to let it move back to the little core because once it has trapped off you know it will happen again. Moving back and forth between big and little cores due to lack of instruction support (as opposed to being due to performance need) is going to waste a ton of power as you keep moving to a core with a cold D cache and TLB, needs icache refill, is missing BTB history, etc.

It seems like the goalposts are moving if now you say you're talking about a core with less FP resources. Well OF COURSE the little core will have fewer FP resources, as well as fewer int resources, fewer load/store resources, and less of everything really. That's absolutely not what was being argued at all. The post I originally replied to was talking about cutting instructions / ISA support out of the small core, not having it be narrower or in order.
 
Last edited:
  • Like
Reactions: scineram

Doug S

Platinum Member
Feb 8, 2020
2,269
3,522
136
@Doug S I mentally pictured a user recording where the user were moving the mouse around crazily while one hand typing out a reply and got a laugh out of it, so thanks for that :D Luckily browser interaction isn't the only use case, given this could work everywhere all the time in a software agnostic fashion.

To me, there seems to be plenty of times when a big core (or even just most of them) could rest in a deep sleep state because even if it seems like humans are interacting with the machine, some code is firing, watching, etc. the gaps between tasks are an eternity for a CPU if it can change states fast enough to be transparent.

If it's worth a handful of percentage increases even when targeted at a mobile platform, it does seem like every rock needs to be turned over in pursuit.


Of course, that's the whole point of a little core. And that's why you want your little cores to be able to execute all the instructions that a big core can (other than as I've said before a few potential exceptions like AVX512 and virtualization) so you can allow it to run ALL threads during times when the system isn't doing much.

Having small cores that can't execute some threads regardless of how little they are doing because they need to run floating point (like the thread executing this browser tab) then they offer less potential power saving.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,355
1,550
136
Sorry, all math in Javascript is done in floating point, there's no way to avoid it if you are running a browser.

While all numeric types in JS are defined to be floating point, all modern browsers actually store all whole numbers as integers first, and only convert to floating point if you do some operation on them that would lose precision as an integer type. This is a major optimization that makes browsers a lot faster, mostly because it saves a lot of time when adding numbers or preventing having to convert floats back to integers for array indexing. As a programmer, this is one of those techniques that's just amazing to me -- just imagine all the scaffolding they had to build around all mathematical operations to be able to do that. How can that possibly be a win? Apparently it is.

(And some browsers also NaN-pack all their values -- that is, store every value inside their JS engine as a 64-bit scalar, with floats being 64-bit IEEE floats, and everything else being packed inside unused NaN values, with both a small type tag and the value stuffed in the 52 bits of mantissa. This means that in some JS engines, there is a double float, which is a NaN that contains a type tag and actually represents an integer, which actually represents a float that happens to be a whole number. Inception?)
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,355
1,550
136
Again and again you bring facts into the conversation... STOP THE COUNT!

Mostly interesting anecdotes. I agree with the basic idea that a browser engine would probably do some JS often enough that you can't throw away all FP from the weak cores and expect them to be useful for browsers.

An interesting option is to have an FPU, but just make it very slow. Like, 64-bit wide total, and relatively high latencies. That will be quite cheap. Then trap out processes if they seem to do more than occasional FP.
 
  • Like
Reactions: Tlh97 and lobz

Vattila

Senior member
Oct 22, 2004
799
1,351
136
From the recent AMD paper submitted to the International Symposium on Computer Architecture (ISCA), AMD hints at the challenges of interconnecting chiplets in the package substrate. I'm sure they would love to have had the interconnect on silicon instead, if silicon interposers have had the "reach" at the time, or silicon bridge packaging technology had been available. Both of these packaging technologies have now been developed by and in cooperation with TSMC, and the roadmap for availability and capacity for TSMC's SoIC packaging technology is conspicuously aligned with Lisa Su's goal to have production by year end of V-Cache enabled chips, as well as with the roadmap for "Zen 4" next year. Hopefully, we'll see exciting developments in packaging.

"Examples of engineering challenges such as the required silicon-package co-design for under-CCD routing are reminders that one does not simply take disparate pieces of silicon and “glue” them into a complete system. Significant thought, planning, collaboration, engineering, and creativity are needed to successfully bring all the pieces together."

Pioneering Chiplet Technology and Design for the AMD EPYC™ and Ryzen™ Processor Families: Industrial Product (computer.org)
 
  • Like
Reactions: Tlh97 and moinmoin

Doug S

Platinum Member
Feb 8, 2020
2,269
3,522
136
An interesting option is to have an FPU, but just make it very slow. Like, 64-bit wide total, and relatively high latencies. That will be quite cheap. Then trap out processes if they seem to do more than occasional FP.

Isn't that what they're already doing? If you look at Anandtech's reviews of Apple's SoCs, the little cores have fewer FP units and somewhat longer latencies.

Optimizing for low power means going back in time in a sense - the sort of very wide high IPC cores we see today that maximize performance burn a lot a power to allow for the less common case where you are able to issue or retire a half dozen instructions in a single cycle. If you drop that down to issuing or retiring 2 or 3 instructions per cycle you GREATLY reduce the complexity of your decoder, rename engine, register file and so forth. That, combined with reducing the clock rate to the point before your power/performance curve starts to bend, seems to be the recipe for a low power core that Apple has adopted.

PS - thanks for the info on modern JS engines handling values as scalars, I wasn't aware they were doing that but its a clever optimization!
 

maddie

Diamond Member
Jul 18, 2010
4,749
4,691
136
It IS an all or nothing deal. If you implement a cut down ISA for the small cores then ONLY those threads that never use unsupported instructions can run on the small cores. Everything else is restricted to running only on big cores, because you'd be stupid to allow a thread that has trapped off the little core to migrate back to it.

What exactly do you think could be cut out of the small cores? What's the benefit, other than saving a tiny amount of area? Whatever static power leakage this costs you in ungated transistors you will more than lose from threads that could have run on a small core if it wasn't so handicapped they're forced to run on more power hungry big cores.

It just doesn't make sense, other than for as I said before AVX512 and virtualization (and I suppose the 16 bit real mode stuff, segments and other legacy crap though IMHO they ought to dump that entirely)
Maybe I am stupid.

"Everything else is restricted to running only on big cores, because you'd be stupid to allow a thread that has trapped off the little core to migrate back to it."

If migration is pretty seamless AND only big or little can be active at any one time, why is this stupid?

In fact, because of the reason that it's an either/or situation regarding the big/little cores, then you have effectively told AMD that they're wasting their time with this.

According to your reasoning, all threads migrate to the big one and stupid to move after that, then might as well ignore small cores and only have big ones as we'll end up there permanently in any case.
 
  • Like
Reactions: Tlh97 and Thibsie

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Part of me wonders if the cores end up getting completely virtualized in the future, with hardware or firmware taking over the duties of the Windows/Linux scheduler.
There are a lot of things that the hardware might not know and vice versa. Putting the scheduler in the hardware is less flexible. I suppose you could add ISA elements to control scheduling, but you run into the issue that full specifying the hardware in the ISA isn’t good. As thing change, you can’t easily change the underlying implantation. The IA-64 ISA had that issue. The software scheduling could not do the same thing as the OOO execution engines in other CPU architectures. They could have tried changing the underlying implementation, but IA-64 made that very difficult and the whole thing ended up being abandoned. The ISA should be somewhat general to allow the underlying implementation to change. We have hit some of this before with GPUs. I guess ATI/AMD were using more hardware scheduling and nvidia was using more towards software scheduling. It probably made nvidia look better due to using more CPU power with less required for the gpu. I don’t know where they are on hardware vs. software scheduling these days.

I would expect that the hardware will be increasingly more virtualized as the underlying implementation evolves, but I think the scheduler will mostly remain in software. We may get to a point where it is kind of unclear how many “cores” a computing device has due to shared resources. We may have other things like an embedded FPGA, which blurs the line between software and hardware even more. It would be funny if we end up with AMD64 cores with an embedded FPGA which contains ARM cores.
 

soresu

Platinum Member
Dec 19, 2014
2,667
1,866
136
With the way the rumors have been, I am wondering if they are planning on making some 2 layer cpu stacks for 128-cores. With lower clocks, they may be able to stack multiple cpu layers, so it could be a special super high core count part.
Oh for sure with lower clocks they could manage it.

But the question is how much lower?

If Zen4 adds significantly more FP/SIMD ALU resources as I expect it will then the clocks may need to be quite a bit lower to work, especially for sustained computing loads.

I'm still wondering with all AMD's push for chip stacking if and when they will announce some kind of thermal vias to cool them from inside, or at the very least from under the stacks by going inside the interposer too.
 

MadRat

Lifer
Oct 14, 1999
11,910
239
106
I like the theory but it may be hard to find a case where small cores get utilized much. Maybe when we run MAME the small core can run the game but a big core will run the emulator. Wait a minute...

Maybe when you run cmd.exe then the small core can get a work out.
 

eek2121

Platinum Member
Aug 2, 2005
2,930
4,026
136
I like the theory but it may be hard to find a case where small cores get utilized much. Maybe when we run MAME the small core can run the game but a big core will run the emulator. Wait a minute...

Maybe when you run cmd.exe then the small core can get a work out.

Once again you are applying “on” vs “off”. This isn’t an “on” vs “off” type of situation. The minority of the threads running on your machine right now run AVX instructions. If they do, they do it infrequently. I think the part you misunderstand is power management. Intel and AMD are pretty good at power management, but they can never completely shut off silicon when it isn’t being used.

Remember, with modern CPUs, the CPU has no idea if an instruction set will be used or not. It relies on the OS to figure that out…

If you implement a trap like AMD is doing, you still don’t get to predict the future, but you do quickly realize what is going on. Right now that code bombs out. When they implement the tech mentioned in the patent, the code will seamlessly switch from efficient cores to performant cores and back again when the demanding instructions are finished.

EDIT: to be clear, this is the type of situation that happens hundreds of times a second. AMD does an excellent job of this now, but future architectures could see even better thanks to this approach.
 
Last edited:

jamescox

Senior member
Nov 11, 2009
637
1,103
136
I think A510 is an interesting example with it's Bulldozerish shared-FP unit design.

It seems a nightmare to detect that one small core is hogging the FP resources causing things to slow down for the other core, without lots of model-specific code in the OS Scheduler. At the very least the schduler needs to be aware to not schedule FP loads to adjacent small cores (in some implementations of the A510, mind you, not all!)

If AMD were to use a similar design with their big-little approach, it would be relatively trivial for the hardware to detect and fire up one of the big cores instead. It only needs to account for one architecture and already has access to loads more low-level CPU-specific data to make that decision.


CPU_37_575px.png
I think excavator cores not performing well had anything to do with the shared FP unit. It is a reasonable way to share hardware. SMT is sharing hardware between
You only need ONE use of floating point to force a thread to trap off a small core that doesn't support it. How much Javascript code do you think is out there that does not use ANY math at all? If it sets a value, compares a value, anything then it will trap.

Once it has trapped to the big core you aren't going to let it move back to the little core because once it has trapped off you know it will happen again. Moving back and forth between big and little cores due to lack of instruction support (as opposed to being due to performance need) is going to waste a ton of power as you keep moving to a core with a cold D cache and TLB, needs icache refill, is missing BTB history, etc.

It seems like the goalposts are moving if now you say you're talking about a core with less FP resources. Well OF COURSE the little core will have fewer FP resources, as well as fewer int resources, fewer load/store resources, and less of everything really. That's absolutely not what was being argued at all. The post I originally replied to was talking about cutting instructions / ISA support out of the small core, not having it be narrower or in order.
I don’t think I ever said anything about a core with no FP resources. I also don’t really pay too much attention to who said what. It is too difficult to keep track of. I don’t have that much time.

I may have said something about not supporting wide vector units. Actual vectorized AVX instructions are probably very uncommon to non-existent in a lot of code, so to me it would make some sense to just not support it on small cores. Just bump the thread to the large core if you hit an unsupported instruction. A lot of code isn’t even compiled with AVX support. The place I work does not compile with AVX at all. We still have a lot of older Xeons in use that do not support it. We do most of the heavy compute on the gpu, so there isn’t much reason to muck about with AVX anyway. I guess I might agree with the original poster on possibly having different instructions supported.

The big and small cores may share caches and such, so many cache effects would not be an issue. A lot of this depends on implementation. Apple has all cores visible to the OS, so they probably support the same instruction. If it is a very low power core and a big core that share caches and can move threads completely in hardware, then having the same instruction support doesn’t seem necessary. The “core” that would be visible to the OS does support all instructions. There will be some penalty for moving a thread, but that isn’t an issue if it isn’t done that frequently. Many operating systems bounce threads between cores on a regular basis anyway. I ran into what seems to be a bug with that on Centos 6.9. If there was only one heavy thread executing, it would constantly bounce it to cores that were down clocked, clock the core up, and then bounce it to another again. The migration process was taking a bunch of cpu time. I ended up locking it in performance mode rather than on demand mode to work around the issue.

The JavaScript example doesn’t seem like a good one. Web browsing is something well suited to small cores, as long as you aren’t going to pages trying to mine crypto currency in the background or something. A web browser will be running a huge number of threads and I have trouble believing that most JavaScript execution threads are doing a sufficient amount of number crunching that it would require a fat core. My phone might get a little warm when watching video, but that should require very little actual cpu power since it is almost certainly using a hardware decoder. I have never observed it to be even warm from web browsing. It is an iPhone though. There is probably a bunch of JavaScript on this forum page. I wouldn’t be surprised if it has been running all on little cores while typing this. Under what circumstances would there actually be enough going on to require a fat core?

For a low power core that can quickly swap a thread to a big core, I don’t think it would be that much of a problem to not support some instructions. A lot of the stuff that would be cut out would be stuff under the hood though ( speculative execution and such ), so I would agree on that. It sounds like AMD’s implementation may be closer to some form of threading anyway, rather than ARM or Apple style big.little. That has a different set of constraints.
 
  • Like
Reactions: Tlh97

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Oh for sure with lower clocks they could manage it.

But the question is how much lower?

If Zen4 adds significantly more FP/SIMD ALU resources as I expect it will then the clocks may need to be quite a bit lower to work, especially for sustained computing loads.

I'm still wondering with all AMD's push for chip stacking if and when they will announce some kind of thermal vias to cool them from inside, or at the very least from under the stacks by going inside the interposer too.
Intel had to clock down significantly when AVX512 units were in use. That was on 14 nm though. AMD stayed with 256-bit units on 7 nm and is probably going to 512 at 5 nm. I suspect that they will be able to sustain high clocks much better than intel did if only due to the large difference in process tech. They have presumably done a lot of work on power efficiency for their GPUs, so some of that tech could be used in the cpu also. They have seen large efficiency gains in RDNA.
 
  • Like
Reactions: Tlh97

Thala

Golden Member
Nov 12, 2014
1,355
653
136
It seems a nightmare to detect that one small core is hogging the FP resources causing things to slow down for the other core, without lots of model-specific code in the OS Scheduler. At the very least the schduler needs to be aware to not schedule FP loads to adjacent small cores (in some implementations of the A510, mind you, not all!)

Thats not how it works - it is not a nightmare either. I scheduler works more by observing what is going on instead of trying predicting what would happen by means of a sophisticated model. In your particular example, the scheduler will quickly see, that the thread in question imposes a high load on the small core and would migrate the thread to a big core. Likewise, if the scheduler observes a low load for a particular thread on the big cores, this thread will be object to be migrated to the small cores.
There are other instances, where the scheduler can make migration decisions. For instance a low performance thread is running on the small cores but having granted a mutex or a critical section, which a high performance thread on the big core is trying to obtain/enter. In this case the low performance thread is also migrated to the large cores. This works very similar to priority inheritance.

In general the biggest problem with your kind of thinking is the fact, that the scheduler has a priori no idea about the instruction distribution of a particular thread. And even this is not static, a thread can suddenly start to execute a buttload of FP instructions in a certain windows of its execution time. If this window is large enough to matter, the OS will observe this and make a migration decision to the big cores.
 

uzzi38

Platinum Member
Oct 16, 2019
2,637
5,989
146
Naming is not important, Genoa or some other name it is obvious that 128 Core Zen 4 CPU is a reality.

That's what I said in the first sentence of what you quoted.

"Worth noting Zen 4 based server processors in this case =/= Genoa in this case. "

I did not rule out the possibility of a 128c Zen 4 server CPU there in the slightest.
 

soresu

Platinum Member
Dec 19, 2014
2,667
1,866
136
So is the small core going to like a K6 with super cache support? Maybe more like the 5x86, before MMX.
Eh?

AMD's last 'small' core Jaguar was better than K6 and supported up to the AVX instruction set, albeit requiring 2 cycles to run an AVX instruction.

I would not expect any new small core to be less than that, and more likely considerably more.
 
  • Like
Reactions: Tlh97 and scineram

DrMrLordX

Lifer
Apr 27, 2000
21,643
10,862
136
I wouldn't be shocked if it wasn't some sort of hybrid between a jaguar derived core and the construction core shared FPU layout.

Why would they do that? AMD's most power-efficient SoCs in terms of raw perf/watt are from the post-Zen era. The amount of effort it would take to update cat cores and/or CON cores to be competitive in terms of perf/watt or perf/area with even Raven Ridge/Dali would not be worth it. Plus anything from that era onward is at least capable of handling AVX2 instructions without bombing out, making it easy to assign threads promiscuously at thread creation and then reassign them according to process priority. AVX-512 is another issue; however, astute observers will note that Intel is busy splitting up the AVX-512 standard(s) into ISA extensions compatible with AVX/AVX2 (see: AVX-VNNI vs. AVX512-VNNI).

It should also be noted that the ARM world is moving in a direction that does all that (and more!) when it comes to giving schedulers the option to assign threads to any available core regardless of which SIMD instructions are in use. Because SVE2. Only in x86 world would anyone even need to contemplate the possibility of "small" cores that are incompatible with prevailing 512b SIMD instructions. Yes, ARM isn't there yet, but it's on ARM's roadmap. Allegedly RISC-V will be going in the same direction.
 
Last edited: