Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Vattila · Oct 6, 2019

Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!

KOMACHI_ENSAKA · Dec 22, 2020

Afaik, Trento/Badami is Zen3 (SP3/SP3r4), but it will support Next Gen PCIe/Next Gen IF for unified memory...etc with Frontier and more. So It will lead to A+A (Trento or later + MI200 or later) BIG APU Mode.
(Also, someone said that there is an HBM2 die in Trento, but I don't believe it.)

soresu · Dec 22, 2020

inf64 said:
We are talking about desktop parts here... 5800X (a desktop part) has 2x L3 cache Vs desktop Rocketlake.

My bad, I must have misread the conversation somewhere and thought you were talking about Cezanne instead of Rocket Lake.

HurleyBird · Dec 22, 2020

soresu said:
Huge hindrance is a real stretch.

Anyone truly serious about gaming to the point that this would bother them is not doing so on a mobile APU chip/platform.

We expect Rocket Lake to outperform Zen 3 in gaming, right?

Which is faster for general tasks, Zen 3 or Willow Cove? Zen 3, right?

Cezanne and Rocket Lake have half the L3, but they also have a massively superior memory controller vs. Vermeer's last-gen IOD.

Taken together, I think it's reasonable to assume that Cezanne will do better in games than Vermeer clock-for-clock.

A high clocked desktop Cezanne could be a great counter to Rocket Lake. Vermeer beats it in general computing while Cezanne holds down the fort for gaming.

soresu · Dec 22, 2020

HurleyBird said:
Cezanne and Rocket Lake have half the L3, but they also have a massively superior memory controller vs. Vermeer's last-gen IOD.

Where does this revelation about Cezanne IMC improvement come from?

moinmoin · Dec 23, 2020

soresu said:
Where does this revelation about Cezanne IMC improvement come from?

Renoir is already containing a superior IMC than the Zen 2/3 IOD (e.g. async IF etc.).

DisEnchantment · Jan 1, 2021

X3D described in new patent.
20200409859
GPU CHIPLETS USING HIGH BANDWIDTH CROSSLINKS

A chiplet system includes a central processing unit (CPU) communicably coupled to a first GPU chiplet of a GPU chiplet array. The GPU chiplet array includes the first GPU chiplet communicably coupled to the CPU via a bus and a second GPU chiplet communicably coupled to the first GPU chiplet via a passive crosslink. The passive crosslink is a passive interposer die dedicated for inter-chiplet communications and partitions systems-on-a-chip (SoC) functionality into smaller functional chiplet groupings.

A number of Chiplets are arranged to form a giant GPU which are connected internally with using the HBX crosslink interconnect.
You will notice the primary GPU chiplet has SDF to connect to the CPU

I believe this is basically the cache coherent unified memory architecture between CPU and GPU which is supposed to arrive with Genoa and Instinct next gen to be deployed on El Capitan.
The stacked dies on the sides in the X3D illustration seems to be HBM.
So it seems AMD is going GPU chiplet and cache coherency between CPU and GPU in one shot.
Also the connectivity between CPU and GPU is probably not barebones PCIe, but a newly patented mechanism described in
20200412848
LOW OVERHEAD HIGH BANDWIDTH DATA TRANSFER PROTOCOL

X3D

GPU Chiplets

uzzi38 · Jan 1, 2021

Also this is my favourite snippet:

For example, in some embodiments, the GPU chiplets may be constructed as pentagon-shaped dies such that five GPU chiplets may be coupled together in a chiplet array.

soresu · Jan 1, 2021

uzzi38 said:
For example, in some embodiments, the GPU chiplets may be constructed as pentagon-shaped dies such that five GPU chiplets may be coupled together in a chiplet array.

That sounds like a pain to cut the wafer up for vs standard die grids.

dr1337 · Jan 1, 2021

soresu said:
That sounds like a pain to cut the wafer up for vs standard die grids.

I mean im pretty sure its all CNC. And once they have the final chip dimensions outlined it should be trivial for them to lay them in a nice pattern to then later have the computer laser them out. Using pentagons sounds really strange though because its impossible to tile a normal pentagon without wasting space, it would have to be some special pentagonal shape. Guess it still shouldn't be that much harder than doing a square grid but I wonder how much of a difference alternate tile shapes would actually make(if any at all).

uzzi38 · Jan 1, 2021

soresu said:
That sounds like a pain to cut the wafer up for vs standard die grids.

I said favourite not because it was cool or anything but because it sounded so hilarious

turtile · Jan 1, 2021

And we have integrated FPGA coming:

https://www.freepatentsonline.com/y2020/0409707.html

moinmoin · Jan 2, 2021

turtile said:
And we have integrated FPGA coming:

https://www.freepatentsonline.com/y2020/0409707.html

Exciting possibilities.

Thibsie · Jan 2, 2021

How easy is it to program/reprogram an FPGA ?
Would be it possible to decode an instruction or a pack of instruction, deduct that an X or Y type of unit is needed and reprogram the FPGA before sending the instruction to it for processing ?
I guess it would be too slow but ...

jpiniero · Jan 2, 2021

Thibsie said:
How easy is it to program/reprogram an FPGA ?

It's a pain in the ass. That's why FPGA's have been very niche.

Tuna-Fish · Jan 2, 2021

Thibsie said:
How easy is it to program/reprogram an FPGA ?
Would be it possible to decode an instruction or a pack of instruction, deduct that an X or Y type of unit is needed and reprogram the FPGA before sending the instruction to it for processing ?
I guess it would be too slow but ...

It would be way too slow, by orders of magnitude. The LUTs are optimized for the actual execution, at the expense of the speed it takes to program them. You could make them faster to reprogram, but this would come at a cost to their runtime speed and/or die area per unit, which would make them worse for actual use.

(Also, just as a reasonable baseline for speed, the kind of FPGAs we are talking about here probably run at a max of about 1GHz. So if your CPU is running at 5GHz, you can run the simplest possible operation on the FPGA every 5 cycles or so. Actual useful things would be even slower than that. Having a small FPGA like that could still be a massive win, because there are things where even a tiny FPGA running at 500MHz can beat the fastest desktop CPU by orders of magnitude.)

If the PEUs are small, it might be reasonable to have them dump out their old code & replace it on process switches. This would make them a lot more practically usable than if you had to fix the program in them, which would require some kind of dedicating the core to the task.

DisEnchantment · Jan 2, 2021

Thibsie said:
How easy is it to program/reprogram an FPGA ?

By programming an FPGA I suppose you mean changing the logic of the FPGA? It is fairly easy and quick in some cases FPGAs the LUTs are present in SRAM. But not remotely fast enough for on the fly loading of the LUTs
They basically write the LUT for the structure of the logic for a given set of input and output.
There are many blocks that are fixed however, for IO and stuffs.

Coding the logic in HDL is a different matter. FPGAs can be very unpredictable in terms of timings and stuffs. But very useful when you need to do simulate stuffs before you tape out silicon.
However many things you still need to figure out, like race conditions, clock propagations, delays, signal integrity and myriad other things. Which you can only do with real silicon.

In the context of the AMD's patent, I dont think it is a FPGA in the same sense like the Spartan or even Versal. Because here the programmable execution unit is part of the CPU core itself (unlike Versal for example with dedicated boundaries for both)
But the possibilities are many.
If they ever manage to work out a way to have some small cache for the LUT which can be loaded by the PSP from BIOS, that can offer acceleration of certain loads and optimized as they go, that would already be a big thing.
Another way to look at this is to load a file before before executing a program. Not sure if AMD would allow this, it would mean some tinkerers could basically alter their processor. But I can see a usecase here as well.

Thibsie · Jan 2, 2021

Thanks, very interesting !

soresu · Jan 2, 2021

moinmoin said:
Exciting possibilities.

I hear emu devs singing sweet music on the wind.....

jamescox · Jan 2, 2021

DisEnchantment said:
By programming an FPGA I suppose you mean changing the logic of the FPGA? It is fairly easy and quick in some cases FPGAs the LUTs are present in SRAM. But not remotely fast enough for on the fly loading of the LUTs
They basically write the LUT for the structure of the logic for a given set of input and output.
There are many blocks that are fixed however, for IO and stuffs.

Coding the logic in HDL is a different matter. FPGAs can be very unpredictable in terms of timings and stuffs. But very useful when you need to do simulate stuffs before you tape out silicon.
However many things you still need to figure out, like race conditions, clock propagations, delays, signal integrity and myriad other things. Which you can only do with real silicon.

In the context of the AMD's patent, I dont think it is a FPGA in the same sense like the Spartan or even Versal. Because here the programmable execution unit is part of the CPU core itself (unlike Versal for example with dedicated boundaries for both)
But the possibilities are many.
If they ever manage to work out a way to have some small cache for the LUT which can be loaded by the PSP from BIOS, that can offer acceleration of certain loads and optimized as they go, that would already be a big thing.
Another way to look at this is to load a file before before executing a program. Not sure if AMD would allow this, it would mean some tinkerers could basically alter their processor. But I can see a usecase here as well.

There has been some work done on FPGAs that can store multiple configuration states. I was familiar with a start-up company called Tabula that was using unique FPGA technology to map large designs across multiple clock cycles. Basically, the device could fit a design up to 8x the single clock size by mapping it across multiple clocks. The “user clock” would half with each doubling of the size though. HDL that needed 2x the size of the chip would run at half the hardware clock. It is a tough software problem to map HDL to such a device and they ran out of funding before their second generation device could be released, so they shutdown. They had some press due to being one of the few companies that was going to use Intel as a fab at 22 nm.

I have wondered if one of the other FPGA companies bought the IP. When I first read about it, it sounded like a spectacular idea since it could reduce interconnect overhead and replace it with local memory. If you had an 8 stage pipeline, for example, you could actually map it out in the same hardware spread out over 8 clocks. There would be no sending data through longer interconnect to other units. You would just change the function of the local units each clock. The data stays local. It does take extra memory to store the configuration and you need some memory to store the state between clocks, so it would have been a very memory heavy design. This seems like it could have made them much more power efficient than conventional FPGAs. Interconnect power is a big portion of modern chip power consumption. It is usually good to trade memory for data movement.

Having an FPGA device that can store multiple states and being able to switch each clock is overkill for things that can fit on a single FPGA device. The complexity of mapping HDL across multiple clocks of such a device is significant also. However, an FPGA that can just cache multiple gpu style kernels may not be too difficult. I haven’t read up much on the current state of the FPGA market, but FPGAs can accelerate some software significantly, so people have been expecting them to be a big thing for HPC for a while. I don’t know if that has taken off yet? It seems like they mostly went more towards using gpu compute rather than FPGAs. Going from cpu to gpu is a huge jump, although algorithms sometimes need to be changed significantly to work on a gpu. It sometimes takes quite a bit of work and a person knowledgeable in gpu programming. That isn’t always available. I don’t know how much of an improvement an FPGA would offer vs. an optimized gpu-based solution.

jamescox · Jan 2, 2021

turtile said:
And we have integrated FPGA coming:

https://www.freepatentsonline.com/y2020/0409707.html

I had been wondering if AMD would leverage their gpu tech in their cpus somehow. There seems to be quite a bit of cross over between Zen and RDNA, at least as far as cache technology. A full gpu compute unit has a lot of graphics specific, fixed function hardware though. That wouldn’t be needed directly in a cpu core. An FPGA type unit could be very parallel such that it could probably be made to execute any vector extensions that they want to support. I wonder if AVX512 and future extensions could be implemented as just a pre-loaded configuration (microcode essentially) for this FPGA-like block.

I still don’t know if we are getting AVX512 in Zen 4. Zen 3 is the new architecture, so it seems like it would have been in Zen 3 if they are going to do it. It is still niche and not really necessary in a lot of markets. A lot of servers don’t really use FP processing at all and it takes significant die area. Consumer applications don’t really need it either. A lot of things that could use AVX512 can be done on a gpu instead, if you can get the software people to support it. Since AMD and now Intel both make GPUs, very wide vector extensions in the cpu seem like they may not be as useful.

I was wondering if they would start making more specialized cores for a wider set of markets. Storage servers and database servers probably don’t need much FP; they need a lot of cores and perhaps big caches. Maybe more encryption acceleration. I suspect they will solve the cache size by taking advantage of the MCM architecture. Epyc with a 128 MB infinity cache could be spectacular. An integrated FPGA block may allow them to continue to make one, modular chiplet that can cover a wide range of the market by reconfiguring this unit to whatever instruction extension is most useful and also allowing custom extensions. It is still going to take a lot of die area though; ARM processors can still fit a lot more cores in the same amount of silicon, so it still seems like the AMD64 tax could be an issue.

It would be funny if the FPGA unit could be configured to execute ARM instructions. Given Apple‘s switch to custom ARM processors, I am wondering if Microsoft is paying AMD to revive their custom ARM core. I thought I saw some rumors about the return of AMD ARM processors; perhaps I missed some things though. I suspect AMD Zen 3 based laptops will compare reasonably well to Apple ARM based laptops performance-wise, but they probably aren’t going to be able to match the battery life. The vertical integration Apple has will be tough to beat on battery life. Also, Apple is paying to monopolize the latest and greatest TSMC process, so it may be 7 or 6 nm Zen 3 against 5 nm Apple cpus. Intel is just too far behind both due to the process tech issues. I don’t think I want a MacBook Pro though. I still use a lot of the old ports, like display port and network ports since I still use an old 30 inch pro level display. I would need all kinds of adapters and/or dongles. I am waiting for that high end Zen 3 based laptop.

Gideon · Jan 3, 2021

Thibsie said:
How easy is it to program/reprogram an FPGA ?
Would be it possible to decode an instruction or a pack of instruction, deduct that an X or Y type of unit is needed and reprogram the FPGA before sending the instruction to it for processing ?
I guess it would be too slow but ...

Here is more info from the patent on how AMD imagines it will be used.

Processor includes one or more reprogrammable execution units which can be programmed to execute different types of customized instructions

When a processor loads a program, it also loads a bitfile associated with the program which programs the PEU to execute the customized instruction

Decode and dispatch unit of the CPU automatically dispatches the specialized instructions to the proper PEUs

PEU shares registers with the FP and Int EUs.

PEU can accelerate Int or FP workloads as well if speedup is desired

PEU can be virtualized while still using system security features

Each PEU can be programmed differently from other PEUs in the system

PEUs can operate on data formats that are not typical FP32/FP64 (e.g. Bfloat16, FP16, Sparse FP16, whatever else they want to come up with) to accelerate machine learning, without needing to wait for new silicon to be made to process those data types.

PEUs can be reprogrammed on-the-fly (during runtime)

PEUs can be tuned to maximize performance based on the workload

PEUs can massively increase IPC by doing more complex work in a single cycle

moinmoin · Jan 3, 2021

In the past I've referred to Jens Keller's influence reaching until Zen 3, which is where we are now. The part that I neglected is that both Zen 3 and RDNA2/CDNA were not the goal itself but the necessary ingredients for a bigger goal, heterogeneous computing at exascale.

The goal was predetermined in 2016 at the latest and already contained CPU ciplets, GPU ciplets, active interposer, HBM stacking on GPU chiplets and so on. The Frontier supercomputer will be the result of this half a decade development. Underfox (who reports many interesting patents on his twitter feed and seems to plan to write more articles this year) wrote about this nearly a year ago:

AMD Master Plan: Achieving Exascale through heterogeneous computing. – Coreteks

coreteks.tech

Of course AMD won't stop there, the above results will go into Zen 4/RDNA3/CDNA2, also for use in the directly following El Captain supercomputer. Underfox suggests that for that AMD will revive heterogeneous computing in a big way, using the x86 ISA as a superset that is split up to be more efficient internally (the most recent FPGA patent also points to this), variable width SIMD units, multi-tasking at every pipeline stage (both for GPUs, but likely of interest in CPUs as well):

AMD Master Plan Pt.2 – Heterogeneous Revolution – Coreteks

coreteks.tech

DisEnchantment · Jan 3, 2021

moinmoin said:
In the past I've referred to Jens Keller's influence reaching until Zen 3, which is where we are now. The part that I neglected is that both Zen 3 and RDNA2/CDNA were not the goal itself but the necessary ingredients for a bigger goal, heterogeneous computing at exascale.

The goal was predetermined in 2016 at the latest and already contained CPU ciplets, GPU ciplets, active interposer, HBM stacking on GPU chiplets and so on. The Frontier supercomputer will be the result of this half a decade development. Underfox (who reports many interesting patents on his twitter feed and seems to plan to write more articles this year) wrote about this nearly a year ago:

AMD Master Plan: Achieving Exascale through heterogeneous computing. – Coreteks

coreteks.tech

Of course AMD won't stop there, the above results will go into Zen 4/RDNA3/CDNA2, also for use in the directly following El Captain supercomputer. Underfox suggests that for that AMD will revive heterogeneous computing in a big way, using the x86 ISA as a superset that is split up to be more efficient internally (the most recent FPGA patent also points to this), variable width SIMD units, multi-tasking at every pipeline stage (both for GPUs, but likely of interest in CPUs as well):

AMD Master Plan Pt.2 – Heterogeneous Revolution – Coreteks

coreteks.tech

Fast Forward 1/2 and Path Forward are actually US Government sponsored research for Exascale all the way from 2012
AMD, Intel and few other Companies were funded by the US Govt. from multiple Labs, LLNL, ORNL, NNSA and others.

Fast Forward

The FastForward objective is to initiate partnerships with multiple companies to accelerate the R&D of critical component technologies needed for extreme-scale computing. This public-private partnership between industry, the DOE Office of Science, and NNSA supports the development of technology...

asc.llnl.gov

There is supposed to be a new round of sponsorship from the same agencies and also there is a new DARPA initiative as well for a bunch of new tech and chiplets, new semiconductor materials, heterogenous computing, security etc are some of the highlights
I have seen AMD is part of many of these initiatives.
It is not very new actually.

When you read AMD's patents, you will find a disclaimer in many of them stating that the patent is part of research sponsored by the Federal Goverment and the Government has certain rights in the invention.

moinmoin · Jan 3, 2021

DisEnchantment said:
It is not very new actually.

Yeah, it's not new at all and it wasn't a secret either. The whole heterogeneous system architecture as AMD's approach for supercomputers goes back to 2012 too, likely as part of DARPA's Fast Forward initiative as well. I thought it's just nice bringing this whole overarching development to attention again since as part of that whole picture the focus on improvements of the individual Zen generations was kind of misleading. As part of that whole picture even Epyc is an offshoot product secondary to the primary goal of building the above mentioned APU package for exascale computing.

soresu · Jan 3, 2021

jamescox said:
I still don’t know if we are getting AVX512 in Zen 4. Zen 3 is the new architecture, so it seems like it would have been in Zen 3 if they are going to do it.

I think AMD have already stated that Zen4 will be a fairly significant uArch change by itself, so we can't rule anything out.

AVX512 itself is just instructions (fragmented though it is), the main thing is the actual FP/SIMD unit that executes them after decode - as Zen2 went from 128 bit to 256 bit units despite being a "minor uArch update" that means anything goes.

Albeit given AMD's rise to competitiveness once more it isn't impossible that we could see an entirely different SIMD solution for 512 bit - unlike with XOP where the non competitive state of Bulldozer meant that support of those extensions were doomed to very niche applications, I'm not sure if any commercial apps ever supported it at all.

What we may see too is AVX512 instruction support albeit with fused 256 bit units - then just a doubling of those units should give the far more widespread AVX2 code a real boost as it did for Zen2.

Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Senior member

Junior Member

Platinum Member

Platinum Member

Platinum Member

Diamond Member

Golden Member

Platinum Member

Platinum Member

Senior member

Platinum Member

Senior member

Diamond Member

Senior member

Lifer

Golden Member

Golden Member

Senior member

Platinum Member

Senior member

Senior member

Golden Member

Diamond Member

Golden Member

Diamond Member

Platinum Member