Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 28 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

HurleyBird

Platinum Member
Apr 22, 2003
2,684
1,267
136
Huge hindrance is a real stretch.

Anyone truly serious about gaming to the point that this would bother them is not doing so on a mobile APU chip/platform.

We expect Rocket Lake to outperform Zen 3 in gaming, right?

Which is faster for general tasks, Zen 3 or Willow Cove? Zen 3, right?

Cezanne and Rocket Lake have half the L3, but they also have a massively superior memory controller vs. Vermeer's last-gen IOD.

Taken together, I think it's reasonable to assume that Cezanne will do better in games than Vermeer clock-for-clock.

A high clocked desktop Cezanne could be a great counter to Rocket Lake. Vermeer beats it in general computing while Cezanne holds down the fort for gaming.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,780
136
X3D described in new patent.
20200409859
GPU CHIPLETS USING HIGH BANDWIDTH CROSSLINKS

A chiplet system includes a central processing unit (CPU) communicably coupled to a first GPU chiplet of a GPU chiplet array. The GPU chiplet array includes the first GPU chiplet communicably coupled to the CPU via a bus and a second GPU chiplet communicably coupled to the first GPU chiplet via a passive crosslink. The passive crosslink is a passive interposer die dedicated for inter-chiplet communications and partitions systems-on-a-chip (SoC) functionality into smaller functional chiplet groupings.

A number of Chiplets are arranged to form a giant GPU which are connected internally with using the HBX crosslink interconnect.
You will notice the primary GPU chiplet has SDF to connect to the CPU

I believe this is basically the cache coherent unified memory architecture between CPU and GPU which is supposed to arrive with Genoa and Instinct next gen to be deployed on El Capitan.
The stacked dies on the sides in the X3D illustration seems to be HBM.
So it seems AMD is going GPU chiplet and cache coherency between CPU and GPU in one shot.
Also the connectivity between CPU and GPU is probably not barebones PCIe, but a newly patented mechanism described in
20200412848
LOW OVERHEAD HIGH BANDWIDTH DATA TRANSFER PROTOCOL

X3D
1609537846338.png


GPU Chiplets
1609537878684.png
1609538841909.png

1609538587957.png
 

uzzi38

Platinum Member
Oct 16, 2019
2,622
5,880
146
Also this is my favourite snippet:

For example, in some embodiments, the GPU chiplets may be constructed as pentagon-shaped dies such that five GPU chiplets may be coupled together in a chiplet array.
 
  • Love
Reactions: lightmanek

dr1337

Senior member
May 25, 2020
331
559
106
That sounds like a pain to cut the wafer up for vs standard die grids.
I mean im pretty sure its all CNC. And once they have the final chip dimensions outlined it should be trivial for them to lay them in a nice pattern to then later have the computer laser them out. Using pentagons sounds really strange though because its impossible to tile a normal pentagon without wasting space, it would have to be some special pentagonal shape. Guess it still shouldn't be that much harder than doing a square grid but I wonder how much of a difference alternate tile shapes would actually make(if any at all).
 

Thibsie

Senior member
Apr 25, 2017
747
798
136
How easy is it to program/reprogram an FPGA ?
Would be it possible to decode an instruction or a pack of instruction, deduct that an X or Y type of unit is needed and reprogram the FPGA before sending the instruction to it for processing ?
I guess it would be too slow but ...
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,346
1,525
136
How easy is it to program/reprogram an FPGA ?
Would be it possible to decode an instruction or a pack of instruction, deduct that an X or Y type of unit is needed and reprogram the FPGA before sending the instruction to it for processing ?
I guess it would be too slow but ...

It would be way too slow, by orders of magnitude. The LUTs are optimized for the actual execution, at the expense of the speed it takes to program them. You could make them faster to reprogram, but this would come at a cost to their runtime speed and/or die area per unit, which would make them worse for actual use.

(Also, just as a reasonable baseline for speed, the kind of FPGAs we are talking about here probably run at a max of about 1GHz. So if your CPU is running at 5GHz, you can run the simplest possible operation on the FPGA every 5 cycles or so. Actual useful things would be even slower than that. Having a small FPGA like that could still be a massive win, because there are things where even a tiny FPGA running at 500MHz can beat the fastest desktop CPU by orders of magnitude.)

If the PEUs are small, it might be reasonable to have them dump out their old code & replace it on process switches. This would make them a lot more practically usable than if you had to fix the program in them, which would require some kind of dedicating the core to the task.
 
  • Like
Reactions: Thibsie

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,780
136
How easy is it to program/reprogram an FPGA ?
By programming an FPGA I suppose you mean changing the logic of the FPGA? It is fairly easy and quick in some cases FPGAs the LUTs are present in SRAM. But not remotely fast enough for on the fly loading of the LUTs
They basically write the LUT for the structure of the logic for a given set of input and output.
There are many blocks that are fixed however, for IO and stuffs.

Coding the logic in HDL is a different matter. FPGAs can be very unpredictable in terms of timings and stuffs. But very useful when you need to do simulate stuffs before you tape out silicon.
However many things you still need to figure out, like race conditions, clock propagations, delays, signal integrity and myriad other things. Which you can only do with real silicon.

In the context of the AMD's patent, I dont think it is a FPGA in the same sense like the Spartan or even Versal. Because here the programmable execution unit is part of the CPU core itself (unlike Versal for example with dedicated boundaries for both)
But the possibilities are many.
If they ever manage to work out a way to have some small cache for the LUT which can be loaded by the PSP from BIOS, that can offer acceleration of certain loads and optimized as they go, that would already be a big thing.
Another way to look at this is to load a file before before executing a program. Not sure if AMD would allow this, it would mean some tinkerers could basically alter their processor. But I can see a usecase here as well.
 
Last edited:

jamescox

Senior member
Nov 11, 2009
637
1,103
136
By programming an FPGA I suppose you mean changing the logic of the FPGA? It is fairly easy and quick in some cases FPGAs the LUTs are present in SRAM. But not remotely fast enough for on the fly loading of the LUTs
They basically write the LUT for the structure of the logic for a given set of input and output.
There are many blocks that are fixed however, for IO and stuffs.

Coding the logic in HDL is a different matter. FPGAs can be very unpredictable in terms of timings and stuffs. But very useful when you need to do simulate stuffs before you tape out silicon.
However many things you still need to figure out, like race conditions, clock propagations, delays, signal integrity and myriad other things. Which you can only do with real silicon.

In the context of the AMD's patent, I dont think it is a FPGA in the same sense like the Spartan or even Versal. Because here the programmable execution unit is part of the CPU core itself (unlike Versal for example with dedicated boundaries for both)
But the possibilities are many.
If they ever manage to work out a way to have some small cache for the LUT which can be loaded by the PSP from BIOS, that can offer acceleration of certain loads and optimized as they go, that would already be a big thing.
Another way to look at this is to load a file before before executing a program. Not sure if AMD would allow this, it would mean some tinkerers could basically alter their processor. But I can see a usecase here as well.
There has been some work done on FPGAs that can store multiple configuration states. I was familiar with a start-up company called Tabula that was using unique FPGA technology to map large designs across multiple clock cycles. Basically, the device could fit a design up to 8x the single clock size by mapping it across multiple clocks. The “user clock” would half with each doubling of the size though. HDL that needed 2x the size of the chip would run at half the hardware clock. It is a tough software problem to map HDL to such a device and they ran out of funding before their second generation device could be released, so they shutdown. They had some press due to being one of the few companies that was going to use Intel as a fab at 22 nm.

I have wondered if one of the other FPGA companies bought the IP. When I first read about it, it sounded like a spectacular idea since it could reduce interconnect overhead and replace it with local memory. If you had an 8 stage pipeline, for example, you could actually map it out in the same hardware spread out over 8 clocks. There would be no sending data through longer interconnect to other units. You would just change the function of the local units each clock. The data stays local. It does take extra memory to store the configuration and you need some memory to store the state between clocks, so it would have been a very memory heavy design. This seems like it could have made them much more power efficient than conventional FPGAs. Interconnect power is a big portion of modern chip power consumption. It is usually good to trade memory for data movement.

Having an FPGA device that can store multiple states and being able to switch each clock is overkill for things that can fit on a single FPGA device. The complexity of mapping HDL across multiple clocks of such a device is significant also. However, an FPGA that can just cache multiple gpu style kernels may not be too difficult. I haven’t read up much on the current state of the FPGA market, but FPGAs can accelerate some software significantly, so people have been expecting them to be a big thing for HPC for a while. I don’t know if that has taken off yet? It seems like they mostly went more towards using gpu compute rather than FPGAs. Going from cpu to gpu is a huge jump, although algorithms sometimes need to be changed significantly to work on a gpu. It sometimes takes quite a bit of work and a person knowledgeable in gpu programming. That isn’t always available. I don’t know how much of an improvement an FPGA would offer vs. an optimized gpu-based solution.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
I had been wondering if AMD would leverage their gpu tech in their cpus somehow. There seems to be quite a bit of cross over between Zen and RDNA, at least as far as cache technology. A full gpu compute unit has a lot of graphics specific, fixed function hardware though. That wouldn’t be needed directly in a cpu core. An FPGA type unit could be very parallel such that it could probably be made to execute any vector extensions that they want to support. I wonder if AVX512 and future extensions could be implemented as just a pre-loaded configuration (microcode essentially) for this FPGA-like block.

I still don’t know if we are getting AVX512 in Zen 4. Zen 3 is the new architecture, so it seems like it would have been in Zen 3 if they are going to do it. It is still niche and not really necessary in a lot of markets. A lot of servers don’t really use FP processing at all and it takes significant die area. Consumer applications don’t really need it either. A lot of things that could use AVX512 can be done on a gpu instead, if you can get the software people to support it. Since AMD and now Intel both make GPUs, very wide vector extensions in the cpu seem like they may not be as useful.

I was wondering if they would start making more specialized cores for a wider set of markets. Storage servers and database servers probably don’t need much FP; they need a lot of cores and perhaps big caches. Maybe more encryption acceleration. I suspect they will solve the cache size by taking advantage of the MCM architecture. Epyc with a 128 MB infinity cache could be spectacular. An integrated FPGA block may allow them to continue to make one, modular chiplet that can cover a wide range of the market by reconfiguring this unit to whatever instruction extension is most useful and also allowing custom extensions. It is still going to take a lot of die area though; ARM processors can still fit a lot more cores in the same amount of silicon, so it still seems like the AMD64 tax could be an issue.

It would be funny if the FPGA unit could be configured to execute ARM instructions. Given Apple‘s switch to custom ARM processors, I am wondering if Microsoft is paying AMD to revive their custom ARM core. I thought I saw some rumors about the return of AMD ARM processors; perhaps I missed some things though. I suspect AMD Zen 3 based laptops will compare reasonably well to Apple ARM based laptops performance-wise, but they probably aren’t going to be able to match the battery life. The vertical integration Apple has will be tough to beat on battery life. Also, Apple is paying to monopolize the latest and greatest TSMC process, so it may be 7 or 6 nm Zen 3 against 5 nm Apple cpus. Intel is just too far behind both due to the process tech issues. I don’t think I want a MacBook Pro though. I still use a lot of the old ports, like display port and network ports since I still use an old 30 inch pro level display. I would need all kinds of adapters and/or dongles. I am waiting for that high end Zen 3 based laptop.
 

Gideon

Golden Member
Nov 27, 2007
1,625
3,650
136
How easy is it to program/reprogram an FPGA ?
Would be it possible to decode an instruction or a pack of instruction, deduct that an X or Y type of unit is needed and reprogram the FPGA before sending the instruction to it for processing ?
I guess it would be too slow but ...
Here is more info from the patent on how AMD imagines it will be used.

  • Processor includes one or more reprogrammable execution units which can be programmed to execute different types of customized instructions
  • When a processor loads a program, it also loads a bitfile associated with the program which programs the PEU to execute the customized instruction
  • Decode and dispatch unit of the CPU automatically dispatches the specialized instructions to the proper PEUs
  • PEU shares registers with the FP and Int EUs.
  • PEU can accelerate Int or FP workloads as well if speedup is desired
  • PEU can be virtualized while still using system security features
  • Each PEU can be programmed differently from other PEUs in the system
  • PEUs can operate on data formats that are not typical FP32/FP64 (e.g. Bfloat16, FP16, Sparse FP16, whatever else they want to come up with) to accelerate machine learning, without needing to wait for new silicon to be made to process those data types.
  • PEUs can be reprogrammed on-the-fly (during runtime)
  • PEUs can be tuned to maximize performance based on the workload
  • PEUs can massively increase IPC by doing more complex work in a single cycle
 

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
In the past I've referred to Jens Keller's influence reaching until Zen 3, which is where we are now. The part that I neglected is that both Zen 3 and RDNA2/CDNA were not the goal itself but the necessary ingredients for a bigger goal, heterogeneous computing at exascale.

amdexascale2016e5km6.jpg


The goal was predetermined in 2016 at the latest and already contained CPU ciplets, GPU ciplets, active interposer, HBM stacking on GPU chiplets and so on. The Frontier supercomputer will be the result of this half a decade development. Underfox (who reports many interesting patents on his twitter feed and seems to plan to write more articles this year) wrote about this nearly a year ago:

Of course AMD won't stop there, the above results will go into Zen 4/RDNA3/CDNA2, also for use in the directly following El Captain supercomputer. Underfox suggests that for that AMD will revive heterogeneous computing in a big way, using the x86 ISA as a superset that is split up to be more efficient internally (the most recent FPGA patent also points to this), variable width SIMD units, multi-tasking at every pipeline stage (both for GPUs, but likely of interest in CPUs as well):
 
Last edited:

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,780
136
In the past I've referred to Jens Keller's influence reaching until Zen 3, which is where we are now. The part that I neglected is that both Zen 3 and RDNA2/CDNA were not the goal itself but the necessary ingredients for a bigger goal, heterogeneous computing at exascale.

amdexascale2016e5km6.jpg


The goal was predetermined in 2016 at the latest and already contained CPU ciplets, GPU ciplets, active interposer, HBM stacking on GPU chiplets and so on. The Frontier supercomputer will be the result of this half a decade development. Underfox (who reports many interesting patents on his twitter feed and seems to plan to write more articles this year) wrote about this nearly a year ago:

Of course AMD won't stop there, the above results will go into Zen 4/RDNA3/CDNA2, also for use in the directly following El Captain supercomputer. Underfox suggests that for that AMD will revive heterogeneous computing in a big way, using the x86 ISA as a superset that is split up to be more efficient internally (the most recent FPGA patent also points to this), variable width SIMD units, multi-tasking at every pipeline stage (both for GPUs, but likely of interest in CPUs as well):
Fast Forward 1/2 and Path Forward are actually US Government sponsored research for Exascale all the way from 2012
AMD, Intel and few other Companies were funded by the US Govt. from multiple Labs, LLNL, ORNL, NNSA and others.


There is supposed to be a new round of sponsorship from the same agencies and also there is a new DARPA initiative as well for a bunch of new tech and chiplets, new semiconductor materials, heterogenous computing, security etc are some of the highlights
I have seen AMD is part of many of these initiatives.
It is not very new actually.

When you read AMD's patents, you will find a disclaimer in many of them stating that the patent is part of research sponsored by the Federal Goverment and the Government has certain rights in the invention.
 

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
It is not very new actually.

Yeah, it's not new at all and it wasn't a secret either. The whole heterogeneous system architecture as AMD's approach for supercomputers goes back to 2012 too, likely as part of DARPA's Fast Forward initiative as well. I thought it's just nice bringing this whole overarching development to attention again since as part of that whole picture the focus on improvements of the individual Zen generations was kind of misleading. As part of that whole picture even Epyc is an offshoot product secondary to the primary goal of building the above mentioned APU package for exascale computing.
 

soresu

Platinum Member
Dec 19, 2014
2,657
1,858
136
I still don’t know if we are getting AVX512 in Zen 4. Zen 3 is the new architecture, so it seems like it would have been in Zen 3 if they are going to do it.
I think AMD have already stated that Zen4 will be a fairly significant uArch change by itself, so we can't rule anything out.

AVX512 itself is just instructions (fragmented though it is), the main thing is the actual FP/SIMD unit that executes them after decode - as Zen2 went from 128 bit to 256 bit units despite being a "minor uArch update" that means anything goes.

Albeit given AMD's rise to competitiveness once more it isn't impossible that we could see an entirely different SIMD solution for 512 bit - unlike with XOP where the non competitive state of Bulldozer meant that support of those extensions were doomed to very niche applications, I'm not sure if any commercial apps ever supported it at all.

What we may see too is AVX512 instruction support albeit with fused 256 bit units - then just a doubling of those units should give the far more widespread AVX2 code a real boost as it did for Zen2.