Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

What do you expect with Zen 4?


  • Total voters
    310

Vattila

Senior member
Oct 22, 2004
711
1,024
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:

Doug S

Golden Member
Feb 8, 2020
1,176
1,716
106
You realise that this information is behind the paywall? That's not fair to Charlie.

If Saylick is a subscriber (especially if his company pays for it not him personally) then yeah I'd agree. I would assume Charlie requires subscribers agree not to publicly repost information from his articles.

If however Saylick found that information repeated elsewhere by someone else then its fair game, IMHO.

I may be annoyed that all of Charlie's best info is behind a paywall, but he's got a right to make a living.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,484
981
136
That is the embedded roadmap though.

EPYC:

Embedded EPYC:

It isn't saying Milan should have been 2020Q3-2021Q2;

It is saying Embedded Milan should have been 2020Q3-2021Q2;

Also, effectively the google date the Embedded website is Aug 4, 2020
Was only listed on: Aug 5, 2020.

Meaning that the Embedded Epyc 7001/7002 only launched then in August 2020.

Jan 19, 2021 upload date
Googling it with "V3000"/"Zen4"/"7004" points it to having the slide.

Identifying the asterisk:
*AMD roadmaps are subject to change without notice or obligations to notify of changes.
Placement of boxes is not intended to represent first year of product shipment.
 
Last edited:

coercitiv

Diamond Member
Jan 24, 2014
4,904
7,224
136
if Nosta is to be believed at all
Name one process & architecture combo "leak" that Nosta talked about in the past and turned out to be true. The amount of fantasy nodes and architectures is staggering, and yet people still eat this crap with a spoon.

You would have better chances at predicting the future of AMD products by tossing a coin. My cat would have better chances of predicting AMD product & node mix, and I don't have a cat. It's still better than what Nosta predicts because me getting a cat and using it to make predictions is still within the realm of possibility in this universe.
 

Hans de Vries

Senior member
May 2, 2008
303
832
136
www.chip-architect.com
What's most surprising to me is the I/O die. I'm really not at all surprised by the CCDs if I'm honest, but the I/O die - new nodes have little to know effect on analog circuit density, and Genoa has a very significant increase to I/O (12ch DDR5, 128 PCIe5 lanes etc etc), yet despite that, the I/O die is actually smaller than Rome's.
The physical I/O of 7nm Cezanne is quite small. The 128 bit bus is just 5% of the 180mm2 die (The top-right rectangle) or 9 mm2.

Cezanne_die.jpg

This makes me think that AM5 may jump over Alderlake's 1700 pin package to 2000+ pins (from 1331 for AM4)
It would make desktop motherboards with 4 memory slots, each with its own channel.

AM5 needs to support 3nm CPU's and APU's. Two channel LPDDR4-4266 is already exhausted by 8 VEGA compute units. (Rembrandt has 12 Navi2 compute units on 6nm). The 5nm Rafael has an unknown amount of Navi3 compute units on AM5 and they could be clocked north of 3GHz.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,484
981
136
raphael.png

Feb 2021+ product bring up.

I can only find GPUs and one other IP bring ups so far with CPUs being less exact:
Vega10 bring up: Jan 2017-April 2017 => August 2017 launch
MI50/MI60 bring up: Jan 2018(start month) => November 2018 launch
Navi10 bring up: July 2018+ => July 2019 launch
PCIe 4.0 bring up in client/server: September 2018+ => July/August 2019 launch
Fiji bring up: Sept 2014 - November 2014 => June 2015 launch
Ontario bring up: 2010(no exact month) => January 2011 launch
Mullins bring up: 2013(no exact month) => April 2014 launch
Kaveri bring up: August 2013 => January~June 2014 launch
Radeon Pro 560x bring up: 2017(no exact month) => July 2018 launch
Radeon Pro Vega bring up: 2018(no exact month) => November 2018 launch
MI100 bring up: July 2019+(no exact start) => November 2020 launch
Zen SoC/Server (Zeppelin) bring up: Before August 2016, After November 2015 => March 2017 launch
MI200 bring up award: December 2020 => not yet launched.
3 launched within the year of first bring-up mention.
9 launched the year after first bring-up mention. Of those, majority of the mentions are during the later half/second half of the year.

On DDR-side LPDDR5/DDR5 has two spots of bring up: January 2020+ and July 2020+.

So, Raphael launching earlier than expected is more likely. ¯\_(ツ)_/¯
 
Last edited:

dnavas

Senior member
Feb 25, 2017
338
174
116
An 18 months release cycle would put Zen 4 to Q2 2022. However, Zen 4 brings a brand new platform with DDR5/PCIe5. So delays are to be expected.
I'm personally finding it increasingly difficult to justify investing in a pcie4 TR platform purchase, so I hate what this does for Zen4 TR, but it frankly doesn't make sense to ship a DDR5 platform prior to DDR5 being available in quantity. So yes, I'd expect later than sooner.

Unless [I add, somewhat self-servingly] you release on the platform that's already more expensive and might be more willing to absorb the cost. I don't expect AMD is doing this, but it seems like it would make some sense to put the TR chiplets after desktop, so that desktop can bake the design, but put the TR IO on the bleeding edge. You allow TR to benefit from the choice of the best chiplets, and allow the desktop to benefit from the better/cheaper supply chain of parts.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,193
3,584
136
I'm pretty sure since Zen2/3, AMD has been using the most dense libs available.

Ex:
View attachment 44460

The actual design reasons for AMD going for Hi-Freq, rather than going full on Hi-Inst will probably remain unknown. For all we know it could be marketing-sided rather than engineer-sided. As it is psychologically easier to sell a chip that has increased frequency over the last generation.

Above, extended w/ actual high-end on AMx:
Ryzen 7 1800X = 4.1 GHz boost
Ryzen 7 2700X = 4.3 GHz boost
Ryzen 7 3950X = 4.7 GHz boost
Ryzen 7 5950X = 4.9 GHz boost
Zen3 is using N7 HD indeed. While RDNA is using N7 HP.
For N5 the most optimal range for Zen 4 would be around 3.8-4.2 GHz according to the Shmoo plot, after that would need considerable jump in voltage for getting the frequency to the same levels like 5950X for example.
One of the advantages of designs which top out at ~3.2 GHz is that they are well below this value resulting in big gains in efficiency.

1621235308312.png

Why doesn't AMD use the high density process? Wouldn't the much higher IPC made possible by many more transistors make up for the lost frequency? Plus, it would be much more energy efficient
My thinking is that when they originally designed Zen1-Zen3 using the CCX concept they intended it to be very small and easily manufacturable. It was supposed to be cheap to produce. The high clocks could help get more performance.
Zen3 Core and upto L2 is quite small in comparison to most contemporary designs. Zen3 core (without L3) is less than half the MTr of M1
However when Zen2 and later Zen3 landed they need to tack on the big L3 to handle the weakness of the memory hierarchy with the multiple CCXs
In the end Zen3 got big anyway. Also making the MTr or the active core too high while operating in high frequency would have increased the TDP by a bit

For Zen4 with this learning it should be interesting if AMD would raise the transistor count drastically or again stick with smaller core.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,193
3,584
136
PCIe 5 is pretty much irrelevant for consumers right now and will remain that way for many years. All it does is increase motherboard costs. The only thing it possibly could affect is storage, because GPUs aren't going to come close to saturating a PCI4x16 link any time soon.
PCIe 5.0 is used as the physical layer for CXL and is supposed to bring in a new era of cache coherent accelerators.

I hope this is not the case.
Otherwise it means buying some Zen4 based TR Pro to try out some CXL based accelerators.
Only option on AM5 would be to go A+A, which is kind of against AMD open and inclusive philosophy. Issue with this is that AMD only got GPUs at the moment, so if you wanna try out FPGA based CXL capable accelerators like the ones from Xilinx you would be out of luck on AM5
It makes no sense to not have it, if they already have the PCIe 5.0 IP on Genoa. They could just have the support in the IOD and the chipset and let the Board OEMs take the cost on the high end boards.
Either that or this generation has no CXL support which is going to be an issue with developers wanting to try out CXL based accelerators.

Not happy with this.
 
  • Like
Reactions: Tlh97 and Kepler_L2

Gideon

Golden Member
Nov 27, 2007
1,481
3,015
136
I don't know, and to be frank, I don't entirely care. He's said enough bollocks for me to know that he's more than happy to either make stuff up or trust things from absolutely anyone.
100% this. Usually just pure informed speculation leads to much more accurate facts that what these leakers claim.

So many of these leaks go blatantly against common industry facts that it hurts (and this goes against all of MLID, Adored and Coreteks), Things like:
  • Claiming not having working silicon in the labs less than year before release
  • Claiming things that would require changes to silicon (other than respins) less than a year before release.
  • Claiming something has been designed but might not be released - This happens, but very rarely, as the R&D money has already been spent, it would literally have to be unsellable to get canned. (Things such as designing a 24 core Genoa and not announcing it while releasing a 16 core one makes 0 sense)
  • And the big one: Knowing SKUs and pricing 6+ months pre-release (when these are the last things that get decided. Especially the pricing as it's the only thing that can be changed easily, even hours before release)

But what really grinds my gears is if they get something wrong, they almost never admit that it was (someone's) poor speculation. Near always there is the excuse of "oh it must have been canned/postponed/changed last minute".

I still remember Adored being hell-bent that Navi will release in January Q1 2019, up to late December 2018. And when it didn't happen it was just casually "postponed due to yields". It ended up "being postponed" for 7 months. I'm sure AMD had no idea of the state their yields a month before release :p
 
Last edited:

DisEnchantment

Golden Member
Mar 3, 2017
1,193
3,584
136
AMD HSA is here


AMD is building a system architecture for the Frontier supercomputer with
a coherent interconnect between CPUs and GPUs. This hardware architecture
allows the CPUs to coherently access GPU device memory. We have hardware
in our labs and we are working with our partner HPE on the BIOS, firmware
and software for delivery to the DOE.
The system BIOS advertises the GPU device memory (aka VRAM) as SPM
(special purpose memory) in the UEFI system address map. The amdgpu driver
looks it up with lookup_resource and registers it with devmap as
MEMORY_DEVICE_GENERIC using devm_memremap_pages.
Now we're trying to migrate data to and from that memory using the
migrate_vma_* helpers so we can support page-based migration in our
unified memory allocations, while also supporting CPU access to those
pages.
 

Gideon

Golden Member
Nov 27, 2007
1,481
3,015
136
Having something like 250 GB/s of IO bandwidth with 128 pci-express 4.0 links seems like it would have been the deciding factor.
That wasn't really true for this case.

Aurora was supposed to be ready earlier, was won by Intel and is being built with Sapphire Rapid chiplets that have PCIe 5.0, "Rambo cache" chiplets, HMB2 on package if needed (and it looks like similar unified-memory-space software). The problem is it's using micro-bumps for stacking (well it's also very late, but that wasn't certain when Frontier was announced). So if anything Intel had the I/O advantage.

There had to be some secret sauce in AMDs offerings to win Frontier like they did. This is certainly one key differentiator. Bear in mind the V-cache solution actually most likely has two layers (as it sits on top of 32MB L3 and is exactly as big on the same process). There is nothing stopping AMD from adding more layers for some server CPUs and I'm convinced now CDNA2 has this stacking as well.

And while all of this is only possible because of AMD's engineering prowess, keep in mind that this is also TSMC's win as much as it's AMDs. They're the only foundry that has anything like that ready in this time-frame. The hoops TSMC had to go through to make this work (and be producible at scale) are also enormous.

All in all ever since Zen 2 it looks like it's the trifecta of execution (Synopsis + AMD + TSMC) that is to be congratulated. AMD couldn't just do it alone.
 
Last edited:

Doug S

Golden Member
Feb 8, 2020
1,176
1,716
106
Chatting on an internet forum doesn't need most of the instruction sets modern day CPUs provide. Why power all that silicon? Playing a game requires a number of instruction sets that aren't normally used. During that time, the small cores can be put to sleep, giving the big cores more headroom (by way of TDP) to run.

That's completely wrong. You think posting to Anandtech doesn't use SIMD instructions? Check out whatever is responsible in your OS kernel for zeroing pages when a new page is needed, it probably uses AVX2 in some circumstances - and that's the tip of the iceberg. You think floating point isn't needed? Sorry, all math in Javascript is done in floating point, there's no way to avoid it if you are running a browser.

I doubt there's anything you can do with a modern PC or smartphone that would allow any worthwhile reduction of instruction set coverage. Not even running an "idle loop" (which is a halt instruction these days) because there are always background/housekeeping processes running at times so the scheduler, I/O dispatch, filesystem, and other parts of the kernel will remain active.

I don't think you can usefully cut out any instructions from a small core other than 1) AVX512 (and that's only true on x86 because Intel didn't provide for variable SIMD width capability like SVE2) and 2) virtualization. Anything else you cut out will mean almost every thread will be forced onto big cores before long.
 

Doug S

Golden Member
Feb 8, 2020
1,176
1,716
106
…and if I am running with javascript disabled? what if i am writing code in vim? What if the machine is a simple file sharing machine? There are plenty of opportunities to use a small core over a big one. Even something as basic as tracking a mouse pointer doesn’t need to use a big core.
OK sure if you are one of the niche cases of people who disable Javascript or run CLI stuff in console mode, fine I'll grant you that. The overwhelming majority of PC/smartphone users don't do stuff like that.

Tracking a mouse pointer doesn't need the performance of a big core, but it will almost certainly exercise your whole instruction set. Do you have any idea of the size of the hot code footprint tracking a mouse pointer on an otherwise idle system these days? A modern GUI is multiple layers of libraries.

A typical person who will leave Javascript enabled will exercise floating point if that mouse cursor moves in any browser window. When the pointer moves between windows, window expose events will exercise stuff like bcopy/memset that uses AVX2, and so on.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,193
3,584
136
If we go a step back, Ryzen 5000/Zen3 is a tradeoff across so many things

Zen 3 MTr/mm is around ~51, MI100 is around ~66, 30% higher. Zen 3 had to trade off 30% density to achieve the high clocks. (RDNA2 as well had similar MTr/mm2 like Zen3, because GPU team learnt from CPU team according to Suzanne Plummer :grinning: )
Why the high clocks, in my opinion:
Because the Core +L2 (around ~204MTr) is not that wide and much smaller than Intel's for example (Sunny Cove is ~283MTr) and Firestorm (~502MTr).
(But they needed to improve the efficiency by making it small to run at such high clocks, so it is kind of a vicious cycle)
Because original design of Zen (1,2,3 at least) is to make the die small for cost, defects, yield etc because AMD cannot charge whatever they want.

When Zen2 was introduced, they needed to add the GAMECACHE, because they are getting hammered by Intel in a key workload in the Windows World, Gaming, but in my opinion was an improvisation and not what was envisioned during the Architectural work 4+ years ago.
What is good though, is that there is not going to be an increase of L3 in Genoa
Increasing L3 size can cause regression in IPC if the increase comes with more cycles and of course there is power involved. V Cache comes with "minimal cost of latency" as per AMD, this means it will cause a minor regression in some workloads. But hitrate is massively increased for workloads like gaming. Thankfully the V Cache can be power gated.
In the end Zen3 Core + L2 + L3 turned out to be big, to address the gaming load. There are other benefits as well in the HPC space, but the effect is profoundly highlighted in the Windows world

Operating range at the very extreme of the Shmoo plot is not exactly going to make the chip efficient

For Zen4
I did mention before that I would prefer AMD don't scale up the frequencies again, otherwise again this same cycle would take effect, but in the PR some days ago Hallock alluded to increasing clocks again soooo :expressionless:

On N5P, there is a lot more room to maneuver if they dont go for the absolute frequency.
The process inherently offers a lot more speed (20% over N7 at same power) with HD cells they could make small adjustments to hit clock targets, assuming their frequency targets are not so high
This can allow a to minimize the tradeoff of density for speed, means they can pack more transistors per mm2. This means more logic.
Also means they are not operating at the very extreme of the Shmoo plot and can greatly control the efficiency.

If AMD only take minor speed improvements, say 5%, they can put all gain into efficiency plus cram more transistors because there to no need to go for absolute tradeoff for frequency.
Putting more logic, in the end, can increase "IPC" because you can have more logic blocks, register file, ROB, etc., improving the perf/watt

1634721415965.png
As per TSMC ~4.1GHz is the best range to run the CPU, and probably around 4.3GHz for N5P which AMD will use
So there is a lot of opportunities made available by the process, but it is very interesting indeed what choices AMD will make this time again.

What is known at this point is the die size, 72mm2, at this size, keeping L3 same, the Core+L2 for Zen4 is going to be quite small, slightly higher MTr than Sunny Cove at best.

When you think about this, Sony in PS5 SoC still want to remove blocks from the Core, smh.
 

leoneazzurro

Senior member
Jul 26, 2016
621
886
136
That's a lot more xtors than needed just for the AVX-512 registers, pipelines, etc. Now I'm really curious what's going on. I think those who said this will be like the Zen1 to Zen2 improvement may be correct. There's the usual suspects like op cache, retire buffer, TLBs, etc. But, how about a larger L2? I wish the Zen3 Wikichip page had as much detail as he had for Zen1.
I think the Gigabyte leaks on AM5 mainboards already revealed that Zen4 will have 1Mbyte of L2 cache.

 
Last edited:

moinmoin

Diamond Member
Jun 1, 2017
3,442
4,820
136
If one looks at the changes in the cores from Zen to Zen 2 and the ones from Zen 2 to 3 one can notice that the latter makes mostly architectural changes while the former does mostly size changes (wider, larger, more, needing more die area). I'm expecting Zen 4 to follow the pattern of the former.

The rhythm seems to be:
- Ground up re-design, same node optimization. (~Zen, Zen 3)
- Same design optimized and extended to make good use of the additional area afforded by new smaller node. (Zen 2, Zen 4?)

That'd make Mike Clark's excitement about Zen 5 understandable as well considering that's the next ground up re-design in the queue, the first with AMD being the healthy company it is nowadays.

Btw.
Mike Clark said:
So every three years, we're pretty much redesigning it all.
New Zen gen only every 18 months confirmed. @DrMrLordX vindicated ;)
(The interview is actually a little fuzzy on that since later on they talk about another three years later being Zen 8, not 7. But that's by Ian and Clark just seems to play along without really confirming or denying it.)
 
Last edited:

nicalandia

Golden Member
Jan 10, 2019
1,358
1,565
106
You seem to have taken my point as a lack of faith in AMD's drive to succeed.

All I did was make a purely logical deduction about the necessary time to recoup R&D costs on Zen4 and likely availability of fab capacity on N3 nodes.

Zen4 is unlikely to land before late Q3 2022 - more likely Q4 to prevent it cannabilising Zen3D sales.

Therefore the likelihood of any Zen5 chip launching in 2023 seems low to me.

I could be wrong of course - especially if Zen5 is in fact using some advanced variant of TSMC's 5nm based processes.
Plus Alder lake and Raptor Lake are not the Core2Duo comeback that some were expecting. Zen3D will even things out until the Zen 4 shows up to dominate. Zen4 will dominate gaming and general tasks
 

SteinFG

Member
Dec 29, 2021
38
52
51
View attachment 55740

Very interesting pic there. I hope Dylan is right on the packaging tech. Either that or I just saved myself from another subscription. (But if he is right I will sub him and pay)
Because, I cannot see any hint of any fancy packaging tech in use there, granted the grey structure obscured everything else. I cannot even see the LGA pattern.
My current theory is that amd is using fan out package just on the IO die in order to decrease the cost of its manufacturing.
Looking at die shots of a 12nm server io die, most of it is taken up by connectors, about 1/3 is logic, and a little bit of dead space. Thу dead space is probably a giveaway that the IO die is at the limit.
Moving the IO to 7nm decreases the bump pitch from 150 to 130 micron, which gives about 33% increase in IO density. So, connectors can be 33% smaller. And we actually see this when looking at die shots of Raven Ridge(12nm) vs Renoir(7nm).
But this is not enough: While IO area decreases by 33%, logic is decreasing by over 55%. This will introduce even more dead space.
Moving to fan out will decrese the bump pitch to 40 micron (I'm using TSMC info), which will shrink the connectors by up to 93%.
Optimistically, the IO die will shrink by 70-80%, and what we see at the center of this xray is a rectangular fan out package with a small die at the center of it. But because of bad quality it's impossible to see the die itself.

edit: I haven't thought about how those RDL wires will carry the signal through the narrow fan out plane, but assuming they are 2/2 micron thick, it's solvable probably
 
Last edited:

ASK THE COMMUNITY