Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 20 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

turtile

Senior member
Aug 19, 2014
614
294
136
The leak is definitely fake by the way it's written. The statement about the pricing of the 5xxx series is a dead give away.

I still believe AMD will be using GF 12nm+ since it's easier to make than 7nm and it won't eat into TSMC capacity. It will get pretty close to 7nm performance/energy use but with less area reduction. Of course, the I/O die can't shrink as easy as other types of logic so it make perfect sense. It also will work with HBM.
 

moinmoin

Diamond Member
Jun 1, 2017
4,933
7,619
136
Personally I'd like to think the likelihood of IOD moving to a process node more optimized for I/O is higher than using the already capacity starved usual TSMC nodes. After all if I/O logic is already separated, make the most of it optimizing it further. Maybe using GloFo's 12FDX?
 
  • Like
Reactions: Tlh97

scannall

Golden Member
Jan 1, 2012
1,944
1,638
136
Personally I'd like to think the likelihood of IOD moving to a process node more optimized for I/O is higher than using the already capacity starved usual TSMC nodes. After all if I/O logic is already separated, make the most of it optimizing it further. Maybe using GloFo's 12FDX?
IO components don't scale down very well anyway, so I am not seeing a reason to use the bleeding edge on them. TSMC 10nm has plenty of capacity available, is cheap now and very good on energy performance compared to GF's 14 and 12 nm processes.
 

jrdls

Junior Member
Aug 19, 2020
12
12
51
IO components don't scale down very well anyway
I've heard before but I never understood what people mean when they say it (full disclosure: I'm a complete noob regarding semiconductors). Is it because the die area dedicated to IO doesn't go down as much? Is it because the power consumption of the circuitry dedicated to IO doesn't go down? Or is it something else?
 

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
IO components don't scale down very well anyway, so I am not seeing a reason to use the bleeding edge on them. TSMC 10nm has plenty of capacity available, is cheap now and very good on energy performance compared to GF's 14 and 12 nm processes.
Cache also does not scale that much anymore. 1.2x from N7 --> N5
Despite TSMC’s claims of a 1.35x shrink on SRAM from N7 to N5, Apple’s 16MB system cache has only shrunk 1.19x
Logic scales better, at 1.4x - 1.5x from N7 --> N5. Far from the 1.8x scaling advertised by TSMC.

AMD's bet on interconnects and multi die designs sounds like a reasonable hedge against all these odds.
 

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
Cache also does not scale that much anymore. 1.2x from N7 --> N5

Logic scales better, at 1.4x - 1.5x from N7 --> N5. Far from the 1.8x scaling advertised by TSMC.

AMD's bet on interconnects and multi die designs sounds like a reasonable hedge against all these odds.
Geez, 'full node' shrinks just aren't what they used to be.
 

scannall

Golden Member
Jan 1, 2012
1,944
1,638
136
I've heard before but I never understood what people mean when they say it (full disclosure: I'm a complete noob regarding semiconductors). Is it because the die area dedicated to IO doesn't go down as much? Is it because the power consumption of the circuitry dedicated to IO doesn't go down? Or is it something else?
The features or components don't shrink as well as say logic. And if your IO is on a different die anyway, just pick a smallish energy efficient node and be done with it.
 
  • Like
Reactions: Tlh97 and jrdls

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
That doesn't smell right.

Cache is supposed to be one of the most easily shrunk items on the die.

Maybe something else is going on... spacing for heat?

I'd like to see a few more comparisons across different design houses before drawing that conclusion.
Right now only Apple have N5 products on the field so no comparison possible. But its real.
Also for N3, TSMC is a bit more candid, but still touting their 1.7x density gain, which will not hold true most likely.
Compared to it’s N5 node, N3 promises to improve performance by 10-15% at the same power levels, or reduce power by 25-30% at the same transistor speeds. Furthermore, TSMC promises a logic area density improvement of 1.7x, meaning that we’ll see a 0.58x scaling factor between N5 and N3 logic. This aggressive shrink doesn’t directly translate to all structures, as SRAM density is disclosed at only getting a 20% improvement which would mean a 0.8x scaling factor, and analog structures scaling even worse at 1.1x the density.
But real products shows that even logic does not scale that much either.
I have read somewhere that analog structures might regress in density, but I will update if I find the source later.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
Well Huawei's (probably short-lived) Mate 40 series is also 5nm
If someone from SemiAnalysis or ICInsights will get their hands on it they will analyse.
But Exynos 1080 is definitely something they will get their hands on and we should get some results from them for sure on how well 5LPE compares to N5. This would be a more interesting comparison.
techinsights did analyse 990 and found it to be as good if not better the plain N7. Density similar to N7 HD and performance similar to N7 (3x3 structures).
Some excerpts
This shows the increased density achieved with EUV lithography implementation compared to TSMC’s N7 7.5T 3/3-fin layout having a 300nm standard cell height, and 6T 2-fin layout at 240nm standard cell height.
Intel’s 10nm process has a similar 272nm standard cell height, but achieved this with 2/3 fin layout. In addition to pitch scaling, in a surprise implementation, Samsung has introduced a SA-DB (Self Aligned Diffusion Break) that likely reduces performance variation caused by the local layout effect (LLE) of PMOS transistor. This is the first time we found in this device. We also started working on Snapdragon 765G fabbed by same Samsung 7LPP as Exynos 990, it turned out having new features that 243nm standard cell height with 2/2 fin layout and 54nm gate pitch. It should be comparable with TSMC N7 or N7P 2/2 fin layout high density cell
 
Last edited:

moinmoin

Diamond Member
Jun 1, 2017
4,933
7,619
136
Cache also does not scale that much anymore. 1.2x from N7 --> N5

Logic scales better, at 1.4x - 1.5x from N7 --> N5. Far from the 1.8x scaling advertised by TSMC.
Right now only Apple have N5 products on the field so no comparison possible. But its real.
Also for N3, TSMC is a bit more candid, but still touting their 1.7x density gain, which will not hold true most likely.

But real products shows that even logic does not scale that much either.
I have read somewhere that analog structures might regress in density, but I will update if I find the source later.
I wonder if those gains are real but only achievable in specific circumstances that don't apply in the use cases that the node then is used for? L1 cache for example needs to reach the frequencies the whole CPU is targeted to reach. While cache scales down well, this may only apply at a lower frequencies. To then reach higher frequencies the density has to be loosened again. This may be the case for many of those worse than expected scalings.

Slightly OT: Does this page really mention nowhere when the article was written? I feel uncomfortable about this complete lack of any timestamp.
 
  • Like
Reactions: Tlh97 and scannall

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
Slightly OT: Does this page really mention nowhere when the article was written? I feel uncomfortable about this complete lack of any timestamp.
Unfortunately, they provide professional services only. What is given there is only a teaser. If you think semiaccurate's 1K/year is too much ....
But that is expected if you want real data for your competitive business analysis.
1605527515289.png
@kokhua probably buys a bunch of these papers.

I wonder if those gains are real but only achievable in specific circumstances that don't apply in the use cases that the node then is used for? L1 cache for example needs to reach the frequencies the whole CPU is targeted to reach. While cache scales down well, this may only apply at a lower frequencies. To then reach higher frequencies the density has to be loosened again. This may be the case for many of those worse than expected scalings
It is best case scenario. Even TSMC itself says 35-40%
At IEDM, Geoffrey Yeap gave a little more color to that density by reporting that for a typical mobile SoC which consists of 60% logic, 30% SRAM, and 10% analog/IO, their 5 nm technology scaling was projected to reduce chip size by 35%-40%.
Reality is more like the lower end of that, which is to no surprise within reach of 5LPE if Samsung is being a little bit more honest.
 
Last edited:

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
As was explained to me a few years back, the limited scaling of SRAM cells was actually expected. In general, an SRAM cell has certain structural requirements that make it consume a certain amount of volume above and beyond a logic circuit. There is a need for extra space to contain the required transistor structure, and that just doesn't shrink well past a certain point. Unless a completely different structure or technique is chosen, it's expected to almost completely stop scaling down in a couple nodes.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136

AMD using EPYC Milan powered MS Azure hybrid cloud solution for their EDA workflows.
Work from home takes a new meaning.
Can't find a better spot to put this, just pasting it here.
 
  • Wow
Reactions: lightmanek

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
TSMC going with FinFET all the way down to 3N. Maybe part of what we are seeing are the limitations of FinFET coming into play. The current densities in the fins must be getting pretty high and of course the fins are closer together. There also may be significant issue with quantum effects. Part of Intel's shift to 22nm FinFET was about limiting the quantum effects that were starting to cause problems in planar technology. And it seems, getting even harder to get uniform sized cells and clean lines, even with EUV.

By the look of these number, AMD has a quandary. Zen 3 still has many areas that could be improved, like de-conflicting port availability to increase execution throughput. The question of going wider vs adding more cores (though, without ballooning chiplet size, seems like the max would be 12 cores/chiplet). Adding more chiplets using a larger package size is an option, though the I/O units would become even more complex and power hungry - and I/O won't scale down as well as SRAM or logic. Hmm, may you live in interesting times.
 
Last edited:
  • Like
Reactions: Tlh97

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
That doesn't smell right.

Cache is supposed to be one of the most easily shrunk items on the die.

Maybe something else is going on... spacing for heat?

I'd like to see a few more comparisons across different design houses before drawing that conclusion.
Cache is hot, which explains not just the relaxed density but also the shape change. A circle is the worst for heat dissipation, a square the next worst. Elongating is one way to handle heat dissipation better as it provides more "fence" area. You could get fancy with an H or X shape but elongation works just as well and isn't nearly as spatially problematic. Elongation + relaxed density is probably just a reasonable middle-ground.
 
Last edited:

jamescox

Senior member
Nov 11, 2009
637
1,103
136
I've heard before but I never understood what people mean when they say it (full disclosure: I'm a complete noob regarding semiconductors). Is it because the die area dedicated to IO doesn't go down as much? Is it because the power consumption of the circuitry dedicated to IO doesn't go down? Or is it something else?
I may be wrong here, but AFAIK, IO doesn't scale well since the transistors to drive external interfaces need to be larger to provide the amount of power necessary. The power required to drive the external interface doesn't change with a process shrink, so the transistors probably remain very large. The transistors in combinational logic just need to power a (usually) very short set of wires to drive a limited number of other transistors (fan out). They can be very, very small and a shrink that shrinks one transistor probably also shrinks the wires and the fan out transistors also.

Something to note though is that the Epyc IO die appears to have the physical interfaces around the edge. The stuff in the middle may be logic that can be shrunk. For things like the memory controllers, they have a unified memory controller (UMC) that will have everything except the physical interface and will be independent of the physical interface such that the physical interface can change (like from DDR4 to DDR5). If they go the interposer route, then they may have some chips stacked on top that are IO stuff rather than cpu die. This would allow them to make chips with the UMC and L4 cache on 5 or 7 nm and then stack them on an interposer with the physical interfaces.
 
  • Like
Reactions: jrdls

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
Cache is hot, which explains not just the relaxed density but also the shape change. A circle is the worst for heat dissipation, a square the next worst. Elongating is one way to handle heat dissipation better as it provides more "fence" area. You could get fancy with an H or X shape but elongation works just as well and isn't nearly as spatially problematic. Elongation + relaxed density is probably just a reasonable middle-ground.
Uhm, I don't think this is true. Cache reduces the energy used by a CPU. Just take a note of what direction Apple is moving in with it's SoCs. Logic is typically energy intensive as is I/O. I don't understand why the 8 transistors per bit use more energy than any other 8 transistors. I'm playing devil's advocate a bit here, so a link would be helpful.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Just examples I am giving, coherency engine is in the L3 of each CCX, the IOD has the the PHY for going off socket. For a single 8 Core CCX chip (Mobile for example), this can be stripped, because the L3 only deals with maintaining the directory for its own L2 clients.
This is one example. I am just wondering how many more of these things are there.
I kind of doubt that there really is that much to cut from the CCD between desktop and server. They probably don't really cut anything from the core going to monolithic mobile parts. I expect it would be limited to the L3 level, which has been running half size; SRAM takes power. If they have a single CCX, then they do not need the hardware in L3 to track cache coherency across multiple CCX. I don't know if there is anything else in L2 or core that they could cut out though, so it seems like it is just limited to L3 and that probably isn't much of a die area savings. Using the same exact cpu die across the whole product stack (desktop/server/hpc) from 4 core to 64 core is a big win.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
I expect the number of cores per CCX will improve. If the 96 core rumour is correct, then I wouldn't be surprised if Zen 4 CCDs have 12 cores and 48MB L3.

Main thing is I'm betting that when AMD refactored the CCX, that they built in some flexibility. Why refactor in a way that only allows eight cores if you're likely to need to refactor again for more? Just do it right the first time. Looking at the die shot of Zen 3 going around kind of supports this. Cores are no longer "mirrored" relative to their closest neighbours, and the interfaces between L3 blocks running east-to-west are no longer there. The area between east and west L3 regions looks like a giant crossbar, and perhaps this can scale arbitrarily albeit at the expense of latency.
I really doubt that they will go to a larger CCX in the next generation. I think it will stay at 8 cores for a while. Zen 3 is the new architecture; Zen 4 should mostly be improvements on Zen 3. Most past history indicates scaling becomes an issue for a monolithic cache for more than 8 cores. The L3 cache may stay 32MB for a while also, but they may add L4. They could do things with stacking to provide larger L3 cache, but it needs to be kept in mind that larger cache pretty much always means slower due to the physical laws of the universe. They may move to a 16 core die with 2 CCX per die though. That would not be that large at 5 nm.

Also, I expect "Infinity Cache" is going to be used in more products. I kind of hope that Milan gets the 128 MB infinity cache in the IO die, but we may not get that until Genoa. Zen 3 is the updated core and Ryzen based on Zen 3 uses the exact same IO die as Zen 2. We probably have to wait for Zen 4 for the massive IO updates (DDR5, PCI-E 5, maybe infinity cache). If they use stacked die in Genoa, then the core count could be massive. Some TSMC tech may allow for stacking multiple cpu die, so they could have a 32 core with 1 layer, 64 with 2, 96 with 3, 128 with 4, etc. TSMC has apparently demonstrated 12 high stacked die without using micro-solder bumps, which makes the whole stack very thin. It has much better thermal performance than tech that uses bumps; no space between die and the whole stack is very thin allowing better heat transfer. The power consumption would be an issue, so the clock speed would need to be reduced with each layer. We already have that with rome anyway, with lower core count devices clocking higher due to larger thermal headroom. We also have some weird products like the Epyc 7F52 16-core (1 active core per CCX; a lot of thermal headroom).

Going to pci-express 5.0 like speeds for the IFOP (on package serdes) is probably too much power so Zen 4 Epyc will probably be stacked with the IO in some manner, even if it is full interposer. Some TSMC tech embedds a piece of silicon in the package under other chips (similar to intel EMIB). They call that one local silicon interconnect (LSI). That should be much cheaper than using a giant interposer under everything. For something like HBM, it would only be partially under the HBM and the GPU or other die. That may be how they do the desktop parts cheaply if they only make one cpu die that is designed for stacking. They would just have a tiny embedded chip with the cpu die and io die overlapping it. I guess they may also be able to make a die with both types of connections. Zen 1 had pads for 4 IFOP links that were not used at all in the single die parts. Chip stacking will allow HBM type connections between chiplets; 1024-bit wide low clocked interfaces rather than 32-bit serdes at ridiculously high clock speeds. They may widen the internal interfaces in Zen 4 to take advantage of the wider pathways. I am also wondering if they are going to find a way to leverage their GPU technology. Even a single, small GPU chiplet could be very powerful;they just need a good way to access it at low latency.

See this article (already posted several times) for an overview of TSMC chip stacking tech:

 
  • Like
Reactions: Tlh97