Speculation: Ryzen 4000 series/Zen 3

Page 8 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Panino Manino

Senior member
Jan 28, 2017
821
1,022
136
I read that Keller was chief architect of "K8 prototype" what was much powerful than public K8 (assuming that ex-DEC engineers were inspired by very powerful DEC Alpha EV8 with it's 4-way SMT). However it was canceled in favor of much simple K8, based on K7 with 64-bit instruction set. Keller left AMD in 1999 and K8 design was led by chief architect Fred Webber who developed K7 earlier. This is what I read.

Sounds impressive but AMD did the right thing at the time.
With Intel trying to push Itanium AMD had a huge opportunity, plus K7 was still more than good enough to compete.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,774
3,153
136
Assuming Zen3 is wider (6xALU, SMT4) core..... then 8core Zen3 chiplet area will be similar to 12c Zen2 area. I think they can keep well proofed quad-core CCX still.
I wouldn't assume that, you need to clock this thing, how many read/write ports are you going to have to your PRF? How are you going to handle load and store bandwidth, cache configuration. How are you going to get enough decode bandwidth? There are limited use cases where SMT4 can make sense otherwise your just dividing your resources by 4 and in physics things dont scale linearly so you will loose clocks big time if you scale out all those structures, especially anything that needs to read and write.

Zen3 8c/32t chiplet might be +50% bigger in area/transistors/power consump, +70% in overall performance. This could be tight to fit 64c under heat-spreader.
doubt it . 7nm+ gives a ~20% density improvement, use that 20% to drive perf per clock up 10-20%, increase CCD's to ~ 100mm, 96 cores , 192 thread EPYC 3. Will be much better for almost all markets (especially the high volume server markets) then 64C/256T chip.
 
  • Like
Reactions: Tlh97 and yuri69

Hitman928

Diamond Member
Apr 15, 2012
5,315
7,983
136
I want to say it was our friendly neighborhood cat prophet, but I don't know for sure.

Edit: Yep, I was right. Seems to originate all the way back over 2 years ago starting with this post. However, he was talking about his mythical low power follow-up to AMD's cat line of CPUs but obviously that didn't pan out but it got the talk going about SMT4 capable AMD CPUs.

Later it is mentioned in passing here speculating on what Zen2 would look like. Responses basically saying it would be interesting to see but was doubtful. Small group of posters continue to speculate about SMT4 being in Zen2. Since that didn't pan out either the speculation about SMT4 being designed into an AMD architecture shifted to Zen3 starting with this post here.
 
Last edited:

Thunder 57

Platinum Member
Aug 19, 2007
2,675
3,801
136
I want to say it was our friendly neighborhood cat prophet, but I don't know for sure.

Edit: Yep, I was right. Seems to originate all the way back over 2 years ago starting with this post. However, he was talking about his mythical low power follow-up to AMD's cat line of CPUs but obviously that didn't pan out but it got the talk going about SMT4 capable AMD CPUs.

Later it is mentioned in passing here speculating on what Zen2 would look like. Responses basically saying it would be interesting to see but was doubtful. Small group of posters continue to speculate about SMT4 being in Zen2. Since that didn't pan out either the speculation about SMT4 being designed into an AMD architecture shifted to Zen3 starting with this post here.

Friendly neighborhood prophet cat... lol

Nice research. There will be no SMT4 just like there will be no FDSOI or Con core follow up.
 

Saylick

Diamond Member
Sep 10, 2012
3,170
6,404
136
I wouldn't assume that, you need to clock this thing, how many read/write ports are you going to have to your PRF? How are you going to handle load and store bandwidth, cache configuration. How are you going to get enough decode bandwidth? There are limited use cases where SMT4 can make sense otherwise your just dividing your resources by 4 and in physics things dont scale linearly so you will loose clocks big time if you scale out all those structures, especially anything that needs to read and write.


doubt it . 7nm+ gives a ~20% density improvement, use that 20% to drive perf per clock up 10-20%, increase CCD's to ~ 100mm, 96 cores , 192 thread EPYC 3. Will be much better for almost all markets (especially the high volume server markets) then 64C/256T chip.

There's a rumor out there that EPYC 3 will use 14+1 chiplets vs. the current 8+1. Assuming more chiplets can't fit onto the current package without shrinking each chiplet, that implies that the transistor count for each CCD can't increase too much over EPYC 2 or else the chiplet size doesn't decrease. If the transistor budget can't increase, I'm not sure how likely it is to extract a whole lot of IPC gains in Zen 3, but this is just a theory.

I personally think some IPC gains have to be introduced with Zen 3. In Zen 2, the uop cache can spit out 8 uops vs. Zen 1's 6 uops, an improvement of 2 uops vs Zen 1, while the decoders were unchanged. The dispatch width, however, was still kept at 6 uops wide and the retire rate was also kept at 8 uops. Per AT's reporting, the retire rate was intentionally set higher than the dispatch rate in order to allow the front end to catch up if there's a flush in the pipeline. The number of execution units remains the same with Zen 2 over Zen 1 with the exception of an additional store AGU. I can easily imagine AMD widening the dispatch to 8 uops/cycle and the retire to 10 uops. This would, in theory, allow for a 33% increase in IPC assuming nothing else is a bottleneck (i.e. 6 uops/cycle -> 8 uops/cycle dispatch). To ensure that the dispatch rate of 8 uops is consistently hit, I wouldn't be surprised if AMD adds another AGU to allow for 2 stores and 2 loads.

Sunny Cove can send 6 uops/cycle to the re-order buffer, which itself can dispatch 10 uops/cycle. Zen 2 doesn't have a singular re-order buffer for both integer and FP. It has dedicated buffers for each path. Regardless, the total number of uops that can be dispatched to either buffer is 6 uops/cycle, which is limited to 6 uops/cycle into the integer side and up to 4 uops/cycle into the FP side. I'd really like to see AMD increase this to 8 uops/cycle and allowing up to 6 uops/cycle for the integer side and 4 uops/cycle for the FP side. Additionally, the re-order buffer sizes should be increased. I'm not sure how this will affect transistor count and power consumption but it ought to increase IPC by another 15% probably.

Zen 2:
Mike_Clark-Next_Horizon_Gaming-CPU_Architecture_06092019-page-003_575px.jpg

Mike_Clark-Next_Horizon_Gaming-CPU_Architecture_06092019-page-007_575px.jpg

Mike_Clark-Next_Horizon_Gaming-CPU_Architecture_06092019-page-008_575px.jpg


The decoders in Zen 2 stay the same, we still have access to four complex decoders (compared to Intel’s 1 complex + 4 simple decoders), and decoded instructions are cached into the micro-op cache as well as dispatched into the micro-op queue.

AMD has also stated that it has improved its micro-op fusion algorithm, although did not go into detail as to how this affects performance. Current micro-op fusion conversion is already pretty good, so it would be interesting to see what AMD have done here. Compared to Zen and Zen+, based on the support for AVX2, it does mean that the decoder doesn’t need to crack an AVX2 instruction into two micro-ops: AVX2 is now a single micro-op through the pipeline.

Going beyond the decoders, the micro-op queue and dispatch can feed six micro-ops per cycle into the schedulers. This is slightly imbalanced however, as AMD has independent integer and floating point schedulers: the integer scheduler can accept six micro-ops per cycle, whereas the floating point scheduler can only accept four. The dispatch can simultaneously send micro-ops to both at the same time however.

Sunny Cove:
FrontEnd_575px.jpg

BackEnd_575px.jpg


The micro-op cache gets an update here, from 1.5k entries to 2.25k entries. This is the first time that Intel has increased the micro-op cache size since Haswell, but it should be noted that the competition also has micro-op caches (ARM has 1.5k, AMD has 2k for Zen, 4k for Zen 2), and so refinement in this area is going to be critical. The micro-op cache can supply six micro-ops to the queue per cycle.

Overall, six micro-ops can be fed between the decoders/cache/direct micro-code per cycle. That is split between up to six per cycle from the cache, up to 5 from the decoders, and up to 4 from direct microcode (which gets fed through the complex decoder).

In total, the number of execution ports has increased from 8 in Skylake to 10 in Sunny Cove. This allows for 10 micro-ops per cycle to be dispatched from the reorder buffer, a 25% increase. The two new ports lie in different areas: Skylake had 3 AGUs, supporting two loads and one store per cycle, but Sunny Cove now has 4 AGUs, for two loads and two stores per cycle. The other new port is a store data port. With these changes, the L1 data cache can now support two stores per cycle, effectively doubling the L1 store bandwidth.
 

Veradun

Senior member
Jul 29, 2016
564
780
136
There's a rumor out there that EPYC 3 will use 14+1 chiplets vs. the current 8+1. Assuming more chiplets can't fit onto the current package without shrinking each chiplet, that implies that the transistor count for each CCD can't increase too much over EPYC 2 or else the chiplet size doesn't decrease. If the transistor budget can't increase, I'm not sure how likely it is to extract a whole lot of IPC gains in Zen 3, but this is just a theory.

Unless they choose to go DDR5 since the decoupled memory controller allows to differentiate between desktop and server, and have a new socket for server only.
 

krumme

Diamond Member
Oct 9, 2009
5,952
1,585
136
I remember Mike Clark saying something to the effect that there was a lot of stuff that they wanted to design for zen2 that they had to leave out for time constrains.

It sounds to me as there is still a lot of lowish hanging fruit to be had for the zen 3 core?

I will also presume they will keep their existing io for both server and desktop and just change the cores. And then wait for ddr5 for zen4/5 to make the new io. What's your take on that?

So for the core part I expect another solid ipc next year. And then in approx 2 years and 5nm some solid all over again with io also changed.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,774
3,153
136
Unless they choose to go DDR5 since the decoupled memory controller allows to differentiate between desktop and server, and have a new socket for server only.
its already been confirmed by AMD Zen3 is DDR4.

https://www.anandtech.com/show/14568/an-interview-with-amds-forrest-norrod-naples-rome-milan-genoa
Forrest Norrod: DDR5 is a different design. It will be on a different socket. We've already said Milan is a mid-2020 platform, and we've already said that's socket SP3, so DDR4 will still be used for Milan.
 

Gideon

Golden Member
Nov 27, 2007
1,644
3,705
136
Yes and maybe no. There were industry rumors that AMD is looking to beat Intel to DDR5, so that statement might be intentionaly misleading.

There is still the chance that Zen3 with a different I/O die (but same core chiplets) will also support DDR5. Perhaps it will come a bit later, and maybe it isn't named Milan as that might be reserved for the DDR4 version.

Therefore Milan not supporting might not be the same as Zen3 not supporting. I agree I'm reaching a bit, but the possibility is not entirelly outlandish
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
What might be a Zen3?
Answer is: what is the inevitable future of CPU cores?

The real power is in back-end and it's ALUs, AGUs and FPUs. As a mech engineer I see these as cylinders in the engine.
Front-end is just feeding them as efficiently as possible. Same as intake manifold is feeding engine. That's all.

The evolution of back-end ALUs was:
- 1995 ... 2xALU Intel P6 uarch, PentiumPro, PII...
- 1997 ... 2xALU AMD/Nexgen K6
- 1999 ... 3xALU AMD K7, Intel PIII
- 2008 ... 4xALU Intel Haswell
- 2012 ... 4xALU AMD Zen
- 2017 ... 6xALU Apple A11 ... most powerful core today (int IPC +76% over Skylake)

x86 CPUs must move to 6xALUs. When Apple did it then Intel and AMD must do that too. Sure, It will be hard move as was move from 3xALU -> 4xALU, it will need core re-design from scratch, same as Nehalem and Zen were. You don't need to be genius to predict that inevitable future is 8xALUs core design as a next step. Or do you think x86 CPUs will sit at 4xALU design for next 50 years? No. Apple moved from weak 4-cylinder engine to their powerfull V6. However I think we deserve V8s.

What is the evolution of SMT?
- 1999 introduced by DEC in 1999, implemented in CPU EV8 SMT4 in 2003 (cancelled in 2001 by Compaq in favor of Itanium)
- 2002 ... Intel P4 SMT2
- 2004 ... IBM Power5 SMT2
- 2010 ... IBM Power7 SMT4 dynamical
- 2014 ... IBM Power8 SMT8 dynamical
- 2017 ... AMD Zen SMT2
- 2050 ... x86 still stuck at SMT2?

6xALU core still might be fine with SMT2. For high thread server application SMT4 makes sense even for this core.
8xALU core will struggle with just SMT2 from efficiency point. You do not need to be genius to predict that SMT4 for this core is efficient move. SMT4 and SMT8 with dynamical changing number of threads/priority is actual IBM technology, not a sci-fi. Again, you do not need to be genius to predict that next step is SMT-16 (for very wide core and some specific server markets). Does SMT4 still look crazy for Zen3?

And don't forget guys what Kennedy said: "We choose to go to the moon because it is hard, not because it is easy."
 
Last edited:

Atari2600

Golden Member
Nov 22, 2016
1,409
1,655
136
Therefore Milan not supporting might not be the same as Zen3 not supporting. I agree I'm reaching a bit, but the possibility is not entirelly outlandish

Indeed.

The decoupling of core from uncore* allows AMD to make two I/O dies one for DDR4, the other DDR5.

Two design teams, Infinity fabric as the defined interface and you can asynchronously update platform and compute core.

*for the purposes of this, everything in the CCX is considered core, whereas traditionally things like decoders would be considered uncore.


You might even see a point down the line where the CCX are not homogeneous, but instead there are different CCX designs, one focused more on being optimal for server, the other HPC. Might not happen, but its a possibility.
 

krumme

Diamond Member
Oct 9, 2009
5,952
1,585
136
What might be a Zen3?
Answer is: what is the inevitable future of CPU cores?

The real power is in back-end and it's ALUs, AGUs and FPUs. As a mech engineer I see these as cylinders in the engine.
Front-end is just feeding them as efficiently as possible. Same as intake manifold is feeding engine. That's all.

The evolution of back-end ALUs was:
- 1995 ... 2xALU Intel P6 uarch, PentiumPro, PII...
- 1997 ... 2xALU AMD/Nexgen K6
- 1999 ... 3xALU AMD K7, Intel PIII
- 2008 ... 4xALU Intel Nehalem
- 2012 ... 4xALU AMD Zen
- 2017 ... 6xALU Apple A11 ... most powerful core today (int IPC +76% over Skylake)

x86 CPUs must move to 6xALUs. When Apple did it then Intel and AMD must do that too. Sure, It will be hard move as was move from 3xALU -> 4xALU, it will need core re-design from scratch, same as Nehalem and Zen were. You don't need to be genius to predict that inevitable future is 8xALUs core design as a next step. Or do you think x86 CPUs will sit at 4xALU design for next 50 years? No. Apple moved from weak 4-cylinder engine to their powerfull V6. However I think we deserve V8s.

What is the evolution of SMT?
- 1999 introduced by DEC in 1999, implemented in CPU EV8 SMT4 in 2003 (cancelled in 2001 by Compaq in favor of Itanium)
- 2002 ... Intel P4 SMT2
- 2004 ... IBM Power5 SMT2
- 2010 ... IBM Power7 SMT4 dynamical
- 2014 ... IBM Power8 SMT8 dynamical
- 2017 ... AMD Zen SMT2
- 2050 ... x86 still stuck at SMT2?

6xALU core still might be fine with SMT2. For high thread server application SMT4 makes sense even for this core.
8xALU core will struggle with just SMT2 from efficiency point. You do not need to be genius to predict that SMT4 for this core is efficient move. SMT4 and SMT8 with dynamical changing number of threads/priority is actual IBM technology, not a sci-fi. Again, you do not need to be genius to predict that next step is SMT-16 (for very wide core and some specific server markets). Does SMT4 still look crazy for Zen3?

And don't forget guys what Kennedy said: "We choose to go to the moon because it is hard, not because it is easy."
Well you dont just dump a 6 cyl engine in a platform made for 4 cyl. You need to beef up the rest.

As desktop enthusiast we want wider stuff. Is there mm2 for 6x alu design in the current socket 7nm+ ?
 
  • Like
Reactions: Tlh97

Saylick

Diamond Member
Sep 10, 2012
3,170
6,404
136
What might be a Zen3?
Answer is: what is the inevitable future of CPU cores?

The real power is in back-end and it's ALUs, AGUs and FPUs. As a mech engineer I see these as cylinders in the engine.
Front-end is just feeding them as efficiently as possible. Same as intake manifold is feeding engine. That's all.

The evolution of back-end ALUs was:
- 1995 ... 2xALU Intel P6 uarch, PentiumPro, PII...
- 1997 ... 2xALU AMD/Nexgen K6
- 1999 ... 3xALU AMD K7, Intel PIII
- 2008 ... 4xALU Intel Nehalem
- 2012 ... 4xALU AMD Zen
- 2017 ... 6xALU Apple A11 ... most powerful core today (int IPC +76% over Skylake)

x86 CPUs must move to 6xALUs. When Apple did it then Intel and AMD must do that too. Sure, It will be hard move as was move from 3xALU -> 4xALU, it will need core re-design from scratch, same as Nehalem and Zen were. You don't need to be genius to predict that inevitable future is 8xALUs core design as a next step. Or do you think x86 CPUs will sit at 4xALU design for next 50 years? No. Apple moved from weak 4-cylinder engine to their powerfull V6. However I think we deserve V8s.

What is the evolution of SMT?
- 1999 introduced by DEC in 1999, implemented in CPU EV8 SMT4 in 2003 (cancelled in 2001 by Compaq in favor of Itanium)
- 2002 ... Intel P4 SMT2
- 2004 ... IBM Power5 SMT2
- 2010 ... IBM Power7 SMT4 dynamical
- 2014 ... IBM Power8 SMT8 dynamical
- 2017 ... AMD Zen SMT2
- 2050 ... x86 still stuck at SMT2?

6xALU core still might be fine with SMT2. For high thread server application SMT4 makes sense even for this core.
8xALU core will struggle with just SMT2 from efficiency point. You do not need to be genius to predict that SMT4 for this core is efficient move. SMT4 and SMT8 with dynamical changing number of threads/priority is actual IBM technology, not a sci-fi. Again, you do not need to be genius to predict that next step is SMT-16 (for very wide core and some specific server markets). Does SMT4 still look crazy for Zen3?

And don't forget guys what Kennedy said: "We choose to go to the moon because it is hard, not because it is easy."

I think there's still some utilization efficiency gains to be had with the current number of pipelines. To use your car analogy, it's better to extract more HP out of the available displacement from the engine than to simply throw more displacement at it. I am familiar with the adage "no replacement for displacement" but in the world of uarch design, perf/W is the name of the game, especially as it pertains to AMD's long-term goals. In our power constricted environments, better perf/W directly translates to better performance.
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
Well you dont just dump a 6 cyl engine in a platform made for 4 cyl. You need to beef up the rest.

As desktop enthusiast we want wider stuff. Is there mm2 for 6x alu design in the current socket 7nm+ ?
That's why I wrote that wider 6-8xALU core will need re-design from scratch, back-end, front-end, everything. Same they did with Zen, what's the problem? Will their brains hurt during development of this? Yes. Will it take longer to develop that just refurbishing 4xALU core. Yes. But don't forget you cannot develop 4xALU design forever because fruits are higher and higher to reach. There is a limit beyond it's easier to pick 6xALU fruit. It looks like Apple engineers discovered it first.

In terms of transistor count, it will cost a lot, A12 is also huge in compare to other ARM cores, likely double/triple of transistor/area size. However instead of doubling L3 cache, which brings few % performance and wasting a lot transistors, they can use it more efficiently. Anybody remember Athlon K7 with 256kB L2 cache and Duron with just tiny 64kB cache? Duron was very close to Athlon in terms of performance but saving a lot of die space/transistors.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,608
5,816
136
However instead of doubling L3 cache, which brings few % performance and wasting a lot transistors, they can use it more efficiently.
They need that big L3 to hide latencies between the CCD and IOD.

The transition from N7 to N7+ won't bring much in terms of clock speed (I would suppose 150 to 250 MHz tops if at all), so AMD has to improve in other areas to bump up performance.
Since it is going to be a major architecture update, unlike Ryzen 2K aka Zen+, we should expect some decent generational gains similar to the transition from Zen+ to Zen2.

Same thoughts to ponder like before,
- How will they spend the transistor budget. The small density increase afforded by the N7+ could come in handy to improve the core itself.

- What about the Cores? Will there be a more Int/FP units? Also front end improvements?

- Will there be some memory stacked on the dies this time? It has been so conistently depicted in most of AMD's patents.

- There are patents to make the L3 Directory visible across all CCXs, would this make it to Zen 3?

- How will they solve the temperature hotspots, will the Thermo electric cooler patent make it to the product? Although this is for 3D stacked chiplets

- How will the IO chiplet evolve, still on 12nm? A reduction in size of the IOD could afford some more wiggling space for bigger core improvements.

- Will the dies be glued together using an interposer this time, as described in most of the patents.

- How will the dies be packaged? Full 3D will come in Zen4 I believe. What would a 2.5D packaging look like?

- IOD to CCD latency, IF improvements? I would suppose they will need to concentrate on this instead of only relying on that massive L3 to hide the latency. Inter Core latency is already quite low on Zen2.


Interestingly, TSMC announced N7P. I wonder if this could go in some of the later Ryzen 3k SKUs? Or is this already what is being used in the Ryzen 3k?
N7+ will be a long node for AMD, all the consoles will also be on it.
 
  • Like
Reactions: Tlh97 and Yotsugi

krumme

Diamond Member
Oct 9, 2009
5,952
1,585
136
Using same packaging and io as zen 2, what is your guys estimate of what is possible to cram in of extra transistors if they go for max use of socket space?
 

Ajay

Lifer
Jan 8, 2001
15,461
7,862
136
Interestingly, TSMC announced N7P

Hmm, maybe Renoir? Not that I've heard anything about that:

Wikichip said:
N7P

TSMC has started rolling out an optimized version of their N7 process called N7 Performance-enhanced version (N7P). This process goes by various other names such as “2nd generation 7 nm” and “7 nm year 2”. This process should not be confused with N7+. N7P is an optimized, DUV-based, process which uses the same design rules and is fully IP-compatible with N7. N7P introduces FEOL and MOL optimizations which are said to translate to either 7% performance improvement at iso-power or up to 10% lower power at iso-speed.
 

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Interestingly, TSMC announced N7P. I wonder if this could go in some of the later Ryzen 3k SKUs? Or is this already what is being used in the Ryzen 3k?
N7+ will be a long node for AMD, all the consoles will also be on it.

I had a discussion about N7P elsewhere, and it was argued that N7P may not apply to N7 HPC, which is different to the density-focused plain N7. What is the view here on that?

Is the HPC variant just an alternative standard cell library with bigger cells (7.5T vs 6T). If so, and the N7P is mainly cell implementation and transistor improvements, would the N7P improvements equally apply to the HPC library, i.e. for a N7P HPC variant? On the other hand, someone argued that the N7P improvements may only apply to the plain N7 process, and that the improvements may actually just be adoption of improvements already made for N7 HPC.