Speculation: Ryzen 4000 series/Zen 3

Page 26 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

NostaSeronx

Diamond Member
Sep 18, 2011
3,683
1,218
136
I don't think AMD catches lightening in a bottle twice in such rapid succession.
I however think they can technically do it. They have two paths non-CMT and w/ CMT, 6 ALUs is probably the best given non-CMT.

cmt1998.png
With CMT-path following closely with Keller's CMT from before K8 days. I rushed it so there is behavior not really properly disclosed.
CMP1 => All resources can be used by a single thread.
CMT2 => Resources are virtually-divided as if there is two cores.
SMT4 => Resources are further virtually-divided per cluster for two threads each.
Dynamically-setted based on prediction/pre-fetch behavior stage.

Beside the retire queue is a discrete op-cache, all 4 threads share a single L0i which might be banked per thread or per cluster. Each SLAQ should coincide with a L0d cache 1/4 to 1/8 the size of L1d.

7.5T to 6T is a 13% loss in performance. With density lost being an unknown. However, this is for the 5nm node "This study shows that 6-track cells (192nm high) and smart routing results in up to 60% lower area than 7.5-track cells in N5 technology. Standard cells have been created for 7.5T and 6T cells in N5 technology (poly pitch 42nm, metal pitch 32nm)." and the 12FFC node "However, there is also a new 6T standard cell library, that pushes density up 1.2X vs the 7.5T library on 16FFC."

DUV to EUV for 7nm to 7nm+ is up to a 20% increase in perf. DUV to EUV also increases routeability from DUV to allow for another 1.2x density. Plus the 7.5T to 6T route can mean more than 44% increased logic density. Family 19h(Zen3) isn't Family 17h(Zen2), so there is a lot more potential given for an IPC boost/Area compression.
 
Last edited:

Thunder 57

Platinum Member
Aug 19, 2007
2,647
3,706
136
TSMC's 7+ node allows for ~10% decrease in power and an ~15% decrease in area for the same the implementation due solely to different transistor characteristics (which may have only been achievable via EUV, not sure). Anyway, more execution resources (at peak utilization), larger cache sizes an such eat more power. I think on some Intel designs, features could only be added if they gave a 2% performance boost for the cost of 1% more power use. That could be what AMD have done - and it surely wasn't easy.

I don't think they mess with the caches again so soon. The next move I think they will make will be to 1MB L2's, but probably not until 5nm. Intel did have that rule, I believe it came about with Nehalem, which is why we saw SMT make a comeback. I do not know if they still abide by it, I would think probably not. I imagine such a rule would be hard to follow with how boost works these days.

I however think they can technically do it. They have two paths non-CMT and w/ CMT, 6 ALUs is probably the best given non-CMT.

View attachment 11552
With CMT-path following closely with Keller's CMT from before K8 days. I rushed it so there is behavior not really disclosed.
CMP1 => All resources can be used by a single thread.
CMT2 => Resources are virtually-divided as if there is two cores.
SMT4 => Resources are further virtually-divided per cluster for two threads each.
Dynamically-setted based on prediction/pre-fetch behavior stage.

Beside the retire queue is a discrete op-cache, all 4 threads share a single L0i which might be banked per thread or per cluster. Each SLAQ should coincide with a L0d cache 1/4 to 1/8 the size of L1d.

7.5T to 6T is a 13% loss in performance. However, this is for the 5nm node "This study shows that 6-track cells (192nm high) and smart routing results in up to 60% lower area than 7.5-track cells in N5 technology. Standard cells have been created for 7.5T and 6T cells in N5 technology (poly pitch 42nm, metal pitch 32nm)." and the 12FFC node "However, there is also a new 6T standard cell library, that pushes density up 1.2X vs the 7.5T library on 16FFC."

DUV to EUV for 7nm to 7nm+ is up to a 20% increase in perf. DUV to EUV also increases routeability from DUV to allow for another 1.2x density. Plus the 7.5T to 6T route can mean more than 44% increased logic density. Family 19h(Zen3) isn't Family 17h(Zen2), so there is a lot more potential given for an IPC boost/Area compression.

I would agree with the IPC boost except I believe it was Papermaster who said they were focusing on energy efficiency. FWIW, I wouldn't be surprised to see a 6 ALU / 4 AGU design. Nothing to back that up, just seems to make sense.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,683
1,218
136
I believe it was Papermaster who said they were focusing on energy efficiency.
"AMD’s CTO, Mark Papermaster, has said in an interview that the 7nm+ process node will be utilised to maximise efficiency within its Zen 3 CPUs, and will offer only “modest device performance opportunities”. " -> "Looking ahead, a 7-nm-plus node using extreme ultraviolet lithography (EUV) will “primarily leverage efficiency with some modest device performance opportunities,” he said in the interview."
Modest device performance(Frequency-related) and maximise efficiency is related to the 7nm+ node, and not the architecture itself. 7nm to 7nm+ is similar enough to Steamroller to Excavator. EUV is basically, increasing effective routing capabilities just like 9-track and M7/M8 BEOL getting 1x pitch in Excavator. The major thing is Family 19h isn't Family 17h compared to Family 15h to Family 15h, so the probability of higher IPC is more likely. If they can't go faster, they can only get wider proportional to the density/power shrink.

The CMT-path works better; Zen2 16-cores still operate like 16 Zen cores, if running 128-bit code. Whereas 16-cores of Zen3(CMT-path), would operate like 32-cores of Zen1, if running 128-bit code. CMT in this way runs more efficently than a 6-wide ALU design.
 
Last edited:
Mar 11, 2004
23,031
5,495
146
If your mounting HBM on top of the IO die, then it essentially is an active interposer and would need to be manufactured like an interposer with through silicon vias and such. For a higher end laptop, my preference would be an 8 core APU connected to a small, discrete, HBM based GPU for when more performance is necessary. That seems to make the most sense.

I'm not sure it would (how do they manage DRAM stacking in mobile?). I'm not talking about a whole stack, I'm talking about a single high stack which should remove the need for TSVs as you wouldn't be routing through the HBM (which is what the TSVs are there for). Plus there's possibility that you could implement the HBM in the die itself, and they could segment easily based on the viable amount. I don't believe that the I/O die gains a lot from being shrunk, and to me HBM3 using the same process provides an opportunity that I think would be very beneficial to take advantage of.

I'm talking about in an APU itself. There's quite a few companies that don't want to bother with an extra chip (GPU), but they're fairly constrained by memory bandwidth with regards to GPU performance in current APUs. Which as they move to chiplets the distinction there becomes a bit semantic, but for the OEM it would be a single chip solution, and its something they could do without needing to overhaul the work they did on the substrate for Zen 2 - which they talked up how much work they did there.

It seems the Keller was brought in to really up Intel's game on server grade CPUs. I suspect that the server group will, finally, be driving core design and the client group will mainly be tweaking the design for desktop/laptop thermal envelopes. Keller, at this point, is really a systems guy (design, prototyping (FPGA), implementation, and QA). I'm guessing that he really wants to get back to his roots in design.

I would be surprised if Zen3 goes wider. It appears that AMD's focus for Zen3 is power efficiency - going wider would blow that up, I think.

By Keller's own words he's there to develop next gen interconnect (which I believe he talked about one that could scale up from intrachip to interchip, and then even system - i.e. unified memory/storage that leverages different tiers trying to make that transparent to the system - and network/datacenter; that to me sounds a lot like the talk about moving to fiber optic, which he's likely looking at is it time to start that transition or can they push the limits of metal first). The way he talked he doesn't seem to have anything to do with the core designs (architecture, etc). Seems that he's there to get the various chips communicating in an efficient and fast manner (which will be needed with move to chiplet designs and co-processing and other things).

Which, I think that's what he was working on at Tesla, is figuring out how to get all the various components (sensors, processing) communicating, while trying to cut down the wiring (for weight, complexity, and cost reasons), but push latency down and throughput up.

And I think there was talk that actually was kinda his focus with Zen (basically InfinityFabric and designing chips to utilize that). I might be very wrong though, but I do know he himself said he's at Intel for developing interconnect.
 
  • Like
Reactions: Vattila

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
I'm talking about a single high stack which should remove the need for TSVs as you wouldn't be routing through the HBM
Nope, as far as I am aware all HBM uses a logic die at the bottom of each stack.

This means TSV's are always needed.
 

Veradun

Senior member
Jul 29, 2016
564
780
136
Nope, as far as I am aware all HBM uses a logic die at the bottom of each stack.

This means TSV's are always needed.
Yup. The fabled HBM-LC promised getting rid of the base logic die (and being compatible with organic interposers). We heard about them at Hot Chips 28 from Samsung iirc and never heard again about this marvel. Dead.
 
  • Like
Reactions: soresu

Richie Rich

Senior member
Jul 28, 2019
470
229
76
It seems the Keller was brought in to really up Intel's game on server grade CPUs. I suspect that the server group will, finally, be driving core design and the client group will mainly be tweaking the design for desktop/laptop thermal envelopes. Keller, at this point, is really a systems guy (design, prototyping (FPGA), implementation, and QA). I'm guessing that he really wants to get back to his roots in design.

I would be surprised if Zen3 goes wider. It appears that AMD's focus for Zen3 is power efficiency - going wider would blow that up, I think.
Zen 3 with 6xALU and SMT4 (6/4 = 1.5 ALU per thread) is gonna be more efficient than Zen 2 (4 ALU / SMT2 = 2 ALU per thread). Efficiency is there. IMHO whole SMT4 extension is by-product of wider core.

Today everybody know that we hit 5GHz frequency wall. I'm affraid that due to increasing heat density the frequency wall will decrease down to 4GHz with 5nm for todays chips. Knowing this how engineers will rise performace of future CPUs?
  1. more cores - since chiplet design AMD is filling socket area by maximum cores it can take. So there is not much to develop.
  2. smart improvements, fighting the heat density frequency wall - less energy means less temperature and so higher clocks.
  3. wider core - old good brute force is now probably the lowest hanging fruits (no wonder fruit company discovered that first).
IMHO Zen 3 is just bringing what is inevitable sooner or later - wider core, new uarch. It's kind of hard for me imagining that Keller came to AMD and designed uarch that is only matching Intel's performance and not picking the lowest hanging fruits as Apple did.
They have a lot of space for improvements new 6ALU uarch - Zen4 and Zen5 will be based on this 6xALU core.
 
Last edited:

maddie

Diamond Member
Jul 18, 2010
4,723
4,628
136
Nope, as far as I am aware all HBM uses a logic die at the bottom of each stack.

This means TSV's are always needed.
Not accurate. The spec allows the logic to be integrated with the memory cells. It's just that the separate logic die was chosen in the first and current implementation.
 

DrMrLordX

Lifer
Apr 27, 2000
21,583
10,785
136
This, by the way, has nothing to do with the FP units themselves. The limiting factor for FP-heavy code on BD is practically always store throughput.

I think you have a point, but even if store throughput were improved such that it were no longer a bottleneck, you'd still see problems with CMT vs SMT. People could go back and compare Piledriver to Haswell running SSE2 or raw x87 if they really want to.

In any case, CMT is never coming back. Corporations don't always do the rational thing. Even if CMT was somehow an actually good decision going forwards, AMD will never use it again because the people involved who championed it are gone, and no-one wants to tie their name to failure by picking up the torch for a design element that is now seen as a really bad idea.

Agreed. Their SMT implementations have been too successful for them to want to try CMT again.
 

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
By Keller's own words he's there to develop next gen interconnect (which I believe he talked about one that could scale up from intrachip to interchip, and then even system - i.e. unified memory/storage that leverages different tiers trying to make that transparent to the system - and network/datacenter; that to me sounds a lot like the talk about moving to fiber optic, which he's likely looking at is it time to start that transition or can they push the limits of metal first). The way he talked he doesn't seem to have anything to do with the core designs (architecture, etc). Seems that he's there to get the various chips communicating in an efficient and fast manner (which will be needed with move to chiplet designs and co-processing and other things).

Guess I didn’t hear that talk. There was a video circulating recently, but I haven’t seen it yet. I guess I better - seems like I’m behind the times.
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
Yup. The fabled HBM-LC promised getting rid of the base logic die (and being compatible with organic interposers). We heard about them at Hot Chips 28 from Samsung iirc and never heard again about this marvel. Dead.
I'm not sure that demand for HBM had ramped up then (more than just GPU's now), the whole low cost thing was to do with drumming up more business without completely going back to the drawing board.

They proposed a 50% improvement in bandwidth per pin (1.6 -> 2.4 Gbps) to offset a lower number of TSV's which would also lower costs - but as I said, sheer demand from the market made HBM LC redundant, they wanted full fat HBM, and at this point HBM2 specs have already surpassed 3 Gbps, so clearly that went out the window.

The main thing I wonder about, is why all these improvements to HBM2?

What happened to HBM3?

They keep saying it's coming, but I believe it's undergoing a significant overhaul to pick up business dropped by the failure of HMC, possibly some cost/production speed optimisations too.

Aready you could replace a whole bunch of widely spaced GDDR6 chips with a single 16GB stack, dramatically reducing the necessary PCB real estate, some further cost/production optimisation could make HBM the obvious option regardless of market segment.
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
Not accurate. The spec allows the logic to be integrated with the memory cells. It's just that the separate logic die was chosen in the first and current implementation.
Still, the limited density of DRAM means that a stack is necessary to achieve density parity with GDDR, for GPU's at least (Nantero/Fujitsu NRAM offers an interesting peak into a possible high density persistent non volatile future for HBM, they have a roadmap going up to 256 Gbit 4 layer dies on 7nm).

OTOH, perhaps an interesting path would be to integrate the HBM logic with an active interposer.

Though apparently a significant advantage of the logic die configuration is being able to test a completed stack BEFORE integrating with the interposer, that way they avoid integrating stacks already DOA and compromising the final product.
 

moinmoin

Diamond Member
Jun 1, 2017
4,934
7,619
136
I suspect that the server group will, finally, be driving core design
So all the AVX upgrades (which were the biggest changes to the cores since Sandy Bridge) weren't primarily done for the server market?
 

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
So all the AVX upgrades (which were the biggest changes to the cores since Sandy Bridge) weren't primarily done for the server market?
Intel has talked about a "server first" cpu development strategy for a while. It hasn't come to fruition yet. Server and client cores are differentiated (AVX512 and mesh interconnect on server CPUs). But, AFAIK, client teams have still been driving the initial development of a given architecture. Maybe that will finally change with Sapphire Rapids - we will see.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,324
1,462
136
So all the AVX upgrades (which were the biggest changes to the cores since Sandy Bridge) weren't primarily done for the server market?

Most of the server market generally doesn't give a damn about SIMD. It matters a lot to HPC, is somewhat useful to consumers, is fairly rarely used in servers.

Intel has tried marketing it a lot to the server market, which might give the impression that it matters there, but it just doesn't.
 

moinmoin

Diamond Member
Jun 1, 2017
4,934
7,619
136
Ok, so AVX is now a consumer feature (about as wasteful as iGPUs). Intel's core design "now" becoming server (not HPC) first means Intel is going to double ALUs or something along that line?
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
Most of the server market generally doesn't give a damn about SIMD
Depends on the customer, and their priority use cases.

Given the heavy use of SIMD in software video encoders, you can bet that Google, Netflix, Twitch, Vimeo and anyone else heavily reliant on video platforms will eye any increases in SIMD performance very closely.
 

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
Ok, so AVX is now a consumer feature (about as wasteful as iGPUs). Intel's core design "now" becoming server (not HPC) first means Intel is going to double ALUs or something along that line?
AVX512 - which is completely pointless on the desktop, IMHO. All I can think is that it's a marketing differentiator and one that those who aren't in the know will want just because it must be better.
 

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
Depends on the customer, and their priority use cases.

Given the heavy use of SIMD in software video encoders, you can bet that Google, Netflix, Twitch, Vimeo and anyone else heavily reliant on video platforms will eye any increases in SIMD performance very closely.
Seems like AVX2 should handle all video trans-coding just fine. Don't new AVX512 for that.
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
Seems like AVX2 should handle all video trans-coding just fine. Don't new AVX512 for that.
Certainly AVX512 related throttling is problematic, but from what I've gathered from regular visits to the Doom9 forum, it seems otherwise like a solid boost, much as I hate to admit it as an AMD lifer since 2006.

The 'just fine' might work for now, but all the industry bigwigs keep saying that video is consuming a greater and greater percentage of net traffic, even as total net traffic increases - which presumably includes uploading to sites like Youtube and Vimeo, which is where the video encoding comes in.

Though I guess an argument can be made that focusing on more cores and more efficient cores, rather than longer SIMD vectors would be the better path.

Oddly the discussion concentration with AVX1 and 2 was always mostly about vector length, rather than number of operands, which I believe AVX also increased to 3.

I wonder if an increase in operands per instruction might be on the cards....
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
That was the old AMD FMA3. AVX doesn't work the same way, IIRC.
"AVX introduces a three-operand SIMD instruction format, where the destination register is distinct from the two source operands. For example, an SSE instruction using the conventional two-operand form a = a + b can now use a non-destructive three-operand form c = a + b, preserving both source operands. AVX's three-operand format is limited to the instructions with SIMD operands (YMM), and does not include instructions with general purpose registers (e.g. EAX). Such support will first appear in AVX2 "

From the AVX Wikipedia page.
 

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
"AVX introduces a three-operand SIMD instruction format, where the destination register is distinct from the two source operands. For example, an SSE instruction using the conventional two-operand form a = a + b can now use a non-destructive three-operand form c = a + b, preserving both source operands. AVX's three-operand format is limited to the instructions with SIMD operands (YMM), and does not include instructions with general purpose registers (e.g. EAX). Such support will first appear in AVX2 "

From the AVX Wikipedia page.

Okay. I though FMA was a = (b*c)+d;
A product plus an offset. Back to school for me I guess; I need to look at some code. :oops:
 

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
Video might get taken down


Untitled.png


Untitled2.png




Zen 3 Milan highlights [AMD, Martin Hilgeman ]
- Unified L3 32+ MB per CCD
- Sampling already
- 7nm
- Same core count as Rome
- 2x SMT
- Planned for Q3 2020
- DDR4/SP3

What is Zen3's special sauce gonna be?
- Bigger cache most likely (32MB+)
- Improved IF
- ...
 
Last edited: