Speculation: Ryzen 4000 series/Zen 3

Page 17 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

amd6502

Senior member
Apr 21, 2017
634
152
86
Cache requires 8 transistors for a single bitcell, plus 1 for the access path. (more for tags, but let's just ignore them for now.) So you need at least 9*8*4*1024*1024 ~= 300M transistors for 4MB of cache.
That would be too much I think; was doing a ballpark area guess late at night.

So 9*4*1M= 36M

Compares better than order of magnitude close, as 40M equals difference between (X ± 20M).

On another semi related topic, right now I'm looking at the RWT history of the EV8 tragedy linked by nicelandia.

Alpha EV8 (Part 1): Simultaneous Multi-Threat
https://www.realworldtech.com/alpha-ev8-wider/

The IPC gained from going out-of-order is actually quite small; they kept the transistors growth relatively low, from EV5's 9M to EV8's 15M. Adjusting for doubled clockspeed we see the IPC increase is not that big ():

At 300MHz, int/FP for EV5 is: 8/13
For EV6 it would be: 15/24

EV7's transistor budget ballooned, and it clocked to 1.5GHz. At 300MHz the projected IPC (neglecting gains from lower cycle RAM benefits) would be:

EV7: 18/32

So it looks like they turned this into a long core speed demon; something that Bulldozer likely inherited from.

More reading for anyone interested: 1. http://alasir.com/articles/alpha_history/alpha_21364_21464.html 2. http://alasir.com/articles/alpha_history/alpha_21264.html

It would also be interesting to know how many transistors were added during the POWER SMT4 jump.
 
Last edited:

moinmoin

Senior member
Jun 1, 2017
988
755
106
Just to clarify, Banded Kestrel was the embedded name for Stoney Bridge right (XV + GCN3).
No, Banded Kestrel was an early internal name for the 2c4t counterpart to the 4c8t Raven Ridge (originally Great Horned Owl), so Zen 1 with Vega 10 plus VCN 1.0. It ended up only launching very late for some reasons (likely the Raven Ridge dies were cheap enough to be used for cut down chips like the Athlon APUs) so the only product officially using it right now is the R1000 embedded series.

Old slides from early 2016:

 
  • Like
Reactions: soresu

Tuna-Fish

Senior member
Mar 4, 2011
984
404
136
Huh, I thought Intel used to used 6T cells (back, oh, 20+ years ago). Are 8T cells more stable at higher frequencies (or lower drive currents)?
They used to, but everyone's moved back to 8T cells with smaller processes. I don't know the precise reason why. L3 moved to 8T cells last, the inner caches switched first, so that might suggest speed as the reason?

That would be too much I think; was doing a ballpark area guess late at night.

So 9*4*1M= 36M
Did you mean to say L2? Because that's 4M bits, or 512 kilobytes. Your original post said L3.
 

amd6502

Senior member
Apr 21, 2017
634
152
86
Did you mean to say L2? Because that's 4M bits, or 512 kilobytes. Your original post said L3.
No, I mean L3. 300M estimate sounds large. But I'll look closer at the die shots of the CCX again and get back later tonight. It might also be the case that the density of transistors is far from uniform and doesn't relate well to area.
 

Tuna-Fish

Senior member
Mar 4, 2011
984
404
136
No, I mean L3. 300M estimate sounds large.
300M is not an estimate, it's a hard lower bound. They can't fit bits into wishes, they need 8 transistors per cell. (8 bits per byte) * 9 transistors per bit * 4 million ~= 300M. On top of that they need more transistors for redundancy, tags, comparators, etc.

But I'll look closer at the die shots of the CCX again and get back later tonight. It might also be the case that the density of transistors is far from uniform and doesn't relate well to area.
That is definitely true. sram is generally the densest kind of structure, and cache is packed to be as dense as possible because signal delay is substantial part of latency, so density is not just cost, it's also speed.
 

amd6502

Senior member
Apr 21, 2017
634
152
86
300M is not an estimate, it's a hard lower bound. They can't fit bits into wishes, they need 8 transistors per cell. (8 bits per byte) * 9 transistors per bit * 4 million ~= 300M. On top of that they need more transistors for redundancy, tags, comparators, etc.



That is definitely true. sram is generally the densest kind of structure, and cache is packed to be as dense as possible because signal delay is substantial part of latency, so density is not just cost, it's also speed.
So looking at the die shot helps. This is the 44mm2 Summit CCX with 8MB L3:


So it's roughly 30%+ for a pair of cores (L2 inc'd) and 40%- for the 8MB L3.

If we are to take the average density of the die, approx: 5000 Mt / 200 mm2 = 25Mt/mm2

Then we would get for 4MB:

20% * 44mm2 *25 Mt/mm2 which equals 5*44 Mt, or 220M transistors.

Now that number is a whole lot better than my wild guess of 40M (for which I admit I was thinking---not very well--of the CCX vaguely from memory).

At least 220M is ballpark close to 300M. As for the difference it seems there likely is a big error (almost 50%) from assuming the average density is close to the cache's density.

Hopefully the cores are about average density. Then the area estimate gives us about 170M transistors each (with the L2). So by this estimate, Zen1 is really very area efficient, with numbers of transistors right between EV's 130M and the 210M for a Dozer module (170M is 40M more than EV7 and 40M less than dozer module).

**Next topic, back to Power7/8**

By much cruder estimate Power7 has around ballpark 75 Mt, assuming roughly half of the 1.2 Bt die (45nm) is occupied by its 8 cores. https://en.wikipedia.org/wiki/POWER7 https://en.wikichip.org/wiki/ibm/microarchitectures/power7 I'd say that number is a lowball estimate since the cores seem to take more than half the die:

Each core is capable of four-way simultaneous multithreading (SMT). The POWER7 has approximately 1.2 billion transistors and is 567 mm2 large fabricated on a 45 nm process. A notable difference from POWER6 is that the POWER7 executes instructions out-of-order instead of in-order.
Power8 went SMT8 and a large die was made on 22nm: "12-core version consists of 4.2 billion transistors and is 650 mm2 large" Applying the same ballpark estimate: 2.1B/12 shows the cores are on the order of 175 Mt, which again would be very impressive, and a little encouraging for those hoping to see x86 go SMT4.
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
13,040
2,705
136
WTH, who is talking about copying Power or Intel? Are you even following this discussion?
AMD's move to SMT2 was obviously their admission that Inte's chocie of SMT2 (HT) was correct while AMD's choice of CMT was incorrect. AMD had every opportunity to roll out yet more CMT chips ("This time for sure!") but they didn't. Zen better met the expectations of buyers by behaving more like an Intel CPU and less like Piledriver. They were copying Intel. Actually what they did was beat Intel at their own game since AMD's implementation of SMT is better than HT. Moving to SMT4, however . . . whom do we know that did the same?

And I don't think you correctly interpret the market if you really think people are happy to suddenly decrease the number of cores again just because SMT2 were to be replaced by SMT4.
On the contrary, I think people would be unhappy with that eventuality. Yet in moving to 7nm+ on Zen3, I don't think AMD will be able to offer the same core density if they move from 4-wide SMT2 cores to 6-wide SMT4 cores. So bigger chiplets or fewer cores . . . pick one. A theoretical 20% increase in transistor density isn't going to go far enough to let them cram all those extra transistors in the same space.

You are just repeating me.
Negative, I do not consider HT to have been a "stagnant" premium. It made a real difference in the longevity of chips that had it. See 2500k vs 2600k, for example.

There is zero indication AMD would ever sell a 4c/16t CPU as their new 8c/16t chips.
There is zero indication that AMD will embrace SMT4 for Zen3 at all. I'm speculating (read thread title) that if AMD were to embrace SMT4, it would necessitate changes in the way they marketed their chips, since it would be likely improssible for them to sell an 8c/32t chip at the same price point as a 3700x or 3800x.

"Nobody did it that way before so it won't happen."
IBM did it before. So did Sun. Look where it got them.

Let's agree to disagree.
Alrighty then.

Possibly, but there is still Dali which was meant to be the value APU option, which would imply monolithic/small to me.

I think Renoir will be chiplet still, it's more a question of how that will be configured.

Of course they could both be monolithic but that seems doubtful to me.
Gonna agree with @moinmoin that they'll make Dali and Renoir monolithic. Should be cheaper to implement overall, and it's consistent with what they've done with Raven Ridge/Picasso and Banded Kestrel.
 
Last edited:
  • Like
Reactions: VirtualLarry

Richie Rich

Member
Jul 28, 2019
37
28
46
Nice discussion here guys.

Let look at SMT4 from other point of view.
Here are the known facts:
  1. ILP is real limitation. CPU can process on average 2 instruction/cycle for most of code.
  2. But avg IPC number is too simplified. For more details you look at Gaussian distribution curve of IPC. This shows that 2 IPC can be found in 60% of code, 3 IPC in 35%, 4 IPC in 20%, 6 IPC in 10%, 8 IPC in 5% of code. For example. Every algorithm has slightly different distribution however lets assume average distribution across typical code.
  3. Fact is that 2xALUs per thread is kind of sweet spot in terms of utilization. This is the idea behind Bulldozer uarch and why they made this "smart" move to 2xALU design from 3xALU K10.
  4. Fact is that Intel kept same utilization at 2xALUs/thread via different strategy> by adopting SMT2 for 4xALUs.
  5. Fact is that 4ALUs + SMT2 is much more powerful in single thread - via saturating more use case % from Gaussian distribution. Bulldozer designers ignored this important factor and it was bad mistake. Resulting from smart move to "smart" move.
Now if you are AMD CPU architect you can simply do same successful trick:
  • A) move from 2ALU+2ALU Bulldozer cores to 4ALU + SMT2 and creating Zen core. Same sweetspot of 2 ALU/thread, same number of ALU units but much more single thread power and better utilization. It was fusion/merge of 2 cores into one wider while taking into account ILP.
  • B) again do the core fusion> by merging 2 Zen cores into one wider Zen3 core ( from 4ALU + 4ALU SMT2 going to 8 ALU + SMT4). This would gain more single core performance from Gaussian distribution of code ILP (with compiler optimalization much more in future) and even more efficiency than SMT2.
  • C) 6ALU + SMT4 also make a sence since we know A12 Vortex core has 6ALU already. It's more conservative approach. Keep in mind that Jim Keller started Zen design approximately in the same time as Apple their A11. Engineer talks to each other especially when Jim is ex-Apple engineer.

I would bet on option C) because it is easier to develop. However both B and C are feasible.
 
  • Like
Reactions: amd6502

moinmoin

Senior member
Jun 1, 2017
988
755
106
@DrMrLordX
Not interested in continuing the previous discussion in general, just this one point:

You appear to have the fixation on SMT (and extending it from 2 to 4 threads) costing a lot of transistors and as such being expensive (to the point of the amount of cores needing to be reduced as a result which sounds insane to me). Where did you get that idea?

Intel states that their HT implementation adds only 5% more die area. I don't recall AMD mentioning how much their SMT implementation adds. Though the fact that resources are mostly dynamically shared already means that it's a matter of available resources how many threads could theoretically be fed concurrently. The only parts statically shared and as such always needed to be expanded along with supporting more threads are most of the queues, specifically µOP Queue, Retire Queue, and Store Queue. What increases the amount of transistors is the addition and expansion of resources, like what was done between Zen and Zen 2 with native AVX2 support and the widening of all related resources. And that increase in resources puts Zen 2 into the neighborhood of Power 7 which uses those resources to offer SMT4.

So my points are twofold:
- New resources (= more transistors) are being added anyway, regardless of any kind of SMT.
- The version of SMT used is not the goal but a reflection of how many of those resources are lying dormant with lighter instructions, and how to best make them available for additional concurrent processing in such scenarios.

AMD's implementation of SMT appears to be mostly agnostic from the actual amount of threads supported and as such can be comparatively easily extended. So my believe is that AMD will keep SMT2 as long as it does the job of using otherwise dormant resources. Once the amount, size and kind of resources reaches the point that 2 threads are no longer sufficient to make good use of them, and more threads offer a significant performance boost in some scenarios, I fully expect AMD to increase the number of concurrently processable threads, thus introducing SMT4.
 
Last edited:
  • Like
Reactions: amd6502

naukkis

Senior member
Jun 5, 2002
232
81
101
They used to, but everyone's moved back to 8T cells with smaller processes. I don't know the precise reason why. L3 moved to 8T cells last, the inner caches switched first, so that might suggest speed as the reason?
8T cells can operate at lower voltage. Using 6T cells will increase vmin.
 

DrMrLordX

Lifer
Apr 27, 2000
13,040
2,705
136
@DrMrLordX
Not interested in continuing the previous discussion in general, just this one point:

You appear to have the fixation on SMT (and extending it from 2 to 4 threads) costing a lot of transistors and as such being expensive (to the point of the amount of cores needing to be reduced as a result which sounds insane to me). Where did you get that idea?
Moving from a 4-wide to 6-wide core and moving from SMT2 to SMT4 will require a significant increase in transistor count per core. Just moving to SMT4, not so much.
 

NostaSeronx

Platinum Member
Sep 18, 2011
2,520
341
126
I can't help thinking of Soft Machines and their tech. Some of the concepts seem quite promising to both increase single thread performance and reduce power/computation. AMD was one of the main investors until Intel bought the company and it went dark.

Zen 3 or Zen 4?

https://www.anandtech.com/print/10025/examining-soft-machines-architecture-visc-ipc
You won't be wrong if it does pop up down the line...

From Softmachines:
Member of the CPU Logic Design team, worked on the uarch and RTL design of a next-generation VISC CPU core. -> Part of Intel’s Big Core team, worked on next-generation, high-performance x86-based CPU cores. -> Working on x86-based high-performance ”Zen” processor cores focusing on major blocks such as Branch Prediction and Out-of-Order Scheduler units.

Some of the key patents for AMD's CMT which is required in VISC-esque architecture will expire in November 2019.
 

amd6502

Senior member
Apr 21, 2017
634
152
86
Nice discussion here guys.

Let look at SMT4 from other point of view.
Here are the known facts:
  1. ILP is real limitation. CPU can process on average 2 instruction/cycle for most of code.
  2. But avg IPC number is too simplified. For more details you look at Gaussian distribution curve of IPC. This shows that 2 IPC can be found in 60% of code, 3 IPC in 35%, 4 IPC in 20%, 6 IPC in 10%, 8 IPC in 5% of code. For example. Every algorithm has slightly different distribution however lets assume average distribution across typical code.
  3. Fact is that 2xALUs per thread is kind of sweet spot in terms of utilization. This is the idea behind Bulldozer uarch and why they made this "smart" move to 2xALU design from 3xALU K10.
  4. Fact is that Intel kept same utilization at 2xALUs/thread via different strategy> by adopting SMT2 for 4xALUs.
  5. Fact is that 4ALUs + SMT2 is much more powerful in single thread - via saturating more use case % from Gaussian distribution. Bulldozer designers ignored this important factor and it was bad mistake. Resulting from smart move to "smart" move.
Now if you are AMD CPU architect you can simply do same successful trick:
  • A) move from 2ALU+2ALU Bulldozer cores to 4ALU + SMT2 and creating Zen core. Same sweetspot of 2 ALU/thread, same number of ALU units but much more single thread power and better utilization. It was fusion/merge of 2 cores into one wider while taking into account ILP.
  • B) again do the core fusion> by merging 2 Zen cores into one wider Zen3 core ( from 4ALU + 4ALU SMT2 going to 8 ALU + SMT4). This would gain more single core performance from Gaussian distribution of code ILP (with compiler optimalization much more in future) and even more efficiency than SMT2.
  • C) 6ALU + SMT4 also make a sence since we know A12 Vortex core has 6ALU already. It's more conservative approach. Keep in mind that Jim Keller started Zen design approximately in the same time as Apple their A11. Engineer talks to each other especially when Jim is ex-Apple engineer.

I would bet on option C) because it is easier to develop. However both B and C are feasible.

That is really insightful.

My thinking is why not shoot for the very lowest hanging fruit of between 1 and 2 IPC (closer towards 1) with a pair of light duty non-speculative threads. 6 ALU with current Zen2's 3 AGU seems a good fit with that, and then do SMT4 for Zen4, either with the same pipes or going even wider to 8 (option B).
 
  • Like
Reactions: Richie Rich

Richie Rich

Member
Jul 28, 2019
37
28
46
I can't help thinking of Soft Machines and their tech. Some of the concepts seem quite promising to both increase single thread performance and reduce power/computation. AMD was one of the main investors until Intel bought the company and it went dark.

Zen 3 or Zen 4?

https://www.anandtech.com/print/10025/examining-soft-machines-architecture-visc-ipc
VISC is very interesting. But. These ideas are already built as OoO+SMT2 in modern RISC x86 CPUs. No need to double it in more complicated way as VISC - that's why it was never adopted or licensed IMHO. Just we are stuck in SMT2 now. Hopefully not for long time...

That is really insightful.

My thinking is why not shoot for the very lowest hanging fruit of between 1 and 2 IPC (closer towards 1) with a pair of light duty non-speculative threads. 6 ALU with current Zen2's 3 AGU seems a good fit with that, and then do SMT4 for Zen4, either with the same pipes or going even wider to 8 (option B).
It's possible, I agree. But. Partly because AMD is not developing it's own compiler - Intel do optimization for its own 4ALU+SMT2 design. AMD wasn't able to push optimizations for their Bulldozer arch with more cores (it's more programmers issue than compiler one) and the same will happen when they will aim for 8ALU+SMT2. Even with 6ALU+SMT2 is very thin ice for them as most code is Intel friendly. SMT4 is not so complicated to implement and is eliminating every risks for wide cores (via keeping ratio 2ALU/thread). And it can be disabled in BIOS or OS controlled. 90% of work is to develop very complex wide core. SMT4 will increase demand for front-end throughput. It's just much easier to develop such a complex core from scratch. So Zen3 is the first good chance to see such a wide SMT4 core - just because of quite good combination of Jim Keller and development from scratch since 2012. Maybe it was canceled together with wide server ARM core. Who knows.
 
  • Like
Reactions: amd6502

NostaSeronx

Platinum Member
Sep 18, 2011
2,520
341
126
No need to double it in more complicated way as VISC - that's why it was never adopted or licensed IMHO.
Always be careful with that;

Architect of the Denver CPU Floating Point Unit(Computer Architect-Nvidia) + Architect of the Tegra CPU Complex(Sr Computer Architect-Nvidia) -> {Architect of the Scheduler and Execution cluster of the VISC CPU(Staff CPU Architect-Soft Machines) -> Lead architect of Out-of-order and Execution cluster of the next generation CPU core(Senior Staff CPU Architect-Intel Corporation)}
{June 2013 to November 2018}

[Microarchitecture and RTL design of VISC-architecture-based Core(Design Engineer-Soft Machines) -> Microarchitecture and RTL(Hardware Engineer-Intel corporation)]
[June 2010 to present]

Intel's only comment on it was the branch predictor was behind. Meaning performance would be higher with Intel IP. This is with a dual-core at 400 MHz(SoftShasta) running faster than a single-core/single-thread at 1 GHz(Haswell).
https://pbs.twimg.com/media/CsartpdUsAAlhuH?format=jpg&name=small
 
Last edited:

Ajay

Diamond Member
Jan 8, 2001
5,467
1,534
136
SMT4 is not so complicated to implement and is eliminating every risks for wide cores (via keeping ratio 2ALU/thread). And it can be disabled in BIOS or OS controlled. 90% of work is to develop very complex wide core. SMT4 will increase demand for front-end throughput. It's just much easier to develop such a complex core from scratch. So Zen3 is the first good chance to see such a wide SMT4 core - just because of quite good combination of Jim Keller and development from scratch since 2012. Maybe it was canceled together with wide server ARM core. Who knows.
1. Sure SMT4 would be complicated. Adding more hardware threads is just like widening the core - it's a case of diminishing returns. Adding more SMT threading without making architectural improvements and allocation of more resources (e.g., more decode, registers and cache) will result in significant fall off in SMT yields at higher thread counts (I can't find the damn graph from the 90's, but using simulations on early Alpha CPU architectures (21064 or 21164) performance scaling was hyperbolic with an optimal xtor cost per thread being reached at 4 (SMT 4). AMD certainly can use Dr. Joel Elmer's approach an keep the resource use to a minimum as a reasonable trade off of using few resources (xtors) to keep die size, and hence cost, down.

In order to keep performance up while adding more threading, proportionally more redundant resources will need to be added than was needed for SMT2. This will need to be done while maintaining high clocks and staying within the target design's power budget. Verification (simulation) and validation (silicon) will increase design time just like widening the core. Actually, more so because as the added registers, circuit blocks, widened resources, etc. span throughout the CPU to maintain good performance across more threads. See the figure below from the Anandtech Zen1 Deep Dive.

2. I would be shocked to see SMT4 in Zen3. I really expect more cores are the most likely outcome (96?). I just don't see the need for more hardware threading at this point in Zen's development. Power and Spark went with more hardware SMT to keep die size and power in check (relative to prior designs), rather than scaling core counts. With AMD's chiplet strategy, that need simply doesn't exist yet.

HC28.AMD.Mike Clark.final-page-015.jpg
 
  • Like
Reactions: Thunder 57

Richie Rich

Member
Jul 28, 2019
37
28
46
AMD in August released Software Optimization Guide for 17h Family Models 30 and Greater. This indicate Zen2 is the last 17h Family model: https://www.amd.com/en/support/tech-docs/software-optimization-guide-for-amd-family-17h-models-30h-and-greater-processors

This means:
  • Zen3 will have a new Family number (19h probably? from AIDA64 leaks)
  • no Zen2+ model
  • new family number means much bigger changes than Zen2.
  • 19h Family will likely consist of Zen3, 4 and 5.
  • because Keller left AMD 2015 there is high probability of his influence on 19h Family concept.
Quite good news.
 
Last edited:

Richie Rich

Member
Jul 28, 2019
37
28
46
1. Sure SMT4 would be complicated. Adding more hardware threads is just like widening the core - it's a case of diminishing returns. Adding more SMT threading without making architectural improvements and allocation of more resources (e.g., more decode, registers and cache) will result in significant fall off in SMT yields at higher thread counts (I can't find the damn graph from the 90's, but using simulations on early Alpha CPU architectures (21064 or 21164) performance scaling was hyperbolic with an optimal xtor cost per thread being reached at 4 (SMT 4). AMD certainly can use Dr. Joel Elmer's approach an keep the resource use to a minimum as a reasonable trade off of using few resources (xtors) to keep die size, and hence cost, down.

In order to keep performance up while adding more threading, proportionally more redundant resources will need to be added than was needed for SMT2. This will need to be done while maintaining high clocks and staying within the target design's power budget. Verification (simulation) and validation (silicon) will increase design time just like widening the core. Actually, more so because as the added registers, circuit blocks, widened resources, etc. span throughout the CPU to maintain good performance across more threads. See the figure below from the Anandtech Zen1 Deep Dive.

2. I would be shocked to see SMT4 in Zen3. I really expect more cores are the most likely outcome (96?). I just don't see the need for more hardware threading at this point in Zen's development. Power and Spark went with more hardware SMT to keep die size and power in check (relative to prior designs), rather than scaling core counts. With AMD's chiplet strategy, that need simply doesn't exist yet.

View attachment 10178
I think you are too pesimistic. There are two separate things: SMT vs. Wide core
  • you can implement 4-way SMT for Zen2 core sharing 4xALUs and 2FPU. Intel stated HT cost 5% transistor increase. Its not so difficult to implement..... however you will get very little gain (better utilization by 2-5%?) and performance/thread will decrease to half (well it can be switched back to SMT2 or switched off). Overall gain is very small compare to effort.
  • to make SMT4 reasonable you need wider core. It depends what AMD is aiming for. AMD would need to double at least FPU from 2xFPU up to 4xFPU. This is minimal configuration for SMT4 IMHO.
  • when AMD is gonna touch all parts of CPU due to SMT4 implementation it is more effective to redesign it completely to increase throughput everywhere (front-end, back-end) - and create new arch for next decade. AMD had enough time to do it since Keller started 2012. IMHO Zen1+2 was just fast food to fill time gap (Zen has from Bulldozer shared FPU, front-end, AGU, completely new is only ALU back-end moving from 2+2ALU to 4ALU+SMT2, they needed something fast to avoid bankruptcy) before big dinner (new architecture from scratch for next decade?). I can see there was an time opportunity and real reasons to plan so. Maybe I'm too optimistic :)
 

Yotsugi

Senior member
Oct 16, 2017
919
412
96
I think you are too pesimistic. There are two separate things: SMT vs. Wide core
  • you can implement 4-way SMT for Zen2 core sharing 4xALUs and 2FPU. Intel stated HT cost 5% transistor increase. Its not so difficult to implement..... however you will get very little gain (better utilization by 2-5%?) and performance/thread will decrease to half (well it can be switched back to SMT2 or switched off). Overall gain is very small compare to effort.
  • to make SMT4 reasonable you need wider core. It depends what AMD is aiming for. AMD would need to double at least FPU from 2xFPU up to 4xFPU. This is minimal configuration for SMT4 IMHO.
  • when AMD is gonna touch all parts of CPU due to SMT4 implementation it is more effective to redesign it completely to increase throughput everywhere (front-end, back-end) - and create new arch for next decade. AMD had enough time to do it since Keller started 2012. IMHO Zen1+2 was just fast food to fill time gap (Zen has from Bulldozer shared FPU, front-end, AGU, completely new is only ALU back-end moving from 2+2ALU to 4ALU+SMT2, they needed something fast to avoid bankruptcy) before big dinner (new architecture from scratch for next decade?). I can see there was an time opportunity and real reasons to plan so. Maybe I'm too optimistic :)
There's no reason to go wider SMT unless you want to game per-C licensing like you're IBM.
Just throw more cores.
Oh, and every core xtor ever should go towards more ST, period.
 

Richie Rich

Member
Jul 28, 2019
37
28
46
There's no reason to go wider SMT unless you want to game per-C licensing like you're IBM.
Just throw more cores.
Oh, and every core xtor ever should go towards more ST, period.
Throwing more cores is inefficient. Imagine situation when one front-end decoder is free because of using uop cache, and second is 100% utilized (being bottleneck). When you have a separated fron-end for each core and you cannot use free resources of other core.You can save a lot of transistors and increase performance via shared front-end like Bulldozer did.


19h model number means big changes. Minimal change from AMD history is like Bobcat 14h -> 16h Jaguar (bringing AVX and doubling FPU perf). So Zen3 will deliver at least this.

Available AMD technology for Zen3:
  • front end sharing for 2 cores (developed for Bulldozer) - Zen3 CCX can consists of 2 modules with 2 cores each. This would save some transistors, decrease power and eliminate bottlenecks thus increase performance.
  • Zen2 has 2x256-bit FPU (in four pipes) - it would be strange if AMD developed brand new FPU just for Zen2 and just for 1 year at market. Just doesn't make sence such a wasting of workforce. Zen3 will likely use double amount of Zen2 FPUs, modified for cooperation to bring AVX-512 support (as BD supported AVX only in coop too). 4x 256-bit Zen2 FPU in Zen3 would be capable run full speed AVX-512 for two threads SMT2.... and AVX2 for 4 threads (ideal for SMT4). This configuration screams for SMT4.
  • 4-way SMT is 20 years old from EV8
  • wider core to 6 or 8 ALUs.

Minimal Zen3 config> Zen2 + Doubled Zen2 FPU to AVX-512 + SMT4 (+20% transistors -> +20% performance)
Maximal Zen3 config> New wide uarch 8xALU + Doubled Zen2 FPU to AVX-512 + SMT4 + shared fron-end for whole CCX (this would be huge so probably 1xCCX per chiplet, +100% more transistors, +150% perf)

There are so many possible combinations inbetween.
 

Ajay

Diamond Member
Jan 8, 2001
5,467
1,534
136
Throwing more cores is inefficient. Imagine situation when one front-end decoder is free because of using uop cache, and second is 100% utilized (being bottleneck). When you have a separated fron-end for each core and you cannot use free resources of other core.You can save a lot of transistors and increase performance via shared front-end like Bulldozer did.
Huh? Yeah, cloud providers will just love this - not. Running massive numbers of VM instances is a clear target for EPYC systems and more cores is the way to go.
 

moinmoin

Senior member
Jun 1, 2017
988
755
106
Personally I think an increase in the amount of cores won't happen with Zen 3 simply because that step will lack the density increase (TSMC's 7nm+ won't offer much there, the next big increase is for 5nm Zen 4 likely will use), lack the memory bandwidth (Zen 3 will still use DDR4, Zen 4 will use DDR5 requiring a new platform) and the current topology makes it tricky to add more cores without adding bottlenecks (currently with 64 cores each 8 core CCD is linked to one RAM channel, each pair of CCDs with one dual channel IMC on the IOD).

I believe Zen 3 will focus on internal changes in core and uncore that will then allow Zen 4 to massively scale out with a different package topology, similar to how it happened with Zen to Zen 2.
 
  • Like
Reactions: CHADBOGA

ASK THE COMMUNITY