• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."

Speculation: Ryzen 4000 series/Zen 3

Page 9 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

AMDK11

Member
Jul 15, 2019
44
33
61
What might be a Zen3?
Answer is: what is the inevitable future of CPU cores?

The real power is in back-end and it's ALUs, AGUs and FPUs. As a mech engineer I see these as cylinders in the engine.
Front-end is just feeding them as efficiently as possible. Same as intake manifold is feeding engine. That's all.

The evolution of back-end ALUs was:
- 1995 ... 2xALU Intel P6 uarch, PentiumPro, PII...
- 1997 ... 2xALU AMD/Nexgen K6
- 1999 ... 3xALU AMD K7, Intel PIII
- 2008 ... 4xALU Intel Nehalem
- 2012 ... 4xALU AMD Zen
- 2017 ... 6xALU Apple A11 ... most powerful core today (int IPC +76% over Skylake)

x86 CPUs must move to 6xALUs. When Apple did it then Intel and AMD must do that too. Sure, It will be hard move as was move from 3xALU -> 4xALU, it will need core re-design from scratch, same as Nehalem and Zen were. You don't need to be genius to predict that inevitable future is 8xALUs core design as a next step. Or do you think x86 CPUs will sit at 4xALU design for next 50 years? No. Apple moved from weak 4-cylinder engine to their powerfull V6. However I think we deserve V8s.

What is the evolution of SMT?
- 1999 introduced by DEC in 1999, implemented in CPU EV8 SMT4 in 2003 (cancelled in 2001 by Compaq in favor of Itanium)
- 2002 ... Intel P4 SMT2
- 2004 ... IBM Power5 SMT2
- 2010 ... IBM Power7 SMT4 dynamical
- 2014 ... IBM Power8 SMT8 dynamical
- 2017 ... AMD Zen SMT2
- 2050 ... x86 still stuck at SMT2?

6xALU core still might be fine with SMT2. For high thread server application SMT4 makes sense even for this core.
8xALU core will struggle with just SMT2 from efficiency point. You do not need to be genius to predict that SMT4 for this core is efficient move. SMT4 and SMT8 with dynamical changing number of threads/priority is actual IBM technology, not a sci-fi. Again, you do not need to be genius to predict that next step is SMT-16 (for very wide core and some specific server markets). Does SMT4 still look crazy for Zen3?

And don't forget guys what Kennedy said: "We choose to go to the moon because it is hard, not because it is easy."
Nehalem does not have 4xALU but 3xALU!

Nehalem 3xALU, 2xAGU
SandyBridge 3xALU, 2xAGU
Haswell 4xALU, 3xAGU
Skylake 4xALU, 3xAGU
SunnyCove 4xALU, 4xAGU

Intel announced some time ago that it has been working on the groundbreaking micro-architecture of NGC (Next Generation Core) since around 2017. It is to be the basis for future generations for the next decade.

My guesses / wishes:
L1-I 48KB 12-Way
8 Decode x86 - 2x complex and 6 simple
8xALU / 6xFPU
6xAGU + 3-4xSD
L1-D 48KB 12-Way

New x86-64! - 64-256 64bit registers!

Intel / AMD x86-64 - 16x64bit registers
IBM POWER - 32x64bit registers
Intel Itanium - 128x64bit registers
 
Last edited:

AMDK11

Member
Jul 15, 2019
44
33
61
@UP
We will see :)

Intel at the time when it promoted Netbrust (Pentium 4) had an extensive mobile core Banias (Pentium M) with IPC equal to or higher than K7 (Athlon / AthlonXP) which was a development of Pentium III. He then expanded the core to the form of Yonah (Core) which had IPC at the K8 level (Athlon64) but timing problems. Only the extension of Yonah to Conroe (Core 2) gave a safe advantage in IPC 25% faster than the K8 and a higher clock.

I suspect that for Intel SunnyCove is not enough and they are preparing something that will give them an indisputable advantage not only over AMD but also Apple.

For now, they wait with the mobile SunnyCove and WillowCove and maybe even GoldenCove and will release NGC on the PC.

It will certainly be interesting to see what Intel and AMD will come up with for the next 3-4 years.
 
Last edited:

Richie Rich

Senior member
Jul 28, 2019
470
228
76
And then you find me code with enough ILP to saturate a machine this wide.
How Apple engineers can utilize so many ALUs in A12 chip (6xALU) with no SMT? A12 is twice as fast (IPC) as 3xALU Cortex core and +76% faster than Skylake. And we are speaking about mobile core with very limited energy resources. I would expect first 6xALUs for some hungry HPC architecture as IBM Power where power consumption is secondary.

Certainly there is a way how to utilize 6xALUs with no SMT needed (ratio of 6 ALU/thread). Apple engineers found the way.
Theoretically should be possible super wide 12xALU + SMT2 or insane 24xALU+SMT4. All these with ratio of 6 ALU/thread.

- 8xALU with SMT2 should be very easy - ratio 4 ALU/thread. Still much lover that Apple ratio of 6.
- 8xALU with SMT4 is super conservative in terms of 2 ALU/thread. You can have both, high IPC and high throughput efficiency.

And this might be Keller's big Zen... project developed for 8 years since 2012 to be launched as Zen3. Theoretically it is possible. They had Keller and enough time. I don't want to start a hype about Zen3. However I believe the world deserve something much better than refurbishing 4xALUs for next 20 years. I want 6xALU core in my desktop, not only in iPhone.
 

Richie Rich

Senior member
Jul 28, 2019
470
228
76
Nehalem does not have 4xALU but 3xALU!

Nehalem 3xALU, 2xAGU
SandyBridge 3xALU, 2xAGU
Haswell 4xALU, 3xAGU
Skylake 4xALU, 3xAGU
SunnyCove 4xALU, 4xAGU

Intel announced some time ago that it has been working on the groundbreaking micro-architecture of NGC (Next Generation Core) since around 2017. It is to be the basis for future generations for the next decade.

My guesses / wishes:
L1-I 48KB 12-Way
8 Decode x86 - 2x complex and 6 simple
8xALU / 6xFPU
6xAGU + 3-4xSD
L1-D 48KB 12-Way

New x86-64! - 64-256 64bit registers!

Intel / AMD x86-64 - 16x64bit registers
IBM POWER - 32x64bit registers
Intel Itanium - 128x64bit registers
You are right, Nehalem had 3xALU.
I agree that 6xALU or 8xALU design are the next step as the lowest hanging fruit.
 

Yotsugi

Golden Member
Oct 16, 2017
1,029
487
106
Certainly there is a way how to utilize 6xALUs with no SMT needed (ratio of 6 ALU/thread). Apple engineers found the way.
Yeah by capping the clock ceiling and doing stuff.
Not exactly valid for cores like Zen or *Cove stuff, they target varying ranges of clocks.
8xALU with SMT2 should be very easy - ratio 4 ALU/thread. Still much lover that Apple ratio of 6.
That's not how it works and ALUs are not statically partitioned.
 

Ajay

Diamond Member
Jan 8, 2001
8,941
3,624
136
Yeah by capping the clock ceiling and doing stuff.
Not exactly valid for cores like Zen or *Cove stuff, they target varying ranges of clocks.
No 'legacy' crap either. I think Xcode must have an excellent optimizing compiler. I'd love to see the actual profiled IPC for Apple's A12. Intel runs around 2.4 instructions per clock on good code, IIRC.
 

Richie Rich

Senior member
Jul 28, 2019
470
228
76
Apple has the advantage of controlling the entire software stack for their hardware.
This was tested in SPEC2006 benchmark.... so what "entire SW stack control" are you talking about? Algorithm is the same for all platforms. Maybe compiler is Apple advantage. In case of Apple is using an advanced compiler, x86 can develop this compiler too. If this is the way then Intel and AMD should follow the leading Apple.

Yeah by capping the clock ceiling and doing stuff.
Not exactly valid for cores like Zen or *Cove stuff, they target varying ranges of clocks.
And what's the problem with low clock? IPC is independent of clock. Apple can make it run much faster if they need that for desktop. It is not so difficult to optimize memory buffers to keep throughput with high frequencies. They need jump from 2.5GHz to 3.5GHz and it will reach performance of Skylake at 6GHz.
That's not how it works and ALUs are not statically partitioned.
SMT is even better than statically partitioned ALUs. When first thread is waiting then second thread can use all 8xALU. That's why A12's 6xALUs is much harder to keep busy than theoretical 12xALU + SMT2. The same as Bulldozer's dual 2xALU was not as good as one Zen's 4xALU + SMT2 although both have same number of ALUs.
 

Thala

Golden Member
Nov 12, 2014
1,256
569
136
Yeah by capping the clock ceiling and doing stuff.
Not exactly valid for cores like Zen or *Cove stuff, they target varying ranges of clocks.
They are not capping the clock ceiling whatsoever. They are capping power, which caps voltage, which caps frequency.
What you need to understand is, that they still need a very high clocking architecture to even reach 2.5GHz at low voltages.
 
Last edited:

naukkis

Senior member
Jun 5, 2002
450
307
136
They are not capping the clock ceiling whatsoever. They are capping power, which caps voltage, which caps frequency.
What you need to understand is, that they still need a very high clocking archticture to even reach 2.5GHz at low voltages.
Yep. There's Cortex A72 voltage vs freq curve at 7nm :

https://fuse.wikichip.org/news/2446/tsmc-demonstrates-a-7nm-arm-based-chiplet-design-for-hpc/

A12 probably could also reach very close to 4ghz frequencies if pushed. A12 core would be extremely competitive in desktop too if Apple decides to use it.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,068
674
136
New x86-64! - 64-256 64bit registers!
This is not useful. Modern desktop CPUs already have hundreds of registers, which they can use to increase execution width thanks to register renaming. What you are proposing to add is several hundred register names, which have both questionable utility and very real costs in a renaming OoOE cpu.

When AMD was designing x64, they ran a lot of simulations on running normal x86 code with more or fewer registers. Going from 8->16 was a no-brainer, but they also simulated 32 registers, and found that it only gave a few percent more speed in most code. In contrast, increasing the amount of register names has a major cost on context switches, as they all have to be stored and reloaded from the memory, so 32 registers turned out to give a net performance deficit. (There would have been no implementation cost for going to 32 names, so AMD could just purely pick the best choice, which on normal x86 code was 16 names.)

Itanium had to have that many registers because it was in-order, with no renaming. This means that for execution width purposes, register names = registers on Itanium.
Most ARM loads do less context switches than x86 because in embedded there's less varied software running at the same time, so for their loads they chose 32, probably because they also intend to push Aarchv8 to very low power targets eventually, where register renaming and OoO are not a given.

(edit)
8 Decode x86 - 2x complex and 6 simple
Also, this would be ridiculously expensive in power, and the 2 complex decode is not useful at all. Because of it's very variable width instructions, increasing x86 decode with has an exponential power cost. Every extra instruction you add cost more power than the already existing decode. This is the one place where ARM has a genuine advantage over x86: increasing decode width in ARM has linear costs, they can just go wild and decode as much as they feel like. For x86, it's better to use a uop cache for getting hot loops, it's not like you are ever going to be decode-limited on 5 insns per clock in straight line code anyway.

As for complex instructions, they basically only exist as either to maintain backwards compatibility for things no real code actually uses anymore, for "janitorial tasks" like managing cpu state (that is completely irrelevant for performance) or for string instructions where decoding more than one per clock is useless anyway. >1 clock is completely pointless.
 
Last edited:

itsmydamnation

Platinum Member
Feb 6, 2011
2,217
1,822
136
They are not capping the clock ceiling whatsoever. They are capping power, which caps voltage, which caps frequency.
What you need to understand is, that they still need a very high clocking archticture to even reach 2.5GHz at low voltages.
You say this all the time but why then did things like bobcat/jaguar have a very brutal clock limit but the same number of pipeline stages as bulldozer?, Why did notherwood have a ~20 stage pipeline, prescott a 30 stage pipeline and Why was Tejas going to have a pipeline length of ~40. None of these things should have mattered/happened if architexture doesn't limit clocks.

So if all those happened what makes apple so special that no one else can do what you say? Sorry not buying it. Even a rudimentary thing like FO4 says that architecture/design/layout matter a lot for clocks.
 

naukkis

Senior member
Jun 5, 2002
450
307
136
So if all those happened what makes apple so special that no one else can do what you say? Sorry not buying it. Even a rudimentary thing like FO4 says that architecture/design/layout matter a lot for clocks.
AMD was making same kind of very wide cpu, K12 but they cancelled it and focus to it's sister core Zen instead, which is limited by instruction arch. It's much easier to design wide cores to Armv8 than x86.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,217
1,822
136
AMD was to be same kind of cpu, K12 but they cancelled it and focus to it's sister core Zen instead, which is limited by instruction arch. It's much easier to design wide cores to Armv8 than x86.
This is just throw away rubbish that ignores physics of actual making something happen within a time frame, inside a core its all uops. Updating state machine(retirement) is all in order. The only thing you can argue is decode. Even weak vs strong memory model is just trading horse depending on the workload.
 

naukkis

Senior member
Jun 5, 2002
450
307
136
This is just throw away rubbish that ignores physics of actual making something happen within a time frame, inside a core its all uops. Updating state machine(retirement) is all in order. The only thing you can argue is decode. Even weak vs strong memory model is just trading horse depending on the workload.
There's more than just decode. Instruction retirement also much more potential bottleneck with x86 because of total store order. With loose controlled instruction sets they can easily reuse store buffer entries and so on which can't be done with x86. There's no point of feeding core more if retirement phase can't keep up.

And this is from Jim Kellers mouth, in some interview he described how K12 has bigger engine because more relaxing instruction set makes it practical.
 

Thala

Golden Member
Nov 12, 2014
1,256
569
136
You say this all the time but why then did things like bobcat/jaguar have a very brutal clock limit but the same number of pipeline stages as bulldozer?, Why did notherwood have a ~20 stage pipeline, prescott a 30 stage pipeline and Why was Tejas going to have a pipeline length of ~40. None of these things should have mattered/happened if architexture doesn't limit clocks.
I cannot comment on prescott etc. because i am not the architect of those cores and cannot reason about the push to such long pipelines.

Anyway did you take a look at the frequency voltage curve for Cortex A72 on TSMC N7? They did sign-offs for 2.8GHz@0.775V until 4.2GHz@1.375V. A12 achieves 2.5GHz at what i assume is close to nominal process voltage of 0.75V. So it will scale very similar to the Cortex A72 from above.
Corollary: If they had a critical path in the design, which would prevent reaching 4.2GHz@1.375V - the same critical path would prevent the design from reaching 2.5GHz@0.75V. What do you think Skylake or Zen2 can be clocked @0.75V?
Case closed.
 

DrMrLordX

Lifer
Apr 27, 2000
17,470
6,476
136
What do you think . . . Zen2 can be clocked @0.75V?
I suppose I could find out for you if you'd like. I might have to exploit p-state OC to find out though, which is a major headache.

edit: so apparently clock stretching makes it very difficult to figure out exactly how high Zen2 can be clocked at .75V. I actually shot for .7V and got 2500 MHz stable, but when the chip keeps spitting out the same benchmark results when you move clockspeed around +/- 100 MHz (or more) from that point, it makes you think that maybe you're getting false clocks.

Anything below about 3000 MHz on this chip is goofy in terms of the performance numbers.
 
Last edited:
  • Like
Reactions: OTG

Thunder 57

Golden Member
Aug 19, 2007
1,659
1,700
136
I cannot comment on prescott etc. because i am not the architect of those cores and cannot reason about the push to such long pipelines.

Anyway did you take a look at the frequency voltage curve for Cortex A72 on TSMC N7? They did sign-offs for 2.8GHz@0.775V until 4.2GHz@1.375V. A12 achieves 2.5GHz at what i assume is close to nominal process voltage of 0.75V. So it will scale very similar to the Cortex A72 from above.
Corollary: If they had a critical path in the design, which would prevent reaching 4.2GHz@1.375V - the same critical path would prevent the design from reaching 2.5GHz@0.75V. What do you think Skylake or Zen2 can be clocked @0.75V?
Case closed.
You're repeating a common opinion that if Apple (or anyone really) wanted to, they could scale up an ARM design and it would destroy everything. It's not that simple. Others can surely explain it better than I, but you can't just say that because it runs at x GHz and q voltage, it can run at y GHz at p voltage. So many different things affect frequency.

If there were massive gains to be had, someone would be doing it. Software is a problem, but less so every day. Apple surely has the money, but not all of the necessary technology. To answer your question, Zen+ at it's lowest p-state of 2.2GHz only needs about 0.75v on my 2600X. I think it's actually 0.775v but it bounces around so much because of background tasks it never stays there long.

I wish I could give you a better answer about high performance ARM and hopefully someone else can. You may find this article to be informative, though.
 
  • Like
Reactions: Tlh97 and OTG

DrMrLordX

Lifer
Apr 27, 2000
17,470
6,476
136
@Thunder 57

If Apple follows through on replacing their mobile CPU lineup with Axx-derived SoCs then eventually we may find out just how high they can scale their custom ARM designs. Until then it's all speculation.
 

Kedas

Senior member
Dec 6, 2018
275
238
86
There was 1.5 year between "design complete" and release of zen2.
Design Zen3 is complete now, so Zen3 in Q1 2021 ?
(Based on intels speed/delays they could actually do that)

Although Zen3 comes probably sooner like end 2020, Zen2 was a big change on all design levels.
 
  • Like
Reactions: Saylick

DrMrLordX

Lifer
Apr 27, 2000
17,470
6,476
136
@Kedas

Zen2 was actually shipping months before retail release (early shipment of Rome through ODM channels). Release on the desktop was delayed thanks to various AGESA issues (apparently). Zen3 is still slated for 2020. We'll see if AMD can deliver, or if it's late again. AMD may also choose to slow their cadence if Intel keeps gimping along. I think they'd be nuts to do that, but some beancounter may encourage them to slow things down a bit.
 

Thunder 57

Golden Member
Aug 19, 2007
1,659
1,700
136
@KedasZen2 was actually shipping months before retail release (early shipment of Rome through ODM channels). Release on the desktop was delayed thanks to various AGESA issues (apparently)...
Let's not forget the GloFo issues. AMD wisely didn't commit to them by the looks they were able to switch to TSMC. That could have bit them hard but I think it only cost them 1-2 months. Who really knows, though? I expect to see Zen 3 in late August at best to mid October at worst.
 

ASK THE COMMUNITY