Speculation: Ryzen 4000 series/Zen 3

AMDK11 · Aug 11, 2019

Richie Rich said:
What might be a Zen3?
Answer is: what is the inevitable future of CPU cores?

The real power is in back-end and it's ALUs, AGUs and FPUs. As a mech engineer I see these as cylinders in the engine.
Front-end is just feeding them as efficiently as possible. Same as intake manifold is feeding engine. That's all.

The evolution of back-end ALUs was:
- 1995 ... 2xALU Intel P6 uarch, PentiumPro, PII...
- 1997 ... 2xALU AMD/Nexgen K6
- 1999 ... 3xALU AMD K7, Intel PIII
- 2008 ... 4xALU Intel Nehalem
- 2012 ... 4xALU AMD Zen
- 2017 ... 6xALU Apple A11 ... most powerful core today (int IPC +76% over Skylake)

x86 CPUs must move to 6xALUs. When Apple did it then Intel and AMD must do that too. Sure, It will be hard move as was move from 3xALU -> 4xALU, it will need core re-design from scratch, same as Nehalem and Zen were. You don't need to be genius to predict that inevitable future is 8xALUs core design as a next step. Or do you think x86 CPUs will sit at 4xALU design for next 50 years? No. Apple moved from weak 4-cylinder engine to their powerfull V6. However I think we deserve V8s.

What is the evolution of SMT?
- 1999 introduced by DEC in 1999, implemented in CPU EV8 SMT4 in 2003 (cancelled in 2001 by Compaq in favor of Itanium)
- 2002 ... Intel P4 SMT2
- 2004 ... IBM Power5 SMT2
- 2010 ... IBM Power7 SMT4 dynamical
- 2014 ... IBM Power8 SMT8 dynamical
- 2017 ... AMD Zen SMT2
- 2050 ... x86 still stuck at SMT2?

6xALU core still might be fine with SMT2. For high thread server application SMT4 makes sense even for this core.
8xALU core will struggle with just SMT2 from efficiency point. You do not need to be genius to predict that SMT4 for this core is efficient move. SMT4 and SMT8 with dynamical changing number of threads/priority is actual IBM technology, not a sci-fi. Again, you do not need to be genius to predict that next step is SMT-16 (for very wide core and some specific server markets). Does SMT4 still look crazy for Zen3?

And don't forget guys what Kennedy said: "We choose to go to the moon because it is hard, not because it is easy."

Nehalem does not have 4xALU but 3xALU!

Nehalem 3xALU, 2xAGU
SandyBridge 3xALU, 2xAGU
Haswell 4xALU, 3xAGU
Skylake 4xALU, 3xAGU
SunnyCove 4xALU, 4xAGU

Intel announced some time ago that it has been working on the groundbreaking micro-architecture of NGC (Next Generation Core) since around 2017. It is to be the basis for future generations for the next decade.

My guesses / wishes:
L1-I 48KB 12-Way
8 Decode x86 - 2x complex and 6 simple
8xALU / 6xFPU
6xAGU + 3-4xSD
L1-D 48KB 12-Way

New x86-64! - 64-256 64bit registers!

Intel / AMD x86-64 - 16x64bit registers
IBM POWER - 32x64bit registers
Intel Itanium - 128x64bit registers

Yotsugi · Aug 11, 2019

AMDK11 said:
My guesses / wishes:
L1-I 48KB 12-Way
8 Decode x86 - 2x complex and 6 simple
8xALU / 6xFPU
6xAGU + 3-4xSD

And then you find me code with enough ILP to saturate a machine this wide.

AMDK11 said:
New x86-64! - 64-256 64bit registers!

No no please no let's not do that.

AMDK11 · Aug 11, 2019

@UP
We will see

Intel at the time when it promoted Netbrust (Pentium 4) had an extensive mobile core Banias (Pentium M) with IPC equal to or higher than K7 (Athlon / AthlonXP) which was a development of Pentium III. He then expanded the core to the form of Yonah (Core) which had IPC at the K8 level (Athlon64) but timing problems. Only the extension of Yonah to Conroe (Core 2) gave a safe advantage in IPC 25% faster than the K8 and a higher clock.

I suspect that for Intel SunnyCove is not enough and they are preparing something that will give them an indisputable advantage not only over AMD but also Apple.

For now, they wait with the mobile SunnyCove and WillowCove and maybe even GoldenCove and will release NGC on the PC.

It will certainly be interesting to see what Intel and AMD will come up with for the next 3-4 years.

Richie Rich · Aug 11, 2019

Yotsugi said:
And then you find me code with enough ILP to saturate a machine this wide.

How Apple engineers can utilize so many ALUs in A12 chip (6xALU) with no SMT? A12 is twice as fast (IPC) as 3xALU Cortex core and +76% faster than Skylake. And we are speaking about mobile core with very limited energy resources. I would expect first 6xALUs for some hungry HPC architecture as IBM Power where power consumption is secondary.

Certainly there is a way how to utilize 6xALUs with no SMT needed (ratio of 6 ALU/thread). Apple engineers found the way.
Theoretically should be possible super wide 12xALU + SMT2 or insane 24xALU+SMT4. All these with ratio of 6 ALU/thread.

- 8xALU with SMT2 should be very easy - ratio 4 ALU/thread. Still much lover that Apple ratio of 6.
- 8xALU with SMT4 is super conservative in terms of 2 ALU/thread. You can have both, high IPC and high throughput efficiency.

And this might be Keller's big Zen... project developed for 8 years since 2012 to be launched as Zen3. Theoretically it is possible. They had Keller and enough time. I don't want to start a hype about Zen3. However I believe the world deserve something much better than refurbishing 4xALUs for next 20 years. I want 6xALU core in my desktop, not only in iPhone.

Richie Rich · Aug 11, 2019

AMDK11 said:
Nehalem does not have 4xALU but 3xALU!

Nehalem 3xALU, 2xAGU
SandyBridge 3xALU, 2xAGU
Haswell 4xALU, 3xAGU
Skylake 4xALU, 3xAGU
SunnyCove 4xALU, 4xAGU

Intel announced some time ago that it has been working on the groundbreaking micro-architecture of NGC (Next Generation Core) since around 2017. It is to be the basis for future generations for the next decade.

My guesses / wishes:
L1-I 48KB 12-Way
8 Decode x86 - 2x complex and 6 simple
8xALU / 6xFPU
6xAGU + 3-4xSD
L1-D 48KB 12-Way

New x86-64! - 64-256 64bit registers!

Intel / AMD x86-64 - 16x64bit registers
IBM POWER - 32x64bit registers
Intel Itanium - 128x64bit registers

You are right, Nehalem had 3xALU.
I agree that 6xALU or 8xALU design are the next step as the lowest hanging fruit.

Yotsugi · Aug 11, 2019

Richie Rich said:
Certainly there is a way how to utilize 6xALUs with no SMT needed (ratio of 6 ALU/thread). Apple engineers found the way.

Yeah by capping the clock ceiling and doing stuff.
Not exactly valid for cores like Zen or *Cove stuff, they target varying ranges of clocks.

Richie Rich said:
8xALU with SMT2 should be very easy - ratio 4 ALU/thread. Still much lover that Apple ratio of 6.

That's not how it works and ALUs are not statically partitioned.

Ajay · Aug 11, 2019

Yotsugi said:
Yeah by capping the clock ceiling and doing stuff.
Not exactly valid for cores like Zen or *Cove stuff, they target varying ranges of clocks.

No 'legacy' crap either. I think Xcode must have an excellent optimizing compiler. I'd love to see the actual profiled IPC for Apple's A12. Intel runs around 2.4 instructions per clock on good code, IIRC.

Yotsugi · Aug 11, 2019

Ajay said:
No 'legacy' crap either.

I don't think legacy cruft impacts x86 in any way tbh.

Ajay said:
I'd love to see the actual profiled IPC for Apple's A12

Now that's a real problem, ye.

DrMrLordX · Aug 11, 2019

Apple has the advantage of controlling the entire software stack for their hardware.

Richie Rich · Aug 12, 2019

DrMrLordX said:
Apple has the advantage of controlling the entire software stack for their hardware.

This was tested in SPEC2006 benchmark.... so what "entire SW stack control" are you talking about? Algorithm is the same for all platforms. Maybe compiler is Apple advantage. In case of Apple is using an advanced compiler, x86 can develop this compiler too. If this is the way then Intel and AMD should follow the leading Apple.

Yotsugi said:
Yeah by capping the clock ceiling and doing stuff.
Not exactly valid for cores like Zen or *Cove stuff, they target varying ranges of clocks.

And what's the problem with low clock? IPC is independent of clock. Apple can make it run much faster if they need that for desktop. It is not so difficult to optimize memory buffers to keep throughput with high frequencies. They need jump from 2.5GHz to 3.5GHz and it will reach performance of Skylake at 6GHz.

Yotsugi said:
That's not how it works and ALUs are not statically partitioned.

SMT is even better than statically partitioned ALUs. When first thread is waiting then second thread can use all 8xALU. That's why A12's 6xALUs is much harder to keep busy than theoretical 12xALU + SMT2. The same as Bulldozer's dual 2xALU was not as good as one Zen's 4xALU + SMT2 although both have same number of ALUs.

Thala · Aug 12, 2019

Yotsugi said:
Yeah by capping the clock ceiling and doing stuff.
Not exactly valid for cores like Zen or *Cove stuff, they target varying ranges of clocks.

They are not capping the clock ceiling whatsoever. They are capping power, which caps voltage, which caps frequency.
What you need to understand is, that they still need a very high clocking architecture to even reach 2.5GHz at low voltages.

naukkis · Aug 12, 2019

Thala said:
They are not capping the clock ceiling whatsoever. They are capping power, which caps voltage, which caps frequency.
What you need to understand is, that they still need a very high clocking archticture to even reach 2.5GHz at low voltages.

Yep. There's Cortex A72 voltage vs freq curve at 7nm :

https://fuse.wikichip.org/news/2446/tsmc-demonstrates-a-7nm-arm-based-chiplet-design-for-hpc/

A12 probably could also reach very close to 4ghz frequencies if pushed. A12 core would be extremely competitive in desktop too if Apple decides to use it.

Tuna-Fish · Aug 12, 2019

AMDK11 said:
New x86-64! - 64-256 64bit registers!

This is not useful. Modern desktop CPUs already have hundreds of registers, which they can use to increase execution width thanks to register renaming. What you are proposing to add is several hundred register names, which have both questionable utility and very real costs in a renaming OoOE cpu.

When AMD was designing x64, they ran a lot of simulations on running normal x86 code with more or fewer registers. Going from 8->16 was a no-brainer, but they also simulated 32 registers, and found that it only gave a few percent more speed in most code. In contrast, increasing the amount of register names has a major cost on context switches, as they all have to be stored and reloaded from the memory, so 32 registers turned out to give a net performance deficit. (There would have been no implementation cost for going to 32 names, so AMD could just purely pick the best choice, which on normal x86 code was 16 names.)

Itanium had to have that many registers because it was in-order, with no renaming. This means that for execution width purposes, register names = registers on Itanium.
Most ARM loads do less context switches than x86 because in embedded there's less varied software running at the same time, so for their loads they chose 32, probably because they also intend to push Aarchv8 to very low power targets eventually, where register renaming and OoO are not a given.

(edit)

AMDK11 said:
8 Decode x86 - 2x complex and 6 simple

Also, this would be ridiculously expensive in power, and the 2 complex decode is not useful at all. Because of it's very variable width instructions, increasing x86 decode with has an exponential power cost. Every extra instruction you add cost more power than the already existing decode. This is the one place where ARM has a genuine advantage over x86: increasing decode width in ARM has linear costs, they can just go wild and decode as much as they feel like. For x86, it's better to use a uop cache for getting hot loops, it's not like you are ever going to be decode-limited on 5 insns per clock in straight line code anyway.

As for complex instructions, they basically only exist as either to maintain backwards compatibility for things no real code actually uses anymore, for "janitorial tasks" like managing cpu state (that is completely irrelevant for performance) or for string instructions where decoding more than one per clock is useless anyway. >1 clock is completely pointless.

itsmydamnation · Aug 12, 2019

Thala said:
They are not capping the clock ceiling whatsoever. They are capping power, which caps voltage, which caps frequency.
What you need to understand is, that they still need a very high clocking archticture to even reach 2.5GHz at low voltages.

You say this all the time but why then did things like bobcat/jaguar have a very brutal clock limit but the same number of pipeline stages as bulldozer?, Why did notherwood have a ~20 stage pipeline, prescott a 30 stage pipeline and Why was Tejas going to have a pipeline length of ~40. None of these things should have mattered/happened if architexture doesn't limit clocks.

So if all those happened what makes apple so special that no one else can do what you say? Sorry not buying it. Even a rudimentary thing like FO4 says that architecture/design/layout matter a lot for clocks.

naukkis · Aug 12, 2019

itsmydamnation said:
So if all those happened what makes apple so special that no one else can do what you say? Sorry not buying it. Even a rudimentary thing like FO4 says that architecture/design/layout matter a lot for clocks.

AMD was making same kind of very wide cpu, K12 but they cancelled it and focus to it's sister core Zen instead, which is limited by instruction arch. It's much easier to design wide cores to Armv8 than x86.

itsmydamnation · Aug 12, 2019

naukkis said:
AMD was to be same kind of cpu, K12 but they cancelled it and focus to it's sister core Zen instead, which is limited by instruction arch. It's much easier to design wide cores to Armv8 than x86.

This is just throw away rubbish that ignores physics of actual making something happen within a time frame, inside a core its all uops. Updating state machine(retirement) is all in order. The only thing you can argue is decode. Even weak vs strong memory model is just trading horse depending on the workload.

naukkis · Aug 12, 2019

itsmydamnation said:
This is just throw away rubbish that ignores physics of actual making something happen within a time frame, inside a core its all uops. Updating state machine(retirement) is all in order. The only thing you can argue is decode. Even weak vs strong memory model is just trading horse depending on the workload.

There's more than just decode. Instruction retirement also much more potential bottleneck with x86 because of total store order. With loose controlled instruction sets they can easily reuse store buffer entries and so on which can't be done with x86. There's no point of feeding core more if retirement phase can't keep up.

And this is from Jim Kellers mouth, in some interview he described how K12 has bigger engine because more relaxing instruction set makes it practical.

Thala · Aug 12, 2019

itsmydamnation said:
You say this all the time but why then did things like bobcat/jaguar have a very brutal clock limit but the same number of pipeline stages as bulldozer?, Why did notherwood have a ~20 stage pipeline, prescott a 30 stage pipeline and Why was Tejas going to have a pipeline length of ~40. None of these things should have mattered/happened if architexture doesn't limit clocks.

I cannot comment on prescott etc. because i am not the architect of those cores and cannot reason about the push to such long pipelines.

Anyway did you take a look at the frequency voltage curve for Cortex A72 on TSMC N7? They did sign-offs for 2.8GHz@0.775V until 4.2GHz@1.375V. A12 achieves 2.5GHz at what i assume is close to nominal process voltage of 0.75V. So it will scale very similar to the Cortex A72 from above.
Corollary: If they had a critical path in the design, which would prevent reaching 4.2GHz@1.375V - the same critical path would prevent the design from reaching 2.5GHz@0.75V. What do you think Skylake or Zen2 can be clocked @0.75V?
Case closed.

DrMrLordX · Aug 12, 2019

Richie Rich said:
What do you think . . . Zen2 can be clocked @0.75V?

I suppose I could find out for you if you'd like. I might have to exploit p-state OC to find out though, which is a major headache.

edit: so apparently clock stretching makes it very difficult to figure out exactly how high Zen2 can be clocked at .75V. I actually shot for .7V and got 2500 MHz stable, but when the chip keeps spitting out the same benchmark results when you move clockspeed around +/- 100 MHz (or more) from that point, it makes you think that maybe you're getting false clocks.

Anything below about 3000 MHz on this chip is goofy in terms of the performance numbers.

Thunder 57 · Aug 12, 2019

Thala said:
I cannot comment on prescott etc. because i am not the architect of those cores and cannot reason about the push to such long pipelines.

Anyway did you take a look at the frequency voltage curve for Cortex A72 on TSMC N7? They did sign-offs for 2.8GHz@0.775V until 4.2GHz@1.375V. A12 achieves 2.5GHz at what i assume is close to nominal process voltage of 0.75V. So it will scale very similar to the Cortex A72 from above.
Corollary: If they had a critical path in the design, which would prevent reaching 4.2GHz@1.375V - the same critical path would prevent the design from reaching 2.5GHz@0.75V. What do you think Skylake or Zen2 can be clocked @0.75V?
Case closed.

You're repeating a common opinion that if Apple (or anyone really) wanted to, they could scale up an ARM design and it would destroy everything. It's not that simple. Others can surely explain it better than I, but you can't just say that because it runs at x GHz and q voltage, it can run at y GHz at p voltage. So many different things affect frequency.

If there were massive gains to be had, someone would be doing it. Software is a problem, but less so every day. Apple surely has the money, but not all of the necessary technology. To answer your question, Zen+ at it's lowest p-state of 2.2GHz only needs about 0.75v on my 2600X. I think it's actually 0.775v but it bounces around so much because of background tasks it never stays there long.

I wish I could give you a better answer about high performance ARM and hopefully someone else can. You may find this article to be informative, though.

DrMrLordX · Aug 12, 2019

@Thunder 57

If Apple follows through on replacing their mobile CPU lineup with Axx-derived SoCs then eventually we may find out just how high they can scale their custom ARM designs. Until then it's all speculation.

Kedas · Aug 12, 2019

There was 1.5 year between "design complete" and release of zen2.
Design Zen3 is complete now, so Zen3 in Q1 2021 ?
(Based on intels speed/delays they could actually do that)

Although Zen3 comes probably sooner like end 2020, Zen2 was a big change on all design levels.

DrMrLordX · Aug 12, 2019

@Kedas

Zen2 was actually shipping months before retail release (early shipment of Rome through ODM channels). Release on the desktop was delayed thanks to various AGESA issues (apparently). Zen3 is still slated for 2020. We'll see if AMD can deliver, or if it's late again. AMD may also choose to slow their cadence if Intel keeps gimping along. I think they'd be nuts to do that, but some beancounter may encourage them to slow things down a bit.

Thunder 57 · Aug 12, 2019

DrMrLordX said:
@KedasZen2 was actually shipping months before retail release (early shipment of Rome through ODM channels). Release on the desktop was delayed thanks to various AGESA issues (apparently)...

Let's not forget the GloFo issues. AMD wisely didn't commit to them by the looks they were able to switch to TSMC. That could have bit them hard but I think it only cost them 1-2 months. Who really knows, though? I expect to see Zen 3 in late August at best to mid October at worst.

darkswordsman17 · Aug 12, 2019

Thunder 57 said:
You're repeating a common opinion that if Apple (or anyone really) wanted to, they could scale up an ARM design and it would destroy everything. It's not that simple. Others can surely explain it better than I, but you can't just say that because it runs at x GHz and q voltage, it can run at y GHz at p voltage. So many different things affect frequency.

If there were massive gains to be had, someone would be doing it. Software is a problem, but less so every day. Apple surely has the money, but not all of the necessary technology. To answer your question, Zen+ at it's lowest p-state of 2.2GHz only needs about 0.75v on my 2600X. I think it's actually 0.775v but it bounces around so much because of background tasks it never stays there long.

I wish I could give you a better answer about high performance ARM and hopefully someone else can. You may find this article to be informative, though.

I think Apple would develop a separate core (entirely separate SoC really) for such a move, so I don't think it'd be simply them pushing the clock speeds up. They like to take things slow though, and they've been working on kinda merging/porting some of their software (not unlike what they started doing leading up to them moving from Power to Intel/x86). I think it'd be a relatively slow transition though, where Apple will start by shifting the SoC that goes in the iPad Pros into the Macbook Air and Macbook (non Pros). The iPhone and regular iPads would share a SoC. Then probably a few years later we see them move the Macbook Pros and iMac to a third SoC.

Thunder 57 said:
Let's not forget the GloFo issues. AMD wisely didn't commit to them by the looks they were able to switch to TSMC. That could have bit them hard but I think it only cost them 1-2 months. Who really knows, though? I expect to see Zen 3 in late August at best to mid October at worst.

AMD shouldn't have similar delay, and I'd guess they won't have as much issues related to I/O stuff. I don't see why Zen 3 shouldn't be ready for Computex. As far as I know, TSMC's 7+ is on track, and already volume production - think Apple's SoC that is in the new iPhones is using it).

I could see AMD stretching things out more. Zen 2 Threadripper at CES (where it becomes the top gaming CPU by having largest L3 cache, highest clock speeds, and most power/thermal room). Where it launches shortly after. Probably Zen 2 APUs (monolithic ones for laptops for instance, not sure on chiplet ones, if we get those they probably won't show up til Computex, but we might see them do something like skip right to Zen 3 CPU chiplet). Possibly Navi 20 announcement with it launching in the spring (or maybe around E3). E3 we get new console announcements. Computex we get Zen 3 Ryzen and then Zen 3 EPYC launches fully end of summer (same time as now). Then the fall we get new console launches. Not sure on Arcturus, its supposed to be out next year, but not sure in what respect. But if Navi 20 were to be out early, they could have Arcturus out for a fall launch. I'm not sure if there'd be any concern about upstaging the consoles, so I could see them possibly waiting for CES 2021 to announce Arcturus. They could possibly put it into production end of 2020 and have it ready for launch. It could also be part of them changing to a new product stack setup (i.e. splitting consumer and pro GPUs), where Arcturus would be a top down consumer launch (and feature probably GDDR6). Or maybe the pro stuff just goes to chiplets with an I/O die on an interposer with HBM).

Speculation: Ryzen 4000 series/Zen 3

Senior member

Golden Member

Senior member

Senior member

Senior member

Golden Member

Lifer

Golden Member

Lifer

Senior member

Golden Member

Golden Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Golden Member

Lifer

Diamond Member

Lifer

Senior member

Lifer

Diamond Member

Lifer