Design changes in Zen 2 (CPU/core/chiplet only)

moinmoin · Nov 11, 2018

The Epyc 2/Zen 2 architecture with its centralized I/O die obviously was everybody's focus of Zen 2 related design discussions at AMD's Next Horizon event. But there were also some few mentions of changes in the Zen 2 core design which likely have a larger impact on later Ryzen consumer products. I'd like to contrast them with Agner Fog's discussion of bottlenecks in Zen 1's design.

Reading material:

"The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers" from Agner Fog's Software optimization resources
Executive Presentation by Mark Papermaster from AMD Investor Overview

Agner Fog's Microarchitecture page 221 said:
20.21 Bottlenecks in AMD Ryzen

The throughput of each core in the Ryzen is higher than on any previous AMD or Intel x86 processor, except for 256-bit vector instructions. Loops that fit into the μop cache can have a throughput of five instructions or six μops per clock cycle. Code that does not fit into the μop cache can have a throughput of four instructions or six μops or approximately 16 bytes of code per clock cycle, whichever is smaller. The 16 bytes fetch rate is a likely bottleneck for CPU intensive code with large loops.

Most instructions are supported by two, three, or four execution units so that it is possible to execute multiple instructions of the same kind simultaneously. Instruction latencies are generally low.

The 256-bit vector instructions are split into two μops each. A piece of code that contains many 256-bit vector instructions may therefore be limited by the number of execution units. The maximum throughput here is four vector-μops per clock cycle, or six μops per clock cycle in total if at least a third of the μops use general purpose registers.

The throughput per thread is half of the above when two threads are running in the same core. But the capacity of each core is higher than what a single-threaded application is likely to need. Therefore, the Ryzen gets more advantage out of simultaneous multithreading than similar Intel processors do. Inter-thread communication should be kept within the same 4-core CPU complex if possible.

The very high throughput of the Ryzen core places an extra burden on the programmer and the compiler if you want optimal performance. Obviously, you cannot execute two instructions simultaneously if the second instruction depends on the output of the first one. It is important to avoid long dependency chains if you want to even get close to the maximum throughput of five instructions per clock cycle.

The caches are fairly big. This is a significant advantage because cache and memory access is the most likely bottleneck in most cases. The cache bandwidth is 32 bytes per clock which is less than competing Intel processors have.

Loops that fit into the μop cache can have a throughput of five instructions or six μops per clock cycle. Code that does not fit into the μop cache can have a throughput of four instructions or six μops or approximately 16 bytes of code per clock cycle, whichever is smaller. The 16 bytes fetch rate is a likely bottleneck for CPU intensive code with large loops.
"Better Instruction Pre-Fetching", "Re-Optimized Instruction Cache", "Larger Op Cache"
-> All three changes should ensure that more code fits the now larger μop cache.

The 256-bit vector instructions are split into two μops each. A piece of code that contains many 256-bit vector instructions may therefore be limited by the number of execution units.
"Doubled Floating Point Width to 256-bit"
-> 256-bit vector instructions in one μop, or doubling of execution units.

The maximum throughput here is four vector-μops per clock cycle, or six μops per clock cycle in total if at least a third of the μops use general purpose registers.
-> Either unchanged four (now 256-bit) or eight (unchanged 128-bit) vector-μops per clock cycle.
-> Probably the former going by the wording in "Doubled Floating Point Width to 256-bit"

Inter-thread communication should be kept within the same 4-core CPU complex if possible.
-> CCX may have been increased to 8 cores and/or inter-CCX latency improved.

Obviously, you cannot execute two instructions simultaneously if the second instruction depends on the output of the first one. It is important to avoid long dependency chains if you want to even get close to the maximum throughput of five instructions per clock cycle.
"Improved Branch Predictor"
-> Better branch predictions should help the core break more dependency chains.

The caches are fairly big. This is a significant advantage because cache and memory access is the most likely bottleneck in most cases. The cache bandwidth is 32 bytes per clock which is less than competing Intel processors have.
"Re-Optimized Instruction Cache", "Larger Op Cache", "Doubled Load / Store Bandwidth", "Increased Dispatch / Retire Bandwidth"
-> For the bigger 256-bit vector instructions these improvements may be a necessity for avoiding new bottlenecks.
"Maintained High Throughput for All Modes"
-> Implies integer and <256-bit vector instructions may also profit from these improvements, possibly significantly so.

Other takes at reading tea leaves, or some rumors going further?

Hi-Fi Man · Nov 11, 2018

This is a big deal for 256-bit AVX performance. The rest of the enhancements seemed to be aimed at reducing the performance impact of an off die memory controller.

itsmydamnation · Nov 12, 2018

Hi-Fi Man said:
This is a big deal for 256-bit AVX performance. The rest of the enhancements seemed to be aimed at reducing the performance impact of an off die memory controller.

Hardly,

Most of them seem to be about keeping the ALU's fed, almost everything about the front-end said it was improved:
prefetch, predict, uop cache/dispatch. Those have nothing to do with memory latency and everything to do with Core utilization, because if the data isn't in the cache , none of those things are going to help improve performance.

Gideon · Nov 12, 2018

itsmydamnation said:
Hardly,

Most of them seem to be about keeping the ALU's fed, almost everything about the front-end said it was improved:
prefetch, predict, uop cache/dispatch. Those have nothing to do with memory latency and everything to do with Core utilization, because if the data isn't in the cache , none of those things are going to help improve performance.

Not that much of a surprise either, considering that the front-end of zen was pretty much the same as the Bulldozer with a uop cache added on top, just doubled up, as it previously it was 1 per-module (2 cores). Cat_Merc made an interesting point on reddit, that the frontend limited 2 Bulldozer cores at full load to about 80% of their single core performance. Incidentally Zen has about 160% Bulldozer performance. Comparing the width and performance of Zen architecture to Haswell and Skylake, it probably is somewhat limited by the frontend.

Hi-Fi Man · Nov 12, 2018

itsmydamnation said:
Hardly,

Most of them seem to be about keeping the ALU's fed, almost everything about the front-end said it was improved:
prefetch, predict, uop cache/dispatch. Those have nothing to do with memory latency and everything to do with Core utilization, because if the data isn't in the cache , none of those things are going to help improve performance.

I don't see how it would hardly help considering some of these enhancements would help keep stuff in the cache.

itsmydamnation · Nov 12, 2018

Hi-Fi Man said:
I don't see how it would hardly help considering some of these enhancements would help keep stuff in the cache.

Not really, cache ages stuff out itself. How far do you think an OOOE core looks ahead in terms of cycles? give it has a 192 entry ROB and memory is ~100ns away. There are Perfetchers higher up the cache structure for trying to predict data from memory into the L2.

Spartak · Nov 12, 2018

Gideon said:
Not that much of a surprise either, considering that the front-end of zen was pretty much the same as the Bulldozer with a uop cache added on top, just doubled up, as it previously it was 1 per-module (2 cores). Cat_Merc made an interesting point on reddit, that the frontend limited 2 Bulldozer cores at full load to about 80% of their single core performance. Incidentally Zen has about 160% Bulldozer performance. Comparing the width and performance of Zen architecture to Haswell and Skylake, it probably is somewhat limited by the frontend.

Three things that spring to mind.

First, that'd mean there's about 25% theoretical performance to win if that 80% figure is true, by simply improving the front end.

Second, by doubling the FPU units to 256 bit there's additional performance room.

Third, a better core utilisation would mean diminishing returns on SMT to the point they might forego on it. Intel seems to be moving in a similar direction.

Here's a wild suggestion: what if Zen2 is indeed 16C at the cost of SMT but with IPC in the +30% range?

lightmanek · Nov 12, 2018

Spartak said:
Three things that spring to mind.

First, that'd mean there's about 25% theoretical performance to win if that 80% figure is true, by simply improving the front end.

Second, by doubling the FPU units to 256 bit there's additional performance room.

Third, a better core utilisation would mean diminishing returns on SMT to the point they might forego on it. Intel seems to be moving in a similar direction.

Here's a wild suggestion: what if Zen2 is indeed 16C at the cost of SMT but with IPC in the +30% range?

It can't be simply because Dr Lisa Su confirmed it has SMT during Next Horizon event. This might be a case for some EPYC SKU's where it is disabled or future Zen revisions, but unlikely.

amd6502 · Nov 12, 2018

So it seems the cache systems might be changed for this new generation. Are there any details on this?

Do I remember rumor correctly that L3 no longer is victim cache?

As for SMT it seems really unlikely. There will always be a significant amount of stalled cores unless all your RAM is as fast as L1 or L2. And for any big core design that means a lot of wasted area that's not being utilized, and big efficiency drop unless efficiently gated off or fixed somehow. Other than maybe to add flexibility, like asymmetric mode multithreading, which would do what ARM's big little feature does, only better. With aSMT on there is a priority logical core or thread. Any decent OS should already support big.little meaning it supports a more flexible SMT.

As for hyperthreating concerns (nonsense like portsmash) it would make sense that OS's will have private cores (a single physical core or module) to themselves and taskset the kernel and OS function, and maybe any root processes to only that reserved physical core. The number of single core or single module CPU's is almost nonexistent anyway. Those two updates are long overdue in the PC ecosystem.

Gideon said:
Not that much of a surprise either, considering that the front-end of zen was pretty much the same as the Bulldozer with a uop cache added on top, just doubled up, as it previously it was 1 per-module (2 cores).

I don't think so, it would be surprising if true.

Enigma- · Jan 10, 2019

The front-end is definitely the bottleneck in Zen. That's why AMD is beefing it up in Zen2. Mostly to feed it's already wide 6 pipe integer cluster. I think they will keep the 4INT+2AGU design. Upgrading them to more symmetrical ones or/and lower latency on some INT ops. That's why it's not expected to shine so much more in generic 128-bit FP like Cinebench there Zen is already good and good feeded. Add higher clocks and it should shine as long as they get good average latency to RAM, or at least at the most common block sizes which desktop and gaming uses to fit in caches... And even if this I/O is bigger with 14nm process, it's too big in my eyes not to include any LLC/eDRAM. It would really be a killer implementation, like Intel did with Broadwell-C.

The CPU-bound games loved it and Lisa Su have mentioned "much better performance in CPU bounded games" like 1080p high fps gaming somewhere I remember.

Abwx · Jan 10, 2019

“Our first priority is overall system performance, but we know how important single-thread performance is,” Su said. “So you will see us push single-threaded performance.”

https://www.pcworld.com/article/3332205/amd/amd-ceo-lisa-su-interview-ryzen-raytracing-radeon.html

moinmoin · Jan 11, 2019

Enigma- said:
The front-end is definitely the bottleneck in Zen. That's why AMD is beefing it up in Zen2. Mostly to feed it's already wide 6 pipe integer cluster. I think they will keep the 4INT+2AGU design. Upgrading them to more symmetrical ones or/and lower latency on some INT ops. That's why it's not expected to shine so much more in generic 128-bit FP like Cinebench there Zen is already good and good feeded. Add higher clocks and it should shine as long as they get good average latency to RAM, or at least at the most common block sizes which desktop and gaming uses to fit in caches... And even if this I/O is bigger with 14nm process, it's too big in my eyes not to include any LLC/eDRAM. It would really be a killer implementation, like Intel did with Broadwell-C.

The CPU-bound games loved it and Lisa Su have mentioned "much better performance in CPU bounded games" like 1080p high fps gaming somewhere I remember.

CB is 64bit FP, so not AVX but SSE.

But other than that very good points! The CB demo in the keynote doesn't give the impression that FP (AVX and below) performance doubled or anything close to that. So the possibility of 4x 128bit AGUs can now be ruled out indeed, it's definitely 2x 256bit AGUs. Now I wonder, can 2x AVX2 AGUs also be combined to a half rate AVX512, like Zen 1 did with 2x AVX AGUs for a half rate AVX2?

William Gaatjes · Jan 12, 2019

This is surely an interesting thread.

I do not know how reliable this is but it is fascinating :

https://fuse.wikichip.org/news/1815/amd-discloses-initial-zen-2-details/

Zen 2 succeeds Zen/Zen+. The design targets TSMC 7 nm process node. AMD evaluated both 10 nm and 7 nm. The choice to go with 7 nm boiled down to the much lower power and higher density they were able to get. AMD claims 7-nanometers delivers 2x the density and offers 0.5x the power at the same performance or >1.25x the performance at the same power (note that at Computex AMD’s slide said “1.35x”). Zen 2-based chips are currently sampling and are on track to be delivered to market in 2019.

AMD has made a large set of enhancements to Zen 2. To feed the widened execution units which were improved in throughput, the front-end had to be adjusted. For that reason, the branch prediction unit has been reworked. This includes improvements to the prefetcher and various undisclosed optimizations to the instruction cache. The µOP cache was also tweaked including changes to the µOP cache tags and the µOP cache itself which has been enlarged to improve the instruction stream throughput. The size of the cache on Zen was 2,048 entries. The exact details of Zen 2 changes were not disclosed at this time.

The majority of the changes to the back-end involve the floating-point units. The most major change is the widening of the data path which has been doubled in width for the floating-point execution units. This includes the load/store operations as well as the FPUs. In Zen, AVX2 is fully supported through the use of two 128-bit micro-ops per instruction. Likewise, the load and store data paths were 128-bit wide. Every cycle, the FPU is capable of receiving 2 loads from the load/store unit, each up to 128 bits. In Zen 2, the data path is now 256 bits. Additionally, the execution units are now 256-bit wide as well, meaning 256-bit AVX operations no longer need to be cracked into two 128-bit micro-ops per instruction. With 2 256-bit FMAs, Zen 2 is capable of 16 FLOPs/cycle, matching that of Intel’s Skylake client core.

AMD stated that Zen 2 IPC has been improved along with an increase in both the dispatch and retire bandwidth, however, the fine details were not disclosed. On the security side, Zen 2 introduces in-silicon enhanced Spectre mitigations that were originally offered in firmware and software in Zen.

Spartak · Jan 12, 2019

Seems it will be very comparable to Skylake. That said, the architecture changes in Sunny Cove seem a bit more far-reaching going from 4 wide to 5 wide.

IPC gains for both would be comparable on average I'd say.

tamz_msc · Jan 12, 2019

Spartak said:
Seems it will be very comparable to Skylake. That said, the architecture changes in Sunny Cove seem a bit more far-reaching going from 4 wide to 5 wide.

IPC gains for both would be comparable on average I'd say.

Sunny cove is 5 way in rename not 5 wide decode.

beginner99 · Jan 13, 2019

Gideon said:
Not that much of a surprise either, considering that the front-end of zen was pretty much the same as the Bulldozer with a uop cache added on top, just doubled up, as it previously it was 1 per-module (2 cores). Cat_Merc made an interesting point on reddit, that the frontend limited 2 Bulldozer cores at full load to about 80% of their single core performance. Incidentally Zen has about 160% Bulldozer performance. Comparing the width and performance of Zen architecture to Haswell and Skylake, it probably is somewhat limited by the frontend.

And this explains the huge gains from SMT (vs intel HT). You can only gain a lot from SMT if your core often has unused resources. That's why I always repeat that having great SMT/HT scaling is actually not a good thing. of course you can't get 100% usage from a single thread but the closer to it the better.

So if AMDs ST performance (IPC, clock independent) actually increases a lot, I suspect their SMT scaling will go down just as much.

PotatoWithEarsOnSide · Jan 13, 2019

That's a legitimate argument for potential MT IPC regression in that CB score if AMD were indeed running high clocks.

coercitiv · Jan 13, 2019

PotatoWithEarsOnSide said:
That's a legitimate argument for potential MT IPC regression in that CB score if AMD were indeed running high clocks.

It's not a MT IPC regression when your SMT scaling shows diminishing returns because your ST IPC increased. SMT is a great way to limit efficiency loss as you build a wider core, but it's not a standalone vector to increase IPC, since it's indirectly dependent on ST IPC.

You start with an IPC of 1X and a SMT yield of 0.25X --> MT scaling is 1.25X
You increase IPC to 1.1X but SMT drops to 0.15X --> MT scaling is still 1.265X
You keep IPC to 1.1X and limit SMT drop to 0.2X --> MT scaling is 1.32X

What this means in the end is that increasing ST IPC by keeping the core constantly fed (efficiency driven gain w/o widening the core) will not yield an equivalent MT gain, but a smaller one. But that's ok, since you achieved one of the hardest feats in modern computing, assuming power usage, clocks and die area are within targeted thresholds.

PotatoWithEarsOnSide · Jan 13, 2019

OK, thanks for the clarification. It fits with what I would have expected, so I figure that my previous comment was overly simplistic.
If my understanding of your post is correct, then any ST IPC increases that AMD have made, could in fact be pointing to lower clocks on that demo than I had predicted myself.

Spartak · Jan 13, 2019

You have it the wrong way around. Increased ST optimisation leads to decreased SMT gains, so the clocks would be higher than you and others assume (~4.0-4.2).

I did not make this point in our discussion since it's so glaringly obvious. Most of the IPC gains judging by the changes they made would come at the cost of SMT.

PotatoWithEarsOnSide · Jan 13, 2019

Decreased gains does not equate to regression; it's an assumption that you've made off an unknown variable. My own calculations did not home in on whether the gains were from ST IPC gains, only an overall gain irrespective of its source.
It is still the case that 4.5GHz+ represents a regression in CB score versus Zen+. That's what you fail to comprehend. How much each contributing factor increases IPC is irrelevant to the actual CB score itself; its only relevant for the ratio of CB ST score versus CB MT score - a ST increase will decrease the ratio for the same overall MT score, consequently we have no actual way to ascertain the clocks from the MT score alone, except to make the assumption that they wouldn't sacrifice overall performance for ST performance...i.e.minimum same MT score at the same clocks.

Spartak · Jan 13, 2019

I think it's you that fails to comprehend that lower-end benches for CB15 are a better frame of reference to compare the ES performance. I'd wager those higer CB15 scores you keep clinging onto are from faster memory.

But since you are so freaking stubborn instead of checking what I already told you several times here's a graph:

Since the ES was running at a lower clock speed (IIRC 2666) the lower end of these benches are more representitive.

As you can see it's running at 4.2 GHz. Clock for clock same IPC/SMT performance would mean a speed of 4.66GHz for the ES. Oh, guess what happens when you have it running at 4GHz? Performance will be ~1750.

This is the last I'm going to say about this since you seem to be in a state of extreme denial.

exquisitechar · Jan 13, 2019

Spartak said:
I think it's you that fails to comprehend that lower-end benches for CB15 are a better frame of reference to compare the ES performance. I'd wager those higer CB15 scores you keep clinging onto are from faster memory.

But since you are so freaking stubborn instead of checking what I already told you several times here's a graph:

View attachment 2415

Since the ES was running at a lower clock speed (IIRC 2666) the lower end of these benches are more representitive.

As you can see it's running at 4.2 GHz. Clock for clock same IPC/SMT performance would mean a speed of 4.66GHz for the ES. Oh, guess what happens when you have it running at 4GHz? Performance will be ~1750.

This is the last I'm going to say about this since you seem to be in a state of extreme denial.

I think you wrote 4.3GHz instead of 4.0GHz by mistake, that's the thing started this confusion:

Spartak said:
The 2700x gets 1750 at 4.3 so it'd still be a 12% IPC improvement.

2700X with 2666MHz memory scoring a bit more than 1750 @4.0GHz sounds about right, I guess.

CatMerc · Jan 13, 2019

It's important to note that using SMT to fill the core in absence of a powerful enough OoO engine to fill the execution pipeline with a single thread is not necessarily a bad thing. It's a conscious trade off, and a very fine balancing point for Intel for years now, as well as AMD with Zen.

Increasing your OoO parallelism costs you in power and complexity, and not in a linear fashion with performance. SMT is much cheaper in both power and transistor cost than a highly complex OoO engine when dealing with multi threaded workloads.

The balance to strike is one where your single threaded performance (and therefore OoO engine complexity) is high enough to be able to effectively get past the non parallel parts of the code so you can keep feeding the rest of the cores and threads more work to do. But at the same time an overly complex OoO engine will cost you with high energy usage and high transistor cost, increasing the cost of the design and limiting how many cores you can put in the design.

Therefore, I wouldn't take it for granted that Zen 2 will definitely, absolutely, no doubt, improve the single threaded performance by a large margin while leaving less for SMT to recover. It's less energy efficient to do so. I believe it will happen, but the design choice could be made to simply make it a matter of more cores and wider execution.

If Zen was a purely consumer oriented design I would expect it to be the case, but AMD is gunning for servers, and the ability to cram a lot of reasonably powerful cores is important there.

PotatoWithEarsOnSide · Jan 13, 2019

Spartak said:
I think it's you that fails to comprehend that lower-end benches for CB15 are a better frame of reference to compare the ES performance. I'd wager those higer CB15 scores you keep clinging onto are from faster memory.

But since you are so freaking stubborn instead of checking what I already told you several times here's a graph:

View attachment 2415

Since the ES was running at a lower clock speed (IIRC 2666) the lower end of these benches are more representitive.

As you can see it's running at 4.2 GHz. Clock for clock same IPC/SMT performance would mean a speed of 4.66GHz for the ES. Oh, guess what happens when you have it running at 4GHz? Performance will be ~1750.

This is the last I'm going to say about this since you seem to be in a state of extreme denial.

I think that you're a donut.
My posts have all said 1900 at 4.3GHz, and there you are posting 1850 @ 4.2GHz whilst claiming 1750 @ 4.3GHz.
At no point have I even mentioned 4.0GHz for that ES.
LEARN TO READ, FFS.

Design changes in Zen 2 (CPU/core/chiplet only)

Diamond Member

Senior member

Diamond Member

Platinum Member

Senior member

Diamond Member

Senior member

Senior member

Senior member

Junior Member

Lifer

Diamond Member

Lifer

Senior member

Diamond Member

Diamond Member

Senior member

Diamond Member

Senior member

Senior member

Senior member

Senior member

Senior member

Golden Member

Senior member