Design changes in Zen 3 (CPU/core/chiplet only)

moinmoin

Diamond Member
Jun 1, 2017
3,295
4,531
136
It's a little less than 2 years since the thread on design changes in Zen 2. It's unfortunate that even a month before the public launch of the first Zen 3 chips we still don't get any meaty information, but with the event we at least got some rough outlines which areas were changed and what their impact is. I hope AMD will fill in the interested public come time.

The 19% IPC improvement broken down into the different areas:

Doc used his pixel counting skill to come up with these numbers:
  • +2.7% Cache Prefetching
  • +3.3% Execution Engine
  • +1.3% Branch Predictor
  • +2.7% Micro-op Cache
  • +4.6% Front End
  • +4.6% Load/Store

The first and essentially only Zen 3 leak, unified L3 cache per CCD, was confirmed:


  • Advanced Load/Store Performance and Flexibility
  • Wider Issue in Float and Int Engines
  • "Zero Bubble" Branch Prediction


More technical details to come, hopefully soon.

+2.7% Cache Prefetching

+3.3% Execution Engine
  • "most likely an additional int unit and fp unit, taking it to 5x INT, 3x AGU, 3x FP" via #3
+1.3% Branch Predictor

+2.7% Micro-op Cache

+4.6% Front End

+4.6% Load/Store
  • higher Load/Store rate (Zen was 32B/cycle Load and 16B/cycle before while Intel Skylake featured double each) via #9
 
Last edited:

uzzi38

Platinum Member
Oct 16, 2019
2,112
4,290
116
It's a little less than 2 years since the thread on design changes in Zen 2. It's unfortunate that even a month before the public launch of the first Zen 3 chips we still don't get any meaty information, but with the event we at least got some rough line outs which areas were changed and what their impact is. I hope AMD will fill in the interested public come time.

The 19% IPC improvement broken down into the different areas:

Doc used his pixel counting skill to come up with these numbers:
  • +2.7% Cache Prefetching
  • +3.3% Execution Engine
  • +1.3% Branch Predictor
  • +2.7% Micro-op Cache
  • +4.6% Front End
  • +4.6% Load/Store

The first and essentially only Zen 3 leak, unified L3 cache per CCD, was confirmed:


  • Advanced Load/Store Performance and Flexibility
  • Wider Issue in Float and Int Engines
  • "Zero Bubble" Branch Prediction


More technically details to come, hopefully soon.
I posted these in the actual Zen 3 thread, but they're possibly worth noting here as well. So compared to the XT chips performance gains made via node should be nil, and also the IPC figure is with SMT enabled.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,157
3,430
136
By wider issue in the INT and FP engine, I understood Papermaster meant the execution backend, in which case most likely an additional int unit and fp unit, taking it to 5x INT, 3x AGU, 3x FP.
So while not miniscule, the improvement is far from radical as people would have us believe.
Rather improving the frontend brought about more gains, unsurprisingly.
 

cherullo

Member
May 19, 2019
26
45
61
The following AMD patent describes a Zero Bubble branch predictor, it's probably closely related to the one on Zen3, but I still haven't found the time to read:

High performance zero bubble conditional branch prediction using micro branch target buffer

The paper below describes some new techniques to improve uop-cache utilization. Some of these may be employed on Zen3 and account for the "Micro-op Cache" contribution on the first slide on the OP.
It's an easy albeit enlightening text about how the uop-cache itself works:

Improving the Utilization of Micro-operationCaches in x86 Processors

Hope you enjoy.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,157
3,430
136
High performance zero bubble conditional branch prediction using micro branch target buffer
The patent was awarded to Samsung, not sure if AMD has a patent in the last year which was not public yet.

 

moinmoin

Diamond Member
Jun 1, 2017
3,295
4,531
136
Great post @cherullo! Cross linking patents and papers is a great way how we can get info that may apply to Zen 3 indeed.

By wider issue in the INT and FP engine, I understood Papermaster meant the execution backend, in which case most likely an additional int unit and fp unit, taking it to 5x INT, 3x AGU, 3x FP.
Sounds sensible.

I decided to add links and discussion to the OP by area, so please keep them coming. :blush:

Btw. Agner Fog is finding new stuff in Zen 2 while we all already move on. :grinning:

The patent was awarded to Samsung, not sure if AMD has a patent in the last year which was not public yet.
By Samsung's now defunct Austin CPU design team no less. But the term "zero bubble" can't be that widespread for this to be a coincidence, can it?
 

DisEnchantment

Golden Member
Mar 3, 2017
1,157
3,430
136
But the term "zero bubble" can't be that widespread for this to be a coincidence, can it?
It is related to OoO. Zen2 also has it. So the Samsung patent actually describes how to achieve it using micro BTB and probably not related to Zen3.
Zen2 has zero bubble prediction in first level BTB.

Software Optimization Guide for AMD Family 17h Models 30h and Greater Processors
2.8.1.2 Branch Target Buffer

Each level of BTB holds an increasing number of entries, and prediction from the larger BTBs have higher latencies. When possible, keep the critical working set of branches in the code as small as possible (see Software Optimization Guide for AMD Family 15h, Section 7.6). L0BTB holds 8 forward taken branches and 8 backward taken branches, and predicts with zero bubbles. L1BTB has 512 entries and creates one bubble if prediction differs from L0BTB. L2BTB has 7168 entries and creates four bubbles if its prediction differs from L1BTB.
 

Schmide

Diamond Member
Mar 7, 2002
5,426
379
126
Throughout the years intel for the most part had a store rate greater than AMD. (double) It always seemed intuitive to me that this was what made the gap in gaming. You do work on a CPU, retire to a buffer, then send that off to the GPU. The rate at which you fill that buffer directly relates to how fast you can queue transfers.

When the actual specs come out, I predict parity on this metric.
 
Last edited:

moinmoin

Diamond Member
Jun 1, 2017
3,295
4,531
136
One of the changes between Zen 1 and 2 was the halving of the L1 instruction cache from 64kb back to 32kb, with the other 32kb being used for the micro-op cache instead. Does it seem a useful change to increase the L1 instruction cache to the Zen 1 size of 64kb again and also increase the size of the micro-op cache (which in Zen 2 is also essentially 64kb) while at it?
 

Thunder 57

Golden Member
Aug 19, 2007
1,777
1,895
136
One of the changes between Zen 1 and 2 was the halving of the L1 instruction cache from 64kb back to 32kb, with the other 32kb being used for the micro-op cache instead. Does it seem a useful change to increase the L1 instruction cache to the Zen 1 size of 64kb again and also increase the size of the micro-op cache (which in Zen 2 is also essentially 64kb) while at it?
I honestly don't know. My guess is they left the L1 and uop caches the same. My guess is the next move they make regarding cache is 1MB L2's on 5nm.
 
  • Like
Reactions: lightmanek

moinmoin

Diamond Member
Jun 1, 2017
3,295
4,531
136
For the AT deep dive Andrei mostly repeated what can be garnered from above slides or we talked about before (like the paper on Improving the Utilization of Micro-operation Caches). But there are some additional interesting points:

"REP MOVS instructions have seen improvements in terms of its efficiencies for shorter buffer sizes. This means that in contrast to past microarchitectures which might have seen better throughput with other copy algorithms, on Zen3 REP MOVS will see optimal performance no matter how big or small the buffer size being copied is."

----

In the other thread I was guessing AMD would combine the IOD link bandwidth of two Zen 2 CCXs for the new bigger Zen 3 CCX. That doesn't seem to be the case, instead essentially halving the bandwidth available per core:



Andrei writes:
"One thing that AMD wasn’t able to scale up with the new L3 cache is cache bandwidth – here the new L3 actually features the same interface widths as on Zen2, and total aggregate bandwidth across all the cores peaks out at the same number as on the previous generation. The thing is now, the cache serves double the cores, so it means that the per-core bandwidth has halved this generation. AMD explains is that also scaling up the bandwidth would have incurred further compromises, particularly on the power side of things. In effect this means that the aggregate L3 bandwidth on a CCD, disregarding clock speed improvements, will be half of that of that of a Zen2/Ryzen 3000 CCD with two CCX’s (Essentially two separate L3’s)."

This is a really odd decision that I can only explain by two possibilities: Prefetching is working well enough to make the reduced bandwidth matter less. And this is in preparation for another doubling of the amount of cores. With this on Epyc packages the amount of CCDs could be even doubled without changing the existing IOD, since the amount of links is the same. But there isn't enough space for 16 CCDs on the package, is there?

----

Zen 3 appears to have a more aggressive prefetcher for L1$ and some proactive cache data management happening in the background, both which were alluded to in AMD patents @DisEnchantment mentioned in the past.
"Starting off with the most basic access pattern, a simple linear chain within the address space, we’re seeing access latencies improve from an average of 5.33 cycles on Zen2 to +-4.25 cycles on Zen3, meaning that this generation’s adjacent-line prefetchers are much more aggressive in pulling data into the L1D. This is actually now even more aggressive than Intel’s cores, which have an average access latency of 5.11 cycles for the same pattern within their L2 region."

"The fact that this test now behaves completely different throughout the L2 to L3 and DRAM compared to Zen2 means that AMD is now employing a very different cache line replacement policy on Zen3. The test’s curve in the L3 no longer actually matching the cache’s size means that AMD is now optimising the replacement policy to reorder/move around cache lines within the sets to reduce unneeded replacements within the cache hierarchies. In this case it’s a very interesting behaviour that we hadn’t seen to this degree in any microarchitecture and basically breaks our TLB+CLR test which we previously relied on for estimating the physical structural latencies of the designs."

"Overall, although Zen3 doesn’t change dramatically in its cache structure beyond the doubled up and slightly slower L3, the actual cache behaviour between microarchitecture generations has changed quite a lot for AMD. The new Zen3 design seems to make much smarter use of prefetching as well as cache line handling – some of whose performance effects could easily overshadow just the L3 increase."

----

In New and Improved Instructions Ian essentially did Agner Fog's work by detailing the bandwidth and latency improvements and regressions at an instruction level. Well worth the read if that's your thing. As Ian writes: "Compared to some of the recent CPU launches, this is a lot of changes!"

Also a comparison with Intel's approach:
"AMD, unlike Intel, does accelerated SHA so being able to reduce multiple instructions to a single instruction to help increase throughput and utilization should push them even further ahead. Rather than going for hardware accelerated SHA256, Intel instead prefers to use its AVX-512 unit, which unfortunately is a lot more power hungry and less efficient."
 

amd6502

Senior member
Apr 21, 2017
971
358
136
That core widening to 8 pipes is interesting. On the int side the engineers are saying they are confident they can just about always keep at least one branch unit busy, and that they are also mostly good with making do with 4 arithmetic pipes.

Am I right in understanding that issue width is now 10 on int side plus 6 on fpu side, for a total of 16 (up from 10/11)?

On the FPU diagram what does F2I stand for (floats to int?) ? TIA
 

moinmoin

Diamond Member
Jun 1, 2017
3,295
4,531
136
On the FPU diagram what does F2I stand for (floats to int?) ?
Yeah, float to int conversion. As Andrei writes on that part:
"AMD has opted to disaggregate some of the pipelines capabilities, such as moving the floating point store and floating-point-to-integer conversion units into their own dedicated ports and units, so that the main execution pipelines are able to see higher utilisation with actual compute instructions."
So it's not wider in actual performance but widened to make the heavy units work more consistently on the stuff they are good at.
 
  • Like
Reactions: amd6502

itsmydamnation

Platinum Member
Feb 6, 2011
2,317
2,081
136
That core widening to 8 pipes is interesting. On the int side the engineers are saying they are confident they can just about always keep at least one branch unit busy, and that they are also mostly good with making do with 4 arithmetic pipes.

Am I right in understanding that issue width is now 10 on int side plus 6 on fpu side, for a total of 16 (up from 10/11)?

On the FPU diagram what does F2I stand for (floats to int?) ? TIA
so its funny,
AMD have made "width" meaningless,
if the Anandtech article is correct then there are the exact same amount of read/write ports to the PRF as Zen2. So it has more "targets" but the same number of ALU's and the same amount of bandwidth to the registers , along with what looks like the same amount of issue and retirement.
 

inf64

Diamond Member
Mar 11, 2011
3,167
2,221
136
They made it wider only they didn't :D. All jokes aside they made much better use of the stuff that was in Zen2, it's miraculous they managed to extract that much IPC. It can be called a new core basically, we will see even visual differences once someone takes a picture of the bare die.
 

Abwx

Diamond Member
Apr 2, 2011
9,460
1,452
126
so its funny,
AMD have made "width" meaningless,
if the Anandtech article is correct then there are the exact same amount of read/write ports to the PRF as Zen2. So it has more "targets" but the same number of ALU's and the same amount of bandwidth to the registers , along with what looks like the same amount of issue and retirement.

There s more data/cycle that get to the PRF, if i read correctly the LSU can load 8 x 64b and store 4 x 64b per cycle, effective width of this unit is 512b loads and 256b stores, wich is logical since two operands are combined to output one through a MUL or an ADD.
 
  • Like
Reactions: Tlh97 and amd6502

itsmydamnation

Platinum Member
Feb 6, 2011
2,317
2,081
136
There s more data/cycle that get to the PRF, if i read correctly the LSU can load 8 x 64b and store 4 x 64b per cycle, effective width of this unit is 512b loads and 256b stores, wich is logical since two operands are combined to output one through a MUL or an ADD.
i dont mean L/S bandwidth in/out of the PRF , that has gotten wider for scalar data (but not SIMD data) i mean PRF read/write ports/ bypass network that connects the PRF to the execution units.

Has any one found out anything about what situations the 2nd Store port can be used?
 

amd6502

Senior member
Apr 21, 2017
971
358
136
They made it wider only they didn't :D.
Well, it is a little wider. One extra branch unit, which is like a downsized specialized ALU.

The number of instructions that can be issued is a big upgrade (10 int instructions per cycle, up from 6 and 7) . It might be a hint of what's to come in Zen 4's int core (6+1+3 ?).
 
Last edited:
  • Like
Reactions: spursindonesia

moinmoin

Diamond Member
Jun 1, 2017
3,295
4,531
136
Two interesting points made in the other thread:
Zen3 benefits more cause individual cores have massively improved available memory parallelism. For example page table walkers were expanded - guess what, when TLB miss happens, it usually means that L3 cache will miss as well, so ZEN3 will have more parallel request to memory happening sooner.
Total of 192 of L3 misses can be outstanding compared to 96 on ZEN2. + there are cascading events, like taking less L3 misses means that memory misses that were separated by more time happen with less separation now.

More items on average in memory controller queue => ZEN3 is more sensitive to memory latency and parallelism. Having more ranks ( either from DR or from more 1R DIMMs on the channel ) increases parallelism.
This is also an ingenious way of making better use of the IOD without actually changing it.

----

Shorter pipeline:
Software Optimisation Guide is out and it states in here:
The branch misprediction penalty is in the range from 11 to 18 cycles, depending on the type of mispredicted branch and whether or not the instructions are being fed from the op cache. The common case penalty is 13 cycles.
EDIT: Just to add some context, Zen 2 is 12-18 cycles with 16 typical.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,157
3,430
136
Two interesting points made in the other thread:

This is also an ingenious way of making better use of the IOD without actually changing it.

----

Shorter pipeline:
Not sure about the shorter pipeline.
A new instruction is injected into the pipeline without waiting for all the stages of the current instruction to be completed, so they shaved off some cycles of penalty in case of misprediction not that the pipeline is shorter.
That is basically what they mean by Zero bubble, at least that is how i understand it. No waiting for the pipeline to fully flushed, i.e zero bubbles.
A shorter pipeline would be a really radical change for x86 and it impacts how fast they can clock the architecture. x86's higher clock speed can be partially attributed to the longer pipeline.

Example of bubbles during pipeline execution
1604865978323.png
 
Last edited:

moinmoin

Diamond Member
Jun 1, 2017
3,295
4,531
136
Thanks @DisEnchantment.

We finally have some die porn now:
Fritz took the shots, Locuza did the annotations.



It's beautiful.



Too bad that it broke when delidding...
For comparison, the cores from Zen 1 and Zen 2:
Sashleycat combined the recently revealed Zen 2 core annotation with Fritzchens Fritz's hi-res die shots, and added the Zen 1 core as well for comparison:

Zen 2 core:


Zen 1 core:
Notable is that with Zen 2 the aspect ratio of the area didn't change, while with Zen 3 it did. (Also to me the debug/wafer test area seems huge, but that's uncore stuff of the CCD so off topic here.)

Somebody willing to do some pixel counting comparing the core sizes?
 

ASK THE COMMUNITY