• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."

Design changes in Zen 3 (CPU/core/chiplet only)

moinmoin

Golden Member
Jun 1, 2017
1,915
2,143
106
It's a little less than 2 years since the thread on design changes in Zen 2. It's unfortunate that even a month before the public launch of the first Zen 3 chips we still don't get any meaty information, but with the event we at least got some rough outlines which areas were changed and what their impact is. I hope AMD will fill in the interested public come time.

The 19% IPC improvement broken down into the different areas:

Doc used his pixel counting skill to come up with these numbers:
  • +2.7% Cache Prefetching
  • +3.3% Execution Engine
  • +1.3% Branch Predictor
  • +2.7% Micro-op Cache
  • +4.6% Front End
  • +4.6% Load/Store

The first and essentially only Zen 3 leak, unified L3 cache per CCD, was confirmed:


  • Advanced Load/Store Performance and Flexibility
  • Wider Issue in Float and Int Engines
  • "Zero Bubble" Branch Prediction


More technical details to come, hopefully soon.

+2.7% Cache Prefetching

+3.3% Execution Engine
  • "most likely an additional int unit and fp unit, taking it to 5x INT, 3x AGU, 3x FP" via #3
+1.3% Branch Predictor

+2.7% Micro-op Cache

+4.6% Front End

+4.6% Load/Store
  • higher Load/Store rate (Zen was 32B/cycle Load and 16B/cycle before while Intel Skylake featured double each) via #9
 
Last edited:

uzzi38

Golden Member
Oct 16, 2019
1,151
1,920
96
It's a little less than 2 years since the thread on design changes in Zen 2. It's unfortunate that even a month before the public launch of the first Zen 3 chips we still don't get any meaty information, but with the event we at least got some rough line outs which areas were changed and what their impact is. I hope AMD will fill in the interested public come time.

The 19% IPC improvement broken down into the different areas:

Doc used his pixel counting skill to come up with these numbers:
  • +2.7% Cache Prefetching
  • +3.3% Execution Engine
  • +1.3% Branch Predictor
  • +2.7% Micro-op Cache
  • +4.6% Front End
  • +4.6% Load/Store

The first and essentially only Zen 3 leak, unified L3 cache per CCD, was confirmed:


  • Advanced Load/Store Performance and Flexibility
  • Wider Issue in Float and Int Engines
  • "Zero Bubble" Branch Prediction


More technically details to come, hopefully soon.
I posted these in the actual Zen 3 thread, but they're possibly worth noting here as well. So compared to the XT chips performance gains made via node should be nil, and also the IPC figure is with SMT enabled.
 

DisEnchantment

Senior member
Mar 3, 2017
648
1,427
106
By wider issue in the INT and FP engine, I understood Papermaster meant the execution backend, in which case most likely an additional int unit and fp unit, taking it to 5x INT, 3x AGU, 3x FP.
So while not miniscule, the improvement is far from radical as people would have us believe.
Rather improving the frontend brought about more gains, unsurprisingly.
 

cherullo

Junior Member
May 19, 2019
19
27
51
The following AMD patent describes a Zero Bubble branch predictor, it's probably closely related to the one on Zen3, but I still haven't found the time to read:

High performance zero bubble conditional branch prediction using micro branch target buffer

The paper below describes some new techniques to improve uop-cache utilization. Some of these may be employed on Zen3 and account for the "Micro-op Cache" contribution on the first slide on the OP.
It's an easy albeit enlightening text about how the uop-cache itself works:

Improving the Utilization of Micro-operationCaches in x86 Processors

Hope you enjoy.
 

DisEnchantment

Senior member
Mar 3, 2017
648
1,427
106
High performance zero bubble conditional branch prediction using micro branch target buffer
The patent was awarded to Samsung, not sure if AMD has a patent in the last year which was not public yet.

 

moinmoin

Golden Member
Jun 1, 2017
1,915
2,143
106
Great post @cherullo! Cross linking patents and papers is a great way how we can get info that may apply to Zen 3 indeed.

By wider issue in the INT and FP engine, I understood Papermaster meant the execution backend, in which case most likely an additional int unit and fp unit, taking it to 5x INT, 3x AGU, 3x FP.
Sounds sensible.

I decided to add links and discussion to the OP by area, so please keep them coming. :blush:

Btw. Agner Fog is finding new stuff in Zen 2 while we all already move on. :grinning:

The patent was awarded to Samsung, not sure if AMD has a patent in the last year which was not public yet.
By Samsung's now defunct Austin CPU design team no less. But the term "zero bubble" can't be that widespread for this to be a coincidence, can it?
 

DisEnchantment

Senior member
Mar 3, 2017
648
1,427
106
But the term "zero bubble" can't be that widespread for this to be a coincidence, can it?
It is related to OoO. Zen2 also has it. So the Samsung patent actually describes how to achieve it using micro BTB and probably not related to Zen3.
Zen2 has zero bubble prediction in first level BTB.

Software Optimization Guide for AMD Family 17h Models 30h and Greater Processors
2.8.1.2 Branch Target Buffer

Each level of BTB holds an increasing number of entries, and prediction from the larger BTBs have higher latencies. When possible, keep the critical working set of branches in the code as small as possible (see Software Optimization Guide for AMD Family 15h, Section 7.6). L0BTB holds 8 forward taken branches and 8 backward taken branches, and predicts with zero bubbles. L1BTB has 512 entries and creates one bubble if prediction differs from L0BTB. L2BTB has 7168 entries and creates four bubbles if its prediction differs from L1BTB.
 
  • Like
Reactions: Tlh97 and moinmoin

Schmide

Diamond Member
Mar 7, 2002
5,329
219
106
Throughout the years intel for the most part had a store rate greater than AMD. (double) It always seemed intuitive to me that this was what made the gap in gaming. You do work on a CPU, retire to a buffer, then send that off to the GPU. The rate at which you fill that buffer directly relates to how fast you can queue transfers.

When the actual specs come out, I predict parity on this metric.
 
Last edited:

moinmoin

Golden Member
Jun 1, 2017
1,915
2,143
106
One of the changes between Zen 1 and 2 was the halving of the L1 instruction cache from 64kb back to 32kb, with the other 32kb being used for the micro-op cache instead. Does it seem a useful change to increase the L1 instruction cache to the Zen 1 size of 64kb again and also increase the size of the micro-op cache (which in Zen 2 is also essentially 64kb) while at it?
 

Thunder 57

Golden Member
Aug 19, 2007
1,546
1,452
136
One of the changes between Zen 1 and 2 was the halving of the L1 instruction cache from 64kb back to 32kb, with the other 32kb being used for the micro-op cache instead. Does it seem a useful change to increase the L1 instruction cache to the Zen 1 size of 64kb again and also increase the size of the micro-op cache (which in Zen 2 is also essentially 64kb) while at it?
I honestly don't know. My guess is they left the L1 and uop caches the same. My guess is the next move they make regarding cache is 1MB L2's on 5nm.
 
  • Like
Reactions: lightmanek

ASK THE COMMUNITY