Design changes in Zen 3 (CPU/core/chiplet only)

moinmoin · Oct 10, 2020

It's a little less than 2 years since the thread on design changes in Zen 2. It's unfortunate that even a month before the public launch of the first Zen 3 chips we still don't get any meaty information, but with the event we at least got some rough outlines which areas were changed and what their impact is. I hope AMD will fill in the interested public come time.

The 19% IPC improvement broken down into the different areas:

Doc used his pixel counting skill to come up with these numbers:

+2.7% Cache Prefetching
+3.3% Execution Engine
+1.3% Branch Predictor
+2.7% Micro-op Cache
+4.6% Front End
+4.6% Load/Store

The first and essentially only Zen 3 leak, unified L3 cache per CCD, was confirmed:

Advanced Load/Store Performance and Flexibility
Wider Issue in Float and Int Engines
"Zero Bubble" Branch Prediction

More technical details to come, hopefully soon.

+2.7% Cache Prefetching

+3.3% Execution Engine

"most likely an additional int unit and fp unit, taking it to 5x INT, 3x AGU, 3x FP" via #3

+1.3% Branch Predictor

+2.7% Micro-op Cache

Paper: Improving the Utilization of Micro-operation Caches in x86 Processors via #5

+4.6% Front End

+4.6% Load/Store

higher Load/Store rate (Zen was 32B/cycle Load and 16B/cycle before while Intel Skylake featured double each) via #9

uzzi38 · Oct 10, 2020

moinmoin said:
It's a little less than 2 years since the thread on design changes in Zen 2. It's unfortunate that even a month before the public launch of the first Zen 3 chips we still don't get any meaty information, but with the event we at least got some rough line outs which areas were changed and what their impact is. I hope AMD will fill in the interested public come time.

The 19% IPC improvement broken down into the different areas:

Doc used his pixel counting skill to come up with these numbers:

+2.7% Cache Prefetching

+3.3% Execution Engine

+1.3% Branch Predictor

+2.7% Micro-op Cache

+4.6% Front End

+4.6% Load/Store

The first and essentially only Zen 3 leak, unified L3 cache per CCD, was confirmed:

Advanced Load/Store Performance and Flexibility

Wider Issue in Float and Int Engines

"Zero Bubble" Branch Prediction

More technically details to come, hopefully soon.

I posted these in the actual Zen 3 thread, but they're possibly worth noting here as well. So compared to the XT chips performance gains made via node should be nil, and also the IPC figure is with SMT enabled.

DisEnchantment · Oct 10, 2020

By wider issue in the INT and FP engine, I understood Papermaster meant the execution backend, in which case most likely an additional int unit and fp unit, taking it to 5x INT, 3x AGU, 3x FP.
So while not miniscule, the improvement is far from radical as people would have us believe.
Rather improving the frontend brought about more gains, unsurprisingly.

DisEnchantment · Oct 10, 2020

Double post, pls delete.

cherullo · Oct 10, 2020

The following AMD patent describes a Zero Bubble branch predictor, it's probably closely related to the one on Zen3, but I still haven't found the time to read:

High performance zero bubble conditional branch prediction using micro branch target buffer

US20170068539A1 - High performance zero bubble conditional branch prediction using micro branch target buffer - Google Patents

Embodiments include a micro BTB, which can predict up to two branches per cycle, every cycle, with zero bubble insertion on either a taken or not taken prediction, thereby significantly improving performance and reducing power consumption of a microprocessor. A front end of a microprocessor can...

patents.google.com

The paper below describes some new techniques to improve uop-cache utilization. Some of these may be employed on Zen3 and account for the "Micro-op Cache" contribution on the first slide on the OP.
It's an easy albeit enlightening text about how the uop-cache itself works:

Improving the Utilization of Micro-operationCaches in x86 Processors

https://jbk5155.github.io/publications/MICRO_2020.pdf

Hope you enjoy.

DisEnchantment · Oct 11, 2020

cherullo said:
High performance zero bubble conditional branch prediction using micro branch target buffer

US20170068539A1 - High performance zero bubble conditional branch prediction using micro branch target buffer - Google Patents

Embodiments include a micro BTB, which can predict up to two branches per cycle, every cycle, with zero bubble insertion on either a taken or not taken prediction, thereby significantly improving performance and reducing power consumption of a microprocessor. A front end of a microprocessor can...

patents.google.com

The patent was awarded to Samsung, not sure if AMD has a patent in the last year which was not public yet.

http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=1&f=G&l=50&co1=AND&d=PTXT&s1=%22HIGH+PERFORMANCE+ZERO+BUBBLE+CONDITIONAL+BRANCH+PREDICTION+USING+MICRO+BRANCH+TARGET+BUFFER%22.TI.&OS=TTL/

moinmoin · Oct 11, 2020

Great post @cherullo! Cross linking patents and papers is a great way how we can get info that may apply to Zen 3 indeed.

DisEnchantment said:
By wider issue in the INT and FP engine, I understood Papermaster meant the execution backend, in which case most likely an additional int unit and fp unit, taking it to 5x INT, 3x AGU, 3x FP.

Sounds sensible.

I decided to add links and discussion to the OP by area, so please keep them coming.

Btw. Agner Fog is finding new stuff in Zen 2 while we all already move on.

DisEnchantment said:
The patent was awarded to Samsung, not sure if AMD has a patent in the last year which was not public yet.

By Samsung's now defunct Austin CPU design team no less. But the term "zero bubble" can't be that widespread for this to be a coincidence, can it?

DisEnchantment · Oct 11, 2020

moinmoin said:
But the term "zero bubble" can't be that widespread for this to be a coincidence, can it?

It is related to OoO. Zen2 also has it. So the Samsung patent actually describes how to achieve it using micro BTB and probably not related to Zen3.
Zen2 has zero bubble prediction in first level BTB.

Software Optimization Guide for AMD Family 17h Models 30h and Greater Processors
2.8.1.2 Branch Target Buffer

Each level of BTB holds an increasing number of entries, and prediction from the larger BTBs have higher latencies. When possible, keep the critical working set of branches in the code as small as possible (see Software Optimization Guide for AMD Family 15h, Section 7.6). L0BTB holds 8 forward taken branches and 8 backward taken branches, and predicts with zero bubbles. L1BTB has 512 entries and creates one bubble if prediction differs from L0BTB. L2BTB has 7168 entries and creates four bubbles if its prediction differs from L1BTB.

Schmide · Oct 11, 2020

Throughout the years intel for the most part had a store rate greater than AMD. (double) It always seemed intuitive to me that this was what made the gap in gaming. You do work on a CPU, retire to a buffer, then send that off to the GPU. The rate at which you fill that buffer directly relates to how fast you can queue transfers.

When the actual specs come out, I predict parity on this metric.

moinmoin · Oct 11, 2020

One of the changes between Zen 1 and 2 was the halving of the L1 instruction cache from 64kb back to 32kb, with the other 32kb being used for the micro-op cache instead. Does it seem a useful change to increase the L1 instruction cache to the Zen 1 size of 64kb again and also increase the size of the micro-op cache (which in Zen 2 is also essentially 64kb) while at it?

Thunder 57 · Oct 11, 2020

moinmoin said:
One of the changes between Zen 1 and 2 was the halving of the L1 instruction cache from 64kb back to 32kb, with the other 32kb being used for the micro-op cache instead. Does it seem a useful change to increase the L1 instruction cache to the Zen 1 size of 64kb again and also increase the size of the micro-op cache (which in Zen 2 is also essentially 64kb) while at it?

I honestly don't know. My guess is they left the L1 and uop caches the same. My guess is the next move they make regarding cache is 1MB L2's on 5nm.

moinmoin · Nov 5, 2020

With the CPU line being out some more details are finally being shared:

moinmoin · Nov 5, 2020

For the AT deep dive Andrei mostly repeated what can be garnered from above slides or we talked about before (like the paper on Improving the Utilization of Micro-operation Caches). But there are some additional interesting points:

"REP MOVS instructions have seen improvements in terms of its efficiencies for shorter buffer sizes. This means that in contrast to past microarchitectures which might have seen better throughput with other copy algorithms, on Zen3 REP MOVS will see optimal performance no matter how big or small the buffer size being copied is."

----

In the other thread I was guessing AMD would combine the IOD link bandwidth of two Zen 2 CCXs for the new bigger Zen 3 CCX. That doesn't seem to be the case, instead essentially halving the bandwidth available per core:

Andrei writes:
"One thing that AMD wasn’t able to scale up with the new L3 cache is cache bandwidth – here the new L3 actually features the same interface widths as on Zen2, and total aggregate bandwidth across all the cores peaks out at the same number as on the previous generation. The thing is now, the cache serves double the cores, so it means that the per-core bandwidth has halved this generation. AMD explains is that also scaling up the bandwidth would have incurred further compromises, particularly on the power side of things. In effect this means that the aggregate L3 bandwidth on a CCD, disregarding clock speed improvements, will be half of that of that of a Zen2/Ryzen 3000 CCD with two CCX’s (Essentially two separate L3’s)."

This is a really odd decision that I can only explain by two possibilities: Prefetching is working well enough to make the reduced bandwidth matter less. And this is in preparation for another doubling of the amount of cores. With this on Epyc packages the amount of CCDs could be even doubled without changing the existing IOD, since the amount of links is the same. But there isn't enough space for 16 CCDs on the package, is there?

----

Zen 3 appears to have a more aggressive prefetcher for L1$ and some proactive cache data management happening in the background, both which were alluded to in AMD patents @DisEnchantment mentioned in the past.
"Starting off with the most basic access pattern, a simple linear chain within the address space, we’re seeing access latencies improve from an average of 5.33 cycles on Zen2 to +-4.25 cycles on Zen3, meaning that this generation’s adjacent-line prefetchers are much more aggressive in pulling data into the L1D. This is actually now even more aggressive than Intel’s cores, which have an average access latency of 5.11 cycles for the same pattern within their L2 region."

"The fact that this test now behaves completely different throughout the L2 to L3 and DRAM compared to Zen2 means that AMD is now employing a very different cache line replacement policy on Zen3. The test’s curve in the L3 no longer actually matching the cache’s size means that AMD is now optimising the replacement policy to reorder/move around cache lines within the sets to reduce unneeded replacements within the cache hierarchies. In this case it’s a very interesting behaviour that we hadn’t seen to this degree in any microarchitecture and basically breaks our TLB+CLR test which we previously relied on for estimating the physical structural latencies of the designs."

"Overall, although Zen3 doesn’t change dramatically in its cache structure beyond the doubled up and slightly slower L3, the actual cache behaviour between microarchitecture generations has changed quite a lot for AMD. The new Zen3 design seems to make much smarter use of prefetching as well as cache line handling – some of whose performance effects could easily overshadow just the L3 increase."

----

In New and Improved Instructions Ian essentially did Agner Fog's work by detailing the bandwidth and latency improvements and regressions at an instruction level. Well worth the read if that's your thing. As Ian writes: "Compared to some of the recent CPU launches, this is a lot of changes!"

Also a comparison with Intel's approach:
"AMD, unlike Intel, does accelerated SHA so being able to reduce multiple instructions to a single instruction to help increase throughput and utilization should push them even further ahead. Rather than going for hardware accelerated SHA256, Intel instead prefers to use its AVX-512 unit, which unfortunately is a lot more power hungry and less efficient."

amd6502 · Nov 6, 2020

That core widening to 8 pipes is interesting. On the int side the engineers are saying they are confident they can just about always keep at least one branch unit busy, and that they are also mostly good with making do with 4 arithmetic pipes.

Am I right in understanding that issue width is now 10 on int side plus 6 on fpu side, for a total of 16 (up from 10/11)?

On the FPU diagram what does F2I stand for (floats to int?) ? TIA

moinmoin · Nov 6, 2020

amd6502 said:
On the FPU diagram what does F2I stand for (floats to int?) ?

Yeah, float to int conversion. As Andrei writes on that part:
"AMD has opted to disaggregate some of the pipelines capabilities, such as moving the floating point store and floating-point-to-integer conversion units into their own dedicated ports and units, so that the main execution pipelines are able to see higher utilisation with actual compute instructions."
So it's not wider in actual performance but widened to make the heavy units work more consistently on the stuff they are good at.

itsmydamnation · Nov 6, 2020

amd6502 said:
That core widening to 8 pipes is interesting. On the int side the engineers are saying they are confident they can just about always keep at least one branch unit busy, and that they are also mostly good with making do with 4 arithmetic pipes.

Am I right in understanding that issue width is now 10 on int side plus 6 on fpu side, for a total of 16 (up from 10/11)?

On the FPU diagram what does F2I stand for (floats to int?) ? TIA

so its funny,
AMD have made "width" meaningless,
if the Anandtech article is correct then there are the exact same amount of read/write ports to the PRF as Zen2. So it has more "targets" but the same number of ALU's and the same amount of bandwidth to the registers , along with what looks like the same amount of issue and retirement.

inf64 · Nov 6, 2020

They made it wider only they didn't

. All jokes aside they made much better use of the stuff that was in Zen2, it's miraculous they managed to extract that much IPC. It can be called a new core basically, we will see even visual differences once someone takes a picture of the bare die.

Abwx · Nov 7, 2020

itsmydamnation said:
so its funny,
AMD have made "width" meaningless,
if the Anandtech article is correct then there are the exact same amount of read/write ports to the PRF as Zen2. So it has more "targets" but the same number of ALU's and the same amount of bandwidth to the registers , along with what looks like the same amount of issue and retirement.

There s more data/cycle that get to the PRF, if i read correctly the LSU can load 8 x 64b and store 4 x 64b per cycle, effective width of this unit is 512b loads and 256b stores, wich is logical since two operands are combined to output one through a MUL or an ADD.

itsmydamnation · Nov 7, 2020

Abwx said:
There s more data/cycle that get to the PRF, if i read correctly the LSU can load 8 x 64b and store 4 x 64b per cycle, effective width of this unit is 512b loads and 256b stores, wich is logical since two operands are combined to output one through a MUL or an ADD.

i dont mean L/S bandwidth in/out of the PRF , that has gotten wider for scalar data (but not SIMD data) i mean PRF read/write ports/ bypass network that connects the PRF to the execution units.

Has any one found out anything about what situations the 2nd Store port can be used?

amd6502 · Nov 7, 2020

inf64 said:
They made it wider only they didn't .

Well, it is a little wider. One extra branch unit, which is like a downsized specialized ALU.

The number of instructions that can be issued is a big upgrade (10 int instructions per cycle, up from 6 and 7) . It might be a hint of what's to come in Zen 4's int core (6+1+3 ?).

Olikan · Nov 7, 2020

New chiplet is 80.7mm2. A small increase for such massive gains...

Source is Semiaccurate and Toms hardware

Tuna-Fish · Nov 8, 2020

AMD has posted the new software optimization guide (zipped pdf).

Nothing earthshattering.

moinmoin · Nov 8, 2020

Two interesting points made in the other thread:

JoeRambo said:
Zen3 benefits more cause individual cores have massively improved available memory parallelism. For example page table walkers were expanded - guess what, when TLB miss happens, it usually means that L3 cache will miss as well, so ZEN3 will have more parallel request to memory happening sooner.
Total of 192 of L3 misses can be outstanding compared to 96 on ZEN2. + there are cascading events, like taking less L3 misses means that memory misses that were separated by more time happen with less separation now.

More items on average in memory controller queue => ZEN3 is more sensitive to memory latency and parallelism. Having more ranks ( either from DR or from more 1R DIMMs on the channel ) increases parallelism.

This is also an ingenious way of making better use of the IOD without actually changing it.

----

Shorter pipeline:

uzzi38 said:
Software Optimisation Guide is out and it states in here:

The branch misprediction penalty is in the range from 11 to 18 cycles, depending on the type of mispredicted branch and whether or not the instructions are being fed from the op cache. The common case penalty is 13 cycles.

Click to expand...

EDIT: Just to add some context, Zen 2 is 12-18 cycles with 16 typical.

DisEnchantment · Nov 8, 2020

moinmoin said:
Two interesting points made in the other thread:

This is also an ingenious way of making better use of the IOD without actually changing it.

----

Shorter pipeline:

Not sure about the shorter pipeline.
A new instruction is injected into the pipeline without waiting for all the stages of the current instruction to be completed, so they shaved off some cycles of penalty in case of misprediction not that the pipeline is shorter.
That is basically what they mean by Zero bubble, at least that is how i understand it. No waiting for the pipeline to fully flushed, i.e zero bubbles.
A shorter pipeline would be a really radical change for x86 and it impacts how fast they can clock the architecture. x86's higher clock speed can be partially attributed to the longer pipeline.

Example of bubbles during pipeline execution

moinmoin · Nov 9, 2020

Thanks @DisEnchantment.

We finally have some die porn now:

.vodka said:
Fritz took the shots, Locuza did the annotations.

It's beautiful.

Too bad that it broke when delidding...

For comparison, the cores from Zen 1 and Zen 2:

moinmoin said:
Sashleycat combined the recently revealed Zen 2 core annotation with Fritzchens Fritz's hi-res die shots, and added the Zen 1 core as well for comparison:

Zen 2 core:

Zen 1 core:

Notable is that with Zen 2 the aspect ratio of the area didn't change, while with Zen 3 it did. (Also to me the debug/wafer test area seems huge, but that's uncore stuff of the CCD so off topic here.)

Somebody willing to do some pixel counting comparing the core sizes?

Design changes in Zen 3 (CPU/core/chiplet only)

Diamond Member

Platinum Member

Golden Member

Golden Member

Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Senior member

Platinum Member

Golden Member

Diamond Member

Golden Member

Diamond Member