Speculation: Ryzen 4000 series/Zen 3

Page 209 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

dnavas

Senior member
Feb 25, 2017
355
190
116
Is that a good thing?
Well, you could read the linked article, but tl;dr:
VAES is vector AES -- I don't know if this includes encode, decode or both, but presumably both.
PCLMULQDQ is Carry-Less MULtiply QuaDword

I think these are both encryption-related?
[ed: ninja'd by Nosta]
 

moinmoin

Diamond Member
Jun 1, 2017
5,064
8,032
136
znver3 does indeed support VEX-encoded versions of VAES and VPCLMULQDQ.
If those are indeed the AVX-512 versions as Phoronix states, then those are EVEX-encoded, aren't they? Or is EVEX only necessary for 512bit support, and VEX-encoding sufficient for 256bit (which should be all they want to support in Zen 3)? Interesting that AMD joins the AVX-512 piecemealing. Will be interesting how AVX-512 coverage develops over the time.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,706
1,233
136
If those are indeed the AVX-512 versions as Phoronix states, then those are EVEX-encoded, aren't they? Or is EVEX only necessary for 512bit support, and VEX-encoding sufficient for 256bit (which should be all they want to support in Zen 3)? Interesting that AMD joins the AVX-512 piecemealing. Will be interesting how AVX-512 coverage develops over the time.
They are VEX-encoded only.


Note no zmm in the documentation, or an EVEX prefix mention.
 

Gideon

Golden Member
Nov 27, 2007
1,771
4,132
136
If those are indeed the AVX-512 versions as Phoronix states, then those are EVEX-encoded, aren't they? Or is EVEX only necessary for 512bit support, and VEX-encoding sufficient for 256bit (which should be all they want to support in Zen 3)? Interesting that AMD joins the AVX-512 piecemealing. Will be interesting how AVX-512 coverage develops over the time.

No, both VPCLMULQDQ and VAES also have VEX-encoded versions. They seem to be the only two new AVX-512 extensions that do.
 
  • Like
Reactions: Tlh97 and moinmoin

lobz

Platinum Member
Feb 10, 2017
2,057
2,856
136
It should help out in encryption related workloads.
For example 7-zip:
The speed of zip AES encryption and 7z/zip/rar AES decryption was increased with the following improvements: 7-Zip now can use new x86/x64 VAES (AVX Vector AES) instructions, supported by Intel Ice Lake CPU.

ipsec:
New AES-GCM, AES-CBC and AES-CTR implementations for VAES and VPCLMULQDQ extensions

On the side, music encoding/decoding can use VPCLMULQDQ for faster CRC I think.
net:
VPCLMULQDQ support for CRC32-Ethernet and CRC16-CCITT.

So, Vermeer and Cezanne should be able to use that.
The funny thing is, AMD is waaay ahead in decryption, and in encryption it was OK / roughly on par with Intel. Curious to see if this brings anything, but of course that would already be in the X workloads that made up the 19% average IPC increase.
 

uzzi38

Platinum Member
Oct 16, 2019
2,705
6,427
146

Zen 3 on GB5.

Noticable points. Compared to the fastest 1185G7 on Windows here (unfortunately, there are no TGL-U benches with 5.1.1 like the 5900X here, so this will have to do for now):


The 5900X falls 5 points below in the averaged out single-threaded score whilst clocking between 4.775GHz and 4.95GHz. looking at the score breakdowns, the 5900X loses heavily in crypto, (2757 vs 4095), the two effecticely tie in the integer workloads (1409 vs 1405) and the 5900X takes a noticable lead in FP workloads (1837 vs 1640).

The 5950X run is using 5.2.3 but overall talking points from me remain the same for the most part. The 5950X loses some points by scoring 2707 in crypto, 1400 in integer and 1764 in floating point, but comparisons vs the 1185g7 otherwise remain the same. Heavy loss in crypto, virtually the same score in Integer with a lead in floating point.
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136

Zen 3 on GB5.

Noticable points. Compared to the fastest 1185G7 on Windows here (unfortunately, there are no TGL-U benches with 5.1.1 like the 5900X here, so this will have to do for now):


The 5900X falls 5 points below in the averaged out single-threaded score whilst clocking between 4.775GHz and 4.95GHz. looking at the score breakdowns, the 5900X loses heavily in crypto, (2757 vs 4095), the two effecticely tie in the integer workloads (1409 vs 1405) and the 5900X takes a noticable lead in FP workloads (1837 vs 1640).

The 5950X run is using 5.2.3 but overall talking points from me remain the same for the most part. The 5950X loses some points by scoring 2707 in crypto, 1400 in integer and 1764 in floating point, but comparisons vs the 1185g7 otherwise remain the same. Heavy loss in crypto, virtually the same score in Integer with a lead in floating point.
My 3600 gets 1216 single-core and 6969 multi-core on a 3.6 / 4.2 auto setup.
With double the cores on a 3.7 / 4.9 5900X, with faster memory to boot, I expected a bit more of a hammering, especially in multi-core workloads.
I know the multi-core test doesn't scale well (especially crypto, for whatever reason) but results still are a bit under what I expected, and I imagine it's for a variety of reasons such as it being an ES, etc.
 

Gideon

Golden Member
Nov 27, 2007
1,771
4,132
136
Noticed one more major difference - read the .gb5 on the memory choices.

The 5900X uses DDR4-3600cl18. The 5950X is using DDR4-3866cl28
Yeah, those memory timings look quite loose (pretty sure these are @ XMP).

Here is the 5900X vs my 3700X @ stock, with custom timings (nothing major, 3600Mhz CL16, Ryzen Timings Calculator SAFE preset).

In Multi Core results I get almost double the AES-XTS score and 50% more in Navigation and Machine Learning.

TL;DR: Better memory should improve at least the multi-core results significantly
 

Hans de Vries

Senior member
May 2, 2008
324
1,047
136
www.chip-architect.com

Zen 3 on GB5.


The 5900X loses heavily in crypto, (2757 vs 4095)

The AES crypto test for Comet Lake is ~1700 while Tiger Lake scores 4095 using the VEAS instruction.

It's just a matter of time when openSSL will be updated with Zen 3's new VEAS instruction and we'll see the result in GeekBench with a significant uplift of the overall score.
 
Last edited:

leoneazzurro

Golden Member
Jul 26, 2016
1,051
1,711
136

Zen 3 on GB5.

Noticable points. Compared to the fastest 1185G7 on Windows here (unfortunately, there are no TGL-U benches with 5.1.1 like the 5900X here, so this will have to do for now):


The 5900X falls 5 points below in the averaged out single-threaded score whilst clocking between 4.775GHz and 4.95GHz. looking at the score breakdowns, the 5900X loses heavily in crypto, (2757 vs 4095), the two effecticely tie in the integer workloads (1409 vs 1405) and the 5900X takes a noticable lead in FP workloads (1837 vs 1640).

The 5950X run is using 5.2.3 but overall talking points from me remain the same for the most part. The 5950X loses some points by scoring 2707 in crypto, 1400 in integer and 1764 in floating point, but comparisons vs the 1185g7 otherwise remain the same. Heavy loss in crypto, virtually the same score in Integer with a lead in floating point.

Considering that GB on Windows is really bad for Zen architectures in general, it seems quite good.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Impressive results for both chips. Like @Gideon mentioned, they are done on untuned memory, so they will rise quite higher.

Must be sad for Intel, when their upcoming RocketLake is getting beaten even before showing to action:


Looking at results and clocks there is no chance in hell AMD will not be ~10% ahead in ST in highly tuned system (think DDR4000CL16 with handtuned secondaries/tertiaries)

It's just a matter of time when openSSL will be updated with Zen 3's new VEAX instruction and we'll see the result in GeekBench with a significant uplift of the overall score.

Let me fix this for You -> what will happen is people will continue to use old OpenSSL for years on Debian/Ubuntu distros. Just check what openSSL/Nginx versions for example Debian 10 runs currently.
But benchmark will get statically linked with latest OpenSSL and will show speeds irrelevant to reality on Linux for 99% of users.
 
Last edited:

mikk

Diamond Member
May 15, 2012
4,245
2,299
136
Noticed one more major difference - read the .gb5 on the memory choices.

The 5900X uses DDR4-3600cl18. The 5950X is using DDR4-3866cl28


Different GB5 version as well. The ST score looks similar to i7-1185G7, the difference is that Ryzen 3 runs 200 Mhz higher than Tigerlake, Integer and AES looks better on Tigerlake. Floating point slightly better on Zen 3.
 

uzzi38

Platinum Member
Oct 16, 2019
2,705
6,427
146
Different GB5 version as well. The ST score looks similar to i7-1185G7, the difference is that Ryzen 3 runs 200 Mhz higher than Tigerlake, Integer and AES looks better on Tigerlake. Floating point slightly better on Zen 3.
I did mention that one in the original post.

Anyway, this is still using znver2 as @Hans de Vries has pointed out. Official Zen 3 support - alongside VAES support, which will boost crypto scores especially - will come with znver3. It's possible there may be other changes too. Treat this as a sneak peak ;)
 

Hans de Vries

Senior member
May 2, 2008
324
1,047
136
www.chip-architect.com
I did mention that one in the original post.

Anyway, this is still using znver2 as @Hans de Vries has pointed out. Official Zen 3 support - alongside VAES support, which will boost crypto scores especially - will come with znver3. It's possible there may be other changes too. Treat this as a sneak peak ;)


ZEN 3's Geekbench 5 ST score can go from 1605 to 1700++ just by using the new VEAS and VPCLMULQDQ instructions for the cryptography test.

The AES test counts for 5% of the overall result.

The use of these instructions for Tiger Lake added 110 points to overall ST result.

https://eprint.iacr.org/2018/392.pdf

https://www.geekbench.com/doc/geekbench5-cpu-workloads.pdf
 
Last edited:

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136
ZEN 3's Geekbench 5 ST score can go from 1605 to 1700++ just by using the new VEAS and VPCLMULQDQ instructions for the cryptography test.

The AES test counts for 5% of the overall result.

The use of these instructions for Tiger Lake added 110 points to overall ST result.

https://eprint.iacr.org/2018/392.pdf

https://www.geekbench.com/doc/geekbench5-cpu-workloads.pdf
Yeah those were mentioned in the manuals since April/May of this year. (See my old post below)
In addition to the VAES changes, 256 bit Vector CLMUL are very useful for (de)compression.
Also virtualization performance should improve slightly with the PCID changes, due to possibility to discriminate TLB flushes.

Just a small aggregation of Zen 3 tidbits

1. SEV-SNP instructions added. One more step in complete VM isolation from host.
Kernel patches ongoing. SEV is complete, but SNP still ongoing. MS implemented Autarky recently for Azure on Intel hosts but I think AMD's SEV-SNP is a much more comprehensive solution.
IMO, once the live migration process from encrypted VMs is streamlined it should be easy to deploy widely. Another one is pinning of pages. An improvement would be the HW allow paging in and out encryped pages without too much perf loss.

2. MPK/PKE support added to Programming Manual and kernel patches submitted
Another feature for Memory page protection.

3. PCID support patches submitted
Smaller hits from all those TLB flushes due to security issues.

4. 256 bit CLMUL and AES instructions
One of the things that servers do ALL THE TIME is bulk encryption and bulk compression, which, not coincidentally, Zen2 is strong. Content compression is when the server send your browser compressed data and your browser decodes it on the fly. It is one of the reasons massive web content is not choking the internet. Encryption is as you know, HTTPS traffic which needs no introduction Any Infrastructure guy worth his salt is not going to use SPEC to judge system performance.
256 bit CLMUL and AES operations are going to give decent boosts for servers in bulk encryption and content compression.

Another thing servers do a lot is load balancing and isolating the DMZ, that is routing ip packets from outside to multiple instances of worker nodes. You would be surprised to know how much time a packet spend traversing the networking stack passing through the various chains of the kernel. I know other stacks handle it differently. But for Linux it will go through the various netfilter chains, the iptable chains input, output, forward, nat, and ebtable chains input, output, bridge etc..
AVX instructions are used to speed up this operation by several factors than what could be possible with normal instructions. Topic for another day though (like in memory DBs etc).

Outside of these, there were several new additions and instructions involving TLBs and cache handling to address security issues.
AMD's programming manual got a lot of changes recently! All in all there are more changes for Zen3 than there were for Zen2.
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
5900X GB5 = 1605 according to GB5 Browser result. 3900X GB5 = 1280 per GB5 benchmark chart. (Presumed we can get 100+ more with VAES/other updates but let's just use these numbers.)

Part 1

Let's assume that SPECint2006 1T scales along with GB5 single-core. Known GB5/SPECint2006 results are in bold, projected results in italics.

A14:.......GB5 1586, SPECint2006 63.13 @ 3.0 GHz, SPEC/GHz = 21.04
A13:.......GB5 1327, SPECint2006 52.82 @ 2.66 GHz, SPEC/GHz = 19.86
5900X:...GB5 1605, SPECint2006 62.72 @ 4.95 GHz, SPEC/GHz = 12.67
1065G7:.GB5 ?????, SPECint2006 47.40 @ 3.9 GHz, SPEC/GHz = 12.15
3900X:...GB5 1280, SPECint2006 50.02 @ 4.6 GHz, SPEC/GHz = 10.87
9900K:...GB5 1334, SPECint2006 54.28 @ 5.0 GHz, SPEC/GHz = 10.86
10900K:.GB5 1412, SPECint2006 57.45 @ 5.3 GHz, SPEC/GHz = 10.84
* I scaled by architecture - A14 inferred SPEC score = (A13 SPEC score * A14 GB5 score) / A13 GB5 score. The 9900K/10900K also grouped, and 3900X/5900X grouped in a similar way.

First, perusing the top CPU charts for raw single-core GB5 results, you see the 3800XT at 1354 (4.7 GHz), 3800X at 1291 (4.5 GHz). This correlates to a miniscule 0.4% GB5 per GHz gain with the XT chips over X chips, with my inference being that this reflects binning rather than any actual process performance/efficiency gains. This is important because Zen3 uses the same process, hence we can infer that any GB5 per GHz gain from Zen2/2+ to Zen3 would be from core/CCX redesign rather than process improvements. In single-core GB5, the 5900X shows a 25.4% gain in raw GB5 score, and a 16.5% gain in GB5 score per GHz. This is lower than the claimed 19% IPC and likely reflects the algorithm and benchmark specific issues as discussed by Hans de Vries and others above.

What's remarkable here is not just the 25.4% raw gain and 16.5% IPC gain from Zen2 to Zen3. What's remarkable is that as generationally amazing as Apple's chips have been, A13 to A14 gains only 19.5% raw and 6.0% IPC between generations.

What AMD have done here is exceeding even Apple's advancement. And remember - some of Apple's IPC gain is coming from process change. AMD has NO process change. So AMD is all design, Apple is part design part process.

Part 2

Since scaling with GB5 is inference, why not just use manufacturer claims? Even if they might be slightly (or wildly) different than our results...

Now, if we simply assume (and give lots of leeway to AMD and Apple on their claims) that:

- Zen3 will gain 19% in IPC over Zen2
- A14 will gain 17% in raw performance over A13*
* Apple said A14 is +40% of A12, and that A13 was +20% on A12, so we can assume A14 is +17% A13 from their marketing

Then the chart looks like this (A13, A14, 3900X, and 5900X only):

A14:.......SPECint2006 61.80 @ 3.0 GHz, SPEC/GHz = 20.60
A13:.......SPECint2006 52.82 @ 2.66 GHz, SPEC/GHz = 19.86
5900X:...SPECint2006 64.05 @ 4.95 GHz, SPEC/GHz = 12.94
3900X:...SPECint2006 50.02 @ 4.6 GHz, SPEC/GHz = 10.87

SPEC/GHz (IPC) generational gains:
A13 -> A14 = +3.7%
Zen2 -> Zen3 = +19%

I'd like to point out that Apple's actual performance gains have fit - with clock speed increase and process improvements, they've been able to squeeze a 19.8% raw performance increase between A12 -> A13 and it appears a 19.5% raw performance increase between A13 -> A14, for a total actual raw performance increase of 43.1%. This matches or exceeds their performance claims.

Part 3

What Apple has done with performance efficiency is remarkable, but it appears AMD is making even better generational leaps than they have, if it plays out this way...

From A12 -> A14, half of Apple's 43.1% performance gains have come solely from clock speed increases (20.5% increase in clocks from 2.49 GHz A12 -> 3.0 GHz A14). That means that about 18.8% came from design changes or process improvements. N7 (A12) -> N5 (A14) is +15% speed at isopower, hence we can infer that possibly as little as 3.8% of the raw performance increase came from actual core / uarch changes over the last two years. Well, that's a low-end estimate. I'm being lazy by not taking into account possible reduction in performance gains in order to reduce power draw since they also increased clocks, but I don't feel qualified to closely examine these things, nor do I know how to scale performance vs power on node changes to figure out exactly how much performance and how much power savings from the N7 -> N5 change Apple decided to use -- particularly since we have no actual chip review out yet for A14 regarding details. However, we do have a history with A12 -> A13 that Apple increased power usage on the Lightning core over Vortex core by 28.4% and increased clock speeds 6.8% for a 16.5% SPECint gain. Since N7 -> N7P provides 7% isopower performance gains, we can infer that Apple didn't use the efficiency gains of the process change, but rather the performance gains, pushing power consumption up by 28.4% for a 16.5% performance gain. Clocks and process improvements account for ~14% of a 16.5% SPECint2006 performance gain, leaving little room in the performance gains for actual uarch/core improvements; I wouldn't expect things to be much different on A13 -> A14.

Meanwhile, in a single generation from Zen2 -> Zen3, AMD performance gains in single-core GB5 (totaling 25.4%) are 7.6% from clockspeed increases (4.6 GHz -> 4.95 GHz) and approximately 16.5% from design changes or process improvements. Since the process no Zen2 and Zen3 is the same, that means 16.5% of the performance improvement is from core / uarch changes.

Food for thought, huh?
 
Last edited:

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
@amrnuke

That analysis isn't going to make all the locals happy. What's interesting to me is that AMD is managing that IPC increase within a much higher clockspeed range than is Apple . . .
I'm actually quite stunned at the results. If the data bear out, Zen3 represents a remarkable improvement in pure core/CCX design over a single generation.

We see the 17% IPC improvement from 2700X -> 3900X and think that this 19% IPC improvement with Zen3 is expected. But we forget that Zen+ -> Zen2 was also a GF 12nm -> TSMC 7nm jump which AMD claimed produced a 25% performance gain at iso-power or 50% power improvement at same performance moving processes. Ultimately they took a blended approach and saw a +15% IPC gain while drawing less power. But it's clear from Anandtech's deep dive of the Zen2 uarch that Zen2 was not intended to be a full core redesign. In fact, they note "Zen 2 looks a lot like Zen" despite new branch predictor, larger uop cache, larger L3$, increased integer and load/store resources, AVX2 support. I get the sense that we won't get a similar statement (Zen3 looks a lot like Zen2) when we get a new deep dive.

I'm still quite fond of Apple's work. They are in a great sweet spot of power and int/fp speed. But their performance gains seem to be heavily dependent on process and clocks, and I wonder if their innovations are more on the end of ensuring they are able to continue to increase clock speeds while managing power all while leveraging process improvements, rather than advancing uarch and core design.

If Zen4 leverages a pure 15% speed@isopower improvement and 1.8x logic density from N5, while letting Apple sort out the kinks of defect density, AMD are going to be in a very good place. I'd even love to see a Zen3+ on N5 some time in July next year.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
That analysis isn't going to make all the locals happy. What's interesting to me is that AMD is managing that IPC increase within a much higher clockspeed range than is Apple . .

Yes clearly x86-64 still has plenty of juice left in the glass, despite what ARM fans might say. It's going to be awesome watching an invigorated AMD (and Intel as well if they sort out their manufacturing woes) duke it out with ARM designs in the next couple of years in HPC, servers etcetera...
 

soresu

Diamond Member
Dec 19, 2014
3,214
2,490
136
Great analysis @amrnuke !

Very interesting indeed, I wonder what the values would be within equal SPEC performance?

It should be interesting to see what gains N6 brings to the party if that is indeed the process that Rembrandt will be fabbed on.
 
  • Like
Reactions: Tlh97