• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."

Speculation: Ryzen 4000 series/Zen 3

Page 69 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Richie Rich

Senior member
Jul 28, 2019
470
228
76
That 4 Ghz example was a test chip, I saw nothing to indicate any advantageous power consumption in the news surrounding it.
It's pointless comparing A72 to the Axx cores, they are completely different uArch designs by different engineering teams.
And this test chip proved that A72 is capable to run at 4 GHz if it's manufactured on HP process and rise the voltage. And Apple's A13 could run at 4 GHz too if manufactured same way. This comparison is very useful when you know how CPU stages work. Doesn't matter how many stages CPU has (useful but might be misleading) because more important is how many transistors contain the critical path chain. Signal velocity propagation across this critical path for every stage is known value called transistor characteristic (freq/voltage curve). Thanks to this we can indirectly compare how short (fast) all stages in pipeline are. For example if A72 runs at 2.5 GHz @ 0.7V and A12 at 2.5 GHz need same 0.7V this basically means their stage lenghts are very similar (no matter how many stages in total core are). Similar stage length means similar scaling to 4 GHz also for Apple CPUs. That's physics.

Again, it's about power - it's not "low hanging fruit", it's sheer power consumption.
Zen and Core derivatives perform far more efficiently below 3 Ghz - even 12nm could get you a 45W 8C Zen+ CPU at 2.8 Ghz (2700E).
I'd be interested to see someone do some tests on the 8C Zen2 SKU's underclocked to that 2700E range and see how many watts it pulls on average loads.
It looks like you don't want to understand. If TDP limit for desktop CPU is typically 65-105W then Intel and AMD have very different design constrains than ARM chips with TDP 4W. And very different low hanging fruits too. For such a high TDP is higher frequency (by rising voltage to 1.3V) very simple way how to rise performance (+77% 3950X). Frequency is low hanging fruits for x86 desktop. Apple's mobile chips cannot waste such a energy so they have to go the hardest way - by increasing IPC. However the hardest way always pays dividends (same in real life with training and learning) and Apple ends up with the most advanced CPU core on the world with massive +82% IPC advantage over Zen 2.

Apple's A13 is beating very fine binned Ryzen 9 3950X @ 4.7 GHz. If you would set Zen2 at clock every chip is capable of (like A13 is) then it would be somewhere around 4.2 Ghz.... and A13 would win over Zen 2 with higher margin. 64-core EPYC with TDP 280W (4.4W/core at 2.5 GHz) looks very competitive with A13 (5W/core @2.6 Ghz). Until you realize A13 has a massive +82% IPC advantage. EPYC would need more than 1000W TDP to be able to run at 4Ghz and be still slower than A13@2.6 GHz. I tell you guys Nuvia server chip with IPC like A13 would be total killer for most x86 server world today. AMD needs +25% IPC jump every year to stay competitive against Nuvia on 2024. That's why Zen3 must be something very good. Much better than leaks suggests. If +12% INT IPC gain is correct for Zen3 then AMD will fall into mediocrity again (same as K10.5/Barcelona/Thuban age).

Because you need every part of the core to be synchronous. The faster the clock frequency, the harder it is to reach same perf/clock, because the same circuits need to be driven harder. Not only that, sometimes you need whole different circuitry that's more complex just to do the same thing at lower frequencies.

Intel's Atom-based Goldmont Plus has a L1 cache latency of 3 cycles. Skylake and Icelake is 5 cycles(Skylake is 4 in some corner case scenarios). Goldmont Plus aims for 3GHz, and SKL/ICL for 4-5GHz.

You see, 5 cycles @ 5GHz is same as 3 cycles @ 3GHz.
Nice argument and I agree with that. However when you take Apple's A13 instead Atom there is same level of absolute performance as SKL. Just achieved at much lower clock. This means A13 core has to deal with same amount of instructions per time as SKL@4.5 GHz..... this means both cores (A13 and SKL) suffer with same latency issues. So memory and cache system is probably very similar for both cores.

But the main question was whether there is any hard obsticle to design very wide 6xALU core operating at high clock as SKL. There isn't. Yes, it would need good cache system and yes, it would suffer at some type of code still. But look at Apple how they handled it. Their first 6xALU core A11 Monsoon was faster than their last 4xALU core A10 Hurricane but not as you would suggest from +50% ALU increase. Apple kept working hard on cache and memory system with A12 and A13 and finally got great IPC. The old good receipt for success is also my wish for everybody here into new year of 2020: much success through hard work, no excuses.
 

soresu

Golden Member
Dec 19, 2014
1,672
854
136
Apple's A13 is beating very fine binned Ryzen 9 3950X @ 4.7 GHz. If you would set Zen2 at clock every chip is capable of (like A13 is) then it would be somewhere around 4.2 Ghz.... and A13 would win over Zen 2 with higher margin. 64-core EPYC with TDP 280W (4.4W/core at 2.5 GHz) looks very competitive with A13 (5W/core @2.6 Ghz). Until you realize A13 has a massive +82% IPC advantage. EPYC would need more than 1000W TDP to be able to run at 4Ghz and be still slower than A13@2.6 GHz. I tell you guys Nuvia server chip with IPC like A13 would be total killer for most x86 server world today. AMD needs +25% IPC jump every year to stay competitive against Nuvia on 2024. That's why Zen3 must be something very good. Much better than leaks suggests. If +12% INT IPC gain is correct for Zen3 then AMD will fall into mediocrity again (same as K10.5/Barcelona/Thuban age).
AMD will not fall into mediocrity unless they manage to fall far behind Intel when they should be in the lead.

The ARM server/desktop market may certainly grow in the future, but it's not going to displace x86 significantly anytime soon.

Besides which, by 2024 Nuvia will have competition from ARM themselves inevitably, A78 is due in May and we will probably still see a new core every year from ARM

It's easier to just write K10 age (or Stars using the desktop codenames).

You somehow managed to use 3 different variants of codenames in differing generations of it there.

K10.5 is the second gen core, Barcelona is the first gen server SKU (Agena is the desktop variant also affected by TLB errata), Thuban is the desktop SKU for K10.6/third gen core/6 core monolithic.

I get testy about the K10 comment because I had a Phenom II X4, it was a great chip for much better value than the Intel product on the market.

Bulldozer was mediocrity, K10 was just second best of it's generation, but a good second best well worth buying at the time.
Apple ends up with the most advanced CPU core on the world with massive +82% IPC advantage over Zen 2.
The most advanced CPU core? Arguable to say the least with the likes of A64FX knocking around now with 512 bit SVE - there's more to being advanced than great scalar IPC.

Remember that for all that vaunted scalar oomph, Axx cores are currently crippled in the SIMD arena when comparing to x86 Zen and Core competitors.

Apple also lacks the software on any iOS platform that compares to x86 systems using MacOS, Windows or Linux - all that awesome IPC is as useful as a paperweight for most serious work outside the new Photoshop iOS release.
 

A///

Senior member
Feb 24, 2017
922
643
136
I want to casually point out that 5nm, PCIe5, DDR5, AM5 would fit nicely on a 5/5/2021 launch date. Note that 2021->2+0+2+1=5 :D
I suppose that works. I only brought it up because AMD launched Zen2 on 7/7 this year. I haven't bought AMD in like two decades, so I don't know if these numbers games are something they do all the time.
 

IntelUser2000

Elite Member
Oct 14, 2003
7,545
2,355
136
As far as this table goes, it's a bit unfair to the 4 and 5 ghz contenders because scaling is a very crude approximation. You should run the 9900k and 3950X at 2.5GHz. Otherwise it's as if you handicapped these competitors with very high latency memory (about 2x latency of whatever RAM the A12 was using).
SpecCPU2006 has a scaling factor of ~85% for most CPUs. Just by that it would bring advantage against Zen 2 to 70%.

But, SpecCPU is not the lightest workload. The 9900K and 3950X are both at the edge of what's possible in terms of frequency. Because the boost can be easily dislodged, it further disadvantages them.

I would say realistically its 60-65%. That's still an astonishing difference.

There's also a matter of x86 vendors standing still. What if it was going against Golden Cove derivatives instead of Skylake? Even if you consider a lower gain of 15% for Sunny Cove, and 15% for Golden Cove, you would reduce the difference to 25%. Add 8% for Willow Cove? It becomes 15%.

Let's say we had 3.5GHz Golden Cove instead of 5GHz Skylake. Would look much better against ARM competition even though the overall result may not be a huge advance on the desktop side.
 

amd6502

Senior member
Apr 21, 2017
848
296
136
SpecCPU2006 has a scaling factor of ~85% for most CPUs. Just by that it would bring advantage against Zen 2 to 70%.

But, SpecCPU is not the lightest workload. The 9900K and 3950X are both at the edge of what's possible in terms of frequency. Because the boost can be easily dislodged, it further disadvantages them.

I would say realistically its 60-65%. That's still an astonishing difference.
Can anyone point to how to run a SpecCPU on linux?

It should be trivial to limit the frequency and test at the correct frequency where there is minimal scaling approximation error.

Just do a
cpufreq-set -c0-23 -u 2.5
and this should get you to a very close frequency. Then look up where it would be running with a cpufreq-info query. Then run your benchmark.

If this bench is easy to get then pretty much anybody with a 9900k-gen core or 3600/Matisse system could get us more accurate IPC comparison numbers.

For benchmarks that fit easily within the cache you're right the scaling should be very high (close to 100%). How long the peak boost is sustained is a good point, and this will also depend much on case cooling and cpu cooler--- so, it's a very bad idea to run at boost frequency if you want to measure IPC.

I haven't looked into it at all but I'm very skeptical of Apple's acorn core achieving a 60% IPC lead over newest gen x86.
 
Last edited:

IntelUser2000

Elite Member
Oct 14, 2003
7,545
2,355
136
If this bench is easy to get then pretty much anybody with a 9900k-gen core or 3600/Matisse system could get us more accurate IPC comparison numbers.

For benchmarks that fit easily within the cache you're right the scaling should be very high (close to 100%). How long the peak boost is sustained is a good point, and this will also depend much on case cooling and cpu cooler--- so, it's a very bad idea to run at boost frequency if you want to measure IPC.
Even with good cooling, I found some errors in calculations when allowing the CPU to boost to Turbo. The clock frequency has to be fixed.

I assume it should be easy to set low frequencies? Just go to BIOS and set the multiplier to 27x for 2.7GHz. Right?

I haven't looked into it at all but I'm very skeptical of Apple's acorn core achieving a 60% IPC lead over newest gen x86.
I can believe this. AMD is just catching up to Skylake, and Intel has stood in a spot for years. ~15% for Sunny/Golden cove, plus 7-8% for Willow Cove gets us 40-45%.

That's still behind A13, but it has the advantage of having a top to bottom vertical stack plus running at much lower frequencies.

For benchmarks that fit easily within the cache you're right the scaling should be very high (close to 100%).
Even 85% is a high figure for some workloads. Some may end up even lower at 60%, and this is true for many floating point workloads because they gobble up bandwidth. I don't like playing with SpecFP numbers for the same reason. Its harder to isolate uarch differences. SpecInt is much better.

Dhrystone scales 100% with clocks, and Geekbench probably at 99% or even higher.
 

DisEnchantment

Senior member
Mar 3, 2017
844
2,121
136
Another week, another couple of AMD patent applications related to stacked memory.

This one is a very novel idea. Quite interesting to read.
20190333876
METHOD AND APPARATUS FOR POWER DELIVERY TO A DIE STACK VIA A HEAT SPREADER
Various chip stack power delivery circuits are disclosed. In one aspect, an apparatus is provided that includes a stack of semiconductor chips that has an uppermost semiconductor chip and a lowermost semiconductor chip. A heat spreader is positioned on the uppermost semiconductor chip. A power transfer circuit is configured to transfer electric power from the heat spreader to the uppermost semiconductor chip.
View attachment 12775


To me this patent could be related to the integrated thermo-electric cooler patent (20180358080/20190122704). They could deliver power to to the integrated thermo-electric device to extract the heat. They talk about the many ways of delivering power to different layers in the stack device from the head spreader. Each layer could be the SRAM or could be a heat transfer layer which also serves as protection layer.


20190333876
CONFIGURATION OF MULTI-DIE MODULES WITH THROUGH-SILICON VIAS
A data processing system includes a processing unit that forms a base die and has a group of through-silicon vias (TSVs), and is connected to a memory system. The memory system includes a die stack that includes a first die and a second die. The first die has a first surface that includes a group of micro-bump landing pads and a group of TSV landing pads. The group of micro-bump landing pads are connected to the group of TSVs of the processing unit using a corresponding group of micro-bumps. The first die has a group of memory die TSVs. The subsequent die has a first surface that includes a group of micro-bump landing pads and a group of TSV landing pads connected to the group of TSVs of the first die. The first die communicates with the processing unit using first cycle timing, and with the subsequent die using second cycle timing.

View attachment 12778


This is also very interesting, quite detailed description how the layout is going to look like, data transfer mechanisms, clocking, synchronization, etc. Looks like development of this is quite advanced.

Zen 4 stuffs I would say.
There are lots of novel patents around PIM as well but probably GPU and FPGA related.
Quoting myself for reference, this new patent is a lot more interesting and seems more practical than the thermoelectric stacked die patent that they made some years ago.
The theme around stacked dies is so recurring in all these patent applications you can bet it is happening real soon (Zen4 latest I would bet) ( similar to how I kept quoting patents for Zen 2 and they did happen for real )

This time, there are other chips stacked on top of the IOD.
It seems highly probable that the IOD will be based on the same process node.
The stacked chips are manufactured differently.
Everything will be wrapped together by a molding material as a single chip.

20190393124 ARRANGEMENT AND THERMAL MANAGEMENT OF 3D STACKED DIES

Untitled1.png


What is described here is that
  • The cache and the IO will be located on the center die which is a low heat producing block. This seems to reconcile with their patents of a big unified L3.
  • The processor cores/compute blocks are on the periphery, with a dummy die stacked on top of them to take the heat out to the IHS. These are thermal hotspots which needs a good thermal path to the IHS.
  • Fully 3D integrated. This allows a lot more room for what can be integrated on a single chip for a specific socket.
  • The desire to route power via IHS comes from the fact that they want to stack the high heat producing blocks like cores on the top of the stack close to the IHS with the low heat producing blocks like the memory below on top of the substrate. Also applies to routing power to other dies stacks.
  • Multiple dies will be stacked on top of the center IOD/Cache Die. These could be memory or other SFUs.
  • Stacked dies are connected via TSV, bumps, conductive pillars and others (see quoted patents).
This patent is a specialization of the the one they filed in 2017. The patent seems to indicate that they have a more mature idea much closer to production based off patents they filed two years ago.

Their 3D stacked chiplet architecture coming full circle.

I think AMD has a good understanding of thermal issues associated with highly dense processes cropping from the move to 7nm for HPC applications and are reflected in a lot of their recent patents.

There are lots of patents around load/store improvements and Fabric efficiency which are very interesting as well but perhaps for another post.
 

Andrei.

Senior member
Jan 26, 2015
316
385
136
The most advanced CPU core? Arguable to say the least with the likes of A64FX knocking around now with 512 bit SVE - there's more to being advanced than great scalar IPC.

Remember that for all that vaunted scalar oomph, Axx cores are currently crippled in the SIMD arena when comparing to x86 Zen and Core competitors.

Apple also lacks the software on any iOS platform that compares to x86 systems using MacOS, Windows or Linux - all that awesome IPC is as useful as a paperweight for most serious work outside the new Photoshop iOS release.
Scalar IPC is the hardest design characteristic to achieve, the A64FX doesn't look that great other than it being an SIMD and bandwidth monster.

In regards to Apple's freq scaling, people forget Apple still has to cater to energy efficiency. Simply using fatter transistors, going above 3GHz to around 3.5GHz shouldn't be very hard if actually designed for it. We'll see macs with their CPUs soon enough so I hope that'll finally shut people up.
 

Olikan

Platinum Member
Sep 23, 2011
2,008
242
106
There are lots of patents around load/store improvements and Fabric efficiency which are very interesting as well but perhaps for another post.
Also, many patents for cpu's front end latency, decode, uop and L1 cache latency
 

Veradun

Senior member
Jul 29, 2016
564
780
136
Quoting myself for reference, this new patent is a lot more interesting and seems more practical than the thermoelectric stacked die patent that they made some years ago.
The theme around stacked dies is so recurring in all these patent applications you can bet it is happening real soon (Zen4 latest I would bet) ( similar to how I kept quoting patents for Zen 2 and they did happen for real )

This time, there are other chips stacked on top of the IOD.
It seems highly probable that the IOD will be based on the same process node.
The stacked chips are manufactured differently.
Everything will be wrapped together by a molding material as a single chip.

20190393124 ARRANGEMENT AND THERMAL MANAGEMENT OF 3D STACKED DIES

View attachment 15171


What is described here is that
  • The cache and the IO will be located on the center die which is a low heat producing block. This seems to reconcile with their patents of a big unified L3.
  • The processor cores/compute blocks are on the periphery, with a dummy die stacked on top of them to take the heat out to the IHS. These are thermal hotspots which needs a good thermal path to the IHS.
  • Fully 3D integrated. This allows a lot more room for what can be integrated on a single chip for a specific socket.
  • The desire to route power via IHS comes from the fact that they want to stack the high heat producing blocks like cores on the top of the stack close to the IHS with the low heat producing blocks like the memory below on top of the substrate. Also applies to routing power to other dies stacks.
  • Multiple dies will be stacked on top of the center IOD/Cache Die. These could be memory or other SFUs.
  • Stacked dies are connected via TSV, bumps, conductive pillars and others (see quoted patents).
This patent is a specialization of the the one they filed in 2017. The patent seems to indicate that they have a more mature idea much closer to production based off patents they filed two years ago.

Their 3D stacked chiplet architecture coming full circle.

I think AMD has a good understanding of thermal issues associated with highly dense processes cropping from the move to 7nm for HPC applications and are reflected in a lot of their recent patents.

There are lots of patents around load/store improvements and Fabric efficiency which are very interesting as well but perhaps for another post.
In one of those slides from the leaked event that sparked a long discussion on SMT2 here, I remember the cache were still inside the compute chiplet for Milan. Do you think this patent would apply to a "L4 cache" if used for Milan?
 

naukkis

Senior member
Jun 5, 2002
447
302
136
But, SpecCPU is not the lightest workload. The 9900K and 3950X are both at the edge of what's possible in terms of frequency. Because the boost can be easily dislodged, it further disadvantages them.

I would say realistically its 60-65%. That's still an astonishing difference.
You are comparing phone SOC to desktop CPU. It's that phone SOC which is boosted to it's max and is throttling to keep temperatures in control. Give A13 a proper heatsink and more power headroom and it will sustain clocks better - there's absolutely no content about which one is more disadvantaged in comparison. So Apple's core is much, much more better than what benchmarks rant in phone would suggest.
 

amd6502

Senior member
Apr 21, 2017
848
296
136
You are comparing phone SOC to desktop CPU. It's that phone SOC which is boosted to it's max and is throttling to keep temperatures in control. Give A13 a proper heatsink and more power headroom and it will sustain clocks better - there's absolutely no content about which one is more disadvantaged in comparison. So Apple's core is much, much more better than what benchmarks rant in phone would suggest.
not if you put the telephone in the freezer. (not sure if this will make the battery explode though, so don't try without a firetruck nearby).
 
  • Like
Reactions: Tlh97 and lobz

soresu

Golden Member
Dec 19, 2014
1,672
854
136
there's absolutely no content about which one is more disadvantaged in comparison.
Yes there is, it's not even arguable the difference in software - the uphill battle of Windows on ARM shows that whatever other commenters on here have said to the contrary, there is clearly much more than a mere re-compile needed to port a lot of the software on x86 platforms.

I'm not saying this because I'm any big proponent of x86, quite the opposite as I wish there were far more AAA game ports on Android (#cough#KOTOR2#cough#), but it simply isn't happening - what little effort was previously made seems to have slowed to a crawl now, even with Windows on ARM to bolster the potential market (and Switch on the same ISA for that matter).

Are there even any significant AAA games playing natively on WARM? If so the news certainly doesn't seem to be circulating as well as might be expected from a properly invested platform vendor.

Coupled with the ridiculous amount of time they are taking with x64 binary translation support, it does seem as if MS never really had much faith in the platform at all.
 
  • Like
Reactions: Tlh97

Carfax83

Diamond Member
Nov 1, 2010
6,064
868
126
Scalar IPC is the hardest design characteristic to achieve, the A64FX doesn't look that great other than it being an SIMD and bandwidth monster.
Perhaps scalar IPC is harder to achieve, but that doesn't negate the increasing importance of SIMD. So many workloads these days can be accelerated by SIMD, and Intel (with AMD following) are both hellbent on wider vectors. It's easy for non industry professionals like myself to not realize the importance, until I see the ridiculous performance gains you get when an application is optimized for it.

We'll see macs with their CPUs soon enough so I hope that'll finally shut people up.
I hope to see this one day myself. I'd like to see how Macs with a beefed up A series CPU deal with heavier workloads.
 

soresu

Golden Member
Dec 19, 2014
1,672
854
136
I won't lie. When I first read that sentence, I thought you were referring to the Athlon 64 FX, which makes no sense whatsoever! :laughing::D
Yes it is a bit of a funny name likely to cause such confusion, but it's not meant for wider consumption so I guess it never went through a more exhaustive PR effort.
 

soresu

Golden Member
Dec 19, 2014
1,672
854
136
until I see the ridiculous performance gains you get when an application is optimized for it.
Absolutely, it's the difference between usable and unusable for the moment with AV1 decoding at higher resolutions, even with dav1d's superior SW engineering.
 
  • Like
Reactions: Carfax83

amd6502

Senior member
Apr 21, 2017
848
296
136
Hey that's not a bad idea . . . gets crazy ideas.
I would not do that with a fully charged warm battery. I would guess (I am really not sure) that it might be safe to put a telephone in the freezer for benchmarking if there is say a 25%-65% charge.

Are there any apple apps to measure boosting behaviour and frequencies? Something equivalent to cpufreq-aperf or more elaborate
 

soresu

Golden Member
Dec 19, 2014
1,672
854
136
I would not do that with a fully charged warm battery. I would guess (really not sure) that it might be safe to put a telephone in the freezer for benchmarking if there is say a 25%-65% charge.

Are there any apple apps to measure boosting behaviour and frequencies? Something equivalent to cpufreq-aperf or more elaborate
My guess would be a condensation fueled short would be the main danger, though pure condensed water is not nearly as conductive.
 

amd6502

Senior member
Apr 21, 2017
848
296
136
My guess would be a condensation fueled short would be the main danger, though pure condensed water is not nearly as conductive.
I lost my phone in the snow for an evening, and it was single digit in Fahrenheit at the time. No damage. But the charge was near empty. I somewhat worry that a fully 100% charged battery could get shocked when quickly going from 40C to 45C (which it might easily coming fresh off a fast charger) to -20C. The energy that a battery can hold at room temperature is significantly more than what it can hold at -20C. So what happens to that energy that it now all the sudden cannot hold any more?
 

ASK THE COMMUNITY