Speculation: Ryzen 4000 series/Zen 3

DisEnchantment · Apr 16, 2020

From TSMC's earnings report.

N5 in volume production good yield. Full node jump +80% logic density, +20% speed. Extensive EUV, expect fast/smooth ramp in 2H20 driven by Mobile+HPC. Reiterate N5 will be 10% of wafer revenue in 2020

https://twitter.com/x/status/1250693659927830530

Bodes well for AMD. If they can get a pipe cleaner asap they will be fine with the node migration.

Thunder 57 · Apr 16, 2020

uzzi38 said:
...I've been on mobile all morning, I'd completely forgotten the forums do what they do with Reddit. I'll fix it up now

I could've been nicer but it was early and I was tired. Thank you though, certainly cleaner now.

Richie Rich said:
15% IPC uplift for Zen3 is way too small for brand new uarch.
Alder Lake in H1 2021 with 40%IPC jump (Golden Cove) can bring rough ride for AMD.

Who let you back in? It's not a brand new uarch. It is very much Zen based. But who cares, an iphone will beat it anyway, right?

Hans Gruber · Apr 16, 2020

I will write it one more time. We can come back in 4 or 5 months and see if the leaks were correct. The enhanced 7nm process didn't give AMD the boost they were hoping for like going from 12nm to 7nm. But Zen 3 is new architecture over Zen 2. 15% IPC gain is on the low end of predictions. I heard 20%. I am guessing a 100mhz increase on boost clocks vs. zen2. I think that will be due to a mature 7nm node production improvements rather than the 7nm+ they were hyping. Think about intel on the 14nm node. They got really good and enhancing the 14nm to take massive amounts of voltage for improvements.

Something people have not talked about here. The infinity fabric on the memory controller for Zen3. As most know 3800mhz on the Zen2 platform is sketchy at best no matter the motherboard. If Zen 3 scales up to 4000mhz or beyond with coupled memory clock and fabric clocks. That will be provide a significant improvement in performance.

Veradun · Apr 16, 2020

awesomedeluxe said:
Holy heck. AMD already gearing up for N5??? The only N5 part on their roadmap is Zen 4... this would put them way ahead of their roadmap, right?

I thought for sure AMD would wait out the iPhone launch to begin N5. Maybe Huawei cutting orders opened the door for them to be a little more aggressive?

EDIT: Here's this for everyone else who doesn't speak Chinese. I don't know if their translation is accurate either. I think they're implying AMD has a special N5 process this year, different from N5P next year, that Apple is also using.

What if AMD is using the same process as Apple because it is the same silicon we are talking about? :>

darkswordsman17 said:
. (Also I learned that Arcturus apparently is not an architecture, like Navi/Vega/Polaris, but rather is just a specific chip?)

What I heard is Arcturus is a SKU. May be wrong ofc.

inf64 · Apr 16, 2020

Richie Rich said:
15% IPC uplift for Zen3 is way too small for brand new uarch.
Alder Lake in H1 2021 with 40%IPC jump (Golden Cove) can bring rough ride for AMD.

First of all you are missing the + sign.
Second of all, good luck seeing AL in 2021. Maybe H2 2022 if all goes well.
Third , 40%(???) IPC jump is a pipe dream. If it is 10-15% faster than Willow Cove then it is awesome, but let's see about that.

Richie Rich · Apr 16, 2020

Hans Gruber said:
I will write it one more time. We can come back in 4 or 5 months and see if the leaks were correct. The enhanced 7nm process didn't give AMD the boost they were hoping for like going from 12nm to 7nm. But Zen 3 is new architecture over Zen 2. 15% IPC gain is on the low end of predictions. I heard 20%. I am guessing a 100mhz increase on boost clocks vs. zen2. I think that will be due to a mature 7nm node production improvements rather than the 7nm+ they were hyping. Think about intel on the 14nm node. They got really good and enhancing the 14nm to take massive amounts of voltage for improvements.

I agree, completely new uarch like Zen3 is expected to have 20%+ over Zen2. However from this reason there won't be any higher boost clocks in ST than Zen2. On the other hand we can see higher clocks in MT thanks to higher efficiency of new uarch and better N7P node. Overall performance will increase but desktop clock hunters might be disappointed from Zen3.

AMD focus more on server market where no CPU clocks over 4GHz. Additional 300MHz in all core boost for EPYC will bring much more profit for AMD than reaching magical 5GHz for one desktop SKU.

I see nobody is interested in 3x lager microcode space for Zen3. This can mean that Zen3 is really big change in architecture. Something like Keller's wider K12 reworked for x86.

soresu · Apr 17, 2020

Veradun said:
What I heard is Arcturus is a SKU. May be wrong ofc.

Arcturus is almost certainly CDNA1 - whatever SKU variants for yield and CU cutdown there may be (and let's face it that is how they split the market and maximise return on design costs) we haven't heard anything to imply more than one specific die design.

soresu · Apr 17, 2020

NostaSeronx said:
N2 which is supposed to be N3 with nanosheets is supposedly 2023. Which will get its own fab in Hsinchu.

I heard TSMC were doing both finFET's and nanosheet/MBCFET's at N3 - ala N7/N7+ with the w/o EUV and w/EUV variants.

NostaSeronx said:
Also, hopefully 52 CU -> 36 CU -> 8 CU with 5nm FinFETs will lead to >2 GH base gpu clock rate and ~3 GHz boost gpu clock rate.

3 ghz even for boost clock is a pipe dream, it took them over a decade to go from 1 to 2 ghz and you think they will make it to 3 ghz in a year or 2??!!

Certainly not at either N5 or N5P processes.

Also what's the 52>36>8 CU thing about?

DisEnchantment said:
But, it would be not surprising if Arcturus turns out to be CDNA at some point in the near future. This chip has so many new things compared to Vega 20 with over 8 months of commit history in LLVM and amdgpu. Just a reminder, ROCm compiler is forked from LLVM.

Definitely Arcturus is the CDNA chip, it's the only thing it could be at this point.

DisEnchantment said:
For the single Frontier machine, assuming CDNA is over 20 TF/Card and using Vega 20 as reference, at 5nm with 0.1 defects/mm2, if 90% of the compute power comes from GPU that is like ~750 wafers plus another few hundreds for the CPUs. So it is going to be much lower than the total 12k wpm they requested.

Remember they made a point of saying that the gutted rasterization logic was being replaced with tensor/ML acceleration logic - so I would expect TOPS to be at least as important, if not more so than TFLOPS for CDNA.

uzzi38 · Apr 17, 2020

CDNA does debut with Arcturus.

It's called CDNA because Arcturus does a lot more stuff than GCN could. That being said, I won't claim to know specifics (excluding tensors and bfloat16), just that it's a big die that doesn't line up with how large it should actually be given the CU count. Even if you add to the rumoured die size for tensor cores, it's still too large to be honest... plus then the TOPs figure doesn't make sense either.

Arcturus and CDNA are a bit mysterious, to say the least.

DisEnchantment · Apr 17, 2020

uzzi38 said:
CDNA does debut with Arcturus.

It's called CDNA because Arcturus does a lot more stuff than GCN could. That being said, I won't claim to know specifics (excluding tensors and bfloat16), just that it's a big die that doesn't line up with how large it should actually be given the CU count. Even if you add to the rumoured die size for tensor cores, it's still too large to be honest... plus then the TOPs figure doesn't make sense either.

Arcturus and CDNA are a bit mysterious, to say the least.

Similar to Navi12
- Mixed precision unsigned/int 4/8/16/32 bit operations
- Mixed precision float 16/32 bit operations

Unique to Arcturus
- ECC
- 1/2 DPFP
- XGMI networked GPU workshare
- VCN 2.5 (instead of UVD/VCE)
- Additional RAS features
- Fused operations with matrix operations using the AGPRs (MFMA), something like [ A ] x [ B ] + [ C ], A whole lot of GPRs were added to supported this.

This is what can be found so far from LLVM and amdgpu/amdkfd. Some additional things might pop up, but I have some doubt.

soresu · Apr 17, 2020

uzzi38 said:
CDNA does debut with Arcturus.

It's called CDNA because Arcturus does a lot more stuff than GCN could. That being said, I won't claim to know specifics (excluding tensors and bfloat16), just that it's a big die that doesn't line up with how large it should actually be given the CU count. Even if you add to the rumoured die size for tensor cores, it's still too large to be honest... plus then the TOPs figure doesn't make sense either.

Arcturus and CDNA are a bit mysterious, to say the least.

Remember TOPS can be a product of any possible calculation width variant - ie INT2, INT4, INT8, etc......

So the figure could be a lot higher than you might expect compared to the FP32 FLOPS spec, especially considering that with added tensor units it might not even be an exact multiple of the FLOPS count as it has been on AMD cards so far.

uzzi38 · Apr 17, 2020

soresu said:
Remember TOPS can be a product of any possible calculation width variant - ie INT2, INT4, INT8, etc......

So the figure could be a lot higher than you might expect compared to the FP32 FLOPS spec, especially considering that with added tensor units it might not even be an exact multiple of the FLOPS count as it has been on AMD cards so far.

If anything, the number is lower than I expected. MI60 was called MI60 because it could output just under 60TOPs in INT8.

Arcturus is MI100, so unless they changed the naming scheme, should be capable of getting 100TOPs in INT8 just off shaders alone. If they actually do have tensors, theyshould be able to go further than that. Unless they either actually don't have tensor cores or aren't including tensor cores in the name for some weird reason

soresu · Apr 17, 2020

uzzi38 said:
If anything, the number is lower than I expected. MI60 was called MI60 because it could output just under 60TOPs in INT8.

Arcturus is MI100, so unless they changed the naming scheme, should be capable of getting 100TOPs in INT8 just off shaders alone. If they actually do have tensors, theyshould be able to go further than that. Unless they either actually don't have tensor cores or aren't including tensor cores in the name for some weird reason

I'd be inclined to say they called it MI100 for marketing continuity rather than TOPS count - the MIxxx nomenclature is still fairly fresh at this point, so they probably don't want to go and do a completely new one, especially given enterprise customers are more averse to change than regular consumers.

Though a 1.66x increase in TOPS on near the same node is nothing to sniff at, like RDNA1 it could just be the pipe cleaner for the new uArch - after all the financial analyst day seemed to make much more of CDNA2 than CDNA1, an odd move even given CDNA2's placement in a coming supercomputer.

edit:

uzzi38 said:
or aren't including tensor cores in the name for some weird reason

Definitely a valid reason if they don't want big green knowing the number ahead of time - like the intersections/second count on RDNA2, it's good to keep some surprises until you are more prepared, at the moment AMD are still playing a perpetual catchup game with their ROCm software stack, which still lacks even Windows support after all this time.

We should probably take this elsewhere if further discussion is warranted.

uzzi38 · Apr 17, 2020

soresu said:
I'd be inclined to say they called it MI100 for marketing continuity rather than TOPS count - the MIxxx nomenclature is still fairly fresh at this point, so they probably don't want to go and do a completely new one, especially given enterprise customers are more averse to change than regular consumers.

Though a 1.66x increase in TOPS on near the same node is nothing to sniff at, like RDNA1 it could just be the pipe cleaner for the new uArch - after all the financial analyst day seemed to make much more of CDNA2 than CDNA1, an odd move even given CDNA2's placement in a coming supercomputer.

Because CDNA2 isn't far away at all, whereas CDNA1 is late and probably hasn't garnered much interest. Also, I imagine they realise they'll probably have the compuete ecosystem in a better place next year as well, and if they want people to hop on, it'll be then.

Good point on the first half though. Also, CDNA1 is most certainly a pipe cleaner, as CDNA2 is MI200:

https://twitter.com/x/status/1210183490005655553

Cardyak · Apr 17, 2020

Hans Gruber said:
I will write it one more time. We can come back in 4 or 5 months and see if the leaks were correct. The enhanced 7nm process didn't give AMD the boost they were hoping for like going from 12nm to 7nm. But Zen 3 is new architecture over Zen 2. 15% IPC gain is on the low end of predictions. I heard 20%. I am guessing a 100mhz increase on boost clocks vs. zen2. I think that will be due to a mature 7nm node production improvements rather than the 7nm+ they were hyping. Think about intel on the 14nm node. They got really good and enhancing the 14nm to take massive amounts of voltage for improvements.

Something people have not talked about here. The infinity fabric on the memory controller for Zen3. As most know 3800mhz on the Zen2 platform is sketchy at best no matter the motherboard. If Zen 3 scales up to 4000mhz or beyond with coupled memory clock and fabric clocks. That will be provide a significant improvement in performance.

From information I've garnered IPC increases for Zen 3 could be both within the region of 15% and 20% at the same time.

Single threaded IPC could increase by around 15%, but multi-threaded closer to ~20%

This can happen for a variety of reasons:

Merged L3 cache allows greater cross communication between cores
Infinity Fabric improvements
Having a wider core may result in more SMT utilisation

These changes could result in a greater uplift in IPC when there are multiple threads vying for memory access and competing for execution units in the back end.

soresu · Apr 17, 2020

Cardyak said:
From information I've garnered IPC increases for Zen 3 could be both within the region of 15% and 20% at the same time.

Single threaded IPC could increase by around 15%, but multi-threaded closer to ~20%

This can happen for a variety of reasons:

Merged L3 cache allows greater cross communication between cores

Infinity Fabric improvements

Having a wider core may result in more SMT utilisation

These changes could result in a greater uplift in IPC when there are multiple threads vying for memory access and competing for execution units in the back end.

I would also expect packaging improvements allowing for more connections between CCD's, I/O dies and the interposer - this will likely help on the IF side of things, allowing greater width without ramping up frequency too much.

DisEnchantment · Apr 17, 2020

[Sorry for the bit OT now on the Zen3 thread]

Keep in mind that the key customers LLNL/ORNL are not so interested in TOPS, they want DPFP performance.

“Our workloads are primarily not deep learning models, although we are exploring something we call cognitive simulation, which brings deep learning and other AI models to bear on our workloads by evaluating how they can accelerate our simulations and how they can also improve their accuracy and find where they actually work,” explained de Supinski.

El Capitan, for example, is targeted to have 2 exaflops of DPFP.

The El Capitan system will have in excess of 2 exaflops of peak double precision performance

Lawrence Livermore To Surpass 2 Exaflops With AMD Compute

As the steward of the nuclear weapon arsenal for the United States government, it is probably not an overstatement to say that Lawrence Livermore National

www.nextplatform.com

AMD is lucky to have won the two contracts for Frontier and El Capitan. It allows them a lot of Flexibility in designing CDNA. They have a captive market to deliver these products with the development paid for and the Software Development paid for to some extent. On top of that Scientists participating in any of the US establishments LLNL/ORNL etc will contribute actively to ROCm (stated in AMD's own page for Frontier/El Capitan).
The government researchers have made a complete roadmap for the replacement of CUDA with elements proposed by AMD but mainly centering around OpenMP. (But for the life of me I cannot find the link again)

That said...
I wouldn't assume that CDNA1 is going to be a trivial upgrade over MI60. I doubt that just scaling up TOPS would be such a big challenge for AMD.
Just doing packed int4 will make MI60 go above 100TOPS without doing anything, then consider more CUs. With MFMA they can chain multiple matrix operations in a single wave. If they can pack mixed precision in there too, the gain is really incredible.
The main kernel work for Arcturus has been centered around networking GPUs to achieve the first step in workload sharing, data coherency between the GPUs.
So I think that was always the main focus. 2nd Gen Infinity Architecture.

Veradun · Apr 17, 2020

soresu said:
I would also expect packaging improvements allowing for more connections between CCD's, I/O dies and the interposer - this will likely help on the IF side of things, allowing greater width without ramping up frequency too much.

You mean wider IF?

eek2121 · Apr 17, 2020

With AMD’s current trends, I can pretty much guarantee you that there will be significant clock gains across the board. The 15-20% IPC increase has no basis in reality from what I understand, but it is a safe assumption.

Regarding clocks: I don’t know if we’ll see the magical 5GHz number. However, Many Ryzen 3000 chips could hit 4.3-4.4GHz when overclocked. The top 0.1% could go even higher. I imagine that number will be pushed to around 4.6Ghz on Zen 3. Non-overclocked base clocks should see a similar 200MHz jump, and. single/low core boost should see a significant jump upwards.

An extra 200 MHz with a 15% IPC boost means Zen 3 will be extremely potent.

The L3 cache change alone will provide drastic performance improvements. I would be surprised if the difference is only “+15%”.

uzzi38 · Apr 17, 2020

eek2121 said:
With AMD’s current trends, I can pretty much guarantee you that there will be significant clock gains across the board. The 15-20% IPC increase has no basis in reality from what I understand, but it is a safe assumption.

Regarding clocks: I don’t know if we’ll see the magical 5GHz number. However, Many Ryzen 3000 chips could hit 4.3-4.4GHz when overclocked. The top 0.1% could go even higher. I imagine that number will be pushed to around 4.6Ghz on Zen 3. Non-overclocked base clocks should see a similar 200MHz jump, and. single/low core boost should see a significant jump upwards.

An extra 200 MHz with a 15% IPC boost means Zen 3 will be extremely potent.

The L3 cache change alone will provide drastic performance improvements. I would be surprised if the difference is only “+15%”.

No, yes, no, probably, no, depends, probably in that order would be my reply to that.

eek2121 · Apr 17, 2020

uzzi38 said:
No, yes, no, probably, no, depends, probably in that order would be my reply to that.

This page will be the one to use for reference as Zen 3 leaks start to roll out: https://www.anandtech.com/show/1570...k-business-with-the-ryzen-9-4900hs-a-review/2

Regarding L3 cache, Zen is particularly sensitive to cache increases due to high latency. Moving to a unified cache means that cache per core count is effectively doubled, depending on how the cache is configured.

Regarding clocks, there is a definite increase in perf/watt between Renoir and Zen 2 parts. We can clearly see this by comparing Zen 2 based parts and Renoir, though the real juicy comparison won't come until U-series parts land. I could dig a bit deeper into some other, more interesting things that'd popped up, but I have to get back to work.

It's all speculation for now, at any rate. Zen 3 is going to be a pleasant surprise for a lot of people.

Glo. · Apr 17, 2020

IMO, Zen3 will look like this: higher base clocks in the same thermal envelopes alongside higher IPC, around 15%. Maximum boost clocks higher by at best 200 MHz in the same thermal envelopes.

DisEnchantment · Apr 17, 2020

Merged L3 can decrease latency for what was before inter CCX latency but can bring in higher overall access latency to L3 for what was before intra CCX L3 latency if they doubled it. They need to address this somehow.

Bigger issue with IF is power usage which affects EPYC a lot. I suppose if they improve IF energy usage they could run EPYC at much higher frequencies that what is possible today, which is a shame because EPYC is running within the sweet spot of the N7 HD VF curve. They need to address this as there is a lost opportunity here. Going faster or wider for IF is not always better.

Zen execution unit is already very good, the higher SMT utilization was because the executing thread is stalling due to misprediction or data not present in the cache leading to a different thread being executed instead.

My hope is that besides the Uncore/L3 and IF improvements they can put some effort in other places too. I think the merged L3 per CCD was probably a requirement to the X3D chiplet NoC architecture because the crossbar will be connected to the uncore (from the patents at least) in the future.

soresu · Apr 17, 2020

Veradun said:
You mean wider IF?

I meant wider IF without a significant bump in power yes.

Gideon · Apr 17, 2020

DisEnchantment said:
Merged L3 can decrease latency for what was before inter CCX latency but can bring in higher overall access latency to L3 for what was before intra CCX L3 latency if they doubled it. They need to address this somehow

Which is why i believe they will increase the L2 cache to 1MB

Speculation: Ryzen 4000 series/Zen 3

Golden Member

Diamond Member

Platinum Member

Senior member

Diamond Member

Senior member

Diamond Member

Diamond Member

Platinum Member

Golden Member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Member

Diamond Member

Golden Member

Senior member

Diamond Member

Platinum Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Platinum Member