Speculation: Ryzen 4000 series/Zen 3

Page 214 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

soresu

Platinum Member
Dec 19, 2014
2,612
1,812
136
I had an interesting thought about Van Gogh thanks to the RDNA2 presentation. The reason why Van Gogh is only Zen2 is because It has only 4 cores(1 CCX), but why does It have only 4 cores? :cool: It's because Van gogh will have Infinity cache. :D Naturally not the whole 128MB, but at least 16MB should be realistic.
Interesting thought.

You get a twinky.
 

lobz

Platinum Member
Feb 10, 2017
2,057
2,856
136
Sorry, my bad - conflated SMT yield with MT throughput for some dumb reason.
Yeah I know the feeling... I'm actually really curious how this pans out with regards to SMT, because this rearchitecting looks like it balanced a lot of weaknesses out. I mean AMD didn't lack in MT perf at all, so I'm not sure anyone should be bothered by the not very impactful MT gains. But all the underlying little things will provide a lot to chew through in the coming months :)
 

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
Yeah I know the feeling... I'm actually really curious how this pans out with regards to SMT, because this rearchitecting looks like it balanced a lot of weaknesses out. I mean AMD didn't lack in MT perf at all, so I'm not sure anyone should be bothered by the not very impactful MT gains. But all the underlying little things will provide a lot to chew through in the coming months :)
Yeah, when we get deep dives on Zen3 and reviews of Ryzen 5XXX, things will get interesting. Plus Navi21 stuff - oh my.
 

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
While I certainly don’t expect Van Gogh to have 128 MB of IfCache, if they target it at a 1080p frame buffer, they can get good mileage out of 32MB. If they include ray tracing, maybe 48MB for the scratch data. That may be a large due, but it’s not unmanageable.
 

Gideon

Golden Member
Nov 27, 2007
1,608
3,570
136
While I certainly don’t expect Van Gogh to have 128 MB of IfCache, if they target it at a 1080p frame buffer, they can get good mileage out of 32MB. If they include ray tracing, maybe 48MB for the scratch data. That may be a large due, but it’s not unmanageable.
It would be even better if this could be shared with the CPU (even as an L4?) as it would benefit tasks that don't use GPU as well.
 

soresu

Platinum Member
Dec 19, 2014
2,612
1,812
136
While I certainly don’t expect Van Gogh to have 128 MB of IfCache, if they target it at a 1080p frame buffer, they can get good mileage out of 32MB. If they include ray tracing, maybe 48MB for the scratch data. That may be a large due, but it’s not unmanageable.
That sounds like as good a job as any for some MRAM goodness finally - lots of bits at far lower mm2 footprint than SRAM.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,683
1,218
136
That sounds like as good a job as any for some MRAM goodness finally - lots of bits at far lower mm2 footprint than SRAM.
MRAM in existing nodes still have endurance/latency issues.

The best embedded RAM in the future is most likely going to be NRAM; low latency and >10^13 endurance. With actual ideal SRAM<->eDRAM performance given <7nm. Lower mm^2 with an easier 3D-crosspoint means more NRAM in a given mm^3 cube/rectangle prism. Only need 8 layers of Gb-class NRAM to get a gigabyte LLC.
 
Last edited:

maddie

Diamond Member
Jul 18, 2010
4,722
4,625
136
The transition to N5P should allow an 8 Zen 3 core, 16MB CCX, with the equivalent to 8CU of RDNA2 and 32MB to exist in roughly the same die size as existing Renoir. That should be plenty for a mobile chip.
If infinity cache allows ~ 3X effective bandwidth, then why not 16-20 CU?
 

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
Die size? If you're targeting 1080p, you likely don't need more than 10CU of RDNA2 for playable frame rates. Clocked up over the 6xxx parts by 10% should give the needed Tflops for it. The trick is having the full frame buffer in cache and also enough for the RT scratch space. No matter what, it's going to be RAM bandwidth limited without it.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Die size? If you're targeting 1080p, you likely don't need more than 10CU of RDNA2 for playable frame rates. Clocked up over the 6xxx parts by 10% should give the needed Tflops for it. The trick is having the full frame buffer in cache and also enough for the RT scratch space. No matter what, it's going to be RAM bandwidth limited without it.

Ray Tracing isn't happening on "free" iGPUs for 4-5 years at least. There just isn't enough performance for it. You have tons of better things to up the quality before you need RT, such as setting all options high, and minimum of 1080p resolution, or having everything playable at 60 fps or more.

Maybe if integrated means something like Kaby-G, where it uses tons of power and was just as expensive as dGPU parts sure.

On iGPUs, they can simply up the cache size of the CPU LLC and just share it like they do on Intel chips. Sharing the 8MB cache on Sandy Bridge resulted in 10-15% improvement. Tigerlake has 12MB L3 it shares between the CPU and the GPU.

AMD talked about roughly 1/3rds of the 54% perf/watt improvement coming from the L3 cache. That's actually not too far from the LLC sharing gains Sandy Bridge got.

N5P should be a step forward but it seems that N5 is not as great as it was expected.

Not surprised. It's slightly more than a half node by historical definitions. I'd call it 0.6 node.

Density gain sounds great but perf/watt is the main issue.
 
Last edited:

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
no one posted this ?


latency looks alright

Wow, L3 bandwidth got cut in half, i guess that explains lack of MT scaling in benchmarks. Not exactly unexpected due to move from 2 to 1 L3 domain and end result is that 8C can access ~same L3 BW as Intel Skylake 8-10C ring.

Memory latency looks like it was tightened by some 10ns tho, great news for peak performance, looking forward to get my claws on this gem.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,683
1,218
136
N5P should be a step forward but it seems that N5 is not as great as it was expected.
Unlike, 7nm where it was unknown.

N5 -> N5P -> N4 are all RTO'able. While, for a brief time it was not known for N7P(July 2019)/N6(April 2019) which was RTO/RTO-capable leaving only N7+ which is NTO.

A design of N5 can coast... better density, perf, watt, yield, cost over 7nm.
Better overall parametric for N5P.
Better cost and parametric for N4.

Server ARM HVM is Q2'21 for Alchips/Marvel/etc Custom ASICs. Plenty are >64-core and there are non-Nvidia/AMD/Intel GPUs coming out as well on the N5-family.

On time to leading edge:
N5 => 2020 / N7 => 2018
N5P => 2021 / N7P => 2019
N4 => 2022 / N7e => 2020
 
Last edited:

itsmydamnation

Platinum Member
Feb 6, 2011
2,743
3,071
136
Wow, L3 bandwidth got cut in half, i guess that explains lack of MT scaling in benchmarks. Not exactly unexpected due to move from 2 to 1 L3 domain and end result is that 8C can access ~same L3 BW as Intel Skylake 8-10C ring.

Memory latency looks like it was tightened by some 10ns tho, great news for peak performance, looking forward to get my claws on this gem.
probably a worthwhile trade off. how many workloads are > 512kb < 16mb and use ~2gb/s of bandwidth, probably not many :)
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
probably a worthwhile trade off. how many workloads are > 512kb < 16mb and use ~2gb/s of bandwidth, probably not many :)

Probably varies by workload. Darling child of ZEN architecture - Cinebench sure does not care about it. Most of the games do not either.
Only workloads represented in MT suites in GB5, Spec will not scale as good as expected from ST gains. I guess that is what we were seeing already, just common explanation was that memory was suboptimal etc.
 

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
I would suggest that having the iGPU share the L3 of the ccx in an APU wouldn’t be ideal. Maybe having a combined L4 might help, but it would need to be larger than the L3 to be a significant gain. If AMD wants a full 32MB in the APU ccx, that’s going to require the L4 to be quite large to be useful.
 

moinmoin

Diamond Member
Jun 1, 2017
4,933
7,619
136
Is it because of the process or the design?
Probably both, the process doesn't improve power usage sufficiently and may have a worse frequency/power curve, while the design clearly is not optimized for that node. AMD warned before that getting high frequencies is harder on smaller nodes. N7 seems to handle it well, N5 may take more time to reach a similar level.

no one posted this ?


latency looks alright
Hm, looking around for similar CPU/RAM frequencies memory read is essentially the same as Zen 2, latency down ~5ns. L1 and L2 cache bandwidths are hugely improved, look like nearly 3 times that of Zen 2 which points to Zen 3 cores having its Load/Store redesigned to match (and surpass?) what Intel was offering all along. L3 cache bandwidth seems that of a single chiplet Zen 2 chip, but improved between ~6% (read) and ~26% (write).