Speculation: Ryzen 4000 series/Zen 3

Ajay · Oct 28, 2020

lobz said:
Yeah I know the feeling... I'm actually really curious how this pans out with regards to SMT, because this rearchitecting looks like it balanced a lot of weaknesses out. I mean AMD didn't lack in MT perf at all, so I'm not sure anyone should be bothered by the not very impactful MT gains. But all the underlying little things will provide a lot to chew through in the coming months 🙂

Yeah, when we get deep dives on Zen3 and reviews of Ryzen 5XXX, things will get interesting. Plus Navi21 stuff - oh my.

LightningZ71 · Oct 28, 2020

While I certainly don’t expect Van Gogh to have 128 MB of IfCache, if they target it at a 1080p frame buffer, they can get good mileage out of 32MB. If they include ray tracing, maybe 48MB for the scratch data. That may be a large due, but it’s not unmanageable.

itsmydamnation · Oct 28, 2020

no one posted this ?

https://twitter.com/x/status/1321442583348916224

latency looks alright

Gideon · Oct 28, 2020

LightningZ71 said:
While I certainly don’t expect Van Gogh to have 128 MB of IfCache, if they target it at a 1080p frame buffer, they can get good mileage out of 32MB. If they include ray tracing, maybe 48MB for the scratch data. That may be a large due, but it’s not unmanageable.

It would be even better if this could be shared with the CPU (even as an L4?) as it would benefit tasks that don't use GPU as well.

LightningZ71 · Oct 28, 2020

Hmm, where have I seen that before... something something iris pro...

soresu · Oct 28, 2020

LightningZ71 said:
While I certainly don’t expect Van Gogh to have 128 MB of IfCache, if they target it at a 1080p frame buffer, they can get good mileage out of 32MB. If they include ray tracing, maybe 48MB for the scratch data. That may be a large due, but it’s not unmanageable.

That sounds like as good a job as any for some MRAM goodness finally - lots of bits at far lower mm2 footprint than SRAM.

soresu · Oct 28, 2020

LightningZ71 said:
Hmm, where have I seen that before... something something iris pro...

I saw it earlier than that.

Something something Fusion/HSA shared memory pools!

NostaSeronx · Oct 28, 2020

soresu said:
That sounds like as good a job as any for some MRAM goodness finally - lots of bits at far lower mm2 footprint than SRAM.

MRAM in existing nodes still have endurance/latency issues.

The best embedded RAM in the future is most likely going to be NRAM; low latency and >10^13 endurance. With actual ideal SRAM<->eDRAM performance given <7nm. Lower mm^2 with an easier 3D-crosspoint means more NRAM in a given mm^3 cube/rectangle prism. Only need 8 layers of Gb-class NRAM to get a gigabyte LLC.

moinmoin · Oct 28, 2020

soresu said:
I saw it earlier than that.

Something something Fusion/HSA shared memory pools!

I saw it on PS2 and GC first, take that!

LightningZ71 · Oct 28, 2020

The transition to N5P should allow an 8 Zen 3 core, 16MB CCX, with the equivalent to 8CU of RDNA2 and 32MB to exist in roughly the same die size as existing Renoir. That should be plenty for a mobile chip.

maddie · Oct 28, 2020

LightningZ71 said:
The transition to N5P should allow an 8 Zen 3 core, 16MB CCX, with the equivalent to 8CU of RDNA2 and 32MB to exist in roughly the same die size as existing Renoir. That should be plenty for a mobile chip.

If infinity cache allows ~ 3X effective bandwidth, then why not 16-20 CU?

VirtualLarry · Oct 28, 2020

maddie said:
If infinity cache allows ~ 3X effective bandwidth, then why not 16-20 CU?

Thermals?

maddie · Oct 28, 2020

VirtualLarry said:
Thermals?

If correct, then why waste space with it on APUs? You certainly don't need it for 8CU products.

LightningZ71 · Oct 29, 2020

Die size? If you're targeting 1080p, you likely don't need more than 10CU of RDNA2 for playable frame rates. Clocked up over the 6xxx parts by 10% should give the needed Tflops for it. The trick is having the full frame buffer in cache and also enough for the RT scratch space. No matter what, it's going to be RAM bandwidth limited without it.

Antey · Oct 29, 2020

LightningZ71 said:
The transition to N5P should allow an 8 Zen 3 core, 16MB CCX, with the equivalent to 8CU of RDNA2 and 32MB to exist in roughly the same die size as existing Renoir. That should be plenty for a mobile chip.

N5P should be a step forward but it seems that N5 is not as great as it was expected.

https://twitter.com/x/status/1321555923350016004

IntelUser2000 · Oct 29, 2020

LightningZ71 said:
Die size? If you're targeting 1080p, you likely don't need more than 10CU of RDNA2 for playable frame rates. Clocked up over the 6xxx parts by 10% should give the needed Tflops for it. The trick is having the full frame buffer in cache and also enough for the RT scratch space. No matter what, it's going to be RAM bandwidth limited without it.

Ray Tracing isn't happening on "free" iGPUs for 4-5 years at least. There just isn't enough performance for it. You have tons of better things to up the quality before you need RT, such as setting all options high, and minimum of 1080p resolution, or having everything playable at 60 fps or more.

Maybe if integrated means something like Kaby-G, where it uses tons of power and was just as expensive as dGPU parts sure.

On iGPUs, they can simply up the cache size of the CPU LLC and just share it like they do on Intel chips. Sharing the 8MB cache on Sandy Bridge resulted in 10-15% improvement. Tigerlake has 12MB L3 it shares between the CPU and the GPU.

AMD talked about roughly 1/3rds of the 54% perf/watt improvement coming from the L3 cache. That's actually not too far from the LLC sharing gains Sandy Bridge got.

Antey said:
N5P should be a step forward but it seems that N5 is not as great as it was expected.

Not surprised. It's slightly more than a half node by historical definitions. I'd call it 0.6 node.

Density gain sounds great but perf/watt is the main issue.

JoeRambo · Oct 29, 2020

itsmydamnation said:
no one posted this ?

https://twitter.com/x/status/1321442583348916224

latency looks alright

Wow, L3 bandwidth got cut in half, i guess that explains lack of MT scaling in benchmarks. Not exactly unexpected due to move from 2 to 1 L3 domain and end result is that 8C can access ~same L3 BW as Intel Skylake 8-10C ring.

Memory latency looks like it was tightened by some 10ns tho, great news for peak performance, looking forward to get my claws on this gem.

NostaSeronx · Oct 29, 2020

Antey said:
N5P should be a step forward but it seems that N5 is not as great as it was expected.

Unlike, 7nm where it was unknown.

N5 -> N5P -> N4 are all RTO'able. While, for a brief time it was not known for N7P(July 2019)/N6(April 2019) which was RTO/RTO-capable leaving only N7+ which is NTO.

A design of N5 can coast... better density, perf, watt, yield, cost over 7nm.
Better overall parametric for N5P.
Better cost and parametric for N4.

Server ARM HVM is Q2'21 for Alchips/Marvel/etc Custom ASICs. Plenty are >64-core and there are non-Nvidia/AMD/Intel GPUs coming out as well on the N5-family.

On time to leading edge:
N5 => 2020 / N7 => 2018
N5P => 2021 / N7P => 2019
N4 => 2022 / N7e => 2020

DisEnchantment · Oct 29, 2020

Antey said:
N5P should be a step forward but it seems that N5 is not as great as it was expected.

https://twitter.com/x/status/1321555923350016004

Is it because of the process or the design?

itsmydamnation · Oct 29, 2020

JoeRambo said:
Wow, L3 bandwidth got cut in half, i guess that explains lack of MT scaling in benchmarks. Not exactly unexpected due to move from 2 to 1 L3 domain and end result is that 8C can access ~same L3 BW as Intel Skylake 8-10C ring.

Memory latency looks like it was tightened by some 10ns tho, great news for peak performance, looking forward to get my claws on this gem.

probably a worthwhile trade off. how many workloads are > 512kb < 16mb and use ~2gb/s of bandwidth, probably not many 🙂

JoeRambo · Oct 29, 2020

itsmydamnation said:
probably a worthwhile trade off. how many workloads are > 512kb < 16mb and use ~2gb/s of bandwidth, probably not many 🙂

Probably varies by workload. Darling child of ZEN architecture - Cinebench sure does not care about it. Most of the games do not either.
Only workloads represented in MT suites in GB5, Spec will not scale as good as expected from ST gains. I guess that is what we were seeing already, just common explanation was that memory was suboptimal etc.

LightningZ71 · Oct 29, 2020

I would suggest that having the iGPU share the L3 of the ccx in an APU wouldn’t be ideal. Maybe having a combined L4 might help, but it would need to be larger than the L3 to be a significant gain. If AMD wants a full 32MB in the APU ccx, that’s going to require the L4 to be quite large to be useful.

moinmoin · Oct 29, 2020

DisEnchantment said:
Is it because of the process or the design?

Probably both, the process doesn't improve power usage sufficiently and may have a worse frequency/power curve, while the design clearly is not optimized for that node. AMD warned before that getting high frequencies is harder on smaller nodes. N7 seems to handle it well, N5 may take more time to reach a similar level.

itsmydamnation said:
no one posted this ?

https://twitter.com/x/status/1321442583348916224

latency looks alright

Hm, looking around for similar CPU/RAM frequencies memory read is essentially the same as Zen 2, latency down ~5ns. L1 and L2 cache bandwidths are hugely improved, look like nearly 3 times that of Zen 2 which points to Zen 3 cores having its Load/Store redesigned to match (and surpass?) what Intel was offering all along. L3 cache bandwidth seems that of a single chiplet Zen 2 chip, but improved between ~6% (read) and ~26% (write).

IntelUser2000 · Oct 29, 2020

moinmoin said:
L1 and L2 cache bandwidths are hugely improved, look like nearly 3 times that of Zen 2 which points to Zen 3 cores having its Load/Store redesigned to match (and surpass?) what Intel was offering all along. L3 cache bandwidth seems that of a single chiplet Zen 2 chip, but improved between ~6% (read) and ~26% (write).

Not according to that screenshot. The throughput is exactly the same as Ryzen 3000 series and also Skylake.

Icelake is higher, and the highest is Skylake-SP.

moinmoin · Oct 29, 2020

IntelUser2000 said:
Not according to that screenshot. The throughput is exactly the same as Ryzen 3000 series and also Skylake.

Icelake is higher, and the highest is Skylake-SP.

Maybe we are looking at different screeshots. I was comparing the tweeted screenshot mainly with this one shared on TPU's forums, matching memory clock (though better subtimings):

Speculation: Ryzen 4000 series/Zen 3

Lifer

Platinum Member

Diamond Member

Platinum Member

Platinum Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

No Lifer

Diamond Member

Platinum Member

Member

Elite Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Platinum Member

Diamond Member

Elite Member

Diamond Member