Question AMD Rembrandt/Zen 3+ APU Speculation and Discussion

izaic3 · Apr 2, 2021

Alright, so we've had some leaks so far. I don't know if any of it's been confirmed yet, as it's pretty early, but here is what I've surmised so far (massive grain of salt of course):

If if turns out to have RDNA 2 and 12 CU, I could see iGPU performance potentially almost doubling over Cezanne.

If I've made any mistakes or gotten anything wrong, please let me know. I'd also love to hear more knowledgeable people weigh in on their expectations.

TESKATLIPOKA · Apr 12, 2021

IntelUser2000 said:
I was actually comparing it to the 6800/6800XT.

My point is they still had to cut down the CUs to get that area. Vega 20 is at 330mm2 but grow to 520mm2 with Navi 21. 1.5x SP of Vega 20 will push it to the 600mm2 range.

Going from Vega 8 to Navi 12 will double the iGPU size as well.

90CU Navi(1.5x SP of Vega20) wouldn't be in 600mm2 range, +10CU don't take up extra 80mm2.

Navi 21 has 128MB IC and 256bit GDDR6, which take a lot more space than HBM2, that's why Navi is so much bigger, of course not just because of them.
RDNA1 CU is ~2.1 mm2 and I don't expect RDNA2 CU to be more than 2.5 mm2 in the worst case.

It's hard to tell If Rembrandt IGP will be really 2x bigger than the one in Cezanne, I didn't even see anyone measuring just the IGP in Renoir or Cezanne to begin with.

BTW does It even matter If It's really 2x bigger? 12CU RDNA2 would be better than putting for example 16CU Vega, which would need to have lowered clocks to fit into the same TDP as 8CU Vega. RDNA2 is way more power efficient compared to Vega.

Mopetar · Apr 12, 2021

IntelUser2000 said:
Going from Vega 8 to Navi 12 will double the iGPU size as well.

It really depends on how they adapt the technology for mobile. I suspect any hardware that accelerates ray tracing gets thrown out to save space. There's no point in adding it to an APU. Similarly, the infinity cache found on RDNA2 chips may or may not show up, and that represents an even large number of transistors for something that might get cut.

AMD may not even go to 12 CUs because unless there's enough bandwidth to feed all of them, there's no point in making a wider design outside of redundancy.

scineram · Apr 12, 2021

Well, it is 12 CU.

Shivansps · Apr 12, 2021

Mopetar said:
AMD may not even go to 12 CUs because unless there's enough bandwidth to feed all of them, there's no point in making a wider design outside of redundancy.

If AMD chooses to go 12CU it is because they can feed them, Vega 11 runs well at DDR4-3200, but it becomes heavy memory limited when OCed to something like 1700mhz.

You can also see why Renoir never went to DIY there, try to sell that at $160 and reviews would have backfired badly on AMD, Renoir is just not up to the task. This is why i dont belive it was due to supply.

It really pisses my off to think that we could have had RX560 perf in a IGP right now whiout that downgrade.

Anyway, im confident that if it is 12CU it is because they expect to work well with the avalible bandwidth.

TESKATLIPOKA · Apr 13, 2021

They wouldn't sacrifice die space and transistors for a bigger IGP, If the performance improvement would end up subpar, because they can't properly feed It.

DisEnchantment · Apr 13, 2021

I read somewhere that the truly next gen APUs will have SLC instead of IC. The SLC is common between CPU/GPU and big enough to provide the BW amplification for the small number of CUs

uzzi38 · Apr 13, 2021

DisEnchantment said:
I read somewhere that the truly next gen APUs will have SLC instead of IC. The SLC is common between CPU/GPU and big enough to provide the BW amplification for the small number of CUs

Hmm, how would that work? A seperate cache that hangs off the SDF on it's own, with the CPU and GPU cores accessing it via the SDF?

moinmoin · Apr 13, 2021

DisEnchantment said:
I read somewhere that the truly next gen APUs will have SLC instead of IC. The SLC is common between CPU/GPU and big enough to provide the BW amplification for the small number of CUs

I can imagine the Zen cores' L3$ and RDNA2's IC to amalgamate into something like a SLC situated on interposers. Though I'm not sure if the packaging cost is economical for relative low budget mainstream chips yet.

DisEnchantment · Apr 13, 2021

uzzi38 said:
Hmm, how would that work? A seperate cache that hangs off the SDF on it's own, with the CPU and GPU cores accessing it via the SDF?

moinmoin said:
I can imagine the Zen cores' L3$ and RDNA2's IC to amalgamate into something like a SLC situated on interposers. Though I'm not sure if the packaging cost is economical for relative low budget mainstream chips yet.

Could also be more mundane like an L4 where the CPU L3 and GPU L2 can probe the SLC. Zen3 L3 can probe other L3s like in 5950X for example, and RDNA2 IC basically an LLC.
The SLC sits before IMC. They need to bring the two together now since the pieces are there..
Using SDF in a mobile part could be not so great from power perspective, so for this part maybe a monolithic part without the SerDes.
Also Trento with HBM is using it as LLC I suppose, backed up by main memory. So this is something which they have already in some form or the other.

moinmoin · Apr 13, 2021

DisEnchantment said:
Using SDF in a mobile part could be not so great from power perspective, so for this part maybe a monolithic part without the SerDes.

I'm pretty sure SDF and SCF are used even on monolithic APUs and don't depend on the package being an MCM or some such. Power usage is usually an issue with high bandwidth low latency interfaces as well as connection wires, so IMC and SerDes etc.

DisEnchantment · Apr 13, 2021

moinmoin said:
I'm pretty sure SDF and SCF are used even on monolithic APUs and don't depend on the package being an MCM or some such. Power usage is usually an issue with high bandwidth low latency interfaces as well as connection wires, so IMC and SerDes etc.

Within the CCD IFOP/SDF is not used for probing caches as far as I can tell at least for Zen3. In 5950X IFOP is used for inter L3 probe, across dies. But I am curious to know if there is somewhere being mentioned of this.

uzzi38 · Apr 13, 2021

DisEnchantment said:
Within the CCD IFOP/SDF is not used for probing caches as far as I can tell at least for Zen3. In 5950X IFOP is used for inter L3 probe, across dies. But I am curious to know if there is somewhere being mentioned of this.

Here's the diagram for Raven Ridge:

And here's the one for Ryzen 1k and 2k.

Inter-CCX communications all go through the SDF, regardless of chiplet layout. I can't see a way of an SLC working unless it also hangs off the SDF, but it would come at a significant cost to latency.

DisEnchantment · Apr 13, 2021

uzzi38 said:
Inter-CCX communications all go through the SDF, regardless of chiplet layout. I can't see a way of an SLC working unless it also hangs off the SDF, but it would come at a significant cost to latency.

I guess we have to wait and see this confirmed

NostaSeronx · Apr 13, 2021

System Level Infinity Cache is technically possible. It has been theorized before with AMD's older L3 directory caches.

Heterogeneous System Coherence for Integrated CPU-GPU Systems

http://pages.cs.wisc.edu/~powerjg/files/micro13-hsc-poster.pdf

Software Assisted Hardware Cache Coherence for Heterogeneous Processors; AMD Research - Advanced Micro Devices, Inc.

https://www.csa.iisc.ac.in/~arkapravab/papers/software_assisted_hardware_coherence.pdf

CPU-GPU would have to operate within the same CCM. Basically meaning CPU L3/GPU L3 will have to be within the same area. With the current CCX evolving from 8-core to 8-core+GPU. To basically match previous research papers and would be a large jump from what we currently have.

1-core => 2 MB L3
3-compute units => 2 MB L3
8+4 set => 16 MB CPU-side CCX + 8 MB GPU-side CCX. // Lopsided integrated of L3.
or
3-compute units => 2x2 MB L3(added bandwidth and capacity)
8+8 set => 16 MB CPU-side CCX + 16 MB GPU-side CCX. // Equal integrated of L3.

Just to show the awkward-ness of such a design:

More accurate-style:

Forgot the dual-CU-ness of the WGP... UGH!!!:

This one is more easy to swallow, in design, it might work out?! It at least matches the design aesthetic of Renoir/Cezanne. As well as sharing L3 with the CPU cores like the research papers above.

Looking through the patents the above would require two coherency managers. "Exclusive" for CPU and "Main" for CPU+GPU, where main is the original manager and exclusive is to separate exclusive CPU coherency from system GPU coherency. Which allows for CPU coherency to maintain its bandwidth even with the extra-GPU part in the main CCM.

Exclusive CCM => 8x2 MB cache coherency (Data processed only on CPUs)
Main CCM => 16x2 MB cache coherency (Data processed on both)

The above is also needed for CPU+GPU chiplets within the same package. CPUs have main and exclusive cache coherency managers, GPUs only have main cache coherency managers. This is also a requirement for 3rd generation infinity fabric architecture, ex; 2x EPYC CPUs(Exclusive(CPU) and Main(CPU+GPU))+8x RI GPUs(Main-only)

Shivansps · Apr 13, 2021

My personal opinion is that AMD is going GPU chiplets way, that likely means some IC on chiplet as it would be the GPU "L3". They may end up doing monolothic APUs only for notebooks, embedded and the low end.

Things will become more clear as we get more info on RDNA3.

Mopetar · Apr 13, 2021

It really depends on the process and packaging techniques. If the APU is small enough, it makes sense to stay with a monolithic die. There are plenty of ways to bin them and a single die can cover the entire product stack.

The chiplet approach gets expensive if you can't use the same chiplets that other products get or if the cost of connecting it all using an interposer or some other technique is expensive. Doing something like that for a server chip is easy to justify, but not so much when it's for a low-end laptop part. Even more so if a lot of the cost is largely fixed, because the server parts are going to all use multiple chiplets, but the APU probably just uses one of each type.

LightningZ71 · Apr 14, 2021

There is a hole in AMD'S roadmap for a quad core, 4 CU, small desktop CPU. A well clocked 4/8 Zen2 cpu CCX with 4 MB L3 and 4 Vega or RDNA2 CUs would do just fine on the market while giving excellent volumes of die per wafer.

soresu · Apr 14, 2021

LightningZ71 said:
There is a hole in AMD'S roadmap for a quad core, 4 CU, small desktop CPU. A well clocked 4/8 Zen2 cpu CCX with 4 MB L3 and 4 Vega or RDNA2 CUs would do just fine on the market while giving excellent volumes of die per wafer.

I feel like the Samsung RDNA partnership might preceed a return to lower power chips.

Perhaps not Cat/*mont style smol cores, but as you say an SoC with a lower core count and a small GPU fit for embedded/fanless environs.

We still don't know much about Van Gogh, it could be targeted at this niche though I doubt it.

scineram · Apr 15, 2021

LightningZ71 said:
There is a hole in AMD'S roadmap for a quad core, 4 CU, small desktop CPU. A well clocked 4/8 Zen2 cpu CCX with 4 MB L3 and 4 Vega or RDNA2 CUs would do just fine on the market while giving excellent volumes of die per wafer.

Van Gogh is kind of that, but with 4 WGP.

Asterox · Apr 15, 2021

scineram said:
Van Gogh is kind of that, but with 4 WGP.

Yes, small die size in combination with RDNA iGPU.For Mobile it will be very good+cheap 4/8 APU model no doubt.

Zepp · Apr 15, 2021

If that is acurate for Van Gogh, perhaps it is intended to be the successor to Dali/Pollock.

dr1337 · Apr 15, 2021

Zepp said:
If that is acurate for Van Gogh, perhaps it is intended to be the successor to Dali/Pollock.

Thats been my assumption, AMD has had a gaping hole in their lineup for almost 3 years now. SFF and NUCs are over 25% of all PC marketshare and while renoir and cezanne are good, they're really overkill for most SFF applications/uses. And if MILD is right, and its only quad core zen 2, the dies should be tiny. Something like van gogh would finally mean top to bottom coverage in the market and thatd be awesome especially from an investor pov

zir_blazer · Apr 15, 2021

dr1337 said:
Thats been my assumption, AMD has had a gaping hole in their lineup for almost 3 years now. SFF and NUCs are over 25% of all PC marketshare and while renoir and cezanne are good, they're really overkill for most SFF applications/uses. And if MILD is right, and its only quad core zen 2, the dies should be tiny. Something like van gogh would finally mean top to bottom coverage in the market and thatd be awesome especially from an investor pov

I see absolutely no reason for Ryzen Embedded to not fit into that segment. Actually, many of the APU dies had more features than when their I/O is castrated to fit AM4, so they have a better feature set in their embedded form.

dr1337 · Apr 15, 2021

zir_blazer said:
I see absolutely no reason for Ryzen Embedded to not fit into that segment. Actually, many of the APU dies had more features than when their I/O is castrated to fit AM4, so they have a better feature set in their embedded form.

v2000 would be nice if they had any volume. Its just renoir rebranded to a different segment, availability is very low. Van gogh should be much smaller, especially if its designed for embedded applications and not just a repurposed laptop chip. I really want to be buying AMD nucs and thinclients but they're all more expensive or slower than their intel counterparts.

izaic3 · Apr 18, 2021

RDNA2's frequency/power curve

I know it's mostly focused on higher power usage, but perhaps can help guide our expectations for mobile.

Question AMD Rembrandt/Zen 3+ APU Speculation and Discussion

Member

Platinum Member

Diamond Member

Senior member

Diamond Member

Platinum Member

Golden Member

Platinum Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Platinum Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Senior member

Golden Member

Senior member

Senior member

Golden Member

Senior member

Member