Speculation: Ryzen 4000 series/Zen 3

jpiniero · Oct 4, 2019

Thunder 57 said:
This should hopefully kill off those nonsense SMT4 rumors for Zen 3. I like what they are showing with the L3.

Does that mean they would be using one 8 core CCX?

Topweasel · Oct 4, 2019

jpiniero said:
Does that mean they would be using one 8 core CCX?

I doubt AMD would do that a 4c CCX's mesh is a million times less complicated compared to an 8c and they seemingly straightened out all out of CCX communication so I don't see why they would add complexity there. What I think that suggests is they remove the L3 from the module design (simplifying that) and probably pool between CCX's. So both CCX's have direct access. It would increase the single core to cache that they were directly attached to latency slightly but it would give everyone much quicker access to the entire stash. Honestly it might help IF connections a bunch. Centralize the IF connections out to the IO die from the cache pool, so that each CCX only has to connect to the cache and the cache out instead of each CCX connecting directly to the IO die. That would also probably save on power as well.

Thunder 57 · Oct 4, 2019

jpiniero said:
Does that mean they would be using one 8 core CCX?

Good question. It would seem so. I mean, they could move the L3 to the I/O die (doubtful). They could add a link between multiple CCX's, also doubtful.

Topweasel said:
I doubt AMD would do that a 4c CCX's mesh is a million times less complicated compared to an 8c and they seemingly straightened out all out of CCX communication so I don't see why they would add complexity there. What I think that suggests is they remove the L3 from the module design (simplifying that) and probably pool between CCX's. So both CCX's havedct aireccess. It would increase the single core to cache that they were directly attached to latency slightly but it would give everyone much quicker access to the entire stash. Honestly it might help IF connections a bunch. Centralize the IF connections out to the IO die from the cache pool, so that each CCX only has to connect to the cache and the cache out instead of each CCX connecting directly to the IO die. That would also probably save on power as well.

I see how that works with two CCD's, but what about EPYC with many more?

soresu · Oct 4, 2019

Seems like the unified L3 in Zen3 effectively makes a CCD a single CCX, rather than 2.

I wonder how this works out for APU's under Zen3...

JoeRambo · Oct 4, 2019

Glad that SMT4 stupidity has died down and AMD instead seems to continue to work on weak spots of Zen architecture. That 32+MB of L3 shared by 8 cores is amazing. I think it was on top of wishlist for Zen2 already, glad AMD realised that having more of L3 means less traffic between I/O and CCX and unified L3 means zero traffic between same chiplet CCX'es. And that +? Could mean 48MB.

It seems like Intel server CPU efforts are receiving coup de grace in 2020, hard to imagine them having something competitive in core/uncore/memory subsystem performance. Ironically their situation is hopeless due to monolithic chip having anemic L3 size and abysmal (by Rome/Milan standards) cumulative bandwidth available.
Stroke of genius by AMD, none the less. First they rock the x86 world by releasing Athlon x64 with IMC, now they decoupled it to rule the server/workstation world.

itsmydamnation · Oct 4, 2019

Topweasel said:
I doubt AMD would do that a 4c CCX's mesh is a million times less complicated compared to an 8c and they seemingly straightened out all out of CCX communication so I don't see why they would add complexity there. What I think that suggests is they remove the L3 from the module design (simplifying that) and probably pool between CCX's. So both CCX's have direct access. It would increase the single core to cache that they were directly attached to latency slightly but it would give everyone much quicker access to the entire stash. Honestly it might help IF connections a bunch. Centralize the IF connections out to the IO die from the cache pool, so that each CCX only has to connect to the cache and the cache out instead of each CCX connecting directly to the IO die. That would also probably save on power as well.

Removing the L3 form the CCX doesn't simply the design, it makes it way,way,way harder, how are you going to handle coherency. If i was to guess its an 8 core CCX using rings.

soresu · Oct 4, 2019

JoeRambo said:
First they rock the x86 world by releasing Athlon x64 with IMC, now they decoupled it to rule the server/workstation world

Decoupled yes, but still part of the socketed package - wasn't it part of the MB chipset(s) before Athlon 64?

Topweasel · Oct 4, 2019

itsmydamnation said:
Removing the L3 form the CCX doesn't simply the design, it makes it way,way,way harder, how are you going to handle coherency. If i was to guess its an 8 core CCX using rings.

I meant for the CCX. One of the big issues with the APU's and some of their other designs is that they still basically have to redesign the ccx for other implementations. So the framework is there but they have redesign that. What that means for the rest of the design I can't be sure. But once you have that you have an even easier to adapt CCX design for other dies. Still going to be tons better then trying to cross connect 8 cores.

Thunder 57 · Oct 4, 2019

soresu said:
Decoupled yes, but still part of the socketed package - wasn't it part of the MB chipset(s) before Athlon 64?

It was on the north bridge for the longest time. Problem was, back then AMD was largely reliant on 3rd party chipsets.

soresu · Oct 4, 2019

Thunder 57 said:
It was on the north bridge for the longest time. Problem was, back then AMD was largely reliant on 3rd party chipsets.

Still are, x570 is their first internal chipset for a while I think - Asmedia being their main supplier up until recently.

Thunder 57 · Oct 4, 2019

soresu said:
Still are, x570 is their first internal chipset for a while I think - Asmedia being their main supplier up until recently.

Not quite. With Ryzen yes it was outsourced, but by then, so much was integrated into the CPU it didn't matter much. Before that, it was largely AMD chipsets. They had some in the A64/Opteron days but then the nforce ones were more popular until AMD absorbed ATI and apparently ATI started making some first party chipsets.

Ajay · Oct 4, 2019

Topweasel said:
I meant for the CCX. One of the big issues with the APU's and some of their other designs is that they still basically have to redesign the ccx for other implementations. So the framework is there but they have redesign that. What that means for the rest of the design I can't be sure. But once you have that you have an even easier to adapt CCX design for other dies. Still going to be tons better then trying to cross connect 8 cores.

28 point to point core links is high, but not absurdly so. Anyway, IF AMD went with an 8 core CCX they'd use a mesh or ring. Or, then can just put 4 CCXs on one CCD. Seems like they'll have the xtor budget to do it at 5nm.

edit: thinking about it some more, 28 p2p links is too expensive, power wise, for one 8 core CCX.

Tuna-Fish · Oct 5, 2019

Thunder 57 said:
IIRC FMA4 was supposed to be the standard but then Intel pulled a fast one on AMD and switched to FMA3.

It was a bit more complex than that. At roughly the same time, Intel published the FMA4 extension and AMD published the FMA3 extension. Then, both manufacturers decided that they'd rather just use what the other one made to reduce fragmentation, and Intel switched to FMA3, and AMD switched to FMA4. And neither talked to the other one about it. Then, a few years later when the chips shipped, they both realized they had done an oops and they were again incompatible. Since Intel sold a lot more cpus than AMD, FMA3 became the standard, and AMD re-added support for it for future chips and dropped FMA4.

Thunder 57 said:
I mean, they could move the L3 to the I/O die (doubtful).

That's only possible if they make L2 a lot bigger. The traffic between the L2 and L3 is high enough that moving it offchip would use more than the entire power budget for the socket.

The video makes it very clear that Milan will no longer have 4-core CCX architecture.

From 20:41 in the video said:
What will come in Milan is that we will get rid of that dual level three where compute complex (right here) have 4 cores sharing level 3, with Milan we will do one L3 for all of the cores in a single chiplet.

Ajay · Oct 5, 2019

DisEnchantment said:
Video might get taken down
View attachment 11605

Zen 3 Milan highlights [AMD, Martin Hilgeman ]
- Unified L3 32+ MB per CCD
- Sampling already
- 7nm
- Same core count as Rome
- 2x SMT
- Planned for Q3 2020
- DDR4/SP3

What is Zen3's special sauce gonna be?
- Bigger cache most likely (32MB+)
- Improved IF
- ...

Ugh, somehow missed this 😵. Finally an 8 core CCX - opens up better options for Genoa. As far as “secret sauce, AMD has made it clear that efficiency is job number one for Milan. Speeding up or widening the IF would be more costly power wise, as would using more cache (which looks set at 32MB from the screen grab). I wouldn’t expect much in terms of higher clocks or higher IPC, since both would suck power. However, vis-a-vis beefing up the execution pipeline, it might be doable if AMD adopts an old Intel rule of requiring that 2% of performance boost can only be used if it adds only 1% to the power budget. Genoa isn’t here yet and I’m really looking forward

IIRC, 7N+ EUV cuts power 15-20% at ISO frequency. The screen grab shows Milan operating in the same power range as Rome. So, perhaps there is more room for increased performance than I make it out to be. I think I just painted myself into a corner 😛

Lastly, I would expect consumer Zen3 products to hit higher clocks than it’s Zen2 equivalents. At least at in boost modes. Efficiency, will be less important for desktop enthusiast class cpus than for servers, obviously.

Vattila · Oct 5, 2019

Ajay said:
Speeding up or widening the IF would be more costly power wise, as would using more cache (which looks set at 32MB from the screen grab).

Speeding up the IF at same or lower power may be possible by going 2.5D, i.e. using a silicon interposer for the interconnect (full or bridge/EMIB-style), unlike having the interconnect implemented in the courser package substrate, as it is.

Regarding cache size, the slide says 32+ MB. I guess they will use the slight density increase and power-efficiency improvement of the N7+ process to double it to 64 MB, primarily to compensate for an increase in L3 latency in the 8-core complex.

If they use the 2x4 topology I propose (in the CCX thread), they may be able to offer two BIOS configurations for the chiplet, i.e. two separate 4-core CCXs (NUMA), each with a private 32 MB of low-latency L3 cache, or an 8-core CCX (UMA), with 64 MB of higher-latency L3 cache.

Richie Rich · Oct 5, 2019

DisEnchantment said:
What is Zen3's special sauce gonna be?
- Bigger cache most likely (32MB+)
- Improved IF
- ...

New uarch Zen 3, new 19h Family number..... and the main improvement is just unified L3 cache? It looks like fake.
Zen 2 was still 17h Family and brought pretty big changes, new L1 cache, adding 1 store unit, doubling FPU performance.

I can't beleive new uarch with new 19 Family number will bring smaller improvements than Zen 2. This doesn't make sense performance wise. How much can bring that unified L3 cache? 2-4% in average? Is it worth of effort for such a tine change to port Zen2 CPU at completely different EUV process node? For this small change I would expect to stay at N7 non-EUV process, or go to N7P or N6.

I would expect from 19h Family at least new instructions set support (AVX512), so new FPUs, effectivelly doubling the FPU performace as Zen 2 did. Zen3 team is lead by Zen1 architect, so ....

Vattila · Oct 5, 2019

Richie Rich said:
It looks like fake.

It is not. It is taken from a presentation by an AMD engineer. The video has been taken down now.

inf64 · Oct 5, 2019

I doubt anyone actually believes that Milan will stop at 64 cores. Just saying.

Tuna-Fish · Oct 5, 2019

Richie Rich said:
I can't beleive new uarch with new 19 Family number will bring smaller improvements than Zen 2.

There's no reason to expect that the changes featured in the presentation were the only things that will change in Zen3. Because of the audience, the video was heavily about topology, and the unified L3 is a major topology change, so that was featured.

Vattila · Oct 5, 2019

Here is my topology sketch, updated for Milan with the the 8-core CCX. Note that the L4 blocks may just be cache-coherency directory slices, but could conceivably include last-level cache. Note that cores/L3 slices are interconnected using the same topology as the L4 slices.

Milan IF topology, central L4 (speculation).png

Here is this topology illustrated on the current package design used for Rome:

AMD EPYC Milan (topology speculation).png

Vattila · Oct 5, 2019

inf64 said:
I doubt anyone actually believes that Milan will stop at 64 cores. Just saying.

Considering that the presentation stated that Milan is a 9-die design, I am now inclined to believe exactly that, i.e. Milan will be an optimised version of Rome with the same core-count. We might have to wait for Genoa for bigger changes and perhaps core-count increase.

jpiniero · Oct 5, 2019

inf64 said:
I doubt anyone actually believes that Milan will stop at 64 cores. Just saying.

It would make sense since the power savings from 7+ isn't that much. Use the density gain to increase IPC and perhaps more cache; and use the power savings for higher frequency.

joesiv · Oct 5, 2019

soresu said:
Seems like the unified L3 in Zen3 effectively makes a CCD a single CCX, rather than 2.

I wonder how this works out for APU's under Zen3...

8 core APUs!

soresu · Oct 5, 2019

joesiv said:
8 core APUs!

Zen+ on 12nm had a Ryzen 7 2700E 8 core at 45W, stands to reason that at 7nm+ you would be able to get an 8 core and decent GPU for 35W.

jamescox · Oct 5, 2019

darkswordsman17 said:
I'm not sure it would (how do they manage DRAM stacking in mobile?). I'm not talking about a whole stack, I'm talking about a single high stack which should remove the need for TSVs as you wouldn't be routing through the HBM (which is what the TSVs are there for). Plus there's possibility that you could implement the HBM in the die itself, and they could segment easily based on the viable amount. I don't believe that the I/O die gains a lot from being shrunk, and to me HBM3 using the same process provides an opportunity that I think would be very beneficial to take advantage of.

I'm talking about in an APU itself. There's quite a few companies that don't want to bother with an extra chip (GPU), but they're fairly constrained by memory bandwidth with regards to GPU performance in current APUs. Which as they move to chiplets the distinction there becomes a bit semantic, but for the OEM it would be a single chip solution, and its something they could do without needing to overhaul the work they did on the substrate for Zen 2 - which they talked up how much work they did there.

By Keller's own words he's there to develop next gen interconnect (which I believe he talked about one that could scale up from intrachip to interchip, and then even system - i.e. unified memory/storage that leverages different tiers trying to make that transparent to the system - and network/datacenter; that to me sounds a lot like the talk about moving to fiber optic, which he's likely looking at is it time to start that transition or can they push the limits of metal first). The way he talked he doesn't seem to have anything to do with the core designs (architecture, etc). Seems that he's there to get the various chips communicating in an efficient and fast manner (which will be needed with move to chiplet designs and co-processing and other things).

Which, I think that's what he was working on at Tesla, is figuring out how to get all the various components (sensors, processing) communicating, while trying to cut down the wiring (for weight, complexity, and cost reasons), but push latency down and throughput up.

And I think there was talk that actually was kinda his focus with Zen (basically InfinityFabric and designing chips to utilize that). I might be very wrong though, but I do know he himself said he's at Intel for developing interconnect.

So, how exactly would the HBM connect to the IO die it is stacked on without having TSVs in the IO die, making it equivalent to an active interposer?p For low performance stuff (like mobile or NAND flash die), they sometimes do stacking by just offsetting the chips and using essentially little wires attached to the edge of the chips (similar to old style wire bonding). That might be fine for a DDR channel or NAND interface, but not for a 1024-bit HBM bus.

Speculation: Ryzen 4000 series/Zen 3

Lifer

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Golden Member

Lifer

Senior member

Senior member

Senior member

Diamond Member

Golden Member

Senior member

Senior member

Lifer

Member

Diamond Member

Senior member