Speculation: Ryzen 4000 series/Zen 3

Page 27 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Thunder 57

Platinum Member
Aug 19, 2007
2,647
3,706
136
Okay. I though FMA was a = (b*c)+d;
A product plus an offset. Back to school for me I guess; I need to look at some code. :oops:

Don't take out student loans just yet. What you described is FMA4, there is also FMA3 which is destructive. IIRC FMA4 was supposed to be the standard but then Intel pulled a fast one on AMD and switched to FMA3.
 

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,654
136
Does that mean they would be using one 8 core CCX?
I doubt AMD would do that a 4c CCX's mesh is a million times less complicated compared to an 8c and they seemingly straightened out all out of CCX communication so I don't see why they would add complexity there. What I think that suggests is they remove the L3 from the module design (simplifying that) and probably pool between CCX's. So both CCX's have direct access. It would increase the single core to cache that they were directly attached to latency slightly but it would give everyone much quicker access to the entire stash. Honestly it might help IF connections a bunch. Centralize the IF connections out to the IO die from the cache pool, so that each CCX only has to connect to the cache and the cache out instead of each CCX connecting directly to the IO die. That would also probably save on power as well.
 
  • Like
Reactions: spursindonesia

Thunder 57

Platinum Member
Aug 19, 2007
2,647
3,706
136
Does that mean they would be using one 8 core CCX?

Good question. It would seem so. I mean, they could move the L3 to the I/O die (doubtful). They could add a link between multiple CCX's, also doubtful.

I doubt AMD would do that a 4c CCX's mesh is a million times less complicated compared to an 8c and they seemingly straightened out all out of CCX communication so I don't see why they would add complexity there. What I think that suggests is they remove the L3 from the module design (simplifying that) and probably pool between CCX's. So both CCX's havedct aireccess. It would increase the single core to cache that they were directly attached to latency slightly but it would give everyone much quicker access to the entire stash. Honestly it might help IF connections a bunch. Centralize the IF connections out to the IO die from the cache pool, so that each CCX only has to connect to the cache and the cache out instead of each CCX connecting directly to the IO die. That would also probably save on power as well.

I see how that works with two CCD's, but what about EPYC with many more?
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Glad that SMT4 stupidity has died down and AMD instead seems to continue to work on weak spots of Zen architecture. That 32+MB of L3 shared by 8 cores is amazing. I think it was on top of wishlist for Zen2 already, glad AMD realised that having more of L3 means less traffic between I/O and CCX and unified L3 means zero traffic between same chiplet CCX'es. And that +? Could mean 48MB.

It seems like Intel server CPU efforts are receiving coup de grace in 2020, hard to imagine them having something competitive in core/uncore/memory subsystem performance. Ironically their situation is hopeless due to monolithic chip having anemic L3 size and abysmal (by Rome/Milan standards) cumulative bandwidth available.
Stroke of genius by AMD, none the less. First they rock the x86 world by releasing Athlon x64 with IMC, now they decoupled it to rule the server/workstation world.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,744
3,080
136
I doubt AMD would do that a 4c CCX's mesh is a million times less complicated compared to an 8c and they seemingly straightened out all out of CCX communication so I don't see why they would add complexity there. What I think that suggests is they remove the L3 from the module design (simplifying that) and probably pool between CCX's. So both CCX's have direct access. It would increase the single core to cache that they were directly attached to latency slightly but it would give everyone much quicker access to the entire stash. Honestly it might help IF connections a bunch. Centralize the IF connections out to the IO die from the cache pool, so that each CCX only has to connect to the cache and the cache out instead of each CCX connecting directly to the IO die. That would also probably save on power as well.
Removing the L3 form the CCX doesn't simply the design, it makes it way,way,way harder, how are you going to handle coherency. If i was to guess its an 8 core CCX using rings.
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
First they rock the x86 world by releasing Athlon x64 with IMC, now they decoupled it to rule the server/workstation world
Decoupled yes, but still part of the socketed package - wasn't it part of the MB chipset(s) before Athlon 64?
 

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,654
136
Removing the L3 form the CCX doesn't simply the design, it makes it way,way,way harder, how are you going to handle coherency. If i was to guess its an 8 core CCX using rings.

I meant for the CCX. One of the big issues with the APU's and some of their other designs is that they still basically have to redesign the ccx for other implementations. So the framework is there but they have redesign that. What that means for the rest of the design I can't be sure. But once you have that you have an even easier to adapt CCX design for other dies. Still going to be tons better then trying to cross connect 8 cores.
 

Thunder 57

Platinum Member
Aug 19, 2007
2,647
3,706
136
Decoupled yes, but still part of the socketed package - wasn't it part of the MB chipset(s) before Athlon 64?

It was on the north bridge for the longest time. Problem was, back then AMD was largely reliant on 3rd party chipsets.
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
It was on the north bridge for the longest time. Problem was, back then AMD was largely reliant on 3rd party chipsets.
Still are, x570 is their first internal chipset for a while I think - Asmedia being their main supplier up until recently.
 

Thunder 57

Platinum Member
Aug 19, 2007
2,647
3,706
136
Still are, x570 is their first internal chipset for a while I think - Asmedia being their main supplier up until recently.

Not quite. With Ryzen yes it was outsourced, but by then, so much was integrated into the CPU it didn't matter much. Before that, it was largely AMD chipsets. They had some in the A64/Opteron days but then the nforce ones were more popular until AMD absorbed ATI and apparently ATI started making some first party chipsets.
 
  • Like
Reactions: soresu

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
I meant for the CCX. One of the big issues with the APU's and some of their other designs is that they still basically have to redesign the ccx for other implementations. So the framework is there but they have redesign that. What that means for the rest of the design I can't be sure. But once you have that you have an even easier to adapt CCX design for other dies. Still going to be tons better then trying to cross connect 8 cores.
28 point to point core links is high, but not absurdly so. Anyway, IF AMD went with an 8 core CCX they'd use a mesh or ring. Or, then can just put 4 CCXs on one CCD. Seems like they'll have the xtor budget to do it at 5nm.

edit: thinking about it some more, 28 p2p links is too expensive, power wise, for one 8 core CCX.
 
Last edited:

Tuna-Fish

Golden Member
Mar 4, 2011
1,324
1,462
136
IIRC FMA4 was supposed to be the standard but then Intel pulled a fast one on AMD and switched to FMA3.

It was a bit more complex than that. At roughly the same time, Intel published the FMA4 extension and AMD published the FMA3 extension. Then, both manufacturers decided that they'd rather just use what the other one made to reduce fragmentation, and Intel switched to FMA3, and AMD switched to FMA4. And neither talked to the other one about it. Then, a few years later when the chips shipped, they both realized they had done an oops and they were again incompatible. Since Intel sold a lot more cpus than AMD, FMA3 became the standard, and AMD re-added support for it for future chips and dropped FMA4.

I mean, they could move the L3 to the I/O die (doubtful).

That's only possible if they make L2 a lot bigger. The traffic between the L2 and L3 is high enough that moving it offchip would use more than the entire power budget for the socket.

The video makes it very clear that Milan will no longer have 4-core CCX architecture.

From 20:41 in the video said:
What will come in Milan is that we will get rid of that dual level three where compute complex (right here) have 4 cores sharing level 3, with Milan we will do one L3 for all of the cores in a single chiplet.
 
  • Like
Reactions: Ajay

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
Video might get taken down
View attachment 11605


Zen 3 Milan highlights [AMD, Martin Hilgeman ]
- Unified L3 32+ MB per CCD
- Sampling already
- 7nm
- Same core count as Rome
- 2x SMT
- Planned for Q3 2020
- DDR4/SP3

What is Zen3's special sauce gonna be?
- Bigger cache most likely (32MB+)
- Improved IF
- ...

Ugh, somehow missed this o_O. Finally an 8 core CCX - opens up better options for Genoa. As far as “secret sauce, AMD has made it clear that efficiency is job number one for Milan. Speeding up or widening the IF would be more costly power wise, as would using more cache (which looks set at 32MB from the screen grab). I wouldn’t expect much in terms of higher clocks or higher IPC, since both would suck power. However, vis-a-vis beefing up the execution pipeline, it might be doable if AMD adopts an old Intel rule of requiring that 2% of performance boost can only be used if it adds only 1% to the power budget. Genoa isn’t here yet and I’m really looking forward

IIRC, 7N+ EUV cuts power 15-20% at ISO frequency. The screen grab shows Milan operating in the same power range as Rome. So, perhaps there is more room for increased performance than I make it out to be. I think I just painted myself into a corner :p

Lastly, I would expect consumer Zen3 products to hit higher clocks than it’s Zen2 equivalents. At least at in boost modes. Efficiency, will be less important for desktop enthusiast class cpus than for servers, obviously.
 
  • Like
Reactions: lightmanek

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Speeding up or widening the IF would be more costly power wise, as would using more cache (which looks set at 32MB from the screen grab).

Speeding up the IF at same or lower power may be possible by going 2.5D, i.e. using a silicon interposer for the interconnect (full or bridge/EMIB-style), unlike having the interconnect implemented in the courser package substrate, as it is.

Regarding cache size, the slide says 32+ MB. I guess they will use the slight density increase and power-efficiency improvement of the N7+ process to double it to 64 MB, primarily to compensate for an increase in L3 latency in the 8-core complex.

If they use the 2x4 topology I propose (in the CCX thread), they may be able to offer two BIOS configurations for the chiplet, i.e. two separate 4-core CCXs (NUMA), each with a private 32 MB of low-latency L3 cache, or an 8-core CCX (UMA), with 64 MB of higher-latency L3 cache.
 
Last edited:

Richie Rich

Senior member
Jul 28, 2019
470
229
76
What is Zen3's special sauce gonna be?
- Bigger cache most likely (32MB+)
- Improved IF
- ...
New uarch Zen 3, new 19h Family number..... and the main improvement is just unified L3 cache? It looks like fake.
Zen 2 was still 17h Family and brought pretty big changes, new L1 cache, adding 1 store unit, doubling FPU performance.

I can't beleive new uarch with new 19 Family number will bring smaller improvements than Zen 2. This doesn't make sense performance wise. How much can bring that unified L3 cache? 2-4% in average? Is it worth of effort for such a tine change to port Zen2 CPU at completely different EUV process node? For this small change I would expect to stay at N7 non-EUV process, or go to N7P or N6.

I would expect from 19h Family at least new instructions set support (AVX512), so new FPUs, effectivelly doubling the FPU performace as Zen 2 did. Zen3 team is lead by Zen1 architect, so ....
 
Last edited:

inf64

Diamond Member
Mar 11, 2011
3,686
3,959
136
I doubt anyone actually believes that Milan will stop at 64 cores. Just saying.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,324
1,462
136
I can't beleive new uarch with new 19 Family number will bring smaller improvements than Zen 2.

There's no reason to expect that the changes featured in the presentation were the only things that will change in Zen3. Because of the audience, the video was heavily about topology, and the unified L3 is a major topology change, so that was featured.
 

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Here is my topology sketch, updated for Milan with the the 8-core CCX. Note that the L4 blocks may just be cache-coherency directory slices, but could conceivably include last-level cache. Note that cores/L3 slices are interconnected using the same topology as the L4 slices.

Milan IF topology, central L4 (speculation).png

Here is this topology illustrated on the current package design used for Rome:

AMD EPYC Milan (topology speculation).png
 
Last edited:
  • Like
Reactions: Elfear

Vattila

Senior member
Oct 22, 2004
799
1,351
136
I doubt anyone actually believes that Milan will stop at 64 cores. Just saying.

Considering that the presentation stated that Milan is a 9-die design, I am now inclined to believe exactly that, i.e. Milan will be an optimised version of Rome with the same core-count. We might have to wait for Genoa for bigger changes and perhaps core-count increase.
 

jpiniero

Lifer
Oct 1, 2010
14,511
5,159
136
I doubt anyone actually believes that Milan will stop at 64 cores. Just saying.

It would make sense since the power savings from 7+ isn't that much. Use the density gain to increase IPC and perhaps more cache; and use the power savings for higher frequency.