Speculation: Ryzen 4000 series/Zen 3

Page 45 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DrMrLordX

Lifer
Apr 27, 2000
21,583
10,785
136
how is it possible for current x86 models (zen 3k, cfl, next icelake which is better) to reach the IPC of a13 while maintaining like 4GHz freq

I would assume that 256-bit vector processing is much faster on x86 hardware already since that really isn't a thing on mobile hardware. There are also scenarios where AMD's implementation of SMT in particular makes Zen2 much more attractive. For example, I can easily clear an MT score of 14000 in Geekbench 5 on a 3900x with clockspeeds sitting in, I dunno, the 4.2 GHz range? An A13 with 2 Lightning (2.66 GHz) and 4 Thunder (??? GHz) cores scores a measily GB5 MT score of 3400-3500 (varies). I have twice the cores and . . . I guess ~57% (or more) of the clockspeed of an A13, but better than 400% the MT performance. Take two A13s, jack up their clockspeeds by +57%, and you get an MT score of around 11k (hypothetically). Yeah my 3900x sucks power, but big deal. Let's see Apple scale that A13 up to a 95W TDP (or higher).

That ST score is scary, and the MT score may be more a result of throttling than anything else. So A13 deserves a lot of credit. Just not all the credit.
 
  • Like
Reactions: Tlh97

Richie Rich

Senior member
Jul 28, 2019
470
229
76
I would assume that 256-bit vector processing is much faster on x86 hardware already since that really isn't a thing on mobile hardware. There are also scenarios where AMD's implementation of SMT in particular makes Zen2 much more attractive. For example, I can easily clear an MT score of 14000 in Geekbench 5 on a 3900x with clockspeeds sitting in, I dunno, the 4.2 GHz range? An A13 with 2 Lightning (2.66 GHz) and 4 Thunder (??? GHz) cores scores a measily GB5 MT score of 3400-3500 (varies). I have twice the cores and . . . I guess ~57% (or more) of the clockspeed of an A13, but better than 400% the MT performance. Take two A13s, jack up their clockspeeds by +57%, and you get an MT score of around 11k (hypothetically). Yeah my 3900x sucks power, but big deal. Let's see Apple scale that A13 up to a 95W TDP (or higher).

That ST score is scary, and the MT score may be more a result of throttling than anything else. So A13 deserves a lot of credit. Just not all the credit.
Rumor says Macbook will have 8c. So right math is to multiply A13 score by 4 which gives you score of 13800 @ 2.66 GHz. This match 12c Ryzen 3900X. And if they will clock it +30% higher at 3.5 GHz, score will rise +25%. And we're talking about old A13, Macbook will have new A14 IMO.
 

dnavas

Senior member
Feb 25, 2017
355
190
116
Rumor says Macbook will have 8c. So right math is to multiply A13 score by 4...

Only if the Thunder cores are not counted for Macbook and/or disabled for the base number. Is the rumor that the Macbook will include 16 thunder cores?
 

TheGiant

Senior member
Jun 12, 2017
748
353
106
Rumor says Macbook will have 8c. So right math is to multiply A13 score by 4 which gives you score of 13800 @ 2.66 GHz. This match 12c Ryzen 3900X. And if they will clock it +30% higher at 3.5 GHz, score will rise +25%. And we're talking about old A13, Macbook will have new A14 IMO.
dream on
by dumping x86 the will lose so much
they will improve ipad pro as second computing machine but keep the macbook within x86 world
icelake and tigerlake dont look bad at all
 

DrMrLordX

Lifer
Apr 27, 2000
21,583
10,785
136
Rumor says Macbook will have 8c.

About that . . .

Only if the Thunder cores are not counted for Macbook and/or disabled for the base number. Is the rumor that the Macbook will include 16 thunder cores?

Exactly. If it's 3 Lightning 5 Thunder or 4 Lightning 4 Thunder then you aren't getting into the 13800 territory on GB5. And even if you did, my 3900x would still be faster in that bench that is admittedly very friendly to mobile hardware. So ha!

@TheGiant

4c TigerLake looks pretty good actually.
 

amd6502

Senior member
Apr 21, 2017
971
360
136
Why are you mixing 1.5 CISC ILP with RISC execution units? Keller said IceLake is executing 3-6 instructions at once. Maybe you can explain why Apple moved from A7 (4xALU, 2xLSU) to wider A11/12/13 (6xALUs with still only 2xLSU). I think they had pretty good reason to do that (especially when we know there is massive +58% IPC gain over SkyLake).

The interesting thing is that AVT saw exactly same slides (graphics) just with SMT4 on it. This is the point. They put it there for identifying leakers or because Zen 3 is SMT4 capable. Could be both.

Cache, cache, cache. I feel like in Tron movie surrounded by programs caught in endless cycle. No offense however it's funny how many people want to increase code execution by not increasing exe units. Leaked Zen 3 IPC gain of >8% (other says >10%) cannot be achieved by just L3 cache.

BTW a comparison of evolution of Apple/Intel cores:
  • 2012 - Intel IvyBridge (3xALU)... Apple A6 (2xALU) .... Apple is way behind
  • 2013 - Intel Haswell (4xALU)... Apple A7 (4xALU) .... Apple is on par with Intel
  • 2017 - Intel CoffieLake (4xALU)... Apple A11 (6xALU) .... Apple became tech leader
Isn't this interesting?

I think they may go wider but 6 ALU (with 4 AGU?) would be a little overkill. I think 5 ALU with one (or some other mix) of ALUs being a simple unit would be quite enough.

I also think that if Zen3 were SMT4 it would have come out in the HPC presentation/leak. However, if it was some other 4 way MT that in performance is closer to SMT2 (I call it SMT2+ or aSMT4) then it would for all practical purposes of an HPC oriented talk be called SMT2. Supercomputing (numerical modeling) could really care less if you have extra background threads. Conversely odds of SMT2+ is also not high; under 10%. But odds of SMT2+ for Zen4 go up quite a bit. This would be a feature very useful in server and also somewhat useful in mobile.
 

Saylick

Diamond Member
Sep 10, 2012
3,084
6,184
136
I'm still thinking they'll beef up the core with an additional L/S Unit (so 2 Load/2 Store), widen dispatch to 8 ops/cycle, widen retire to 10 ops/cycle, and shared L3 across all 8 cores on each CCD. This is on top of the usual enlarging of registers and buffers.
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
shared L3 across all 8 cores on each CCD
That much was confirmed by the new presentation slide, exact quantity of the L3 is still up in the air though beyond 32MB+.

Given how much area L3 takes up though, I'm inclined to think snything more than 40MB (+25%) might be too big for the 7nm+ move.
 
Last edited:
  • Like
Reactions: Thunder 57

Gideon

Golden Member
Nov 27, 2007
1,608
3,573
136
That much was confirmed by the new presentation slide, exact quantity of the L3 is still up in the air though beyond 32MB+.

Given how much area L3 takes up though, I'm inclined to think snything more than 40MB (+25%) might be too big for the 7nm+ move.
My guess is, that it will be an all inclusive cache and will therefore grow, to fit the L2 of all cores in CCD (in order not to effectively shrink in size compared to zen 2). This means extra 4MB, provided the L2 remains unchanged - so in total 36MB of L3 per chiplet.

As the L3 latency in Zen2 is already measurably slower than Zen+ (11ns vs 9ns) and unifying the cache will probably make it a tad worse still, I wouldn't rule out L2 being enlarged to compensate it, so 1MB L1 per core + 40MB L3 per Chiplet is also a (less likely) possibility.

If one is to believe the ~15% IPC gain rumors, I think the entire cache hierarchy will be redesigned as the unification of L3 is a major redesign anyway. My (somewhat wild and wishful) predictions for Milan memory hierarchy in that case are:
  • Memory Compression for chiplet-to-chiplet communication at least on server (probably configurable in BIOS). They have issued patents for it a while a go and it would save considerable amount of power (in EPYC and Threadripper) that could be used in the core-chiplets instead of it being wasted transporting data.
  • 40MB of all-inclusive L3 cache per CCD (36MB if L2 stays the same)
  • 1MB of L2
  • 48KB of 12-way L1 Data Cache "Ice Lake Style" (this is the least likely prediciton IMO)
  • improvements to the uop cache, so that is competitively shared between SMT threads, rather than statically partitioned (effectively doubling it for lightly threaded loads).
 
Last edited:

Thunder 57

Platinum Member
Aug 19, 2007
2,647
3,706
136
My guess is, that it will be an all inclusive cache and will therefore grow, to fit the L2 of all cores in CCD (in order not to effectively shrink in size compared to zen 2). This means extra 4MB, provided the L2 remains unchanged - so in total 36MB of L3 per chiplet.

As the L3 latency in Zen2 is already measurably slower than Zen+ (11ns vs 9ns) and unifying the cache will probably make it a tad worse still, I wouldn't rule out L2 being enlarged to compensate it, so 1MB L1 per core + 40MB L3 per Chiplet is also a (less likely) possibility.

If one is to believe the ~15% IPC gain rumors, I think the entire cache hierarchy will be redesigned as the unification of L3 is a major redesign anyway. My (somewhat wild and wishful) predictions for Milan memory hierarchy in that case are:
  • Memory Compression for chiplet-to-chiplet communication at least on server (probably configurable in BIOS). They have issued patents for it a while a go and it would save considerable amount of power (in EPYC and Threadripper) that could be used in the core-chiplets instead of it being wasted transporting data.
  • 40MB of all-inclusive L3 cache per CCD (36MB if L2 stays the same)
  • 1MB of L2
  • 48KB of 12-way L1 Data Cache "Ice Lake Style" (this is the least likely prediciton IMO)
  • improvements to the uop cache, so that is competitively shared between SMT threads, rather than statically partitioned (effectively doubling it for lightly threaded loads).

Have to admit, an inclusive L3 cache would be interesting. It is certainly more realistic than some other ideas being floated around here.
 

Cardyak

Member
Sep 12, 2018
72
159
106
improvements to the uop cache, so that is competitively shared between SMT threads, rather than statically partitioned (effectively doubling it for lightly threaded loads).

That’s an interesting notion. Similar to Ivy Bridge where improvements were made to the ROB and other sections to reduce static partitioning.

This alone could offer a modest IPC increase.
 
  • Like
Reactions: yuri69 and Gideon

amd6502

Senior member
Apr 21, 2017
971
360
136
inclusive means L1/L2 data is copied to L3.

L2 doubling to 1MB seems like a very good bet. And semi-decent chance L1 growing too.

Seems they took the all roads lead to Rome theme further in Zen3 by doing this at the scale of the CCX.

Does Zen2 follow Zen1 in L3 being victim cache?

For Zen3 the L3 cache unit may be a complex compound of its own and also now may have the role of acting as hub. It may be a smart hybrid thing that is not one or the other; internally it may dedicate a good fraction ~30% capacity as L4-ish victim cache (with L3 = hub+L3+L4).

inclusive would help in its role as hub when there are shared memory addresses being updated by several cores (more energy efficient, less latency than exclusive, but you have slightly smaller total cache footprint: 32 MB vs a 36 or 40 MB. something more flexible (non-exclusive or partially-inclusive) based on whether addresses are shared between cores would have best of both worlds.
 
Last edited:

Tarkin77

Member
Mar 10, 2018
70
147
106
Direct quote from Dr. Lisa Su

Going forward, we are not relying on process technology as the main driver. We think process technology is necessary. It’s necessary to be sort of at the leading edge of process technology. And so, today, 7-nanometer is a great node, and we’re getting a lot of benefit from it. We will transition to the 5-nanometer node at the appropriate time and get great benefit from that as well. But we’re doing a lot in architecture. And I would say, that the architecture is where we believe the highest leverage is for our product portfolio going forward.

from the Q3 conference call this week.
source: https://www.overclock3d.net/news/cp...tecture_not_process_tech_says_amd_s_lisa_su/1
 
  • Like
Reactions: soresu

Arzachel

Senior member
Apr 7, 2011
903
76
91
I'll be shocked if Zen3 has this much IPC gain. And I'll be dismayed, slightly, if I build a Ryzen 3000 series system.
I think a 15% overall performance gain would be impressive, and would finally leave Intel behind in gaming.

It would be impressive but also absolutely necessary to match Intel's pace. Ice Lake might be stuck on a dud node but clock for clock it's still the fastest x86 cpu.
 

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
It would be impressive but also absolutely necessary to match Intel's pace. Ice Lake might be stuck on a dud node but clock for clock it's still the fastest x86 cpu.
AMD doesn’t really have anything to worry about on the desktop till 2022! They will definitively take not just the multithreaded crown but also the single threaded crown in 2020. So yes, they to have to keep up the pace; I believe Intel will finally bring their A game in 2022 (unless something is horribly wrong with 7nm EUV).
 

Abwx

Lifer
Apr 2, 2011
10,854
3,298
136
It would be impressive but also absolutely necessary to match Intel's pace. Ice Lake might be stuck on a dud node but clock for clock it's still the fastest x86 cpu.

Clock for clock, and in Cinebench R15, a Zen 2 is....2% faster in single thread than Icelake, so that doesnt bode well for Intel s alleged 18% IPC improvement...
 

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,654
136
Clock for clock, and in Cinebench R15, a Zen 2 is....2% faster in single thread than Icelake, so that doesnt bode well for Intel s alleged 18% IPC improvement...
Depends on the tests, I have seen -10% to a - 6% to a +2%. I think the key is IF speed can be key here at least in these kind of benches. But even at 10% better then Zen 2 its a wash with their current clocks. If they can get another 5-7% with Zen 3 and Zen 4 while maintaining clocks, then I think we start to really see AMD pull away. Everything Intel is trying AMD basically has a smoother thing going. I mean look at IF/Zen2/and their packaging Vs. where Intel is with EMIB. Intel's solution seems like it would be more elegant. But does that matter if AMD is selling twice as many cores for half the price of their best chips at higher clocks.
 
  • Like
Reactions: lightmanek

Abwx

Lifer
Apr 2, 2011
10,854
3,298
136
Depends on the tests, I have seen -10% to a - 6% to a +2%. I think the key is IF speed can be key here at least in these kind of benches. But even at 10% better then Zen 2 its a wash with their current clocks. If they can get another 5-7% with Zen 3 and Zen 4 while maintaining clocks, then I think we start to really see AMD pull away. Everything Intel is trying AMD basically has a smoother thing going. I mean look at IF/Zen2/and their packaging Vs. where Intel is with EMIB. Intel's solution seems like it would be more elegant. But does that matter if AMD is selling twice as many cores for half the price of their best chips at higher clocks.


Dunno what are thoses "other" tests you re talking about, so far i ve seen no exhaustive review with fixed or known clocks if we except the CB R15 ST score, all other tests are done at a given power, wich doesnt say what is the clock rate..