Speculation: Ryzen 4000 series/Zen 3

Page 46 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Abwx

Lifer
Apr 2, 2011
10,847
3,297
136
OT, but Phoronix managed to obtain ~18% IPC uplift compared to Kaby. This was done using their standard suite.

His not so standard suite show the 9900K as being better than a 3900X in floating point...


For his ICL comparison he takes a 8550U using 2133 RAM even he has a 8650U that use DDR2400, this way he increase the difference since ICL use much faster memory, second is that he doesnt use Blender or Povray, rather he get to memory bound benches for obvious reasons.

So far the only relevant numbers are CB R15 ST and the Geekbench LZMA subscore that is indicative of Integer perf but with the advantage of fast RAM, hence the improvement per clock is 10% (apparently...) while it s only 5% in Cinebench.

;
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,683
1,218
136
What?? They just doubled it with Zen 2 at the expensive of L1-I cache. Pretty sure that stays put for Zen 3.
As consumers we always need more.

L0i before the Pick stage.
Decoded micro-op cache with the dispatch stage.
In-flight micro-op cache with the retire/mapper stage. (For >1K micro-instructions in OoO flight)
Execution loop caches with the scheduler stage.

Might as well add Register caches, L0ds, L0.5ds, L1.5ds and L2.5ds, L3.5d, and lest not forget thee L4 cache.
 
  • Haha
Reactions: Thunder 57

TheGiant

Senior member
Jun 12, 2017
748
353
106
His not so standard suite show the 9900K as being better than a 3900X in floating point...


For his ICL comparison he takes a 8550U using 2133 RAM even he has a 8650U that use DDR2400, this way he increase the difference since ICL use much faster memory, second is that he doesnt use Blender or Povray, rather he get to memory bound benches for obvious reasons.

So far the only relevant numbers are CB R15 ST and the Geekbench LZMA subscore that is indicative of Integer perf but with the advantage of fast RAM, hence the improvement per clock is 10% (apparently...) while it s only 5% in Cinebench.

;
shows its in line with intels claim of 18%
 

yuri69

Senior member
Jul 16, 2013
373
573
136
For his ICL comparison he takes a 8550U using 2133 RAM even he has a 8650U that use DDR2400, this way he increase the difference since ICL use much faster memory, second is that he doesnt use Blender or Povray, rather he get to memory bound benches for obvious reasons.

So far the only relevant numbers are CB R15 ST and the Geekbench LZMA subscore that is indicative of Integer perf but with the advantage of fast RAM, hence the improvement per clock is 10% (apparently...) while it s only 5% in Cinebench.
Sorry but it sounds like you are in denial.
 

Abwx

Lifer
Apr 2, 2011
10,847
3,297
136
Sorry but it sounds like you are in denial.


Phoronix does only one single threaded bench that provide 17% improvement and then his conclusion is that those 17% are a generality for ST even if we know that it s only 5% in CB R15, dunno who is in denial when a single test is enough for you while discarding CB, moreover it s a test based on PHP, wich is certainly not the best way to measure ipc improvements.

Besides you didnt notice that power comsumption is exactly the same than KBL despite a lower freqsuency, methink that there will be big surprises, for the time i believe in nothing as long as CB R20, Povray, Blender, Corona, 7zip, X264/265, Stockfish and such apps are tested, likely that if those benchs are not tested is not by chance, hey, why wouldnt Intel disclose the improvement in such famed apps.??
 

cherullo

Member
May 19, 2019
40
84
91
inclusive means L1/L2 data is copied to L3.

L2 doubling to 1MB seems like a very good bet. And semi-decent chance L1 growing too.

Seems they took the all roads lead to Rome theme further in Zen3 by doing this at the scale of the CCX.

Does Zen2 follow Zen1 in L3 being victim cache?

For Zen3 the L3 cache unit may be a complex compound of its own and also now may have the role of acting as hub. It may be a smart hybrid thing that is not one or the other; internally it may dedicate a good fraction ~30% capacity as L4-ish victim cache (with L3 = hub+L3+L4).

inclusive would help in its role as hub when there are shared memory addresses being updated by several cores (more energy efficient, less latency than exclusive, but you have slightly smaller total cache footprint: 32 MB vs a 36 or 40 MB. something more flexible (non-exclusive or partially-inclusive) based on whether addresses are shared between cores would have best of both worlds.


The CCX already acts like this. Except that since it's not strictly inclusive, it doesn't have to keep all the lines up to date, the L2 is only probed for data on hits from external requests. This saves a lot of power, because otherwise the L1/L2 would have to be write-through.

Check this: https://fuse.wikichip.org/news/1177/amds-zen-cpu-complex-cache-and-smu/2/

The L3 only keeps all current CCX L2/L1 tags.
 
Last edited:

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
What?? They just doubled it with Zen 2 at the expensive of L1-I cache. Pretty sure that stays put for Zen 3.
I read, somewhere, that there were still a number of front end bottle necks, including the uop cache. If there is no room for improvement on 7nm+, then we’ll have to wait for 5nm.
 

Hans Gruber

Platinum Member
Dec 23, 2006
2,092
1,065
136
I think people should consider there could be some upside potential in core clocks. If they can get close to 5ghz that would be impressive.
 

rainy

Senior member
Jul 17, 2013
505
424
136
I think people should consider there could be some upside potential in core clocks. If they can get close to 5ghz that would be impressive.

Honestly, I do not understand that fascination (obsession?) with 5 GHz - if AMD would be able to deliver about 10 percent higher IPC with Zen 3, plus a bit higher clocks and lower latency in comparison to Zen 2, they could end up Intel reign even in the games.
 

DrMrLordX

Lifer
Apr 27, 2000
21,582
10,785
136
I think people should consider there could be some upside potential in core clocks. If they can get close to 5ghz that would be impressive.

AMD needs to work on more consistency of clockspeed than anything else. And they need new packaging techniques to deal with hotspots.
 

moinmoin

Diamond Member
Jun 1, 2017
4,933
7,619
136
AMD needs to work on more consistency of clockspeed than anything else.
I personally don't expect fixed high stock clocks as we knew them to ever come back past the 14nm/12nm nodes.
Different instructions take different amount of power thus can't scale up the frequency equally. Consequently hotspots are there to stay. Intel essentially forced that issue early with AVX and its offset, with Zen 2 AMD made that dynamic showing that few instructions are equal anyway. In the future we may see IPW (instructions per watt) instead IPC that are calculated per test case or (group of) instructions.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
Another week, another couple of AMD patent applications related to stacked memory.

This one is a very novel idea. Quite interesting to read.
20190333876
METHOD AND APPARATUS FOR POWER DELIVERY TO A DIE STACK VIA A HEAT SPREADER
Various chip stack power delivery circuits are disclosed. In one aspect, an apparatus is provided that includes a stack of semiconductor chips that has an uppermost semiconductor chip and a lowermost semiconductor chip. A heat spreader is positioned on the uppermost semiconductor chip. A power transfer circuit is configured to transfer electric power from the heat spreader to the uppermost semiconductor chip.
Untitled1.png


To me this patent could be related to the integrated thermo-electric cooler patent (20180358080/20190122704). They could deliver power to to the integrated thermo-electric device to extract the heat. They talk about the many ways of delivering power to different layers in the stack device from the head spreader. Each layer could be the SRAM or could be a heat transfer layer which also serves as protection layer.


20190333876
CONFIGURATION OF MULTI-DIE MODULES WITH THROUGH-SILICON VIAS
A data processing system includes a processing unit that forms a base die and has a group of through-silicon vias (TSVs), and is connected to a memory system. The memory system includes a die stack that includes a first die and a second die. The first die has a first surface that includes a group of micro-bump landing pads and a group of TSV landing pads. The group of micro-bump landing pads are connected to the group of TSVs of the processing unit using a corresponding group of micro-bumps. The first die has a group of memory die TSVs. The subsequent die has a first surface that includes a group of micro-bump landing pads and a group of TSV landing pads connected to the group of TSVs of the first die. The first die communicates with the processing unit using first cycle timing, and with the subsequent die using second cycle timing.

Untitled12.png


This is also very interesting, quite detailed description how the layout is going to look like, data transfer mechanisms, clocking, synchronization, etc. Looks like development of this is quite advanced.

Zen 4 stuffs I would say.
There are lots of novel patents around PIM as well but probably GPU and FPGA related.
 
  • Like
Reactions: Tlh97

Tuna-Fish

Golden Member
Mar 4, 2011
1,324
1,462
136
Shared L2 cache??? (On the patent image)

The internals of the cpu in the second patent application are not about the subject of the patent application (which was about the stacked-die TSV communication). It is very normal for these kinds of "ephemeral" details to not be accurate to what they are doing now. I'd guess that the CPU detailed there is a BD. Doesn't mean they are doing shared L2 again in the future.
 
  • Like
Reactions: Olikan

Richie Rich

Senior member
Jul 28, 2019
470
229
76
If Apple A12 Vertex core has 2.07 mm2 (without L2 cache) then we can assume that Zen 3 re-designed core w/ 6xALUs would cost approximately +0.5 mm2. For 8 core CCD it would cost +4 mm2 of die size. That's nothing in compare to L3 cache size. IMHO it's very feasible.
 
  • Like
Reactions: amd6502

NostaSeronx

Diamond Member
Sep 18, 2011
3,683
1,218
136
If Apple A12 Vertex core has 2.07 mm2 (without L2 cache) then we can assume that Zen 3 re-designed core w/ 6xALUs would cost approximately +0.5 mm2. For 8 core CCD it would cost +4 mm2 of die size. That's nothing in compare to L3 cache size. IMHO it's very feasible.
Apple is using low-power/high-density/mobile libs. While apple can get away with a low-power short pipeline. AMD has to optimize for a high-performance long pipeline.

Bunch of stuff:
- It is easier to get exploit high ILP/MLP at lower clocks than higher clocks. Hence, Apple is in the goldie lock zone: 2.49 GHz~2.65 GHz(thic)& >1.58GHz(smol)
- To achieve the same MLP/ILP in high clocks it needs to adapt more OoO or SMT.
- The design of low-clock/high-IPC has a significantly lower design cost than a high-clock/high-IPC design.

Zen2 is around 2.8 mm2 with less IPC and less cache compared to Apple cores.

Basically, AMD needs to drop high-frequency and HPC(FP256/FP512) if they want to compete with Apple.

Zen3 is unlikely to be a refactored core because it is F17 rather than F19. With Renoir being noncompetitive to Tigerlake it is unlikely AMD will recover this era.
 
Last edited:

Thunder 57

Platinum Member
Aug 19, 2007
2,647
3,706
136
Apple is using low-power/high-density/mobile libs. While apple can get away with a low-power short pipeline. AMD has to optimize for a high-performance long pipeline.

Bunch of stuff:
- It is easier to get exploit high ILP/MLP at lower clocks than higher clocks. Hence, Apple is in the goldie lock zone: 2.49 GHz~2.65 GHz(thic)& >1.58GHz(smol)
- To achieve the same MLP/ILP in high clocks it needs to adapt more OoO or SMT.
- The design of low-clock/high-IPC has a significantly lower design cost than a high-clock/high-IPC design.

Zen2 is around 2.8 mm2 with less IPC and less cache compared to Apple cores.

Basically, AMD needs to drop high-frequency and HPC(FP256/FP512) if they want to compete with Apple.

Zen3 is unlikely to be a refactored core because it is F17 rather than F19. With Renoir being noncompetitive to Tigerlake it is unlikely AMD will recover this era.

I like everything up until the last line. How do we know that Renoir won't be competitive with whatever Intel has out at the time? Isn't Renoir more of a competitor to Ice Lake?
 
  • Like
Reactions: Tlh97

NostaSeronx

Diamond Member
Sep 18, 2011
3,683
1,218
136
I like everything up until the last line. How do we know that Renoir won't be competitive with whatever Intel has out at the time? Isn't Renoir more of a competitor to Ice Lake?
Renoir's Zen2/Zen3 won't compete with Tigerlake's Willowcove. Obfuscated rumor: Willowcove will be deploying a something but it is only targeting a single, while Goldencove will deploy the double.

On the GPU side, I have been getting mixed messages. Rumor: TGL-S/TGL-H(8-core WLC Desktop/Mobile) won't be launching with the 32EU igpu, it is actually GT0. It instead will be launching with the small DG1/HBM2e dgpu on package.

Basically, both AMD's 7nm and 5nm APUs will lose out to Intel's ultra-wide power-efficient core/discrete on package projects.

Launch dates:
Renoir -> January 2020 if it isn't delayed like the rumors state.
Tigerlake-U -> June 2020 (quad-core/96 eu)
Tigerlake-H -> August 2020 (octo-core/128 eu) <== There will be a PoC console running a modified clear linux os for gaming.
The desktop -S model being either September or October. (The -H SKU is prioritized)
 
Last edited:

Enigma-

Junior Member
Feb 1, 2017
10
19
51
If Apple A12 Vertex core has 2.07 mm2 (without L2 cache) then we can assume that Zen 3 re-designed core w/ 6xALUs would cost approximately +0.5 mm2. For 8 core CCD it would cost +4 mm2 of die size. That's nothing in compare to L3 cache size. IMHO it's very feasible.

4-way SMT is as unrealistic as 6 ALU's in Zen3. Stop the obsession about a number of units on paper when you ignoring the more important factor of what they can do and why, and how. Just registered to tell you this. There is no way near peak utilization of the current 4 ALU's in real world usage which is basic to understand regarding knowledge of current x86 CPU's. The back-end of Zen is still the most powerful (highest TP) of any x86 ever constructed.

It would be way more realistic if you saw WHAT each ALU can do, and dependent on what. For instance, 3 symmetrical AGU's for L/S. Another is extended ALU's for more/faster math capability like DIV, MUL shuffle etc and the latency of them.

Still, I repeat. The front-end is where the fruit is. You will see. Don't worry, Zen3 will be very powerful indeed. No question about it. It's here AMD going full force and IPC will again blow you away.
 
Last edited:

Richie Rich

Senior member
Jul 28, 2019
470
229
76
Apple is using low-power/high-density/mobile libs. While apple can get away with a low-power short pipeline. AMD has to optimize for a high-performance long pipeline.

Bunch of stuff:
- It is easier to get exploit high ILP/MLP at lower clocks than higher clocks. Hence, Apple is in the goldie lock zone: 2.49 GHz~2.65 GHz(thic)& >1.58GHz(smol)
- To achieve the same MLP/ILP in high clocks it needs to adapt more OoO or SMT.
- The design of low-clock/high-IPC has a significantly lower design cost than a high-clock/high-IPC design.

Zen2 is around 2.8 mm2 with less IPC and less cache compared to Apple cores.

Basically, AMD needs to drop high-frequency and HPC(FP256/FP512) if they want to compete with Apple.

Zen3 is unlikely to be a refactored core because it is F17 rather than F19. With Renoir being noncompetitive to Tigerlake it is unlikely AMD will recover this era.
  • - I agree Apple uses different libs (trading high clock for high density)
  • - However I don't agree with pipeline length. A12 is around 16/18-stage and Zen2 is 19. That's almost equal. Obviously A12's freq is limited by higher density (libs) however under that clock those two chips will scale pretty similar IMHO.
  • - I don't think there is problem with exploitation ILP/MLP at high frequency. IMHO it's freq independent, and it's OoO window size dependent. The only thing it can hurt that is mem latency in case of miss-prediction. That's why hyper threading was developed for P4. Maybe that's why SMT4 might be good move for wider core. I'm afraid this could tell us only AMD/Intel engineers based on CPU SW simulation at typical code.
  • - Why do you assume Zen3 is 17h Family? From Zen2 optimization document is clear that Zen2 is last 17h chip IMHO. Did I missed some new info?


Let's calculate Zen 3 chiplet die size:
  • 32MB L3 cache is about 34mm2 area of Zen 2 CCD die (total 78mm2).
  • 48MB L3 unified cache would be 51mm2 (+17 mm2). That's not bad (total 95mm2).
  • New wider core +4mm2 is total die size 99 mm2.
  • 7nm+ EUV process allows 18% die shrink.... so 99 mm2 will shrink down to 84 mm2 die. That's minimal area increase, easy to fit 8x chiplets into EPYC socket. Very feasible.
  • 64MB L3 unified cache would be total 98mm2. Also feasible.