Speculation: Ryzen 4000 series/Zen 3

Abwx · Nov 1, 2019

yuri69 said:
OT, but Phoronix managed to obtain ~18% IPC uplift compared to Kaby. This was done using their standard suite.

His not so standard suite show the 9900K as being better than a 3900X in floating point...

AMD Ryzen 9 3900X vs. Intel Core i9 9900K Performance In 400+ Benchmarks - Phoronix

www.phoronix.com

For his ICL comparison he takes a 8550U using 2133 RAM even he has a 8650U that use DDR2400, this way he increase the difference since ICL use much faster memory, second is that he doesnt use Blender or Povray, rather he get to memory bound benches for obvious reasons.

So far the only relevant numbers are CB R15 ST and the Geekbench LZMA subscore that is indicative of Integer perf but with the advantage of fast RAM, hence the improvement per clock is 10% (apparently...) while it s only 5% in Cinebench.

;

NostaSeronx · Nov 1, 2019

Thunder 57 said:
What?? They just doubled it with Zen 2 at the expensive of L1-I cache. Pretty sure that stays put for Zen 3.

As consumers we always need more.

L0i before the Pick stage.
Decoded micro-op cache with the dispatch stage.
In-flight micro-op cache with the retire/mapper stage. (For >1K micro-instructions in OoO flight)
Execution loop caches with the scheduler stage.

Might as well add Register caches, L0ds, L0.5ds, L1.5ds and L2.5ds, L3.5d, and lest not forget thee L4 cache.

TheGiant · Nov 2, 2019

Abwx said:
His not so standard suite show the 9900K as being better than a 3900X in floating point...

AMD Ryzen 9 3900X vs. Intel Core i9 9900K Performance In 400+ Benchmarks - Phoronix

www.phoronix.com

For his ICL comparison he takes a 8550U using 2133 RAM even he has a 8650U that use DDR2400, this way he increase the difference since ICL use much faster memory, second is that he doesnt use Blender or Povray, rather he get to memory bound benches for obvious reasons.

So far the only relevant numbers are CB R15 ST and the Geekbench LZMA subscore that is indicative of Integer perf but with the advantage of fast RAM, hence the improvement per clock is 10% (apparently...) while it s only 5% in Cinebench.

;

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

shows its in line with intels claim of 18%

yuri69 · Nov 2, 2019

Abwx said:
For his ICL comparison he takes a 8550U using 2133 RAM even he has a 8650U that use DDR2400, this way he increase the difference since ICL use much faster memory, second is that he doesnt use Blender or Povray, rather he get to memory bound benches for obvious reasons.

So far the only relevant numbers are CB R15 ST and the Geekbench LZMA subscore that is indicative of Integer perf but with the advantage of fast RAM, hence the improvement per clock is 10% (apparently...) while it s only 5% in Cinebench.

Sorry but it sounds like you are in denial.

Abwx · Nov 2, 2019

yuri69 said:
Sorry but it sounds like you are in denial.

Phoronix does only one single threaded bench that provide 17% improvement and then his conclusion is that those 17% are a generality for ST even if we know that it s only 5% in CB R15, dunno who is in denial when a single test is enough for you while discarding CB, moreover it s a test based on PHP, wich is certainly not the best way to measure ipc improvements.

Besides you didnt notice that power comsumption is exactly the same than KBL despite a lower freqsuency, methink that there will be big surprises, for the time i believe in nothing as long as CB R20, Povray, Blender, Corona, 7zip, X264/265, Stockfish and such apps are tested, likely that if those benchs are not tested is not by chance, hey, why wouldnt Intel disclose the improvement in such famed apps.??

cherullo · Nov 2, 2019

Double post, sorry.

cherullo · Nov 2, 2019

amd6502 said:
inclusive means L1/L2 data is copied to L3.

L2 doubling to 1MB seems like a very good bet. And semi-decent chance L1 growing too.

Seems they took the all roads lead to Rome theme further in Zen3 by doing this at the scale of the CCX.

Does Zen2 follow Zen1 in L3 being victim cache?

For Zen3 the L3 cache unit may be a complex compound of its own and also now may have the role of acting as hub. It may be a smart hybrid thing that is not one or the other; internally it may dedicate a good fraction ~30% capacity as L4-ish victim cache (with L3 = hub+L3+L4).

inclusive would help in its role as hub when there are shared memory addresses being updated by several cores (more energy efficient, less latency than exclusive, but you have slightly smaller total cache footprint: 32 MB vs a 36 or 40 MB. something more flexible (non-exclusive or partially-inclusive) based on whether addresses are shared between cores would have best of both worlds.

The CCX already acts like this. Except that since it's not strictly inclusive, it doesn't have to keep all the lines up to date, the L2 is only probed for data on hits from external requests. This saves a lot of power, because otherwise the L1/L2 would have to be write-through.

Check this: https://fuse.wikichip.org/news/1177/amds-zen-cpu-complex-cache-and-smu/2/

The L3 only keeps all current CCX L2/L1 tags.

Ajay · Nov 2, 2019

Thunder 57 said:
What?? They just doubled it with Zen 2 at the expensive of L1-I cache. Pretty sure that stays put for Zen 3.

I read, somewhere, that there were still a number of front end bottle necks, including the uop cache. If there is no room for improvement on 7nm+, then we’ll have to wait for 5nm.

Hans Gruber · Nov 2, 2019

I think people should consider there could be some upside potential in core clocks. If they can get close to 5ghz that would be impressive.

rainy · Nov 2, 2019

Hans Gruber said:
I think people should consider there could be some upside potential in core clocks. If they can get close to 5ghz that would be impressive.

Honestly, I do not understand that fascination (obsession?) with 5 GHz - if AMD would be able to deliver about 10 percent higher IPC with Zen 3, plus a bit higher clocks and lower latency in comparison to Zen 2, they could end up Intel reign even in the games.

soresu · Nov 2, 2019

Abwx said:
CB R20, Povray, Blender, Corona

Are not those all likely to post similar improvements given they are all renderers?

*assuming of course you meant Cycles for Blender.

DrMrLordX · Nov 2, 2019

Hans Gruber said:
I think people should consider there could be some upside potential in core clocks. If they can get close to 5ghz that would be impressive.

AMD needs to work on more consistency of clockspeed than anything else. And they need new packaging techniques to deal with hotspots.

moinmoin · Nov 3, 2019

DrMrLordX said:
AMD needs to work on more consistency of clockspeed than anything else.

I personally don't expect fixed high stock clocks as we knew them to ever come back past the 14nm/12nm nodes.
Different instructions take different amount of power thus can't scale up the frequency equally. Consequently hotspots are there to stay. Intel essentially forced that issue early with AVX and its offset, with Zen 2 AMD made that dynamic showing that few instructions are equal anyway. In the future we may see IPW (instructions per watt) instead IPC that are calculated per test case or (group of) instructions.

DisEnchantment · Nov 3, 2019

Another week, another couple of AMD patent applications related to stacked memory.

This one is a very novel idea. Quite interesting to read.
20190333876
METHOD AND APPARATUS FOR POWER DELIVERY TO A DIE STACK VIA A HEAT SPREADER
Various chip stack power delivery circuits are disclosed. In one aspect, an apparatus is provided that includes a stack of semiconductor chips that has an uppermost semiconductor chip and a lowermost semiconductor chip. A heat spreader is positioned on the uppermost semiconductor chip. A power transfer circuit is configured to transfer electric power from the heat spreader to the uppermost semiconductor chip.

To me this patent could be related to the integrated thermo-electric cooler patent (20180358080/20190122704). They could deliver power to to the integrated thermo-electric device to extract the heat. They talk about the many ways of delivering power to different layers in the stack device from the head spreader. Each layer could be the SRAM or could be a heat transfer layer which also serves as protection layer.

20190333876
CONFIGURATION OF MULTI-DIE MODULES WITH THROUGH-SILICON VIAS
A data processing system includes a processing unit that forms a base die and has a group of through-silicon vias (TSVs), and is connected to a memory system. The memory system includes a die stack that includes a first die and a second die. The first die has a first surface that includes a group of micro-bump landing pads and a group of TSV landing pads. The group of micro-bump landing pads are connected to the group of TSVs of the processing unit using a corresponding group of micro-bumps. The first die has a group of memory die TSVs. The subsequent die has a first surface that includes a group of micro-bump landing pads and a group of TSV landing pads connected to the group of TSVs of the first die. The first die communicates with the processing unit using first cycle timing, and with the subsequent die using second cycle timing.

This is also very interesting, quite detailed description how the layout is going to look like, data transfer mechanisms, clocking, synchronization, etc. Looks like development of this is quite advanced.

Zen 4 stuffs I would say.
There are lots of novel patents around PIM as well but probably GPU and FPGA related.

Olikan · Nov 3, 2019

Shared L2 cache??? (On the patent image)

Tuna-Fish · Nov 3, 2019

Olikan said:
Shared L2 cache??? (On the patent image)

The internals of the cpu in the second patent application are not about the subject of the patent application (which was about the stacked-die TSV communication). It is very normal for these kinds of "ephemeral" details to not be accurate to what they are doing now. I'd guess that the CPU detailed there is a BD. Doesn't mean they are doing shared L2 again in the future.

Richie Rich · Nov 4, 2019

If Apple A12 Vertex core has 2.07 mm2 (without L2 cache) then we can assume that Zen 3 re-designed core w/ 6xALUs would cost approximately +0.5 mm2. For 8 core CCD it would cost +4 mm2 of die size. That's nothing in compare to L3 cache size. IMHO it's very feasible.

NostaSeronx · Nov 4, 2019

Richie Rich said:
If Apple A12 Vertex core has 2.07 mm2 (without L2 cache) then we can assume that Zen 3 re-designed core w/ 6xALUs would cost approximately +0.5 mm2. For 8 core CCD it would cost +4 mm2 of die size. That's nothing in compare to L3 cache size. IMHO it's very feasible.

Apple is using low-power/high-density/mobile libs. While apple can get away with a low-power short pipeline. AMD has to optimize for a high-performance long pipeline.

Bunch of stuff:
- It is easier to get exploit high ILP/MLP at lower clocks than higher clocks. Hence, Apple is in the goldie lock zone: 2.49 GHz~2.65 GHz(thic)& >1.58GHz(smol)
- To achieve the same MLP/ILP in high clocks it needs to adapt more OoO or SMT.
- The design of low-clock/high-IPC has a significantly lower design cost than a high-clock/high-IPC design.

Zen2 is around 2.8 mm2 with less IPC and less cache compared to Apple cores.

Basically, AMD needs to drop high-frequency and HPC(FP256/FP512) if they want to compete with Apple.

Zen3 is unlikely to be a refactored core because it is F17 rather than F19. With Renoir being noncompetitive to Tigerlake it is unlikely AMD will recover this era.

Thunder 57 · Nov 4, 2019

NostaSeronx said:
Apple is using low-power/high-density/mobile libs. While apple can get away with a low-power short pipeline. AMD has to optimize for a high-performance long pipeline.

Bunch of stuff:
- It is easier to get exploit high ILP/MLP at lower clocks than higher clocks. Hence, Apple is in the goldie lock zone: 2.49 GHz~2.65 GHz(thic)& >1.58GHz(smol)
- To achieve the same MLP/ILP in high clocks it needs to adapt more OoO or SMT.
- The design of low-clock/high-IPC has a significantly lower design cost than a high-clock/high-IPC design.

Zen2 is around 2.8 mm2 with less IPC and less cache compared to Apple cores.

Basically, AMD needs to drop high-frequency and HPC(FP256/FP512) if they want to compete with Apple.

Zen3 is unlikely to be a refactored core because it is F17 rather than F19. With Renoir being noncompetitive to Tigerlake it is unlikely AMD will recover this era.

I like everything up until the last line. How do we know that Renoir won't be competitive with whatever Intel has out at the time? Isn't Renoir more of a competitor to Ice Lake?

NostaSeronx · Nov 5, 2019

Thunder 57 said:
I like everything up until the last line. How do we know that Renoir won't be competitive with whatever Intel has out at the time? Isn't Renoir more of a competitor to Ice Lake?

Renoir's Zen2/Zen3 won't compete with Tigerlake's Willowcove. Obfuscated rumor: Willowcove will be deploying a something but it is only targeting a single, while Goldencove will deploy the double.

On the GPU side, I have been getting mixed messages. Rumor: TGL-S/TGL-H(8-core WLC Desktop/Mobile) won't be launching with the 32EU igpu, it is actually GT0. It instead will be launching with the small DG1/HBM2e dgpu on package.

Basically, both AMD's 7nm and 5nm APUs will lose out to Intel's ultra-wide power-efficient core/discrete on package projects.

Launch dates:
Renoir -> January 2020 if it isn't delayed like the rumors state.
Tigerlake-U -> June 2020 (quad-core/96 eu)
Tigerlake-H -> August 2020 (octo-core/128 eu) <== There will be a PoC console running a modified clear linux os for gaming.
The desktop -S model being either September or October. (The -H SKU is prioritized)

Enigma- · Nov 5, 2019

Richie Rich said:
If Apple A12 Vertex core has 2.07 mm2 (without L2 cache) then we can assume that Zen 3 re-designed core w/ 6xALUs would cost approximately +0.5 mm2. For 8 core CCD it would cost +4 mm2 of die size. That's nothing in compare to L3 cache size. IMHO it's very feasible.

4-way SMT is as unrealistic as 6 ALU's in Zen3. Stop the obsession about a number of units on paper when you ignoring the more important factor of what they can do and why, and how. Just registered to tell you this. There is no way near peak utilization of the current 4 ALU's in real world usage which is basic to understand regarding knowledge of current x86 CPU's. The back-end of Zen is still the most powerful (highest TP) of any x86 ever constructed.

It would be way more realistic if you saw WHAT each ALU can do, and dependent on what. For instance, 3 symmetrical AGU's for L/S. Another is extended ALU's for more/faster math capability like DIV, MUL shuffle etc and the latency of them.

Still, I repeat. The front-end is where the fruit is. You will see. Don't worry, Zen3 will be very powerful indeed. No question about it. It's here AMD going full force and IPC will again blow you away.

Richie Rich · Nov 5, 2019

NostaSeronx said:
Apple is using low-power/high-density/mobile libs. While apple can get away with a low-power short pipeline. AMD has to optimize for a high-performance long pipeline.

Bunch of stuff:
- It is easier to get exploit high ILP/MLP at lower clocks than higher clocks. Hence, Apple is in the goldie lock zone: 2.49 GHz~2.65 GHz(thic)& >1.58GHz(smol)
- To achieve the same MLP/ILP in high clocks it needs to adapt more OoO or SMT.
- The design of low-clock/high-IPC has a significantly lower design cost than a high-clock/high-IPC design.

Zen2 is around 2.8 mm2 with less IPC and less cache compared to Apple cores.

Basically, AMD needs to drop high-frequency and HPC(FP256/FP512) if they want to compete with Apple.

Zen3 is unlikely to be a refactored core because it is F17 rather than F19. With Renoir being noncompetitive to Tigerlake it is unlikely AMD will recover this era.

- I agree Apple uses different libs (trading high clock for high density)
- However I don't agree with pipeline length. A12 is around 16/18-stage and Zen2 is 19. That's almost equal. Obviously A12's freq is limited by higher density (libs) however under that clock those two chips will scale pretty similar IMHO.
- I don't think there is problem with exploitation ILP/MLP at high frequency. IMHO it's freq independent, and it's OoO window size dependent. The only thing it can hurt that is mem latency in case of miss-prediction. That's why hyper threading was developed for P4. Maybe that's why SMT4 might be good move for wider core. I'm afraid this could tell us only AMD/Intel engineers based on CPU SW simulation at typical code.
- Why do you assume Zen3 is 17h Family? From Zen2 optimization document is clear that Zen2 is last 17h chip IMHO. Did I missed some new info?

Let's calculate Zen 3 chiplet die size:

32MB L3 cache is about 34mm2 area of Zen 2 CCD die (total 78mm2).
48MB L3 unified cache would be 51mm2 (+17 mm2). That's not bad (total 95mm2).
New wider core +4mm2 is total die size 99 mm2.
7nm+ EUV process allows 18% die shrink.... so 99 mm2 will shrink down to 84 mm2 die. That's minimal area increase, easy to fit 8x chiplets into EPYC socket. Very feasible.
64MB L3 unified cache would be total 98mm2. Also feasible.

krumme · Nov 5, 2019

I am not sure you can compare pipeline lenght like that.
It's not like counting seats in the bus, but more like counting the good chicks in the bar.

uzzi38 · Nov 6, 2019

NostaSeronx said:
Renoir's Zen2/Zen3 won't compete with Tigerlake's Willowcove. Obfuscated rumor: Willowcove will be deploying a something but it is only targeting a single, while Goldencove will deploy the double.

On the GPU side, I have been getting mixed messages. Rumor: TGL-S/TGL-H(8-core WLC Desktop/Mobile) won't be launching with the 32EU igpu, it is actually GT0. It instead will be launching with the small DG1/HBM2e dgpu on package.

Basically, both AMD's 7nm and 5nm APUs will lose out to Intel's ultra-wide power-efficient core/discrete on package projects.

Launch dates:
Renoir -> January 2020 if it isn't delayed like the rumors state.
Tigerlake-U -> June 2020 (quad-core/96 eu)
Tigerlake-H -> August 2020 (octo-core/128 eu) <== There will be a PoC console running a modified clear linux os for gaming.
The desktop -S model being either September or October. (The -H SKU is prioritized)

TigerLake-H is nonexistent. Or perhaps dead is the better term. -H and -S will only come with Alder Lake, if at all.

DisEnchantment · Nov 9, 2019

TSMC's presentation from several months ago on fully 3D stacked ICs. b and d are the new stuffs.

TSMC-SoIC® - Taiwan Semiconductor Manufacturing Company Limited

www.tsmc.com

3D Multi-chip Integration with System on Integrated Chips (SoIC™)

The electrical characterization of System on Integrated Chips (SoIC™), an innovative 3D heterogeneous integration technology manufactured in front-end of line with known-good-die is reported. Chiplets integration of devices including foundry leading edge 7nm FinFET technology with SoIC™...

ieeexplore.ieee.org

System on Integrated Chips (SoIC(TM) for 3D Heterogeneous Integration

A brand new 3D integrated circuit (3DIC) solution, System on Integrated Chips (SoIC™), has been successfully developed to integrate active and passive chips into a new integrated SoC system to meet ever-increasing market demands on higher computing efficiency, wilder data bandwidth, higher...

ieeexplore.ieee.org

Speculation: Ryzen 4000 series/Zen 3

Lifer

Diamond Member

Senior member

Senior member

Lifer

Member

Member

Lifer

Platinum Member

Senior member

Diamond Member

Lifer

Diamond Member

Golden Member

Platinum Member

Golden Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Junior Member

Senior member

Diamond Member

Platinum Member

Golden Member