Speculation: Ryzen 4000 series/Zen 3

Page 67 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DrMrLordX

Lifer
Apr 27, 2000
21,637
10,855
136
Adding a bunch of hardware to do AI badly on the CPU seems like a silly idea, when anyone doing serious amounts of AI work in server will have accelerators better suited to the task.

Agreed, but . . .

You could say the same thing of ARM when they have independent ML accelerator cores, yet still accelerate ML functions on their CPU cores - arguably they know what they are doing.

Different target markets. AMD isn't in cell phones/tablets at all, so adding ML instructions to their CPUs for the reasons that the mobile/tablet SoC designers add them to their designs would be ludicrous. AMD has their dGPUs for ML/AI, arguably at better perf/watt and defintely at better overall performance than anything from the mobile SoC sector.

Then there's VIA's bizarre decision to include ML in their CPUs. Their market rationale is truly odd and out-of-place. I would think AMD following suit would be seen in a similar fashion. Intel's decision to include bfloat seems really weird when they should be upselling Loihi instead.
 

soresu

Platinum Member
Dec 19, 2014
2,662
1,862
136
Agreed, but . . .



Different target markets. AMD isn't in cell phones/tablets at all, so adding ML instructions to their CPUs for the reasons that the mobile/tablet SoC designers add them to their designs would be ludicrous. AMD has their dGPUs for ML/AI, arguably at better perf/watt and defintely at better overall performance than anything from the mobile SoC sector.

Then there's VIA's bizarre decision to include ML in their CPUs. Their market rationale is truly odd and out-of-place. I would think AMD following suit would be seen in a similar fashion. Intel's decision to include bfloat seems really weird when they should be upselling Loihi instead.
It only sounds bizarre because most people still haven't grasped how pervasive ML workloads are becoming.

It will become as important a workload as any soon, that means needing to support it as well as possible across all system compute hardware, regardless of whatever may do it better in the ideal scenario.

Not every AMD CPU system will have an AMD GPU, and even those that do may well be mismatched with an older GPU.
 
  • Like
Reactions: lightmanek

amd6502

Senior member
Apr 21, 2017
971
360
136
Another trick AMD could do is shared FPU ala Bulldozer. Two cores would be
12xpipes (6x FPU 256-bit) instead 8+8xpipes (8xFPUs). Such a configuration would save a lot of transistors while producing similar performance. The cost is radical uarch change (cannot be done as Zen2 evolution). But it's new uarch and AMD has experience from Bulldozer so who knows. Such a shared FPU (and a front-end) has one nice advantage: 4-core CCX becomes 8-core CCX (L2$ is shared by two cores). This configuration is less probable IMHO.

That's a very interesting idea. Dozer was actually kind of a hybrid. It was CMT on integer side and SMT on FPU side.

Now Zen could mimic this pattern and remain 2x SMT2 on the integer side while going SMT4 on the FPU side. This could be a huge sparse thread boost for very heavy FPU code.


IPC calculations of SPECint2006:
  • - 9900K .... 54.28/5 GHz = 10.86 pts/GHz
  • - 3950X .... 50.02/4.6 GH = 10.87 pts/GHz
  • - A76 ........ 26.65/2.84 GHz = 9.38 pts/GHz
  • - A77 ........ 33.32/2.84 GHz = 11.73 pts/GHz ...... +8% IPC over 9900K
  • - A11 ........ 36.80/2.39 GHz = 15.40 pts/GHz .... +42% IPC over 9900K
  • - A12 ........ 45.32/2.53 GHz = 17.91 pts/GHz .... +65% IPC over 9900K
  • - A13 ........ 52.82/2.65 GHz = 19.93 pts/GHz .... +83% IPC over 9900K

As far as this table goes, it's a bit unfair to the 4 and 5 ghz contenders because scaling is a very crude approximation. You should run the 9900k and 3950X at 2.5GHz. Otherwise it's as if you handicapped these competitors with very high latency memory (about 2x latency of whatever RAM the A12 was using).


View attachment 14993

Allowing FP128 instructions to also execute on FP4-7 is the easier to implement option, with instant IPC growth for FP128/legacy SSE2+ workloads.

I don't recognize Zen2 core looking like this. Source?

Looks interesting, but what is FP4-7?
 

Thunder 57

Platinum Member
Aug 19, 2007
2,675
3,801
136
I don't recognize Zen2 core looking like this. Source?

Looks interesting, but what is FP4-7?

The source is himself, otherwise he would've linked to an image on the web by a reputable website. Instead, he just made his on image and uploaded it as fan fiction fact. I was going to make a comment on this earlier but I figured I'd let it be. Don't ever expect to get a source though. At best you will get words like "easily found".
 

lobz

Platinum Member
Feb 10, 2017
2,057
2,856
136
As far as this table goes, it's a bit unfair to the 4 and 5 ghz contenders because scaling is a very crude approximation. You should run the 9900k and 3950X at 2.5GHz. Otherwise it's as if you handicapped these competitors with very high latency memory (about 2x latency of whatever RAM the A12 was using).
There is no realistic setup for a test like this right now. But you will never convince the ARMada that they're comparing bananas to oranges every day, saying the these bananas are much tastier bananas than those oranges. Well, they should be! They are bananas, for God's sake. But they are not very good as oranges and vica versa.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
I don't recognize Zen2 core looking like this. Source?
It is the FPU, the source is every high-res Matisse die shot ever.
Looks interesting, but what is FP4-7?
Floating Point Pipe 4: Replicated Floating Point Pipe 0 for bits 128-255 (FP256: Upper 4x32-bit/ Upper 2x64-bit)
Floating Point Pipe 5: Replicated Floating Point Pipe 1 for bits 128-255
Floating Point Pipe 6: Replicated Floating Point Pipe 2 for bits 128-255
Floating Point Pipe 7: Replicated Floating Point Pipe 3 for bits 128-255

Evolving the design from just upper 128-bit to also lower 128-bit is a low hanging fruit. The PRF design is literally copy and pasted between the two sections. So, it isn't hampered like previous designs: Greyhound/Bulldozer. Which had separate PRF designs for lower 0-63 and upper 64-127. The benefit of Zen2's FPU means that they both have control bits, where as GH/BD only had one PRF with control bits.
 
Last edited:
  • Like
Reactions: amd6502

Tuna-Fish

Golden Member
Mar 4, 2011
1,355
1,547
136
NostaSeronx is right that the 256-bit FPU just consists of "copy-pasting" (and mirroring) the existing FPU for the upper halves. The die-shot is real.

What he's not right about is that it would be easy to make the upper halves act as additional pipes for 128-bit SSE. There is a reason why the EUs of processor cores are all bunched up like they are in that shot, and it's that it would cost multiple clock cycles to cross the distance between the upper half and the lower half parts. AVX allows splitting the EUs into upper and lower halves like that because it's kind of rare for any information to cross the 128-bit boundary. When it does, you can clearly see how long it takes to move the data, as any such instruction has many cycles of extra latency over an instruction of similar complexity that does not need to cross the distance. So if you just allowed 128-bit operations to cross the boundary, it would mean increased throughput but inserting ~3 cycles of latency in front of any instruction that feeds results from a different half of the FPU to another.

There IS one intriguing possibility though -- the fpu instructions of one thread on the core never need data from the other one. If they change the FPU from mirroring almost everything to being a full mirroring, they could use FPU cluster 0 as the lower half of thread 0 and upper half of thread 1, and FPU cluster 1 as the upper half of thread 0 and the lower half of thread 1. This would allow any 128-bit SSE/AVX instructions to be issued by both threads at the same time without ever conflicting, with minimal extra transistors needed.

I don't think this is likely, but it is possible.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,774
3,152
136
Adding a bunch of hardware to do AI badly on the CPU seems like a silly idea, when anyone doing serious amounts of AI work in server will have accelerators better suited to the task.
It's not about doing ai badly, Amd have gpus to do matrix multiply on workloads that suite higher latency. The point of a matrix multiply unit on a cpu would be for workloads that require low latency / serial dependancy. It would also be transistors that could easily be power gated off when not in use.

The question is are there enough meaningful workloads to make an implementation like that worth while.
 
  • Like
Reactions: Drazick

Richie Rich

Senior member
Jul 28, 2019
470
229
76
I am counting it right.
View attachment 14993

It isn't native FP256 like how Intel does it. It is literally 8x 128-bit datapaths, FP0-3 being low 128-bit and FP4-7 being high 128-bit. It would be relatively simple to switch 4x 128-bit / 4x 256-bit to 8x 128-bit / 4x 256-bit. Increasing FPU availability to the rest of 128-bit datapaths absolutely will give higher IPC than AVX512. AVX512 requires a new ISA and requires to use full-width 512-bit instructions to max out usage. Much like to max out usage on Zen2, to use all datapaths the instruction must be AVX256.

Allowing FP128 instructions to also execute on FP4-7 is the easier to implement option, with instant IPC growth for FP128/legacy SSE2+ workloads.
If there is 8x pipes with 128-bit it must be seen in performance too: Is 128-bit code significantly faster than 256-bit?
Adding functionality is not equal to adding whole pipe (including new scheduler pipe, change in ROB etc). That's the main difference between evolution like Zen 2 and completely new uarch like Zen 3 will be. One possibility is that Zen 3 can double the pipes to 8x but with 128-bit (and widening to 256-bit with Zen4, same Zen1->Zen2 trick).


That's a very interesting idea. Dozer was actually kind of a hybrid. It was CMT on integer side and SMT on FPU side.

Now Zen could mimic this pattern and remain 2x SMT2 on the integer side while going SMT4 on the FPU side. This could be a huge sparse thread boost for very heavy FPU code.
Exactly, this was one of the good things at terrible Dozer. And that's one possible explanation of high FP IPC increase we see on Zen 3 leaks.


As far as this table goes, it's a bit unfair to the 4 and 5 ghz contenders because scaling is a very crude approximation. You should run the 9900k and 3950X at 2.5GHz. Otherwise it's as if you handicapped these competitors with very high latency memory (about 2x latency of whatever RAM the A12 was using).
Unfair at iso-clock IPC but fair at peak performance IPC comparison where the CPU were designed for. 9900K and 3950X has pipeline depth and memory subsystem designed for such a high clock (caches). When we look at 64-core EPYC2 running at 2.5 GHz (base 2.25 Ghz) there is no significant IPC benefit at lower clock. So the majority of IPC comes from uarch design. And +82% IPC advantage is almost twice as fast. Funny that people cannot believe +17% IPC gain of Zen 3 when there is actually huge +82% IPC deficit/potential. Should we be happy to get just 17% out of 82%?
 
  • Like
Reactions: .vodka

soresu

Platinum Member
Dec 19, 2014
2,662
1,862
136
When we look at 64-core EPYC2 running at 2.5 GHz (base 2.25 Ghz) there is no significant IPC benefit at lower clock. So the majority of IPC comes from uarch design.
A ridiculous comparison - once you bring huge many core, many die server chips into the equation you are talking total MT performance more than ST IPC, and the majority of that performance comes from the 64 cores!
 
  • Like
Reactions: xpea and Thunder 57

mtcn77

Member
Feb 25, 2017
105
22
91
Unfair at iso-clock IPC but fair at peak performance IPC comparison where the CPU were designed for. 9900K and 3950X has pipeline depth and memory subsystem designed for such a high clock (caches). When we look at 64-core EPYC2 running at 2.5 GHz (base 2.25 Ghz) there is no significant IPC benefit at lower clock. So the majority of IPC comes from uarch design. And +82% IPC advantage is almost twice as fast. Funny that people cannot believe +17% IPC gain of Zen 3 when there is actually huge +82% IPC deficit/potential. Should we be happy to get just 17% out of 82%?
When discussing designated TDP performance, I think a major portion of the emphasis is on task power per stock TDP. This is an area in which major contenders take a heavy blow by the underdogs(Apple vs. Qualcomm & Intel vs. AMD). Intel for instance cuts short the frequency bins when running AVX512. This is good.
The alternative is the tendency towards higher operating LLC which works against the intended task power goal. Saving on not just power, but operating temperature is better towards an adaptable clock profile. Let me further explain this as a matter of Vdroop,
The motherboard power is finite. Any amount of current comes at the cost of having voltage disregulation that vrm phases need to rectify. Twice the power at comparable voltage consistency requires double the amount of phases. In that case, it becomes relatively easier to just let it sag and adapt the clock frequency to the target power budget than target performance level. It is always cooler for the power circuitry to run the cpu at a lower LLC setting because high current LLC level that accounts overshoot will spike temperatures throughout the range up until it is absolutely necessary and thus, will lead to a higher temperature profile. Better to save on power temperature and not run any hotter than sustain vaingloriously higher clocks.
The part involving AMD is, that the caches to operate on AVX512 need heavier voltage regulation. That part is under duress as to why playing fast follower to Intel is beneficiary to straight-on challenging AVX512 performance.
 

moinmoin

Diamond Member
Jun 1, 2017
4,952
7,666
136
I doubt ML workloads will creep into the desktop/workstation sector very quickly. But we'll see.
Microsoft is pushing Windows Hello, especially on their Surface products. They now use AMD APUs there as well. So some ML integration coming at some point really is not that far fetched unless it takes too much silicon area.
 
  • Like
Reactions: soresu

Richie Rich

Senior member
Jul 28, 2019
470
229
76
That part is under duress as to why playing fast follower to Intel is beneficiary to straight-on challenging AVX512 performance.
I think AMD has a different problem with AVX512 because there is as much as 20 instruction subsets. That's huge number in compare to SSE4 (two subsets SSE4.1 and 4.2), and AVX1&2 with one subset. Even the newest IceLake supports 14 out of 20. https://en.wikipedia.org/wiki/AVX-512
When AMD will support AVX512 are they gonna support just basic Foundation subset, customer needed subsets (like Intel) or all of it? Just imagine that Zen 3 will support all subsets - as a first CPU in the world, beating Intel at his own yard. Since 2013 when AVX512 was introduced AMD had a lot of time to develop it (at background/parallel as a part of completely new uarch of Zen3, also could be introduced later in Zen 4). Funny thing is that AMD doesn't need any magic (disruptive tech) here - just to concentrate on the right technology already available (SMT4, AVX512, 6xALUs etc.).


A ridiculous comparison - once you bring huge many core, many die server chips into the equation you are talking total MT performance more than ST IPC, and the majority of that performance comes from the 64 cores!
You are too pessimistic here. It would be even worse for x86 if we would compare two 64-core chips. Hypothetical 64-core A13 would be much faster with much lower TDP on top of that (slow A76 in Neoverse N1 gives x86 world a lot of headache in 64-core Graviton2 and 80-core Ampere eMag). IMHO that's exactly what is the source of motivation for all those engineers established Nuvia corp.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
Is 128-bit code significantly faster than 256-bit?
128-bit SIMDs are more present in general purpose. The average consumer if they are running SIMD it will mostly be 128-bit.

There is definitely more => VFMADDPS xmm, xmm, xmm/m128
Than => VFMADDPS ymm, ymm, ymm/m256
In existence across code. It can be switched to MUL/ADD versions just the same.
 

eek2121

Platinum Member
Aug 2, 2005
2,930
4,026
136
There are a few really noisy elephants in the room that everyone is forgetting about:

  1. TSMC N5 hits volume production Q1 of 2020 and has a dedicated 'HPC' path that mobile chips won't use.
  2. 7NM EUV offers very little over 7nm except increased margins for AMD and up to 10% higher clock speeds.
  3. Intel is rumored (and the rumors have come from a reliable source) to be upping their core count as well as their clock speed, with base/boost numbers for Core i3s, i5s, and i7s far exceeding AMD chips. These chips are rumored for release around April 2019.
  4. TSMC's guidance appears to state that 7nm customers should transition to 6nm or 5nm as N7+. The N6 node is ramping up slower than N5.
  5. AMD definitely won't be releasing Zen 3 before July of next year, as July 7th will mark the 1 year anniversary of Zen 2.
Based on the above, I'm calling it (with good fun, I could very likely be wrong): Zen 3 will be a unified chiplet design with a CCD on 5nm and the IO die on 7nm EUV. Expected clock speeds will reach or exceed 5 GHz. I also do not believe that AMD will offer any additional features outside of a 10-15% IPC increase.

CHOOCHOO!
 

lobz

Platinum Member
Feb 10, 2017
2,057
2,856
136
There are a few really noisy elephants in the room that everyone is forgetting about:

  1. TSMC N5 hits volume production Q1 of 2020 and has a dedicated 'HPC' path that mobile chips won't use.
  2. 7NM EUV offers very little over 7nm except increased margins for AMD and up to 10% higher clock speeds.
  3. Intel is rumored (and the rumors have come from a reliable source) to be upping their core count as well as their clock speed, with base/boost numbers for Core i3s, i5s, and i7s far exceeding AMD chips. These chips are rumored for release around April 2019.
  4. TSMC's guidance appears to state that 7nm customers should transition to 6nm or 5nm as N7+. The N6 node is ramping up slower than N5.
  5. AMD definitely won't be releasing Zen 3 before July of next year, as July 7th will mark the 1 year anniversary of Zen 2.
Based on the above, I'm calling it (with good fun, I could very likely be wrong): Zen 3 will be a unified chiplet design with a CCD on 5nm and the IO die on 7nm EUV. Expected clock speeds will reach or exceed 5 GHz. I also do not believe that AMD will offer any additional features outside of a 10-15% IPC increase.

CHOOCHOO!
I guess you wanted to write April 2020. Other than that, Zen 3 = 7nm EUV. It's set in stone. It was designed for that node.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
It's set in stone. It was designed for that node.
If it is set in stone it will also have SMT4 and AVX512. Family K18.2 = Zen3 before it was used for Dhyana.
Arden/Nedra/Anaconda X-series => Zen3 AVX512 + RDNA2, yadda yadda. However it is using DUV 7nm. <== Also, used the 18h/24 family in the early Microsoft APUs.

Retapeouts(RTO) are only possible between N7 -> N7P -> N6. It is more likely to see a tape out on 7nm and a partial NTO w/ 6T N6(higher density than N7+ 6T). Similiar to Excavator evolution from 13T-28SHP(Steamroller) to 9T-28A(XV).

HPC and Mobile share the same track height in N5, however HPC Fins use more extensive Ge stressors.
 
Last edited:

lobz

Platinum Member
Feb 10, 2017
2,057
2,856
136
If it is set in stone it will also have SMT4 and AVX512. Family K18.2 = Zen3 before it was used for Dhyana.
Arden/Nedra/Anaconda X-series => Zen3 AVX512 + RDNA2, yadda yadda. However it is using DUV 7nm. <== Also, used the 18h/24 family in the early Microsoft APUs.

Retapeouts(RTO) are only possible between N7 -> N7P -> N6. It is more likely to see a tape out on 7nm and a partial NTO w/ 6T N6(higher density than N7+ 6T). Similiar to Excavator evolution from 13T-28SHP(Steamroller) to 9T-28A(XV).

HPC and Mobile share the same track height in N5, however HPC Fins use more extensive Ge stressors.
Man, half your comments are fairy tales, I don't always know what to take seriously. But if you say so, why not.
 
  • Haha
Reactions: CHADBOGA

DrMrLordX

Lifer
Apr 27, 2000
21,637
10,855
136
Microsoft is pushing Windows Hello, especially on their Surface products. They now use AMD APUs there as well. So some ML integration coming at some point really is not that far fetched unless it takes too much silicon area.

MS can push all manner of things. Windows Hello doesn't require ML instructions anyway. And, mind you, that's for Surface. For desktop/workstation systems, where you don't even know if you'll have a camera or not, a lot of the "consumer" ML stuff falls flat on its face. That's why I expect to see a lot of the consumer ML stuff to stay to mobile computing platforms.

There are a few really noisy elephants in the room that everyone is forgetting about:

Intel is rumored (and the rumors have come from a reliable source) to be upping their core count as well as their clock speed, with base/boost numbers for Core i3s, i5s, and i7s far exceeding AMD chips. These chips are rumored for release around April 2019.

If you mean Comet Lake . . . AMD ain't skeered. Those "higher core count" CPUs will be 10c Comet Lake-S. Intel already has higher boost clocks than AMD. Comet Lake-S will generally be slower than Matisse.

TSMC's guidance appears to state that 7nm customers should transition to 6nm or 5nm as N7+. The N6 node is ramping up slower than N5.

N6 is nothing but a "cheap" node for 7nm customers that don't want to switch to the 7nm+ design rules.

Based on the above, I'm calling it (with good fun, I could very likely be wrong): Zen 3 will be a unified chiplet design with a CCD on 5nm and the IO die on 7nm EUV. Expected clock speeds will reach or exceed 5 GHz. I also do not believe that AMD will offer any additional features outside of a 10-15% IPC increase.

CHOOCHOO!

Nah. Zen3 (Milan) already started sampling earlier this year. N5 wasn't ready at that point . . . 7nm+ was. Milan is 7nm+, and in keeping with AMD's strategy for previous Zen versions, Vermeer will likely use dice in common with Milan (in this case, chiplets). So that means Vermeer has to be 7nm+ as well.
 
  • Like
Reactions: Olikan

Richie Rich

Senior member
Jul 28, 2019
470
229
76
128-bit SIMDs are more present in general purpose. The average consumer if they are running SIMD it will mostly be 128-bit.
We discussed Zen2 FPU whether it has 8x pipes or 4x pipes. Sorry, I didn't write that question clear enough. So corrected question: Is Zen 2 running 128-bit code significantly faster than 256-bit?
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,355
1,547
136
We discussed Zen2 FPU whether it has 8x pipes or 4x pipes. Sorry, I didn't write that question clear enough. So corrected question: Is Zen 2 running 128-bit code significantly faster than 256-bit?

It isn't, and shouldn't, because while there are 8 128-bit ports, only 4 of them are attached to the low-order bits of registers. The other 4 only read from the second RF which contains bits 128..255 of each AVX register.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
It isn't, and shouldn't, because while there are 8 128-bit ports, only 4 of them are attached to the low-order bits of registers. The other 4 only read from the second RF which contains bits 128..255 of each AVX register.
This but there is more.

The PRF design probably has the capability of doing 255:0 on the lower half and 511:256 on the upper half.
Zen can use two registers as 255:0, this capability probably wasn't lost in Zen2's PRF design.
The upper half can be continuous thus adding 511:256 thus giving 512-bit VPRF w/ 4 registers(2-reg in the lower-half and 2-reg in the upper-half).
160-entry 256-bit PRF (1 Low + 1 High) w/ unlocked potential it can become 80-entry 512-bit PRF(2 Low + 2 High).

If it is an exact clone:
4x 128-bit FMUL => 1x 512-bit FMUL instruction
4x 128-bit FADD => 1x 512-bit FADD instruction
6x 128-bit VADD => 1x 512-bit + 1x 256-bit PADD instruction
2x 128-bit VMUL => 1x 256-bit PMUL
---
N6 is nothing but a "cheap" node for 7nm customers that don't want to switch to the 7nm+ design rules.
N7+ requires a new from scratch design as AMS(SerDes/IO) is incompatible, SRAM(+PRF/CAM/etc) is incompatible, and Logic is incompatible.

N6 operates like GlobalFoundries 12LP node. Allowing for a retapeout on EUV w/o the hassle of starting from scratch.
N7 + N7 EUV => 150 million (Zen2) + AMD design costs (Zen2) + 200 million (Zen3) + AMD shrinked design costs (Zen3).
Just a Nostaestimate => 150 + 80 + 200 + 100 => ~530 million overall cost
Would be N7 7.5T to N7+ 7.5T and would get a density increase.

Then, N7 to N6 EUV => 150 million (Zen2) + AMD design costs (Zen2) + N6 EUV masks (Zen3) + AMD shrinked logic design costs (Zen3).
Just another Nostaestimate => 150 + 80 + ~0.5 + 80 => ~311 million overall cost
Would be N7 7.5T to N6 6T to get the same density increase as above. ~210 million saved for Zen4 or something.

CC Wei => N7+ isn't as advantageous as N6. N7+ also has poor demand volume. Everyone sticking to N7 can go to EUV N6 via RTO w/o the effort of EUV N7+.
Snap 865 => N7P
App A13 => N7P
Kirin 990 5G/Huawei => ~70% of all N7+ wafers. Remember, they were estimated to be just 10% of N7.
 
Last edited:
  • Like
Reactions: amd6502