Speculation: Ryzen 4000 series/Zen 3

DrMrLordX · Sep 13, 2019

Ajay said:
How about not double the FMACs an just use op fusion for one AVX512 stream (just for compatibility).
AVX512 is such a power sink

Doubling the # of FMACs would improve AVX2 performance. Yes, it's a power sink, but the extra performance would be worth it for people that need it. I don't think AMD is going to just stop at allowing one AVX512 stream; why switch to 512-bit FMACs when they can go for groupings of 2x256-bit instead? Yes, AMD would be slower in AVX512 thanks to op fusion overhead, but they'd wind up faster in AVX2.

itsmydamnation · Sep 13, 2019

You still need to load and store all this data, a simple doubling of fmac's without significantly larger data paths/PRF/retire isn't going to be worth it and increasing all of those is going to make a very big core ( lets not forget how big sunny cove is).

DrMrLordX · Sep 14, 2019

itsmydamnation said:
You still need to load and store all this data, a simple doubling of fmac's without significantly larger data paths/PRF/retire isn't going to be worth it and increasing all of those is going to make a very big core ( lets not forget how big sunny cove is).

That's correct. AVX512 is coming though, like it or not.

I guess they could just use op fusion to support AVX512 and call it a day, which would be fine from my point-of-view. I just don't think they'll go that route. They're getting a small density increase with 7nm+.

itsmydamnation · Sep 14, 2019

DrMrLordX said:
That's correct. AVX512 is coming though, like it or not.

I don't really think it is, if your code can scale to 512bit vectors linearly it most likely can scale more cores as well. Having more high clocking high IPC lower width cores, then less lower clocking lower IPC 512bit wide cores will work better for far more workloads/environments.

DrMrLordX · Sep 14, 2019

itsmydamnation said:
I don't really think it is, if your code can scale to 512bit vectors linearly it most likely can scale more cores as well.

Sometimes yes, sometimes no. It didn't stop Intel from pushing it though. That is the #1 reason why AMD will adopt it, at least for a time. Hopefully AVX512 can be unseated by something like SVE2 in the future.

soresu · Sep 14, 2019

DrMrLordX said:
All of that is possible, and someone academic at this point unless some version of XOP emerges in the future. Personally I'd like to see AMD throw out AVX altogether in favor of SVE2 but . . . that's unlikely to happen. Instead we're probably going to see AVX512 support in Zen3 which is not thrilling. But Intel has moved the market in that direction, so I guess AMD needs to follow.

It's not impossible, and certainly preferable to have SVE2 style variable vectors, unless there are specific restrictions in the x64 ISA preventing it.

RISC-V also has a length agnostic vector instruction set in development, it seems the natural evolution for vector computing, if perhaps not so easy to implement.

Though I might add that I believe SVE was based on a research paper detailing an instruction set called ARGON - this paper implied diminishing returns past a point, so unless they addressed that problem in subsequent research, there won't be a great amount of mileage beyond 512 bit length vectors.

soresu · Sep 14, 2019

DrMrLordX said:
Instead we're probably going to see AVX512 support in Zen3 which is not thrilling

Unlikely after a mere 20% area reduction unless their SIMD unit design is incredibly area efficient - they just doubled FP with Zen2 as it is, adding AVX-512 without that increase would be similar to having AVX2 before Zen2.

Perhaps we may see it with Zen3 if a significant core redesign opens up some space, but more likely with Zen 4 at 5nm I think.

Not to mention their current core per socket advantage does offset AVX 512 some, as shown by their stellar SVT-AV1 encoding results with EPYC 7742.

512 bit code isn't nearly as prevalent yet too - even Intel had to add 256 bit code to SVT-VP9, which had the side effect of boosting EPYC too.

moinmoin · Sep 14, 2019

Richie Rich said:
What about if AMD will create shared front-end for whole CCX? This would bring some advantages out of CMT while still using SMT for back-end.
1) This could save some transistors and increase throughput.
2) It allows HW control over threads within CCX. It can eliminate crazy windows scheduler shuffle.

I myself was previously toying with the thought of moving some of the front end's functionality onto the IOD even. Problem is always data locality, you don't want to move critical data too far away from where it's actually needed to keep latency down.

Maybe the decoder could be situated before the core specific front end, so that all instructions hitting L2/3$ are already in the optimized internal uop format. Branch prediction, specifically TAGE that relies on long histories to work ideally, could also profit from being handled centrally. But aside latency to make efficient use of such a topology the task scheduler would need to be moved from OS into hardware, and that's something which for AMD's RTG repeatedly turned out to be a hindrance instead an advantage compared to Nvidia's driver controlled scheduling. On the other hand centralized hardware scheduling would allow for clean separation of INT and FP units as well as making HSA more feasible again. But as of now all of that is not feasible and won't happen.

soresu said:
Unlikely after a mere 20% area reduction unless their SIMD unit design is incredibly area efficient - they just doubled FP with Zen2 as it is, adding AVX-512 without that increase would be similar to having AVX2 before Zen2.

512bit FP can still happen by the way of combining the two 256bit FMACs already there. The issue with AVX-512 are all the additional instructions that are then also usable with 128bit and 256bit FP and likely need quite some area as well.

Zen 4 on 5nm with up to 50% higher density (compared to 7nm+, up to 80% compared to 7nm) could allow a bigger increase in transistors again, allowing another doubling of the FP unit with accordingly widened data paths/loads/stores etc.

It was mentioned before that many changes to the Zen 2 core were initially planned for Zen 3, so I'm expecting the Zen 3 core to be a much more polished coherent implementation of many parts that were premiered in Zen 2 (aside FP the newly introduced TAGE branch predictor is a primary candidate for such).

moinmoin · Sep 14, 2019

rainy said:
She was definitely an important part (team leader), however chief architect was Mike Clark.

It looks like different people are leading the different Zen gen efforts as chief architect. Zen 1 was Mike Clark. Zen 2 was David Suggs. And according to his LinkedIn profile (via wccftech, sorry) Suggs apparently also handles Zen 5.

NTMBK · Sep 14, 2019

DrMrLordX said:
Sometimes yes, sometimes no. It didn't stop Intel from pushing it though. That is the #1 reason why AMD will adopt it, at least for a time. Hopefully AVX512 can be unseated by something like SVE2 in the future.

Intel pushed it as a way to boost their Larrabee derivatives. The theory was that if mainstream CPUs shared an ISA with the Phi, they could get a critical mass of compatible software.

Of course it never really panned out, and now Phi is dead. I don't really see the benefit to huge vectors which are so power hungry that they tank the performance of the rest of your CPU. I mean great, you boosted peak FLOPs, but now all the logic you need to get data to feed those vector units is running at a crippled clock speed.

AVX-512 as an ISA has some really nice features like mask registers and full masking of pretty much every instruction, and scatter to match the gather instructions from AVX2. It makes it a much better ISA to vectorize for. But I'd be happy with it on half width vector units.

amd6502 · Sep 15, 2019

Richie Rich said:
You are right, Nehalem had 3xALU.
I agree that 6xALU or 8xALU design are the next step as the lowest hanging fruit.

I think 5 or 6 ALU is likelier in Zen3. I think we could see 6 ALU in Zen4, and not ruling out we might see 8 ALU.

And again I'm pushing my hope that SMT will make symmetry optional (with an aSMT mode). Linux is already ready for asymmetric cores, although it seems application for it so far is for mosly for telephones.

https://www.reddit.com/r/linux/comments/d4rpkx

Linux_5.3 - Linux Kernel Newbies

List of changes and new features merged in the Linux kernel during the 5.3 development cycle

kernelnewbies.org

utilization clamping support as an extension of their work on the Energy Aware Scheduling framework in order to boost some workloads while capping background workloads.

(from https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.3-Scheduler-Clamping ).

Low ipc non-speculative logical cores would be the most low energy and secure way to process threads.

Richie Rich · Sep 16, 2019

amd6502 said:
I think 5 or 6 ALU is likelier in Zen3. I think we could see 6 ALU in Zen4, and not ruling out we might see 8 ALU.

And again I'm pushing my hope that SMT will make symmetry optional (with an aSMT mode). Linux is already ready for asymmetric cores, although it seems application for it so far is for mosly for telephones.

https://www.reddit.com/r/linux/comments/d4rpkx

Linux_5.3 - Linux Kernel Newbies

List of changes and new features merged in the Linux kernel during the 5.3 development cycle

kernelnewbies.org

(from https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.3-Scheduler-Clamping ).

Low ipc non-speculative logical cores would be the most low energy and secure way to process threads.

IMHO AMD will keep ALU core fixed for whole family. Same as 17h Family Zen1+2 was 4xALU core design fixed. They also kept 4 pipes for FPU, just doubling width of FPUs. On the other hand they added one store unit, but I consider this as minor back-end change - probably eliminating bottleneck of AGUs with no impact to front-end.

IMHO 19H Family will keep 6xALU number fixed too. At least for Zen3, Zen4 and Zen5.

There is rumour about 15 chiplets for server Zen3 Epyc Milan. I think it isn't possible put there 14 CPU chiplets with 8 cores. I think this is another indirect evidence that Zen3 core will be big beast core. If Zen3 core will be estimated 6xALU core with 8 FPU pipes and 4xAGU + SMT4, this means this core will consists of approximately +50% more transistors, so quad-core CCX will be much larger than Zen2 (chiplet die area 80 -> 120mm2).

AMD can create 1xCCX chiplet which can cut the die area down to 60mm2 and receive better yields and binning.

- Zen2 ROME: 8 chiplets x 8 core CCD x 2 SMT => 64c/128t (perf 100% ST, 100% MT)
- Zen3 MILAN: 14 chiplets x 4 core CCX x 4 SMT => 56c/224t (perf 150% ST, 175% MT)

Die areas also looks good:

- Zen2 Rome 8x80mm2 = 640mm2,
- Zen3 Milan 14x60mm2 = 840mm2 (x0.9 EUV factor= 756mm2).Pretty similar.

This allows to use big 14nm IO die as interposer for HBM on it. Oh god, if this will be true then this is game over for Intel.

Do you remember Lisa Su's words? "Just for the record, zero truth to this rumor (leaving AMD). I love AMD and the best is yet to come!" She was IMHO talking about new uarch 19h Family Zen3.

Richie Rich · Sep 16, 2019

moinmoin said:
I myself was previously toying with the thought of moving some of the front end's functionality onto the IOD even. Problem is always data locality, you don't want to move critical data too far away from where it's actually needed to keep latency down.

Maybe the decoder could be situated before the core specific front end, so that all instructions hitting L2/3$ are already in the optimized internal uop format. Branch prediction, specifically TAGE that relies on long histories to work ideally, could also profit from being handled centrally. But aside latency to make efficient use of such a topology the task scheduler would need to be moved from OS into hardware, and that's something which for AMD's RTG repeatedly turned out to be a hindrance instead an advantage compared to Nvidia's driver controlled scheduling. On the other hand centralized hardware scheduling would allow for clean separation of INT and FP units as well as making HSA more feasible again. But as of now all of that is not feasible and won't happen.

Your ideas are briliant however too much futuristic. I'm talking about just share front-end in Bulldozer style. Zen4 could have CCX front-end capable to handle 16 threads, internally 4 cores 6xALU each, and shared 4x 8 = 32 pipes of 256 bit FPU. Bulldozer had very weak back-end, especially FPU was very weak. However it was not so bad thanks to sharing. If they would share strong back-end it will become even stronger.

moinmoin said:
512bit FP can still happen by the way of combining the two 256bit FMACs already there. The issue with AVX-512 are all the additional instructions that are then also usable with 128bit and 256bit FP and likely need quite some area as well.

Zen 4 on 5nm with up to 50% higher density (compared to 7nm+, up to 80% compared to 7nm) could allow a bigger increase in transistors again, allowing another doubling of the FP unit with accordingly widened data paths/loads/stores etc.

It was mentioned before that many changes to the Zen 2 core were initially planned for Zen 3, so I'm expecting the Zen 3 core to be a much more polished coherent implementation of many parts that were premiered in Zen 2 (aside FP the newly introduced TAGE branch predictor is a primary candidate for such).

Could I ask where did you get Zen4 will be 5nm? Because I assume Zen4 being equivalent for Zen1+ (plus shared front-end

), so probably shrink to 6nm EUV. New process (5nm) I would estimate for Zen5 (Zen2 like).

moinmoin · Sep 16, 2019

Richie Rich said:
Your ideas are briliant however too much futuristic. I'm talking about just share front-end in Bulldozer style. Zen4 could have CCX front-end capable to handle 16 threads, internally 4 cores 6xALU each, and shared 4x 8 = 32 pipes of 256 bit FPU. Bulldozer had very weak back-end, especially FPU was very weak. However it was not so bad thanks to sharing. If they would share strong back-end it will become even stronger.

The problem I see with a combined front end for every CCX is that that approach may make disabling cores awkward. So far AMD is pretty adamant at keeping behavior and latency the same regardless of the internal topology even when a CCX contains less than 4 cores due to disabled cores. But it's certainly a more feasible approach than what I had in mind. Let's see if they do anything in that direction of combining resources across cores, the TAGE branch predictor would profit a lot from such.

Richie Rich said:
Could I ask where did you get Zen4 will be 5nm? Because I assume Zen4 being equivalent for Zen1+ (plus shared front-end ), so probably shrink to 6nm EUV. New process (5nm) I would estimate for Zen5 (Zen2 like).

TSMC offers different upgrade paths for its customers where much of the design rules are kept to keep the cost down. 6nm is the direct upgrade from 7nm (which Zen 2 uses). 5nm is the direct upgrade from 7nm+ (which Zen 3 is announced to use). A switch back to the 7nm -> 6nm upgrade path makes no sense when the most recent design (then Zen 3) already uses 7nm+ which prepares for the vastly superior 5nm node (which will be the next real deal, like the jump from GloFo's 12nm to TSMC's 7nm).

DrMrLordX · Sep 16, 2019

Wouldn't 6nm be inferior to 7nm+ for high-performance parts? I see no reason why Zen4 would move backwards to a less-performant node.

soresu · Sep 16, 2019

moinmoin said:
which will be the next real deal, like the jump from GloFo's 12nm to TSMC's 7nm

Unless my math is way off, the 7nm+ to 5nm move is only a 11-11.2% power reduction, and about a 33.5-34% area reduction.

So fairly good for area, but closer to 7nm to 7nm+ for power.

soresu · Sep 16, 2019

moinmoin said:
Let's see if they do anything in that direction of combining resources across cores

I haven't heard anything substantial about speculative multi-threading (SpecMT) for some time, it's a shame no-one ever managed it in practice.

Though transactional memory is supposed to be a significant part of that problem - maybe ARM's TME might be laying the groundwork for SpecMT in the future.

moinmoin · Sep 17, 2019

soresu said:
Unless my math is way off, the 7nm+ to 5nm move is only a 11-11.2% power reduction, and about a 33.5-34% area reduction.
View attachment 10869
So fairly good for area, but closer to 7nm to 7nm+ for power.

Where is that table from? I based my math (50% higher density 7nm+ -> 5nm) on WikiChip's reporting from end of July:
"Compared to their N7 process, N7+ is said to deliver around 1.2x density improvement."
"Compared to N7, N5 is said to deliver 1.8x routed logic density."

Power and performance is then dependent on how that density is used, usually high performance areas use significantly lower density.

darkswordsman17 · Sep 17, 2019

moinmoin said:
Where is that table from? I based my math (50% higher density 7nm+ -> 5nm) on WikiChip's reporting from end of July:
"Compared to their N7 process, N7+ is said to deliver around 1.2x density improvement."
"Compared to N7, N5 is said to deliver 1.8x routed logic density."

Power and performance is then dependent on how that density is used, usually high performance areas use significantly lower density.

Its from one of the Anandtech articles (I think on the one where TSMC announced the 6nm process). I think those numbers were supposedly from TSMC itself too.

I think they've since had other articles with different figures too though. Frankly, I'm not sure if they even know yet.

Its possible there was a typo or something too and the 5FF was supposed to be against 7FF+ not 7FF.

DrMrLordX said:
Wouldn't 6nm be inferior to 7nm+ for high-performance parts? I see no reason why Zen4 would move backwards to a less-performant node.

I'm not sure about that, but I believe 6nm is mostly about providing an easy and less expensive node for customers that care more about price than max performance related metrics. It provides some minor benefit over 7, and I think its supposed to possibly be easier to transition to (compared to 7) since it uses EUV, so if you were a company that hangs behind a node (and you're still at 16/14nm, or maybe 12 or 10nm), you might would skip 7 and go right to 6.

NostaSeronx · Sep 17, 2019

6nm is for those who want to port from 7nm DUV, but don't want to completely redesign.

RTO => Retapeout re-uses GDS, but now has EUV. Same die, no changes other than EUV adopted. <== Extra yields
NTO => Completely new tapeout, but re-uses 7nm AMS/SRAM cells, but has its own 6nm Logic cells. <== Extra logic density and yields

Only people going to 6nm are those on 7nm DUV. As 7nm+ in general for 2020+ is better in every case. There is always room for a N6+ for N7+ compatibility as well.

soresu · Sep 17, 2019

moinmoin said:
Where is that table from? I based my math (50% higher density 7nm+ -> 5nm) on WikiChip's reporting from end of July:
"Compared to their N7 process, N7+ is said to deliver around 1.2x density improvement."
"Compared to N7, N5 is said to deliver 1.8x routed logic density."

Power and performance is then dependent on how that density is used, usually high performance areas use significantly lower density.

From an Anandtech article back in April (link), as darkswordsman17 says the figures were supposed to be directly from TSMC - there is also a second 5nm process called N5P with higher performance announced in late July (link), "The latter will also be offered in a performance-enhanced version called N5P. This technology will also feature FEOL and MOL optimizations in order to make the chips run 7% faster at the same power, or reduce consumption by 15% at the same clocks. "

Just scanned the WC article, not sure how Anand's and WikiChip's figures differ so much.

tomatosummit · Sep 17, 2019

NostaSeronx said:
6nm is for those who want to port from 7nm DUV, but don't want to completely redesign.

RTO => Retapeout re-uses GDS, but now has EUV. Same die, no changes other than EUV adopted. <== Extra yields
NTO => Completely new tapeout, but re-uses 7nm AMS/SRAM cells, but has its own 6nm Logic cells. <== Extra logic density and yields

Only people going to 6nm are those on 7nm DUV. As 7nm+ in general for 2020+ is better in every case. There is always room for a N6+ for N7+ compatibility as well.

I see n6 as analogous to 12ff was to 14ff. A node refresh for customer to reuse masks, AMD could use it for a low end navi refresh in a year or so and any upcoming IO dies.
Zen3 designs will probably go ahead with n7+ as they'll be substantially new designs wanting cutting edge performance.

Isn't a lot of this all conjecture though. AMD is reported to be using "7nmHP" that isn't on tsmc's public slides so we don't know how it ports to n6 and if n7+ is or isn't a high performance node. or if there's a variant.

moinmoin · Sep 17, 2019

tomatosummit said:
Isn't a lot of this all conjecture though. AMD is reported to be using "7nmHP" that isn't on tsmc's public slides so we don't know how it ports to n6 and if n7+ is or isn't a high performance node. or if there's a variant.

In slides AMD has been using "7nm+" for Zen 3 for ages now. N6 was only unveiled in April. And N7 used for Zen 2 technically wasn't a "HP" node either.

Saylick · Sep 17, 2019

soresu said:
Unless my math is way off, the 7nm+ to 5nm move is only a 11-11.2% power reduction, and about a 33.5-34% area reduction.
View attachment 10869
So fairly good for area, but closer to 7nm to 7nm+ for power.

According to Wikichip's slide of TSMC's nodes, I came to a similar conclusion as well. Note the performance gains at iso-power between N to N7+, and N7 to N5 as shown below:

N7 ---> N7+ (+10% perf @ iso-power)
N7 ---> N5 (+15% perf @ iso-power)

Doesn't that imply only a 4.5% perf gain @ iso-power when we move from 7nm+ to 5nm? That's not a big jump at all, and considering that most of the time, the perf @ iso-power gains are given at the perf/W sweet-spot (i.e. not at Fmax), this really highlights Forrest Norrod's point that future node shrinks will not push the max clocks any higher with the more likely outcome being a clock regression.

For this reason, I reckon any meaningful performance gains moving forward in the next 5 years, assuming we don't switch to graphene or another semiconductor material with drastically better properties, will largely be the result of architectural improvements, likely a combination of larger cores and/or fixed function hardware.

LightningZ71 · Sep 18, 2019

As long as they can continue to scale density, they can continue to increase cache sizes, buffer sizes, and continue to optimize the cores by replacing the few remaining microcode paths with fixed function units to continue to improve worst case situations. While ideal case performance may not improve a whole lot, overall performance will still continue to scale. They can also just continue to throw more cores at problems.

In general though, ideal case performance hasn’t been improving by leaps and bounds for years. AMD fixed an architectural flaw in their cores with the construction to zen core change, but that was a fix to catch up to the market. I just don’t see the core computing being a major performance hindrance in most things these days. Things tend to be IO bound more often, and keeping the cores fed with data is usually the bigger problem. There is still a lot of potential for improvements in IO to the cores.

Speculation: Ryzen 4000 series/Zen 3

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Senior member

Senior member

Senior member

Diamond Member

Lifer

Diamond Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Diamond Member

Member

Diamond Member

Diamond Member

Platinum Member