Speculation: Ryzen 4000 series/Zen 3

soresu · Nov 30, 2019

A/// said:
50x transistor rate spread over 20 years?

Even after 20 years, 50x for a single core would be ridiculous.

You might as well just increase the core count - if they haven't solved performant auto parallelisation, or speculative threading for code by then it's a wash anyways IMHO.

A/// · Nov 30, 2019

soresu said:
Even after 20 years, 50x for a single core would be ridiculous.

You might as well just increase the core count - if they haven't solved performant auto parallelisation, or speculative threading for code by then it's a wash anyways IMHO.

Yeah, I'll wait to see what Richie meant. There's always the possibility of other platforms in 20 years, but they need to close performance gaps showing up right now even with high end parts and then exceed performance of what's offered now. Nothing's ever permanent.

DrMrLordX · Dec 1, 2019

soresu said:
Though I do believe it could be dangerous to AMD to bring in AVX512, unless it can do the same issue per clock as AVX2 currently can - it could lead to programs with both AVX2 and AVX512 codepaths being executed on AVX512 and delivering inferior performance to AVX2?

Dunno how many people tried it, but AVX2 codepaths on various AMD chips going back to XV had interesting results. On XV, it was a disaster, as was AVX. XOP was the only thing worth running on that chip (ditto for AVX on any of the CON cores that supported it, which I think was all of them). On Summit Ridge, AVX and AVX2 performed almost the same, assuming you were using applications that had been coded with Intel CPUs in mind. Stuff like y-cruncher has its own codepaths for Zen which perform better than "generic" AVX/AVX2 applications that had been created for Intel CPUs.

In any case, if AMD chooses to enable a limited amount of the massive AVX512 ISA on Zen3 or Zen4 by way of op fusion, I would expect application performance of AVX2 and AVX512 to be within 5% of each other, assuming both codepaths had been created with Intel targets in the first place. As AMD's installed base grows, I fully expect more "serious" application coders to do as the y-cruncher author did and provide separate optimized code paths for AMD hardware.

A/// · Dec 1, 2019

And given that, expect a certain former Intel employee to go on another four month hate tirade despite it being a fairly niche instruction set and Ryzen eating up Intel in other benchmarks which rely on key instruction sets. If you can't figure out who it is, keyword: web browsing benchmarks.

Veradun · Dec 1, 2019

A/// said:
And given that, expect a certain former Intel employee to go on another four month hate tirade despite it being a fairly niche instruction set and Ryzen eating up Intel in other benchmarks which rely on key instruction sets. If you can't figure out who it is, keyword: web browsing benchmarks.

"Clear advantage"

Richie Rich · Dec 1, 2019

A/// said:
Yeah, I'll wait to see what Richie meant. There's always the possibility of other platforms in 20 years, but they need to close performance gaps showing up right now even with high end parts and then exceed performance of what's offered now. Nothing's ever permanent.

K6-2 (1999) had 9.3 mil tanstistors
Zen2 chiplet (2019) has 3,900 mil transistors ... 419x more /8 cores .... 52x more/core in 20 years

I bet in 1999 there were a lot of people like you... saying 3-6 instructions per clock is impossible and here we are. So I would be very careful about predictions in sense of something is not possible. I'm gonna believe CPU architects (Jim Keller) more than anybody else here.

Theoreticaly, if Zen 3 uarch would do the same core fusion as BD -> Zen (2xALU no SMT -> 4xALU +SMT2) so Zen 3 would be 8xALU+SMT4. All variants have 2xALUs/thread (great for actual generic code) however you can have ultimate high IPC if you disable SMT or when priority thread control is possible (Asymmetric MT?). Maybe this core fusion Keller meant when he was mentioning "linear performance scaling" as opposite to IceLake (+38% transistors produce only +18%IPC).

A/// · Dec 1, 2019

Richie Rich said:
K6-2 (1999) had 9.3 mil tanstistors
Zen2 chiplet (2019) has 3,900 mil transistors ... 419x more /8 cores .... 52x more/core in 20 years

I bet in 1999 there were a lot of people like you... saying 3-6 instructions per clock is impossible and here we are. So I would be very careful about predictions in sense of something is not possible. I'm gonna believe CPU architects (Jim Keller) more than anybody else here.

Theoreticaly, if Zen 3 uarch would do the same core fusion as BD -> Zen (2xALU no SMT -> 4xALU +SMT2) so Zen 3 would be 8xALU+SMT4. All variants have 2xALUs/thread (great for actual generic code) however you can have ultimate high IPC if you disable SMT or when priority thread control is possible (Asymmetric MT?). Maybe this core fusion Keller meant when he was mentioning "linear performance scaling" as opposite to IceLake (+38% transistors produce only +18%IPC).

You would be wrong about your bet. But relying on past progression as a measure for future progression isn't always ideal. But I take your point. I think the semiconductor fabs who research this stuff are going to have to seek new materials as node shrinks are expected to get tough after 5 nm, and this is something that's been in research for quite a long time. We're in a processor revolution right now, and I haven't felt this way in a long time. That goes for all of us. I don't see SMT4 anytime soon, but it may be something to consider in the future. I could see it being something on Zen5 or 6. While I compared such threading to Power, the problem is you can't directly compare these together. Though it wouldn't surprise me if AMD and Intel have test labs where SMT4/4T HT chips exist.

amd6502 · Dec 1, 2019

Richie Rich said:
+38% transistors produce only +18%IPC

That is actually quite good (probably refers to MT (or combined) not pure ST performance though). If 18% is ST number then it sounds like it beats Pollack's rule.

I think increasing ST becomes more and more challenging as we see Moore's law die and as IPC has reached an already large number.

Thunder 57 · Dec 1, 2019

Richie Rich said:
K6-2 (1999) had 9.3 mil tanstistors
Zen2 chiplet (2019) has 3,900 mil transistors ... 419x more /8 cores .... 52x more/core in 20 years

I bet in 1999 there were a lot of people like you... saying 3-6 instructions per clock is impossible and here we are. So I would be very careful about predictions in sense of something is not possible. I'm gonna believe CPU architects (Jim Keller) more than anybody else here.

Theoreticaly, if Zen 3 uarch would do the same core fusion as BD -> Zen (2xALU no SMT -> 4xALU +SMT2) so Zen 3 would be 8xALU+SMT4. All variants have 2xALUs/thread (great for actual generic code) however you can have ultimate high IPC if you disable SMT or when priority thread control is possible (Asymmetric MT?). Maybe this core fusion Keller meant when he was mentioning "linear performance scaling" as opposite to IceLake (+38% transistors produce only +18%IPC).

You know, I've heard so much about SMT4 that I found through a quick search you mention it in over 1/3 of your posts. We get it, we just all (unless I am missing someone) disagree. Not in Zen 3 at least.

Also, Jim Keller is not some deity and there are other brilliant minds out there. I doubt any of them are bold enough to make claims of 50x more transistors per core in 20 years either. Manufacturing is quickly becoming a problem here. I'm not saying it can't happen, but it certainly looks more difficult now than it did 20/30/40 years ago.

soresu · Dec 1, 2019

Richie Rich said:
I bet in 1999 there were a lot of people like you... saying 3-6 instructions per clock is impossible

I suggest you learn a little history first, before you start to embarrass yourself.

According to this article, the Alpha EV6 was a 6 wide OoO superscalar uArch introduced in 1996 (dated from Wikipedia).

It's not about whether it is simply possible - it's about whether it can be done while maintaining legacy x86 cruft, and the high multi Ghz clock speed necessary for both market competitive performance and software compatibility.

I might also add that the planned, but unproduced EV8 was targeted for 8 wide, cancelled in favor of the Itanic disaster.

itsmydamnation · Dec 1, 2019

Richie Rich said:
I bet in 1999 there were a lot of people like you... saying 3-6 instructions per clock is impossible and here we are. So I would be very careful about predictions in sense of something is not possible. I'm gonna believe CPU architects (Jim Keller) more than anybody else here.

So you realize that K7,K8,K10 are all actually 6 wide ( 3ALU , 3AGU) , now look at how much faster Zen is while also being 6 wide (4ALU + 2 AGU) and Zen2 even faster again while being 7 wide. But what your doing like you always seem to do ( probably because Dunning Kruger) is ignoring how things actually work, those 6 pipes in K7 are organised is 3 cluster of 2, Zen its 1 cluster of 6. It is so much harder and so much more complex to have 6x1 vs 2x3, but by your dumb counting logic they should perform the same right.....

I also see you completely ignored trying to justify how Zen is ALU limited when they choose to increase AGU in Zen2 which is harder then ALU because not only does it need to connect to the PRF it also has to connect to the memory sub system.

Theoreticaly, if Zen 3 uarch would do the same core fusion as BD -> Zen (2xALU no SMT -> 4xALU +SMT2) so Zen 3 would be 8xALU+SMT4. All variants have 2xALUs/thread (great for actual generic code) however you can have ultimate high IPC if you disable SMT or when priority thread control is possible (Asymmetric MT?). Maybe this core fusion Keller meant when he was mentioning "linear performance scaling" as opposite to IceLake (+38% transistors produce only +18%IPC).

How do you plan to feed these 8x ALU's , how many load + store ports, how much cache bandwidth, what size how is it tagged / indexed , how big/type of TLB's , page walkers etc etc . Get this through your thick skull, execution of data is easy movement of data is hard, its why flock of chicken CPU's always loose and why the massive investment in transistors per core can be justified for big OOOE cores.

NostaSeronx · Dec 1, 2019

8 ALUs and 8 FPUs can be fed with 2x64B(R)/1x64B(W). Data movement is easy, it is in fact the easiest thing to target. Hence, the semiconductor world wraps around MLP over ILP everyday.

The purpose of SMT4 is to increase MLP, thus increase ILP. Get it through your thick skull scrub.

Two big things to watch out for:
*Store Queue Fusion: ALU0/ALU1[M0]+(ALUx/ALUy){Mx} can (each) store 16B simultaneous into the store queue.
*Load Queue Fusion: ALU0/ALU1[M0]+(ALUx/ALUy){MX} can (each) load 16B simultaneous from the load queue.

IntelUser2000 · Dec 1, 2019

amd6502 said:
That is actually quite good (probably refers to MT (or combined) not pure ST performance though). If 18% is ST number then it sounds like it beats Pollack's rule.

Actually 18% improvement with 38% more transistors nearly exactly follows Pollack's rule.

This is why we went from having passively cooled desktop chips in the 486 era to water-cooled desktops in 2019. Lot of the increase came not just from new process, but by increasing TDP.

Optimism drives people to continue, even though in reality it may fall short of such lofty goals. Take any such claims with a heaping of salt, because otherwise disappointment will set in.

Thunder 57 · Dec 1, 2019

NostaSeronx said:
8 ALUs and 8 FPUs can be fed with 2x64B(R)/1x64B(W). Data movement is easy, it is in fact the easiest thing to target. Hence, the semiconductor world wraps around MLP over ILP everyday.

The purpose of SMT4 is to increase MLP, thus increase ILP. Get it through your thick skull scrub.

Two big things to watch out for:
*Store Queue Fusion: ALU0/ALU1[M0]+(ALUx/ALUy){Mx} can (each) store 16B simultaneous into the store queue.
*Load Queue Fusion: ALU0/ALU1[M0]+(ALUx/ALUy){MX} can (each) load 16B simultaneous from the load queue.

There is no SMT4 in Zen 3. Do we have to continues this crap until it is released?

NostaSeronx · Dec 1, 2019

Thunder 57 said:
There is no SMT4 in Zen 3. Do we have to continues this crap until it is released?

Zen3 imho, isn't going to be much different from Zen2. It won't even be on 7nm+, but rather be on N7P. AMD got banned from N7+. The only node beyond 7nm that AMD is allowed to go to is 5nm.

Milan is Family 17h.
Vermeer and Genoa/Genesis is Family 19h.

Thunder 57 · Dec 2, 2019

NostaSeronx said:
Zen3 imho, isn't going to be much different from Zen2. It won't even be on 7nm+, but rather be on N7P. AMD got banned from N7+. The only node beyond 7nm that AMD is allowed to go is 5nm.

Milan is Family 17h.
Vermeer and Genoa/Genesis is Family 19h.

What??? AMD has being saying 7nm+ for a long time now. Banned from it? For selling just too many damn CPU's? I'm sure they have contracts in place. And for once, just once, I would like to see you back something up with facts. Otherwise, some people will continue to think that, well, I have to stop myself here. Just let me put it this way; Not hold you or what you have to say in high regard.

soresu · Dec 2, 2019

Thunder 57 said:
What??? AMD has being saying 7nm+ for a long time now. Banned from it? For selling just too many damn CPU's? I'm sure they have contracts in place. And for once, just once, I would like to see you back something up with facts. Otherwise, some people will continue to think that, well, I have to stop myself here. Just let me put it this way; Not hold you or what you have to say in high regard.

The repeated FDX/FD SOI stuff kinda leans that way already - makes any talk against Zen3 7nm+ a bit far fetched, considering AMD have been completely unambiguous about it publicly.

Zen4 on the other hand is still up in the air it seems given no process on the most recent roadmap - methinks either 5nm is not quite floating AMD's boat, or perhaps Samsung are persuading them to wait for their 3nm MBCFET process, though this does seem a stretch given it's still a ways out yet.

NostaSeronx · Dec 2, 2019

Thunder 57 said:
What???

L3 SRAM libs for 5nm are done at AMD. There isn't really enough leaks to get the full complete how far AMD is in 5nm right now.

However:
"that it will ramp up much in terms of revenue, be much faster than 7 nm." => "N5 chips in high volume starting Q2 2020."
AMD is slated to be with Apple/hisilicon group. ~1cm2 Elite Group.

TSMC is more optimistic that it can support AMD on 5nm better than 7nm.
Lower lead times, higher volumes, etc are the promises that I have received. There is also the custom EDA part that is getting way more support on 5nm than on 7nm.

It also follows the trend...
14LPP/12LP (576nm Height) -> N7 (300nm Height) rather than slow down with N7+, go straight to 5nm -> N5 (168nm Height) ¯\_(ツ)_/¯

DrMrLordX · Dec 2, 2019

soresu said:
methinks either 5nm is not quite floating AMD's boat, or perhaps Samsung are persuading them to wait for their 3nm MBCFET process

Request for clarification: is TSMC 5nm going to be more akin to TSMC 7nm+ or 6nm?

NostaSeronx · Dec 2, 2019

DrMrLordX said:
Request for clarification: is TSMC 5nm going to be more akin to TSMC 7nm+ or 6nm?

TSMC 5nm is 14 layers of EUV and isn't cell-compatible with 7nm. ~59+ mask layers
TSMC 6nm is 5 layers of EUV and is cell-compatible with 7nm. ~65+ mask layers ((TSMC 6nm can re-tapeout 7nm(N7/N7P(also, N7plus)) on EUV, while TSMC 7nm+ can't.))
TSMC 7nm+ is 4 layers of EUV and isn't cell-compatible with 7nm. ~65+ mask layers
TSMC 7nm, no EUV. ~75+ mask layers. (7LP/13ML on GloFo, it is 88 mask layers)

TSMC 7nm/6nm is ~40nm metal pitch and TSMC 5nm is ~28nm metal pitch. It also has a lower TCO cost w/o a shrink than 7nm. Making its shrink very cost-effective.

77 mm2 on 7nm => expensive
77 mm2 on 5nm(no shrink, no advancement, just a new process) => 15% lower cost

Relative to the other nodes, N5 is planned to be the longest. N5/2020-HVM -> N5P/2021-HVM -> N5P+(Ge+)/2022-HVM -> N5P+ Low Vdd(Ge+&LV)/2023-HVM.

DisEnchantment · Dec 2, 2019

NostaSeronx said:
Data movement is easy, it is in fact the easiest thing to target

I think everyone keeps saying it is the opposite. Moving data causes delay and uses energy. From Lisa's presentation at the DARPA event, she says more time is spent moving data than computing.
Hence so many concepts of PIM floating around.

moinmoin · Dec 2, 2019

DisEnchantment said:
I think everyone keeps saying it is the opposite. Moving data causes delay and uses energy. From Lisa's presentation at the DARPA event, she says more time is spent moving data than computing.
Hence so many concepts of PIM floating around.

Furthermore data movement as an action is the single big bottleneck preventing compute running at 100% at all time. Which as an aside is also why having some form of predictors at all stages is so effective, with hits it both reduces latency and prevents congestion.

Richie Rich · Dec 2, 2019

itsmydamnation said:
So you realize that K7,K8,K10 are all actually 6 wide ( 3ALU , 3AGU) , now look at how much faster Zen is while also being 6 wide (4ALU + 2 AGU) and Zen2 even faster again while being 7 wide. But what your doing like you always seem to do ( probably because Dunning Kruger) is ignoring how things actually work, those 6 pipes in K7 are organised is 3 cluster of 2, Zen its 1 cluster of 6. It is so much harder and so much more complex to have 6x1 vs 2x3, but by your dumb counting logic they should perform the same right.....

Funny how you think others are dumb and you are smart.
This is exactly what I say. On paper stronger K8 with tied ALU+AGU together (3xALU+3xAGU) was much slower than theoreticaly weaker Core2Duo with decoupled 3xALU+2AGU. C2D had speculative load feature and some other new stuff which was possible at that big cluster and K8 was missing all that. That's why fusion of two cores together into one big Zen 3 core with 8xALU + 4xAGU + SMT4 could provide enough room to implement some new advanced logic to extract more ILP/IPC and is not possible at narrow 4xALU+SMT2 core. Especially for next iterations in Zen4 and Zen5. It looks like you argumented in favor in my dumb wide Zen3+SMT4 core, thanks

itsmydamnation said:
I also see you completely ignored trying to justify how Zen is ALU limited when they choose to increase AGU in Zen2 which is harder then ALU because not only does it need to connect to the PRF it also has to connect to the memory sub system.

You are wrong about that. Zen 2 has not new full AGU but only store unit what is much much simpler that load with all those speculative loading and load predictors. Lowest hanging fruits, it was the easiest way. Maybe you noted that Intel is using dedicated store unit for a while too.

itsmydamnation said:
How do you plan to feed these 8x ALU's , how many load + store ports, how much cache bandwidth, what size how is it tagged / indexed , how big/type of TLB's , page walkers etc etc . Get this through your thick skull, execution of data is easy movement of data is hard, its why flock of chicken CPU's always loose and why the massive investment in transistors per core can be justified for big OOOE cores.

How Apple in Vortex core is feeding those 6xALUs? They can do that with just 2xAGUs. How they gain +58% IPC INT over Skylake? Maybe Apple hired some black magic Woo Doo shaman, or maybe they know what they are doing. And unfortunately Apple engineers forgot to ask you that it's not possible

itsmydamnation · Dec 3, 2019

Richie Rich said:
Funny how you think others are dumb and you are smart.
This is exactly what I say. On paper stronger K8 with tied ALU+AGU together (3xALU+3xAGU) was much slower than theoreticaly weaker Core2Duo with decoupled 3xALU+2AGU. C2D had speculative load feature and some other new stuff which was possible at that big cluster and K8 was missing all that. That's why fusion of two cores together into one big Zen 3 core with 8xALU + 4xAGU + SMT4 could provide enough room to implement some new advanced logic to extract more ILP/IPC and is not possible at narrow 4xALU+SMT2 core. Especially for next iterations in Zen4 and Zen5. It looks like you argumented in favor in my dumb wide Zen3+SMT4 core, thanks

i dont think im smart but you have done nothing other then but "teh APPLE!@!#@!#@!#. So now that you have said 4x AGU, how much load and store bandwidth/ports to cache, whats the cache configuration, in multi-ported caches you are wire limited. lets not even talk about getting enough decode/ dispatch for the mythical 4 threads.

You are wrong about that. Zen 2 has not new full AGU but only store unit what is much much simpler that load with all those speculative loading and load predictors. Lowest hanging fruits, it was the easiest way. Maybe you noted that Intel is using dedicated store unit for a while too.

no im right and your wrong ( see i provided as much evidence as you do )
Maybe you should go read the patient of how it actually works ( yes its published) it is one unified queue in which it picks 3 address to generate and load/store, it wasn't simple and cant be done in a single cycle, there is no point adding the 3rd AGU to the load side of the equations because there are only 2 load ports to cache. But the AGU's have nothing to do with prefetch/predict so i dont know why your trying to conflate that. But Store has to deal with store to load forwarding/ memory memory disambiguation and it still needs to connect to the PRF.

How Apple in Vortex core is feeding those 6xALUs? They can do that with just 2xAGUs. How they gain +58% IPC INT over Skylake? Maybe Apple hired some black magic Woo Doo shaman, or maybe they know what they are doing. And unfortunately Apple engineers forgot to ask you that it's not possible

So first they dont tell us how any of there Cores works at all, for all you know if could be two cluster of 3 ALU + branch +AGU / split PRF (just like z15) . You have no idea of how there prefetch/predict/ L2/stream page walkers etc work, you have no idea what kind of memory disambiguation they are doing (arm has a weaker memory model). The only thing you know is they have 6 ALU's so that MUST be it, just ignore that hurrican has 4 ALU's and would still beat skylake in your metric quite handily and Apple has massively improved there cache and memory sub systems from A10 to A12 as can be seen in the anandtech reviews along with dispatch and all the prefetch predict /etc improvements you would expect. also ARM has load/store pairs and Apple has complete control of there ecosystem/compilers so those 2 load/store units can be load/storing 4 "bits" of data a cycle.

See i've never said we wont see more ALU's on Zen , unlike you i dont see more ALU's being the "killer feature", the killer feature is all the other micro architectural improvements that allow you to get enough ILP to be worth having more ALU's.

Richie Rich · Dec 3, 2019

itsmydamnation said:
no im right and your wrong ( see i provided as much evidence as you do )
Maybe you should go read the patient of how it actually works ( yes its published) it is one unified queue in which it picks 3 address to generate and load/store, it wasn't simple and cant be done in a single cycle, there is no point adding the 3rd AGU to the load side of the equations because there are only 2 load ports to cache. But the AGU's have nothing to do with prefetch/predict so i dont know why your trying to conflate that. But Store has to deal with store to load forwarding/ memory memory disambiguation and it still needs to connect to the PRF.

No, you are wrong. You are lying or you have poor knowledge. There is predictor in load unit since Intel Core uarch. It's used for load instruction speculative pass ahead of store instruction (which address is not calculated yet), theoretically delivering 30-40% performance boost. You should educate yourself before spreading your miss-information: https://www.anandtech.com/show/1998/5

itsmydamnation said:
See i've never said we wont see more ALU's on Zen , unlike you i dont see more ALU's being the "killer feature", the killer feature is all the other micro architectural improvements that allow you to get enough ILP to be worth having more ALU's.

I never said there are no other uarch improvements. Do not put your lies into my mounth please. Actually I always put a strong emphasis to Apple's advanced uarch allowing to utilize those 6xALUs in incredible way (+50% more ALU provides +58% IPC over Skylake, mentioning here multiple times).

And again, you didn't get my point at all. I'm not talking about more ALUs only. Regarding Zen 3 (and what Keller was mentioning by "linear scaling IPC" on that video) I'm consistently talking about high number of ALUs together with SMT4. Symbiotic combination of these two features could become the killer feature. Shared resources brings more efficiency and performance. Same way AMD leap leapfrogged performance by merging of two narrow cores design 2+2 ALU in BD into one wider core design 4xALU+SMT2 in Zen. It was effective move once so it could be effective again with even wider core +SMT4.

Another option for Zen 3 is shared front-end with shared FPU - Bulldozer style. Front-end capable handling of 4-threads, back-end consisting of 2x Zen3+SMT2 int cores and shared powerful FPU (12-pipes shared by 4 threads). AMD has an experience with BD, it's simpler to do (than wider entire core +SMT4) and allows great FPU boost (indicated by leaks +40-50% FPU performance). However it's kind of sub-optimal solution IMHO.

Speculation: Ryzen 4000 series/Zen 3

Diamond Member

Diamond Member

Lifer

Diamond Member

Senior member

Senior member

Diamond Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Elite Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Golden Member

Diamond Member

Senior member

Diamond Member

Senior member