Speculation: Ryzen 4000 series/Zen 3

Page 56 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

soresu

Platinum Member
Dec 19, 2014
2,657
1,858
136
50x larger CPU? 50x as large as now or 50 variations? When and where was this said? This is the first time I'm hearing of it.
I can only assume he means 50x transistors total, rather than per core transistor count.

Because a core with 50x the transistor budget would be insane even at <1nm.
 
  • Like
Reactions: A///

A///

Diamond Member
Feb 24, 2017
4,352
3,154
136
I can only assume he means 50x transistors total, rather than per core transistor count.

Because a core with 50x the transistor budget would be insane even at <1nm.
50x transistor rate spread over 20 years?
 

soresu

Platinum Member
Dec 19, 2014
2,657
1,858
136
50x transistor rate spread over 20 years?
Even after 20 years, 50x for a single core would be ridiculous.

You might as well just increase the core count - if they haven't solved performant auto parallelisation, or speculative threading for code by then it's a wash anyways IMHO.
 
  • Like
Reactions: A///

A///

Diamond Member
Feb 24, 2017
4,352
3,154
136
Even after 20 years, 50x for a single core would be ridiculous.

You might as well just increase the core count - if they haven't solved performant auto parallelisation, or speculative threading for code by then it's a wash anyways IMHO.
Yeah, I'll wait to see what Richie meant. There's always the possibility of other platforms in 20 years, but they need to close performance gaps showing up right now even with high end parts and then exceed performance of what's offered now. Nothing's ever permanent.
 

DrMrLordX

Lifer
Apr 27, 2000
21,620
10,829
136
Though I do believe it could be dangerous to AMD to bring in AVX512, unless it can do the same issue per clock as AVX2 currently can - it could lead to programs with both AVX2 and AVX512 codepaths being executed on AVX512 and delivering inferior performance to AVX2?

Dunno how many people tried it, but AVX2 codepaths on various AMD chips going back to XV had interesting results. On XV, it was a disaster, as was AVX. XOP was the only thing worth running on that chip (ditto for AVX on any of the CON cores that supported it, which I think was all of them). On Summit Ridge, AVX and AVX2 performed almost the same, assuming you were using applications that had been coded with Intel CPUs in mind. Stuff like y-cruncher has its own codepaths for Zen which perform better than "generic" AVX/AVX2 applications that had been created for Intel CPUs.

In any case, if AMD chooses to enable a limited amount of the massive AVX512 ISA on Zen3 or Zen4 by way of op fusion, I would expect application performance of AVX2 and AVX512 to be within 5% of each other, assuming both codepaths had been created with Intel targets in the first place. As AMD's installed base grows, I fully expect more "serious" application coders to do as the y-cruncher author did and provide separate optimized code paths for AMD hardware.
 

A///

Diamond Member
Feb 24, 2017
4,352
3,154
136
And given that, expect a certain former Intel employee to go on another four month hate tirade despite it being a fairly niche instruction set and Ryzen eating up Intel in other benchmarks which rely on key instruction sets. If you can't figure out who it is, keyword: web browsing benchmarks.
 

Veradun

Senior member
Jul 29, 2016
564
780
136
And given that, expect a certain former Intel employee to go on another four month hate tirade despite it being a fairly niche instruction set and Ryzen eating up Intel in other benchmarks which rely on key instruction sets. If you can't figure out who it is, keyword: web browsing benchmarks.
"Clear advantage"
 
  • Haha
Reactions: A///

Richie Rich

Senior member
Jul 28, 2019
470
229
76
Yeah, I'll wait to see what Richie meant. There's always the possibility of other platforms in 20 years, but they need to close performance gaps showing up right now even with high end parts and then exceed performance of what's offered now. Nothing's ever permanent.
K6-2 (1999) had 9.3 mil tanstistors
Zen2 chiplet (2019) has 3,900 mil transistors ... 419x more /8 cores .... 52x more/core in 20 years

I bet in 1999 there were a lot of people like you... saying 3-6 instructions per clock is impossible and here we are. So I would be very careful about predictions in sense of something is not possible. I'm gonna believe CPU architects (Jim Keller) more than anybody else here.

Theoreticaly, if Zen 3 uarch would do the same core fusion as BD -> Zen (2xALU no SMT -> 4xALU +SMT2) so Zen 3 would be 8xALU+SMT4. All variants have 2xALUs/thread (great for actual generic code) however you can have ultimate high IPC if you disable SMT or when priority thread control is possible (Asymmetric MT?). Maybe this core fusion Keller meant when he was mentioning "linear performance scaling" as opposite to IceLake (+38% transistors produce only +18%IPC).
 
Last edited:

A///

Diamond Member
Feb 24, 2017
4,352
3,154
136
K6-2 (1999) had 9.3 mil tanstistors
Zen2 chiplet (2019) has 3,900 mil transistors ... 419x more /8 cores .... 52x more/core in 20 years

I bet in 1999 there were a lot of people like you... saying 3-6 instructions per clock is impossible and here we are. So I would be very careful about predictions in sense of something is not possible. I'm gonna believe CPU architects (Jim Keller) more than anybody else here.

Theoreticaly, if Zen 3 uarch would do the same core fusion as BD -> Zen (2xALU no SMT -> 4xALU +SMT2) so Zen 3 would be 8xALU+SMT4. All variants have 2xALUs/thread (great for actual generic code) however you can have ultimate high IPC if you disable SMT or when priority thread control is possible (Asymmetric MT?). Maybe this core fusion Keller meant when he was mentioning "linear performance scaling" as opposite to IceLake (+38% transistors produce only +18%IPC).
You would be wrong about your bet. But relying on past progression as a measure for future progression isn't always ideal. But I take your point. I think the semiconductor fabs who research this stuff are going to have to seek new materials as node shrinks are expected to get tough after 5 nm, and this is something that's been in research for quite a long time. We're in a processor revolution right now, and I haven't felt this way in a long time. That goes for all of us. I don't see SMT4 anytime soon, but it may be something to consider in the future. I could see it being something on Zen5 or 6. While I compared such threading to Power, the problem is you can't directly compare these together. Though it wouldn't surprise me if AMD and Intel have test labs where SMT4/4T HT chips exist.
 
  • Like
Reactions: Thunder 57

amd6502

Senior member
Apr 21, 2017
971
360
136
+38% transistors produce only +18%IPC

That is actually quite good (probably refers to MT (or combined) not pure ST performance though). If 18% is ST number then it sounds like it beats Pollack's rule.

I think increasing ST becomes more and more challenging as we see Moore's law die and as IPC has reached an already large number.
 

Thunder 57

Platinum Member
Aug 19, 2007
2,674
3,796
136
K6-2 (1999) had 9.3 mil tanstistors
Zen2 chiplet (2019) has 3,900 mil transistors ... 419x more /8 cores .... 52x more/core in 20 years

I bet in 1999 there were a lot of people like you... saying 3-6 instructions per clock is impossible and here we are. So I would be very careful about predictions in sense of something is not possible. I'm gonna believe CPU architects (Jim Keller) more than anybody else here.

Theoreticaly, if Zen 3 uarch would do the same core fusion as BD -> Zen (2xALU no SMT -> 4xALU +SMT2) so Zen 3 would be 8xALU+SMT4. All variants have 2xALUs/thread (great for actual generic code) however you can have ultimate high IPC if you disable SMT or when priority thread control is possible (Asymmetric MT?). Maybe this core fusion Keller meant when he was mentioning "linear performance scaling" as opposite to IceLake (+38% transistors produce only +18%IPC).

You know, I've heard so much about SMT4 that I found through a quick search you mention it in over 1/3 of your posts. We get it, we just all (unless I am missing someone) disagree. Not in Zen 3 at least.

Also, Jim Keller is not some deity and there are other brilliant minds out there. I doubt any of them are bold enough to make claims of 50x more transistors per core in 20 years either. Manufacturing is quickly becoming a problem here. I'm not saying it can't happen, but it certainly looks more difficult now than it did 20/30/40 years ago.
 

soresu

Platinum Member
Dec 19, 2014
2,657
1,858
136
I bet in 1999 there were a lot of people like you... saying 3-6 instructions per clock is impossible
I suggest you learn a little history first, before you start to embarrass yourself.

According to this article, the Alpha EV6 was a 6 wide OoO superscalar uArch introduced in 1996 (dated from Wikipedia).

It's not about whether it is simply possible - it's about whether it can be done while maintaining legacy x86 cruft, and the high multi Ghz clock speed necessary for both market competitive performance and software compatibility.

I might also add that the planned, but unproduced EV8 was targeted for 8 wide, cancelled in favor of the Itanic disaster.
 
Last edited:

itsmydamnation

Platinum Member
Feb 6, 2011
2,764
3,131
136
I bet in 1999 there were a lot of people like you... saying 3-6 instructions per clock is impossible and here we are. So I would be very careful about predictions in sense of something is not possible. I'm gonna believe CPU architects (Jim Keller) more than anybody else here.
So you realize that K7,K8,K10 are all actually 6 wide ( 3ALU , 3AGU) , now look at how much faster Zen is while also being 6 wide (4ALU + 2 AGU) and Zen2 even faster again while being 7 wide. But what your doing like you always seem to do ( probably because Dunning Kruger) is ignoring how things actually work, those 6 pipes in K7 are organised is 3 cluster of 2, Zen its 1 cluster of 6. It is so much harder and so much more complex to have 6x1 vs 2x3, but by your dumb counting logic they should perform the same right.....

I also see you completely ignored trying to justify how Zen is ALU limited when they choose to increase AGU in Zen2 which is harder then ALU because not only does it need to connect to the PRF it also has to connect to the memory sub system.

Theoreticaly, if Zen 3 uarch would do the same core fusion as BD -> Zen (2xALU no SMT -> 4xALU +SMT2) so Zen 3 would be 8xALU+SMT4. All variants have 2xALUs/thread (great for actual generic code) however you can have ultimate high IPC if you disable SMT or when priority thread control is possible (Asymmetric MT?). Maybe this core fusion Keller meant when he was mentioning "linear performance scaling" as opposite to IceLake (+38% transistors produce only +18%IPC).
How do you plan to feed these 8x ALU's , how many load + store ports, how much cache bandwidth, what size how is it tagged / indexed , how big/type of TLB's , page walkers etc etc . Get this through your thick skull, execution of data is easy movement of data is hard, its why flock of chicken CPU's always loose and why the massive investment in transistors per core can be justified for big OOOE cores.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
8 ALUs and 8 FPUs can be fed with 2x64B(R)/1x64B(W). Data movement is easy, it is in fact the easiest thing to target. Hence, the semiconductor world wraps around MLP over ILP everyday.

The purpose of SMT4 is to increase MLP, thus increase ILP. Get it through your thick skull scrub.

Two big things to watch out for:
*Store Queue Fusion: ALU0/ALU1[M0]+(ALUx/ALUy){Mx} can (each) store 16B simultaneous into the store queue.
*Load Queue Fusion: ALU0/ALU1[M0]+(ALUx/ALUy){MX} can (each) load 16B simultaneous from the load queue.
 
Last edited:
  • Like
Reactions: amd6502

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
That is actually quite good (probably refers to MT (or combined) not pure ST performance though). If 18% is ST number then it sounds like it beats Pollack's rule.

Actually 18% improvement with 38% more transistors nearly exactly follows Pollack's rule.

This is why we went from having passively cooled desktop chips in the 486 era to water-cooled desktops in 2019. Lot of the increase came not just from new process, but by increasing TDP.

Optimism drives people to continue, even though in reality it may fall short of such lofty goals. Take any such claims with a heaping of salt, because otherwise disappointment will set in.
 

Thunder 57

Platinum Member
Aug 19, 2007
2,674
3,796
136
8 ALUs and 8 FPUs can be fed with 2x64B(R)/1x64B(W). Data movement is easy, it is in fact the easiest thing to target. Hence, the semiconductor world wraps around MLP over ILP everyday.

The purpose of SMT4 is to increase MLP, thus increase ILP. Get it through your thick skull scrub.

Two big things to watch out for:
*Store Queue Fusion: ALU0/ALU1[M0]+(ALUx/ALUy){Mx} can (each) store 16B simultaneous into the store queue.
*Load Queue Fusion: ALU0/ALU1[M0]+(ALUx/ALUy){MX} can (each) load 16B simultaneous from the load queue.

There is no SMT4 in Zen 3. Do we have to continues this crap until it is released?
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
There is no SMT4 in Zen 3. Do we have to continues this crap until it is released?
Zen3 imho, isn't going to be much different from Zen2. It won't even be on 7nm+, but rather be on N7P. AMD got banned from N7+. The only node beyond 7nm that AMD is allowed to go to is 5nm.

Milan is Family 17h.
Vermeer and Genoa/Genesis is Family 19h.
 
  • Wow
Reactions: amd6502

Thunder 57

Platinum Member
Aug 19, 2007
2,674
3,796
136
Zen3 imho, isn't going to be much different from Zen2. It won't even be on 7nm+, but rather be on N7P. AMD got banned from N7+. The only node beyond 7nm that AMD is allowed to go is 5nm.

Milan is Family 17h.
Vermeer and Genoa/Genesis is Family 19h.

What??? AMD has being saying 7nm+ for a long time now. Banned from it? For selling just too many damn CPU's? I'm sure they have contracts in place. And for once, just once, I would like to see you back something up with facts. Otherwise, some people will continue to think that, well, I have to stop myself here. Just let me put it this way; Not hold you or what you have to say in high regard.
 

soresu

Platinum Member
Dec 19, 2014
2,657
1,858
136
What??? AMD has being saying 7nm+ for a long time now. Banned from it? For selling just too many damn CPU's? I'm sure they have contracts in place. And for once, just once, I would like to see you back something up with facts. Otherwise, some people will continue to think that, well, I have to stop myself here. Just let me put it this way; Not hold you or what you have to say in high regard.
The repeated FDX/FD SOI stuff kinda leans that way already - makes any talk against Zen3 7nm+ a bit far fetched, considering AMD have been completely unambiguous about it publicly.

Zen4 on the other hand is still up in the air it seems given no process on the most recent roadmap - methinks either 5nm is not quite floating AMD's boat, or perhaps Samsung are persuading them to wait for their 3nm MBCFET process, though this does seem a stretch given it's still a ways out yet.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
L3 SRAM libs for 5nm are done at AMD. There isn't really enough leaks to get the full complete how far AMD is in 5nm right now.

However:
"that it will ramp up much in terms of revenue, be much faster than 7 nm." => "N5 chips in high volume starting Q2 2020."
AMD is slated to be with Apple/hisilicon group. ~1cm2 Elite Group.

TSMC is more optimistic that it can support AMD on 5nm better than 7nm.
Lower lead times, higher volumes, etc are the promises that I have received. There is also the custom EDA part that is getting way more support on 5nm than on 7nm.

It also follows the trend...
14LPP/12LP (576nm Height) -> N7 (300nm Height) rather than slow down with N7+, go straight to 5nm -> N5 (168nm Height) ¯\_(ツ)_/¯
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
21,620
10,829
136
methinks either 5nm is not quite floating AMD's boat, or perhaps Samsung are persuading them to wait for their 3nm MBCFET process

Request for clarification: is TSMC 5nm going to be more akin to TSMC 7nm+ or 6nm?
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
Request for clarification: is TSMC 5nm going to be more akin to TSMC 7nm+ or 6nm?
TSMC 5nm is 14 layers of EUV and isn't cell-compatible with 7nm. ~59+ mask layers
TSMC 6nm is 5 layers of EUV and is cell-compatible with 7nm. ~65+ mask layers ((TSMC 6nm can re-tapeout 7nm(N7/N7P(also, N7plus)) on EUV, while TSMC 7nm+ can't.))
TSMC 7nm+ is 4 layers of EUV and isn't cell-compatible with 7nm. ~65+ mask layers
TSMC 7nm, no EUV. ~75+ mask layers. (7LP/13ML on GloFo, it is 88 mask layers)

TSMC 7nm/6nm is ~40nm metal pitch and TSMC 5nm is ~28nm metal pitch. It also has a lower TCO cost w/o a shrink than 7nm. Making its shrink very cost-effective.

77 mm2 on 7nm => expensive
77 mm2 on 5nm(no shrink, no advancement, just a new process) => 15% lower cost

Relative to the other nodes, N5 is planned to be the longest. N5/2020-HVM -> N5P/2021-HVM -> N5P+(Ge+)/2022-HVM -> N5P+ Low Vdd(Ge+&LV)/2023-HVM.
 
Last edited:
  • Like
Reactions: amd6502

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,780
136
Data movement is easy, it is in fact the easiest thing to target

I think everyone keeps saying it is the opposite. Moving data causes delay and uses energy. From Lisa's presentation at the DARPA event, she says more time is spent moving data than computing.
Hence so many concepts of PIM floating around.
 
  • Like
Reactions: Olikan

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
I think everyone keeps saying it is the opposite. Moving data causes delay and uses energy. From Lisa's presentation at the DARPA event, she says more time is spent moving data than computing.
Hence so many concepts of PIM floating around.
Furthermore data movement as an action is the single big bottleneck preventing compute running at 100% at all time. Which as an aside is also why having some form of predictors at all stages is so effective, with hits it both reduces latency and prevents congestion.
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
So you realize that K7,K8,K10 are all actually 6 wide ( 3ALU , 3AGU) , now look at how much faster Zen is while also being 6 wide (4ALU + 2 AGU) and Zen2 even faster again while being 7 wide. But what your doing like you always seem to do ( probably because Dunning Kruger) is ignoring how things actually work, those 6 pipes in K7 are organised is 3 cluster of 2, Zen its 1 cluster of 6. It is so much harder and so much more complex to have 6x1 vs 2x3, but by your dumb counting logic they should perform the same right.....
Funny how you think others are dumb and you are smart.
This is exactly what I say. On paper stronger K8 with tied ALU+AGU together (3xALU+3xAGU) was much slower than theoreticaly weaker Core2Duo with decoupled 3xALU+2AGU. C2D had speculative load feature and some other new stuff which was possible at that big cluster and K8 was missing all that. That's why fusion of two cores together into one big Zen 3 core with 8xALU + 4xAGU + SMT4 could provide enough room to implement some new advanced logic to extract more ILP/IPC and is not possible at narrow 4xALU+SMT2 core. Especially for next iterations in Zen4 and Zen5. It looks like you argumented in favor in my dumb wide Zen3+SMT4 core, thanks :)


I also see you completely ignored trying to justify how Zen is ALU limited when they choose to increase AGU in Zen2 which is harder then ALU because not only does it need to connect to the PRF it also has to connect to the memory sub system.
You are wrong about that. Zen 2 has not new full AGU but only store unit what is much much simpler that load with all those speculative loading and load predictors. Lowest hanging fruits, it was the easiest way. Maybe you noted that Intel is using dedicated store unit for a while too.


How do you plan to feed these 8x ALU's , how many load + store ports, how much cache bandwidth, what size how is it tagged / indexed , how big/type of TLB's , page walkers etc etc . Get this through your thick skull, execution of data is easy movement of data is hard, its why flock of chicken CPU's always loose and why the massive investment in transistors per core can be justified for big OOOE cores.
How Apple in Vortex core is feeding those 6xALUs? They can do that with just 2xAGUs. How they gain +58% IPC INT over Skylake? Maybe Apple hired some black magic Woo Doo shaman, or maybe they know what they are doing. And unfortunately Apple engineers forgot to ask you that it's not possible :)