Speculation: Ryzen 4000 series/Zen 3

Page 57 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

itsmydamnation

Platinum Member
Feb 6, 2011
2,743
3,075
136
Funny how you think others are dumb and you are smart.
This is exactly what I say. On paper stronger K8 with tied ALU+AGU together (3xALU+3xAGU) was much slower than theoreticaly weaker Core2Duo with decoupled 3xALU+2AGU. C2D had speculative load feature and some other new stuff which was possible at that big cluster and K8 was missing all that. That's why fusion of two cores together into one big Zen 3 core with 8xALU + 4xAGU + SMT4 could provide enough room to implement some new advanced logic to extract more ILP/IPC and is not possible at narrow 4xALU+SMT2 core. Especially for next iterations in Zen4 and Zen5. It looks like you argumented in favor in my dumb wide Zen3+SMT4 core, thanks :)
i dont think im smart but you have done nothing other then but "teh APPLE!@!#@!#@!#. So now that you have said 4x AGU, how much load and store bandwidth/ports to cache, whats the cache configuration, in multi-ported caches you are wire limited. lets not even talk about getting enough decode/ dispatch for the mythical 4 threads.


You are wrong about that. Zen 2 has not new full AGU but only store unit what is much much simpler that load with all those speculative loading and load predictors. Lowest hanging fruits, it was the easiest way. Maybe you noted that Intel is using dedicated store unit for a while too.
no im right and your wrong ( see i provided as much evidence as you do )
Maybe you should go read the patient of how it actually works ( yes its published) it is one unified queue in which it picks 3 address to generate and load/store, it wasn't simple and cant be done in a single cycle, there is no point adding the 3rd AGU to the load side of the equations because there are only 2 load ports to cache. But the AGU's have nothing to do with prefetch/predict so i dont know why your trying to conflate that. But Store has to deal with store to load forwarding/ memory memory disambiguation and it still needs to connect to the PRF.

How Apple in Vortex core is feeding those 6xALUs? They can do that with just 2xAGUs. How they gain +58% IPC INT over Skylake? Maybe Apple hired some black magic Woo Doo shaman, or maybe they know what they are doing. And unfortunately Apple engineers forgot to ask you that it's not possible :)
So first they dont tell us how any of there Cores works at all, for all you know if could be two cluster of 3 ALU + branch +AGU / split PRF (just like z15) . You have no idea of how there prefetch/predict/ L2/stream page walkers etc work, you have no idea what kind of memory disambiguation they are doing (arm has a weaker memory model). The only thing you know is they have 6 ALU's so that MUST be it, just ignore that hurrican has 4 ALU's and would still beat skylake in your metric quite handily and Apple has massively improved there cache and memory sub systems from A10 to A12 as can be seen in the anandtech reviews along with dispatch and all the prefetch predict /etc improvements you would expect. also ARM has load/store pairs and Apple has complete control of there ecosystem/compilers so those 2 load/store units can be load/storing 4 "bits" of data a cycle.

See i've never said we wont see more ALU's on Zen , unlike you i dont see more ALU's being the "killer feature", the killer feature is all the other micro architectural improvements that allow you to get enough ILP to be worth having more ALU's.
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
no im right and your wrong ( see i provided as much evidence as you do )
Maybe you should go read the patient of how it actually works ( yes its published) it is one unified queue in which it picks 3 address to generate and load/store, it wasn't simple and cant be done in a single cycle, there is no point adding the 3rd AGU to the load side of the equations because there are only 2 load ports to cache. But the AGU's have nothing to do with prefetch/predict so i dont know why your trying to conflate that. But Store has to deal with store to load forwarding/ memory memory disambiguation and it still needs to connect to the PRF.
No, you are wrong. You are lying or you have poor knowledge. There is predictor in load unit since Intel Core uarch. It's used for load instruction speculative pass ahead of store instruction (which address is not calculated yet), theoretically delivering 30-40% performance boost. You should educate yourself before spreading your miss-information: https://www.anandtech.com/show/1998/5

See i've never said we wont see more ALU's on Zen , unlike you i dont see more ALU's being the "killer feature", the killer feature is all the other micro architectural improvements that allow you to get enough ILP to be worth having more ALU's.
I never said there are no other uarch improvements. Do not put your lies into my mounth please. Actually I always put a strong emphasis to Apple's advanced uarch allowing to utilize those 6xALUs in incredible way (+50% more ALU provides +58% IPC over Skylake, mentioning here multiple times).

And again, you didn't get my point at all. I'm not talking about more ALUs only. Regarding Zen 3 (and what Keller was mentioning by "linear scaling IPC" on that video) I'm consistently talking about high number of ALUs together with SMT4. Symbiotic combination of these two features could become the killer feature. Shared resources brings more efficiency and performance. Same way AMD leap leapfrogged performance by merging of two narrow cores design 2+2 ALU in BD into one wider core design 4xALU+SMT2 in Zen. It was effective move once so it could be effective again with even wider core +SMT4.

Another option for Zen 3 is shared front-end with shared FPU - Bulldozer style. Front-end capable handling of 4-threads, back-end consisting of 2x Zen3+SMT2 int cores and shared powerful FPU (12-pipes shared by 4 threads). AMD has an experience with BD, it's simpler to do (than wider entire core +SMT4) and allows great FPU boost (indicated by leaks +40-50% FPU performance). However it's kind of sub-optimal solution IMHO.

 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
Another option for Zen 3 is shared front-end with shared FPU - Bulldozer style. Front-end capable handling of 4-threads, back-end consisting of 2x Zen3+SMT2 int cores and shared powerful FPU (12-pipes shared by 4 threads). AMD has an experience with BD, it's simpler to do (than wider entire core +SMT4) and allows great FPU boost (indicated by leaks +40-50% FPU performance). However it's kind of sub-optimal solution IMHO.
Won't happen end of, anything that even looks like it will spook AMD customers and stock investors after the BD fustercluck - optics are extremely important when you are both this high and yet not the mountainous business force that is Intel (financially I mean).

I'll be happy to eat my words when the Zen3 details come out, but I strongly doubt that will be the case.

It's my opinion that someone convinced them during BD's concept stage that CMT was simply better than SMT, at the very least in terms of power efficiency - that plan may have been leaned into in the hope that a different MT strategy could make them unique in the market, unfortunately it caused them to fall spectacularly on their face amidst broken, burning glass.
 

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,654
136
Won't happen end of, anything that even looks like it will spook AMD customers and stock investors after the BD fustercluck - optics are extremely important when you are both this high and yet not the mountainous business force that is Intel (financially I mean).

I'll be happy to eat my words when the Zen3 details come out, but I strongly doubt that will be the case.

It's my opinion that someone convinced them during BD's concept stage that CMT was simply better than SMT, at the very least in terms of power efficiency - that plan may have been leaned into in the hope that a different MT strategy could make them unique in the market, unfortunately it caused them to fall spectacularly on their face amidst broken, burning glass.
I think GCN and BD put AMD in motion for their current moves. Everything was great with GCN for a release or two, but they thought they would get more developers to adopt better Mantle or move to Mantle like API's. But instead as soon as DX12 gets announces everyone just stops using Mantle. Years before DX12 and Vulcan get actually used. It also wasn't every coheasive. Mantle and direct GPU scheduling reduced CPU requirements where BD required programs to be written for it to utilized the extra resources per module. Both needing heavier work done by developers both in competing ways.

Zen was and RDNA is the start of on GPU's them going back to the basics. Make your stuff work the best on what is already out there. Do what they started with the old HD GPU lineup. Make sure its fast for what is out there, then add functionality. Don't require the added functionality to make the product work correctly (or in some cases worth while, the Geforce256 didn't sell because T&L and the RTX series for Ray tracing, they sell because they run everything else fast, but the RTX series is so lateral that RT is a selling point which I think was a mistake). Doubtful they make the same mistake so soon.
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
Mantle and direct GPU scheduling reduced CPU requirements where BD required programs to be written for it to utilized the extra resources per module. Both needing heavier work done by developers both in competing ways.
Yes, unfortunately I think AMD ha erred more than once on the point of leaving it to outside developers to do the work, especially when Intel and nVidia can afford to splurge on teams of software engineers to largely do it for them.

Strange that the console dominance AMD held didn't translate to more developers using AMD's hardware on principle though - I guess the difference in software platforms were too significant to directly translate experience perhaps.

I think that AMD could do with investing in their own optimised Linux build (like Clear Linux for Intel), but more for their GPU endeavors - the Linux community seem to be covering Zen support pretty well so far.
 

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
Request for clarification: is TSMC 5nm going to be more akin to TSMC 7nm+ or 6nm?
7nm+. TSMC 6nm FF is an area reduction node based off 7nm FF. 5nm EUV isn’t just an optical shrink and will require more substantial design effort (which is probably why TSMC developed 6nm FF).
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
Years before DX12 and Vulcan get actually used.
Arguable - widely used would be a better argument.

DOOM 2016 had a Vulkan back end the same year Vulkan was announced, and Rise of the Tomb Raider had a DX12 update less than a year after W10 launched with DX12 officially.

It was a slow start sure, but far better than earlier API's like DX11 which took ages to get significant support.
 

Thunder 57

Platinum Member
Aug 19, 2007
2,647
3,706
136
Arguable - widely used would be a better argument.

DOOM 2016 had a Vulkan back end the same year Vulkan was announced, and Rise of the Tomb Raider had a DX12 update less than a year after W10 launched with DX12 officially.

It was a slow start sure, but far better than earlier API's like DX11 which took ages to get significant support.

Vulkan Doom was great. OpenGL Doom at launch, not so much. It would just crash or stop responding at times. As time went on, that went away. I'm not sure if that was due to game fixes or Vulkan (which I turned on as soon as it was available) but it got better. I wish more developers would use Vulkan.

id/Bethesda/whoever has always went with OpenGL, or anything other than DirectX. Typically with good results.
 
  • Like
Reactions: soresu

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,654
136
Arguable - widely used would be a better argument.

DOOM 2016 had a Vulkan back end the same year Vulkan was announced, and Rise of the Tomb Raider had a DX12 update less than a year after W10 launched with DX12 officially.

It was a slow start sure, but far better than earlier API's like DX11 which took ages to get significant support.
Not sure what's off about what I said.

GCN launched in 2012, Mantle launched with BF4 in 2013, MS announced DX12 in 2014, 2 years later in 2016 Doom launched. I did mean widely used but the facts check out as is. DX12 wasn't consistently used till late 17 early 18. 5-6 years after AMD developing a uarch meant for on device scheduling to be used with low lovely api's.
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
5-6 years after AMD developing a uarch meant for on device scheduling to be used with low lovely api's.
I think it was less designed for Mantle/low level API as compute in general - its problems lay with extracting gaming FPS from all those lovely TFLOPS, something that nVidia's less versatile uArch was far more suited to.

I'd say low level API's just mitigated GCN's problems on the gaming side - I can only assume that the consoles made a better job of it as singular platform targets being easier to optimise for.
 
  • Like
Reactions: Olikan

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,654
136
I think it was less designed for Mantle/low level API as compute in general - its problems lay with extracting gaming FPS from all those lovely TFLOPS, something that nVidia's less versatile uArch was far more suited to.

I'd say low level API's just mitigated GCN's problems on the gaming side - I can only assume that the consoles made a better job of it as singular platform targets being easier to optimise for.
They certainly had an eye towards general compute capability rather then adding specialized compute functionality. Well at least at the start. But I think that was the plan write to Mantle or other Low Level API's and boom these basically compute units would do great graphics as well. It helped them that the 7900 series and 200 series where generally pretty competitive at least at launch but Efficiency killed them. But the big problem is AMD thought they could change the market, maybe not to their will like Intel, but what they thought was just a better solution. But like I said lost their vision when it came to making sure first and foremost it was really good for what they were doing now. The 7970 and 290x were really fast, but way to big, and power hungry to keep up with Nvidia in the long run because the uarchs themselves where not particularly good at what was already out there.
 

Olikan

Platinum Member
Sep 23, 2011
2,023
275
126
I think it was less designed for Mantle/low level API as compute in general - its problems lay with extracting gaming FPS from all those lovely TFLOPS, something that nVidia's less versatile uArch was far more suited to.

I'd say low level API's just mitigated GCN's problems on the gaming side - I can only assume that the consoles made a better job of it as singular platform targets being easier to optimise for.
Yeah, no low level API can fix GCN memory subsystem (huge bandwidth requirements)
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,743
3,075
136
Yeah, no low level API can fix GCN memory subsystem (huge bandwidth requirements)
When GCN was released it's memory sub system was fine, what the actual problem is AMD having no money to turn R&D ( see patients) into products as a result GCN 2,3,4,5 really didn't add a lot at the same time NV made big uarch changes almost every generation if not every second. But now AMD has money again and RDNA is the first product to come to market with that money. I expect each new generation to have far more uarch changes then ever existed within GCN.
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
When GCN was released it's memory sub system was fine, what the actual problem is AMD having no money to turn R&D ( see patients) into products as a result GCN 2,3,4,5 really didn't add a lot at the same time NV made big uarch changes almost every generation if not every second. But now AMD has money again and RDNA is the first product to come to market with that money. I expect each new generation to have far more uarch changes then ever existed within GCN.
True to an extent - I also believe that nVidia had to make more changes due to its more narrow focus on fps rather than a more forward thinking and compute versatile design like GCN.

The problem with that versatility in GCN was that like a big boat, it ate a lot of fuel and didn't turn quite so quickly (or cheaply) as a small one.
 

moinmoin

Diamond Member
Jun 1, 2017
4,933
7,619
136
When GCN was released it's memory sub system was fine, what the actual problem is AMD having no money to turn R&D ( see patients) into products as a result GCN 2,3,4,5 really didn't add a lot at the same time NV made big uarch changes almost every generation if not every second. But now AMD has money again and RDNA is the first product to come to market with that money. I expect each new generation to have far more uarch changes then ever existed within GCN.
The development on RDNA definitely started way before Zen brought in the new money. Previously its progress just was majorly linked to the demand by the game console manufacturers that only now are getting into the final stage of finalizing their next gen products.
 
  • Like
Reactions: soresu

Olikan

Platinum Member
Sep 23, 2011
2,023
275
126
When GCN was released it's memory sub system was fine, what the actual problem is AMD having no money to turn R&D ( see patients) into products as a result GCN 2,3,4,5 really didn't add a lot at the same time NV made big uarch changes almost every generation if not every second. But now AMD has money again and RDNA is the first product to come to market with that money. I expect each new generation to have far more uarch changes then ever existed within GCN.
While it's true about R&D... Amd also just went into a diferent path than Nvidia, Amd co-developed HBM with Samsung, while Nvidia went color compression, tiled rendering and cache...
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
The development on RDNA definitely started way before Zen brought in the new money. Previously its progress just was majorly linked to the demand by the game console manufacturers that only now are getting into the final stage of finalizing their next gen products.
The Wave32 uArch compatibility with Wave64 GCN code shows a clear console backwards compatibility thinking in RDNA development - it means together with Zen it should play any XB1/PS4 games well, and probably anything else like virtual console emulators coded for them too.

For Sony and Microsoft this represents the first generation where they have anything like the BW compatibility of Gamecube to WiiU that Nintendo had - coupled with a serious boost to performance its a win win for them.
 

uzzi38

Platinum Member
Oct 16, 2019
2,565
5,575
146
God, who wrote that article, that's absolutely horrible.

For starters, Renoir is still using Vega. This has been an established fact for literal months now, I don't get why it's still up for debate. It appears somebody needs to teach the writer the difference between CU counts and Codenames, the GPU in APUs don't get codenames. You don't go and say, 'Oh, there's gonna be a Navi 12, must be what the QPUs are getting'.

Secondly, Renoir shouldn't be changing much in the way of CU count, and that device has 8 CUs (512 shaders / 64 is 8). So I have no clue where the assumption its the Ryzen 3 4200U comes from.
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
Secondly, Renoir shouldn't be changing much in the way of CU count, and that device has 8 CUs (512 shaders / 64 is 8). So I have no clue where the assumption its the Ryzen 3 4200U comes from.
Yeah, I'd say 16 maximum, I heard 15 from one quarter.