Speculation: Ryzen 4000 series/Zen 3

Page 24 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

maddie

Diamond Member
Jul 18, 2010
4,738
4,667
136
Why the assumption on fixed SMT values? I thought that IBM has long used a 'scaling on the fly' SMT variation, so we know it works. An up to SMT4 core 'as needed' might be the better solution.
 
  • Like
Reactions: DarthKyrie

Ajay

Lifer
Jan 8, 2001
15,431
7,849
136
Its a half node but I think there's a significant gain in density compared to 7nm that Zen 2 is using. Its not double but there seems to be room for AMD to increase the size of the chiplets (especially on EPYC and Threadripper). Plus, AMD will likely be changing packaging (i.e. socket) somwhere in there as well as shrinking the size of the I/O die, so they could increase the chiplet sizes themselves. And there's other aspects they can likely play with (shape of the chiplets as well for instance).

Well, I checked an article on Wikichip and was shocked to see that, according to TSMC's estimate, density will increase by approximately 85%!
This is probably for SRAM, so logic may not scale quite as well.

tsmc-7nm-density-q2-2019.png


This give AMD allot of xtors to play with for Zen4. A 20% bump in chiplet size would double the number of CCXs per CCD, if AMD wants to go that way.
AMD may want to go for a wider core and increase cache instead. Lots of decisions to be made (well, AMD has already made them!).

Something I've been curious about (since I've been pushing for it a lot), due to the recent GF announcement about their new version of 12, including a new interposer for HBM (which I think HBM3 is supposed to be made on 12nm, so I'm still left wondering, what if they integrated the HBM into/on say the I/O die itself, skipping the interposer entirely), would slapping some of that with the I/O die let them reduce cache sizes? That'd let them increase core counts and/or size of the cores. Which, yes it'd be higher latency than L1, L2, and maybe L3 cache (although I think it'd be not that far off from the latter), but that could possibly be mitigated via the throughput (bandwidth), and pooling the cache (so perhaps there could be some smart sharing of resources that would mitigate latency issues). But maybe they could keep L1 and L2 cache low, replace L3 with a larger pool of HBM, which could double as buffer to system RAM and maybe even NAND, so that you're increasing the throughput significantly (more than making up with the slightly reduced latency).

HBM is way to slow compared to L3 cache, for example. The latency difference is huge (don't have the numbers on me, but HBM is at least an order of magnitude slower and would have to route through the IO chip to service all CCXs).
 
Last edited:

soresu

Platinum Member
Dec 19, 2014
2,657
1,858
136
I'm more interested in if they can squeeze more performance out of SMT2. Currently, AMD's implementation of SMT2 is good for maybe a 25-30% increase in throughput. In theory they should be able to get more out of it on a wider core.
Slightly off topic, but I'm actually wondering if ARM's Cortex A78/Hercules might be SMT2, with a 6 wide core it would certainly be interesting to see how much they can get out of it considering what they already managed on a 3 wide core (A65/E1/Helios).

I found it odd that they announced their first SMT design on a Little core (even if it is 3 wide OoO), and then announced no follow up big core this year.
 

soresu

Platinum Member
Dec 19, 2014
2,657
1,858
136
No SW GPGPU needed
Eh? All GPGPU code is software, technically all graphics GPU code is software too - the driver shader compiler just obfuscates that from the user.

Unless you mean for the shader compiler to be entirely hosted by the hardware, like the change to hardware scheduling?

It might be sort of like GPU microcode.

I'm probably talking completely out of my ass here of course, feel free to say as much!
 

soresu

Platinum Member
Dec 19, 2014
2,657
1,858
136
HBM is way to slow compared to L3 cache, for example. The latency difference is huge (don't have the numbers on me, but HBM is at least an order of magnitude slower and would have to route through the IO chip to service all CCXs).
Slower than L3 for sure, but a damn sight lower latency than DDR mounted on the motherboard, not to mention lower power draw too as the distance is so much less to travel - if they ever mount HBM on the CCX chiplets it would make a killer L4.
 

Andrei.

Senior member
Jan 26, 2015
316
386
136
Slightly off topic, but I'm actually wondering if ARM's Cortex A78/Hercules might be SMT2, with a 6 wide core it would certainly be interesting to see how much they can get out of it considering what they already managed on a 3 wide core (A65/E1/Helios).

I found it odd that they announced their first SMT design on a Little core (even if it is 3 wide OoO), and then announced no follow up big core this year.
Arm pretty much stated that they have absolutely no plans to ever adopt SMT in any IP going into battery powered devices.
 
  • Like
Reactions: Lodix

DrMrLordX

Lifer
Apr 27, 2000
21,620
10,829
136
And AMD can implement HW bypass for calculation on GPU.

There are many reasons why they haven't done that already. Latency is an issue when dealing with a PCIe-connected GPU. Also calculations that aren't structured very carefully and don't involve relatively large data sets may not be any faster on a GPU.
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
Eh? All GPGPU code is software, technically all graphics GPU code is software too - the driver shader compiler just obfuscates that from the user.

Unless you mean for the shader compiler to be entirely hosted by the hardware, like the change to hardware scheduling?

It might be sort of like GPU microcode.

I'm probably talking completely out of my ass here of course, feel free to say as much!
I ment to bypass compilers and go directly into HW. Something like GPU front-end will able to decode AVX512 a calculate it. There would be some latency penalties however for large vectors it can be much faster.

Shame if so, seems like a waste with that 6 wide core.
The purpose of mobile Big cores is to deliver peak IPC. If you would adopt SMT2 you can gain 30% more throughput (130% total) however your performance per thread will fall down to 65% (130% divided by 2 threads). This doesn't make sense when for efficiency you have LITTLE cores. It's very interesting that mobile ARMs never implemented SMT - It looks like SMT has more disadvantages than benefits for mobile market.
 

soresu

Platinum Member
Dec 19, 2014
2,657
1,858
136
It's very interesting that mobile ARMs never implemented SMT - It looks like SMT has more disadvantages than benefits for mobile market
To be fair, A65/E1/Helios with SMT only got announced last year - I wouldn't take ARM's word as gospel that we will get no SMT mobile core ever.
 

soresu

Platinum Member
Dec 19, 2014
2,657
1,858
136
There are many reasons why they haven't done that already. Latency is an issue when dealing with a PCIe-connected GPU.
I'm pretty sure he meant an APU, so basically infinity fabric - it will be no worse than going from chiplet to chiplet, perhaps better if they implement it directly on the CCX chiplet itself, though I admit to a healthy skepticism about this ever happening however tempting it sounds to indulge in the speculation.

Though before Bulldozer came out, there was a PR image released about future APU design, in 3 parts.

Part 1 showed entirely separate CPU and GPU, the pre Llano days.

Part 2 showed 2 big chunks of CPU and GPU next to each other in a way that basically echoes all APU's today.

Part 3 showed a distributed mix of GPU bits in amongst the CPU bits, in what I would assume to be a far off future design.

Now obviously the utter failure of Bulldozer will have derailed any far off designs based on it in favor of fixing bugs and clawing back perf/watt, but that doesn't necessarily mean that the meat of that mixed design may not still materialise in the future (assuming it wasn't merely a graphics wingding by an unusually hopeful soul in the PR department).
 
  • Like
Reactions: Gideon

yuri69

Senior member
Jul 16, 2013
387
617
136
Though before Bulldozer came out, there was a PR image released about future APU design, in 3 parts.

Part 1 showed entirely separate CPU and GPU, the pre Llano days.

Part 2 showed 2 big chunks of CPU and GPU next to each other in a way that basically echoes all APU's today.

Part 3 showed a distributed mix of GPU bits in amongst the CPU bits, in what I would assume to be a far off future design.

Now obviously the utter failure of Bulldozer will have derailed any far off designs based on it in favor of fixing bugs and clawing back perf/watt, but that doesn't necessarily mean that the meat of that mixed design may not still materialise in the future (assuming it wasn't merely a graphics wingding by an unusually hopeful soul in the PR department).
This was a whole era at AMD's. It started with the merge of ATi in 2006. We almost immediately got the mighty Fusion branding (The Future is Fusion!), followed with M-SPACE in 2007.

The first Fusion iteration was to be the Bulldozer-based Falcon in late 2008/early 2009. It got cancelled. Yeah, in 2011 we got K10-based Llano instead.

M-SPACE? The brand got killed almost immediately. It evolved to "IPs" used internally in today's SoCs (GMC, GFX, etc.).

The Fusion, when announced, was really murky. Many assumed AMD was gonna create a new ISA merging the x86 and GPU (they did x86-64, right?!). There was also an idea of integrating the GPU in a similar fashion to IBM Cell's SPEs with interacting directly to the different ISA.

The images of "integration evolution" you mentioned are probably some of those: AMD CTO summit 2007. There were numerous variations.

Everything was Fusion, even the auto-overclock utility was Fusion in 2009! In 2011 Fusion got marginalized and transformed to FSA just to rebrand it to today's HSA year later...

The GPU integration brought better power/W and a wins in specific workloads - they has to massively benefit from the tight GPU/CPU latency to that degree it compensates the lowish FLOPs compared to full-blown discrete GPU.

13 years after the Fusion announcement, an average enduser is lucky if he/she uses a single OpenCL-backed app which doesn't treat the integrated GPU in any special way. Don't trust the marketing hype. Such is life.
 

soresu

Platinum Member
Dec 19, 2014
2,657
1,858
136
just to rebrand it to today's HSA year later
Today it's basically a non existent terminology, the Boltzmann Initiative provided the HIP compiler, and the genesis of ROCM and at that point it basically became a runtime/SDK for that - and it's still not even available on Windows to my astonishment.
 

soresu

Platinum Member
Dec 19, 2014
2,657
1,858
136
13 years after the Fusion announcement, an average enduser is lucky if he/she uses a single OpenCL-backed app which doesn't treat the integrated GPU in any special way. Don't trust the marketing hype. Such is life.
That can be attributed to AMD having to near completely abandon it due to Bulldozer's repeated problems and the consequential Zen fast track soaking up R&D budget - almost certainly the reason also that there was no post Piledriver server chip before Epyc landed.

Similarly GCN is a truly sorry mess of lost opportunities, because AMD lacked the budget freedom to pursue a more aggressive software development initiative to make it as attractive as CUDA solutions for professionals.

From what I can gather GCN was at least as competitive as Fermi derivatives in sheer compute power per watt for it's reign of Southern Islands to Vega - with the exception of AI uses starting with the introduction of tensor cores on Volta.

As far as the average user and OpenCL, that depends if they ever use Adobe Premiere for video editing - its Mercury engine was until recently OpenCL based I think, and recently ported to Vulkan using the CLSPV project to allow it to run on mobile OS's.

nVidia's support of OpenCL has been historically poor, not only on feature set (still 1.2) but on performance relative to its CUDA framework - this seems intentional in my opinion, nothing works to say your proprietary software solution is better than intentionally making the competing open solution worse/underdeveloped.
 

yuri69

Senior member
Jul 16, 2013
387
617
136
Sure, budget problems are ever-present with AMD. However, there is another PoV on the "Fusion claims".

Look at the backwards compatibility/ecosystem/inertia. IBM Cell approach failed miserably - it was hard to program to. It was a new paradigm. AMD would need to pull a "successful Cell" on a global scale. This would be an enormous effort.
 
Mar 11, 2004
23,073
5,554
146
Well, I checked an article on Wikichip and was shocked to see that, according to TSMC's estimate, density will increase by approximately 85%!
This is probably for SRAM, so logic may not scale quite as well.

tsmc-7nm-density-q2-2019.png


This give AMD allot of xtors to play with for Zen4. A 20% bump in chiplet size would double the number of CCXs per CCD, if AMD wants to go that way.
AMD may want to go for a wider core and increase cache instead. Lots of decisions to be made (well, AMD has already made them!).



HBM is way to slow compared to L3 cache, for example. The latency difference is huge (don't have the numbers on me, but HBM is at least an order of magnitude slower and would have to route through the IO chip to service all CCXs).

I believe they've said something like 50% density improvement over N7 (was in one of the comparisons in I think the AT articles), so it should be quite a bit. They seemed to have focused a lot on density as power and performance don't go up nearly as much. I'm assuming there's a reason for that.

From what I recall seeing, it isn't ( don't remember where I saw but while its definitely slower than L1 and L2, it wasn't that far off the higher level caches although maybe it was L4 it was being compared to; although I'm sure that would change some if it were routed through the I/O). Along with HUGE bandwidth and larger size compared to cache, which if properly exploited I think would more than make up for the latency issues. I think there's a movement to decouple the cache some as well. Plus it being near the I/O die likely has benefits (say for mixing CPU and GPU, or for serving as a buffer between storage). But, it could possibly let them limit cache size increases (which are bigger than the cores themselves these days), enabling them to cram more cores in.
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
Similarly GCN is a truly sorry mess of lost opportunities, because AMD lacked the budget freedom to pursue a more aggressive software development initiative to make it as attractive as CUDA solutions for professionals.

They now have the HIP API as a real alternative to the CUDA API. Modern C++ features, single-source programming model (OpenCL and all PC gfx APIs are separate-source models), and it's a true low level API since HIP kernels are directly compiled into native GCN hardware bytecode and no intermediate representations/languages are used like as seen with examples such as HLSL shaders -> DXBC/DXIL, Metal shaders -> Apple IR (intermediate representation), OpenCL Kernels -> SPIR/SPIR-V kernels, Vulkan shaders -> SPIR-V shader or even CUDA kernels -> PTX ...
 

Yotsugi

Golden Member
Oct 16, 2017
1,029
487
106
From what I can gather GCN was at least as competitive as Fermi derivatives in sheer compute power per watt for it's reign of Southern Islands to Vega
That's not how that works, not at all.
Peak FP != compute.
 

NTMBK

Lifer
Nov 14, 2011
10,232
5,013
136
They now have the HIP API as a real alternative to the CUDA API. Modern C++ features, single-source programming model (OpenCL and all PC gfx APIs are separate-source models), and it's a true low level API since HIP kernels are directly compiled into native GCN hardware bytecode and no intermediate representations/languages are used like as seen with examples such as HLSL shaders -> DXBC/DXIL, Metal shaders -> Apple IR (intermediate representation), OpenCL Kernels -> SPIR/SPIR-V kernels, Vulkan shaders -> SPIR-V shader or even CUDA kernels -> PTX ...

Compiling to a bytecode is a feature, not a bug. It gives you portability and forwards compatibility.
 
  • Like
Reactions: ryan20fun and Ajay

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
They seemed to have focused a lot on density as power and performance don't go up nearly as much. I'm assuming there's a reason for that.
That's because density is the primary quality of any node. Density and performance are two sides of the same coin. Starting with a high density, improving the performance by reducing the density again in specific areas is how it's usually achieved. Intel's different 14nm(++) optimizations are just that.
 

soresu

Platinum Member
Dec 19, 2014
2,657
1,858
136
That's not how that works, not at all.
Peak FP != compute.
Yes, and?

As far as I can remember, precisely the problem for gamers with GCN was nominally superior FLOPS which did not equal an equivalent FPS when compared to a nVidia card with similar FLOPS.

Which dials in to what I was talking about, a superior compute power which was largely a missed opportunity, due to slow delivery of a competitive sw platform and inferior gfx fps per flop.

The whole point of RDNA is largely addressing that FPS per FLOP gap while retaining or improving compute efficiency, time will tell how well they execute with that.

Arcturus will seemingly push pure compute efficiency to the max, it will be interesting to see how that works out too.
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
Compiling to a bytecode is a feature, not a bug. It gives you portability and forwards compatibility.

I never claimed that it wasn't but compiling to an intermediate representation also has it's drawbacks as well such as lower potential performance (Nvidia has to keep releasing new versions of PTX to better match the native ISAs of their new GPUs), more potential for compiler bugs to arise due to mismatch in hardware behaviour, arguably makes development harder, and etc ...

Even CUDA isn't perfect in terms of compatibility as implicit warp synchronous programming which is a depreciated feature can be potentially hazardous(!) to Volta/Turing GPUs or anything of PTX 6.x family and Nvidia instead recommends that developers now use explicit warp synchronous programming with cooperative groups ...
 

soresu

Platinum Member
Dec 19, 2014
2,657
1,858
136
Peak FP != compute.
Hammer that into your brain already.
You keep making an obvious point about something I was not arguing against in the first place.

Of course FLOPS is compute, I did not say otherwise - I think you interpreted my words to mean something else entirely here.

I was talking about hardware AND software - ie it didn't matter if AMD had greater peak FP/FLOPS if no one was actually writing code for their cards because CUDA had essentially created a software monopoly in the GPGPU market.

Hence why nVidia never really supported their OpenCL implementation fully, why bother if it only helps your competitor in the end.