Speculation: Ryzen 4000 series/Zen 3

Ajay · Sep 25, 2019

Abwx said:
Unified libraries and layouts of the common parts, FI the IMC cicuitry and other IFs, USB and so on.

I see you point now

Richie Rich · Sep 26, 2019

Next year 5nm EUV production starts at TSMC for small mobile ARM chips. So in 2021 will be available 5nm for Zen4. This basically means doubling CPU cores in package: desktop 32c/128t, Epyc & TR 112c/448t.

Ajay · Sep 26, 2019

Richie Rich said:
Next year 5nm EUV production starts at TSMC for small mobile ARM chips. So in 2021 will be available 5nm for Zen4. This basically means doubling CPU cores in package: desktop 32c/128t, Epyc & TR 112c/448t.

That's a half node, so there won't be a doubling. AMD could just go for more execution resources and larger caches (more throughput per core).

soresu · Sep 26, 2019

Richie Rich said:
Next year 5nm EUV production starts at TSMC for small mobile ARM chips. So in 2021 will be available 5nm for Zen4. This basically means doubling CPU cores in package: desktop 32c/128t, Epyc & TR 112c/448t.

Unless Zen3 and Zen4 bring radical uArch efficiency improvements that simply is not possible, 5nm benefits are great for area, but only middling for power scaling.

Probably we will have to wait for a 3nm node for cores to double again, likely using MBCFET.

We could see a 50% bump in cores though, unless some radical core additions are made which bulk them out.

Thunder 57 · Sep 26, 2019

Ajay said:
That's a half node, so there won't be a doubling. AMD could just go for more execution resources and larger caches (more throughput per core).

They would almost certainly use the die space to bulk up the core. AVX-512 as well.

Ajay · Sep 26, 2019

Thunder 57 said:
They would almost certainly use the die space to bulk up the core. AVX-512 as well.

Meh, I'd rather AMD run Quad AVX2 and do op-fusion for AVX-512 instructions. I wouldn't follow Intel down the rabbit hole of huge FMA units (plus other functions).

Thunder 57 · Sep 26, 2019

Ajay said:
Meh, I'd rather AMD run Quad AVX2 and do op-fusion for AVX-512 instructions. I wouldn't follow Intel down the rabbit hole of huge FMA units (plus other functions).

Sure, they could do that. It might be a good idea for them to do that. It depends on where they want to spend die size on.

VirtualLarry · Sep 26, 2019

Ajay said:
Meh, I'd rather AMD run Quad AVX2 and do op-fusion for AVX-512 instructions. I wouldn't follow Intel down the rabbit hole of huge FMA units (plus other functions).

Not everything benefits from AVX-512, but more (most?) things benefit from MOAR CORES.

NostaSeronx · Sep 26, 2019

VirtualLarry said:
Not everything benefits from AVX-512, but more (most?) things benefit from MOAR CORES.

I think @Ajay is driving more towards like... Super-SIMD. 4x AVX2 operations or 2x AVX3 operations, rather than 2x AVX512, 2x AVX256, 2x AVX128.

Or, even better 8-wide AVX128, 4-wide AVX256, 2-wide AVX512. With no penalties in cross-vector-lengthed workloads. 2x AVX128 can lead into 1x AVX256, or 4x AVX128 can lead into 2x AVX256, which is then scattered in 1x AVX512.

Support a wider-OoO on SIMD, while also supporting wider-DLP instructions. Thus, getting the best of the past/present(legacy/modern workloads) and the future(future workloads).

darkswordsman17 · Sep 26, 2019

Ajay said:
That's a half node, so there won't be a doubling. AMD could just go for more execution resources and larger caches (more throughput per core).

Its a half node but I think there's a significant gain in density compared to 7nm that Zen 2 is using. Its not double but there seems to be room for AMD to increase the size of the chiplets (especially on EPYC and Threadripper). Plus, AMD will likely be changing packaging (i.e. socket) somwhere in there as well as shrinking the size of the I/O die, so they could increase the chiplet sizes themselves. And there's other aspects they can likely play with (shape of the chiplets as well for instance).

Something I've been curious about (since I've been pushing for it a lot), due to the recent GF announcement about their new version of 12, including a new interposer for HBM (which I think HBM3 is supposed to be made on 12nm, so I'm still left wondering, what if they integrated the HBM into/on say the I/O die itself, skipping the interposer entirely), would slapping some of that with the I/O die let them reduce cache sizes? That'd let them increase core counts and/or size of the cores. Which, yes it'd be higher latency than L1, L2, and maybe L3 cache (although I think it'd be not that far off from the latter), but that could possibly be mitigated via the throughput (bandwidth), and pooling the cache (so perhaps there could be some smart sharing of resources that would mitigate latency issues). But maybe they could keep L1 and L2 cache low, replace L3 with a larger pool of HBM, which could double as buffer to system RAM and maybe even NAND, so that you're increasing the throughput significantly (more than making up with the slightly reduced latency).

DrMrLordX · Sep 27, 2019

VirtualLarry said:
Not everything benefits from AVX-512, but more (most?) things benefit from MOAR CORES.

There are some workloads that respond well to SIMD but resist thread-level paralellism. It's also a matter of how you want to spend your power and area budget. One core with 4x256b FMACs will probably take less area and less power to operate than two cores with 2x256b FMACs. But if you're going to go wider, you have to rethink load/store and suchlike. I would assume that going wider would (overall) use less power than expanding on core count.

Gideon · Sep 27, 2019

DrMrLordX said:
There are some workloads that respond well to SIMD but resist thread-level paralellism. It's also a matter of how you want to spend your power and area budget. One core with 4x256b FMACs will probably take less area and less power to operate than two cores with 2x256b FMACs. But if you're going to go wider, you have to rethink load/store and suchlike. I would assume that going wider would (overall) use less power than expanding on core count.

My hope is that if AMD indeed goes for SMT-4 (which is far from given), they'll also double the FP part to 4x256b and beef up the integer part as well (say something between 25-50%). We have reached the clock speed wall (till at least 3nm GAA stuff) and the only way to go forward is to go wider. Apple could do it, and we know Intel is doing it. AMD must go the same route and rather sooner than later, to continue being competitive once Intel finally manages to get Sapphire Rapid out.

Another wish is that instead of supporting most of AVX-512 extensions AMD would go for something more akin to ARM's SVE (Scalable Vector Extension) which scales from 128 bit to 2048 bit (no matter how wide the underlying hardware itself is).

It would be just insane to introduce another AVX extension set for 1024b, considering how different AVX-2 and AVX-512 are (and what a monstrosity the latter is). A Scalable extension would at least keep the CPU-frontend and compiler development stable while the hardware can be enlarged transparently.

Carfax83 · Sep 27, 2019

Gideon said:
It would be just insane to introduce another AVX extension set for 1024b, considering how different AVX-2 and AVX-512 are (and what a monstrosity the latter is).

Wow, did something change these past few months? Seems I've been reading a lot of negativity on these forums surrounding AVX-512 lately. I thought AVX-512 was SIMD done right, as there was a lot of fanfare when it first became available.

Now though, there is more criticism than not it seems. And that's not to say that criticism is unwarranted mind you. But my, the times have changed I guess.

If pursuing wider vectors is a mistake, then why does Intel seem hellbent on doing so?

soresu · Sep 27, 2019

DrMrLordX said:
I would assume that going wider would (overall) use less power than expanding on core count

To a point yes, but just look at AVX 512 and the throttling it causes - CPU's were not supposed to be super wide vector crunchers.

GPU's are more suited for this by design, but sadly APU's are still mostly targeted at gaming rather than actual compute.

Perhaps with Intel getting more serious in that area with OneAPI and Xe, they may push this angle in the future.

jpiniero · Sep 27, 2019

Carfax83 said:
If pursuing wider vectors is a mistake, then why does Intel seem hellbent on doing so?

Expand the market that they can sell server CPUs to include HPC. It is power efficient in isolation but sucks up a ton of juice.

Atari2600 · Sep 27, 2019

Carfax83 said:
If pursuing wider vectors is a mistake, then why does Intel seem hellbent on doing so?

2 reasons:

- Intel does not have a competitive GPU solution.
- They don't have any better ideas on how to go about it.

Gideon · Sep 27, 2019

Carfax83 said:
If pursuing wider vectors is a mistake, then why does Intel seem hellbent on doing so?

Wider vectors themselves aren't the problem. Inventing new incompatible extensions, every time you go wider, is. If Intel had done something like SVE (which scales from 128bit to 2048 bit) say, on Sandy Bridge, Ice-lake would still use the exact same instructions, just be 4x faster (by executing vectors up to 512 bit natively).

And what I meant by AVX-512 being a "monstrosity" ...

just look at the number of new instructions in:
AVX and AVX2
and compare that to new instructions in:
AVX-512 (don't forget to scroll down ... and down ... and down).

And finally look at the table for CPU compatiblity, at the bottom:

Doesn't seem a bit much?

Gideon · Sep 27, 2019

And just to make the part of what SVE is for ARM more clear. Here is a quote from the SVE article I linked above:

Rather than specifying a specific vector length, SVE allows CPU designers to choose the most appropriate vector length for their application and market, from 128 bits up to 2048 bits per vector register. SVE also supports a vector-length agnostic (VLA) programming model that can adapt to the available vector length. doption of the VLA paradigm allows you to compile or hand-code your program for SVE once, and then run it at different implementation performance points, while avoiding the need to recompile or rewrite it when longer vectors appear in the future. This reduces deployment costs over the lifetime of the architecture; a program just works and executes wider and faster.

So developers can just compile their program once today, and in 2035 the very same binary might run on 2048 bit wide vector units no-problem, no recompilation needed. VS the reality today where you have to feature detect the CPU, whether to run SSE, AVX, AVX2, AVX-512 codepaths (and bloat the binary to support all of these). Not to mention the need to reprogram your code and and also ship new binaries, every time a new extension type appears (e.g. AVX-1024)

LightningZ71 · Sep 27, 2019

Getting back to 7nm+, 6nm and 5nm for Zen 3/4, it seems to me that AMD could leverage the improved step processes to move their higher end desktop processors to a three CCD arrangement with a fourth I/O die done in the lowest cost 7nm node available. It looks like there would be ample space for a square arrangement of the die on the AM4 socket. Going to 6nm or 5nm would allow them to make the individual cores slightly larger to accommodate going wider, and, if they went with a relaxed density 7nm process for the I/O die, as a lot of it won't scale well with the shrink anyway, their main limiting factor will likely be how dense they can make the I/O pins on the I/O die. That's three CCDs worth of IF links to connect, as well as all the board I/O and RAM channel connections that it will need to make. Also, that's going to be a whole lot of heat to get from those dies to the Heatsink, suggesting that such a solution will be rather clock limited for both power and thermal reasons.

For Epyc, it'll likely be the same situation. They may be able to squeeze another four CCDs on the package with a shrink to 6 or 5nm, but, how do you get all those links into the I/O die? A shrink of the I/O die to relaxed 7nm may help with its power draw, but, its still going to need to be large to handle all those connections.

soresu · Sep 27, 2019

Atari2600 said:
Intel does not have a competitive GPU solution

Also pouring software dev efforts into GPU vector interplay with the CPU before they had a competitive platform to AMD would only end up strengthening their competitors hand.

AMD tried this with HSA/ROCM, but their marketshare at inception was too low to get others fully on board with it - that may change now though with their phoenix flight since 2017 Zen and now with RDNA and Arcturus to push them forward.

Even so Intel's OneAPI could end up muddying the waters, so who knows how it will play out.

Tuna-Fish · Sep 27, 2019

Carfax83 said:
Wow, did something change these past few months? Seems I've been reading a lot of negativity on these forums surrounding AVX-512 lately. I thought AVX-512 was SIMD done right, as there was a lot of fanfare when it first became available.

Now though, there is more criticism than not it seems. And that's not to say that criticism is unwarranted mind you. But my, the times have changed I guess.

If pursuing wider vectors is a mistake, then why does Intel seem hellbent on doing so?

As a counterpoint to Gideon's reply above, as a programmer I think that AVX-512 is the best designed SIMD extension to x86 ever, and the sooner we get it in every new x86 cpu so that I can actually start using it, the better. At lot of those new instructions are in there because Intel finally took it's time and did it right, with a nice, complete and orthogonal set.

However, there are very few situations where the vectors *really* need to be that long, and a lot of situations where having less simd width in the CPU is beneficial. My dream solution, what I'd really like AMD to do, is to keep the current vector ALU widths, but double the registers to 512 and fully implement the AVX-512 instruction set, splitting full-width vectors to two ALUs, like they used to do with AVX on Zen1.

A lot of the flak AVX-512 has gotten recently is because the implementation of throttling the CPU down to execute very wide code that was used by the early Intel CPUs that had it was really -Redacted-. There is nothing fundamentally wrong about clocking a cpu down a bit to run wider vectors; clocking down even 30% to double the vector width for vector-heavy code is still a substantial win in throughput! The problem was that Intel was eagerly clocking down when it encountered almost any AVX-512 ops, and took a long time clocking back up. Which means that code that has a few of them sprinkled around lots of scalar code will run all that scalar code at the lower clock, which really sucks.

But anyway, I would caution against just looking at vector units alone and thinking that widening them, or increasing their amount, would be automatically good. Most code that matters is still scalar, and will probably stay that way. Making the vector units wider mean that the proportional amount of time spent in vector code goes down, so every doubling of width gives less real gain that the last. And widening execution units themselves doesn't even help all that much; they need to be kept fed, and the memory/cache interface that is optimized to keep a very wide vector machine happy is quite different from one meant to satisfy a fast scalar machine. So optimization too far for vectors can be pessimization for scalar code.

Profanity in the technical forums is not allowed.

Thanks,
Daveybrat
AT Moderator

Carfax83 · Sep 27, 2019

Gideon said:
Wider vectors themselves aren't the problem. Inventing new incompatible extensions, every time you go wider, is. If Intel had done something like SVE (which scales from 128bit to 2048 bit) say, on Sandy Bridge, Ice-lake would still use the exact same instructions, just be 4x faster (by executing vectors up to 512 bit natively).

I'm not a programmer or industry professional by any means, but from my limited understand, I thought that AVX-512 already had a variable length extension (VL). Isn't that what VL does?

Also from that chart you posted, it seems there is a lot of product segmentation when it comes to the AVX-512 instruction set.

Also found this interesting technical thread on Intel developer forums from way back discussing AVX-512 and its impact and potential future.

DrMrLordX · Sep 27, 2019

Gideon said:
My hope is that if AMD indeed goes for SMT-4 (which is far from given)

I'm more interested in if they can squeeze more performance out of SMT2. Currently, AMD's implementation of SMT2 is good for maybe a 25-30% increase in throughput. In theory they should be able to get more out of it on a wider core.

Another wish is that instead of supporting most of AVX-512 extensions AMD would go for something more akin to ARM's SVE (Scalable Vector Extension) which scales from 128 bit to 2048 bit (no matter how wide the underlying hardware itself is).

I concur, but I've spoken about that already.

Atari2600 said:
- Intel does not have a competitive GPU solution.

Yet. Xe is coming. Will it be competitive? We don't know. Intel has attempted all manner of computing doodads that would compete with dGPUs in some function currently dominated by enterprise GPUs. Xe might replace a lot of their own projects, or consolidate them into a tighter family of products.

Carfax83 said:
I thought that AVX-512 already had a variable length extension (VL). Isn't that what VL does?

Intel Developer Zone

Find software and development products, explore tools and technologies, connect with other developers and more. Sign up to manage your products.

software.intel.com

That's what VL(E) does. All it really does is allow a developer to use some AVX512 features on shorter data lengths. SVE2 seems to be more flexible than that.

Richie Rich · Sep 27, 2019

Also 5nm is FULL node (0.7x in edge, 0.5x in area, double the transistors theoretically). Half node is 6nm.

Instructions set like AVX512 is part of the "dirty game". Intel was working on AVX512 and HW FPU for 4 years however introduced was after develpment is finished. This means your competitors are out of the battle for next 4 years. SIMD SVE like ARM would be great for x86 too but for obvious reasons it's unpopular. That's why x86 development is retarded and ARM gets more momentum. No wonder the first 6xALU CPU core is ARM chip.

Maybe AMD 19h Family Zen3 will be about new instruction set based on SVE set developed by AMD. Imagine x86 SVE which scales from 128 - 8192 bits. And AMD can implement HW bypass for calculation on GPU. No SW GPGPU needed, no special programming knowledge, just pure HW solution based on available HW resources.That would be killing feature.

maddie · Sep 27, 2019

Why the assumption on fixed SMT values? I thought that IBM has long used a 'scaling on the fly' SMT variation, so we know it works. An up to SMT4 core 'as needed' might be the better solution.

Speculation: Ryzen 4000 series/Zen 3

Lifer

Senior member

Lifer

Diamond Member

Diamond Member

Lifer

Diamond Member

No Lifer

Diamond Member

Lifer

Lifer

Platinum Member

Diamond Member

Diamond Member

Lifer

Golden Member

Platinum Member

Platinum Member

Platinum Member

Diamond Member

Golden Member

Diamond Member

Lifer

Senior member

Diamond Member