Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Vattila · Oct 6, 2019

Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!

RnR_au · Jul 8, 2022

MadRat said:
I'm starting to think AVX512 should be a co-processor.

Ryzen 7600Xsx... hmm

DrMrLordX · Jul 9, 2022

Moving SIMD to a co-processor would break existing SIMD programming models. Terrible idea.

igor_kavinski · Jul 9, 2022

DrMrLordX said:
Terrible idea.

Intel could do it by making their Thread Director a bit more intelligent and move AVX-512 instruction execution to the co-processor when required. There might be some latency disadvantage in doing that but the compute cores won't have to be downclocked whenever AVX-512 pipelines start churning. Or you can say, "Even more terrible of an idea!"

DrMrLordX · Jul 9, 2022

igor_kavinski said:
Intel could do it by making their Thread Director a bit more intelligent and move AVX-512 instruction execution to the co-processor when required. There might be some latency disadvantage in doing that but the compute cores won't have to be downclocked whenever AVX-512 pipelines start churning. Or you can say, "Even more terrible of an idea!"

I don't even think that would work. Intel has gone well out of their way to disable AVX512 on Alder Lake consumer products. Even if you haven't gotten a microcode update disabling AVX512, it requires the e-cores to be disabled.

In any case, something like SVE2 would make the argument mostly moot. Pity the x86 world hasn't licensed it yet.

Mopetar · Jul 9, 2022

Intel only did that because the e-cores don't support AVX512 and they didn't have a way to ensure that programs trying to use those instructions wouldn't end up running on those cores.

Maybe there is a solution that they can eventually come up with in time, but in the short term it was just easier to disable the functionality. It's not as though it's widely used at this point.

Really though they should have left users the option to enable it at the expense of disabling the e-cores. One of the few benchmarks where Intel was able to dominate AMD was AVX512 and when Zen 4 launches it's going to be the opposite.

coercitiv · Jul 9, 2022

Mopetar said:
Really though they should have left users the option to enable it at the expense of disabling the e-cores.

It's kinda' still there, in the sense that mobo makers probably use workarounds to make it work.

My UEFI isn't the latest but it ain't that old either, released in March just before the 12900KS compatibility UEFI update. Enabling AVX512 requires an extra step in UEFI configuration than it previously did on Alder Lake launch.

igor_kavinski · Jul 9, 2022

coercitiv said:
It's kinda' still there

Not for later steppings with the AVX-512 fused off (or so Intel says).

igor_kavinski · Jul 9, 2022

coercitiv said:
Enabling AVX512 requires an extra step in UEFI configuration than it previously did on Alder Lake launch.

Is it easy to do? What are the exact steps in your case?

moinmoin · Jul 9, 2022

That AVX512 mess is also flabbergasting. How was that originally even supposed to work to make the area spent for enabling AVX512 in the cores worth it? I mean Intel obviously knew from the beginning that the ISA support would divert between P and E cores. They introduce E cores essentially solely for being area efficient, but keep fused off ability in the oh so area inefficient P cores?

igor_kavinski · Jul 9, 2022

They probably had some solution in mind but there wasn't enough time to implement it properly. They wanted to rush out ADL due to Zen 3 eating their lunch.

jpiniero · Jul 9, 2022

moinmoin said:
That AVX512 mess is also flabbergasting. How was that originally even supposed to work to make the area spent for enabling AVX512 in the cores worth it? I mean Intel obviously knew from the beginning that the ISA support would divert between P and E cores. They introduce E cores essentially solely for being area efficient, but keep fused off ability in the oh so area inefficient P cores?

Originally Intel intended to make it work with a software solution. They gave up on it a long time ago.

igor_kavinski · Jul 9, 2022

jpiniero said:
Originally Intel intended to make it work with a software solution.

Source?

DrMrLordX · Jul 9, 2022

Mopetar said:
Intel only did that because the e-cores don't support AVX512

Remember @MadRat was discussing the possibility of removing SIMD functionality altogether and moving it to a coprocessor. To the best of my recollection, x86 CPUs haven't done that since the 286/386SX days when x87 wasn't even supported without a math coprocessor. Not sure but compilers would probably have to be redone to support a coprocessor. And if AMD doesn't go that route (which they won't) then it would be a bad look for Intel.

Elsewhere in the industry we have "big" APUs emerging on AMD's roadmap along with Fujitsu's A64FX. AVX512 coprocessors wouldn't make a whole lot of sense.

NostaSeronx · Jul 9, 2022

FPU is already a co-processor. => "The floating-point unit (FPU) utilizes a coprocessor model for all operations"

The only architectures with integrated SIMD/FPU functionality is K5/K6. All architectures past it have it closely attached but not fully integrated.

K5 = Integrated
K6 = Integrated
K7 = Co-processor
K8 = Co-processor
K9/Greyhound = Co-processor
K10/Bulldozer = Co-processor
Steamroller = Co-processor
Zen = Co-processor
Zen3 = Co-processor

Co-processor allows them to have a different FPU in different Models within the same Family.
FPU change in BD->SR = Different model, same family
FPU change in Zn->Zn2 = Different model, same family
FPU change in Zn2->Zn2-lite = Different model, same family
FPU change in Zn3->Zn4 = Different model, same family

Fully Integrated FPU = K5/K6
Integrated Co-processor = K7+ (design has to share space with actual core)
Discrete Co-processor = not yet (separate microarchitecture, no shared space with FE/LSU/Core/etc)

Of which, Discrete Co-processor => As like the Alpha's Tarantula or RISC-V's Hybrid-Superscalar-Vector Modified-Ara unit.

As well as discrete co-processor doesn't necessarily need to be shared. Where a discrete co-processor can be stacked over the cores. Which can be done at CPU(SIMD/FPU) on CPU(General Purpose Int) point of 3d-stacking roadmap.

General Purpose CPU-layer = AMD64 decoders/microcode for Baseline(x86->x86-64)
Vector(FPU/SIMD) CPU-layer = AMD64 decoders/microcode for Extensions(x87->EVEX)

>6 one-set decoders+extra area for wider GP ALUs/AGUs (GP-layer) + >4 another-set decoders+extra area for extra AVX512 units (Vec-layer). Is preferred over ~same decoders/units just that the aggregate FPU is shunted down to a fifth core on same layer.

Zen family is HPC so any reduction in speed is negative. Performance is absolute, W/mm2/$ is constrained. Zen's performance gains are faster than W/mm2/$ growth.
Only for ULP (NY+etc ULP Cores team) is W/mm2/$ is absolute, and performance is constrained. => https://patents.google.com/patent/US6944744B2/en (ULP Grid Arch, Fam 24h has been pushed to an earlier timeline: 2H2025/1H2026 tapeout/prod -> 2H2023/1H2024 tapeout/prod:: 2022 ULP Arch has been greenlit for GloFo 12FDX-NY)

A shared discrete co-processor is a big no for Zen. Only performance increases are in the outlook for Zen.

coercitiv · Jul 9, 2022

moinmoin said:
They introduce E cores essentially solely for being area efficient, but keep fused off ability in the oh so area inefficient P cores?

Remember they did the same with Lakefield. In that case they even went on record declaring the AVX-512 unit was removed from the Sunny Cove core, only for it to be identified later in die shots. To me the path AMD took with ZenC makes a lot more sense. The density jump may not be that impressive but the results are predictable, AMD can iterate on the design and later on may deviate the core architecture as well. Last but not least, the mission of the design is clear, solution works from day one and has good impact in the market.

To borrow an idea presented initially by Wendell from Level1Techs, AMD seems to behave like a company that goes out into the "wild", asks it's customers what would make them happier, then executes towards that goal. Both the 3D cache and ZenC variants seem to have been born this way, and what's very interesting about these changes is that even though they serve opposite ends of the business market through different technological solutions, they do share one common trait: performance gains are instant and customers don't have to lift a finger to integrate these new products in their workflow.

On the opposite side we have Intel and their hybrid approach. Lots of potential for compute density and a promise of best lightly threaded performance. Looks like a game changer. The only problem is customers need to adapt to the solution. Server simply doesn't work, workstation feels wonky, and consumer products don't really feel the benefits of the change from day one. True impact of this change will come multiple generations later, giving the competition ample time to respond.

It boggles the mind to realize that Intel had E core IP in development for so many years, including server chips based on this IP, and yet it is AMD who's selling "smaller" cores to (big server) customers first. I threw an error just as I was writing the last paragraph.

eek2121 · Jul 9, 2022

Intel and AMD both just need to find a way to have the AVX sections of the chip clock independently of the rest of the chip. Shoot, for all we know AMD may be doing this already. We actually know very little about how Zen 4 handles this.

nicalandia · Jul 9, 2022

Amd Genoa source: Yuuki_AnS @ yuuki_ans

Schmide · Jul 9, 2022

I'm just going to say you can't decouple an extension to an instruction set.

The simd registers and lanes exist in the same space. If you do an SSE operation it executes in the lower lane of the 2 lane AVX register as AVX occupies the lower 2 lanes of the AVX512 registers. You or the compiler must take care to preserve the upper lanes as they execute concurrently when less wide instructions execute. This is the reason you often pay a penalty for mixing SSE/AVX/AVX512 in the same code sequence. Moreover, since all modern 64 bit processors use the SSE registers for basic operations, the same relationship holds true for operations within the first SSE lane.

nicalandia · Jul 9, 2022

Coming up this week... Sapphire Rapids vs Genoa courtesy of Yuuki_AnS @ yuuki_ans , this is going to be a QS sample SPR

deasd · Jul 10, 2022

He just had a modification of the Genoa list, claimed the clocks are conservative all-core turbo given by OEM materials, so....... it looks to be just speculation. And reputation of that guy seems doubtful......

Timmah! · Jul 10, 2022

Regarding the v-cache chips, do you:

- think they solved the clocks issue with it and there wont be such a big difference in clocks between the regular and 3d version, as is the case of 5800x/58003d? Since the high clocks and resulting performance uplift seem to be the main point of improvement over Zen3, it would be rather disappointing to get 3d version and end up with zen3 clocks…

- if that happens though, does anyone here now, between 5800x/58003d, which one provides better viewport performance in CAD apps (specifically autocad, 3dsmax)? I know v-cache suits better to games, in general, but my interest and the reason to upgrade is the performance in these apps. Especially 3dsmax can slow down significantly, when dealing with bigger models…

igor_kavinski · Jul 10, 2022

Cache-Rich: AMD Ryzen 7 5800X3D Workstation Performance Review

First seen in its server-bound Milan-X EPYC, AMD's brought its 3D V-Cache technology to consumers with the new Ryzen 7 5800XD. With triple the L3 cache vs. the original 5800X, the right workloads could exhibit a notable performance-boost. For our first look at the 5800X3D, we're tackling our...

techgage.com

In 3dsmax rendering, it offers no benefits.

Cache-Rich: AMD Ryzen 7 5800X3D Workstation Performance Review

First seen in its server-bound Milan-X EPYC, AMD's brought its 3D V-Cache technology to consumers with the new Ryzen 7 5800XD. With triple the L3 cache vs. the original 5800X, the right workloads could exhibit a notable performance-boost. For our first look at the 5800X3D, we're tackling our...

techgage.com

There are bandwidth benefits in choosing the 5800X3D for multicore workloads. But the 5950X seems to be the best in this regard. Might be why it excels in pure parallel computations. 5900X3D might end up being better than 5900X if they can keep the clocks same.

Also, notice how the 12900K suffers miserably in some of the cache bandwidth tests. It's being held back.

Timmah! · Jul 10, 2022

igor_kavinski said:
Cache-Rich: AMD Ryzen 7 5800X3D Workstation Performance Review

First seen in its server-bound Milan-X EPYC, AMD's brought its 3D V-Cache technology to consumers with the new Ryzen 7 5800XD. With triple the L3 cache vs. the original 5800X, the right workloads could exhibit a notable performance-boost. For our first look at the 5800X3D, we're tackling our...

techgage.com

In 3dsmax rendering, it offers no benefits.

Cache-Rich: AMD Ryzen 7 5800X3D Workstation Performance Review

First seen in its server-bound Milan-X EPYC, AMD's brought its 3D V-Cache technology to consumers with the new Ryzen 7 5800XD. With triple the L3 cache vs. the original 5800X, the right workloads could exhibit a notable performance-boost. For our first look at the 5800X3D, we're tackling our...

techgage.com

View attachment 64281
There are bandwidth benefits in choosing the 5800X3D for multicore workloads. But the 5950X seems to be the best in this regard. Might be why it excels in pure parallel computations. 5900X3D might end up being better than 5900X if they can keep the clocks same.

Also, notice how the 12900K suffers miserably in some of the cache bandwidth tests. It's being held back.

thank you, will study the links. Just wanted to point out i was not asking about rendering performance, or multicore performance for that matter, as i do gpu rendering anyway. i meant viewport performance of the app, when you build the 3d model before rendering and need to zoom in/out, rotate and pan constantly. This, as far i know, is a single core/ threaded task, and it can get pretty choppy/ stuttery at times.

igor_kavinski · Jul 10, 2022

Timmah! said:
This, as far i know, is a single core/ threaded task

SPECviewperf 13: A Look At Viewport Performance In SolidWorks...

Continuing our workstation performance analysis, we're tossing our usual fleet of GPUs against eight separate tests, all made possible through SPECgpc's SPECviewperf 13. With this testing, we can show current state of viewport performance in SolidWorks, CATIA, Maya, 3ds Max, Creo, and Siemens NX...

techgage.com

These guys say it needs a Pro level card for better performance.

Also,

https://www.thesourcecad.com/improve-autocad-performance/

AutoCAD can use that extra processor to improve the speed of operations such as zoom which redraws or regenerates the drawing. There will be a slight acceleration when you are working with large drawings if you set this variable to 3.

More CPU cores may help. Check the best answer here: https://community.spiceworks.com/topic/540277-slow-autocad-screen-refresh

Faster storage (RAID 0 NVMe SSD) may also help.

Timmah! · Jul 10, 2022

igor_kavinski said:
SPECviewperf 13: A Look At Viewport Performance In SolidWorks...

Continuing our workstation performance analysis, we're tossing our usual fleet of GPUs against eight separate tests, all made possible through SPECgpc's SPECviewperf 13. With this testing, we can show current state of viewport performance in SolidWorks, CATIA, Maya, 3ds Max, Creo, and Siemens NX...

techgage.com

These guys say it needs a Pro level card for better performance.

Also,

https://www.thesourcecad.com/improve-autocad-performance/

More CPU cores may help. Check the best answer here: https://community.spiceworks.com/topic/540277-slow-autocad-screen-refresh

Faster storage (RAID 0 NVMe SSD) may also help.

Thanks for the links.

Checked the Techgage article, and despite the Pro level card being recommended, if you look here:

https://techgage.com/wp-content/uploads/2018/07/SPECviewperf-13-AMD-vs-NVIDIA-3ds-Max-Performance.png

Geforce has more or less the same performance as Quadro, at least in 3Ds max. Anyway, Quadro is outside of my budget and i need GPUs primarily for the rendering, so i am looking for the highest-end geforces, which provide the same performance, but for significantly less money.

On topic of more CPU cores being helpful, this i did not know, it surprises me. Though i doubt this would scale over 16 cores, which would be my baseline.

Regarding storage - this would no doubt help, but IMO moreso with scene saving/loading times, rather than viewport performance during actual work, when the scene is entirely loaded inside RAM. At least i presume that.
This is actually one of the things that slow down my workflow the most, because i keep my work data on regular HDD (WD Caviar Black from 2011). I was thinking to keep the most recent stuff on the M2 disk, to improve on this, and only move it away to said HDD storage, once its done and not being worked on anymore, but there is issue with path to external files like textures, xrefs and whatnot - trying to load such scene from different disk is a pain, as its missing that stuff and it needs to be linked manually and whatnot...

BTW do we know when the PCI-E 5.0 m2 drives are going to be up for purchase? And what pricing to expect? I looked at current offering and Samsing 980 Pro 1TB is 159 EUROs with VAT around here. Which is acceptable, i guess. But if the PCI-E 5.0 replacement of this wont be same price, but significantly more, that will be indeed very disappointing.

Last thing, i looked more into this CPU affecting viewport performance myself and found this article:

UI and Viewport performance lags or seems slow on fast multicore CPUs in 3ds Max | 3ds Max | Autodesk Knowledge Network

3ds Max lags or shows a delay in performance, even on fast CPUs. This includes, but is not limited to Intel i9 and AMD 'Ryzen Threadrippers'. This includes clicking any area of the user interface or moving objects, animation playback (FPS, or frames per second) in the viewport. When trying to...

knowledge.autodesk.com

This part:

Under certain PC hardware configurations, the multicore CPU's cache memory access may become bottle-necked when performing certain calculations. The CPUs may run fast as long as they can hit the data they need directly (in the cache), but can become stalled when hitting a "cache miss.".

makes it sound that more cache may indeed matter, so perhaps i should be looking at the v-cache version, even if it has slower clocks.

Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Senior member

Platinum Member

Lifer

Lifer

Lifer

Diamond Member

Diamond Member

Lifer

Lifer

Diamond Member

Lifer

Lifer

Lifer

Lifer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Golden Member

Lifer

Golden Member

Lifer

Golden Member