Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 260 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01
Jul 27, 2020
16,165
10,240
106
Terrible idea.
Intel could do it by making their Thread Director a bit more intelligent and move AVX-512 instruction execution to the co-processor when required. There might be some latency disadvantage in doing that but the compute cores won't have to be downclocked whenever AVX-512 pipelines start churning. Or you can say, "Even more terrible of an idea!" :D
 

DrMrLordX

Lifer
Apr 27, 2000
21,620
10,830
136
Intel could do it by making their Thread Director a bit more intelligent and move AVX-512 instruction execution to the co-processor when required. There might be some latency disadvantage in doing that but the compute cores won't have to be downclocked whenever AVX-512 pipelines start churning. Or you can say, "Even more terrible of an idea!" :D

I don't even think that would work. Intel has gone well out of their way to disable AVX512 on Alder Lake consumer products. Even if you haven't gotten a microcode update disabling AVX512, it requires the e-cores to be disabled.

In any case, something like SVE2 would make the argument mostly moot. Pity the x86 world hasn't licensed it yet.
 

Mopetar

Diamond Member
Jan 31, 2011
7,835
5,981
136
Intel only did that because the e-cores don't support AVX512 and they didn't have a way to ensure that programs trying to use those instructions wouldn't end up running on those cores.

Maybe there is a solution that they can eventually come up with in time, but in the short term it was just easier to disable the functionality. It's not as though it's widely used at this point.

Really though they should have left users the option to enable it at the expense of disabling the e-cores. One of the few benchmarks where Intel was able to dominate AMD was AVX512 and when Zen 4 launches it's going to be the opposite.
 

coercitiv

Diamond Member
Jan 24, 2014
6,187
11,859
136
Really though they should have left users the option to enable it at the expense of disabling the e-cores.
It's kinda' still there, in the sense that mobo makers probably use workarounds to make it work.

avx512.png

My UEFI isn't the latest but it ain't that old either, released in March just before the 12900KS compatibility UEFI update. Enabling AVX512 requires an extra step in UEFI configuration than it previously did on Alder Lake launch.
 
  • Like
Reactions: Tlh97 and Mopetar

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
That AVX512 mess is also flabbergasting. How was that originally even supposed to work to make the area spent for enabling AVX512 in the cores worth it? I mean Intel obviously knew from the beginning that the ISA support would divert between P and E cores. They introduce E cores essentially solely for being area efficient, but keep fused off ability in the oh so area inefficient P cores?
 
  • Like
Reactions: Tlh97 and coercitiv

jpiniero

Lifer
Oct 1, 2010
14,585
5,209
136
That AVX512 mess is also flabbergasting. How was that originally even supposed to work to make the area spent for enabling AVX512 in the cores worth it? I mean Intel obviously knew from the beginning that the ISA support would divert between P and E cores. They introduce E cores essentially solely for being area efficient, but keep fused off ability in the oh so area inefficient P cores?

Originally Intel intended to make it work with a software solution. They gave up on it a long time ago.
 

DrMrLordX

Lifer
Apr 27, 2000
21,620
10,830
136
Intel only did that because the e-cores don't support AVX512

Remember @MadRat was discussing the possibility of removing SIMD functionality altogether and moving it to a coprocessor. To the best of my recollection, x86 CPUs haven't done that since the 286/386SX days when x87 wasn't even supported without a math coprocessor. Not sure but compilers would probably have to be redone to support a coprocessor. And if AMD doesn't go that route (which they won't) then it would be a bad look for Intel.

Elsewhere in the industry we have "big" APUs emerging on AMD's roadmap along with Fujitsu's A64FX. AVX512 coprocessors wouldn't make a whole lot of sense.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
FPU is already a co-processor. => "The floating-point unit (FPU) utilizes a coprocessor model for all operations"

The only architectures with integrated SIMD/FPU functionality is K5/K6. All architectures past it have it closely attached but not fully integrated.

K5 = Integrated
K6 = Integrated
K7 = Co-processor
K8 = Co-processor
K9/Greyhound = Co-processor
K10/Bulldozer = Co-processor
Steamroller = Co-processor
Zen = Co-processor
Zen3 = Co-processor

Co-processor allows them to have a different FPU in different Models within the same Family.
FPU change in BD->SR = Different model, same family
FPU change in Zn->Zn2 = Different model, same family
FPU change in Zn2->Zn2-lite = Different model, same family
FPU change in Zn3->Zn4 = Different model, same family

Fully Integrated FPU = K5/K6
Integrated Co-processor = K7+ (design has to share space with actual core)
Discrete Co-processor = not yet (separate microarchitecture, no shared space with FE/LSU/Core/etc)

Of which, Discrete Co-processor => As like the Alpha's Tarantula or RISC-V's Hybrid-Superscalar-Vector Modified-Ara unit.

As well as discrete co-processor doesn't necessarily need to be shared. Where a discrete co-processor can be stacked over the cores. Which can be done at CPU(SIMD/FPU) on CPU(General Purpose Int) point of 3d-stacking roadmap.

General Purpose CPU-layer = AMD64 decoders/microcode for Baseline(x86->x86-64)
Vector(FPU/SIMD) CPU-layer = AMD64 decoders/microcode for Extensions(x87->EVEX)

>6 one-set decoders+extra area for wider GP ALUs/AGUs (GP-layer) + >4 another-set decoders+extra area for extra AVX512 units (Vec-layer). Is preferred over ~same decoders/units just that the aggregate FPU is shunted down to a fifth core on same layer.

Zen family is HPC so any reduction in speed is negative. Performance is absolute, W/mm2/$ is constrained. Zen's performance gains are faster than W/mm2/$ growth.
Only for ULP (NY+etc ULP Cores team) is W/mm2/$ is absolute, and performance is constrained. => https://patents.google.com/patent/US6944744B2/en (ULP Grid Arch, Fam 24h has been pushed to an earlier timeline: 2H2025/1H2026 tapeout/prod -> 2H2023/1H2024 tapeout/prod:: 2022 ULP Arch has been greenlit for GloFo 12FDX-NY)

A shared discrete co-processor is a big no for Zen. Only performance increases are in the outlook for Zen.
 
Last edited:

coercitiv

Diamond Member
Jan 24, 2014
6,187
11,859
136
They introduce E cores essentially solely for being area efficient, but keep fused off ability in the oh so area inefficient P cores?
Remember they did the same with Lakefield. In that case they even went on record declaring the AVX-512 unit was removed from the Sunny Cove core, only for it to be identified later in die shots. To me the path AMD took with ZenC makes a lot more sense. The density jump may not be that impressive but the results are predictable, AMD can iterate on the design and later on may deviate the core architecture as well. Last but not least, the mission of the design is clear, solution works from day one and has good impact in the market.

To borrow an idea presented initially by Wendell from Level1Techs, AMD seems to behave like a company that goes out into the "wild", asks it's customers what would make them happier, then executes towards that goal. Both the 3D cache and ZenC variants seem to have been born this way, and what's very interesting about these changes is that even though they serve opposite ends of the business market through different technological solutions, they do share one common trait: performance gains are instant and customers don't have to lift a finger to integrate these new products in their workflow.

On the opposite side we have Intel and their hybrid approach. Lots of potential for compute density and a promise of best lightly threaded performance. Looks like a game changer. The only problem is customers need to adapt to the solution. Server simply doesn't work, workstation feels wonky, and consumer products don't really feel the benefits of the change from day one. True impact of this change will come multiple generations later, giving the competition ample time to respond.

It boggles the mind to realize that Intel had E core IP in development for so many years, including server chips based on this IP, and yet it is AMD who's selling "smaller" cores to (big server) customers first. I threw an error just as I was writing the last paragraph.
 

eek2121

Platinum Member
Aug 2, 2005
2,930
4,026
136
Intel and AMD both just need to find a way to have the AVX sections of the chip clock independently of the rest of the chip. Shoot, for all we know AMD may be doing this already. We actually know very little about how Zen 4 handles this.
 

Schmide

Diamond Member
Mar 7, 2002
5,586
718
126
I'm just going to say you can't decouple an extension to an instruction set.

The simd registers and lanes exist in the same space. If you do an SSE operation it executes in the lower lane of the 2 lane AVX register as AVX occupies the lower 2 lanes of the AVX512 registers. You or the compiler must take care to preserve the upper lanes as they execute concurrently when less wide instructions execute. This is the reason you often pay a penalty for mixing SSE/AVX/AVX512 in the same code sequence. Moreover, since all modern 64 bit processors use the SSE registers for basic operations, the same relationship holds true for operations within the first SSE lane.
 
Last edited:
  • Like
Reactions: Mopetar and Thibsie

deasd

Senior member
Dec 31, 2013
516
746
136
He just had a modification of the Genoa list, claimed the clocks are conservative all-core turbo given by OEM materials, so....... it looks to be just speculation. And reputation of that guy seems doubtful......
 
  • Like
Reactions: Tlh97 and ftt

Timmah!

Golden Member
Jul 24, 2010
1,418
630
136
Regarding the v-cache chips, do you:

- think they solved the clocks issue with it and there wont be such a big difference in clocks between the regular and 3d version, as is the case of 5800x/58003d? Since the high clocks and resulting performance uplift seem to be the main point of improvement over Zen3, it would be rather disappointing to get 3d version and end up with zen3 clocks…

- if that happens though, does anyone here now, between 5800x/58003d, which one provides better viewport performance in CAD apps (specifically autocad, 3dsmax)? I know v-cache suits better to games, in general, but my interest and the reason to upgrade is the performance in these apps. Especially 3dsmax can slow down significantly, when dealing with bigger models…
 
Jul 27, 2020
16,165
10,240
106

In 3dsmax rendering, it offers no benefits.


1657455945093.png
There are bandwidth benefits in choosing the 5800X3D for multicore workloads. But the 5950X seems to be the best in this regard. Might be why it excels in pure parallel computations. 5900X3D might end up being better than 5900X if they can keep the clocks same.

Also, notice how the 12900K suffers miserably in some of the cache bandwidth tests. It's being held back.
 

Timmah!

Golden Member
Jul 24, 2010
1,418
630
136

In 3dsmax rendering, it offers no benefits.


View attachment 64281
There are bandwidth benefits in choosing the 5800X3D for multicore workloads. But the 5950X seems to be the best in this regard. Might be why it excels in pure parallel computations. 5900X3D might end up being better than 5900X if they can keep the clocks same.

Also, notice how the 12900K suffers miserably in some of the cache bandwidth tests. It's being held back.

thank you, will study the links. Just wanted to point out i was not asking about rendering performance, or multicore performance for that matter, as i do gpu rendering anyway. i meant viewport performance of the app, when you build the 3d model before rendering and need to zoom in/out, rotate and pan constantly. This, as far i know, is a single core/ threaded task, and it can get pretty choppy/ stuttery at times.
 
Jul 27, 2020
16,165
10,240
106
This, as far i know, is a single core/ threaded task

These guys say it needs a Pro level card for better performance.

Also,


AutoCAD can use that extra processor to improve the speed of operations such as zoom which redraws or regenerates the drawing. There will be a slight acceleration when you are working with large drawings if you set this variable to 3.

More CPU cores may help. Check the best answer here: https://community.spiceworks.com/topic/540277-slow-autocad-screen-refresh

Faster storage (RAID 0 NVMe SSD) may also help.
 

Timmah!

Golden Member
Jul 24, 2010
1,418
630
136

These guys say it needs a Pro level card for better performance.

Also,




More CPU cores may help. Check the best answer here: https://community.spiceworks.com/topic/540277-slow-autocad-screen-refresh

Faster storage (RAID 0 NVMe SSD) may also help.


Thanks for the links.

Checked the Techgage article, and despite the Pro level card being recommended, if you look here:



Geforce has more or less the same performance as Quadro, at least in 3Ds max. Anyway, Quadro is outside of my budget and i need GPUs primarily for the rendering, so i am looking for the highest-end geforces, which provide the same performance, but for significantly less money.

On topic of more CPU cores being helpful, this i did not know, it surprises me. Though i doubt this would scale over 16 cores, which would be my baseline.

Regarding storage - this would no doubt help, but IMO moreso with scene saving/loading times, rather than viewport performance during actual work, when the scene is entirely loaded inside RAM. At least i presume that.
This is actually one of the things that slow down my workflow the most, because i keep my work data on regular HDD (WD Caviar Black from 2011). I was thinking to keep the most recent stuff on the M2 disk, to improve on this, and only move it away to said HDD storage, once its done and not being worked on anymore, but there is issue with path to external files like textures, xrefs and whatnot - trying to load such scene from different disk is a pain, as its missing that stuff and it needs to be linked manually and whatnot...

BTW do we know when the PCI-E 5.0 m2 drives are going to be up for purchase? And what pricing to expect? I looked at current offering and Samsing 980 Pro 1TB is 159 EUROs with VAT around here. Which is acceptable, i guess. But if the PCI-E 5.0 replacement of this wont be same price, but significantly more, that will be indeed very disappointing.

Last thing, i looked more into this CPU affecting viewport performance myself and found this article:


This part:

  • Under certain PC hardware configurations, the multicore CPU's cache memory access may become bottle-necked when performing certain calculations. The CPUs may run fast as long as they can hit the data they need directly (in the cache), but can become stalled when hitting a "cache miss.".

makes it sound that more cache may indeed matter, so perhaps i should be looking at the v-cache version, even if it has slower clocks.