Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Vattila · Oct 6, 2019

Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!

MadRat · Jun 8, 2021

We know logic cannot be supported four layers high. If you could use the lower level to bolster communication pathways and use cache under existing L1 cache that might be a good use of one layer.

jamescox · Jun 8, 2021

lobz said:
At that time that was an actual engineering marvel, regardless of how unfeasible it was for the consumer market.

I am assuming that large silicon interposers just never came down in price sufficiently. Consumer GPUs are actually rather low margin. The prices have been all over the place recently, but if you go back and compare how much goes into a $500 cpu vs. what goes into a $500 video card, you see why Intel didn't really want to be in the gpu business. The GPU is probably larger than the cpu. The gpu is on a PCB with a bunch of high performance memory. This means that the cpu silicon has much higher margin than gpu silicon. So we have HBM for professional and enterprise level products that go for thousands of dollars, but we don't get it for consumer level gaming cards.

To me, this means that we don't get it for Epyc either. AMD wants a modular solution with a single type of chiplet that can span their whole product stack from 8 to 96 cores or more. Requiring large interposers or a bunch of different interposers doesn't seem to be a very good fit. If they could make one small interposer, then that might be workable, but still probably more expensive than not using silicon interposers. It would be great to get at least local silicon interconnect bridges (LSI), but that might be a zen 4 refresh or Zen 5 thing. The massive L3 caches could reduce infinity fabric traffic sufficiently that the added power consumption of pci-express 5 speeds is not as much of an issue, so Zen 4 Epyc might just stay with roughly the same package as Zen 3 Epyc.

MadRat · Jun 8, 2021

HBM2 seems to have supported massive internal communication speeds. Unfortunately it sounds like the interface to the GPU in Fury X was equivalent to only a 128-bit channel for each stack, with only two able to operate at a time. GDDR6 was able to offer more bandwidth over 384-bit pathway with no such limitations. That really explains why Radeon 9 wasn't as good (compared to similarly priced Nvidia cards) as we thought it should have been.

DrMrLordX · Jun 8, 2021

MadRat said:
That really explains why Radeon 9 wasn't as good (compared to similarly priced Nvidia cards) as we thought it should have been.

Radeon 9?

MadRat · Jun 8, 2021

DrMrLordX said:
Radeon 9?

I missed an 'R'

Sapphire Radeon R9 Fury X

jamescox · Jun 9, 2021

JoeRambo said:
I think eventually ARM will eat both AMD/Intel in such usages. Graviton X or Altra Biturbo or whatever they will call it is perfect fit for such workloads, going to be unbeatable in all metrics relevant for cloud and usecases like Your AWS stuff.
AMD should be shooting for higher performance segment, and with V-Cache, ZEN3 they are doing just that. 8C chiplet with 96MB of private L3 is force is nice unit of computing for those non-trivial computing backend nodes.

ARM does look good for things that perform well with just a lot of low power cores. I wouldn't be surprised if AMD and Intel do something to compete in that area. If we get heterogeneous cpus with big and little cores, I could see them making a product with a huge number of little cores. A lot of applications do not need hardly any floating point resources, so the massive vector floating point units in current and future AMD64 cpus will go to waste in a lot of servers. They probably should have specialized products for certain types of servers. Not every server needs an HPC cpu.

lobz · Jun 9, 2021

MadRat said:
HBM2 seems to have supported massive internal communication speeds. Unfortunately it sounds like the interface to the GPU in Fury X was equivalent to only a 128-bit channel for each stack, with only two able to operate at a time. GDDR6 was able to offer more bandwidth over 384-bit pathway with no such limitations. That really explains why Radeon 9 wasn't as good (compared to similarly priced Nvidia cards) as we thought it should have been.

How do you arrive to such conclusions all the time? That card had ZERO issues regarding memory, bandwidth or pathways, not to mention that GDDR6 was not even on the horizon. Fury's massive shader array was heavily underutilized most of the time and the architecture itself was more efficient in computing than in gaming in general - those 2 are mostly intertwined of course.

LightningZ71 · Jun 9, 2021

jamescox said:
ARM does look good for things that perform well with just a lot of low power cores. I wouldn't be surprised if AMD and Intel do something to compete in that area. If we get heterogeneous cpus with big and little cores, I could see them making a product with a huge number of little cores. A lot of applications do not need hardly any floating point resources, so the massive vector floating point units in current and future AMD64 cpus will go to waste in a lot of servers. They probably should have specialized products for certain types of servers. Not every server needs an HPC cpu.

Intel already tried that with Knights Landing, a big compute CPU made from Atom class cores. It was both unpopular and not particularly performant.

MadRat · Jun 9, 2021

lobz said:
How do you arrive to such conclusions all the time? That card had ZERO issues regarding memory, bandwidth or pathways, not to mention that GDDR6 was not even on the horizon. Fury's massive shader array was heavily underutilized most of the time and the architecture itself was more efficient in computing than in gaming in general - those 2 are mostly intertwined of course.

I don't know who you are or why I care. Ironically you touch on an issue pointed out with your criticism why what I said was right. The shaders unable to be fed was part of a larger issue. And btw, it wasn't my conclusion. It was the conclusion of people that studied the operations of the architecture. Anandtech's site did a pretty nice analysis, too.

Maybe you're hung up on a 2015 technology being compared to newer technology. Honestly, we aren't talking much difference in time because the technology you're saying was not on the horizon was finalized in 2017 and was very much in development and part of the conversation when Fury X was in the mainstream. Maybe you're upset I skipped comparisons with GDDR5. GDDR6 competed directly with HBM2, the latter which is still around and progresses at a snail's pace. HBM is low production architecture, expensive, and has few applications in the video card world. Kind of a flash in the pan so to speak because it was never going to cure AMD's answer to Nvidia's product lines. GDDR6 has a 50% higher external bandwidth potential and has left HBM2 in the dust for obvious reasons.

NTMBK · Jun 9, 2021

MadRat said:
HBM is low production architecture, expensive, and has few applications in the video card world. Kind of a flash in the pan so to speak because it was never going to cure AMD's answer to Nvidia's product lines. GDDR6 has a 50% higher external bandwidth potential and has left HBM2 in the dust for obvious reasons.

Rubbish. NVidia keeps using HBM2 in its high margin datacenter parts (e.g. V100, A100), where money is no issue, and maximum performance and performance-per-watt is all that matters.

DrMrLordX · Jun 9, 2021

MadRat said:
I missed an 'R'

Sapphire Radeon R9 Fury X

Yeah I usually refer to that as Fury/Fury X like everyone else. Keeps things simple. I thought you were talking about Radeon VII to be honest. There are techically a lot of R9 cards, like the R9 290, R9 290X, R9 390, etc.

NTMBK said:
Rubbish. NVidia keeps using HBM2 in its high margin datacenter parts (e.g. V100, A100), where money is no issue, and maximum performance and performance-per-watt is all that matters.

AMD uses HBM/HBM2 on their datacenter cards as well. Plus my Radeon VII has it! YAY! But there is a pattern, and I do not expect HBM2 on consumer parts (CPU or dGPU) for awhile yet, until costs come down. Which sadly may never happen despite HBM2e.

NTMBK · Jun 9, 2021

DrMrLordX said:
AMD uses HBM/HBM2 on their datacenter cards as well. Plus my Radeon VII has it! YAY! But there is a pattern, and I do not expect HBM2 on consumer parts (CPU or dGPU) for awhile yet, until costs come down. Which sadly may never happen despite HBM2e.

Oh sure, I agree there. It's an expensive solution, and regular DRAM plus big caches seems to be working better for AMD in lots of cost-constrained situations. But the idea that GDDR6 is better than HBM2 is just wrong.

maddie · Jun 9, 2021

NTMBK said:
Oh sure, I agree there. It's an expensive solution, and regular DRAM plus big caches seems to be working better for AMD in lots of cost-constrained situations. But the idea that GDDR6 is better than HBM2 is just wrong.

Unreservedly true. In fact, I would say that every GPU product would use HBM products if costs were the same. The only factor preventing further use is high relative costs.

Ajay · Jun 9, 2021

Oh, another debate over HBM. It's like déjà vu all over again

.

Vattila · Jun 9, 2021

We need an explanation for the change in heatspreader design for "Raphael", assuming ExecutableFix's leaks are accurate. Hopeful explanation: AMD needed to move the capacitors closer to the edge of the package, as to fit a large interposer under the heatspreader. Less hopeful explanation: The more complicated interconnect in the package substrate necessitated the new placement of the capacitors.

VideoCardz: AMD Raphael CPU heatspreader allegedly has open cutouts for capacitors

maddie · Jun 9, 2021

Ajay said:
Oh, another debate over HBM. It's like déjà vu all over again .

Debate? Nah.

MadRat · Jun 9, 2021

NTMBK said:
Rubbish. NVidia keeps using HBM2 in its high margin datacenter parts (e.g. V100, A100), where money is no issue, and maximum performance and performance-per-watt is all that matters.

The same problems for memory in Fury X were closely related to keeping the shaders busy. AMD crippled their HBM execution; there wasn't anything inherent wrong with HBM technology per se. HBM memory has advantages in specific niches like low power draw. High margin products aimed at specific market targets (e.g. data center) is a great niche for it. AMD wasn't chasing NVidia on die size. NVidia's big die is geared for the high bandwidth that HBM and GDDR technologies bring. HBM development is slower due to issues with economies of scale.

lobz · Jun 10, 2021

MadRat said:
I don't know who you are or why I care. Ironically you touch on an issue pointed out with your criticism why what I said was right. The shaders unable to be fed was part of a larger issue. And btw, it wasn't my conclusion. It was the conclusion of people that studied the operations of the architecture. Anandtech's site did a pretty nice analysis, too.

Maybe you're hung up on a 2015 technology being compared to newer technology. Honestly, we aren't talking much difference in time because the technology you're saying was not on the horizon was finalized in 2017 and was very much in development and part of the conversation when Fury X was in the mainstream. Maybe you're upset I skipped comparisons with GDDR5. GDDR6 competed directly with HBM2, the latter which is still around and progresses at a snail's pace. HBM is low production architecture, expensive, and has few applications in the video card world. Kind of a flash in the pan so to speak because it was never going to cure AMD's answer to Nvidia's product lines. GDDR6 has a 50% higher external bandwidth potential and has left HBM2 in the dust for obvious reasons.

Edit: on second thought, scratch that. Since you've already decided you were right, I'm not willing to engage in your lunacy any further.

jamescox · Jun 10, 2021

LightningZ71 said:
Intel already tried that with Knights Landing, a big compute CPU made from Atom class cores. It was both unpopular and not particularly performant.

That was Intel’s kludge to try and make a cpu behave like a gpu; they gave up on that and made a real gpu. I am talking about the opposite. Knights landing had super simple cores with big vector FP units. Atom cores are very weak, but you can fit a lot more ARM cores per unit area than big AMD64 cores, so a smaller core is necessary, even with the die size advantages from MCMs. Wide vector FP units take a lot of die area and the interconnect to support them burns a lot of power. A lot of server applications do not really need any floating point performance. A tiny scalar FPU or a very limited vector unit would be sufficient. I don’t know how well higher levels of threading (like 4 or 8 threads per core) would compare in power and die size to just a bunch of smaller cores. There has been some specialized throughput processors that supported a lot of threads per core with limited success; Cray MTA, SPARC T and M series, IBM Power processors. It is a way of sharing hardware though. They may want to keep support for the same instruction set, but they could emulate wider FP units with less hardware or swap the thread to a bigger core when it hits an unsupported instruction. I don’t know if they would essentially try to share an FPU like excavator cores did.

There are a lot of different options for supporting throughout oriented computing, but AMD is almost certainly working on a small, low power versions of their cores for mobile and other applications anyway. It is a requirement to stay competitive. Both AMD and Intel are going to have a hard time competing with Apple in the mobile space. Apple is probably at least a generation ahead on cpu design although some of their performance comes from their control over the whole system. I have an old Apple laptop, but I don’t want to be pushed into the Apple walled garden for my laptop so I am hoping to see some much better mobile solutions from AMD. Apple is definitely trying to push more users into the iPhone/iPad space. The locked down iPad Pro type devices are deliberately much more attractive than the light weight laptops. I am tempted by the iPad Pro mini-led display though. Other ARM server cpu designs are close with AMD processors already and a bit ahead of Intel. AMD64 makers would be in even more trouble if Apple actually made servers. Apple are supposedly going to make a 40 core Mac Pro, so they could make their own servers easily; they just don’t seem to want to be in that market even though they probably would have a significant performance per watt advantage. Although, once you support massive amounts of IO, the core power probably becomes less significant.

An AMD low power version probably isn’t going to have multiple AVX512 units, although a version with such units might be useful for some applications, like game consoles and perhaps some laptops or mobile gaming devices. It is generally better to use the gpu cores when possible rather than try to make such streaming applications perform well on the cpu. The cores used in heterogeneous compute solutions will certainly be more powerful than atom cores. They could make different versions with different big:little core ratios. You might have almost all big cores for HPC. Perhaps some mix for general use and a high core count version that is mostly little cores. I don’t know if they would do that via different versions of the base die and/or different stacked die. I could see them eventually stacking 2 layers of cpu core that would allow for a range of products. Perhaps 3 different die, one all or mostly big, one all little, and one mixed. You could make a lot of different products if you could stack just two of those. That would also allow them to use different process tech for different cores, like a high performance process for the big cores and a density/power optimized version for the small cores. I wonder how many small cores would fit on a CCD sized area. They are working on cooling solutions to deal with stacked chips, but that is probably very be expensive. A layer of ultra low power cores might be doable without exotic cooling though, especially for server applications with lower clock and higher parallelism.

MadRat · Jun 10, 2021

lobz said:
Edit: on second thought, scratch that. Since you've already decided you were right, I'm not willing to engage in your lunacy any further.

Ad hominem always bolsters your argument.

jpiniero · Jun 10, 2021

MadRat said:
Ad hominem always bolsters your argument.

Given that nVidia and AMD are using HBM2 for their datacenter products... not to mention Intel.

MadRat · Jun 10, 2021

jpiniero said:
Given that nVidia and AMD are using HBM2 for their datacenter products... not to mention Intel.

Well if you were looking for maximum density at lowest voltage, HBM has always made sense. You are talking the antithesis of a consumer product. What was the point you're trying to make?

jpiniero · Jun 10, 2021

MadRat said:
Well if you were looking for maximum density at lowest voltage, HBM has always made sense. You are talking the antithesis of a consumer product. What was the point you're trying to make?

Your post implied that HBM was worthless and GDDR had completely replaced it. Which it hasn't.

MadRat · Jun 11, 2021

jpiniero said:
Your post implied that HBM was worthless and GDDR had completely replaced it. Which it hasn't.

I never said any such thing.

I said HBM is a niche product and, because of economy of scale favors GDDR, the progress for HBM is relatively slow. HBM technology is good for controlling thermal footprints but it did not solve the bandwidth competition for AMD. I didn't blame HBM for Fury X shortcomings which in reality came from AMD's small core architecture preference. NVidia chose a large core preference which doesn't share AMD's problems. And I pointed out GDDR6 had 50% better bandwidth than HBM2, which is an advantage when it comes to NVidia's preference in architecture, Its not a perfect explanation but it is the general gist.

maddie · Jun 11, 2021

MadRat said:
I never said any such thing.

I said HBM is a niche product and, because of economy of scale favors GDDR, the progress for HBM is relatively slow. HBM technology is good for controlling thermal footprints but it did not solve the bandwidth competition for AMD. I didn't blame HBM for Fury X shortcomings which in reality came from AMD's small core architecture preference. NVidia chose a large core preference which doesn't share AMD's problems. And I pointed out GDDR6 had 50% better bandwidth than HBM2, which is an advantage when it comes to NVidia's preference in architecture, Its not a perfect explanation but it is the general gist.

Which reality from?

Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Senior member

Lifer

Senior member

Lifer

Lifer

Lifer

Senior member

Platinum Member

Golden Member

Lifer

Lifer

Lifer

Lifer

Diamond Member

Lifer

Senior member

Diamond Member

Lifer

Platinum Member

Senior member

Lifer

Lifer

Lifer

Lifer

Lifer

Diamond Member