- Apr 4, 2024
- 1,001
- 1,804
- 96
With Zen 5, we keep getting bombarded from all sides with information, there's too much to keep up.
I'll use this post to gather all the info that we can and edit it eventually as more pertinent info comes in.
Write down below what you think should be added and I'll edit as we go (if it's worth adding ofc).
Source: Anandtech
Source: Techpowerup
The first and biggest change is that look-ahead Frontend redesign. Instead of a 6-wide decode with Zen 4 (and I think Zen 3 too?), we have a completely new dual 4-wide decode.
The look-ahead part is essentially loading up half the dual decode and preemptively loading the other half to lower decode latency. Since it seems that most of the current problems of Zen 5 come from decode bottlenecks or memory latency, I think we can assume that the new system is revolutionary enough to not work as expected in its first iteration.
This is the first thing to notice about Zen 5: its performance increase is not at all linear like it more or less was from Zen 1 to Zen 4.
Zen 3 already had fairly unequal growth, but it wasn't anything like this.
Zen 5 also massively grows its SIMD and FP capabilities and goes from a "smart" (I.E cheap) solution of dual pumping AVX 512 instructions in Zen 4's 256 bit pipelines to just making full size 512 bit pipelines.
From Y-cruncher's Alexander Yee, we also know that the AVX 512 implementation is very solid: http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardown/
tl;dr a lot of typically high latency instructions on the Intel side are done with less cycles on Zen 5. To the point that "The regular benchmarks wouldn't do Zen5 (and future processors) any justice. At least until someone can figure out how to get DDR5-20000 on AM5..."
This has led a lot of people to claim that Zen 5 is only a server oriented/SIMD oriented architecture. However, INT ALUs also jumped a whopping 50% more, and the new INT are "fat" pipelines too.
From the aforementioned numberworld article:
"Section 2.10 in the Zen4 Optimization Guide shows the architecture as:
The two extra ALUs that Zen5 adds are not simple ALUs like Zen4's ALU2/3. They are actually closer to ALU1 in capability. In other words, they support all the "expensive" 3-cycle latency instructions - multiply, CRC, PDEP/PEXT."
So INT workloads actually do not seem to be ignored nor put as a second class citizen vs FP/SIMD in Zen 5.
In his testing, numberworld found the 6 ALUs to never be bottomed out. Rather than the backend being insufficient for INT, it seems like something is lacking in the general design that prevents INT from being fully served.
But this is not a problem for FP/SIMD. And particularly scalar ops seem to gain very little performance vs Zen 4. Some have been throwing the notion that the problem is in memory latency, not in the design itself.
So when you have very large amounts of data or complex ops that take a lot of cycles, Zen 5 is limited by memory bandwidth or by the speed of its backend. This is where it shines vs Zen 4.
But for low complexity or scalar operations, the backend is oversized. This seems to be the case for Integer, where the pipelines are never fully filled (no case was found where all 6 ALUs actually fired simultaneously), and where it seems that latency from the rest of the arch is problematic.
In clear, I'm not sure that Zen 5 is "imbalanced" as some have said. Rather, it seems like FP/SIMD is being fed fully and Zen 5 really trumps Zen 4 there, while INT is not fully fed, and so the improvement vs Zen 4 is marginal.
This is highly speculatory of me, so I'm hoping to get some conversation going around that point, see if we can clarify things. And of course if it is the case, identify where/how Zen 5's INT is always left wanting.
Source: Chips and Cheese (9950x article)
(TechSpot, Steve Walton aka Aussie Steve)
So.
Yeah.
(Hairy Steve)
Mmmmmm-mmh.
So yeah so...
Aha.
(Linus Tech Tips)
NAH. It's not 16%.
And I could probably post about 10 more images of the same kind.
It's not even close to 16%. Zen 5% turned out to be real. Even Zen 3% sometimes.
So what happened? Is Zen 5 just a massive transistor count increase, on a better node, providing a non-improvement?
Well...no. Because we have Phoronix to the rescue.
This is where there really is a LOT to unpack about Zen 5.
Because the Phoronix tests are extensive, I won't be posting all of it. It's 400 benchmarks done several times and also done again for a lot of them on Windows 11 and Ubuntu 24.04, so there is too much to say.
But I can at least summarise a few things:
1/ Zen 5 is indeed very unequal in its results. While the Geometric Mean is a flat 21% improvement at 12 cores (7900X vs 9900X) and 18% at 16 cores (7950x vs 9950x), it is nowhere near that in a lot of workloads.
2/ Zen 5 clearly has performance inequality between Windows and Linux, to the tune of several dozen percent in the worst case scenarios.
3/ Zen 5 does not yield a massive improvement in all fields, if anything it feels more like a flat 10 or 12% in most, but the outliers are massive, with Memcached yielding an absurd 49% improvement in Ops/sec and a way more absurd 92% improvement in Set to Get Ratio 1:100.
We have to also talk about where Zen 5 really shines. Because the list is short and sweet:
- Python, Javascript Interpreters and Java JHM/JVM
- Numpy
- Webserver (Apache, but I'm willing to bet others will also see a huge uplift)
- PostgreSQL
- SimdJSON
- Memcached (again, willing to bet Redis or Dragonfly will also blow up)
In one line: everything a webserver demands. If there's a million things to take in about Zen 5 itself, there's just one thing to take in about Zen 5 server: it's a monster arch. The only thing that I couldn't find is load balancer performance tests.
All of these take in an improvement of anywhere between 20% to 50%. Zen 5 is a massive leap in server/Cloud stuff. Probably as big a leap as Zen 2 was with its lego-like chiplet design back in the day.
The gaming improvement is an absolutely nothingburger. Worse, it is a nothingburger that comes out after Zen 5 was one of the most difficult generations for AMD to put up: 22.5 months, pretty much 2 entire years, while everyone's favorite baby, Zen 3, took only 16. They never really respected the 18 months target, but this is the worst of the worst. So gheymers are understandably annoyed and screamy. But it's not because Zen 5 is a gaming disappointment (with the notable exception of Assetto Corsa Competizione, for some reason) that it's a general disappointment.
Or more to the point, it's not because it's a gaming disappointment and a general "meh" in productivity that it's all bad.
Mike Clark, the Zen Dad, went around and talked to a few people, including Chips & Cheese and Ian Cutress.
In the Ian Cutress interview, he's asked "why only 16% despite all those changes"?
"Because we need software to see those changes, so it can really leverage it. When you're stuck on that kind of 6 wide dispatch 4 ALU..."
In the Chips & Cheese one, he basically said that:
"Zen 6 and later will profit off the work made with Zen 5".
The likelihood that Zen 5 will see performance improvement from "FineWineing", from later software and compiler optimisations for its new design is there. When? Where? Don't know.
1/ Zen 5 is NOT an "efficiency improvement generation". I don't know how long AMD's marketing breathed through their whisky for that one, I'm guessing they panicked between the lack of gaming performance and somewhat poor general productivity improvement.
On the contrary, it is a bona fide massive architectural overhaul with huge long-term consequences. In many ways, with the completely redesigned frontend, AVX 512 full width and backend rework (I skipped a lot of details, read Chips and Cheese and numberworld and David Huang's articles https://blog.hjc.im for them), Zen 5 shouldn't even be called Zen. Zen 1 led to 2, Zen 3 reworked 2 and Zen 4 shrunk and tweaked 3. Zen 5 doesn't have much of anything to do with 4. If Zen 3 was a big change, Zen 5 is a revolution. A costly, transistor heavy and not efficiency targeting revolution.
2/ Zen 5 is highly capable in certain workloads, and it seems that where it clearly shines is server oriented workloads. All the python/JS/Java users will greatly enjoy it. Browsers obviously should run a lot faster too. I'm expecting the Cloud providers and general heavily online corpos to absolutely dunk into the Zen 5 chip bin like ScroogeMcDuck in his money bin. There is no reason not to buy whatsoever, or at least, there shouldn't be when Turin comes out. It's a full blown all around improvement in perf and power draw from the node at roughly the same costs of production with a die of the same area. It's a knockout for server and I really feel like Intel will need a tsunami of a result with Granite Rapids to counter this.
3/ Zen 5's general 5%-ness is undeniable, and even though you can find some amazing results, you can find a ton of absolutely "yeah why didn't you just shrink Zen 4 and pump more frequency" results. That look-ahead decode is very, very new, and it shows. A 22% better node with "just Zen 4" on it would've probably been more appreciated than Zen 5 as is by gamers. The notion that software (by which I expect Mike Clark meant compilers) needs to improve their support for the new width and new look-ahead system is surely true, but he didn't give any dates there. It's very much possible that GCC next-gen will start leveraging it in a few months. Or only improve at a snail's pace until Zen 6 replaces Zen 5. Or maybe even later. No promises were made, so for now, for every unimpressive workload performance leap, we can safely assume that Zen 5 is a firm "meh". A 22 months in the making "meh".
4/ Zen 5 does not seem to have an "imbalance" in how it is meant to serve SIMD/server and client. It seems like the architecture either has a plain design flaw that prevents its INT backend from being well fed and used, or it needs the decode to run faster and feed the backend better. Whichever way, I don't think the criticism of "It's only for server, client gets dumped on" is true.
5/ Zen 5 has not even come close to the efficiency of Apple's stuff. It's another subject entirely, but since Zen 5's consequences may be with us for at least as long as Zen 1 (so 7 years, more likely 9-10 years), we may see wider, less data oriented cores gain more and more value in client while classic x64 designs will show inefficiency to please server workloads more. QC has a shot at doing something impressive before Zen 6 comes out at least.
And a few questions still hang in the air.
1/ Why is the INT performance so underused? This is clearly why games, which are extremely branchy, do not seem to properly get any real improvements and even sometimes are slower. Is this a memory latency problem? Is there a cache design or size problem or lack of speed? Did they find some kind of hard limit with Zen 4 that Zen 5 cannot crack? Why have 50% more ALUs if you can't even leverage them then? They already added more L1 cache, and yet the improvement is basically nothing, it's 3%. The lack of performance improvement there and even mild regression is jarring, it's like Zen 5 can't push any further.
2/ Numberworld tested AVX 512 perf extensively and concluded that overall, Zen 5 has excellent latency with it, unlike Intel's implementation way back with Rocket Lake. So does that mean that we'll see a large growth of AVX 512 optimisations in the future? Are we going to see a lot more of it? Or is it only going to be server/heavy data oriented and not that useful for client applications and games?
3/ The memory bandwidth problem with Zen 5 is glaring. Some tests that were data heavy came out with a 9700X having about the same performance as a 9950X. Twice the cores, same bandwidth = performance death. AVX 512 instructions make that even worse. So are we facing an architecture that is somehow "so good" at computing heavy workloads that there is basically nothing to expect for next generation except if memory bandwidth gets better? DDR5 8000 was even tested vs 6000 and it's somehow worse! That Zen 2 era I/O Die seems like it should've died with Zen 4. And AM5 will probably be a good deal of suffering for Zen 6 if it can't get the memory to go as fast as it wanted.
This thread isn't meant to supersede the Zen 5 Architecture nor Zen 5 speculation threads. Think of it as an archive or summary, that I'll rectify from comments below. We simply have 700+ posts in one and at least 100 since Zen 5 came out, and another 12 in the architecture one. This is to try and recoup pertinent information.
I'll use this post to gather all the info that we can and edit it eventually as more pertinent info comes in.
Write down below what you think should be added and I'll edit as we go (if it's worth adding ofc).
Physical
Zen 5 | Zen 4 | Zen 3 | Zen 2 | |
Die Size | 70.6mm² | 71mm² | 80.7mm² | 74mm² |
Node | N4P | N5 | N7 | N7 |
Density (vs precedent) | 1.06x | 1.8x | 1x | Someone cares about Zen 1? Anyone? |
Density (absolute) | 117.78MTr/mm² | 92.9MTr/mm² | 52.7MTr/mm² | 52.7MTr/mm² |
Area (vsp) | 0.94x | 0.55x | 1 | Anyone? Going out for free? |
Power Draw (vsp) | -22% | -30% | 0% | Really? Ok. |
Transistor Count | 8.315B | 6.5B | 4.15B | 3.9B |
Transistor Count per core | ~1040M | ~821M | ~518M | ~475M |
Source: Anandtech
Source: Techpowerup
Architecture and Design
The first and biggest change is that look-ahead Frontend redesign. Instead of a 6-wide decode with Zen 4 (and I think Zen 3 too?), we have a completely new dual 4-wide decode.
The look-ahead part is essentially loading up half the dual decode and preemptively loading the other half to lower decode latency. Since it seems that most of the current problems of Zen 5 come from decode bottlenecks or memory latency, I think we can assume that the new system is revolutionary enough to not work as expected in its first iteration.
This is the first thing to notice about Zen 5: its performance increase is not at all linear like it more or less was from Zen 1 to Zen 4.
Zen 3 already had fairly unequal growth, but it wasn't anything like this.
Zen 5 also massively grows its SIMD and FP capabilities and goes from a "smart" (I.E cheap) solution of dual pumping AVX 512 instructions in Zen 4's 256 bit pipelines to just making full size 512 bit pipelines.
From Y-cruncher's Alexander Yee, we also know that the AVX 512 implementation is very solid: http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardown/
tl;dr a lot of typically high latency instructions on the Intel side are done with less cycles on Zen 5. To the point that "The regular benchmarks wouldn't do Zen5 (and future processors) any justice. At least until someone can figure out how to get DDR5-20000 on AM5..."
This has led a lot of people to claim that Zen 5 is only a server oriented/SIMD oriented architecture. However, INT ALUs also jumped a whopping 50% more, and the new INT are "fat" pipelines too.
From the aforementioned numberworld article:
"Section 2.10 in the Zen4 Optimization Guide shows the architecture as:
- ALU0: add/logic, divide, branch
- ALU1: add/logic, multiply, CRC, PDEP/PEXT (all the 3-cycle instructions)
- ALU2: add/logic
- ALU3: add/logic
The two extra ALUs that Zen5 adds are not simple ALUs like Zen4's ALU2/3. They are actually closer to ALU1 in capability. In other words, they support all the "expensive" 3-cycle latency instructions - multiply, CRC, PDEP/PEXT."
So INT workloads actually do not seem to be ignored nor put as a second class citizen vs FP/SIMD in Zen 5.
In his testing, numberworld found the 6 ALUs to never be bottomed out. Rather than the backend being insufficient for INT, it seems like something is lacking in the general design that prevents INT from being fully served.
But this is not a problem for FP/SIMD. And particularly scalar ops seem to gain very little performance vs Zen 4. Some have been throwing the notion that the problem is in memory latency, not in the design itself.
So when you have very large amounts of data or complex ops that take a lot of cycles, Zen 5 is limited by memory bandwidth or by the speed of its backend. This is where it shines vs Zen 4.
But for low complexity or scalar operations, the backend is oversized. This seems to be the case for Integer, where the pipelines are never fully filled (no case was found where all 6 ALUs actually fired simultaneously), and where it seems that latency from the rest of the arch is problematic.
In clear, I'm not sure that Zen 5 is "imbalanced" as some have said. Rather, it seems like FP/SIMD is being fed fully and Zen 5 really trumps Zen 4 there, while INT is not fully fed, and so the improvement vs Zen 4 is marginal.
This is highly speculatory of me, so I'm hoping to get some conversation going around that point, see if we can clarify things. And of course if it is the case, identify where/how Zen 5's INT is always left wanting.
Source: Chips and Cheese (9950x article)
Improvements
AMD claimed a 16% IPC improvement.(TechSpot, Steve Walton aka Aussie Steve)
So.
Yeah.
(Hairy Steve)
Mmmmmm-mmh.
So yeah so...
Aha.
(Linus Tech Tips)
NAH. It's not 16%.
And I could probably post about 10 more images of the same kind.
It's not even close to 16%. Zen 5% turned out to be real. Even Zen 3% sometimes.
So what happened? Is Zen 5 just a massive transistor count increase, on a better node, providing a non-improvement?
Well...no. Because we have Phoronix to the rescue.
This is where there really is a LOT to unpack about Zen 5.
Because the Phoronix tests are extensive, I won't be posting all of it. It's 400 benchmarks done several times and also done again for a lot of them on Windows 11 and Ubuntu 24.04, so there is too much to say.
But I can at least summarise a few things:
1/ Zen 5 is indeed very unequal in its results. While the Geometric Mean is a flat 21% improvement at 12 cores (7900X vs 9900X) and 18% at 16 cores (7950x vs 9950x), it is nowhere near that in a lot of workloads.
2/ Zen 5 clearly has performance inequality between Windows and Linux, to the tune of several dozen percent in the worst case scenarios.
3/ Zen 5 does not yield a massive improvement in all fields, if anything it feels more like a flat 10 or 12% in most, but the outliers are massive, with Memcached yielding an absurd 49% improvement in Ops/sec and a way more absurd 92% improvement in Set to Get Ratio 1:100.
We have to also talk about where Zen 5 really shines. Because the list is short and sweet:
- Python, Javascript Interpreters and Java JHM/JVM
- Numpy
- Webserver (Apache, but I'm willing to bet others will also see a huge uplift)
- PostgreSQL
- SimdJSON
- Memcached (again, willing to bet Redis or Dragonfly will also blow up)
In one line: everything a webserver demands. If there's a million things to take in about Zen 5 itself, there's just one thing to take in about Zen 5 server: it's a monster arch. The only thing that I couldn't find is load balancer performance tests.
All of these take in an improvement of anywhere between 20% to 50%. Zen 5 is a massive leap in server/Cloud stuff. Probably as big a leap as Zen 2 was with its lego-like chiplet design back in the day.
The gaming improvement is an absolutely nothingburger. Worse, it is a nothingburger that comes out after Zen 5 was one of the most difficult generations for AMD to put up: 22.5 months, pretty much 2 entire years, while everyone's favorite baby, Zen 3, took only 16. They never really respected the 18 months target, but this is the worst of the worst. So gheymers are understandably annoyed and screamy. But it's not because Zen 5 is a gaming disappointment (with the notable exception of Assetto Corsa Competizione, for some reason) that it's a general disappointment.
Or more to the point, it's not because it's a gaming disappointment and a general "meh" in productivity that it's all bad.
Speculatory Expectations/FineWineing
There's also another thing I want to mention.Mike Clark, the Zen Dad, went around and talked to a few people, including Chips & Cheese and Ian Cutress.
"Because we need software to see those changes, so it can really leverage it. When you're stuck on that kind of 6 wide dispatch 4 ALU..."
In the Chips & Cheese one, he basically said that:
"Zen 6 and later will profit off the work made with Zen 5".
The likelihood that Zen 5 will see performance improvement from "FineWineing", from later software and compiler optimisations for its new design is there. When? Where? Don't know.
Conclusion
I think several conclusions can be drawn from the present information.1/ Zen 5 is NOT an "efficiency improvement generation". I don't know how long AMD's marketing breathed through their whisky for that one, I'm guessing they panicked between the lack of gaming performance and somewhat poor general productivity improvement.
On the contrary, it is a bona fide massive architectural overhaul with huge long-term consequences. In many ways, with the completely redesigned frontend, AVX 512 full width and backend rework (I skipped a lot of details, read Chips and Cheese and numberworld and David Huang's articles https://blog.hjc.im for them), Zen 5 shouldn't even be called Zen. Zen 1 led to 2, Zen 3 reworked 2 and Zen 4 shrunk and tweaked 3. Zen 5 doesn't have much of anything to do with 4. If Zen 3 was a big change, Zen 5 is a revolution. A costly, transistor heavy and not efficiency targeting revolution.
2/ Zen 5 is highly capable in certain workloads, and it seems that where it clearly shines is server oriented workloads. All the python/JS/Java users will greatly enjoy it. Browsers obviously should run a lot faster too. I'm expecting the Cloud providers and general heavily online corpos to absolutely dunk into the Zen 5 chip bin like ScroogeMcDuck in his money bin. There is no reason not to buy whatsoever, or at least, there shouldn't be when Turin comes out. It's a full blown all around improvement in perf and power draw from the node at roughly the same costs of production with a die of the same area. It's a knockout for server and I really feel like Intel will need a tsunami of a result with Granite Rapids to counter this.
3/ Zen 5's general 5%-ness is undeniable, and even though you can find some amazing results, you can find a ton of absolutely "yeah why didn't you just shrink Zen 4 and pump more frequency" results. That look-ahead decode is very, very new, and it shows. A 22% better node with "just Zen 4" on it would've probably been more appreciated than Zen 5 as is by gamers. The notion that software (by which I expect Mike Clark meant compilers) needs to improve their support for the new width and new look-ahead system is surely true, but he didn't give any dates there. It's very much possible that GCC next-gen will start leveraging it in a few months. Or only improve at a snail's pace until Zen 6 replaces Zen 5. Or maybe even later. No promises were made, so for now, for every unimpressive workload performance leap, we can safely assume that Zen 5 is a firm "meh". A 22 months in the making "meh".
4/ Zen 5 does not seem to have an "imbalance" in how it is meant to serve SIMD/server and client. It seems like the architecture either has a plain design flaw that prevents its INT backend from being well fed and used, or it needs the decode to run faster and feed the backend better. Whichever way, I don't think the criticism of "It's only for server, client gets dumped on" is true.
5/ Zen 5 has not even come close to the efficiency of Apple's stuff. It's another subject entirely, but since Zen 5's consequences may be with us for at least as long as Zen 1 (so 7 years, more likely 9-10 years), we may see wider, less data oriented cores gain more and more value in client while classic x64 designs will show inefficiency to please server workloads more. QC has a shot at doing something impressive before Zen 6 comes out at least.
And a few questions still hang in the air.
1/ Why is the INT performance so underused? This is clearly why games, which are extremely branchy, do not seem to properly get any real improvements and even sometimes are slower. Is this a memory latency problem? Is there a cache design or size problem or lack of speed? Did they find some kind of hard limit with Zen 4 that Zen 5 cannot crack? Why have 50% more ALUs if you can't even leverage them then? They already added more L1 cache, and yet the improvement is basically nothing, it's 3%. The lack of performance improvement there and even mild regression is jarring, it's like Zen 5 can't push any further.
2/ Numberworld tested AVX 512 perf extensively and concluded that overall, Zen 5 has excellent latency with it, unlike Intel's implementation way back with Rocket Lake. So does that mean that we'll see a large growth of AVX 512 optimisations in the future? Are we going to see a lot more of it? Or is it only going to be server/heavy data oriented and not that useful for client applications and games?
3/ The memory bandwidth problem with Zen 5 is glaring. Some tests that were data heavy came out with a 9700X having about the same performance as a 9950X. Twice the cores, same bandwidth = performance death. AVX 512 instructions make that even worse. So are we facing an architecture that is somehow "so good" at computing heavy workloads that there is basically nothing to expect for next generation except if memory bandwidth gets better? DDR5 8000 was even tested vs 6000 and it's somehow worse! That Zen 2 era I/O Die seems like it should've died with Zen 4. And AM5 will probably be a good deal of suffering for Zen 6 if it can't get the memory to go as fast as it wanted.
This thread isn't meant to supersede the Zen 5 Architecture nor Zen 5 speculation threads. Think of it as an archive or summary, that I'll rectify from comments below. We simply have 700+ posts in one and at least 100 since Zen 5 came out, and another 12 in the architecture one. This is to try and recoup pertinent information.