Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 23 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Cardyak

Member
Sep 12, 2018
72
159
106
Found an interesting patent from AMD for increasing IPC by concurrently executing both sides of a branch instruction.

ALTERNATE PATH FOR BRANCH PREDICTION REDIRECT
View attachment 76053

They need a good bump size of the register file and the other OoO resources to pull this off.

There has been research/work on this since at least the 90s, and while I believe a few CPUs may do this on an extremely very limited basis (I've seen claims that Apple's big cores can run both paths in certain cases, though that may simply be to allow progress before the branch predictor has its result ready) no one has gone all-in on it because branch predictors are so good these days you won't get much out of it.

Sure, there are some branches that are essentially impossible to predict where it would be of benefit (so long as they aren't quickly followed by more such branches) but then you are paying a price in terms of additional transistors, power to operate them, and verification time for something that doesn't help you very often.

Indeed, this has been an active area of research for a long time, and if you're feeling brave there is a ~170 page PhD Thesis that delves into this concept here - Maintaining High Performance in the Presence of Impossible-to-Predict Branches

This idea is fascinatingly complex, and there's a lot of processes to unpack:

Firstly: As the AMD patent indicates, you'd have to build a confidence predictor that tracked all of the branches and marked the Hard-to-Predict branches so you can detect them in the future. This alone is already a difficult prospect, the you would have to determine "When is a branch deemed to difficult enough to mark as a H2P branch" and also constantly monitor the aggression on this confidence predictor. If you end up marking too many branches as H2P then these predication processes can actually cause a regression in performance because you are spending a lot of resources attempting clever tricks and manoeuvres as opposed to simply predicting the branch as per normal.

Secondly, as the paper I've linked describes. What do you do with these hard to predict branches? There are a few options, some of which the paper describes:
  1. Dynamic Predication - Simply decode and execute both sides of the branch, and then discard the wrong side after the branch has been determined. The great thing about this approach is that it's guaranteed to avoid a misprediction, the downside is that you are also guaranteed to take a performance hit because your processor is now filled up with a stream of instructions that are eventually going to be disposed, so it's also inherently wasteful. This is essentially an insurance policy - "I'd rather incur a small hit in performance by having redundant code take up resources in the ROB/Scheduler, etc... then predict a dangerous branch, get it wrong, and then have to flush the entire pipeline". If going down this route the aggression factor becomes very critical. You'd have to use this sparingly, we're talking only the hardest 0.1 - 1% of branches that are nearly impossible to predict. If this is used to frequently on too many branches it will result in the tracking structures of the processor being clogged up with tonnes of redundant code and cause enormous performance drops. However, if this is implemented with restraint it could be extremely beneficial, especially with branches that involve random-number-generation and are essentially impossible to predict. For what it's worth, Intel also published a paper on this solution back in 2020 titled Auto-Predication of Critical Branches
  2. Delayed Fetch & Merge Point Prediction - Skip over the hard branches for now, detect when the code merges in the future, and start fetching Out-of-Order from there. This seems like it offers the most performance gain, but also the most complexity. Determining the reconvergence point of the code is difficult, not to mention you'll have data-dependencies that have changed within the branch that has been skipped, so you can't simply ignore the branch completely and zip into the future at the point of merge. You would have to do some clever clean-up process when the branch has been determined and resolved, go back and "fill in the gap" from the code that you skipped, and then error check the speculative instructions and check the data dependencies. This seems really cool but I couldn't even begin to fathom the engineering complexity of it, seems like a pipe-dream for the moment at least.
  3. Branch Runahead - There is a paper detailing this idea here. Essentially the idea is to spawn a small thread which runs on ahead, scans the instruction stream, executes the code (not properly otherwise you'd be executing all of the same code twice, it just checks for branch dependencies and attempts to predict them), this information is then relayed back to the "proper" core, which then learns from the small thread that ran on ahead, and uses it's data and feedback to hopefully not make the same mistakes. This feels like the CPU version of using a thread as cannon-fodder, forcing it to jump on ahead and act as a scout and perform reconnaissance. (You run on ahead and trip over all the H2P branches, and I'll run along behind executing the code properly and learn from your mistakes). Numerous questions remain about the implementation. How far does the 2nd smaller thread run on ahead? And how does it skip irrelevant code and only execute and gather information regarding the H2P branches? Where does the 2nd thread execute? A small core sitting alongside the bigger core? Another SMT thread? Really interesting concept but a little ambitious.
There's also a lot of other ideas about mitigating the damage from branch mispredictions as well, such as selective pipeline flushing. But that's a conversation for another time! There really are a lot of ambitious ideas out there and I'd be amazed if we don't see at least one of these implemented over the next decade or so.
 

Timmah!

Golden Member
Jul 24, 2010
1,428
650
136
Watched RedGamingTech video about actual Meteor/ArrowLake and Zen5 rumors and he claimed that Zen5 will be still just 8 cores per chip, cause more cores would be starved by its Infinity fabric connection to IOD. Do you reckon to be true? I thought that when Infinity Fabric is an issue, because of the penalty it incurs on cross chip communication, then having more cores per chip is actually a good thing, because the chance of needing additional cores housed on that second CCD is lower. Or do they mean that there would be not enough physical space to connect say 12 cores instead of 8 to IOD?
 

DisEnchantment

Golden Member
Mar 3, 2017
1,608
5,816
136
Watched RedGamingTech video about actual Meteor/ArrowLake and Zen5 rumors and he claimed that Zen5 will be still just 8 cores per chip, cause more cores would be starved by its Infinity fabric connection to IOD. Do you reckon to be true? I thought that when Infinity Fabric is an issue, because of the penalty it incurs on cross chip communication, then having more cores per chip is actually a good thing, because the chance of needing additional cores housed on that second CCD is lower. Or do they mean that there would be not enough physical space to connect say 12 cores instead of 8 to IOD?
Likely not the Infinity Fabric.
By the way, GMI2 is 25 Gbps , GMI3 is 32Gbps - 36Gbps and each Zen 4 CCM link has two GMI3 links in wide mode. Some of the 32C Genoa models come with dual GMI3 links per CCM interface
GMI4 as per AMD employees profiles on LinkedIn is 64 Gbps.

Mike Clark said they need to scale the CCX from mobile to server and that is their main consideration to stick to 8 cores.
But regardless they already hinted about increasing cores per CCX, if he is not talking in 2021 about Zen 7 all the way to 2026+ (the next grounds up rework) it could be Zen 5.

We do see core counts growing, and we will continue to increase the number of cores in our core complex that are shared under an L3. As you point out, communicating through that has both latency problems, and coherency problems, but though that's what architecture is, and that's what we signed up for. It’s what we live for - solving those problems. So I'll just say that the team is already looking at what it takes to grow to a complex far beyond where we are today, and how to deliver that in the future.
 
Last edited:

Doug S

Platinum Member
Feb 8, 2020
2,269
3,522
136
Exactly this.

That idea is sometimes called eager execution. Eager execution doing twice the work and throwing one half of that burned power away is quite bad given the overall accuracy of the current predictors. Turning the eager execution on and off based on the prediction history doesn't sound easy. There is a RWT thread about this topic.


Another issue with it is all the Spectre/Meltdown type bugs from speculative execution would have a new class of speculation to attack, and you'd provide more room for making other types of side channel attacks.

So the verification I mentioned wouldn't be simply insuring that the results were correct, but that they didn't open up new avenues of exploit. Even if your verification engineers give it a clean bill of health, you might ship hundreds of millions of CPUs only to learn some guy in Bulgaria found something they missed and every single CPU you shipped with this feature over the last few years is now vulnerable.

If there was any lesson the Spectre/Meltdown stuff should have taught us, it is that anything that involves speculation execution that is later rolled back/discarded has to be treated with a suspicious eye. If you end up having to deliver a patch or firmware "fix" that disables eager execution as the easiest/only way to mitigate such exploits, then all the effort you put into developing and marketing it has been wasted, and customers feel cheated they are no longer getting the performance they were promised.
 

BorisTheBlade82

Senior member
May 1, 2020
664
1,015
106
Watched RedGamingTech video about actual Meteor/ArrowLake and Zen5 rumors and he claimed that Zen5 will be still just 8 cores per chip, cause more cores would be starved by its Infinity fabric connection to IOD. Do you reckon to be true? I thought that when Infinity Fabric is an issue, because of the penalty it incurs on cross chip communication, then having more cores per chip is actually a good thing, because the chance of needing additional cores housed on that second CCD is lower. Or do they mean that there would be not enough physical space to connect say 12 cores instead of 8 to IOD?
Honestly, this is a Bulls**t argument by RGT. Yes, some SKUs are already a bit starved by the IFoP, but AMD made that decision knowingly.
If AMD identified the Infinity Fabric CCX link to be a significant bottleneck, they are entirely free to widen it as much as they like and their target costing allows. @DisEnchantment already mentioned several indications. As this is an on-package topic, they don't even need to change socket, chipset or anything else that would break AM5 compatibility. There are options such as InFo-R and EFB for a physical implementation beyond plain old IFoP.
And I am still speculating a bit, that already Zen4c might be in for a surprise.
 

Geddagod

Golden Member
Dec 28, 2021
1,157
1,021
106
Honestly, this is a Bulls**t argument by RGT. Yes, some SKUs are already a bit starved by the IFoP, but AMD made that decision knowingly.
If AMD identified the Infinity Fabric CCX link to be a significant bottleneck, they are entirely free to widen it as much as they like and their target costing allows. @DisEnchantment already mentioned several indications. As this is an on-package topic, they don't even need to change socket, chipset or anything else that would break AM5 compatibility. There are options such as InFo-R and EFB for a physical implementation beyond plain old IFoP.
And I am still speculating a bit, that already Zen4c might be in for a surprise.
I think a much more likely culprit for why the CCD might stay at 8 cores is the inter-CCD core connect and poor node scaling. SRAM is essentially staying the same, and the core size of zen 5 should increase (comparative to zen 4, if they stayed on the same node I mean) because they are making the core wider.
Maybe I imagine a 12 core CCD or 2 8 core CCXs in one CCD, connected with infinity fabric like zen 2 was, but even that I doubt.
Some people claim that zen 5 defaults with L3 stacked onto the CCD, but I don't see that happening since AMD already confirmed there are separate V-cache models for Zen 5.
 
  • Like
Reactions: scineram

maddie

Diamond Member
Jul 18, 2010
4,749
4,691
136
I think a much more likely culprit for why the CCD might stay at 8 cores is the inter-CCD core connect and poor node scaling. SRAM is essentially staying the same, and the core size of zen 5 should increase (comparative to zen 4, if they stayed on the same node I mean) because they are making the core wider.
Maybe I imagine a 12 core CCD or 2 8 core CCXs in one CCD, connected with infinity fabric like zen 2 was, but even that I doubt.
Some people claim that zen 5 defaults with L3 stacked onto the CCD, but I don't see that happening since AMD already confirmed there are separate V-cache models for Zen 5.
Maybe V-cache Zen 5 is a 2 Hi stack.
 
  • Like
Reactions: Kaluan and Geddagod

Timmah!

Golden Member
Jul 24, 2010
1,428
650
136
Honestly, this is a Bulls**t argument by RGT.
Thats exactly what i thought, hence why i posted about it in the first place. I do not question that they will stick to 8 cores, but the reasoning for that was suspect there.

I think a much more likely culprit for why the CCD might stay at 8 cores is the inter-CCD core connect and poor node scaling. SRAM is essentially staying the same, and the core size of zen 5 should increase (comparative to zen 4, if they stayed on the same node I mean) because they are making the core wider.
Maybe I imagine a 12 core CCD or 2 8 core CCXs in one CCD, connected with infinity fabric like zen 2 was, but even that I doubt.
Some people claim that zen 5 defaults with L3 stacked onto the CCD, but I don't see that happening since AMD already confirmed there are separate V-cache models for Zen 5.

Yeah, i guess if they increasing the size of the cores, but not using smaller node, then its probably 8 cores again and 16 max on the client platform. That means i will very likely hold my 7950x longer, cause i dont think i will swap it for another 16 core part. It would really need to be something special for that to happen.
Regarding v-cache, stacking L3 cache purely onto CCD would have to be precluded by solving the clock penalty. I mean, they are stacking the cache on top of cache, not the cores themselves, and still have to clock the chips lower. Removing the cache from the die completely, would mean stacking it on top of the cores - i dont see that happening by next year. Probably not by 2025 either.

Maybe V-cache Zen 5 is a 2 Hi stack.

Thats a good point. But, as said above, i kinda dont believe it seeing the pace of innovation so far.
 

turtile

Senior member
Aug 19, 2014
614
294
136
Honestly, this is a Bulls**t argument by RGT. Yes, some SKUs are already a bit starved by the IFoP, but AMD made that decision knowingly.
If AMD identified the Infinity Fabric CCX link to be a significant bottleneck, they are entirely free to widen it as much as they like and their target costing allows. @DisEnchantment already mentioned several indications. As this is an on-package topic, they don't even need to change socket, chipset or anything else that would break AM5 compatibility. There are options such as InFo-R and EFB for a physical implementation beyond plain old IFoP.
And I am still speculating a bit, that already Zen4c might be in for a surprise.

Especially when the real reason is so obvious, there is no need to speculate. The whole purpose of chiplets is cost efficiency. AMD is widening its chip to increase IPC. Going from 5N to 4N doesn't provide significant density improvements, and AMD will probably choose a blend for the best performance/efficiency. More than 8 cores just doesn't make sense economically.
 

HurleyBird

Platinum Member
Apr 22, 2003
2,684
1,268
136
Well, AMD has long been rumored to substantially increase core counts with Turin, and they probably need to in order to maintain a comfortable lead over Intel in servers. I don't think AMD would plan to stay at 96 cores for serves because it's their #1 priority and they aren't dumb.

I don't see how that happens with 8-Core CCDs unless you can stack them on top of one another.

It might also be possible that there are two different CCDs with different core counts. If AMD does make a 16-core CCD, that would be pretty overkill for a lot of the market.

But RGT is such a garbage tier leaker that it's not even worth putting much speculation into anything he says. Lots of people here hate on MLID, but at least he has some decent sources. RGT doesn't even seem to have that, and his speculation is always much worse. RDNA3 triple RDNA2 performance, anyone?
 

Exist50

Platinum Member
Aug 18, 2016
2,445
3,043
136
Regarding v-cache, stacking L3 cache purely onto CCD would have to be precluded by solving the clock penalty. I mean, they are stacking the cache on top of cache, not the cores themselves, and still have to clock the chips lower. Removing the cache from the die completely, would mean stacking it on top of the cores - i dont see that happening by next year. Probably not by 2025 either.
They could alternatively stack it underneath. Would have its own challenges, but might be better for thermals. Though we never did get a deep dive on what the limiting factor actually is for clocks w/ v-cache.
Well, AMD has long been rumored to substantially increase core counts with Turin, and they probably need to in order to maintain a comfortable lead over Intel in servers. I don't think AMD would plan to stay at 96 cores for serves because it's their #1 priority and they aren't dumb.
I think they could still get away with 96c reasonably comfortably if they had to, but if not, they could wait for a 3nm Zen 5 version (refresh?) and then do a sort of mid-cycle upgrade. Or they could just make a new socket with a 128c, 16 channel config, but that's probably not an ideal outcome.
 

soresu

Platinum Member
Dec 19, 2014
2,667
1,866
136
Though we never did get a deep dive on what the limiting factor actually is for clocks w/ v-cache
I strongly suspect that die thickness plays a significant role, especially thickness of the die on top so that there is a shorter path of thermal dissipation between the power heavy bottom die and the IHS.
 
  • Like
Reactions: Tlh97 and Kaluan

BorisTheBlade82

Senior member
May 1, 2020
664
1,015
106
I strongly suspect that die thickness plays a significant role, especially thickness of the die on top so that there is a shorter path of thermal dissipation between the power heavy bottom die and the IHS.
I could imagine that the contact patches for the TSVs introduce quite some resistance, producing heat, consuming more power, more electromagnetic trouble etc. pp.
 
  • Like
Reactions: Tlh97 and soresu

CakeMonster

Golden Member
Nov 22, 2012
1,392
501
136
But RGT is such a garbage tier leaker that it's not even worth putting much speculation into anything he says. Lots of people here hate on MLID, but at least he has some decent sources. RGT doesn't even seem to have that, and his speculation is always much worse. RDNA3 triple RDNA2 performance, anyone?
Agreed, except about the other guy, neither of them has brought any positives to the hardware community overall. I love the AT forums because we usually discuss the core technologies instead of the frauds and clickbaiters that unfortunately dominate other hardware communities.
 

CakeMonster

Golden Member
Nov 22, 2012
1,392
501
136
I don't see how that happens with 8-Core CCDs unless you can stack them on top of one another.

It might also be possible that there are two different CCDs with different core counts. If AMD does make a 16-core CCD, that would be pretty overkill for a lot of the market.
Based on all the official info so far, I'm very much inclined to bet on Z5 still being 8c main CCD for the consumer lineup. The question is whether they feel the pressure to go for a secondary 16c CCD to keep up with Intel's core counts. Maybe the Z4 X3D lineup with two different CCD's is a sign that they're moving to that with all the scheduling that entails. If the node advance is only minor then they will need those transistors to achieve the main 8c CCD IPC improvements that Mike Clark hinted at in Ian's interview 18 months ago.
 

inf64

Diamond Member
Mar 11, 2011
3,703
4,034
136
RGT claims only 16C/32T for top Zen 5 desktop AM5 part. That seems kinda low as I was expecting AMD to do Zen 5 + Zen 5C (or whatever they'll call it). They could easily to 8C Zen 5 + 16C Zen 5C for a total of 24 Zen 5 cores. ISA would be the same, Zen 5C might clock ~15-20% lower but that's fine as they would still get ~30% boost versus 16C Zen 5 in MT workloads.
 

yuri69

Senior member
Jul 16, 2013
389
624
136
Turin with 128c is doable as 16 * 8c CCDs. The IFOP was said to have a ~20mm reach. Now Genoa features 3-deep stacks of CCDs. Turin might rearrange them to "quads" since a straight line of 4-deep CCDs would likely be out of the IFOP reach.

Zen 5c could have 12 * dual-CCX CCDs = 192c.
 

Kepler_L2

Senior member
Sep 6, 2020
340
1,219
106
RGT claims only 16C/32T for top Zen 5 desktop AM5 part. That seems kinda low as I was expecting AMD to do Zen 5 + Zen 5C (or whatever they'll call it). They could easily to 8C Zen 5 + 16C Zen 5C for a total of 24 Zen 5 cores. ISA would be the same, Zen 5C might clock ~15-20% lower but that's fine as they would still get ~30% boost versus 16C Zen 5 in MT workloads.
Zen4c is coming out almost 1 year after Zen4, how would Zen5c be ready at the same time as Zen5 for GNR launch?
 
  • Like
Reactions: Tlh97 and scineram

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Turin with 128c is doable as 16 * 8c CCDs. The IFOP was said to have a ~20mm reach. Now Genoa features 3-deep stacks of CCDs. Turin might rearrange them to "quads" since a straight line of 4-deep CCDs would likely be out of the IFOP reach.

Zen 5c could have 12 * dual-CCX CCDs = 192c.
I would expect Zen 5 to start using stacked die in some manner, so it may look more like MI300 than Genoa, which is partially why I am so interested in exactly what is in MI300. If they can economically use the same interconnect used in RDNA3/MCDs, then that would reduce power consumption significantly. Going up to pci-e 5 or 6 speeds for SerDes based GMI has to cost a lot of power. Speculating on how stacked die are going to used or arranged is very difficult without more information.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
I could imagine that the contact patches for the TSVs introduce quite some resistance, producing heat, consuming more power, more electromagnetic trouble etc. pp.
I thought I saw something about using TSVs to help transfer heat since they are copper rather than silicon. Keeping the thermal expansion differences in check would be difficult though.