Discussion Zen 5 Discussion (EPYC Turin and Strix Point/Granite Ridge - Ryzen 8000)

Page 22 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
Well, since many folks already got their hands (or at least going to get) on Zen 4 CPUs , time to discuss about Zen 5 (Zen 4 already old news :D)

We already got roadmaps and key technologies like AIE
1664493390795.png

1664493471118.png

1664493514999.png

1664493556491.png
1681912883215.png
Some things we already knew
  • Dr. Lisa Su and Forrest Norrod already mentioned at FAD 2022 on May 9th, during Q&A that Zen 5 will come in N3 and N4/5 variants so it will be on multiple nodes.
  • Mark Papermaster highlighted that it will be a grounds up architecture, Also mentioned last para here
  • Mike Clark mentioned that they started to work on Zen 5 already in 2018. This means Zen 5 by the time it launches would have been under conception and planning and development for much longer than the original Zen program
For a CPU architecture launching in early 2024 in the form of Strix Point for OEM notebook refresh, tape out should be happening in the next few months already.
Share your thoughts


"I just wanted to close my eyes, go to sleep, and then wake up and buy this thing. I want to be in the future, this thing is awesome and it's going be so great - I can't wait for it." - Mike Clark
 
Last edited:

BorisTheBlade82

Senior member
May 1, 2020
660
1,003
106
MLID said it is a giant interposer.

Also, I don't know if this is true, but AMD may dumped (for now) EFB, for now.


OTOH, AMD's Papermaster mentioned that there may be a future EFB with hybrid bond.
But not a giant silicon based monolithic interposer. Maybe a giant RDL interposer.
Mark my words: Although it is only my humble opinion, I am dead sure about this.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
Also, I don't know if this is true, but AMD may dumped (for now) EFB, for now.
I find Dylan make a mountain out of every mole hill. But MLID is a different league, not only is he barely literate on such matters, he just make things up.
But not a giant silicon based monolithic interposer. Maybe a giant RDL interposer.
Mark my words: Although it is only my humble opinion, I am dead sure about this.
Maybe the specifics are missing. If we are talking about 2.5D packaging, then RDL fanout package makes more sense if the trace counts are low and no active routing logic needed in between the dies. Giant Si Interposer makes less sense, when EFB is there. For 3D stacking, the base die basically can be an Si interposer with functional logic. But I bet MLID would say I told you so even if he said Si Interposer from the beginning.
 

Joe NYC

Golden Member
Jun 26, 2021
1,893
2,191
106
I find Dylan make a mountain out of every mole hill. But MLID is a different league, not only is he barely literate on such matters, he just make things up.

On Mi300, MLID has been very good.

Also, another thing that MLID was the first to say is that V-Cache on Zen 4 will be N6. Which the guy from Tom's hardware let slip out yesterday.

Anyway, this would mean that TSMC can stack N5 on N6 (Mi300) and N6 on N5 (Zen 4)
 
  • Like
Reactions: Kaluan

maddie

Diamond Member
Jul 18, 2010
4,722
4,626
136
On Mi300, MLID has been very good.

Also, another thing that MLID was the first to say is that V-Cache on Zen 4 will be N6. Which the guy from Tom's hardware let slip out yesterday.

Anyway, this would mean that TSMC can stack N5 on N6 (Mi300) and N6 on N5 (Zen 4)
I always found the assumed different node stacking issue strange. At the interface, where the bond actually happens is where they must match. Why would it matter what the interior logic size was, when it was not directly involved? Validation would take time, but the possibility of stacking different nodes appears perfectly normal.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
I always found the assumed different node stacking issue strange. At the interface, where the bond actually happens is where they must match. Why would it matter what the interior logic size was, when it was not directly involved? Validation would take time, but the possibility of stacking different nodes appears perfectly normal.
Drive currents, voltage ranges, CTEs etc differs, so they need to take care of such things.
 

BorisTheBlade82

Senior member
May 1, 2020
660
1,003
106
On Mi300, MLID has been very good.

Also, another thing that MLID was the first to say is that V-Cache on Zen 4 will be N6. Which the guy from Tom's hardware let slip out yesterday.

Anyway, this would mean that TSMC can stack N5 on N6 (Mi300) and N6 on N5 (Zen 4)
I couldn't find that on THW - do you have a link? Nevertheless, it is quite likely that even THW are sometimes echoing things, MLID mentions.
 
  • Like
Reactions: Kaluan

jamescox

Senior member
Nov 11, 2009
637
1,103
136
MLID said it is a giant interposer.

Also, I don't know if this is true, but AMD may dumped (for now) EFB, for now.


OTOH, AMD's Papermaster mentioned that there may be a future EFB with hybrid bond.
A lot of different stuff has been invented to avoid using a giant silicon interposer so I don't know if I believe that it is a single piece of silicon, if that is what they are trying to say. I would definitely doubt that it is a single, monolithic, silicon interposer under all 4 groups of chiplets. It is much more plausible that it is a separate interposer under each set of 2 gpu chiplets (4 silicon interposers total), but even that seems like more than necessary.

Note that much of the advanced packaging technology has reticle size limits. Even if they are talking about a reticle size limit, that doesn't mean that it is a single, monolithic piece of silicon. It seems more likely to be some form of LSI (local silicon interconnect) and RDL, which can use very thin pieces of silicon embedded under the the die. TSMC has a bunch of different forms of this which all have reticle size limits, although they are likely at 3 or 4x now. I believe the thin peices of silicon can be just passive interconnect or active chiplets, so it seems plausible that they could use an MCD type chiplet under the compute die.

This old link is still a good overview: https://www.anandtech.com/show/16051/3dfabric-the-home-for-tsmc-2-5d-and-3d-stacking-roadmap

None of these use a full silicon interposer. I don't know if the infinity fabric fan-out that they are using for RDNA3 with MCD matches any of these, so that may be something new. I thought they indicated that it was not embedded silicon. I believe they said something about it being derived from tech originally meant for mobile use.

The last slide from the link above looks a lot like the "EFB" that AMD has talked about. It appears to have copper pillars (TIV) that elevate the main chip and allow other chiplets to be embedded underneath. It also shows an SoIC stacked die (like an MCD with v-cache) under other chiplets.


1675718070690.png
 

BorisTheBlade82

Senior member
May 1, 2020
660
1,003
106
A lot of different stuff has been invented to avoid using a giant silicon interposer so I don't know if I believe that it is a single piece of silicon, if that is what they are trying to say. I would definitely doubt that it is a single, monolithic, silicon interposer under all 4 groups of chiplets. It is much more plausible that it is a separate interposer under each set of 2 gpu chiplets (4 silicon interposers total), but even that seems like more than necessary.

Note that much of the advanced packaging technology has reticle size limits. Even if they are talking about a reticle size limit, that doesn't mean that it is a single, monolithic piece of silicon. It seems more likely to be some form of LSI (local silicon interconnect) and RDL, which can use very thin pieces of silicon embedded under the the die. TSMC has a bunch of different forms of this which all have reticle size limits, although they are likely at 3 or 4x now. I believe the thin peices of silicon can be just passive interconnect or active chiplets, so it seems plausible that they could use an MCD type chiplet under the compute die.

This old link is still a good overview: https://www.anandtech.com/show/16051/3dfabric-the-home-for-tsmc-2-5d-and-3d-stacking-roadmap

None of these use a full silicon interposer. I don't know if the infinity fabric fan-out that they are using for RDNA3 with MCD matches any of these, so that may be something new. I thought they indicated that it was not embedded silicon. I believe they said something about it being derived from tech originally meant for mobile use.

The last slide from the link above looks a lot like the "EFB" that AMD has talked about. It appears to have copper pillars (TIV) that elevate the main chip and allow other chiplets to be embedded underneath. It also shows an SoIC stacked die (like an MCD with v-cache) under other chiplets.


View attachment 76032
I agree with you that this is all highly confusing. AFAIK the Fan-out used on N31 is more or less identical to InFo-R(DL). The Redistribution layer is not silicon based and has no real reticle limit but might be bound to the reticle limit in the certain way, that no single connection can exceed it. The same goes for EFB. As per my understanding the reticle limit there applies to each single bridge, while the composition as a whole is more or less unlimited. Of course I might be terribly wrong.
 

Joe NYC

Golden Member
Jun 26, 2021
1,893
2,191
106
A lot of different stuff has been invented to avoid using a giant silicon interposer so I don't know if I believe that it is a single piece of silicon, if that is what they are trying to say. I would definitely doubt that it is a single, monolithic, silicon interposer under all 4 groups of chiplets. It is much more plausible that it is a separate interposer under each set of 2 gpu chiplets (4 silicon interposers total), but even that seems like more than necessary.

No, that his the "base die" - a N6 based die that will have I/O, memory controller, SRAM. There are 4 of them, and each one is 300-350mm2.

Here is a picture from MLID that makes it clearer:
1675746946975.png

The silicon interposer is underneath all of these stacked dies, these being:
- 8x stacks of HBM
- 4x of base die with compute stacked on top of these.

Note that much of the advanced packaging technology has reticle size limits. Even if they are talking about a reticle size limit, that doesn't mean that it is a single, monolithic piece of silicon. It seems more likely to be some form of LSI (local silicon interconnect) and RDL, which can use very thin pieces of silicon embedded under the the die. TSMC has a bunch of different forms of this which all have reticle size limits, although they are likely at 3 or 4x now. I believe the thin peices of silicon can be just passive interconnect or active chiplets, so it seems plausible that they could use an MCD type chiplet under the compute die.

This old link is still a good overview: https://www.anandtech.com/show/16051/3dfabric-the-home-for-tsmc-2-5d-and-3d-stacking-roadmap

None of these use a full silicon interposer. I don't know if the infinity fabric fan-out that they are using for RDNA3 with MCD matches any of these, so that may be something new. I thought they indicated that it was not embedded silicon. I believe they said something about it being derived from tech originally meant for mobile use.

The last slide from the link above looks a lot like the "EFB" that AMD has talked about. It appears to have copper pillars (TIV) that elevate the main chip and allow other chiplets to be embedded underneath. It also shows an SoIC stacked die (like an MCD with v-cache) under other chiplets.


View attachment 76032

As far as the connections in the picture above, each of the 4 pair of HBM memory most likely only needs to talk to their adjacent base die.

So, one way to save on the size of the silicon interposer would be to have 4 or those connections using different technology.

But the 4 base dies need to have a high bandwidth, low latency interconnect, so possibly the silicon interposer would only be under those 4 base dies.

Intel is using EMIB in SPR to connect the "tiles", but I think the bandwidth requirements of the disaggregated GPGPU is an order of magnitude (or more) higher bandwidth than what SPR requires.

BTW, this may mean nothing, just a rumor, but there is a rumor out there that AMD had yield issues with EFB of the Mi250.
 

BorisTheBlade82

Senior member
May 1, 2020
660
1,003
106
No, that his the "base die" - a N6 based die that will have I/O, memory controller, SRAM. There are 4 of them, and each one is 300-350mm2.

Here is a picture from MLID that makes it clearer:
View attachment 76039

The silicon interposer is underneath all of these stacked dies, these being:
- 8x stacks of HBM
- 4x of base die with compute stacked on top of these.



As far as the connections in the picture above, each of the 4 pair of HBM memory most likely only needs to talk to their adjacent base die.

So, one way to save on the size of the silicon interposer would be to have 4 or those connections using different technology.

But the 4 base dies need to have a high bandwidth, low latency interconnect, so possibly the silicon interposer would only be under those 4 base dies.

Intel is using EMIB in SPR to connect the "tiles", but I think the bandwidth requirements of the disaggregated GPGPU is an order of magnitude (or more) higher bandwidth than what SPR requires.

BTW, this may mean nothing, just a rumor, but there is a rumor out there that AMD had yield issues with EFB of the Mi250.
Not for a single moment do I believe that AMD might produce a silicon interposer north of 1600mm2 - I do not even know if someone in the world is able to do this. And more importantly: There is absolutely no need.
The HBM stacks have the same bandwidth demand as the MCD of N31, where InFo-R is sufficient. Apple has shown, that you can produce chiplet GPUs by connecting them via silicon bridge.
 

Joe NYC

Golden Member
Jun 26, 2021
1,893
2,191
106
Not for a single moment do I believe that AMD might produce a silicon interposer north of 1600mm2 - I do not even know if someone in the world is able to do this. And more importantly: There is absolutely no need.

I have seen a number of TSMC presentations saying that they can produce the interposers of this size.
If anything, ever is going to use a big interposer, what kind of product is about $10,000 - $30,000 AI / HPC processor?

$100 is not going to break the bank for this type of product.

The HBM stacks have the same bandwidth demand as the MCD of N31, where InFo-R is sufficient. Apple has shown, that you can produce chiplet GPUs by connecting them via silicon bridge.

Less expensive technologies can be used for point to point, adjacent chips.

But Mi300 may have a mesh, and it may not be a mesh of just 4 base dies, but 8-9 compute dies.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
Found an interesting patent from AMD for increasing IPC by concurrently executing both sides of a branch instruction.

ALTERNATE PATH FOR BRANCH PREDICTION REDIRECT
1675761313582.png

They need a good bump size of the register file and the other OoO resources to pull this off.

Reposting earlier patents for increasing decode width and multiple op cache pipelines which did not make it to Zen 4.
Not sure what "Re-pipelined front end and wide issue" is going to be or will include such patents at all but interesting regardless.

PROCESSOR WITH MULTIPLE FETCH AND DECODE PIPELINES
1675761158622.png
PROCESSOR WITH MULTIPLE OP CACHE PIPELINES
1675761168927.png

Compressing Micro-Operations in Scheduler Entries in a Processor
1675765091326.png
 

BorisTheBlade82

Senior member
May 1, 2020
660
1,003
106
Found an interesting patent from AMD for increasing IPC by concurrently executing both sides of a branch instruction.

ALTERNATE PATH FOR BRANCH PREDICTION REDIRECT
View attachment 76053

They need a good bump size of the register file and the other OoO resources to pull this off.

Reposting earlier patents for increasing decode width and multiple op cache pipelines which did not make it to Zen 4.
Not sure what "Re-pipelined front end and wide issue" is going to be or will include such patents at all but interesting regardless.

PROCESSOR WITH MULTIPLE FETCH AND DECODE PIPELINES
View attachment 76051
PROCESSOR WITH MULTIPLE OP CACHE PIPELINES
View attachment 76052

Compressing Micro-Operations in Scheduler Entries in a Processor
View attachment 76056
Thanks for posting. I'd say at least the first one is a kind of patent they are applying for in the thousands. The kind of trade-offs involved make it possible that we might never see it in a product.
 
  • Like
Reactions: Mopetar

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
Thanks for posting. I'd say at least the first one is a kind of patent they are applying for in the thousands. The kind of trade-offs involved make it possible that we might never see it in a product.
Actually in one embodiment, they said to use the resources which would have been used by SMT, so something they might try but indeed it is just a patent and one off at that. But lets say, disable SMT and get IPC gains, sounds acceptable.
For example, a processor (or processor core) that implements simultaneous multithreading executes a software thread along the main path using a first logical or physical pipeline (or first hardware thread) and the alternate path using a second logical or physical pipeline (or second hardware thread).

However I noticed AMD has been working a lot on PIM, just a handful of what I found
PROVIDING ATOMICITY FOR COMPLEX OPERATIONS USING NEAR-MEMORY COMPUTING
From <https://www.freepatentsonline.com/y2022/0413849.html>
APPROACH FOR REDUCING SIDE EFFECTS OF COMPUTATION OFFLOAD TO MEMORY
From <https://www.freepatentsonline.com/y2023/0004491.html>
ERROR CHECKING DATA USED IN OFFLOADED OPERATIONS
From <https://www.freepatentsonline.com/y2022/0318089.html>
DETECTING EXECUTION HAZARDS IN OFFLOADED OPERATIONS
From <https://www.freepatentsonline.com/y2022/0318085.html>
Processing-in-memory concurrent processing system and method
From <https://www.freepatentsonline.com/11468001.html>
OFFLOADING COMPUTATIONS FROM A PROCESSOR TO REMOTE EXECUTION LOGIC
From <https://www.freepatentsonline.com/y2022/0206855.html>
MEMORY ALLOCATION FOR PROCESSING-IN-MEMORY OPERATIONS
From <https://www.freepatentsonline.com/y2021/0303355.html>
Command throughput in PIM-enabled memory using available data bus bandwidth
From <https://www.freepatentsonline.com/11262949.html>
HARDWARE-SOFTWARE COLLABORATIVE ADDRESS MAPPING SCHEME FOR EFFICIENT PROCESSING-IN-MEMORY SYSTEMS
From <https://www.freepatentsonline.com/y2022/0066662.html>
PROCESSOR-GUIDED EXECUTION OF OFFLOADED INSTRUCTIONS USING FIXED FUNCTION OPERATIONS
From <https://www.freepatentsonline.com/y2022/0188117.html>
REUSING REMOTE REGISTERS IN PROCESSING IN MEMORY
From <https://www.freepatentsonline.com/y2022/0206685.html>
PRESERVING MEMORY ORDERING BETWEEN OFFLOADED INSTRUCTIONS AND NON-OFFLOADED INSTRUCTIONS
From <https://www.freepatentsonline.com/y2022/0206817.html>
Providing host-based error detection capabilities in a remote execution device
From <https://www.freepatentsonline.com/11409608.html>
I am wondering if the feature they had with Xilinx Virtex Ultrascale+ with Samsung Aquabolt XL for PIM will make it to Zen 5 DC parts with HBM (MI300 type parts), usually recurring patents and provisional patents are good candidates for making it to a product.
 
Last edited:

Bigos

Member
Jun 2, 2019
127
281
136
Compressing Micro-Operations in Scheduler Entries in a Processor
View attachment 76056

Some kind of compression is already being done in Zen4, though only NOP instructions. This has some utility when branch targets are being aligned with more than one NOP instruction, but I am not sure how often it is used (the compiler/assembler can create one pretty long NOP instruction instead).

Looking forward to having this used in more cases.

 

Doug S

Platinum Member
Feb 8, 2020
2,201
3,405
136
Found an interesting patent from AMD for increasing IPC by concurrently executing both sides of a branch instruction.


There has been research/work on this since at least the 90s, and while I believe a few CPUs may do this on an extremely very limited basis (I've seen claims that Apple's big cores can run both paths in certain cases, though that may simply be to allow progress before the branch predictor has its result ready) no one has gone all-in on it because branch predictors are so good these days you won't get much out of it.

Sure, there are some branches that are essentially impossible to predict where it would be of benefit (so long as they aren't quickly followed by more such branches) but then you are paying a price in terms of additional transistors, power to operate them, and verification time for something that doesn't help you very often.
 

Mopetar

Diamond Member
Jan 31, 2011
7,797
5,899
136
I don't think there's a much performance to be had from such a scheme as most people might assume. It assumes that taking a branch isn't going to result in a cache miss that would delay execution by enough time that the result of the condition is available (and further that the extra memory access we didn't need to make isn't polluting or thrashing the cache and degrading performance down the road) and likely doesn't handle situations where there are nested branches several layers deep and you have a large fanout of possible paths.

There are a lot of other cases like large iteration count loops where executing the alternative path is pointless 99.999999% of the time and saves you a few cycles on that last iteration. Granted they can incorporate some branch prediction into this to avoid those cases and there's a lot of other considerations for something like this, but ultimately it's only as good as your branch prediction isn't. It almost warrants a specific instruction that could be used when dealing with what's essentially a random outcome or something that will only give the branch predictor fits.
 

yuri69

Senior member
Jul 16, 2013
373
573
136
There has been research/work on this since at least the 90s, and while I believe a few CPUs may do this on an extremely very limited basis (I've seen claims that Apple's big cores can run both paths in certain cases, though that may simply be to allow progress before the branch predictor has its result ready) no one has gone all-in on it because branch predictors are so good these days you won't get much out of it.

Sure, there are some branches that are essentially impossible to predict where it would be of benefit (so long as they aren't quickly followed by more such branches) but then you are paying a price in terms of additional transistors, power to operate them, and verification time for something that doesn't help you very often.
Exactly this.

That idea is sometimes called eager execution. Eager execution doing twice the work and throwing one half of that burned power away is quite bad given the overall accuracy of the current predictors. Turning the eager execution on and off based on the prediction history doesn't sound easy. There is a RWT thread about this topic.
 

Mopetar

Diamond Member
Jan 31, 2011
7,797
5,899
136
It's certainly nothing new, but any company would be foolish to put something in a product that they don't have a patent on. Too many potent trolls out there who are eager for an opportunity for a potential payout.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
No, that his the "base die" - a N6 based die that will have I/O, memory controller, SRAM. There are 4 of them, and each one is 300-350mm2.

Here is a picture from MLID that makes it clearer:
View attachment 76039

The silicon interposer is underneath all of these stacked dies, these being:
- 8x stacks of HBM
- 4x of base die with compute stacked on top of these.



As far as the connections in the picture above, each of the 4 pair of HBM memory most likely only needs to talk to their adjacent base die.

So, one way to save on the size of the silicon interposer would be to have 4 or those connections using different technology.

But the 4 base dies need to have a high bandwidth, low latency interconnect, so possibly the silicon interposer would only be under those 4 base dies.

Intel is using EMIB in SPR to connect the "tiles", but I think the bandwidth requirements of the disaggregated GPGPU is an order of magnitude (or more) higher bandwidth than what SPR requires.

BTW, this may mean nothing, just a rumor, but there is a rumor out there that AMD had yield issues with EFB of the Mi250.
Yeah, MLIDs mock-up is what I meant by "a separate interposer under each set of 2 gpu chiplets (4 silicon interposers total)". That is plausible. Much more plausible than a single giant silicon interposer. You would still need to connect them together somehow, with RDL or embedded bridge chips. The HBM would likely need embedded bridge chips, so you are up to possibly 3 layers of silicon die instead of 2. Four of the "base die" would probably be over 1200 mm2 of silicon with each one not too far from an entire Epyc IO die in size.

They could fit a ridiculous amount of cache on each one, but that large of cache has not proved to be that useful for GPUs. The infinity cache on RDNA3 is only 96 MB. It has 8 stackes of HBM3, so it doesn't need the bandwidth boost from infinity cache. It already has ridiculously high bandwidth. If it is stacked with SoIC rather than other stacking tech, then that could be a very different beast. That could allow compute units to have massive local caches rather than a monolithic, but much farther L3 cache. All chiplets used would need to be designed with that in mind though. The Zen 4 chiplets likely would not be able to use it. In fact it is unclear how stacked Zen 4 chiplets will work on a base die anyway. One of AMD's slides did show what looked like two 8-core cpu chiplets on each end of the "base die" and something else in the middle. I don't know how they get 24 cores from that, but it could be that the cpu chiplets are mounted over the "IO area" and the square in the middle that appears to be something else is mounted over the "cache area".

Given AMD's modular approach, something like embedded MCD (off package memory controllers + cache) , embedded IO die, bridge chips (LSI), etc seems like it may make more sense; chiplets that can be used across many different products rather than just MI300. I am not sure if we have any info on what the MI300 will have for off package connectivity in the SH5 socket. Will it have the same IO as SP5? HPC often needs TB of memory, so it can't just be the 128 GB of HBM. Also, sending signals across these giant interposers may be problematic. They don't daisy chain or run silicon under cpu chiplets in Epyc; it is better to just go the IFOP route with a separate connection than to have to route across multiple chips in silicon. The Epyc IO die already has a number of switches and repeaters internally that add latency. Due to the scaling differences between IO, cache, and logic, I am still thinking that the "base die" may be something made out of a number of different pieces of silicon, made on slightly different processes.
 

scineram

Senior member
Nov 1, 2020
361
283
106
It's certainly nothing new, but any company would be foolish to put something in a product that they don't have a patent on. Too many potent trolls out there who are eager for an opportunity for a potential payout.
If it's nothing new the patent is invalid and completely wasted.