- Mar 3, 2017
- 1,774
- 6,757
- 136
Every modern cpu instruction timing is wildly variable. Your 6502 example had memory running zero latency but additional one cycle latency for more than 8 bit addressing. CPU's now have usually 3-level caches for both instruction and data separately, memory is divided to many different timed pages - and usually operate in translated virtual memory meaning that memory access speed in every cache level varies too depending on translation cache hits and misses with page walks -resulting that every instruction execute can vary from one cycle to thousand cycles. And cpu's can reorder instruction to hide that to up to thousand instruction window. So to study how code is executing needs special tools to diagnose it - and for example Intel provides great tools for that.
I think that for Zen 5 generation, AMD will have a native 16 core chiplets, not dual CCD. On N3.Has anyone here postulated that Zen 5 being on N3 and N4 could mean that the single CCD SKUs may use N4 and the dual CCD ones may use N3? It's also possible that the E-core CCD may use N3 for minimal energy usage while the P-core CCD will benefit from the maturity of the N4 node family?
That certainly would be exciting. So Zen 5 is expected to go up to 64 threads?I think that for Zen 5 generation, AMD will have a native 16 core chiplets, not dual CCD. On N3.
That depends. I'd expect for the consumer market, there will always be at least one big 8c CCD with Zen5. The second CCD could be a 16c Zen5c CCD or another 8c - giving you up to 48 hybrid threads.That certainly would be exciting. So Zen 5 is expected to go up to 64 threads?
Oh, it would. Even if it was just single 16C chiplet for Zen5 and we would have to wait for dual chiplet version to Zen6. I dont intend to upgrade to Zen5 anywayThat certainly would be exciting. So Zen 5 is expected to go up to 64 threads?
Yeah, having to cut down a 16 core CCD to 4,6,8 core CPUs would be pretty wasteful.That depends. I'd expect for the consumer market, there will always be at least one big 8c CCD with Zen5. The second CCD could be a 16c Zen5c CCD or another 8c - giving you up to 48 hybrid threads.
Not only that: IMHO for the client market, competitive ST performance will remain a significant factor even in the long run. The server market OTOH is already on the verge.Yeah, having to cut down a 16 core CCD to 4,6,8 core CPUs would be pretty wasteful.
I had initially thought that we would get almost everything stacked in the Zen 5 generation, but that just doesn't seem to be the case. The stacked silicon packages are more expensive and they add in some limitations. They generally require that chips be directly adjacent. The infinity fabric fan out used for MCD/GCD connection seems to be an in between tech (almost 900 GB/s), but it also has the adjacency limitation. You can't easily build something like Genoa with either tech since it does not place the chips adjacent to each other. I thought about daisy chaining chiplets, but this is also a problem for routing high speed signals that distance. They might be able to place 4 along each edge and 2 on the top and bottom, but that would require a lot IO die design work and the chips may be too large.I expect some interesting things from Zen 5 considering AMD is not developing cores on a shoestring budget anymore since a couple of years now.
Zen 3 was developed pretty much during the years of austerity at AMD. Zen 4 slightly less so and Zen 5 should see the first fruits of R&D under better days.
But more interesting for me is indeed packaging and SoC architecture. MI300 is almost here (next week?) to give us a glimpse of next gen packaging.
Curious to see whether InFO-R will replace substrate based PHY for 2.5D packaging on the Zen 5 family. Bergamo seems to have demonstrated the limits of routing with the substrate based interconnects and a likely way forward is fanout based RDLs at a minimum if not active bridges.
Besides the issue with practically no more space for traces coming out from IOD to CCD there is also the problem that the next gen IF which as per employee LinkedIn can hit up to 64Gbps compared to the current 36 Gbps.
I think InFO-3D could be a wildcard to enable lower cost 3D packaging. InFO-3D fits nicely here to enable lesser dense interconnect density than say FE packaging like SoIC but dense enough for SoC level interconnects for stacking on top of IOD. There is big concern at the moment with F15 and F14 underutilized and TSMC is pushing customers from 16FF and older to N7 family and ramping down those fabs (commodity process nodes you might say). Having any customer generously making use of N7/6 besides the leading node would be a win win.
Regarding the core perf gains, they have more transistors to work with and a more efficient process to work with, so at the very least just throwing more transistors at the problem should bring decent gains if their ~6 years (2018-2023) of 'grounds up design' of Zen 5 has to be worthwhile. Zen 4 is behind in capacity in almost all key resources of a typical OoO machine from key contemporaries. Pretty good (but not surprising given other factors) that it even keep up.
Nevertheless, few AMD patents, regarding core architecture, I have been reading strikes me as intriguing and I wonder if they will make it to Zen 5 in some form.
Not coincidentally all these patents are about increasing resources without drastically increasing Transistor usage.
- Dual fetch/Decode and op-cache pipelines.
- This seems like something that would be very interesting for mobile to power gate the second pipeline during less demanding loads
- Remove secondary decode pipeline for a Zen 5c variant? Lets say 2x 4 wide decode for Zen 5 and 4 wide for Zen 5c
- Retire queue compression
- op-cache compression
- Cache compression
- Master-Shadow PRF
But that 16 cores would be only the dense cores as far as I understand. They would not be too exciting for desktop customers, since the ST performance might be weak.That certainly would be exciting. So Zen 5 is expected to go up to 64 threads?
We have seen some changes to the caches with different Zen generations, but I think Zen 5 is going to be changes to the whole cache hierarchy. This may be significantly more radical changes compared to zen 2 to Zen 3. I don't know whether that will result in big improvements though. Pushing single thread performance is obviously getting harder and harder, so I am keeping my expectations low. They can much more easily push FP performance, so I am expecing a significant increase there.It's not only about the improvements directly achieved but also the new technologies introduced (which can then be refined) and future improvements enabled by the changes (the usual even Zen gen).
Also the excitement may be not only about the Zen cores but also the package layout with CCDs and one IOD that with Zen 4 was still essentially unchanged since Zen 2.
I had initially thought that we would get almost everything stacked in the Zen 5 generation, but that just doesn't seem to be the case. The stacked silicon packages are more expensive and they add in some limitations. They generally require that chips be directly adjacent. The infinity fabric fan out used for MCD/GCD connection seems to be an in between tech (almost 900 GB/s), but it also has the adjacency limitation. You can't easily build something like Genoa with either tech since it does not place the chips adjacent to each other. I thought about daisy chaining chiplets, but this is also a problem for routing high speed signals that distance. They might be able to place 4 along each edge and 2 on the top and bottom, but that would require a lot IO die design work and the chips may be too large.
Also, cpus just do not really need that high of bandwidth. Stacked devices would be lower power, but that seems to be one of the few advantages of stacked silicon for cpus. I suspect that MCM with infinity fabric connected chips (technically not chiplets) are going to stay with us for quite a while yet. They can continue to make them very cheaply since the same chiplet is used for a huge number of products. Intel, with an expensive stacked silicon package, will likely have trouble competing with this. Intel has their own fab, so I guess they don't take as big of a hit from having everything on the same tile chiplet and made on the most advanced process. AMD has been spliting everything out to allow them to make IO, cache, and logic all on different processes, which should allow them to better compete on price. This is in addition to having a cheaper MCM package.
For GPUs, 2.5D or 3D stacked silicon or interconnect makes sense due to the bandwidth requirements, but AMD isn't even using stacking for consumer level GPU devices. They are using infinity fabric fan out to connect MCDs to the GCD. The infinity cache allows the use of cheaper memory also, rather than HBM. Stacking seems to be reserved for the very high end like MI300. Since they probably have to use EFB to connect to the HBM, I suspect that the base die are connected together with EFB, which is also a cost saving packaging tech used in place of a full interposer. It would be great to be able to get HBM in consumer products though. An APU with a single HBM stack for mobile would be a powerful device. This also leads me to wonder what AMD could possibly still be making at Global Foundries. If looks like GF has made HBM in the past, so I was actually wondering if it is plausible that AMD would make a specialized version of HBM at GF using infinity fabric fan out links rather than 2.5D connections. The PC market is going to need something to compete with Apples M-series chips. This may require some add-on accelerator chips for video editing and such. Perhaps such things could be connected with infinity fabric fan out.
I share your feelings. I was also more positive about the adoption of advanced packaging in the CPU space by AMD.I had initially thought that we would get almost everything stacked in the Zen 5 generation, but that just doesn't seem to be the case. The stacked silicon packages are more expensive and they add in some limitations. They generally require that chips be directly adjacent. The infinity fabric fan out used for MCD/GCD connection seems to be an in between tech (almost 900 GB/s), but it also has the adjacency limitation. You can't easily build something like Genoa with either tech since it does not place the chips adjacent to each other. I thought about daisy chaining chiplets, but this is also a problem for routing high speed signals that distance. They might be able to place 4 along each edge and 2 on the top and bottom, but that would require a lot IO die design work and the chips may be too large.
Also, cpus just do not really need that high of bandwidth. Stacked devices would be lower power, but that seems to be one of the few advantages of stacked silicon for cpus. I suspect that MCM with infinity fabric connected chips (technically not chiplets) are going to stay with us for quite a while yet. They can continue to make them very cheaply since the same chiplet is used for a huge number of products. Intel, with an expensive stacked silicon package, will likely have trouble competing with this. Intel has their own fab, so I guess they don't take as big of a hit from having everything on the same tile chiplet and made on the most advanced process. AMD has been spliting everything out to allow them to make IO, cache, and logic all on different processes, which should allow them to better compete on price. This is in addition to having a cheaper MCM package.
For GPUs, 2.5D or 3D stacked silicon or interconnect makes sense due to the bandwidth requirements, but AMD isn't even using stacking for consumer level GPU devices. They are using infinity fabric fan out to connect MCDs to the GCD. The infinity cache allows the use of cheaper memory also, rather than HBM. Stacking seems to be reserved for the very high end like MI300. Since they probably have to use EFB to connect to the HBM, I suspect that the base die are connected together with EFB, which is also a cost saving packaging tech used in place of a full interposer. It would be great to be able to get HBM in consumer products though. An APU with a single HBM stack for mobile would be a powerful device. This also leads me to wonder what AMD could possibly still be making at Global Foundries. If looks like GF has made HBM in the past, so I was actually wondering if it is plausible that AMD would make a specialized version of HBM at GF using infinity fabric fan out links rather than 2.5D connections. The PC market is going to need something to compete with Apples M-series chips. This may require some add-on accelerator chips for video editing and such. Perhaps such things could be connected with infinity fabric fan out.
Yesn't.AMD has been all about cheap-to-make products using a sensible tech.
The MI300 is a premium product with the price tag surely sitting well above 128c server chips or the previous accelerators.
They can get an extra $100 for it, so the cost makes sense there. The big question is how the incremental cost of more advanced packaging compares to its product benefits. And more likely than not, that tradeoff will change over time.3D V cache isn't exactly cheap either considering the cache chiplets are only 1 node behind the main processor die.
Yep and "isn't cheap" is not an issue isThey can get an extra $100 for it, so the cost makes sense there. The big question is how the incremental cost of more advanced packaging compares to its product benefits. And more likely than not, that tradeoff will change over time.
I noticed one patent which attempted to address this issue by using normal fanout when the CCDs are single column on each side of the IOD and adding bridges in the fanout when multi columns of CCDs are there. so basically same CCD/IOD can be used but packaged differently for different configs.I had initially thought that we would get almost everything stacked in the Zen 5 generation, but that just doesn't seem to be the case. The stacked silicon packages are more expensive and they add in some limitations. They generally require that chips be directly adjacent. The infinity fabric fan out used for MCD/GCD connection seems to be an in between tech (almost 900 GB/s), but it also has the adjacency limitation. You can't easily build something like Genoa with either tech since it does not place the chips adjacent to each other. I thought about daisy chaining chiplets, but this is also a problem for routing high speed signals that distance. They might be able to place 4 along each edge and 2 on the top and bottom, but that would require a lot IO die design work and the chips may be too large.
Also, cpus just do not really need that high of bandwidth. Stacked devices would be lower power, but that seems to be one of the few advantages of stacked silicon for cpus. I suspect that MCM with infinity fabric connected chips (technically not chiplets) are going to stay with us for quite a while yet. They can continue to make them very cheaply since the same chiplet is used for a huge number of products. Intel, with an expensive stacked silicon package, will likely have trouble competing with this. Intel has their own fab, so I guess they don't take as big of a hit from having everything on the same tile chiplet and made on the most advanced process. AMD has been spliting everything out to allow them to make IO, cache, and logic all on different processes, which should allow them to better compete on price. This is in addition to having a cheaper MCM package.
For GPUs, 2.5D or 3D stacked silicon or interconnect makes sense due to the bandwidth requirements, but AMD isn't even using stacking for consumer level GPU devices. They are using infinity fabric fan out to connect MCDs to the GCD. The infinity cache allows the use of cheaper memory also, rather than HBM. Stacking seems to be reserved for the very high end like MI300. Since they probably have to use EFB to connect to the HBM, I suspect that the base die are connected together with EFB, which is also a cost saving packaging tech used in place of a full interposer. It would be great to be able to get HBM in consumer products though. An APU with a single HBM stack for mobile would be a powerful device. This also leads me to wonder what AMD could possibly still be making at Global Foundries. If looks like GF has made HBM in the past, so I was actually wondering if it is plausible that AMD would make a specialized version of HBM at GF using infinity fabric fan out links rather than 2.5D connections. The PC market is going to need something to compete with Apples M-series chips. This may require some add-on accelerator chips for video editing and such. Perhaps such things could be connected with infinity fabric fan out.
Is 36 Gbps IFOP from Zen2 generation?Bandwidth was not an issue with earlier cores, but Zen 4 showed signs of bandwidth deficiency in many workloads with the 36 Gbps IFOP.
It depends on FCLK. Earlier Zen generations have slightly lower FCLK. Zen 2 was stable at 1600MHz FCLK. Zen 4 is around 2000MHz.Is 36 Gbps IFOP from Zen2 generation?
Zen5 won't change anything in this regard, but AFAIK Zen6 and onwards will use MI300/400 style packaging.When Zen 2 came out in 2019, it was a radical change when it comes to CPU packaging as well as core scaling with SCF/SDF.
5 years later in 2024 we would hope something new will come up to address the short comings of this technology for the next gen CPUs
AMD will have to address the above problems with a new interconnect, even their competitor is using much more exotic BE packaging in current gen products. But I wouldn't hold my breath if next gen CPUs are stuck on same tech.
- Latency is a known issue with AMD's IFOP and for addressing that few LinkedIn posts put it at 64 Gbps for next gen IF. This is a big jump and could have a major efficiency impact if same IFOP is used going forward.
- Bandwidth was not an issue with earlier cores, but Zen 4 showed signs of bandwidth deficiency in many workloads with the 36 Gbps IFOP.
- Power, well 0.4 pJ/bit for MCD links vs 2pJ/bit speaks for itself. GLink is being advertised as 0.25 pJ/bit
- Die area being consumed on Zen 4 for IFOP is ~7mm2 in a 66mm2 chip (excluding the SDF/SCF that is part of L3), that is 10% of a potential N4P and soon N3E Si. GCD-MCD links have demonstrated smaller beachfront for higher bandwidth density. GuCs GLink for instance needs 3mm of beachfront to provide 7.5 Tbps of BW.
- Trace density from the IOD to the CCDs and IO. It seems there is already a limit reached how much space is available to route the signals from IOD to CCD considering the space is also needed for IO/Memory etc traces
As for the costs, AMD is doing fanout links on a 750 USD GPU product in which the actual chip should be sold at half of that if not less, and with 6 of these links. And SoIC parts like 5800X3D selling for less than 300 USD.
I noticed one patent which attempted to address this issue by using normal fanout when the CCDs are single column on each side of the IOD and adding bridges in the fanout when multi columns of CCDs are there. so basically same CCD/IOD can be used but packaged differently for different configs.
https://www.freepatentsonline.com/11469183.html
View attachment 81610
Rumors of MI300C have been floating around, lets see if this is real in couple of days. It could be a precursor.
That seems like the most logical progression.Zen5 won't change anything in this regard, but AFAIK Zen6 and onwards will use MI300/400 style packaging.
It seems that with the Mi300 approach about to be released this year, from GPU, APU to CPU, I don't think AMD is going to expend any money or effort on any half measure between the current Zen 4 (Genoa) and Zen 5 (Turin) on SP5 socket and the "nirvana" of Mi300/400.Rumors of MI300C have been floating around, lets see if this is real in couple of days. It could be a precursor.