IMO, it's not overkill. This is not a bus to connect single purpose chiplet to common memory.
This is a bus to let the M1 Ultra act as if it is one big chip containing all the parts of both SoCs. All the GPU units work together, the caches work together, the NPUs work together, the external I/O works together, the media ecoders work together, without any NUMA or SLI issues. When connected it really is like one big chip and that needs a connection to everything at high speed and low latency just as if it were all on the same chip.
It's pretty freaking amazing.
You're assuming the only SoC to LPDDR layout they can use is the one they use now. The iPhone stacks the LPDDR on top of the SoC, if the memory chips weren't in the way that's not a problem, though I'm not sure about the limits of that stacking.Not only does that displace the memory chips which are positioned adjacent to the appropriate chip, but it also implies the only bandwidth that matters is memory bandwidth.
In order for the chips to function to software as a monolithic die is for the throughput from every core to every other core to be uniform, or within margins. Considering cache can run in the TB/s range, you can't split that bus up to more than 2 chips without the cache taking a severe bandwidth hit.
I significantly doubt more than two chips on a package is in the cards for this design.
Isn't there an additional functional unit on the M1 series though? The NPU isn't exactly ignoring memory. If they have a way of clustering the two GPU segments, perhaps they are also clustering the NPU. I also suspect that Apple is attempting to keep the local cache on the CPU clusters coherent across the two clusters so that they can avoid NUMA issues. I can't imagine that Apple would throw silicon and money at intentionally over-provisioning memory bandwidth to the processor to such a degree. They could use slower, lower power memory chips if they had no use for the bandwidth. There's definitely something that's using the capacity.I never said it was only to connect memory, but when cores "work together" their bandwidth they are able to consume talking to each other is bounded by the memory bandwidth between them where they meet. L2/L3 for chips which use a cluster design where cores share an L2, or top level cache for designs where they share a cache. In the M1 we know the SLC bandwidth is unfortunately less than the DRAM bandwidth, because neither the CPU cluster (topping out at around 210 GB/sec IIRC) or the GPU cluster (topping out at around 330 GB/sec) can match the 400 GB/sec the M1 Max DRAM is capable of delivering. We know it isn't overhead related since a single CPU core can consume almost exactly the 100 GB/sec provided on the regular M1 - the SLC is no bottleneck there. 2.5 TB/sec is so far above the capability of at least the M1 generation it is massive overkill for the M1 Ultra.
Whether it is overkill for the M2 depends on what improvements are made to eliminate the bottlenecks that limit the M1 Max's ability to exploit the full bandwidth. It isn't like they couldn't go higher than 10,000 I/Os and/or clock it higher if they need more bandwidth. But there's no way the M1 Ultra needs anything like that much bandwidth between chips. The cores don't have the performance necessary to consume data at that rate even in a worst case scenario where all data is located on the wrong chip.
Predicted by whom? If that's CPU Monkey, they are well known to just make up numbers and enter them as placeholders.
Hmm... While 2000 single-core for M2 does seem quite optimistic, I wouldn't put too much stock into Macworld's guesstimate either.
I'm not sure whether they are clustering the NPU, or whether that's even particularly useful (i.e. since NPU tasks are already so highly parallel, I wonder if it makes much difference to have 2 NPUs of power 'n' vs one NPU of power '2n') but I guess we'll see once people start taking delivery of M1 Ultra systems.Isn't there an additional functional unit on the M1 series though? The NPU isn't exactly ignoring memory. If they have a way of clustering the two GPU segments, perhaps they are also clustering the NPU. I also suspect that Apple is attempting to keep the local cache on the CPU clusters coherent across the two clusters so that they can avoid NUMA issues. I can't imagine that Apple would throw silicon and money at intentionally over-provisioning memory bandwidth to the processor to such a degree. They could use slower, lower power memory chips if they had no use for the bandwidth. There's definitely something that's using the capacity.
They are assuming M2 will be based on A15. I believe it will be A16. The AnX chips before M chips came along skipped one A generation. 2000+ is quite reasonable if so.
The most similar one I can think of would be AMD's new mi200 gpgpu.Likewise it was hard to find any data on GPUs, other than Nvidia's DGX-2, which has a 300 GB/sec bidirectional bandwidth between individual GPUs and Nvidia thinks that's good enough to sell a 12 GPU system for $400K! Chiplets for GPUs seem to be in their infancy, but maybe someone has seen some data there that can be compared?
https://www.anandtech.com/show/17054/amd-announces-instinct-mi200-accelerator-family-cdna2-exacale-serversThe most similar one I can think of would be AMD's new mi200 gpgpu.
But I can't find any reliable numbers on it's interconnect. Only place that actually lists something is tom's but the paragraph is talking about the 8x 100gb/s infinity fabric links which should be ifop and not interchiplet.
But the interconnect on that would have to take into account the 1.6TB/s of memory on each chiplet so my guess would be that it's within that region and, if you were marketing inclined, claim double the number for bidirectional bandwidth.
If I'm interpreting the article correctly, it's 4 links between GCDs and each link can do 50 GB/s in each direction. That's 200 GB/s total in each direction, which is much lower than the HBM bandwidth. The GPU itself isn't really a chiplet design, and the fact that the OS recognizes each GCD as a separate GPU is consistent with that ideology.With the additional IF links exposed by the OAM form factor, AMD has given each GCD 8 Infinity Fabric 3.0 links. As previously mentioned, 4 of these links are used to couple the two GCDs within an MI200, which leaves 4 IF links per GCD (8 total) free for linking up to hosts and other accelerators.
All of these IF links are 16 bits wide an operate at 25Gbps/pin in a dual simplex fashion. This means there’s 50GB/second of bandwidth up and another 50GB/second of bandwidth down along each link. Or, as AMD likes to put it, each IF link is 100GB/second of bi-directional bandwidth, for a total aggregate bandwidth of 800GB/second. Notably, this gives the two GCDs within an MI250(X) 200GB/second of bandwidth in each direction to communicate among themselves. This is an immense amount of bandwidth, but for remote memory accesses it’s still going to be a fraction of the 1.6TB/second available to each GCD from its own HBM2E memory pool.
2000 points is on a 17% improvement over the M1. It seems reasonable to conclude that Apple could reach that through some combination of IPC and clock speed improvements.
I don't know what their usual performance improvements have been, but even if they're a bit under that on average it still doesn't put 17% outside of the realm of possibility.
By the way, I absolutely LOVE this “sandwich” idea. Apple is very likely to do just that. It will require advance thermal solution on both sides of the mother board. IF implemented inside a Mac Studio like enclosure, it will need to be at least twice taller (maybe a perfect 7.7” cube!!) with double the power supply (800W?), and maybe beefier intake air vents.If so then tripling the interconnect in the M2 Max to support four SoC configurations is even easier. It would also be more likely they do the "sandwich" approach I previously suggested with two M2 Max dies on either face of the interposer.
Ignore multicore numbers for a moment, let's talk single threaded performance.
Here are iPad 12.9” GB5 single and multi core scores over the years:
1st Gen (Sep 2015) with A9X: 637, 1194
2nd Gen (June 2017) with A10X Fusion: 833 (+31.8%), 2279 (+90.9%)
3rd Gen (Oct 2018) with A12X Bionic: 1145 (+37.5%), 4774 (+109%)
4th Gen (March 2020) with A12Z Bionic: 1121 (-2.1%), 4665 (-2.3%)
5th Gen (April 2021) with M1: 1706 (+52.2%), 7219 (+54.7%)
Average Gen-over-Gen performance is +30%, +63%.
When M2 comes out later this year, roughly 2 years after M1, greater than 17% single thread improvement is very plausible.
The A12X was built on N7. The A9x on N16, and the A10x on N10.TSMC original 7-nanometer N7 process was introduced in April 2018. Compared to its own 16-nanometer technology, TSMC claims its 7 nm node provides around 35-40% speed improvement or 65% lower power. Compared to the half-node 10 nm node, N7 is said to provide ~20% speed improvement or ~40% power reduction. In terms of density, N7 is said to deliver 1.6x and 3.3x improvement compared to N10 and N16 respectively. N7 largely builds on all prior FinFET processes the company has had previously. To that end, this is a fourth-generation FinFET, fifth-generation HKMG, gate-last, dual gate oxide process.
The M1 is built on N5.At a high level, TSMC N5 is a high-density high-performance FinFET process designed for mobile SoCs and HPC applications. Fabrication makes extensive use of EUV at Fab 18, the company’s new 12-inch GigaFab located at the Southern Taiwan Science Park. TSMC says that its 5-nanometer process is 1.84x denser than its 7-nanometer node. TSMC also optimized analog devices where roughly 1.2x scaling has been achieved. TSMC reported the density for a typical mobile SoC which consists of 60% logic, 30% SRAM, and 10% analog/IO, their 5 nm technology scaling was projected to reduce chip size by 35%-40%.
So TSMC N3 is the next big jump from N5. I believe that Apples next big design overhaul will start with TSMC N3, therefore, you won't see it this year. You might see some minor to moderate improvements in the meantime, but nothing big. Multithreaded performance is different because they can throw more cores at the problem, just like Intel is doing.TSMC's N3 technology will provide full node scaling compared to N5, so its adopters will get all performance (10% - 15%), power (-25% ~ -30%), and area (1.7x higher for logic) enhancements that they come to expect from a new node in this day and age. But these advantages will come at a cost. The fabrication process will rely extensively on extreme ultraviolet (EUV) lithography, and while the exact number of EUV layers is unknown, it will be a greater number of layers than the 14 used in N5. The extreme complexity of the technology will further add to the number of process steps – bringing it toto well over 1000 – which will further increase cycle times.
Very nice analysis! We will see, indeed!Ignore multicore numbers for a moment, let's talk single threaded performance.
The A12X was built on N7. The A9x on N16, and the A10x on N10.
More from wikichip:
The M1 is built on N5.
The Original TSMC N4 process is a very small shrink from N5 (6%, there are sources, do some Googling).
N4P adds another 6% on top of N4.
That tells me IPC or perf/watt increases will be modest at best. In order to up the performance gain Apple will need to do one of the following:
Option 1 is the most likely. Drop another M1 Ultra chip (for a total of 2x M1 Ultras) on a Mac Pro and profit. Option 2, though possible, is unlikely. Apple has already refined their current cores significantly. Let's look at option 3:
- Increase core size, and therefore power consumption.
- Drastically improve the cores themselves. Note that MOST of their gains come from the process. If all 3 chips were ported to 5nm and compared at peak perf/watt today, the gains would be much more modest.
- Move to N3.
So TSMC N3 is the next big jump from N5. I believe that Apples next big design overhaul will start with TSMC N3, therefore, you won't see it this year. You might see some minor to moderate improvements in the meantime, but nothing big. Multithreaded performance is different because they can throw more cores at the problem, just like Intel is doing.
We will see.
I would love to see details, but it appears that Apple has their own packaging process. There was some debate over whether M1 Ultra was one die or two - it appears the answer is both are correct.CoWoS-L looks like a much better fit for what Apple is using for M1 Ultra. Chip-last, because the dies are pretty big to begin with and chip-first probably wouldn't be economically feasible. Local silicon interconnect bridge to link the two dies, because the I/O for that is all concentrated into a small area so there's no need to do a full-size silicon interposer. Plus a redistribution layer to handle any additional fanout.
access:Alibaba Cloud 2022 “Cloud Procurement Season”” Hundreds of cloud products at least 0.24% off access:Apple Online Store (China)In March 2022, Apple once again touched the rules of…www.newsdirectory3.com
M1 Ultra using Chip-on-Wafer-on-Substrate with Si interposer (CoWoS-S) technology, as also suggested by @Doug S above.
“In UltraFusion technology, by using die stitching (Die Stitching) technology, 4 masks can be spliced to expand the area of the interposer. In this method, 4 masks are exposed simultaneously and four stitched “edges” are generated in a single chip.
UltraFusion architecture interconnection technology (single-layer and multi-layer, refer to patent US 20220013504A1/US 20210217702A1.
Special optimization of six major technologies:
1) Low RC interconnect
2) Interconnect power consumption control
3) Optimize TSV
4) Capacitors integrated in the interposer (iCAP)
5) New thermal interface materials
6) Effectively improve packaging yield and reduce costs through Die-Stitching technology”
I’m now trying to find those Apple patents. Please post links if you find them before I do.
Here is one possible implementation of the “sandwich” idea from Apple patent.
View attachment 58568
|Thread starter||Similar threads||Forum||Replies||Date|
|C||Question Intel Raptor Lake vs AMD Zen 4 vs Apple M2||CPUs and Overclocking||212|
|Discussion Do you think the M-chip in the Mac Pro will have a soldered CPU ?||CPUs and Overclocking||16|
|Question Is it possible to overclock an Apple M1 chip?||CPUs and Overclocking||22|
|Question Why does the M1 Pro/M1 Max have only 2 efficiency cores ?||CPUs and Overclocking||12|
|Discussion Apple Silicon Team is better than ARM's own Cortex Team.||CPUs and Overclocking||18|
|Question Intel Raptor Lake vs AMD Zen 4 vs Apple M2|
|Discussion Do you think the M-chip in the Mac Pro will have a soldered CPU ?|
|Question Is it possible to overclock an Apple M1 chip?|
|Question Why does the M1 Pro/M1 Max have only 2 efficiency cores ?|
|Discussion Apple Silicon Team is better than ARM's own Cortex Team.|