The M1 Pro has whopping 200 GB/s and the M1 Max has an insane 400 GB/s of memory bandwidth. How did Apple achieve this ?
They don't use GDDR or HBM RAM, but regular LPDDR (!), which makes it all the more puzzling.
Isn't LPDDR5X already shipping with Mediatek 9000? Iirc 7500MHzUh, Apple is already using LPDDR5-6400. The only memory designed to clock higher than that at the moment is GDDR. Also, the only memory controllers on the die are LPDDR (assuming they'll reuse the Jade C-Die as the basis for larger desktop chips).
Isn't LPDDR5X already shipping with Mediatek 9000? Iirc 7500MHz
Yep, system designer can wring a lot more efficiency and/or performance when they get to 'set/fix' the organization, interface, and topology vs. making it 'standard' (e.g. JEDEC). IIRC, ASUS tried to market a motherboard years ago that had some 'custom' (star?) DRAM topology, but it was panned because everything was soldered. It was kind of gimmicky, like AOpen's 'audiophile' motherboard that used a damn tube amp....Apple is also using bespoke 64 and 128-Gbit x128 packages with at least 8 dies in them.
...Normally LPDDR is implemented as PoP or memory down on the logic board. Those configurations would likely present significant challenges with that many channels, which is another reason why Apple places the SDRAM on the package substrate.
Oh, right. I totally forgot about LPDDR5X.
One thing I noticed about Andrei's testing was that a single core could pull 102.36 GB/s across the system fabric from main memory. The theoretical bandwidth to a single memory package with a 128-bit LPDDR5-6400 interface is 102.4 GB/s. That may not be entirely coincidental.
I agree about the number being suspiciously close, but it was a purely synthetic test that Andrei designed specifically to probe the memory subsystem, so who knows.
My best guess for the inter-chip fabric is 4x PCIe Gen5 x16 links on each die. The 2-chip version will use two x16 links from each die for CXL and the remaining four will be for PCIe slots in the Mac Pro. The 4-chip version will use three x16 links from each die for all-way CXL with the remaining four for PCIe slots in the Mac Pro. After accounting for protocol and encoding overhead, a PCIe Gen5 x16 link is good for around 53.2 GB/s. So the two links on the 2-chip version would provide bandwidth equivalent to a 128-bit LPDDR5-6400 interface in each direction. The all-way setup for the 4-chip version would have a cross-sectional bandwidth of over 400 GB/s.
OK, skip the slots. Three links per die then. That probably makes more sense. Each die already has three PCIe Gen4 x4 ports already.Why in the world would Apple need so many PCIe slots? It is pretty clear they are not supporting any third party GPUs, so there is nothing to plug into an x16 slot. At most they will have a couple x4 slots for SSDs, 100Gb ethernet, or a fibre channel port for an external array (though TB4 is probably good enough there)
I'm skeptical Apple would use CXL. Standards only matter if you need to interface with something conforming to that standard. Apple does not, so they could probably do better rolling their own. A 400 GB/sec cross sectional bandwidth is pretty poor on a system that will support 2 TB/sec of memory bandwidth (assuming they upgrade to LPDDR5X by the time the 4 SoC Mac Pro comes out in 2023)