Question How did Apple achieve such high memory bandwidth on the M1 Pro and Max ?

FlameTail

Platinum Member
Dec 15, 2021
2,197
1,194
106
The M1 Pro has whopping 200 GB/s and the M1 Max has an insane 400 GB/s of memory bandwidth. How did Apple achieve this ?

They don't use GDDR or HBM RAM, but regular LPDDR (!), which makes it all the more puzzling.
 

uzzi38

Platinum Member
Oct 16, 2019
2,613
5,853
146
The M1 Pro has whopping 200 GB/s and the M1 Max has an insane 400 GB/s of memory bandwidth. How did Apple achieve this ?

They don't use GDDR or HBM RAM, but regular LPDDR (!), which makes it all the more puzzling.

What we refer to as dual channel on the desktop is essentially a 128b memory bus, we just don't refer to it as that like we do with GPUs. Same goes for your standard mobile SoC. Not going to go into the specifics of LPDDR5 and channel widths, all you need to know is that the M1 Pro effectively utilises a 256b memory bus. The M1 Max? 512b.

That's how they get so much memory bandwidth.
 

repoman27

Senior member
Dec 17, 2018
342
488
136
Like uzzi38 said, 256 and 512-bit wide memory busses. I think LPDDR5 is usually a 16-bit channel width, so that's 16-channel x16 for the M1 Pro and 32-channel x16 for the M1 Max. Apple is also using bespoke 64 and 128-Gbit x128 packages with at least 8 dies in them.

Normally LPDDR is implemented as PoP or memory down on the logic board. Those configurations would likely present significant challenges with that many channels, which is another reason why Apple places the SDRAM on the package substrate. HBM stacks use 1024-bit interfaces but require silicon interposers or EMIB due to the trace density with that many I/Os. 128-bit packages aren't nearly as bad, but they're still way denser than the more common memory technologies.
 
  • Like
Reactions: Tlh97 and Mopetar

Mopetar

Diamond Member
Jan 31, 2011
7,830
5,977
136
Yeah, they've got a memory bus as wide as a high-end GPU.

The real question is what is the memory controller capable of handling because the current chips just use LPDDR5 and the desktop chips could use something quite a bit faster and essentially double the bandwidth.

I think the AT review said they couldn't find anything that really could saturate the available bandwidth anyway so there may not be a need for it. Of course the flip side is that faster memory would let them use a smaller bus if they have separate desktop chips.
 

repoman27

Senior member
Dec 17, 2018
342
488
136
Uh, Apple is already using LPDDR5-6400. The only memory designed to clock higher than that at the moment is GDDR. Also, the only memory controllers on the die are LPDDR (assuming they'll reuse the Jade C-Die as the basis for larger desktop chips).
 

uzzi38

Platinum Member
Oct 16, 2019
2,613
5,853
146
Uh, Apple is already using LPDDR5-6400. The only memory designed to clock higher than that at the moment is GDDR. Also, the only memory controllers on the die are LPDDR (assuming they'll reuse the Jade C-Die as the basis for larger desktop chips).
Isn't LPDDR5X already shipping with Mediatek 9000? Iirc 7500MHz
 

Doug S

Platinum Member
Feb 8, 2020
2,248
3,478
136
Isn't LPDDR5X already shipping with Mediatek 9000? Iirc 7500MHz

Sure but how many phones containing Mediatek 9000 will ship over the next year versus iPhones or even Macs? Apple has to be assured of supply, and probably want multiple sources to be certain. It would cost them billions if a memory shortage (if an earthquake or whatever takes a key fab offline) forced them to reduce shipments.

Besides, as Mopetar said there aren't a lot of use cases that can fully exploit the bandwidth they have now. Even the SoC itself appears to have some limitations that will need to be addressed before doing beyond LPDDR5 makes sense. The benchmarks I've seen show 2 cores of an M1 Max able to get nearly 200 GB/sec, but it doesn't move that far beyond that figure no matter how many additional cores are brought to bear. The GPU maxes out at around 330 GB/sec.

Until they address those limitations, what's the point of going LPDDR5X other than the power savings?
 
  • Like
Reactions: Tlh97 and blckgrffn

repoman27

Senior member
Dec 17, 2018
342
488
136
Oh, right. I totally forgot about LPDDR5X.

One thing I noticed about Andrei's testing was that a single core could pull 102.36 GB/s across the system fabric from main memory. The theoretical bandwidth to a single memory package with a 128-bit LPDDR5-6400 interface is 102.4 GB/s. That may not be entirely coincidental.
 

tcsenter

Lifer
Sep 7, 2001
18,349
259
126
...Apple is also using bespoke 64 and 128-Gbit x128 packages with at least 8 dies in them.

...Normally LPDDR is implemented as PoP or memory down on the logic board. Those configurations would likely present significant challenges with that many channels, which is another reason why Apple places the SDRAM on the package substrate.
Yep, system designer can wring a lot more efficiency and/or performance when they get to 'set/fix' the organization, interface, and topology vs. making it 'standard' (e.g. JEDEC). IIRC, ASUS tried to market a motherboard years ago that had some 'custom' (star?) DRAM topology, but it was panned because everything was soldered. It was kind of gimmicky, like AOpen's 'audiophile' motherboard that used a damn tube amp.
 

Doug S

Platinum Member
Feb 8, 2020
2,248
3,478
136
Oh, right. I totally forgot about LPDDR5X.

One thing I noticed about Andrei's testing was that a single core could pull 102.36 GB/s across the system fabric from main memory. The theoretical bandwidth to a single memory package with a 128-bit LPDDR5-6400 interface is 102.4 GB/s. That may not be entirely coincidental.

That number is so good I'm still a bit suspicious there may have been some issues with the test showing what it is supposed to show. I've never seen a system able to basically deliver 100% of theoretical memory bandwidth, but maybe the more recent DDR standards have improved upon the areas where inefficiencies used to show up? What's the best number observed in Intel or AMD systems?

Regardless of whether that number is correct or a bit overmeasured, it is clear Apple's memory subsystem is extremely efficient, so the fact it can't get much past 200 GB/sec even with all the cores an M1 Max is not due to inefficiencies or overhead but obviously due to limitations in the design. Some paths will need to be wider and/or faster just to fully exploit LPDDR5, and will need another 33% beyond that to handle the fastest currently available LPDDR5X.

I am really curious to see how they handle the fabric for high end Mac Pros with four M* Max. Will there be enough bandwidth between SoCs to carry all 400 GB/sec (or 533 GB/sec if LPDDR5X is used) from each SoC's DRAM? I'm assuming it will be fully connected since there are only four SoCs. That's a lot of very high speed wires!

Since we now know that LPDDR5X will be available in up to 64 GB modules, I can at least stop worrying about whether Apple will need to support DIMMs for larger configurations :)
 

repoman27

Senior member
Dec 17, 2018
342
488
136
I agree about the number being suspiciously close, but it was a purely synthetic test that Andrei designed specifically to probe the memory subsystem, so who knows.

My best guess for the inter-chip fabric is 4x PCIe Gen5 x16 links on each die. The 2-chip version will use two x16 links from each die for CXL and the remaining four will be for PCIe slots in the Mac Pro. The 4-chip version will use three x16 links from each die for all-way CXL with the remaining four for PCIe slots in the Mac Pro. After accounting for protocol and encoding overhead, a PCIe Gen5 x16 link is good for around 53.2 GB/s. So the two links on the 2-chip version would provide bandwidth equivalent to a 128-bit LPDDR5-6400 interface in each direction. The all-way setup for the 4-chip version would have a cross-sectional bandwidth of over 400 GB/s.
 

Doug S

Platinum Member
Feb 8, 2020
2,248
3,478
136
I agree about the number being suspiciously close, but it was a purely synthetic test that Andrei designed specifically to probe the memory subsystem, so who knows.

My best guess for the inter-chip fabric is 4x PCIe Gen5 x16 links on each die. The 2-chip version will use two x16 links from each die for CXL and the remaining four will be for PCIe slots in the Mac Pro. The 4-chip version will use three x16 links from each die for all-way CXL with the remaining four for PCIe slots in the Mac Pro. After accounting for protocol and encoding overhead, a PCIe Gen5 x16 link is good for around 53.2 GB/s. So the two links on the 2-chip version would provide bandwidth equivalent to a 128-bit LPDDR5-6400 interface in each direction. The all-way setup for the 4-chip version would have a cross-sectional bandwidth of over 400 GB/s.

Why in the world would Apple need so many PCIe slots? It is pretty clear they are not supporting any third party GPUs, so there is nothing to plug into an x16 slot. At most they will have a couple x4 slots for SSDs, 100Gb ethernet, or a fibre channel port for an external array (though TB4 is probably good enough there)

I'm skeptical Apple would use CXL. Standards only matter if you need to interface with something conforming to that standard. Apple does not, so they could probably do better rolling their own. A 400 GB/sec cross sectional bandwidth is pretty poor on a system that will support 2 TB/sec of memory bandwidth (assuming they upgrade to LPDDR5X by the time the 4 SoC Mac Pro comes out in 2023)
 
  • Like
Reactions: Tlh97 and scannall

repoman27

Senior member
Dec 17, 2018
342
488
136
Why in the world would Apple need so many PCIe slots? It is pretty clear they are not supporting any third party GPUs, so there is nothing to plug into an x16 slot. At most they will have a couple x4 slots for SSDs, 100Gb ethernet, or a fibre channel port for an external array (though TB4 is probably good enough there)

I'm skeptical Apple would use CXL. Standards only matter if you need to interface with something conforming to that standard. Apple does not, so they could probably do better rolling their own. A 400 GB/sec cross sectional bandwidth is pretty poor on a system that will support 2 TB/sec of memory bandwidth (assuming they upgrade to LPDDR5X by the time the 4 SoC Mac Pro comes out in 2023)
OK, skip the slots. Three links per die then. That probably makes more sense. Each die already has three PCIe Gen4 x4 ports already.

My reasoning behind CXL is that there's no compelling reason for Apple to reinvent that wheel when they can just buy off-the-shelf IP and be done with it. These are inherently low volume parts, and the engineering resources could be better spent elsewhere. I also think (although I'm probably being overly optimistic) that we'll see the 2 and 4-die M1 variants by June / WWDC 2022. These are just the M1 Max dies with additional I/O for the interconnect fabric. Which also means there probably won't be any changes to the memory interfaces.
 

Doug S

Platinum Member
Feb 8, 2020
2,248
3,478
136
CXL is simply way too slow, nearly an order of magnitude too slow if Apple upgrades to LPDDR5X by the time Mac Pro comes along. They might as well drop LPDDR and go with DIMM slots if they were going to use such a weak interconnect. The speed is not really surprising since it is basically PCIe 5.0, which is designed for board level I/O not Apple's need for a solution for short reach die to die connectivity. They could overcome PCIe 5.0's slow speed with enough links, but PCIe as a jack of all trades isn't the best in any category for connecting M1 Max packages.

Luckily they have some better options. On the "minimize number of I/Os" side 112G serdes is mainstream now, nearly 4x the speed of PCIe 5.0's 32G serdes. 224G serdes might even be an option. Based on 112G they'd need about 32 to each SoC (40 for LPDDR5X) so a total of 96 or 120 I/Os for a fully connected four way fabric.

Alternatively they could use TSMC's LIPINCON which is extremely power efficient at 0.56 pj/bit (2-3x more efficient than the 112G serdes) but with only 8 Gbps per pin they'd need around 400 or 500 I/Os per SoC (for LPDDR5 or LPDDR5X respectively) or as many as 1500 total. The "worse case scenario" of one SoC streaming DRAM from the other three SoCs at the LPDDR5X max of 533 GB/sec would require only 7 watts overhead.