Discussion Apple Silicon SoC thread

Eug · Nov 10, 2020

M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:

Page 78 - Discussion - Apple Silicon SoC thread

Page 78 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M1 Ultra discussion here:

Page 109 - Discussion - Apple Silicon SoC thread

Page 109 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M2 discussion here:

Page 127 - Discussion - Apple Silicon SoC thread

Page 127 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:

Page 215 - Discussion - Apple Silicon SoC thread

Page 215 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M4 Family discussion here:

Page 263 - Discussion - Apple Silicon SoC thread

Page 263 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

Doug S · Jul 21, 2024

johnsonwax said:
As Apple has improved their device longevity, their OS support hasn't kept pace. These M1 devices are going to last a decade and still be as fast as new Windows machines being sold. Users should have means to keep those device in use and in support. And if that's linux, that's reasonable. Apple can hand off support to them. Apple can also afford to keep some greybeards on staff to keep security patches for 30 years of OS releases.

Apple already has those Linux options, thanks to the Asahi Linux team. That's not an option for an iPad or iPhone but it is theoretically possible for any device that can be jailbroken. Just don't expect there to be enough interest in that for someone else to do all the legwork for you making it as easy to install as Asahi Linux is on a Mac.

name99 · Jul 21, 2024

The Hardcard said:
This is inaccurate. running high parameter neural networks are bandwidth limited on local devices. Matrix operations are efficient, however, when for instance in transformers, every weight needs to be moved through the GPU for each token, it ends up being a lot of data.

The math is simple. Take Llama 3 70B. At 4-bit quantization, the model weights occupy 35 GBs of RAM. In a Max chip (all generations are equally fast since it is memory bandwidth dependent) itis straightforward. 400 GB/s memory bandwidth divided by 35 GBs of model weights puts you at a theoretical maximum of 11.4 tokens per second - or about 9 words per second. in real world use, people are getting about 8 tokens or 6.5 words per second which is tolerable, but not ideal.

The next Ultra will fit 4-bit quantized Llama 3 400B, but 800 GB/s memory bandwidth divided by 200 GBs of model weights gives you less than 4 tokens a second. Very slow. Memory bandwidth is critical to practical usefulness. Remember, people are trying to chat with these models. 3.5 words per second makes for a slow chat bot that will make many listeners or readers impatient.

While token generations speeds could be improved (LPDDR5x 10700 giving the Max 668 GB/s or better yet, LPDDR6) the concern I am raising is the biggest weakness for using Macs on large language models (probably other aspects of generative AI as well.) They do horribly on large prompt processing and handling large context windows. These are compute limited - for simple prompts not a big deal. But for huge 3000 token+ prompts an extended back-and-forth sessions that require the model to keep track of the entire chat, this aspect causes Mac performance to plummet down to abysmal response times.

The metric here is time to first token. And for a 14,000 token context window, Nvidia cards remain a chatty (less than 10 seconds) time to first token while an M2 Ultra may take as long as 20 or more *minutes* to start responding. Unacceptably poor for chatting. This currently the main roadblock for Nvidia users otherwise tempted by the siren song of the top RAM Mac’s ability to run huge parameter language models. Having to wait half an hour for every response is rejected by many.

That is the basis for my posts about increasing matrix multiply compute on Macs. I didn’t bring up the location of the ANE because I think it’s using the L2 cache. Instead, again, my question is how much memory bandwidth Is available in that area of the chip given that the CPU clusters don’t use the full SOC bandwidth.

The point is that boosting the TOPS of the ANE higher than what the GPU is capable of (and I don’t know what the low precision TOPS of the M3 Max GPU is) won’t help the next Max chip run Llama 3 70B satisfactorily if moving the model weights through the compute block to generate tokens does not have access to the full 400 GB/s SOC bandwidth. The number of tokens generated per second will drop by the same percentage that the ANE memory bandwidth is relative to total SOC bandwidth.

If Apple uses the ANE as the path to increase matmul TOPS then the token generation will be limited by how much memory bandwidth it has access to. That’s why I brought that up.

Hopefully now, you can see why it is important. Given that, do you know how much memory bandwidth the ANE has access to? I am contending that it is important for Apple to increase matrix compute in a location on the die where it can cycle the model weights through at full SOC memory bus speed. If the ANE has access to all 800 GB/s in an Ultra, then a great place to jack up the TOPS, otherwise…

Well, Apple would also need allow it to be directly programmed. I don’t think the Accelerate framework is acceptable to the state of the art machine learning community.

Once again you are confusing a whole lot of things.
I did not say that LLMs were not bandwidth limited; I said that MATRIX MULTIPLY is not bandwidth limited. This is simply a mathematical fact: O(N^3) vs O(N^2). If it's not obvious to you, I don't know what to say.
What IS more problematic (though still not bandwidth worst case) is matrix VECTOR multiply.

You're also assuming that an LLM HAS TO touch the entire set of weights to generate every token. This may be how current models do it, but it's far from clear that it's essential. Both mixture of experts and RAG are alternatives to this model.

Finally you are casually throwing out the term "Nvidia cards" as though they're all the same. But the very high end of nVidia sells at a very different price, to a very different market, from an M Max or even an M Ultra...

You're doing this strange thing of assuming that FUTURE machines (Apple is not shipping a local chatbot this year) will be running the same sort of SW (ie algorithms and data footprints as today), as though the past three months just didn't happen; and as if Apple is directly competing with Blackwell.
But that's not how it works. Apple is selling you a butler not a PhD:

https://twitter.com/x/status/1801268635438391310

And yes, this year the butler has a 70 IQ. Next year it will have an 85 IQ. Year after that it will have a 98 IQ. And the PhD running on BlackwellNext++ will have a 150 IQ. BUT the world is a big place... Apple has a niche selling butlers, nVidia/OpenAI has a niche selling PhDs.

Google does not compete with Spotlight. Both have their uses. It would have been (and remains) crazy to say that "Apple can never compete with Google because a Mac's SSD only has <xyz> GB of storage". That's true, but IT IS NOT RELEVANT. Apple sells a version of search that both
- meets the LOCAL needs of customers and
- matches the hardware they own.
If that's not what you want, use Google. Great thing about Apple is that both Spotlight AND Google are easily available to you.

Likewise Apple Intelligence will do things (I assume) like
- provide substantially better local search (based on English semantics not just keywords)
- be able to fill in web pages (address, phone, etc) much more robustly than the current keyword-based scheme, including things like "set passenger 2 to Cynthia" for say booking a flight

Apple Intelligence will not attempt to replace Google, nor will it attempt to replace OpenAI. Its performance will ramp up each year (like GPU performance ramps up each year) always lagging behind some silly metric (some sort of best case performance for one particular use case, on code that's so lousy no mass market is actually using it, on hardware more expensive than is relevant to the mass market) and people will insist, in spite of 50 years of evidence, that this time this means Apple is DOOMED!!!

Finally no-one programs AI via Accelerate. They do so via various Python frameworks that are translated/compiled down to appropriate primitives running on some combination of CPU, GPU, and ANE. This is the way it's done everywhere, and there doesn't seem to be any particular reason to say that Apple's version of this is much worse than anyone else's.

soresu · Jul 22, 2024

name99 said:
some sort of best case performance for one particular use case

The AI industry as a whole is already moving away from these kind of models in favor of multi modal ones better at generalising use cases from fewer inputs/training data.

Unless I missed something ChatGPT is not yet focusing on multi modality.

If they continue down the "moar variables" rabbit hole to brute force AI they will only end up left in the dust as the descendants of DeepMind's Gato model improve and overtake them in not only efficiency, but performance too.

johnsonwax · Jul 22, 2024

name99 said:
And yes, this year the butler has a 70 IQ. Next year it will have an 85 IQ. Year after that it will have a 98 IQ. And the PhD running on BlackwellNext++ will have a 150 IQ. BUT the world is a big place... Apple has a niche selling butlers, nVidia/OpenAI has a niche selling PhDs.

Yeah. I've been arguing that ChatGPT is going to have trouble in the market because it's overserving. It costs a lot of money to build a thing that does way more than people actually need. Apple's view isn't just that the smaller model is better serving, but by putting it on device - they don't have to pay much to build it - the device owner pays for that.

amosliu137 · Jul 22, 2024

The Hardcard said:
My vision of future computing has nothing to do with hardware form factors, iPad or otherwise. It is the software that fully determines what that vision is.

The iPad is just something that Apple is doing currently on its journey to that goal. There’s nothing about iPad hardware that dictates whether it will or will not be around at the end of that goal.

And Apple is extremely deliberate and intentional about every aspect of the products they provide software and hardware. I am an old head who got into computing before the Macintosh and Windows existed. I heavily use an M1 iPad Pro, I could easily and joyfully use macOS on it. but it is clear to me that they don’t intend to put macOS on the iPad and that it is pointless to wait for it to happen.

“We don’t want you to do that” is not an Apple intent. they are trying to provide a secure and stable environment that limits what a lot of enthusiasts want to do with their computing devices. there are no arbitrary restrictions.

Do you have evidence to prove that "And for a 14,000 token context window, M2 Ultra may take as long as 20 or more *minutes* to start responding. " 70b model int4 quantization?

The Hardcard · Jul 22, 2024

Error in posting

The Hardcard · Jul 22, 2024

name99 said:
Once again you are confusing a whole lot of things.
I did not say that LLMs were not bandwidth limited; I said that MATRIX MULTIPLY is not bandwidth limited. This is simply a mathematical fact: O(N^3) vs O(N^2). If it's not obvious to you, I don't know what to say.
What IS more problematic (though still not bandwidth worst case) is matrix VECTOR multiply.

You're also assuming that an LLM HAS TO touch the entire set of weights to generate every token. This may be how current models do it, but it's far from clear that it's essential. Both mixture of experts and RAG are alternatives to this model.

Finally you are casually throwing out the term "Nvidia cards" as though they're all the same. But the very high end of nVidia sells at a very different price, to a very different market, from an M Max or even an M Ultra...

You're doing this strange thing of assuming that FUTURE machines (Apple is not shipping a local chatbot this year) will be running the same sort of SW (ie algorithms and data footprints as today), as though the past three months just didn't happen; and as if Apple is directly competing with Blackwell.
But that's not how it works. Apple is selling you a butler not a PhD:

https://twitter.com/x/status/1801268635438391310
And yes, this year the butler has a 70 IQ. Next year it will have an 85 IQ. Year after that it will have a 98 IQ. And the PhD running on BlackwellNext++ will have a 150 IQ. BUT the world is a big place... Apple has a niche selling butlers, nVidia/OpenAI has a niche selling PhDs.

Google does not compete with Spotlight. Both have their uses. It would have been (and remains) crazy to say that "Apple can never compete with Google because a Mac's SSD only has <xyz> GB of storage". That's true, but IT IS NOT RELEVANT. Apple sells a version of search that both
- meets the LOCAL needs of customers and
- matches the hardware they own.
If that's not what you want, use Google. Great thing about Apple is that both Spotlight AND Google are easily available to you.

Likewise Apple Intelligence will do things (I assume) like
- provide substantially better local search (based on English semantics not just keywords)
- be able to fill in web pages (address, phone, etc) much more robustly than the current keyword-based scheme, including things like "set passenger 2 to Cynthia" for say booking a flight

Apple Intelligence will not attempt to replace Google, nor will it attempt to replace OpenAI. Its performance will ramp up each year (like GPU performance ramps up each year) always lagging behind some silly metric (some sort of best case performance for one particular use case, on code that's so lousy no mass market is actually using it, on hardware more expensive than is relevant to the mass market) and people will insist, in spite of 50 years of evidence, that this time this means Apple is DOOMED!!!

Finally no-one programs AI via Accelerate. They do so via various Python frameworks that are translated/compiled down to appropriate primitives running on some combination of CPU, GPU, and ANE. This is the way it's done everywhere, and there doesn't seem to be any particular reason to say that Apple's version of this is much worse than anyone else's.

Yes, there is confusion. You are not talking about what I am talking about. So your responses are about something different than what I am talking about.

All of my posts concerning matrix multiply in this forum are in relation to large parameter neural networks such as large language models with high parameters. That is my interest in machine learning currently and it is why I bring up memory bandwidth constraints. I have never brought up matrix multiply in any other context. So if you were talking about matrix multiply outside of running LLMs then that has nothing to do with what I’m talking about.

My interest is in running 50 GBs of RAM or more for each token so even the Ultras are memory bandwidth limited in what I am talking about. If you want to respond to me, the limits of memory bandwidth are characteristic to take into account.

I am not talking about what Apple is doing with on device Apple intelligence, not that I wouldn’t find some use for certain features, but my discussion here is solely about running high parameter AI using large amounts of RAM.

LLMs do need to run their full model weights through the compute blocks for each token generated. It is not an assumption, it is reality. Mixture of Experts are one of many compromises made to run large models when faced with limited hardware resources. It is a clever compromise, hopefully one of many clever compromises that make it easier to access more power with less hardware. But it is a compromise, nevertheless, not an alternative to people who have the resources to run large models.

And, in fact, the most powerful mixture of experts are also themselves memory bandwidth constrained consumer hardware. There are a number of people using small models and compromises like mixture of experts, because they have to. They lack enough memory attached to compute acceleration, or compute itself.

RAG refers to the ability to retrieve information from local documents that are not part of the model training in this no impact, and reducing the compute and bandwidth needs of any given model

That is not likely to change anytime soon. I am interested in the upcoming breakthroughs in neural network algorithms that move toward being able to take over large, major tasks - not small, usually cute, and sometimes useful tasks that all the companies are raving about now. I am looking toward a future where AI models can do large heavily lifting for users. These capabilities will nearly always emerge through models that take up sizable amount of workstation compute accelerated RAM. I expect memory bandwidth to be a central constraint for years to come.

I bring up my concerns about the Apple Neural Engine because it is now a common discussion to boost AI capabilities of processors through larger NPUs. However, given that I only am interested in running models that need more than the full SOC bandwidth of Apple Silicon Max or Ultra SOCs, I continue to pose the question - still with no answer - about whether the ANE has access to 400 GB/s or 800 GB/s of the top chips.

Llama 3 400B is dropping this week. The 4-bit quantized will need 200 GBs of RAM. Even with the fastest LPDDR5x, the next Ultra will not have enough memory bandwidth to get above 10 tokens per second. however, I am hoping that they make advances in the compute hardware to significantly reduce prompt in context delays.

If you want to talk about that, it would be great. Talking about compute that does not make available full SOC bandwidth does not concern me nearly as much now.

The Hardcard · Jul 22, 2024

amosliu137 said:
Do you have evidence to prove that "And for a 14,000 token context window, M2 Ultra may take as long as 20 or more *minutes* to start responding. " 70b model int4 quantization?

It wasn’t speaking about the 70B model in that example though I didn’t make that clear. I’m talking about using large models in general and 70B was just about the memory bandwidth calculations.

There were some posts on Reddit about Falcon 180B Getting less than 9 tokens per second for prompt evaluation, but I’m not pulling those up in my current searches.

It does appear to have improved with a new attention algorithm to about 50 tokens per second currently on a 155B model. Still not as fast as Nvidia if you use enough cards to hold the model.

The Hardcard · Jul 24, 2024

This is what I am talking about. Llama 3 405B dropped yesterday. It benchmarks to be on the level of ChatGPT 4o. And already up and running on 2 M3 Max Macbook Pros at 4-bit quantization. Full precision size is 820 GB, plus it has a huge context window, so around 850 GB of RAM needed.

Here, at 4-bit, the model is 210 GB. The context window is probably close to filling the 256 GB of RAM from these Macbooks. he hasn’t yet that I’ve seen revealed token, generation, or prompt processing speeds.

You would need 10 4090s plus a huge system with upwards of 100 PCIe lanes to run this. Or five professional Nvidia cards. One way or the other $20K to $40K of hardware. No Windows AI PC is touching this.

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">2 MacBooks is all you need.<br><br>Llama 3.1 405B running distributed across 2 MacBooks using <a href="https://twitter.com/exolabs_?ref_src=twsrc^tfw">@exolabs_</a> home AI cluster <a href="https://t.co/MLm47UR0B7">pic.twitter.com/MLm47UR0B7</a></p>— Alex Cheema - e/acc (@ac_crypto) <a href="

https://twitter.com/x/status/1815969489990869369

">July 24, 2024</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

name99 · Jul 24, 2024

The Hardcard said:
This is what I am talking about. Llama 3 405B dropped yesterday. It benchmarks to be on the level of ChatGPT 4o. And already up and running on 2 M3 Max Macbook Pros at 4-bit quantization. Full precision size is 820 GB, plus it has a huge context window, so around 850 GB of RAM needed.

Here, at 4-bit, the model is 210 GB. The context window is probably close to filling the 256 GB of RAM from these Macbooks. he hasn’t yet that I’ve seen revealed token, generation, or prompt processing speeds.

You would need 10 4090s plus a huge system with upwards of 100 PCIe lanes to run this. Or five professional Nvidia cards. One way or the other $20K to $40K of hardware. No Windows AI PC is touching this.

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">2 MacBooks is all you need.<br><br>Llama 3.1 405B running distributed across 2 MacBooks using <a href="https://twitter.com/exolabs_?ref_src=twsrc^tfw">@exolabs_</a> home AI cluster <a href="https://t.co/MLm47UR0B7">pic.twitter.com/MLm47UR0B7</a></p>— Alex Cheema - e/acc (@ac_crypto) <a href="
https://twitter.com/x/status/1815969489990869369
">July 24, 2024</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

So what is your complaint?
Two posts ago you were complaining about how Apple was doing it wrong in multiple different ways wrt handling large LLMs, now you're telling us that Apple does it better than anyone else!

The Hardcard · Jul 24, 2024

name99 said:
So what is your complaint?
Two posts ago you were complaining about how Apple was doing it wrong in multiple different ways wrt handling large LLMs, now you're telling us that Apple does it better than anyone else!

I have made no complaints. Just the opposite of saying Apple was doing it wrong, my point was the Apple was doing it correctly. I have been speculating about the possibility of them taking the next step in the journey. My perspective is, depending upon the moves they make with upcoming devices they can significantly increase their overall revenue, and Mac revenue in particular.

My posts have been simply about what those steps could be on a hardware level to make Apple Silicon a central part the upcoming new era of computing.

johnsonwax · Jul 25, 2024

The Hardcard said:
I have made no complaints. Just the opposite of saying Apple was doing it wrong, my point was the Apple was doing it correctly. I have been speculating about the possibility of them taking the next step in the journey. My perspective is, depending upon the moves they make with upcoming devices they can significantly increase their overall revenue, and Mac revenue in particular.

My posts have been simply about what those steps could be on a hardware level to make Apple Silicon a central part the upcoming new era of computing.

Well, given that the compute behind these things is still changing rapidly - ternary weight models being explored now - and the fact that the latency in rolling out silicon to optimize to those compute approaches is going to lag by 3 years at least, the big advantage Apple has is the unified memory model that if they can hang their heterogeneous compute units off of that, it offers significant flexibility in how you allocate that compute over PCs while the rollout of silicon lags.

The downside to the unified memory model is the manner in which it's packaged in the device - with no ability right now to expand memory, and with a ceiling which must be a SKU that Apple sells, where offering a partially populated board provides the ability to offer a very high potential ceiling without having to sell it fully populated as a SKU.

But that's really what we're seeing here in this video - without Apple doing anything to provide dedicated silicon to solve this use case, there's enough compute on board with enough flexibility due to that memory model that they can solve these problems in software - either MLX to abstract the compute need to the different kinds of compute silicon on board, or exolabs to cluster hardware together. That's the challenge with the PC approach - a very limited amount of fast RAM on high-compute GPUs with a limited ability through software to stitch all of those resources into a single compute system of adequate size.

poke01 · Jul 29, 2024

Finally putting that ANE to use for general tasks.

https://twitter.com/x/status/1818010590474027028

poke01 · Jul 29, 2024

This is good.

https://twitter.com/x/status/1818016706717172229

johnsonwax · Jul 31, 2024

poke01 said:
Finally putting that ANE to use for general tasks.

"Also, it did a better job at re-writing this document than Gemini did lol"

I was kind of wondering how the multiple narrower use, but smaller models would fare against the larger monolithic do-everything models. Anecdotal, but encouraging.

Doug S · Jul 31, 2024

johnsonwax said:
I was kind of wondering how the multiple narrower use, but smaller models would fare against the larger monolithic do-everything models. Anecdotal, but encouraging.

I think Apple wants to limit its AI to doing stuff they feel it can do well, rather than let it do whatever the user wants and do a bad (perhaps virally reputationally damagingly bad, like Microsoft's "Tay" episode) job on a lot of it. People will whine "oh Siri can't do nearly as much as ChatGPT" but ignore all the stuff ChatGPT is terrible at, or all the ways you can trick it into doing stuff it shouldn't be.

mikegg · Aug 1, 2024

johnsonwax said:
Yeah. I've been arguing that ChatGPT is going to have trouble in the market because it's overserving. It costs a lot of money to build a thing that does way more than people actually need. Apple's view isn't just that the smaller model is better serving, but by putting it on device - they don't have to pay much to build it - the device owner pays for that.

Highly disagree on this.

Yes, your average intelligence butler who knows everything about you is valuable. That's what Apple can do because local inference will always run dumber models due to lower compute.

But the PhD level scientist has a ton of value. For one, corporations want to use PhD level AIs. Having a million butlers does nothing for most corporations. But having a highly intelligent assistant to help you discover new drugs? Design a new bridge? Give insights from all your data? It's very valuable.

If OpenAI can build that PhD level AI, they can easily sell it to non-consumers.

FlameTail · Aug 1, 2024

Okay so something interesting about the M3 Pro and M3 Max is that Apple increased the number of P-cores in a cluster from 4 to 6, but the L2 cache remained the same at 16 MB. I previously wrongly assumed that the L2 size had also increased by 50% to 24 MB.

This means the L2 per core has been reduced! Have the effects of this been investigated?

name99 · Aug 1, 2024

johnsonwax said:
Well, given that the compute behind these things is still changing rapidly - ternary weight models being explored now - and the fact that the latency in rolling out silicon to optimize to those compute approaches is going to lag by 3 years at least, the big advantage Apple has is the unified memory model that if they can hang their heterogeneous compute units off of that, it offers significant flexibility in how you allocate that compute over PCs while the rollout of silicon lags.

The downside to the unified memory model is the manner in which it's packaged in the device - with no ability right now to expand memory, and with a ceiling which must be a SKU that Apple sells, where offering a partially populated board provides the ability to offer a very high potential ceiling without having to sell it fully populated as a SKU.

But that's really what we're seeing here in this video - without Apple doing anything to provide dedicated silicon to solve this use case, there's enough compute on board with enough flexibility due to that memory model that they can solve these problems in software - either MLX to abstract the compute need to the different kinds of compute silicon on board, or exolabs to cluster hardware together. That's the challenge with the PC approach - a very limited amount of fast RAM on high-compute GPUs with a limited ability through software to stitch all of those resources into a single compute system of adequate size.

Just as a technical point, the very recent Apple Foundations Model paper states that for their on-device models, the "more critical" layers layers are 4b quantized, the less critical layers are 2b quantized.
Which is basically equivalent to ternary weight models, unless you're willing to build an entire infrastructure of storage and compute based on trits rather than bits...

As usual, the world is insisting that "Apple needs to chase <shiny thing X> RIGHT NOW" while Apple has been diligently investigating (but not loudly talking about) <shiny thing X> for five years.

name99 · Aug 1, 2024

mikegg said:
Highly disagree on this.

Yes, your average intelligence butler who knows everything about you is valuable. That's what Apple can do because local inference will always run dumber models due to lower compute.

But the PhD level scientist has a ton of value. For one, corporations want to use PhD level AIs. Having a million butlers does nothing for most corporations. But having a highly intelligent assistant to help you discover new drugs? Design a new bridge? Give insights from all your data? It's very valuable.

If OpenAI can build that PhD level AI, they can easily sell it to non-consumers.

Fine, but that's not what Apple sells...
Your point is legit in a discussion of how OpenAI might (who knows?) make money, but this discussion is essentially about how Apple will use AI.

For the PhD case, it's unclear that the LLM track is the way to get there. Maybe, but other tracks seem more likely. For example we have the recent Google results of IOM, but that was done via manual translation of the problems into LEAN, then having AI operate on the LEAN representation. We've seen the same sort of specialization in the AI's doing something useful in the domains of chemistry, or materials science, or place&route.
PERHAPS the LLM will play a role in translating a somewhat vague human-level specification into something that can be manipulated by these more specialized AI's? But that's very much a guess right now, not a sure thing.

johnsonwax · Aug 1, 2024

name99 said:
Just as a technical point, the very recent Apple Foundations Model paper states that for their on-device models, the "more critical" layers layers are 4b quantized, the less critical layers are 2b quantized.
Which is basically equivalent to ternary weight models, unless you're willing to build an entire infrastructure of storage and compute based on trits rather than bits...

As usual, the world is insisting that "Apple needs to chase <shiny thing X> RIGHT NOW" while Apple has been diligently investigating (but not loudly talking about) <shiny thing X> for five years.

Yeah, that wasn't really my focus there. In my work, we had a lot of situations where people thought they could see the future (we had a lot of PhDs from my previous comment), and someone (often me) had to say 'whoa, we can't see the future - we have to recognize that any of these paths may be correct including ones we don't yet know about, how do we best build infrastructure in anticipation of any of these being correct so that when the correct path reveals itself, we've already got more of the foundation laid than others and can implement faster'.

If you guess right, you win. If you guess wrong, you go out of business. And survivors bias means you will probably learn the wrong lesson here. Alternatively you can not rely on a guess, you can plan for all outcomes, and maybe be a bit behind the winner, but still be in the game. Apple seems to be closer to the plan for all outcomes approach on the hardware side.

mikegg · Aug 3, 2024

name99 said:
Fine, but that's not what Apple sells...
Your point is legit in a discussion of how OpenAI might (who knows?) make money, but this discussion is essentially about how Apple will use AI.

For the PhD case, it's unclear that the LLM track is the way to get there. Maybe, but other tracks seem more likely. For example we have the recent Google results of IOM, but that was done via manual translation of the problems into LEAN, then having AI operate on the LEAN representation. We've seen the same sort of specialization in the AI's doing something useful in the domains of chemistry, or materials science, or place&route.
PERHAPS the LLM will play a role in translating a somewhat vague human-level specification into something that can be manipulated by these more specialized AI's? But that's very much a guess right now, not a sure thing.

I'm responding to someone who questions the value of GPT4 (and future OpenAI models) market.

Quite frankly, there may come one day that an LLM model is so good and so useful, it renders the traditional iPhone obsolete. But that's quite distant in the future.

johnsonwax · Aug 3, 2024

mikegg said:
I'm responding to someone who questions the value of GPT4 (and future OpenAI models) market.

Quite frankly, there may come one day that an LLM model is so good and so useful, it renders the traditional iPhone obsolete. But that's quite distant in the future.

I'm not sure what happened to my post in response to you - it's disappeared.

I'll summarize. I think the GPT4 market is overserving even the case you think it serves.

The largest market here for these things is going to be consumers (there is more money in iPhones than in servers) and consumers don't need the PhD AI. They need an AI that interacts with their data. Maybe you're jumping ahead to the AGI future where it fits on an iPhone - I don't think that future exists, and I don't think there's any point discussing something that far out anyway. The biggest market is going to be butler AIs that act locally because they aren't overserving.

There is a market for the PhD AIs, but OpenAI doesn't sell that either, nor is there any indication they are heading that way. The value in a PhD in your org is that they are inside your org, working on your data, which you retain, and whose output you legally own. You slap IP protections around their work and you seek rent off of it. That's how you extract value from a PhD. That suggests you want a tall and narrow AI that is, again, local - at least local to your enterprise control. That's an argument for open source models that you can tailor to your specific needs. That's not what OpenAI is making. That's not their market. If I want an AI designing new pharmaceuticals, having it write poetry or pass the bar isn't useful. It's both a waste of resources (because I only care about new pharmaceuticals) and it's a distraction from the thing that I want it to do. I don't want it to be hallucinating other tasks that Sam Altman thought it would be cool to do, I want it to be unaware that there is anything in this universe other than pharmaceuticals.

OpenAI is trying to make an AGI. Butler on-device AIs don't help toward that goal. Narrow specialty AIs don't help toward the goal. These big overserving AIs do (at least they think they do) and they're trying to monetize their work by forcing it into a range of use cases for which it's not particularly well suited. It's too dumb for PhD work because it's too broad. It's too disconnected for butler work because it's too large to fit on consumer devices.

I'm not arguing that OpenAIs technology is bad. I'm arguing that their business model doesn't align with where the market finds value, and value governs what the market is wiling to spend money on. It's interesting for a certain type of person who values its ability to write real estate listings and cheat on their homework, but that's not a sufficiently large market to carry OpenAI. That's my argument.

mikegg · Aug 5, 2024

johnsonwax said:
There is a market for the PhD AIs, but OpenAI doesn't sell that either, nor is there any indication they are heading that way. The value in a PhD in your org is that they are inside your org, working on your data, which you retain, and whose output you legally own.

Why do you say they're not selling that?

They're selling enterprise tools and APIs to their best models. Eventually, they could sell the entire model and companies can use the model on-premise - just like the cloud did.

We don't even need to talk about OpenAI. Meta's Llama 3.1 is a foundational model at the level of GPT4. You can deploy it on premise TODAY.

Mixtral also ships an open model and it has enterprise licensing.

So even today, "PhD AIs" are being deployed and used.

darkswordsman17 · Aug 5, 2024

mikegg said:
Highly disagree on this.

Yes, your average intelligence butler who knows everything about you is valuable. That's what Apple can do because local inference will always run dumber models due to lower compute.

But the PhD level scientist has a ton of value. For one, corporations want to use PhD level AIs. Having a million butlers does nothing for most corporations. But having a highly intelligent assistant to help you discover new drugs? Design a new bridge? Give insights from all your data? It's very valuable.

If OpenAI can build that PhD level AI, they can easily sell it to non-consumers.

To get there they alone will need probably $100 trillion, any clue how they'll get that money considering their models aren't profitable and they can't afford the hardware, have no skin in the development of said hardware (meaning others could leverage their profitability to get priority access to said hardware), and literally do not have a functional (aka one that makes more money than it costs to run it) business model?

Right now they're lucky as they have Microsoft willing to throw their data centers behind it, but it costs a lot of money to build and run data centers (just look at how Amazon and everyone is like "oh it costs us money to run voice assistants and we don't really get money from it"; AI is that but even more ridiculous disparity in the resources needed for the outputs) and they're starting to face pushback. At some point Microsoft is gonna have to justify the expenses to their stock holders. They're basically building towards a nuclear future, by which I mean Microsoft is building towards a future where new data center will be scrutinized as much as a nuclear reactor, as people are starting to turn against such (because they don't employ nearly as much people as claimed, especially long term, they steal resources like water which are gong to become even more significant, and their power needs are gonna require nuclear reactors to operate).

Sam Altman can say ChatGPT will "solve physics" all he wants, but that's nonsense and people need to start pointing this out. People act like there's a simple A->B route, but there simply isn't. The current AI is not designed to do such things, and to get to the next levels of even this form of AI requires orders of magnitude increase in processing capability, that is beyond the grasp of the combined current tech industry. To get "PhD AI" we're probably looking at orders of magnitude investment of the entire worlds' GDPs. How that can be justified when there's thus far been no tangible benefits to AI, and its quickly being used to do harm.

There are ways to get improvements without that processing, but they're fundamentally opposed to how AI is being marketed, and I don't know of any company actually even attempting to use it to augment human workers, but rather to replace them (and usually they then have to go and pay slave wages in India to human workers that do the actual work, as seen with Amazon's purchase recognition "AI" system). The only groups likely to be doing that are cybrcriminals who will be using AI to make botnets, worms, and ransomware even more prevalent. The damage that will cause will make the Crowdstrike fiasco seem pale in comparison.

mikegg said:
I'm responding to someone who questions the value of GPT4 (and future OpenAI models) market.

Quite frankly, there may come one day that an LLM model is so good and so useful, it renders the traditional iPhone obsolete. But that's quite distant in the future.

This makes no sense, unless you genuinely believe biological transistors will happen in that timeframe. My point being, iPhone is hardware interface. LLMs are software (and ones using basic inputs because they're not really capable of much more yet). The attempts at making dedicated AI devices have been so absurdly atrocious that they likely singlehandedly are responsible for people starting to question AI hype. No matter the AI you're gonna need an interface unless you functionally build a brain or biological processor that can be added to our brains. You really seem to be listening to people like Sam Altman who have a nonsensical vision of what reality is. By the time that happens the traditional iPhone will be obsolete, because the timeline for when that happens is decades out. We'll be looking at major interface changes sooner than that.

mikegg said:
Why do you say they're not selling that?

They're selling enterprise tools and APIs to their best models. Eventually, they could sell the entire model and companies can use the model on-premise - just like the cloud did.

We don't even need to talk about OpenAI. Meta's Llama 3.1 is a foundational model at the level of GPT4. You can deploy it on premise TODAY.

Mixtral also ships an open model and it has enterprise licensing.

So even today, "PhD AIs" are being deployed and used.

Because such a thing doesn't exist and nothing they've built so far is even attempting to actually do that?

And? Those models are not capable of what you're talking about as they require precision and accuracy, which their models are fundamentally incapable of (because they're designed completely oppositionally to that, where they've been accepting lower and lower precision and accuracy in order to try and push out AI using current processing).

Ironically, that is one of the few legitimate useful things AI is good for, assisting human PhDs who can manipulate it to perform straightforward but computationally intensive tasks that would otherwise take them half a lifetime's worth of work. So they can then take the output and do something actually useful with it, because the AI can't because its not even built to be able to do that and they don't even know how to develop such.

Your argument is, again, nonsensical as they're selling AI to anyone willing to pay them money, that doesn't magically make it smarter or whatever argument you're trying to make there. Much like how companies tried jumping on blockchain (at direction of know-nothing executives that don't understand the technology, let alone how their company would actually use it), even more companies are jumping on the AI bandwagon with again, no real idea of what it actually does, what they could use it for, or anything else. Most of them are just making chatbots that are there basically just to frustrate their customers so they can layoff their human customer service. And also so they can micromanage their workforce so that they can lay off as many of them as they can to cut short term costs to boost stock prices.

Discussion Apple Silicon SoC thread

Lifer

Diamond Member

Senior member

Diamond Member

Member

Member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Member

Diamond Member

Diamond Member

Member

Diamond Member

Golden Member

Diamond Member

Senior member

Senior member

Member

Golden Member

Member

Golden Member

Lifer