Discussion Apple Silicon SoC thread

Page 60 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Eug

Lifer
Mar 11, 2000
23,583
996
126
M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

Screen-Shot-2021-10-18-at-1.20.47-PM.jpg

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:


M1 Ultra discussion here:


M2 discussion here:


Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:

 
Last edited:

Doug S

Platinum Member
Feb 8, 2020
2,202
3,405
136
build2 compilation times:

Keep in mind that this isn't solely a test of CPU performance, you are also testing the OS. We've been discussing this difference over on RWT - Linux is much better optimized for multicore performance in stuff like process creation/scheduling, filesystem, etc. since it has been used in systems with 100+ (and in some cases thousands) of CPUs for the better part of two decades, while until the latest Mac Pro Apple never had a system with even a double digit number of cores.

Windows falls somewhere between the two.
 
  • Like
Reactions: bigggggggg

jeanlain

Member
Oct 26, 2020
149
122
86
Ok, but previous photoshop version for MacOS couldn't use such a DSP or whatever, because intel processor didn't have that. If rosetta can automatically use GPU, it can't automatically switch between a DSP that didn't exist on Intel CPUs to the M1 DSP.
Sure it can. The app makes a system call, whether it's Metal or another API, and the system leverages whatever hardware is available if Apple programmed it this way. For instance, the Accelerate framework most probably uses the matrix math hardware (a specialised module added to the A13) for certain computations. Any app using the relevant Accelerate functions will automagically use the matrix math hardware on the M1, even if this app was written before the M1 even existed.
 

bigggggggg

Junior Member
Nov 27, 2020
18
12
41
Sure it can. The app makes a system call, whether it's Metal or another API, and the system leverages whatever hardware is available if Apple programmed it this way. For instance, the Accelerate framework most probably uses the matrix math hardware (a specialised module added to the A13) for certain computations. Any app using the relevant Accelerate functions will automagically use the matrix math hardware on the M1, even if this app was written before the M1 even existed.
Only if we are talking about math matrix. I don't think that test uses any algorithm involving matrices or whatever. He says only "exporting 50 photos, etc." (no deep learning stuff, no filters...). Anyway, Adobe states that lightroom uses CUDA hardware acceleration for many years, and in this case the 4900HS + 2060S would have blown away the M1. In fact in the Davinci resolve test the 4900HS + 2060S is twice as fast as the M1.
It's strange to think that MacOS lightroom is accelerated and CUDA isn't.

Anyway i have no way to resolve the matter.
 
  • Like
Reactions: Tlh97 and lobz

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
Browsers have been multi threaded, web browsing has not. Javascript is still a fundamentally single threaded languague.

If there is no parallelism to be exploited in web browsing, I have to wonder why every browser has "hardware acceleration" as a feature.

I can only say you must be using Netscape Navigator or something LOL! :D

Independent modules of the same code can be compiled in parallel which does not meant that compiling is a multithread workload even if it can benefit from it in certain occasions. You can google this in more detail as I grossly simplified it.

This I can somewhat agree with. From what I gather, compiling code is sequential by nature, but because the vast majority of programmers are probably compiling large amounts of files, a strong multicore CPU can be an asset as each core/thread can be assigned to a particular file until it finishes.

On the topic of games, the bottleneck is still single thread performance in most cases when it comes to CPU.

It's not so simple as that unfortunately. First off, the majority of games are bottlenecked by the GPU, not the CPU. Games that are CPU bottlenecked are comparatively rare, and mostly due to being simulation heavy (especially in terms of AI) or using legacy APIs like DX9/DX10/DX11 or OpenGL. And as I mentioned before, many of the best game engines these days are using task based parallelism, so workloads that used to occupy an entire core/thread by themselves years ago in the PS3/Xbox 360 era, are now broken up across multiple threads. Then you have to factor in that many of the newest games are using DX12/Vulkan, which are much more geared towards parallelism.

A good example of this is the IdTech 7 engine used in Doom. The engine has no lead rendering thread as rendering is dispersed across any available thread, courtesy of the expert implementation of Vulkan.

I dont understand what is hard to get about the idea that single thread code will never be obsolete: if your code requires the result of a previous operation to proceed, it cant be parallel.

I don't understand why you keep implying that I am saying single threaded code/performance is irrelevant or obsolete. I have never said that, and I don't think that.

Single threaded performance has an important place in computing, but due to the massive progression we've had over the years towards parallelism, practically all performance sensitive apps are accelerated by multithreading/multicore.
 

jeanlain

Member
Oct 26, 2020
149
122
86
Anyway i have no way to resolve the matter.
Even without CUDA, the AMD CPU should have beaten the M1 using Rosetta. Perhaps CUDA wasn't working and the test only used the AMD CPU, while the M1 GPU was being used. But if this test heavily relied on the GPU, the 16" MBP with its AMD dGPU should have beaten the M1. It did not (that's in another video).

What I said about Accelerate and matrices was just an example. Adobe could be using other macOS APIs that can now use various pieces of specialised hardware. And as I said, there's probably a macOS effect. Maybe they use some macOS APIs that have no equivalent on Windows, like the RAW format decoders. I don't think Windows ships with anything of the sort, so Adobe would have to come with their own, or use another implementation, which could be slower.
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
That Xeon Line is linear up to 24 threads. If the CPU can deliver, it will scale linearly.
It also scales linearly from 1 thread to 2 threads! The Xeon 3175X has 56 threads, and starts to scale non-linearly when not even half are saturated.
There are ZERO cases I'm aware of where Cinebench scales linearly through the entire capacity of any CPU with more than 2 threads available.
If you'd care to provide some evidence to back your claim that Cinebench scales linearly, I'm happy to review and accept new evidence and change my stance on this issue.
 
Last edited:

Heartbreaker

Diamond Member
Apr 3, 2006
4,222
5,224
136
It also scales linearly from 1 thread to 2 threads! How many cores does the 3175X have again? I don't think you bothered to check. 28. It has 56 threads. Not 24 threads, not even 24 cores.
There are ZERO cases I'm aware of where Cinebench scales linearly through the entire capacity of any CPU with more than 2 threads available.
If you'd care to provide some evidence to back your claim that Cinebench scales linearly, I'm happy to review and accept new evidence and change my stance on this issue.

I didn't say it scaled over the full range. But I not sure what you are saying it only scales to 2 threads. There is a straight line from 1 to 24 threads on the Xeon.

Straight line on graph = linear.
 
  • Like
Reactions: Zucker2k

teejee

Senior member
Jul 4, 2013
361
199
116
Adding SMT does not affect the ST performance at all. Do you think Intel would have added SMT if it caused a significant ST performance regression? Heck, just look back at the performance of their CPUs before/after they added SMT, do you see a regression? How about for AMD? Or are you going to argue that somehow only Apple would see a performance regression, but Intel and AMD were somehow immune?

The question of adding SMT has nothing to do with "difficulty" (if Apple can design a CPU that's competition with the best x86 in ST using a fraction of the power, this would present little challenge to them) it is a question of what your market is and whether it makes sense. Apple's market is overwhelmingly mobile where there's no point to SMT since people don't run workstation/server levels of threading on a phone, and I would argue the little cores greatly reduce the utility of SMT.

Plus you introduce potential security headaches, as more attacks against CPUs with SMT enabled are discovered all the time.

SMT might make sense on say the Mac Pro, but that's too niche of a market for them to bother with all that effort when that's the only place it is used. If they start using their CPUs for their own servers, then I could see it being worth it - they'd just leave it permanently disabled on the phone/tablet and potential laptop cores.

There are no examples of Intel adding SMT to a high IPC core afterwards. Core architechture had it from day one.
SMT requires quite a lot of transistors that can be used for other things if a core is designed without SMT. Remember that CPU development of a core has strict requirements on die/transistor budget and power draw, so anything you add means basically something else is not added.

And the benefit from certain improvements to prefetch, memory system and other things becomes quite different if for example a memory stall means completely halt of the CPU or just that you get more time for the other thread. So with or without SMT means that you choose different paths in some of all the design choices during the evolution of a core

How much this affect the performance is just something we can speculate about, but I believe it has a significant impact.

I agree that Apples has less benefit from SMT than AMD/Intel due to partly targeting different products, but that does not contradict my arguments above (rather the opposite). Same for the security headaches, not having to deal with that gives room for other solutions in Apples cores.
 

nxre

Member
Nov 19, 2020
60
103
66
Single threaded performance has an important place in computing, but due to the massive progression we've had over the years towards parallelism, practically all performance sensitive apps are accelerated by multithreading/multicore.
I think we difer in how we are defining parallelism. A parallel application for me would be one that can scale infinitely to run in multiple threads, so if you have a 4thread processor it uses 4 threads, and if you have a 64thread processor it uses 64threads. This is the ideal parallelism, because you always gain performance by adding more cores, and is the sort of parallelism GPUs use.
Many of the examples you are using are not like this, in fact, almost nothing is like this. What you are describing is situations where a programmer specifically searches for certain instances in a code that can be independent and makes them run in their own thread instead (as in web browsers and games). While this has enabled processors to benefit from multiple cores, it is not perfect parallelism: at a certain point, the code will have no more chances for parallelism, this can either be at 4threads, 8threads, whatever, and after that only by improving the performance of those individual threads you can gain something.
Point being, applications can be optimized to take advantage of multiple threads while not being parallel, and in that case, when they saturate the ammount of threads they can take advantage of, single thread performance is the only way to improve. The limit to this optimization is also not some big number that is far from being a problem: most applications cant even make use of more than 4threads.
As you can see, single threaded performance is always relevant even on apps optimized to take advantage of multiple threads. While I understand you are not saying single thread is irrelevant, I think you perceive it as being a niche benchmark because of the world of applications that take advantage of multithread, when its far from that.
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
I didn't say it scaled over the full range. But I not sure what you are saying it only scales to 2 threads. There is a straight line from 1 to 24 threads on the Xeon.

Straight line on graph = linear.
Nah. You claimed CB has linear scaling but have failed to provide evidence where there's linear scaling on any CPU, unless you cut out the portions of the data you don't like. Yes, I know most CPUs clock down with increasing core usage - doesn't matter. You still have failed to prove that it's linear scaling. Run CB23 on 1...n cores on a CPU with locked all-core speed and let's see if it scales linearly! Until you prove it, well, it's just a claim that's not validated. All the data we have shows that it scales non-linearly.

Also, your original claim is "If you run a fully MT Embarrassingly Parallel benchmark, you can figure out the scaling you get from SMT, without any need to run a separate 2 thread benchmark." That's untrue. With only running an embarrassingly parallel benchmark on MT, you cannot figure out the scaling you get from SMT, cores, or anything. You just get a single MT score.

Anyway, that's not the original point.

The original point - and I'm not sure why you keep bringing up things that are so tangential - is that evaluating who has the fastest single-threaded performance and who has the fastest single-core performance and who has the fastest 4T performance and who has the fastest 8T performance are all valid questions, and in particular, evaluating who has the core that can do a given workload the fastest is an interesting though experiment. Most browsers can utilize 4+ cores, most workloads users do utilize a lot more cores than you seem to think are important. Evaluating ST performance is important because many are heavy on one core compared to others, but they still do utilize other cores -- otherwise Apple wouldn't have put 2 Firestorm and 4 Icestorm cores in their iPhone SoC.

Anyway, the point is that if we evaluate a single Firestorm core and a single Tiger Lake core, the Tiger Lake core can do more CB23 work and more GB5 work than the Firestorm core. That's interesting to me given the die area, process, power consumption, etc. What's the power consumption of a single TGL core compared to a single Firestorm core accomplishing the same task? Would be fun to examine more in-depth.
 
Last edited:
  • Like
Reactions: Tlh97

Heartbreaker

Diamond Member
Apr 3, 2006
4,222
5,224
136
Nah. You claimed CB has linear scaling but have failed to provide evidence where there's linear scaling on any CPU,

It's not going to be linear for every CPU because the CPUs themselves will have issues, not the benchmark.

The benchmark is embarrassingly parallel, and as long as the CPU doesn't throttle, hit weird memory issues, run into fabric issues, etc... then it can scale linearly.

The Xeon scaling linearly for the first 24 threads demonstrates that. Other CPUs failing to do that, is those CPUs having issues, just like the Xeon has issues after 24 threads, but is linear before that point.

I actually didn't expect it to go wonky so early on the Threadrippers. It's likely the inter-chip and inter CCX latencies that throw it off.

So, I acknowledge it won't scale well on threadrippers, and potentially other CPUs with non uniform memory access.
 

Gideon

Golden Member
Nov 27, 2007
1,608
3,573
136
The graph I posted with 4-8 simultaenous thread occupancy on an Android phone includes every workload needed to render a page, hence there's much more work to do on the CPU than just one JavaScript thread.

Web browsing is multi-threaded.

Yeah, as it touched a subject I'm a bit more familiar with,I'll add my 2 cents.

Modern browsers are actually quite multi-threaded.

JavaScript on a single page in a single tab is indeed single-threaded. But only if:
  • That page doesn't use Web Workers - with which you can run things in other threads
  • Doesn't use WebAssembly libraries (which does support multi-threading now)
  • Doesn't have nested Iframes from other pages on different domains (that in both Chromium and Firefox run in a separate process even, let-alone thread)

And JavaScript is just a small part of the list of things that browsers do to actually deliver your page.
I'll take Firefox as an example (as they have nice blog-posts explaining stuff I'll link below), but many things from here apply to Chrome as well:
  • Composition is done off main-thread and can be quite taxing
    • (and in fact nowadays GPU is used quite extensively as well, see Firefox Webrender for instance)
  • CSS layout can be very parallel if designed well
  • HTML Parsing can be parallel (Though this is a different engine not used in FF currently, the Gecko engine also supports that)
  • And yet again Chrome and Firefox have different processes for every domain you have open. (for firefox it's very recent, see project fission)
And all of that doesn't even account to the fact that JavaScript itself is not just interpreted but also JIT (Just In Time) compiled - meaning the hot parts of your code get compiled to native a code to speed it up.
That happens on-the-fly and is also done off-main-thread.

Browsers aren't parallel in a sense that say Cinebench is (they don't tax all your cores to 100% with embarrassingly parallel workloads) but they absolutely do use multiple threads.

Just try running your browser in a virtual machine with a single vCPU, won't be a pleasant excercise even with a fast processor and using hypervisors (e.g. minimal overhead).
In fact Just going from a 2C/4T cpu to a 4C/8T CPU is often a noticable speedup while browsing (if you also have other stuff open). Less so with more cores, but the it's still there.

The reason for that misconception is simple - Benchmarks.

Old benchmarks (Octane, Kraken etc) are the worst, testing some fringe 100% atypical javascipt functionality (that sometimes get's special paths in browsers just to look good). For instance Firefox totally redesigned their JS engine (Warp) and got real-life site loading improcements from 12-20% on JS heavy sites (Google Docs, Reddit, Netflix). These benchmarks showed significant regressions (despite actual browsing experience improving greatly).

Speedometer 2.0 is much better, cause it at least renders an actual javacsript SPA (in different js frameworks) but it's still very simplistic and doesn't do stuff that is actually done on most websites: No embedded twitter/facebook frames, no google analytics, no ads (separate domains) nor adblock (which can be quite taxing on ad-heavy sites). No SVG or huge image-rendering, no worker-threads reading/writing from IndexedDB (separate thread).

All of these things benefit from more cores (at least more than 1-2) in the real world, but aren't shown in any benchmarks.
 
Last edited:

nxre

Member
Nov 19, 2020
60
103
66
Qualcomm Execs Say Apple M1 Further Validates Windows On Arm, And They’re Right

So it seems qualcomm is invested in making laptop chips. Let's hope they actually try to compete. A 4 X1+ 4 A55 or even a 6 X1 + 4 A55 running at higher clock speeds could be extremely competitive.
Based on performance estimates and leaked benchmarks, I think a 4Ghz X1 could trade blows with an M1 firestorm, no idea on the efficiency figures. But even if it doesn't match M1, it doesn't matter, they only need to match the best Intel or AMD has to offer in the ultrabook/laptop market at a much lower power to become viable. This, of course, assuming Microsoft gets Windows on ARM optimized and actually usable.
 

Bam360

Member
Jan 10, 2019
30
58
61
Qualcomm Execs Say Apple M1 Further Validates Windows On Arm, And They’re Right

So it seems qualcomm is invested in making laptop chips. Let's hope they actually try to compete. A 4 X1+ 4 A55 or even a 6 X1 + 4 A55 running at higher clock speeds could be extremely competitive.
Based on performance estimates and leaked benchmarks, I think a 4Ghz X1 could trade blows with an M1 firestorm, no idea on the efficiency figures. But even if it doesn't match M1, it doesn't matter, they only need to match the best Intel or AMD has to offer in the ultrabook/laptop market at a much lower power to become viable. This, of course, assuming Microsoft gets Windows on ARM optimized and actually usable.

There is no way a Cortex X1 can reach 4GHz, we've seen some slides showing that clocks are expected to be fairly similar, under a similar process node, than Cortex A77. Unless maybe if it uses high performance libraries, and even then it may still not be possible, or if it is, it would use a very high voltage. Or maybe 5nm TSMC is much better than expected, but based on Kirin 9000, I'm not seeing anything special.
 

nxre

Member
Nov 19, 2020
60
103
66
There is no way a Cortex X1 can reach 4GHz, we've seen some slides showing that clocks are expected to be fairly similar, under a similar process node, than Cortex A77. Unless maybe if it uses high performance libraries, and even then it may still not be possible, or if it is, it would use a very high voltage. Or maybe 5nm TSMC is much better than expected, but based on Kirin 9000, I'm not seeing anything special.
I doubt it too, X1 was meant to reach 3Ghz on mobile but qualcomm caps it at 2,84Ghz for reasons I can't guess. It should be easier for X1 to clock higher, given its much narrower than firestorm, but can't know until actual products reach the market
 

Qwertilot

Golden Member
Nov 28, 2013
1,604
257
126
I'm at least as intrigued as to what - for instance - Samsung do going forward. They've got fabs, SOC's and so on.

Even if they can't push a totally top end chip, there must be some incentive for people like them to push out some cheap but quite effective laptop chips. It has never seemed to quite take off even for Chromebooks though, so maybe not.
 

jpiniero

Lifer
Oct 1, 2010
14,510
5,159
136
I'm at least as intrigued as to what - for instance - Samsung do going forward. They've got fabs, SOC's and so on.

Even if they can't push a totally top end chip, there must be some incentive for people like them to push out some cheap but quite effective laptop chips. It has never seemed to quite take off even for Chromebooks though, so maybe not.

Price is a big issue. Most of the ARM Chromebooks are using 4-5 year old processors that aren't even on a semi modern node. The MediaTek one that is popular is fabbed at TSMC 28. Intel can compete pretty good with the (also aging) 14 nm Atoms if they are also willing to be cheap.
 

Heartbreaker

Diamond Member
Apr 3, 2006
4,222
5,224
136
I'm at least as intrigued as to what - for instance - Samsung do going forward. They've got fabs, SOC's and so on.

Even if they can't push a totally top end chip, there must be some incentive for people like them to push out some cheap but quite effective laptop chips. It has never seemed to quite take off even for Chromebooks though, so maybe not.


Samsung would be a great option, since they licensed AMD RDNA GPUs for Mobile, 4+ X1 Cores plus a decent sized RDNA GPU would be a nice laptop part.

But the problem right now for anyone but Apple building an ARM laptop chip is volume. The 8CX was likely partially financed by Microsoft, and we still got essentially the same lackluster chip for 2 "generations". The second generation part was the exact some core config, but just had a new modem (probably same HW with SW change).

So I wouldn't count on anyone, not partnered with Microsoft, building a laptop part until ARM-Windows shows more significant signs of life.

I think the most likely option is that the Microsoft-Qualcomm partnership continues, spawning an X1 based laptop part.

But even that potential future part seems likely trail M1 distantly.

It will be interesting to see how Microsoft reacts to the M1. Will they license ARM-Windows for the virtualization on the M1, or will they refuse, hoping no one notices how Windows on M1 embarrasses their custom Qualcomm HW.
 

nxre

Member
Nov 19, 2020
60
103
66
It will be interesting to see how Microsoft reacts to the M1. Will they license ARM-Windows for the virtualization on the M1, or will they refuse, hoping no one notices how Windows on M1 embarrasses their custom Qualcomm HW.
Im curious on that too, Apple said it was up to Microsoft if they wanted to port windows to M1. This to me sounds as if Apple isnt open to providing support and documentation for Microsoft to port Windows to M1, but I may be reading too much into it.
The question becomes: does windows on Mac benefit Microsoft, or does it benefit Apple? If it only benefits Apple, they wont do it. If it benefits them, they might. Do you think it benefits microsoft to do so?
 

scannall

Golden Member
Jan 1, 2012
1,944
1,638
136
Im curious on that too, Apple said it was up to Microsoft if they wanted to port windows to M1. This to me sounds as if Apple isnt open to providing support and documentation for Microsoft to port Windows to M1, but I may be reading too much into it.
The question becomes: does windows on Mac benefit Microsoft, or does it benefit Apple? If it only benefits Apple, they wont do it. If it benefits them, they might. Do you think it benefits microsoft to do so?
I would think that it would benefit both of them.

As for being up to Microsoft, the problem is they are currently only licensing Windows ARM to OEM's for devices only, not for retail.
 
  • Like
Reactions: nxre

Eug

Lifer
Mar 11, 2000
23,583
996
126
Qualcomm and Apple are such fast friends now, it's cute. ;) Just two years ago they were mortal enemies. It's amazing what $4.5 billion can do. :p

Browsers have been multi threaded, web browsing has not. Javascript is still a fundamentally single threaded languague.
I guess you never run multiple tabs with background loading, and never access websites with multi-media content, or never run another app at the same time as a browser?

While what part of what you said may be true, all you have to do is take a machine and increase the number of active cores and compare the experience and you'll see how much of a difference it is.
 
  • Like
Reactions: Tlh97

Heartbreaker

Diamond Member
Apr 3, 2006
4,222
5,224
136
Qualcomm and Apple are such fast friends now, it's cute. ;) Just two years ago they were mortal enemies. It's amazing what $4.5 billion can do. :p

Qualcomm isn't supporting Apple here, there were just doing PR and damage control for themselves. There were specifically asked about M1, and they answered that Qualcomm led the way on transition from x86 to ARM partnering with Microsoft.