Discussion Apple Silicon SoC thread

Page 63 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Eug

Lifer
Mar 11, 2000
23,583
996
126
M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

Screen-Shot-2021-10-18-at-1.20.47-PM.jpg

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:


M1 Ultra discussion here:


M2 discussion here:


Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:

 
Last edited:

Heartbreaker

Diamond Member
Apr 3, 2006
4,222
5,224
136
Agree that the ballooning size of the chips in these predictions is pretty sus but it's not impossible. The M1 is only 120mm2 and less than a quarter of that is from the four firestorm cores and their memory. Remove the GPU cores, move to N5P, and you can probably make a 16 firestorm core part that is 150mm2 or less. Yield on this is probably kind of sucky at this size on a new process, but if you're planning on disabling as many as half the CPU cores anyway and your die is mostly comprised of CPU cores maybe that doesn't matter.

Well, sure if you are ditching the SoC design, and going discrete for laptops, but that is also dubious.
 

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
If they are going to maintain the NPU at it's current size, and they are willing to eat the die size cost if going for 16 cores, I don't see where it's insurmountable. The resulting chip wouldn't be too massive, and using die salvage, they could get a few different SKUs out of it. Maybe up to six, with 8, 12, 16 cores and a choice of 1/2 enabled gpu with half the memory channels, or a full house gpu with all of them.
 

nxre

Member
Nov 19, 2020
60
103
66
Doing some quick rough estimates: 4Big cores+12Mb cache is around 15% of M1 die size so roughly 18mm2. Assuming 16Big cores+48Mb cache(Which I doubt, 32Mb seems more likely), then we would be adding 54mm2 to the die, for around 170mm2, plus all the added I/O, should put it around 200mm2. Not impossible, but expensive. Even if we subtract the iGPU assuming there wont be one for the high end parts it would still be around 170mm2.
a 200m2 die on a bleeding edge node. No wonder they are binning this thing to hell. Good thing they at least don't have to worry about hitting 5Ghz clock speeds on these.
For the GPUs, an 8core part is around 30mm2. a 64core part would be around 250mm2 which is still safe in the GPU space. a 128core would be 500mm2. For comparison AMD has 500mm2 dies and NVIDIA has 600mm2 ones, so not impossible, but unless they plan on selling a lot of these, I think it problably is less expensive to add two 64core dies together than to produce a 128core monolithic one. EDIT: I forgot to add the SLC cache which is used by the GPU on the count, so these would be problably be bigger.
But extremely curious nonetheless to see how they proceed.
 
Last edited:

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
What do you do when your CPU chips handily beats competition in IPC and power consumption? You outclass them. Like release X6800 when lowly E6300 is enough to match competitors. Extra 2MB of L2 and 1Ghz clock to senselesly beat opposition in the minds of media and enthusiasts.

That is what Apple is going to do - release chips with ton of cores to nuke laptop and desktop from outer space and capture prestige of being no1. +several percent is not enough for them and the cost to produce such chip is secondary concern to company like Apple.
 

awesomedeluxe

Member
Feb 12, 2020
69
23
41
Well, 16+4 would be fine in an iMac. The other thing is that you're assuming a 16+4 part would be running at the same clock speed as their 4+4 part.
I think you misread - I said "clocked at iPhone speeds." To be more specific, 16 Firestorm cores clocked around 2.89GHz would use around 40W. You could get to 3GHz on 5NP, though.

The only reason I was thinking about thermals is because this is a pretty ambitious proposition for the MBP16. Agree that there's no thermal constraint in the iMac.
Well, sure if you are ditching the SoC design, and going discrete for laptops, but that is also dubious.
Yeah, this isn't really the direction I would expect. But looking at the M1, I can sort of envision a GPU being glued to the other side of 4x LPDDR5-6200 modules. That's a solid 200+GB/s of bandwidth. And something something unified memory.
 

nxre

Member
Nov 19, 2020
60
103
66
I don't think Apple is moving into the server market as selling chips, but I definitely think they will move all of their operations to servers running on 32core or more parts. It is cheaper, the power usage is significantly reduced, and ~vertical integration~. They may consider offering something similar to AWS and Azure which is a highly lucrative market that completely justifies the price in developing these massive chips, but I doubt that part.
 

awesomedeluxe

Member
Feb 12, 2020
69
23
41
Having gobs of gpu cores us one thing... But how are they feeding that beast? If they maintain their current setup, those chips will require packages surrounded by 8 or more LPDDRX stacks, or a couple of HBM stacks. They're going to look more like Consoles and less like traditional desktops...
Eh, for a Macbook 16 at least - with presumably 32 GPU cores - I think 4x LPDDR5-6400 is enough. 200GB/s bandwidth, and you are scaling the bandwidth almost linearly with the number of additional GPU cores relative to the M1. Performance will still be well above the 5600M.

No idea what solution they would use for a 64 core GPU, but since that's a desktop part they have a lot more options.
Doing some quick rough estimates: 4Big cores+12Mb cache is around 15% of M1 die size so roughly 18mm2. Assuming 16Big cores+48Mb cache(Which I doubt, 32Mb seems more likely), then we would be adding 54mm2 to the die, for around 170mm2, plus all the added I/O, should put it around 200mm2. Not impossible, but expensive. Even if we subtract the iGPU assuming there wont be one for the high end parts it would still be around 170mm2.
a 200m2 die on a bleeding edge node. No wonder they are binning this thing to hell. Good thing they at least don't have to worry about hitting 5Ghz clock speeds on these.
For the GPUs, an 8core part is around 30mm2. a 64core part would be around 250mm2 which is still safe in the GPU space. a 128core would be 500mm2. For comparison AMD has 500mm2 dies and NVIDIA has 600mm2 ones, so not impossible, but unless they plan on selling a lot of these, I think it problably is less expensive to add two 64core dies together than to produce a 128core monolithic one. EDIT: I forgot to add the SLC cache which is used by the GPU on the count, so these would be problably be bigger.
But extremely curious nonetheless to see how they proceed.
Nice estimates. I forgot about the space needed for the extra I/O. I do feel like "16 perf core part and 32 core GPU" implies strongly that this would not be an APU, so ~170mm2 seems right.

Yeah, if the report is accurate, they are prepared for terrible yield on these CPU parts. It's a really weird strategy and I wonder how they are avoiding fatal defects. I would just assume any die which has enough defects that half the CPU cores have to be disabled has a fatal defect somewhere too.
 
Last edited:

Doug S

Platinum Member
Feb 8, 2020
2,201
3,405
136
Ok. It's a smaller Mac Pro. Not a Pro version of Mini. Even half the size of the current Mac Pro will be a sizable computer, and it won't be called Mini.

Not an in between model, but a the newer, smaller Mac Pro, likely with less expansion capability. But more expansion than the Trashcan Macs.


There's no reason they couldn't make new Mac Pro a LOT smaller though. It doesn't need to be as upgradeable as a traditional workstation - if Apple is going to be the only GPU supplier, there's no need to have PCIe slots - the GPU will be soldered on (whether part of the SoC module via chiplets or discrete) You have a bunch of DIMM slots (probably four DDR5 channels) for RAM, some NVMe slots, what else do you need for internal expansion?

Give it a couple network ports (at least one of them 10GbE) a half dozen or more USB/TB ports, multiple HDMI/DP, and you plug in the rest of what you need like external storage arrays, displays and so forth and that takes care of everything you might plug into it. It would probably be nice if they had four USB-A ports for stuff like keyboards and USB sticks, but knowing Apple you'll probably need a USB-C to USB-A hub for that.

If you don't have PCIe slots or support internal SATA drives, the form factor is mostly limited the area required for DIMM slots, and the space needed for efficient cooling. While I'm not predicting they will, I will point out that having the CPU/GPU soldered to the board would make it very easy for Apple to build in phase change liquid cooling to keep it nice and quiet.

Not sure why they would make a "Mac Pro Mini" - I wonder if the rumor got its wires crossed and it is a "Mac Mini Pro". Now THAT I could see, once some of the higher end chips are out. It was designed to cool a 65W TDP, so clearly it could handle a much more powerful M1 successor down the road.
 

Doug S

Platinum Member
Feb 8, 2020
2,201
3,405
136
Eh, for a Macbook 16 at least - with presumably 32 GPU cores - I think 4x LPDDR5-6400 is enough. 200GB/s bandwidth, and you are scaling the bandwidth almost linearly with the number of additional GPU cores relative to the M1. Performance will still be well above the 5600M.

No idea what solution they would use for a 64 core GPU, but since that's a desktop part they have a lot more options.

I don't think they can share the memory bus between the CPU and GPU when they scale much higher. If you're going to add another controller to the SoC, it makes more sense for the CPU and GPU to each have their own. Having four DDR5 controllers in a laptop just to allow sharing the CPU and GPU RAM sounds a little crazy to me.

So if you have an 8+4 chip for the Mac Pro with 32 GPU cores, as I've previously suggested, you'd have a 'regular' DDR5 memory controller for the CPU that interfaced with SODIMMs, and a GDDR6/6x controller interfacing to chips on the package like the current M1. That could also potentially work as a chiplet to build the big stuff if they don't go discrete like that recent rumor suggests.
 

Heartbreaker

Diamond Member
Apr 3, 2006
4,222
5,224
136
So if you have an 8+4 chip for the Mac Pro with 32 GPU cores, as I've previously suggested, you'd have a 'regular' DDR5 memory controller for the CPU that interfaced with SODIMMs, and a GDDR6/6x controller interfacing to chips on the package like the current M1. That could also potentially work as a chiplet to build the big stuff if they don't go discrete like that recent rumor suggests.

That's a waste/overlap of pin/pad and memory controller space. If you are connecting that many memory channels to the chip, you should go unified and the same bus type.

I don't believe there is any chance of M2 chip in top end MBP and iMacs having 32 core iGPU. If there is 32 core discrete chip, then yes a separate memory buss and different type of memory might make a lot more sense.
 

Doug S

Platinum Member
Feb 8, 2020
2,201
3,405
136
That's a waste/overlap of pin/pad and memory controller space. If you are connecting that many memory channels to the chip, you should go unified and the same bus type.

I'm not sure what you're trying to say here. Are you claiming pins or memory controller blocks can somehow be shared if you have more than one of the same controller type?

The memory would still be unified - both would be connected to the SLC, and the CPU would be capable of accessing the GPU's memory space and vice versa when necessary. Using different controllers would make sense because a CPU performs best with RAM optimized for minimum latency and a GPU performs best with RAM optimized for maximum bandwidth. GDDR was created for this reason. Use the tool designed for the job, rather than using whatever you have in your hand as a hammer.
 

Heartbreaker

Diamond Member
Apr 3, 2006
4,222
5,224
136
I'm not sure what you're trying to say here. Are you claiming pins or memory controller blocks can somehow be shared if you have more than one of the same controller type?

The memory would still be unified - both would be connected to the SLC, and the CPU would be capable of accessing the GPU's memory space and vice versa when necessary. Using different controllers would make sense because a CPU performs best with RAM optimized for minimum latency and a GPU performs best with RAM optimized for maximum bandwidth. GDDR was created for this reason. Use the tool designed for the job, rather than using whatever you have in your hand as a hammer.

If you are putting the DDR (lets say 128 bit wide) in there for the CPU and GDDR (128 bit wide) for the GPU, then it really isnt' unified.

If you only load GPU data from the GDDR, then you loose half your bus width, wasting out on the bandwidth benefit.

Make much more sense to have a true unified 256 bit bus for everyone.

Lots of opportunities to follow your suggestion and no one had done this, because it makes no sense.
 
  • Like
Reactions: Tlh97 and name99

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
It's not going to be linear for every CPU because the CPUs themselves will have issues, not the benchmark.

The benchmark is embarrassingly parallel, and as long as the CPU doesn't throttle, hit weird memory issues, run into fabric issues, etc... then it can scale linearly.

The Xeon scaling linearly for the first 24 threads demonstrates that. Other CPUs failing to do that, is those CPUs having issues, just like the Xeon has issues after 24 threads, but is linear before that point.

I actually didn't expect it to go wonky so early on the Threadrippers. It's likely the inter-chip and inter CCX latencies that throw it off.

So, I acknowledge it won't scale well on threadrippers, and potentially other CPUs with non uniform memory access.
Just wanted to circle back around here, because as Apple's SoC designs start to expand into double-digit core ranges, the validity of Cinebench as a parallel-processing comparison tool will certainly be brought up time and time again. Your dogmatic claim that it is embarrassingly parallel may have some theoretical basis, it is not for any practical purposes useful or true.

Two points about Cinebench (and we'll use R20 results, though R15 and it appears R23 show similar tendencies):
1) It does not scale well with cores or threads, which is not unusual. In terms of MT scaling, it's middle-of-the-road. There are definitely better tests, and worse ones.
2) It does not scale linearly with cores or threads. In fact, uniquely, it scales negatively and logarithmically with increasing thread counts.

Point 1 - scaling

Test5600X 1T5600X 12TAT MT scaling score5950X 1T5950X 32TAT MT scaling scoreScaling score decrease
CB20600451762.7%6441009649.0%21.9%
3DPM v1111105779.4%120278372.5%8.7%
OpenSSL sha22202538195.3%23614933587.1%12.4%
OpenSSL md5943999688.3%10222553678.1%11.6%
SPEC2017 int7.242.4749.2%7.6582.9833.9%31.0%
SPEC2017 fp11.4742.9631.2%12.1964.3916.5%47.1%
GB51576870546.0%16551572629.7%35.5%
* AT MT scaling score = [ ( 1T score / nT score ) / # threads ]
* scaling score decrease = [ ( 5600X score - 5950X score ) / 5600X score ]

The key takeaway here is that CB20 is an odd lodestar for embarrassing parallelism in the consumer desktop market, especially compared to other benchmarks which show much better scaling. CB20 also exhibits poor utilization of SMT compared to other MT tests (link, cf second table). Admittedly, it's not the worst. It's also clearly not embarrassingly parallel for the CPUs we are concerned about. SHA hash on the other hand? Certainly seems like a much better candidate.

Point 2 - linearity or non-linearity of scaling

When you plot the AT MT scaling scores (which are adjusted for different 1T scores) of 5600X, 5800X, 5900X, and 5950X comparing the 7 above benchmarks, CB20 is the only one that exhibits significant logarithmic degradation (with the other tests showing exponential degradation or linear degradation). That is, the CB20 result is the only one where a reflected/shifted logarithmic curve (which exhibits progressively larger degradations of performance with increasing thread counts) was the best fit line.

ss-by-chip.png
12 = 5600X
16 = 5800X
24 = 5900X
32 = 5950x

We need more data, obviously. It would be nice to have separate graphs for each benchmark, and separate curves for each chip with varying thread counts enabled per-chip on the x axis, and more chips than these.

In any case, this is interesting enough. There are certainly any number of explanations for the above results. None of them offer us any logical reason to label CB20 as "embarrassingly parallel" in this context, especially with this kind of poor MT scaling and its non-linear regression in performance with increasing thread counts, and especially not with other more valid candidates for the label. I'm not sure there's a more extreme label for parallel performance than "embarrassingly parallel" and I'd contend that label should be used for something that actually, you know, scales fairly well with increasing thread count.

So as we move on to potential Apple SoC products with 8+4, 16+8, even 24, 32, 48 performance cores -- let's let the data guide us, not the dogma. If we are looking for a pure "embarrassingly parallel" test, it seems that among those tests above, OpenSSL hash rates offer a better place to start than Cinebench scores.
 
Last edited:

Antey

Member
Jul 4, 2019
105
153
116
for their 128 cores GPU i'm expecting something like a 4 tiles gpu with 32 cores per tile (512 Execution units)... just like intel Xe HPC with 2 and 4 tiles... not good for gaming but great for everything else, just what imac pros are made for. and then 1 tile for the imac, just like the intel Xe HPG.

BUsZ5EdKUcP8mWRKypTNB4-970-80.jpg
 

insertcarehere

Senior member
Jan 17, 2013
639
607
136
Just wanted to circle back around here, because as Apple's SoC designs start to expand into double-digit core ranges, the validity of Cinebench as a parallel-processing comparison tool will certainly be brought up time and time again. Your dogmatic claim that it is embarrassingly parallel may have some theoretical basis, it is not for any practical purposes useful or true.

Two points about Cinebench (and we'll use R20 results, though R15 and it appears R23 show similar tendencies):
1) It does not scale well with cores or threads, which is not unusual. In terms of MT scaling, it's middle-of-the-road. There are definitely better tests, and worse ones.
2) It does not scale linearly with cores or threads. In fact, uniquely, it scales negatively and logarithmically with increasing thread counts.

Point 1 - scaling

Test5600X 1T5600X 12TAT MT scaling score5950X 1T5950X 32TAT MT scaling scoreScaling score decrease
CB20600451762.7%6441009649.0%21.9%
3DPM v1111105779.4%120278372.5%8.7%
OpenSSL sha22202538195.3%23614933587.1%12.4%
OpenSSL md5943999688.3%10222553678.1%11.6%
SPEC2017 int7.242.4749.2%7.6582.9833.9%31.0%
SPEC2017 fp11.4742.9631.2%12.1964.3916.5%47.1%
GB51576870546.0%16551572629.7%35.5%
* AT MT scaling score = [ ( 1T score / nT score ) / # threads ]
* scaling score decrease = [ ( 5600X score - 5950X score ) / 5600X score ]

The key takeaway here is that CB20 is an odd lodestar for embarrassing parallelism in the consumer desktop market, especially compared to other benchmarks which show much better scaling. CB20 also exhibits poor utilization of SMT compared to other MT tests (link, cf second table). Admittedly, it's not the worst. It's also clearly not embarrassingly parallel for the CPUs we are concerned about. SHA hash on the other hand? Certainly seems like a much better candidate.

Point 2 - linearity or non-linearity of scaling

When you plot the AT MT scaling scores (which are adjusted for different 1T scores) of 5600X, 5800X, 5900X, and 5950X comparing the 7 above benchmarks, CB20 is the only one that exhibits significant logarithmic degradation (with the other tests showing exponential degradation or linear degradation). That is, the CB20 result is the only one where a reflected/shifted logarithmic curve (which exhibits progressively larger degradations of performance with increasing thread counts) was the best fit line.

View attachment 35275
12 = 5600X
16 = 5800X
24 = 5900X
32 = 5950x

We need more data, obviously. It would be nice to have separate graphs for each benchmark, and separate curves for each chip with varying thread counts enabled per-chip on the x axis, and more chips than these.

In any case, this is interesting enough. There are certainly any number of explanations for the above results. None of them offer us any logical reason to label CB20 as "embarrassingly parallel" in this context, especially with this kind of poor MT scaling and its non-linear regression in performance with increasing thread counts, and especially not with other more valid candidates for the label. I'm not sure there's a more extreme label for parallel performance than "embarrassingly parallel" and I'd contend that label should be used for something that actually, you know, scales fairly well with increasing thread count.

So as we move on to potential Apple SoC products with 8+4, 16+8, even 24, 32, 48 performance cores -- let's let the data guide us, not the dogma. If we are looking for a pure "embarrassingly parallel" test, it seems that among those tests above, OpenSSL hash rates offer a better place to start than Cinebench scores.

Overlaying 5600x vs 5950x scaling scores to show that "Cinebench does not scale well with threads" brings a bunch of confounding variables to the table, much of which has nothing to do with the software and shouldn't affect multi-core Apple SoCs.
-5600x/5800x use one chiplet vs 5900x/5950x two chiplets, so that's a factor to attribute scaling drop-off because of non-uniform memory access inter-chiplet.
-Those benchmarks do not control for clock speeds, or rather, clock speed differences between ST vs MT. (Link) A 5950x can drop from >5ghz single core to ~3.8 Ghz fully loaded with 32 threads (depends on benchmark obviously), while a 5600x drops off from 4.65ghz 1T to 4.45ghz 12T, that will explain a lot of the difference in scaling as well.

Now that's not to say Cinebench is the ideal benchmark for scaling with cores/threads, but there's little to differentiate between the benchmark not scaling well and Zen 3's own potential issues with scaling.
 

Heartbreaker

Diamond Member
Apr 3, 2006
4,222
5,224
136
Just wanted to circle back around here, because as Apple's SoC designs start to expand into double-digit core ranges, the validity of Cinebench as a parallel-processing comparison tool will certainly be brought up time and time again. Your dogmatic claim that it is embarrassingly parallel may have some theoretical basis, it is not for any practical purposes useful or true.

Two points about Cinebench (and we'll use R20 results, though R15 and it appears R23 show similar tendencies):
1) It does not scale well with cores or threads, which is not unusual. In terms of MT scaling, it's middle-of-the-road. There are definitely better tests, and worse ones.
2) It does not scale linearly with cores or threads. In fact, uniquely, it scales negatively and logarithmically with increasing thread counts.

Point 1 - scaling

Test5600X 1T5600X 12TAT MT scaling score5950X 1T5950X 32TAT MT scaling scoreScaling score decrease
CB20600451762.7%6441009649.0%21.9%
3DPM v1111105779.4%120278372.5%8.7%
OpenSSL sha22202538195.3%23614933587.1%12.4%
OpenSSL md5943999688.3%10222553678.1%11.6%
SPEC2017 int7.242.4749.2%7.6582.9833.9%31.0%
SPEC2017 fp11.4742.9631.2%12.1964.3916.5%47.1%
GB51576870546.0%16551572629.7%35.5%
* AT MT scaling score = [ ( 1T score / nT score ) / # threads ]
* scaling score decrease = [ ( 5600X score - 5950X score ) / 5600X score ]

The key takeaway here is that CB20 is an odd lodestar for embarrassing parallelism in the consumer desktop market, especially compared to other benchmarks which show much better scaling. CB20 also exhibits poor utilization of SMT compared to other MT tests (link, cf second table). Admittedly, it's not the worst. It's also clearly not embarrassingly parallel for the CPUs we are concerned about. SHA hash on the other hand? Certainly seems like a much better candidate.

Point 2 - linearity or non-linearity of scaling

When you plot the AT MT scaling scores (which are adjusted for different 1T scores) of 5600X, 5800X, 5900X, and 5950X comparing the 7 above benchmarks, CB20 is the only one that exhibits significant logarithmic degradation (with the other tests showing exponential degradation or linear degradation). That is, the CB20 result is the only one where a reflected/shifted logarithmic curve (which exhibits progressively larger degradations of performance with increasing thread counts) was the best fit line.

View attachment 35275
12 = 5600X
16 = 5800X
24 = 5900X
32 = 5950x

We need more data, obviously. It would be nice to have separate graphs for each benchmark, and separate curves for each chip with varying thread counts enabled per-chip on the x axis, and more chips than these.

In any case, this is interesting enough. There are certainly any number of explanations for the above results. None of them offer us any logical reason to label CB20 as "embarrassingly parallel" in this context, especially with this kind of poor MT scaling and its non-linear regression in performance with increasing thread counts, and especially not with other more valid candidates for the label. I'm not sure there's a more extreme label for parallel performance than "embarrassingly parallel" and I'd contend that label should be used for something that actually, you know, scales fairly well with increasing thread count.

So as we move on to potential Apple SoC products with 8+4, 16+8, even 24, 32, 48 performance cores -- let's let the data guide us, not the dogma. If we are looking for a pure "embarrassingly parallel" test, it seems that among those tests above, OpenSSL hash rates offer a better place to start than Cinebench scores.

I can't believe you went through all of this analysis from such fundamentally flawed basis.

1) Ignoring throttling. These chips all run higher clock speed when running single core. Higher core count parts, limit speed more, to keep power in check.

2) Ignoring non-linear memory access in the CPUs. Inter-CCX, and Inter chiplet memory access vagaries.

3) ignoring that SMT does not scale linearly. SMT only improves utilization, of unused units, and only adds the fraction of the first thread on the core.

4) Assuming that higher SMT benefit, means more parallel benchmark. It doesn't. A thread that does negligible work, will leave units more underutilized, allowing a secondary thread higher gains when sharing those units.

5) Ignoring the impact of the scheduler across the other confounding factors.

On top of that, you only looked at endpoints of 1 thread, and Max threads, paying no attention to the data points in between to observe where linearity breaks down.

You linked this image earlier in the thread as evidence that it doesn't scale linearly:

But I contend this is evidence that the Benchmark itself is embarrassingly parallel, at least up to 24 threads. Unlike the data you used for your analysis of only endpoints, this show from 1-64 thread, inclusive with all points between.

What it shows is that the monolithic Xeon server CPU meant to deliver consistent performance scales text book linear, like one would expect from an embarrassingly linear benchmark up to 24 threads where it starts to deviate.

We can speculate on whether that is the benchmark or the CPU factors after that point, but my speculation would be it's the CPU hitting memory/scheduling issues, and that a better designed, higher core count CPU would scale further than this given what we see up until this point.

The Threadrippers loses linearity much sooner, and how it breaks down is governed by it's scheduling model. Given that, and how the Xeon holds up, is extremely strong evidence that this non linear behavior is the result of CPU issues, not benchmark issues.

Here's a graph of Amdahl's law outcomes on theoretically perfect processors, running up to a theoretically perfect 100% parallel load:


Speedup-for-increasing-the-number-of-processes.png



Here is a highlighted crop from the previous CB image roughly indicating the first 24 threads on the Xeon server processor. It should be fairly obvious that until the Xeon stumbles at thread 25, it's behaving like a textbook embarrassingly parallel benchmark. That doesn't mean it has to be 100% parallel, which is more of theoretical case, but it is clear that it is well above 95% parallel.

Scaling.png


This is EXTREMELY strong evidence that all the variations you are seeing below 24 threads, are CPU issues (Ex: Clock speed, Memory access, scheduling) , not benchmark issues.
 

Bam360

Member
Jan 10, 2019
30
58
61
MT scaling needs to be done with fixed frequency, because on Ryzen it is well known that each CPU uses a power limit that lowers the frequency the higher the core count. It is one of the reasons why efficiency looks a bit worse than it really is on Intel desktop (it is still much lower of course), fixed all-core turbo is not a good idea, there are workloads that consume more power than others.
 

Doug S

Platinum Member
Feb 8, 2020
2,201
3,405
136
Just wanted to circle back around here, because as Apple's SoC designs start to expand into double-digit core ranges, the validity of Cinebench as a parallel-processing comparison tool will certainly be brought up time and time again. Your dogmatic claim that it is embarrassingly parallel may have some theoretical basis, it is not for any practical purposes useful or true.

Two points about Cinebench (and we'll use R20 results, though R15 and it appears R23 show similar tendencies):
1) It does not scale well with cores or threads, which is not unusual. In terms of MT scaling, it's middle-of-the-road. There are definitely better tests, and worse ones.
2) It does not scale linearly with cores or threads. In fact, uniquely, it scales negatively and logarithmically with increasing thread counts.

Point 1 - scaling

Test5600X 1T5600X 12TAT MT scaling score5950X 1T5950X 32TAT MT scaling scoreScaling score decrease
CB20600451762.7%6441009649.0%21.9%
3DPM v1111105779.4%120278372.5%8.7%
OpenSSL sha22202538195.3%23614933587.1%12.4%
OpenSSL md5943999688.3%10222553678.1%11.6%
SPEC2017 int7.242.4749.2%7.6582.9833.9%31.0%
SPEC2017 fp11.4742.9631.2%12.1964.3916.5%47.1%
GB51576870546.0%16551572629.7%35.5%
* AT MT scaling score = [ ( 1T score / nT score ) / # threads ]
* scaling score decrease = [ ( 5600X score - 5950X score ) / 5600X score ]

The key takeaway here is that CB20 is an odd lodestar for embarrassing parallelism in the consumer desktop market, especially compared to other benchmarks which show much better scaling. CB20 also exhibits poor utilization of SMT compared to other MT tests (link, cf second table). Admittedly, it's not the worst. It's also clearly not embarrassingly parallel for the CPUs we are concerned about. SHA hash on the other hand? Certainly seems like a much better candidate.

Point 2 - linearity or non-linearity of scaling

When you plot the AT MT scaling scores (which are adjusted for different 1T scores) of 5600X, 5800X, 5900X, and 5950X comparing the 7 above benchmarks, CB20 is the only one that exhibits significant logarithmic degradation (with the other tests showing exponential degradation or linear degradation). That is, the CB20 result is the only one where a reflected/shifted logarithmic curve (which exhibits progressively larger degradations of performance with increasing thread counts) was the best fit line.

View attachment 35275
12 = 5600X
16 = 5800X
24 = 5900X
32 = 5950x

We need more data, obviously. It would be nice to have separate graphs for each benchmark, and separate curves for each chip with varying thread counts enabled per-chip on the x axis, and more chips than these.

In any case, this is interesting enough. There are certainly any number of explanations for the above results. None of them offer us any logical reason to label CB20 as "embarrassingly parallel" in this context, especially with this kind of poor MT scaling and its non-linear regression in performance with increasing thread counts, and especially not with other more valid candidates for the label. I'm not sure there's a more extreme label for parallel performance than "embarrassingly parallel" and I'd contend that label should be used for something that actually, you know, scales fairly well with increasing thread count.

So as we move on to potential Apple SoC products with 8+4, 16+8, even 24, 32, 48 performance cores -- let's let the data guide us, not the dogma. If we are looking for a pure "embarrassingly parallel" test, it seems that among those tests above, OpenSSL hash rates offer a better place to start than Cinebench scores.


How much memory does CB20 read/write in a run? A benchmark without much in the way of data dependencies tends to be limited primarily by memory bandwidth.

Do you have memory bandwidth benchmarks for 1 and 12 cores of a 5600X and 1 and 32 cores of a 5950X as tested in the above benchmark to see how that scales? If we had those figures, along with a measurement of the amount of memory traffic in the CB20 benchmark we could determine how much memory bandwidth is responsible for limiting scaling. Somewhere I have a link to that information for SPEC2017, I'll have to see if I can dig that up.
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
I can't believe you went through all of this analysis from such fundamentally flawed basis.

1) Ignoring throttling. These chips all run higher clock speed when running single core. Higher core count parts, limit speed more, to keep power in check.

2) Ignoring non-linear memory access in the CPUs. Inter-CCX, and Inter chiplet memory access vagaries.

3) ignoring that SMT does not scale linearly. SMT only improves utilization, of unused units, and only adds the fraction of the first thread on the core.

4) Assuming that higher SMT benefit, means more parallel benchmark. It doesn't. A thread that does negligible work, will leave units more underutilized, allowing a secondary thread higher gains when sharing those units.

5) Ignoring the impact of the scheduler across the other confounding factors.

On top of that, you only looked at endpoints of 1 thread, and Max threads, paying no attention to the data points in between to observe where linearity breaks down.

You linked this image earlier in the thread as evidence that it doesn't scale linearly:

But I contend this is evidence that the Benchmark itself is embarrassingly parallel, at least up to 24 threads. Unlike the data you used for your analysis of only endpoints, this show from 1-64 thread, inclusive with all points between.

What it shows is that the monolithic Xeon server CPU meant to deliver consistent performance scales text book linear, like one would expect from an embarrassingly linear benchmark up to 24 threads where it starts to deviate.

We can speculate on whether that is the benchmark or the CPU factors after that point, but my speculation would be it's the CPU hitting memory/scheduling issues, and that a better designed, higher core count CPU would scale further than this given what we see up until this point.

The Threadrippers loses linearity much sooner, and how it breaks down is governed by it's scheduling model. Given that, and how the Xeon holds up, is extremely strong evidence that this non linear behavior is the result of CPU issues, not benchmark issues.

Here's a graph of Amdahl's law outcomes on theoretically perfect processors, running up to a theoretically perfect 100% parallel load:

Here is a highlighted crop from the previous CB image roughly indicating the first 24 threads on the Xeon server processor. It should be fairly obvious that until the Xeon stumbles at thread 25, it's behaving like a textbook embarrassingly parallel benchmark. That doesn't mean it has to be 100% parallel, which is more of theoretical case, but it is clear that it is well above 95% parallel.

This is EXTREMELY strong evidence that all the variations you are seeing below 24 threads, are CPU issues (Ex: Clock speed, Memory access, scheduling) , not benchmark issues.
1, 2, 5: Correct, which is why at the end I mentioned that we really need more tests.

3, 4: Correct. I was incorrect in my statement, and updated my post - benefit with SMT means the task isn't embarrassingly parallel. As a result, the (true) point you make here means that CB isn't embarrassingly parallel. So does the fact that CB isn't synchronous (some buckets will still be running while many cores/threads sit idle at the end of the test; some cores will complete more tasks than other cores; and so on).

Now that you've established that CB isn't embarrassingly parallel, we can move on to whether CB results will exhibit linear improvement in performance by virtue of the test itself. Because you have established that CB is not embarrassingly parallel (that is, it is not 100% parallel), performance gains with added threads must follow a non-linear curve.

1607485528127.png

Except for a parallel portion of 0% or 100%, all lines are non-linear.

In the range of 1-24 threads on the Xeon in question, as a result, per Amdahl's law, performance gain as a function of CB20 alone thus can only seem linear, but cannot actually be linear. Therefore, if it is linear, since you have already established that the test is not embarrassingly parallel, the only explanation for the apparent linearity is because of 1) optical illusion, 2) noise in the data, 3) CPU/system issues/effects, or 4) some other problem not related to CB20.
 

Heartbreaker

Diamond Member
Apr 3, 2006
4,222
5,224
136
As I said before, a benchmark coded on an embarrassingly parallel problem, need not be 100% parallel. There will always be some small amount of overhead preventing real world algorithms from reaching 100%.

We can see from the results that it is clearly above 95% parallel.
 
  • Like
Reactions: ryan20fun

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
As I said before, a benchmark coded on an embarrassingly parallel problem, need not be 100% parallel. There will always be some small amount of overhead preventing real world algorithms from reaching 100%.

We can see from the results that it is clearly above 95% parallel.
And because it is not 100% parallel, at no point on the graph could the gains be linear as a function of the workload in question, CB20. It is not embarrassingly parallel, it is nearly embarrassingly parallel (which is a clearly defined term). Thus, according to Amdahl's law, the relationship is non-linear.
 

Doug S

Platinum Member
Feb 8, 2020
2,201
3,405
136
Happened upon a wafer cost table at sparrownews.com they claimed came from a "semiconductor industry insider" somewhere on Twitter. I can't vouch for its accuracy, but plugging in 100 mm^2 outputs ~600 chips per wafer. The $17K per wafer cost per TSMC N5 wafer comes out to a little under $30 per chip (just ignore the sale price per chip shown, that's obviously a figure for a very large chip)

Obviously yield isn't 100%, but with the A14's die size at 88 mm^2 it seems like the estimates that peg the SoC price around $30 are on the money unless yield is below 80%. That would be the "what Apple pays TSMC per chip" price and not include dicing, packaging, testing, or fixed design/mask costs (though those fixed costs are pretty low when Apple will need about 200 million over the next few years, that cost is a lot higher for those with more modest unit volume)

The M1 is 119 m^2, or a bit under 500 chips per wafer - if yields are in the mid 80s or better would equal a per chip cost of $40. Just to show the crazy $100+ numbers some were bandying about are crazy, unless you believe TSMC N5 yields are truly awful. The per unit fixed cost would be higher here, though if it shares a die with the A14X as I believe it isn't that bad. The higher end "8+4" or even higher obviously have a larger and larger fixed cost component as the volume gets smaller.

wp-1600416481628638735103719892029.jpg
 

name99

Senior member
Sep 11, 2010
404
303
136
Happened upon a wafer cost table at sparrownews.com they claimed came from a "semiconductor industry insider" somewhere on Twitter. I can't vouch for its accuracy, but plugging in 100 mm^2 outputs ~600 chips per wafer. The $17K per wafer cost per TSMC N5 wafer comes out to a little under $30 per chip (just ignore the sale price per chip shown, that's obviously a figure for a very large chip)

Obviously yield isn't 100%, but with the A14's die size at 88 mm^2 it seems like the estimates that peg the SoC price around $30 are on the money unless yield is below 80%. That would be the "what Apple pays TSMC per chip" price and not include dicing, packaging, testing, or fixed design/mask costs (though those fixed costs are pretty low when Apple will need about 200 million over the next few years, that cost is a lot higher for those with more modest unit volume)

The M1 is 119 m^2, or a bit under 500 chips per wafer - if yields are in the mid 80s or better would equal a per chip cost of $40. Just to show the crazy $100+ numbers some were bandying about are crazy, unless you believe TSMC N5 yields are truly awful. The per unit fixed cost would be higher here, though if it shares a die with the A14X as I believe it isn't that bad. The higher end "8+4" or even higher obviously have a larger and larger fixed cost component as the volume gets smaller.

wp-1600416481628638735103719892029.jpg
Yes, $40 for A14 (just the SoC, not including the DRAM also on the package) matches other iPhone BOM estimates.
 

Doug S

Platinum Member
Feb 8, 2020
2,201
3,405
136
I don't recall which thread but I remember seeing some speculation that Apple's GPU in the M1 not only had double the cores, but also had double the ALU width. Based on this, that seems unlikely: the GPU in the M1 is only 2.1x larger despite having twice the cores. Given that the NPU is listed as being 0.9 the size in the M1 I'm guessing there is a margin of error in their measurement so it might be exactly 2.0x larger.

https://www.techinsights.com/blog/two-new-apple-socs-two-market-events-apple-a14-and-m1