Question Bloomberg: Apple testing SoCs with 16 and 32 high performance cores

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

mikegg

Golden Member
Jan 30, 2010
1,755
411
136
The current ‌M1‌ chip has four high-performance processing cores and four power-saving cores. For its next generation chip targeting MacBook Pro and ‌iMac‌ models, Apple is said to be working on designs with as many as 16 power cores and four efficiency cores.

Apple is also reportedly testing a chip design with as many as 32 high-performance cores for higher-end desktop computers planned for later in 2021, as well as a new half-sized Mac Pro planned to launch by 2022.

 

Mopetar

Diamond Member
Jan 31, 2011
7,831
5,980
136
I'm still not sold on the "wider is better" argument.

Pontiac was telling us this over two decades ago. Get with the times, man!


More seriously thought, it's obvious you could push the concept too far (see Netburst or Bulldozer with clock speeds) and not see the kind of performance returns that Apple/AMD have gotten from making their own architectures a little bit wider than previously, but part of it is going to be creating a balanced design the keeps utilization of all of your hardware as high as possible.

It isn't just that "wider is better" in some kind of isolation, but likely as a result of improvements to other parts of the chip that make it able to keep more ALUs, etc. fed with instructions and data. If you don't go wider, you're just leaving potential performance on the table and holding back the other parts of the chip that can perform better, but offer no real benefit if they're being bottlenecked by something else.

Start looking at all of the improvements in things like branch prediction over the years along with the much larger caches that most chips have and tell me that those wouldn't make a wider design a good idea. I'm sure if you would have tried to do that before all of those improvements occurred the wider design would be pointless without the ability to feed the additional execution units. Wider just happens to create a more balanced CPU and that's what makes wider better.
 
  • Like
Reactions: Tlh97 and Carfax83

naukkis

Senior member
Jun 5, 2002
705
576
136
The reason Intel was so dominant in gaming is not higher clocks, it was because of memory latency almost exclusively. Memory performance is part of the IPC equation, and Intel won against Zen2 even with the same clocks, there were some tests using fixed clocks.
So what I was saying is simply that IPC will be lower in workloads that will use 256-bit to good use, it is not because of lower clocks, it is because the architecture doesn't feature modern 256-bit registers, when a future architecture does then it will have that IPC advantage like it does in other workloads.

256-bit vectors aren't so useful. Gaming won't benefit from them at all - 3d calculations doesn't scale beyond 128 bit vectors. But for hardware implementation it's much easier double the vector length than double the execution and load/store units. Apple has chosen the harder way to expose vector calculation optimizations - and because of that their performance vs threoretical flops scales much better than x86-rivals.
 

Bam360

Member
Jan 10, 2019
30
58
61
256-bit vectors aren't so useful. Gaming won't benefit from them at all - 3d calculations doesn't scale beyond 128 bit vectors. But for hardware implementation it's much easier double the vector length than double the execution and load/store units. Apple has chosen the harder way to expose vector calculation optimizations - and because of that their performance vs threoretical flops scales much better than x86-rivals.

The vast majority of workloads don't benefit from 256-bit vectors, agree, but HEVC is one of the workloads that do, and this is what I was discussing, not if it's a good or bad idea for Apple to use 256-bit vectors. In fact, one could argue it is a better idea to use GPU accelerators to do video encoding, but we are discussing pure CPU performance here, and trying to explain some of the odd results we are seeing, and why they happen, aside from bad optimization or hand-tuned code.
 

Hulk

Diamond Member
Oct 9, 1999
4,214
2,006
136
This is great. I don't use Apple products. I go back to DOS so the Apple tech confounds me. I can't stand it. But I know people love it and they make great stuff.

The "great" thing about this is Apple, like AMD and the other semi-conductor manufacturers that have become large players over the last 10 or 15 years aren't doling out tiny tech upgrades every 3 or 4 years like we are used to from Intel when they ruled the market. Now there is real competition!

Intel held back higher count desktop CPU's for years. AMD and Apple are like BAM! You want 16 cores on the desktop, here you go. How about 24 or 32 cores. Now Apple with the possibility or 8 or even 16 core laptop parts. Intel must be thinking "They're ruining our humongous profit margins! They're ruining it for all of us! Why? Why? Why can't they eat cake (dual and quad cores)?"

Love this.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
The reason Intel was so dominant in gaming is not higher clocks, it was because of memory latency almost exclusively. Memory performance is part of the IPC equation, and Intel won against Zen2 even with the same clocks, there were some tests using fixed clocks.
So what I was saying is simply that IPC will be lower in workloads that will use 256-bit to good use, it is not because of lower clocks, it is because the architecture doesn't feature modern 256-bit registers, when a future architecture does then it will have that IPC advantage like it does in other workloads.

Memory latency is very important for games, but it's just one of a quadrafecta. Clock speed, memory latency, intercore latency and big caches. All four are important. If memory latency is the most important almost to the point of exclusivity as you claim, how do you explain Zen 3, which has higher memory latency than the Core i9? Or how do you explain Intel's original Core series which had higher memory latency than the Athlon 64?

As far as 256 bit computing goes, I'm pretty sure that the AVX/AVX2 units have their own registers which can accommodate 256 bit instructions. Although the only use for such instructions in gaming is through physics.
 

naukkis

Senior member
Jun 5, 2002
705
576
136
Memory latency is very important for games, but it's just one of a quadrafecta. Clock speed, memory latency, intercore latency and big caches. All four are important. If memory latency is the most important almost to the point of exclusivity as you claim, how do you explain Zen 3, which has higher memory latency than the Core i9? Or how do you explain Intel's original Core series which had higher memory latency than the Athlon 64?

Effective memory latency isn't memory latency itself but average latency of all memory operations. Bigger caches, better prefetchers and so on will reduce effective memory latency. Both Zen3 and Intel Core had way more cache than their rivals at time.
 
  • Like
Reactions: Tlh97 and Carfax83

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
More seriously thought, it's obvious you could push the concept too far (see Netburst or Bulldozer with clock speeds) and not see the kind of performance returns that Apple/AMD have gotten from making their own architectures a little bit wider than previously, but part of it is going to be creating a balanced design the keeps utilization of all of your hardware as high as possible.

I agree, a balanced architecture is optimal, and I personally don't view the M1 as a balanced architecture. It is extremely wide, and is probably going to suffer in low IPC workloads like gaming as a result if Apple ever decides to scale it to more cores and put it in Mac Pro desktops. It's one thing to utilize an architecture like that when Apple controls everything from top to bottom in a walled off environment and another where you have Intel and AMD that operate in much more open platforms.

Also, modern x86-64 CPUs are wider than many people think. They use micro ops caches to increase the rate of execution without adding more decoders. In this regard, Zen 3 is 8 wide but can dispatch 6 instructions per cycle supposedly according to Anandtech:

Bypassing the decode stage through a structure such as the Op-cache is nowadays the preferred method to solve this issue, with the first-generation Zen microarchitecture being the first AMD design to implement such a block. However, such a design also brings problems, such as one set of instructions residing in the instruction cache, and its target residing in the OP-cache, again whose target might again be found in the instruction cache. AMD found this to be a quite large inefficiency in Zen2, and thus evolved the design to better handle instruction flows from both the I-cache and the OP-cache and to deliver them into the µOP-queue. AMD’s researchers seem to have published a more in-depth paper addressing the improvements.
On the dispatch side, Zen3 remains a 6-wide machine, emitting up to 6-Macro-Ops per cycle to the execution units, meaning that the maximum IPC of the core remains at 6. The Op-cache being able to deliver 8 Macro-Ops into the µOp-queue would serve as a mechanism to further reduce pipeline bubbles in the front-end – as the full 8-wide width of that structure wouldn’t be hit at all times.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
256-bit vectors aren't so useful. Gaming won't benefit from them at all - 3d calculations doesn't scale beyond 128 bit vectors. But for hardware implementation it's much easier double the vector length than double the execution and load/store units. Apple has chosen the harder way to expose vector calculation optimizations - and because of that their performance vs threoretical flops scales much better than x86-rivals.

Actually modern games use 256 bit vectors quite often, mostly for physics and destruction effects.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
Effective memory latency isn't memory latency itself but average latency of all memory operations. Bigger caches, better prefetchers and so on will reduce effective memory latency. Both Zen3 and Intel Core had way more cache than their rivals at time.

No disagreement here. But when you look at the overall latency to RAM, Intel's CPUs have about 10-15ns lower latency than Zen 3 depending on memory frequency and sub timings, which is huge. In practice though, Zen 3's massive L3 cache really does wonders to lower effective memory latency in a big way.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,764
3,131
136
256-bit vectors aren't so useful. Gaming won't benefit from them at all - 3d calculations doesn't scale beyond 128 bit vectors. But for hardware implementation it's much easier double the vector length than double the execution and load/store units. Apple has chosen the harder way to expose vector calculation optimizations - and because of that their performance vs threoretical flops scales much better than x86-rivals.
This is kind of right but also all kinds of wrong....lol

first as far as im aware firestorm is 4 L/S ports Zen 3 is 4 L/S ports.
second is that you can do xyza vectors etc with 256 bit ops you just have to write your code to do it, bundle two together xyzaxyza or bundle 8 together and do xxxxxxxx yyyyyyyy zzzzzzz aaaaaaaa.
third is that Zen3 has 6 FP pipelines ( kind of really 4) and Apple has 4 , what's different is the configuration of each of those pipes.

dont be surprised if a lot of firestorms performance advantage comes down to L1I size, L1D size and ROB size.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
^^ Regardless of whether software encoding is niche or not, if Apple really does have plans to scale the M1 (and its successors) architecture up to 16 and 32 cores, that's no longer "regular consumer" territory. Those products will be marketed and sold to prosumers and power users of all stripes. So the argument is valid nonetheless.

Also, software encoding is a heavy workload for a CPU and a good criterion of how powerful it is, so I don't see why it wouldn't be worthwhile to discuss it.
 

Entropyq3

Junior Member
Jan 24, 2005
22
22
81
So what? That's never stopped us before. It's an enthusiast forum. We're part of the reason why PC reviewers include software encode benchmarks when reviewing new CPUs.
It’s also why drawing conclusions from such benchmarks to the general market is simply wrong.

And their presence in reviews is a problem, because the largest issue in benchmarking is relevance. To what extent can I use the benchmark results to predict my own results as a user? When a review contains benchmarks that are irrelevant to the overwhelming majority of users it creates a false image of what to expect from new hardware.

Which, lets be honest, is often the point, the consumer oriented tech press mostly acts as advertising for the industry they are covering.
 

naukkis

Senior member
Jun 5, 2002
705
576
136
This is kind of right but also all kinds of wrong....lol

first as far as im aware firestorm is 4 L/S ports Zen 3 is 4 L/S ports.

Apple M1 can do 4 128 bit L/S per cycle where max two can be stores. AMD Zen3 can do 3 64 bit integer L/S where two can be stores. For 256 bit registers it can do 2 L/S ops where one could be store. So bandwith per clock is just equal to M1, M1 can achieve same throughput with half the length registers. Zen3's ability with 128 bit registers are bit unknown, some sources say it could do same as integer, other that it's limited to same as 256 bit ops.
 

Doug S

Platinum Member
Feb 8, 2020
2,252
3,483
136
^^ Regardless of whether software encoding is niche or not, if Apple really does have plans to scale the M1 (and its successors) architecture up to 16 and 32 cores, that's no longer "regular consumer" territory. Those products will be marketed and sold to prosumers and power users of all stripes. So the argument is valid nonetheless.

Also, software encoding is a heavy workload for a CPU and a good criterion of how powerful it is, so I don't see why it wouldn't be worthwhile to discuss it.

Well Apple obviously has plans to scale to that territory, because they have to for the Mac Pro / iMac Pro. The only unknown is whether it is done via chiplets or monolithic designs.

As you say, the Mac Pro is well above consumer territory, but comparing what the M1 does for encoding today when only low end ARM Macs have been released (and only a month ago) to what x86 does when running code that has been tuned for HEVC encoding for years is pointless. Until the sort of Macs the people who need to do a lot of encoding are available with an Apple Silicon SoC, there is little incentive for developers to bother optimizing it.

It also isn't clear that using the CPU for HEVC encoding will even make sense on a Mac Pro. If it can do it faster with its GPU, that's how it'll get done. But I expect people would whine and claim that's cheating, and a "fair test" would require it only be done on a CPU (but those same people will also claim hand tuned SIMD assembly using instructions not available across the whole x86 line is fair game)
 
  • Like
Reactions: insertcarehere

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
As you say, the Mac Pro is well above consumer territory, but comparing what the M1 does for encoding today when only low end ARM Macs have been released (and only a month ago) to what x86 does when running code that has been tuned for HEVC encoding for years is pointless. Until the sort of Macs the people who need to do a lot of encoding are available with an Apple Silicon SoC, there is little incentive for developers to bother optimizing it.

I can't say I disagree that it's not a fair comparison, but based on what @jeanlain said in post #38, the optimizations that Apple and another benefactor included in handbrake version 1.4 already increase the performance substantially for the M1, but it's still a significant ways off from Zen 3.

Makes me wonder how much performance is left on the table, and whether that gap could ever be bridged by the M1, at least in its current form. Perhaps a successor with SVE/SVE2 would do much better.

It also isn't clear that using the CPU for HEVC encoding will even make sense on a Mac Pro. If it can do it faster with its GPU, that's how it'll get done.

Using the GPU itself for encoding is a bad idea. Both Nvidia and AMD use fixed function hardware to accelerate encoding. Neither of them use the GPU proper, because the GPU produces terrible quality.

Also the CPU will always have its uses for encoding (especially offline) because it can produce the highest quality per bitrate. It's similar in a way to big rendering farms, the vast majority of which use CPUs and not GPUs because the former allows for higher quality, more options, flexibility and way more memory than the latter even though it's slower.

But I expect people would whine and claim that's cheating, and a "fair test" would require it only be done on a CPU (but those same people will also claim hand tuned SIMD assembly using instructions not available across the whole x86 line is fair game)

Well I personally wouldn't call it cheating, because the ASIC is fulfilling its purpose. But if I am debating CPU performance, then there's no reason for me to look at the ASIC performance.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
"SPEC isn't real, use Cinebench" has clearly morphed into "Cinebench isn't real, only x265 counts."

Cool cool.

Personally I don't care for Cinebench, as it doesn't really make much heavy use of SIMD from what I have read. I remember seeing a comment from Andrei F. that having AVX/AVX2 turned on or off really didn't affect the score that much. But with Blender, AVX optimization is much more powerful and makes a huge difference.

Generally speaking though, I like to see real world comparisons, and not canned benchmarks because they better represent the experience a user will have, and that's why I liked that Linus did use real world tests when he was able. When Blender becomes optimized for ARM64, we should expect to see comparisons there as well.

As far as Spec goes, if you frequent the RWT forums much, you'll see that it gets plenty of criticism. A few good examples, apparently the "Blender" test in SpecFP 2017 uses a very small working set which can run almost entirely out of the L1 cache.

Source

Now you have to wonder whether that sub test is a good representative of an actual real world workload using blender. I'm not an expert by any means, but I'm going to say it probably isn't because a real world workload is going to be much larger. Also, another RWT user said that the x264 test in Spec2017 does not use SIMD assembly, which in his opinion makes it completely worthless and not representative of the x264 codec. And he makes a good point because x264 absolutely relies on integer SIMD to perform well, and autovectorization from the compiler is apparently nowhere near as good as assembly.

Source

So like I said, Spec has plenty of critics among the professionals.
 

Nothingness

Platinum Member
Jul 3, 2013
2,402
733
136
As far as Spec goes, if you frequent the RWT forums much, you'll see that it gets plenty of criticism. A few good examples, apparently the "Blender" test in SpecFP 2017 uses a very small working set which can run almost entirely out of the L1 cache.

Source

Now you have to wonder whether that sub test is a good representative of an actual real world workload using blender. I'm not an expert by any means, but I'm going to say it probably isn't because a real world workload is going to be much larger. Also, another RWT user said that the x264 test in Spec2017 does not use SIMD assembly, which in his opinion makes it completely worthless and not representative of the x264 codec. And he makes a good point because x264 absolutely relies on integer SIMD to perform well, and autovectorization from the compiler is apparently nowhere near as good as assembly.

Source

So like I said, Spec has plenty of critics among the professionals.
None of the guys who criticized SPEC have access to it or know a lot about it. As an example the link you give proves Chester doesn't know what he is talking about:
I also looked up what SPECfp 2017 does with Blender. They render a 'reduced version' of a data set at 320x200.
He thinks the training input is what is being run. The mistake was spotted on the message just after the one you linked.
 
  • Like
Reactions: Leftillustrator

DrMrLordX

Lifer
Apr 27, 2000
21,620
10,829
136
"SPEC isn't real, use Cinebench" has clearly morphed into "Cinebench isn't real, only x265 counts."

Since when? Cinebench (and POVray, and c-ray, and Blender, and any number of other rendering workloads) are perfectly valid and interesting benchmarks. I'm not seeing many arguments here against Cinebench specifically . . .
 
  • Like
Reactions: Tlh97

Bam360

Member
Jan 10, 2019
30
58
61
The argument against Cinebench is when using it as a general CPU performance test, many people do this, they solely compare Cinebench scores to determine "this CPU has X% IPC more than the other one" or "This CPU has Y% more single threaded performance", and then when other real world tests come and performance is more disappointing, like having much worse gaming performance in comparison, they don't understand what is happening.
Cinebench is perfectly fine in general to assess overall rendering performance, but not much more really.
 

Mopetar

Diamond Member
Jan 31, 2011
7,831
5,980
136
The vast majority of workloads don't benefit from 256-bit vectors, agree, but HEVC is one of the workloads that do, and this is what I was discussing, not if it's a good or bad idea for Apple to use 256-bit vectors. In fact, one could argue it is a better idea to use GPU accelerators to do video encoding, but we are discussing pure CPU performance here, and trying to explain some of the odd results we are seeing, and why they happen, aside from bad optimization or hand-tuned code.

Why add support for 256-bit vectors in the general CPU cores when it's an SoC and can contain a dedicated bit of silicon to do encoding more efficiently than regular CPU cores could hope to do it?

You either spend space on the dedicated hardware or on adding the capabilities to the CPU cores. In the end the dedicated hardware performs better and likely doesn't cost much, particularly if you focus on energy savings as opposed to strictly the amount of transistors required.
 

Doug S

Platinum Member
Feb 8, 2020
2,252
3,483
136
Why add support for 256-bit vectors in the general CPU cores when it's an SoC and can contain a dedicated bit of silicon to do encoding more efficiently than regular CPU cores could hope to do it?

You either spend space on the dedicated hardware or on adding the capabilities to the CPU cores. In the end the dedicated hardware performs better and likely doesn't cost much, particularly if you focus on energy savings as opposed to strictly the amount of transistors required.

And we know the A14 and therefore M1 has dedicated blocks able to very efficiently perform realtime HEVC encoding of 4K video, but no faster because that's all it needs to do. If someone has a need to bulk encode a few hours of 4K video, do they necessarily have a reason to care whether that takes "a few hours" or say 10 minutes?

HEVC encoding may be something a certain segment considers important, but how many of them need it immediately rather than being able to do something else while it crunches? Put another way, how many people's usage model requires HEVC encoding of more than 24 hours of video per day? Those people will need something beefier, but the casual user could use fixed function hardware and kick off the encoding of their 'home movies' before they go to bed and not care how long it takes.

There is obviously a case to have wider vectors in the high end like the Mac Pro, though we're all just assuming that the A14/M1 doesn't already have any such capability. Apple talked about something they called AMX (Apple Media Extensions) when they announced the A13, but we still aren't exactly sure exactly what it is and whether it is part of the CPU (and if so not just giving their own name to SVE2) renaming a capability of another block like the NPU, or some new block that lets Apple do their own thing rather than what ARM defines with SVE2.

Anyway, assuming AMX doesn't help with run of the mill FP number crunching, and SVE2 is not already hiding in there somewhere, they could easily add it to the A15 or A16 cores that will be used in bigger Macs. Perhaps even with different widths, with the mobile CPUs supporting only 128 bit vectors and the Mac supporting wider either by having more hardware or by having some of it left unused in an iPhone to save power. There was no need for wider vectors in the A14 cores, because they were never going to be used in high end Macs, so we shouldn't be surprised at their absence.
 

wlee15

Senior member
Jan 7, 2009
313
31
91
And we know the A14 and therefore M1 has dedicated blocks able to very efficiently perform realtime HEVC encoding of 4K video, but no faster because that's all it needs to do. If someone has a need to bulk encode a few hours of 4K video, do they necessarily have a reason to care whether that takes "a few hours" or say 10 minutes?

HEVC encoding may be something a certain segment considers important, but how many of them need it immediately rather than being able to do something else while it crunches? Put another way, how many people's usage model requires HEVC encoding of more than 24 hours of video per day? Those people will need something beefier, but the casual user could use fixed function hardware and kick off the encoding of their 'home movies' before they go to bed and not care how long it takes.

Seems to be the consensus that the M1 hardware encoders produce mediocre output compared to software which is pretty much consistent with most non-professional hardware encoders. The casual user who going to encode their videos is going to get much value going from their camera/phone's hardware encoder to the M1 hardware encoder.
 
  • Like
Reactions: Tlh97 and lobz