Apple A12 benchmarks

CatMerc · May 23, 2018

Nothingness said:
Really? I don't expect much if anything...

Intel's architects haven't been playing golf all of these years while waiting for TMG to fix themselves, and AMD's architects have many dimensions they can improve the Zen design in seeing as its the first in a long series.

Nothingness · May 23, 2018

CatMerc said:
Intel's architects haven't been playing golf all of these years while waiting for TMG to fix themselves, and AMD's architects have many dimensions they can improve the Zen design in seeing as its the first in a long series.

I was thinking you were saying that 7nm and 10nm would bring a lot of improvements as processes. This is what I doubt.

As far as uarch improvements go, I can guarantee others are not standing still either.

Thala · May 23, 2018

CatMerc said:
It is very likely that these CPU's are purpose built for a specific TDP and performance level. Much of processor design is finding the sweet spot between complexity (IPC) and clockspeed for a given node.

Design is indeed optimized for a certain power and performance level, architecture however is not.
With other words architecture can typically scale by a wide margin.

Example: Our design of a particular CPU architecture max out at 1Ghz following design rules and using low power cells at pslow corner @110degree and nominal voltage for sign-off. From power consumption and area this is for our designs the sweet spot. We could with a different design using the same architecture and process easily achieve 2+GHz, while power goes up of course.

In conclusion, given a low power design, chances are you can easily achieve significant frequency boost with same architecture if you just give sufficient power headroom.

I expect 7nm and 10nm x86 CPU's to be a significant jump far beyond what we're used to. That's if Intel can manufacture 10nm eventually 😛.

You ignoring the x86-penalty. x86/x64 based architectures can not possibly come close to the efficiency, which is possible with ARMv8. No-one with a sane mind would come up with the x64 ISA today after 40+ years of microprocessor architecture research.
That having said, i do not expect a significant IPC jump for Icelake and Co however i do expect, that Apples upcoming SoCs for MacBook Pro will achieve 4+GHz at similar or better IPC as today @ 2.5GHz.
And while we are at predictions regarding IPC, i expect stock ARM Cortex-A76 cores will surpass Geekbench integer score of 1000/1GHz for the first time.

CatMerc · May 23, 2018

Nothingness said:
I was thinking you were saying that 7nm and 10nm would bring a lot of improvements as processes. This is what I doubt.

As far as uarch improvements go, I can guarantee others are not standing still either.

TSMC and GloFo 7nm are definitely better than their 16/14 processes in performance and efficiency, not just density. Intel is in a weirder spot in that they're having trouble beating their 14++, but need to start moving forward with density.

Others may not be standing still, but underestimating the veterans is rarely a good idea. Intel has been effectively riding the same architecture since 2015, design of which began in 2011, and everyone is only now catching on.

Thala said:
Design is indeed optimized for a certain power and performance level, architecture however is not. I hope you understand the difference.

What is your defintion for design and architecture? If you start changing the core by lengthening pipelines or adding caches, that's an architecture in my eyes. And in that sense architectures are in fact made for specific targets. If by architecture you mean x86 or ARM then in that case you're correct.

Thala said:
You ignoring the x86-penalty. x86/x64 architectures can not possibly come close to the efficiency, which is possible with ARMv8.

On the grounds of what exactly?
You do realize that x86 designs' uncore takes more power than entire phone SoC's to handle the relatively massive amounts of memory, I/O, and scalability they can handle?

The x86 penalty is extremely overstated, especially the whole RISC/CISC comparison. These days it's basically a bit of extra die space for the uOp cache. Pretty much anything else of note is not inherent to x86, but rather a result of different expectations in capability. I suggest you look at the latest review of the new Cavium CPU from Anandtech, and you'll see what I mean. Ignore the idle power of the cavium, that's from unfinished firmware/hardware.

These days the vast majority of energy is spent on data movement from caches or god forbid main memory, and that's an issue that every architecture has to contend with.

Thala · May 23, 2018

CatMerc said:
What is your defintion for design and architecture? If you start changing the core by lengthening pipelines or adding caches, that's an architecture in my eyes

Indeed, architecture is loosely reflected at RTL level, while design is reflected at gate level/cell level.
In my example from above we are using the very same architecture, with different design (e.g. different cell mix, drivers, netlist) in conjunction with low power memories for caches etc.
The achievable frequency range is at least factor 2 between ultra low power (at or below nominal voltage) and high performance (at overdrive voltage) for the very same architecture.

CatMerc said:
On the grounds of what exactly?

You do realize that x86 designs' uncore takes more power than entire phone SoC's to handle the relatively massive amounts of memory, I/O, and scalability they can handle?

The x86 penalty is extremely overstated, especially the whole RISC/CISC comparison. These days it's basically a bit of extra die space for the uOp cache

I do not agree that it is overstated, it is understated based on your comments. Its not just the uOp cache itself, its the fact that there is a uOP cache at all with large miss penalty (e.g. pre-decode miss). Its the fact that with variable instruction length you cannot start decoding before knowing the length of previous instruction. Its the fact, that there are few 3 operand instructions ,which requires additional moves (either register to register or register to memory), it the fact that you only have 16 registers which increases the amount of memory accesses, its the fact that memory references are hard to prove independent and cannot be subject to register renaming while ooo scheduling, its the fact that return address is stored on stack instead of registers, its the fact that far calls uses segment descriptors to access GDT or LDT....i could go on and on. Most of these "features" were much less of a problem in the 20th century but today they limit both efficiency and setting upper boundary for realistically achievable IPC.

CatMerc · May 23, 2018

Thala said:
Indeed, architecture is loosely reflected at RTL level, while design is reflected at gate level/cell level.
In my example from above we are using the very same architecture, with different design (e.g. different cell mix, drivers, netlist) in conjunction with low power memories for caches etc.
The achievable frequency range is at least factor 2 between ultra low power (at or below nominal voltage) and high performance (at overdrive voltage) for the very same architecture.

I do not agree that it is overstated, it is understated based on your comments. Its not just the uOp cache itself, its the fact that there is a uOP cache at all with large miss penalty (e.g. pre-decode miss). Its the fact that with variable instruction length you cannot start decoding before knowing the length of previous instruction. Its the fact, that there are few 3 operand instructions ,which requires additional moves (either register to register or register to memory), it the fact that you only have 16 registers which increases the amount of memory accesses, its the fact that memory references are hard to prove independent and cannot be subject to register renaming while ooo scheduling, its the fact that return address is stored on stack instead of registers, its the fact that far calls uses segment descriptors to access GDT or LDT....i could go on and on. Most of these "features" were much less of a problem in the 20th century but today they limit both efficiency and setting upper boundary for realistically achievable IPC.

Fair points all around, but in the end that only matters if the penalties are at all noticeable relative to the costs of where your power budget goes to these days, chiefly data movement. I haven't seen any evidence to suggest this.

We shall wait and see I guess.

Thala · May 23, 2018

CatMerc said:
Fair points all around, but in the end that only matters if the penalties are at all noticeable relative to the costs of where your power budget goes to these days, chiefly data movement. I haven't seen any evidence to suggest this.

We shall wait and see I guess.

The theory that data movement takes most power is not necessarily sound. The irony is, that power simulations on system level show that its not the code with many cache misses and thus requiring dram access take most power, but code who has the most cache hits. Background is, that code with many cache misses does less work per time (or has lower IPC) and core internal units are clock gated while waiting on memory access. Interestingly ALU operations taking much less power than a load or store even on a wide ALU like NEON*. So worst case power could be achieved with code which only consisted of loads in conjunction with L1$ hits. With an ooo pipeline things look slightly different, for worst case power you need to mix in independent ALU operations such that load/store unit stays saturated. I bet when writing synthetic code with above rules you get something which putting Prime95 to shame with respect to power usage 🙂

In summary i would be careful when claiming where most of the power ist lost - because it depends on quite a few factors.

*This could be different for AVX512 units, but i do not have any data for this case.

CatMerc · May 24, 2018

Thala said:
The theory that data movement takes most power is not necessarily sound. The irony is, that power simulations on system level show that its not the code with many cache misses and thus requiring dram access take most power, but code who has the most cache hits. Background is, that code with many cache misses does less work per time (or has lower IPC) and core internal units are clock gated while waiting on memory access. Interestingly ALU operations taking much less power than a load or store even on a wide ALU like NEON*. So worst case power could be achieved with code which only consisted of loads in conjunction with L1$ hits. With an ooo pipeline things look slightly different, for worst case power you need to mix in independent ALU operations such that load/store unit stays saturated. I bet when writing synthetic code with above rules you get something which putting Prime95 to shame with respect to power usage 🙂

In summary i would be careful when claiming where most of the power ist lost - because it depends on quite a few factors.

*This could be different for AVX512 units, but i do not have any data for this case.

Of course if you're not doing any useful computation then that would be the least power used. But when you are actually doing something, where is the most energy being used? Feeding your execution units or actually doing said execution?

JoeRambo · May 25, 2018

16 registers is just fine, and was increased to 32 registers with AVX512 where it matters the most -> in vector processing, and cause of register rename i highly doubt it is relevant much in general code. We are long past 4-6 GPR days of x86.

RISC vs CISC is over for quite some time now, internally X86 are RISC machines and ARM makes RISC purists cry with plenty of CISC like instructions.

X86 instruction decode of course remains problematic and decoders burn power, but realistically uOP caches, loop buffers and other magic are there to ease the burden. On ARM side of things there is also plenty of "bundling" of OPs going after decode and it can be argued that because RISC like instructions are simpler you end up more instructions to express same things ( like load A address to register R1, add R1,R2 and store result to B address ), so there are trade offs everywhere. But it is very deep question, different ISAs, different instruction density both in counts and bytes.

On topic of what is limiting what, there is an interesting option in Linux perf, that one can use to get top down view of limits and their percentages:

perf stat -a --topdown your_app_start_command

WIth recent perf you can see output for your workload.

Thala · May 25, 2018

JoeRambo said:
RISC vs CISC is over for quite some time now, internally X86 are RISC machines and ARM makes RISC purists cry with plenty of CISC like instructions.

You might have noticed that RISC vs. CISC is nowhere to be found in my argumentation - and then you go on arguing about other irrelevant factors plus some "magic", which magically has no costs and some nonsense about non-impacting low number of architectural registers - because there is register renaming...

JoeRambo · May 25, 2018

Thala said:
You might have noticed that RISC vs. CISC is nowhere to be found in my argumentation - and then you go on arguing about other irrelevant factors plus some "magic", which magically has no costs and some nonsense about non-impacting low number of architectural registers - because there is register renaming...

I was not responding specifically to Your post(s), other guy mentioned CISC vs RISC "wars". And on topic of "irrelevant factors", those wars were as passionate as Your claims about superiority of ARMv8 ABI. In the early 1990s quite some of those warrior guys claimed that Intel will never build proper, high clocked, high performance x86 OoO core. They were using strong words and listing weak points of x86 and CISC in general.

Then Pentium Pro happened, and within a decade and half they were all* out of jobs and their superior architectures sold to oracled chinese, instead of taking over the world like they predicted. So there's a tale of warning to not underestimate Intel and x64 ( and not over-inflate advantages of ARMv8 ).

P.S. Please put forward the real world codes and/or academic research where 16GPRs backed by spills to XMM registers are highly limiting factor.

Nothingness · May 26, 2018

JoeRambo said:
P.S. Please put forward the real world codes and/or academic research where 16GPRs backed by spills to XMM registers are highly limiting factor.

The same could be asked from you since you said that having 16 GPR is fine 😉

If you had unconstrained use of those 16 registers then I agree, it would likely be enough. The problem is that your function call standard for parameter passing significantly constrains you and also forces you to do register saves and restores around calls.

I seem to remember studies that were saying that around 20 registers was the right number, but that was long ago and I can't find that...

William Gaatjes · May 26, 2018

Thala said:
Indeed, architecture is loosely reflected at RTL level, while design is reflected at gate level/cell level.
In my example from above we are using the very same architecture, with different design (e.g. different cell mix, drivers, netlist) in conjunction with low power memories for caches etc.
The achievable frequency range is at least factor 2 between ultra low power (at or below nominal voltage) and high performance (at overdrive voltage) for the very same architecture.

I do not agree that it is overstated, it is understated based on your comments. Its not just the uOp cache itself, its the fact that there is a uOP cache at all with large miss penalty (e.g. pre-decode miss). Its the fact that with variable instruction length you cannot start decoding before knowing the length of previous instruction. Its the fact, that there are few 3 operand instructions ,which requires additional moves (either register to register or register to memory), it the fact that you only have 16 registers which increases the amount of memory accesses, its the fact that memory references are hard to prove independent and cannot be subject to register renaming while ooo scheduling, its the fact that return address is stored on stack instead of registers, its the fact that far calls uses segment descriptors to access GDT or LDT....i could go on and on. Most of these "features" were much less of a problem in the 20th century but today they limit both efficiency and setting upper boundary for realistically achievable IPC.

That is a very good statement. Was it not the case that Intel and AMD(since zen) both mitigate the moving of data somewhat from register to register by using some renaming technique , avoiding the copying of data ?

JoeRambo · May 26, 2018

Obviously infinite registers would be best, and having just 1 GPR would be horrible. I was not making making a claim that 32 GPRs do not help ( heck even Intel is increasing vector registers to 32 in AVX512 ), rather than pointing out that it is not the end of the world to have "just" 16.

And on topic whether 20 is optimal - i have no idea, but i know that where things are very tight with GPRs like JVMs of various type, compilers resort to spilling to XMM registers, probably removing "load/store pressure" from lack of GPRs. And in general code? ARMv8 is as vulnerable to register reuse and need to rename as Intel, just % is lower. As a OoO execution mind experiment imagine a loop that is using 15 registers, is perfectly predicted and you have 100 instructions of this loop in execution -> register renames will still happen, only that compiler can generate tighter code for the loop on ARM.

William Gaatjes · May 26, 2018

JoeRambo said:
16 registers is just fine, and was increased to 32 registers with AVX512 where it matters the most -> in vector processing, and cause of register rename i highly doubt it is relevant much in general code. We are long past 4-6 GPR days of x86.

RISC vs CISC is over for quite some time now, internally X86 are RISC machines and ARM makes RISC purists cry with plenty of CISC like instructions.

X86 instruction decode of course remains problematic and decoders burn power, but realistically uOP caches, loop buffers and other magic are there to ease the burden. On ARM side of things there is also plenty of "bundling" of OPs going after decode and it can be argued that because RISC like instructions are simpler you end up more instructions to express same things ( like load A address to register R1, add R1,R2 and store result to B address ), so there are trade offs everywhere. But it is very deep question, different ISAs, different instruction density both in counts and bytes.

On topic of what is limiting what, there is an interesting option in Linux perf, that one can use to get top down view of limits and their percentages:

perf stat -a --topdown your_app_start_command

WIth recent perf you can see output for your workload.

But x86 cisc is converted to uops so in the end, even though the cisc instruction does it all at once, it still is broken up in several uops when necessary.
Because x86 work similar as risc machines internally.
In the end, a risc with some cisc instructions is the best solution combined with 3 operand instructions.
3 operand instructions help out a lot and it would help in reducing instruction and data traffic.

Eug · May 26, 2018

Roland Quandt: A12 and A12X called Vortex, with part numbers T8020 and T8027 respectively.

https://wccftech.com/apple-a12-part-number-cpu-codename/

http://m.mydrivers.com/newsview/578273.html?ref=

Those claimed part numbers seem odd to me. And where is A11X?

thunng8 · May 26, 2018

Eug said:
Roland Quandt: A12 and A12X called Vortex, with part numbers T8020 and T8027 respectively.

https://wccftech.com/apple-a12-part-number-cpu-codename/

http://m.mydrivers.com/newsview/578273.html?ref=

Those claimed part numbers seem odd to me. And where is A11X?

Looks like the a11x might be skipped this time. This would bring the phone and tablet schedules more in sync.

Ie a9x was released shortly after a9, while there was a relatively big delay between a10 and a10x (9 months)

Eug · May 26, 2018

thunng8 said:
Looks like the a11x might be skipped this time. This would bring the phone and tablet schedules more in sync.

Ie a9x was released shortly after a9, while there was a relatively big delay between a10 and a10x (9 months)

That’s a good point, I forgot that A10X only came out last June, just a few months before A11. That makes sense then.

A12X it is. That would also make for a pretty decent MacBook.

jpiniero · May 27, 2018

Eug said:
That’s a good point, I forgot that A10X only came out last June, just a few months before A11. That makes sense then.

A12X it is. That would also make for a pretty decent MacBook.

A refresh of iPad Pros with the A12X at WWDC next week would be something, seems like that would be too early though for TSMC 7nm.

Eug · May 27, 2018

jpiniero said:
A refresh of iPad Pros with the A12X at WWDC next week would be something, seems like that would be too early though for TSMC 7nm.

I was thinking a new 12” or 13” (or both) non-Pro MacBook in late summer or in fall, running A12X. I find it very difficult to do anything productive on an iPad, despite what Tim Cook says.

I wouldn’t actually buy one since I already have a Kaby Lake Core m3 MacBook that I would keep a long while since it has 16 GB and it wouldn’t have any software compatibility issues like an ARM MacBook would have. However, it would nice to see Apple release an ARM MacBook sooner rather than later, to get the ball rolling with the developers, and it’d probably outrun all the current MacBooks too.

Lodix · May 27, 2018

jpiniero said:
A refresh of iPad Pros with the A12X at WWDC next week would be something, seems like that would be too early though for TSMC 7nm.

Last year they released the A10X on 10nm in June too.

dark zero · May 27, 2018

That's why Jim Keller is called by Intel... they need to save X86 and he saved AMD. Time to do the same to Intel.

And maybe later is VIA. 3 competitors are always better than only 2.

ksec · May 27, 2018

jpiniero said:
A refresh of iPad Pros with the A12X at WWDC next week would be something, seems like that would be too early though for TSMC 7nm.

Remember this is for iPad Pro only. Which is much lower volume then iPad and serve well to test 7nm so it doesn't mess up when iPhone's A12 is in production.

Eug · May 29, 2018

Take with a grain of salt, but this could mean an A12X MacBook by fall.

DigiTimes: Pegatron likely to land new MacBook orders from Apple

Pegatron is likely to land orders from Apple to produce an ARM-based MacBook model, codenamed Star with a series number N84, according to industry sources.

EDIT:

Other people are reporting N84 is in fact an 2018 lower cost iPhone X successor with LCD.

http://pocketnow.com/2018/05/28/apple-star-project-n84-confusion-hybrid-computer

CatMerc · May 29, 2018

dark zero said:
That's why Jim Keller is called by Intel... they need to save X86 and he saved AMD. Time to do the same to Intel.

And maybe later is VIA. 3 competitors are always better than only 2.

Keller is there to establish a standardized SoC design method and interconnect. As it is they're spending a LOT of resources and take way too much time for every single design. Basically he's heading the creation of Intel's version of the Infinity Fabric.
He's not directly involved with the next gen core, but he is involved with the fabric that will tie it, and the various other IP Intel has.

Intel doesn't lack strong core design people despite what the 10nm delays and having to ride Skylake all these years would tell you, but everything around the core is a real pain for them as they failed to standardize it within the company.

Apple A12 benchmarks

Golden Member

Diamond Member

Golden Member

Golden Member

Golden Member

Golden Member

Golden Member

Golden Member

Golden Member

Golden Member

Golden Member

Diamond Member

Lifer

Golden Member

Lifer

Lifer

Member

Lifer

Lifer

Lifer

Senior member

Platinum Member

Senior member

Lifer

Golden Member