Apple A12 benchmarks

Page 10 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Charlie22911

Senior member
Mar 19, 2005
614
228
116
Obviously apple would almost certainly use off package ram in these scenarios as they have with other X class CPUs, but I wouldn't count A12 as competitive with recent x86 CPUs when they can’t sustain equivalent workloads without throttling.

Well, nobody in their right mind would suggest Apple would use the exact same design from phones up to desktops.

I agree with you. I'm not good at multitasking though so admittedly my responses are poorly constructed. Since this thread was about A12 (not A12X) my response was to serious comparisons between A12 and Intel ***Lake cores.
 

oak8292

Member
Sep 14, 2016
82
67
91
This is a fine point but the POP packaging on the A12 is a process developed by TSMC called InFO with TIV (through insulator via) to reduce the wire lengths and improve thermals. I was really disappointed when Anandtech did not do a full blown analysis when Apple first utilized InFO packaging. According to TSMC the A12 should have less throttling than a traditional wire based POP package. This is a compromise 3D packaging technique with lower cost than full blown TSV.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
We already know that the first post in this thread pointed to a fake result since Apple claim 15% perf improvement :)

Not really?

The original source said while it was 25% faster they were dealing with significantly increased power usage. So 25% may have been possible but they reduced clocks to bring it down to 15% and reduce power use to an acceptable level.
 

Nothingness

Platinum Member
Jul 3, 2013
2,409
739
136
Not really?

The original source said while it was 25% faster they were dealing with significantly increased power usage. So 25% may have been possible but they reduced clocks to bring it down to 15% and reduce power use to an acceptable level.
Fair point :)

I guess I'm too suspicious about those leaks in general...
 

tynopik

Diamond Member
Aug 10, 2004
5,245
500
126
it's been years, but i was under the impression that geekbench put a large weight on AES performance and apple had a dedicated AES coprocessor that 'artificially' boosted performance while intel didn't

maybe the situation has changed since then, but that was my recollection of how a tiny low-power processor was seemingly matching monster desktop cpus
 

jpiniero

Lifer
Oct 1, 2010
14,591
5,214
136
it's been years, but i was under the impression that geekbench put a large weight on AES performance and apple had a dedicated AES coprocessor that 'artificially' boosted performance while intel didn't

maybe the situation has changed since then, but that was my recollection of how a tiny low-power processor was seemingly matching monster desktop cpus

https://browser.geekbench.com/v4/cpu/compare/9790044?baseline=9836174

It can be decently competitive with the average desktop in a good amount of the tests, except for SGEMM and SFFT on Core. Highly overclocked desktop it won't catch of course.
 

Nothingness

Platinum Member
Jul 3, 2013
2,409
739
136
it's been years, but i was under the impression that geekbench put a large weight on AES performance and apple had a dedicated AES coprocessor that 'artificially' boosted performance while intel didn't

maybe the situation has changed since then, but that was my recollection of how a tiny low-power processor was seemingly matching monster desktop cpus
It's not a dedicated coprocessor: AES is part of ARM instruction set... and now also part of x86 instruction set.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Apple has brutal L1 caches that help big time in a lot of workloads. 256KB total code and data cached in L1 is insane and comparable to L2 size on Intel client CPUs. x86 CPUs cannot make L1 caches larger than 32-64KB, because page size is 4KB, so there is huge advantage to Apple having full control of OS.

I assume this score is legit and it is very comparable to what was leaked in first post, probably Apple had to back down on clocks but 4828 ST would need maybe 5-10% extra clock to reach 5200 in the leak. Clearly it is the same CPU as MT score is within same 5-10% 11488.
 
  • Like
Reactions: krumme

Nothingness

Platinum Member
Jul 3, 2013
2,409
739
136
Apple has brutal L1 caches that help big time in a lot of workloads. 256KB total code and data cached in L1 is insane and comparable to L2 size on Intel client CPUs. x86 CPUs cannot make L1 caches larger than 32-64KB, because page size is 4KB, so there is huge advantage to Apple having full control of OS.
I'm not sure it's a huge advantage. Large L1 caches increase latency, and I'm not convinced that a 128KB Dcache matters that much (Icache is another thing). They'd better spend their budget on improving their hardware data prefetchers.

But I guess they found that it was good for their performance.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
New The L1 information on Geekbench shouldn't be taken as a final word.

Yeah, but that is exactly the size to "exploit" with 16KB sized pages, why go to OS pains and leave performance on the table.

I'm not sure it's a huge advantage. Large L1 caches increase latency, and I'm not convinced that a 128KB Dcache matters that much (Icache is another thing).

What is exactly "increase of latency" mechanism here? If we put physical layout "sizing" constraints aside ( and on 7nm i suspect they are irrelevant anyway, as array size is probably smaller than 64KB AMD L1 on 14nm ), we are left with latency constraints of VIVT design, that has been 3-4 clocks for ages on all designs ( things like tlb and way selection happen in parallel etc). Except maximum Intel/AMD can cache in a sane design is 32KB for data cache due to 4KB pages limiting "virtual" in VIVT design.
 

Nothingness

Platinum Member
Jul 3, 2013
2,409
739
136
Yeah, but that is exactly the size to "exploit" with 16KB sized pages, why go to OS pains and leave performance on the table.
Yep. I wonder if they went for 8-way caches or used some form of HW alias resolution with replay. 8-way would have a non negligible power impact.

What is exactly "increase of latency" mechanism here? If we put physical layout "sizing" constraints aside ( and on 7nm i suspect they are irrelevant anyway, as array size is probably smaller than 64KB AMD L1 on 14nm )
I'd expect that going from the A11 32KB on 14nm to 128KB on 7nm to require 1 extra cycle in ld to use latency.

Interestingly Apple A9 and A10 had 64KB Dcache. I wonder why they went to 32KB on A11.

we are left with latency constraints of VIVT design, that has been 3-4 clocks for ages on all designs ( things like tlb and way selection happen in parallel etc). Except maximum Intel/AMD can cache in a sane design is 32KB for data cache due to 4KB pages limiting "virtual" in VIVT design.
K8/K10 had 2-way 64KB Dcache, and IMHO they were sane designs. Mechanisms to handle aliases in HW exist that make the cache look like a VIPT cache.
 

Greyguy1948

Member
Nov 29, 2008
156
16
91
https://m.weibo.cn/status/4236380060065313

~25% improvement in both Single thread and Multiple thread benchmarks.

Given the uniform improvement in both, this lead me to believe A12 may likely only be ~25% higher clock speed ( Max Clock 3Ghz ) 7nm A11. We don't know if there are any changes in other parts of SoC such as GPU.

I have long wonder when will Apple stop moving from node to node and IPC improvement YoY.

Not so much on GB4:
https://browser.geekbench.com/v4/cpu/compare/7205880?baseline=9816947
The L1 cache has changed from 32 kBi+32 kBd to 128 kBi+128 kBd and this is both a plus and minus...
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Oh I certainly agree :D But K8 still was a good chip (my last AMD CPU...).

Yeah, no doubt about it. It reigned supreme, but i feel in the days of SMT 2-ways is more liability than it was during K8 reign. And ofc AMD was helped by Netburst perversion with 8KB of L1 ( even if latency was on steroids ).
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Latency must be higher ( 2 clocks?)

Apple OS has 16KB sized pages, so 128KB L1 is equivalent for Apple in "load to use" latency as 4KB pages operating in 32KB sized VIVT caches for Intel/AMD.
They theoretically can get away with zero latency increase and i think GB4 bench results concur with this.
 

krumme

Diamond Member
Oct 9, 2009
5,952
1,585
136
Anyways; Isnt there good ways to hide the effects of higher L1 latency in modern archs?
 

Nothingness

Platinum Member
Jul 3, 2013
2,409
739
136
Apple OS has 16KB sized pages, so 128KB L1 is equivalent for Apple in "load to use" latency as 4KB pages operating in 32KB sized VIVT caches for Intel/AMD.
They theoretically can get away with zero latency increase and i think GB4 bench results concur with this.
SRAM latency increases with area (and number of ports) and here we are talking about 4x larger RAM, and TSMC 7nm is not 4x denser than their 14nm for SRAM cells. As I wrote, I guess Apple lost (at least) a cycle in ld to use latency. And I think that's what @Greyguy1948 is talking about: Apple vs Apple ;)
 

Nothingness

Platinum Member
Jul 3, 2013
2,409
739
136
Anyways; Isnt there good ways to hide the effects of higher L1 latency in modern archs?
Yes out of order execution. But sometimes you have flows of dependent instructions and you'll be limited by L1 ld to use latency in some cases.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
SRAM latency increases with area (and number of ports) and here we are talking about 4x larger RAM, and TSMC 7nm is not 4x denser than their 14nm for SRAM cells

Sure, but the real question is were they ever limited by "physical" latency. AMD is fine running 64KB of instruction L1 ( 4-way) cause code has great locality and is coming from same pages + they have some sort of L0 iTLB to speed address translation up and come out with benefits compared to 32KB 8-way I cache Intel has.
 

krumme

Diamond Member
Oct 9, 2009
5,952
1,585
136
Yes out of order execution. But sometimes you have flows of dependent instructions and you'll be limited by L1 ld to use latency in some cases.
Ok. What about the quality of the branch predictor? This way over my head. But with a large l1 and an effective intelligent branchpredictor doesnt that help offset some of the latency cost?
 

Nothingness

Platinum Member
Jul 3, 2013
2,409
739
136
Ok. What about the quality of the branch predictor? This way over my head. But with a large l1 and an effective intelligent branchpredictor doesnt that help offset some of the latency cost?
That doesn't help to hide latency of data accesses in case of long computational dependencies chains. In fact it would make the increased latency even more apparent as you'd never fail at guessing branch direction and destination :)
 
  • Like
Reactions: krumme

krumme

Diamond Member
Oct 9, 2009
5,952
1,585
136
That doesn't help to hide latency of data accesses in case of long computational dependencies chains. In fact it would make the increased latency even more apparent as you'd never fail at guessing branch direction and destination :)
Lol. Yeaa i get it now.