Apple A12 benchmarks

Charlie22911 · Sep 13, 2018

Charlie22911 said:
Obviously apple would almost certainly use off package ram in these scenarios as they have with other X class CPUs, but I wouldn't count A12 as competitive with recent x86 CPUs when they can’t sustain equivalent workloads without throttling.

Eug said:
Well, nobody in their right mind would suggest Apple would use the exact same design from phones up to desktops.

I agree with you. I'm not good at multitasking though so admittedly my responses are poorly constructed. Since this thread was about A12 (not A12X) my response was to serious comparisons between A12 and Intel ***Lake cores.

oak8292 · Sep 13, 2018

This is a fine point but the POP packaging on the A12 is a process developed by TSMC called InFO with TIV (through insulator via) to reduce the wire lengths and improve thermals. I was really disappointed when Anandtech did not do a full blown analysis when Apple first utilized InFO packaging. According to TSMC the A12 should have less throttling than a traditional wire based POP package. This is a compromise 3D packaging technique with lower cost than full blown TSV.

IntelUser2000 · Sep 13, 2018

Nothingness said:
We already know that the first post in this thread pointed to a fake result since Apple claim 15% perf improvement

Not really?

The original source said while it was 25% faster they were dealing with significantly increased power usage. So 25% may have been possible but they reduced clocks to bring it down to 15% and reduce power use to an acceptable level.

Nothingness · Sep 14, 2018

IntelUser2000 said:
Not really?

The original source said while it was 25% faster they were dealing with significantly increased power usage. So 25% may have been possible but they reduced clocks to bring it down to 15% and reduce power use to an acceptable level.

Fair point

I guess I'm too suspicious about those leaks in general...

tynopik · Sep 14, 2018

it's been years, but i was under the impression that geekbench put a large weight on AES performance and apple had a dedicated AES coprocessor that 'artificially' boosted performance while intel didn't

maybe the situation has changed since then, but that was my recollection of how a tiny low-power processor was seemingly matching monster desktop cpus

jpiniero · Sep 14, 2018

tynopik said:
it's been years, but i was under the impression that geekbench put a large weight on AES performance and apple had a dedicated AES coprocessor that 'artificially' boosted performance while intel didn't

maybe the situation has changed since then, but that was my recollection of how a tiny low-power processor was seemingly matching monster desktop cpus

https://browser.geekbench.com/v4/cpu/compare/9790044?baseline=9836174

It can be decently competitive with the average desktop in a good amount of the tests, except for SGEMM and SFFT on Core. Highly overclocked desktop it won't catch of course.

Nothingness · Sep 14, 2018

tynopik said:
it's been years, but i was under the impression that geekbench put a large weight on AES performance and apple had a dedicated AES coprocessor that 'artificially' boosted performance while intel didn't

maybe the situation has changed since then, but that was my recollection of how a tiny low-power processor was seemingly matching monster desktop cpus

It's not a dedicated coprocessor: AES is part of ARM instruction set... and now also part of x86 instruction set.

JoeRambo · Sep 14, 2018

Apple has brutal L1 caches that help big time in a lot of workloads. 256KB total code and data cached in L1 is insane and comparable to L2 size on Intel client CPUs. x86 CPUs cannot make L1 caches larger than 32-64KB, because page size is 4KB, so there is huge advantage to Apple having full control of OS.

I assume this score is legit and it is very comparable to what was leaked in first post, probably Apple had to back down on clocks but 4828 ST would need maybe 5-10% extra clock to reach 5200 in the leak. Clearly it is the same CPU as MT score is within same 5-10% 11488.

Nothingness · Sep 14, 2018

JoeRambo said:
Apple has brutal L1 caches that help big time in a lot of workloads. 256KB total code and data cached in L1 is insane and comparable to L2 size on Intel client CPUs. x86 CPUs cannot make L1 caches larger than 32-64KB, because page size is 4KB, so there is huge advantage to Apple having full control of OS.

I'm not sure it's a huge advantage. Large L1 caches increase latency, and I'm not convinced that a 128KB Dcache matters that much (Icache is another thing). They'd better spend their budget on improving their hardware data prefetchers.

But I guess they found that it was good for their performance.

IntelUser2000 · Sep 14, 2018

The L1 information on Geekbench shouldn't be taken as a final word.

JoeRambo · Sep 14, 2018

IntelUser2000 said:
New The L1 information on Geekbench shouldn't be taken as a final word.

Yeah, but that is exactly the size to "exploit" with 16KB sized pages, why go to OS pains and leave performance on the table.

Nothingness said:
I'm not sure it's a huge advantage. Large L1 caches increase latency, and I'm not convinced that a 128KB Dcache matters that much (Icache is another thing).

What is exactly "increase of latency" mechanism here? If we put physical layout "sizing" constraints aside ( and on 7nm i suspect they are irrelevant anyway, as array size is probably smaller than 64KB AMD L1 on 14nm ), we are left with latency constraints of VIVT design, that has been 3-4 clocks for ages on all designs ( things like tlb and way selection happen in parallel etc). Except maximum Intel/AMD can cache in a sane design is 32KB for data cache due to 4KB pages limiting "virtual" in VIVT design.

Nothingness · Sep 14, 2018

JoeRambo said:
Yeah, but that is exactly the size to "exploit" with 16KB sized pages, why go to OS pains and leave performance on the table.

Yep. I wonder if they went for 8-way caches or used some form of HW alias resolution with replay. 8-way would have a non negligible power impact.

What is exactly "increase of latency" mechanism here? If we put physical layout "sizing" constraints aside ( and on 7nm i suspect they are irrelevant anyway, as array size is probably smaller than 64KB AMD L1 on 14nm )

I'd expect that going from the A11 32KB on 14nm to 128KB on 7nm to require 1 extra cycle in ld to use latency.

Interestingly Apple A9 and A10 had 64KB Dcache. I wonder why they went to 32KB on A11.

we are left with latency constraints of VIVT design, that has been 3-4 clocks for ages on all designs ( things like tlb and way selection happen in parallel etc). Except maximum Intel/AMD can cache in a sane design is 32KB for data cache due to 4KB pages limiting "virtual" in VIVT design.

K8/K10 had 2-way 64KB Dcache, and IMHO they were sane designs. Mechanisms to handle aliases in HW exist that make the cache look like a VIPT cache.

Greyguy1948 · Sep 14, 2018

ksec said:
https://m.weibo.cn/status/4236380060065313

~25% improvement in both Single thread and Multiple thread benchmarks.

Given the uniform improvement in both, this lead me to believe A12 may likely only be ~25% higher clock speed ( Max Clock 3Ghz ) 7nm A11. We don't know if there are any changes in other parts of SoC such as GPU.

I have long wonder when will Apple stop moving from node to node and IPC improvement YoY.

Not so much on GB4:
https://browser.geekbench.com/v4/cpu/compare/7205880?baseline=9816947
The L1 cache has changed from 32 kBi+32 kBd to 128 kBi+128 kBd and this is both a plus and minus...

JoeRambo · Sep 14, 2018

Nothingness said:
K8/K10 had 2-way 64KB Dcache, and IMHO they were sane designs.

2 way is not exactly the most sane design

There is a reason why Skylake and Zen are both 32KB / 8 ways do D-Cache.

Nothingness · Sep 14, 2018

JoeRambo said:
2 way is not exactly the most sane design There is a reason why Skylake and Zen are both 32KB / 8 ways do D-Cache.

Oh I certainly agree

But K8 still was a good chip (my last AMD CPU...).

Greyguy1948 · Sep 14, 2018

Greyguy1948 said:
Not so much on GB4:
https://browser.geekbench.com/v4/cpu/compare/7205880?baseline=9816947
The L1 cache has changed from 32 kBi+32 kBd to 128 kBi+128 kBd and this is both a plus and minus...

Latency must be higher ( 2 clocks?) but the traffic L1-L2 could be lower resulting in energy saving.

JoeRambo · Sep 14, 2018

Nothingness said:
Oh I certainly agree But K8 still was a good chip (my last AMD CPU...).

Yeah, no doubt about it. It reigned supreme, but i feel in the days of SMT 2-ways is more liability than it was during K8 reign. And ofc AMD was helped by Netburst perversion with 8KB of L1 ( even if latency was on steroids ).

JoeRambo · Sep 14, 2018

Greyguy1948 said:
Latency must be higher ( 2 clocks?)

Apple OS has 16KB sized pages, so 128KB L1 is equivalent for Apple in "load to use" latency as 4KB pages operating in 32KB sized VIVT caches for Intel/AMD.
They theoretically can get away with zero latency increase and i think GB4 bench results concur with this.

krumme · Sep 14, 2018

Anyways; Isnt there good ways to hide the effects of higher L1 latency in modern archs?

Nothingness · Sep 14, 2018

JoeRambo said:
Apple OS has 16KB sized pages, so 128KB L1 is equivalent for Apple in "load to use" latency as 4KB pages operating in 32KB sized VIVT caches for Intel/AMD.
They theoretically can get away with zero latency increase and i think GB4 bench results concur with this.

SRAM latency increases with area (and number of ports) and here we are talking about 4x larger RAM, and TSMC 7nm is not 4x denser than their 14nm for SRAM cells. As I wrote, I guess Apple lost (at least) a cycle in ld to use latency. And I think that's what @Greyguy1948 is talking about: Apple vs Apple

Nothingness · Sep 14, 2018

krumme said:
Anyways; Isnt there good ways to hide the effects of higher L1 latency in modern archs?

Yes out of order execution. But sometimes you have flows of dependent instructions and you'll be limited by L1 ld to use latency in some cases.

JoeRambo · Sep 14, 2018

Nothingness said:
SRAM latency increases with area (and number of ports) and here we are talking about 4x larger RAM, and TSMC 7nm is not 4x denser than their 14nm for SRAM cells

Sure, but the real question is were they ever limited by "physical" latency. AMD is fine running 64KB of instruction L1 ( 4-way) cause code has great locality and is coming from same pages + they have some sort of L0 iTLB to speed address translation up and come out with benefits compared to 32KB 8-way I cache Intel has.

krumme · Sep 14, 2018

Nothingness said:
Yes out of order execution. But sometimes you have flows of dependent instructions and you'll be limited by L1 ld to use latency in some cases.

Ok. What about the quality of the branch predictor? This way over my head. But with a large l1 and an effective intelligent branchpredictor doesnt that help offset some of the latency cost?

Nothingness · Sep 15, 2018

krumme said:
Ok. What about the quality of the branch predictor? This way over my head. But with a large l1 and an effective intelligent branchpredictor doesnt that help offset some of the latency cost?

That doesn't help to hide latency of data accesses in case of long computational dependencies chains. In fact it would make the increased latency even more apparent as you'd never fail at guessing branch direction and destination

krumme · Sep 15, 2018

Nothingness said:
That doesn't help to hide latency of data accesses in case of long computational dependencies chains. In fact it would make the increased latency even more apparent as you'd never fail at guessing branch direction and destination

Lol. Yeaa i get it now.

Apple A12 benchmarks

Senior member

Senior member

Elite Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Golden Member

Diamond Member

Elite Member

Golden Member

Diamond Member

Member

Golden Member

Diamond Member

Member

Golden Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member