CPU Integer Performance: Applied Micro ARMv8 vs AMD Bulldozer "Interlagos"

cbn · Nov 14, 2011

http://www.anandtech.com/show/5098/applied-micros-xgene-the-first-armv8-soc

APM's performance estimates put a 3GHz X-Gene at roughly half the integer performance of a 2.4GHz Sandy Bridge.

vs. the following set of data

http://www.amdzone.com/phpbb3/viewtopic.php?f=532&t=138927#p214019

BD 6282 SE 2.6GHz@ 2s/32c = 526 SPECint_rate2006
Xeon E7-4870 2.4GHz @ 2s/20c = 553 SPECint_rate2006
Xeon 5690 3.46GHz @ 2s/12c = 393 SPECint_rate2006
Opteron 6180 SE 2.5GHz @ 2s/24c = 430 SPECint_rate2006

SB 2600K 3.4GHz @ 1S/4C = 156 SPECint_rate2006

BD = 16.4 SPECint_rate per core @2.6 GHz
K10 = 17.9 SPECint_rate per core @2.5 GHz
Westmere = 27.7 SPECint_rate per core @2.4 GHz
Westmere = 32.7 SPECint_rate per core @ 3.46 GHz
Sandy Bridge = 39 SPECint_rate per core @ 3.4 GHz

Therefore, if Sandy Bridge has a 39 SpecInt_rate per core @ 3.4 Ghz that would equal 27.5 SpecInt_rate per core @ 2.4 Ghz

If Applied Micro ARMv8 @ 3 Ghz has half the integer of a 2.4 Ghz Sandy core (27.5 SpecInt_rate per core) that would put it a 13.75 SpecInt_rate per core

Looking at Bulldozer @ 2.6 Ghz it scores 16.4 SpecInt_rate per core.

Therefore ARMv8 @ 3Ghz (13.75 SpecInt_rate per core) is roughly equal to a 2.18 Ghz Bulldozer core.

Am I interpreting this correctly?

If so, I have to say I am impressed with how close ARM is getting to x86. (with respect to Integer)

GammaLaser · Nov 14, 2011

I don't understand that half integer claim with respect to the graph, which shows 32C X-Gene with 1.5X improvement vs. 4C/8T sandy bridge at the same power consumption?

cbn · Nov 14, 2011

GammaLaser said:
I don't understand that half integer claim with respect to the graph, which shows 32C X-Gene with 1.5X improvement vs. 4C/8T sandy bridge at the same power consumption?

In the graph there is also:

1. Claim of 8 core ARMv8 @ 3ghz being 3x faster than a dual core Sandy Bridge@ 2.2 Ghz
2. Claim of 16 core ARMv8 @ 3Ghz being 1.5 faster than a quad core Sandy Bridge @ 2.4 Ghz

P.S. What makes this comparison even more impressive is that Specint_rate test includes hyperthreading (which benefits the Intel processor).

http://en.wikipedia.org/wiki/SPECint

SPECint tests are carried out on a wide range of hardware, with results typically published for the full range of system-level implementations employing the latest CPUs. For SPECint2006, the CPUs include Intel and AMD x86 & x86-64 processors, Sun SPARC CPUs, IBM POWER CPUs, and IA-64 CPUs. This range of capabilities, specifically in this case the number of CPUs, means that the SPECint benchmark is usually run on only a single CPU, even if the system has many CPUs. If a single CPU has multiple cores, only a single core is used; hyper-threading is also typically disabled,

A more complete system-level benchmark that allows all CPUs to be used is known as SPECint_rate2006, also called "CINT2006 Rate".

cbn · Nov 14, 2011

Now if only we could get a SPECint_2006 result on ARMv8 vs Intel Sandy Bridge.

My guess is that (with Intel hyperthreading excluded from the test) ARMv8 @ 3Ghz would probably come out to be half as fast as a 2.7 to 3Ghz Sandy Bridge....rather than half as fast as a 2.4 Ghz Sandy Bridge.

If that ended up being true, then ARMv8 would move closer to Bulldozer clock for clock in integer performance.

SickBeast · Nov 14, 2011

What about floating point performance?

IntelUser2000 · Nov 14, 2011

Computer Bottleneck said:
Looking at Bulldozer @ 2.6 Ghz it scores 16.4 SpecInt_rate per core.

Therefore ARMv8 @ 3Ghz (13.75 SpecInt_rate per core) is roughly equal to a 2.18 Ghz Bulldozer core.

I'd like to make adjustments to that list.

Applied Micro's charts are based on actual SpecInt_Rate 2006 numbers. So at four 3GHz chip gets score of 55.

Hyperthreading's effect on Nehalem for SpecInt_Rate2006 is 13%. I assume its similar on Sandy Bridge. That's still a very good 34.5.

Bulldozer: On simple divide-numbers-by-cores analysis, remember real world applications do not scale linearly. That would put the number in favor for Bulldozer, somewhat mitigated by Bulldozer being able to run at 3.1GHz TurboCore on 6282SE for all cores.

You can also see the chart's numbers for the dual core E3-1220L is off by a significant amount. On the chart, it gets ~40 points. On Spec.org, it gets 60.

E3-1220L: 30 @ 2.2Ghz. If scaled linearly to 3.4GHz it would result in 46/core.

cbn · Nov 14, 2011

IntelUser2000 said:
Bulldozer: On simple divide-numbers-by-cores analysis, remember real world applications do not scale linearly. That would put the number in favor for Bulldozer, somewhat mitigated by Bulldozer being able to run at 3.1GHz TurboCore on 6282SE for all cores.

You make a good point about real world applications.

How does the SpecInt_rate benchmark scale with increasing cores? Almost linearly? Or does it also suffer from a good amount of drop off?

EDIT: Nvm, just looking at slide #1 the ARMv8s are dropping off in performance as core counts increase (eg, moving from 16 core to 32 core) in the SPECint_rate2006 benchmark.

cbn · Nov 14, 2011

GammaLaser said:
I don't understand that half integer claim with respect to the graph, which shows 32C X-Gene with 1.5X improvement vs. 4C/8T sandy bridge at the same power consumption?

The points on the graph (for the different processor sizes and speeds) don't line-up on the x axis.

My guess is that this "half" claim is derived from some sort of extrapolation of the data.

soccerballtux · Nov 14, 2011

if it's roughly half a sb at 2.4ghz that in itself is quite a feat.

GammaLaser · Nov 14, 2011

Computer Bottleneck said:
The points on the graph (for the different processor sizes and speeds) don't line-up on the x axis.

My guess is that this "half" claim is derived from some sort of extrapolation of the data.

I looked again and understand now; the 32C is not scaling well at all in the benchmark and thus the numbers look much better for ARM at the lower-power end of the graph.

cbn · Nov 14, 2011

Maybe looking at the score for 32 3.0 Ghz ARMv8 cores (~240 according to the above slide) and comparing it to the score for 32 2.6 Ghz Bulldozer Cores (526, shown in post #1) is a better indication of each core's integer performance?

In this case we are looking at each 3 Ghz ARMv8 core having 45% of the integer score of a 2.6 Ghz Bullldozer core.

Another way to look at it would be to say each 3Ghz ARMv8 core is equivalent to a 1.17 Ghz Bulldozer core.

hooflung · Nov 14, 2011

Servers? Sir this is going into game consoles.

SickBeast · Nov 14, 2011

hooflung said:
Servers? Sir this is going into game consoles.

We don't know that yet for sure, and really, they're only half way there in terms of delivering proper performance.

It's exciting to watch the development of ARM, but they're not there yet. I'll bet the Cell CPU in the Playstation 3 would outperform even a quad core ARM at 3ghz, which is pretty wishful thinking at this point to begin with.

podspi · Nov 14, 2011

hooflung said:
Servers? Sir this is going into game consoles.

I am having trouble believing that, but if it turns out to be the case, at least people won't have to complain that games can't use tons of threads anymore.

And AMD will be vindicated, kinda. 😀

cbn · Nov 14, 2011

GammaLaser said:
I looked again and understand now; the 32C is not scaling well at all in the benchmark and thus the numbers look much better for ARM at the lower-power end of the graph.

I just noticed that the 8 core 3Ghz ARMv8 is the only one that actually does line up with an Intel processor (dual core Sandy Bridge @ 2.2 Ghz) on the x-axis. (clicking on the slide in post #11 and zooming makes this easier to see)

cbn · Nov 14, 2011

SickBeast said:
It's exciting to watch the development of ARM, but they're not there yet.

We also have to remember this ARMv8 will have to compete with future Bulldozers and Intel processors.

cbn · Nov 14, 2011

IntelUser2000 said:
Applied Micro's charts are based on actual SpecInt_Rate 2006 numbers. So at four 3GHz chip gets score of 55.

You can also see the chart's numbers for the dual core E3-1220L is off by a significant amount. On the chart, it gets ~40 points. On Spec.org, it gets 60.

E3-1220L: 30 @ 2.2Ghz. If scaled linearly to 3.4GHz it would result in 46/core.

So if we took away hyperthreading, we would be looking at approximately 53 points for the Intel dual core @ 2.2 Ghz.

That compares to 55 points for the 4 core ARMv8 at 3Ghz.

Based on this.....we are looking at 3GHz ARMv8 being roughly 50% of a 2.2 to 2.3 Ghz Sandy Bridge for single core "integer"

EDIT: If we compare the results of post 11 to this one I come up with:

3 Ghz ARMv8 is equivalent to a 1.17 Ghz Bulldozer or 1.1 Ghz Sandy Bridge for single core "integer".

Tuna-Fish · Nov 15, 2011

SickBeast said:
What about floating point performance?

It'll be pretty horrible compared to x86. However, most server workloads don't use FP at all.

IntelUser2000 said:
Bulldozer: On simple divide-numbers-by-cores analysis, remember real world applications do not scale linearly.

The target application is pretty obviously web serving, which does scale linearly. (Each core is running a different request.)

Tuna-Fish · Nov 15, 2011

SickBeast said:
It's exciting to watch the development of ARM, but they're not there yet. I'll bet the Cell CPU in the Playstation 3 would outperform even a quad core ARM at 3ghz, which is pretty wishful thinking at this point to begin with.

Depends on what you measure. In FP throughput, not a contest. Cell would still win most present-day CPUs in that. In running complex integer scripts with a lot of memory ops, Cell would lose to most modern cellphone cpus.

IntelUser2000 · Nov 15, 2011

Computer Bottleneck said:
You make a good point about real world applications.

How does the SpecInt_rate benchmark scale with increasing cores? Almost linearly? Or does it also suffer from a good amount of drop off?

My reference to Bulldozer was specific, because the module-based approach incurs some penalty on multi-threaded performance. If we take AMD's claims each module is like 80% of a core, Bulldozer 16 core is more like having 14.4 cores. That makes it look artificially worse for single thread than it already is if you simply divide by number of cores.

Regards to scaling, I typically take ~85% for Spec(Integer only) and Cinebench, but it can go from 70-95% too. So that's too much of a margin.

Cerb · Nov 15, 2011

hooflung said:
Servers? Sir this is going into game consoles.

Quite a few companies that make, sell, and support servers disagree, including the makers of the SoC(s) in question.

SickBeast said:
What about floating point performance?

Probably not too fast, but hopefully not slow. NEON should be perfectly capable of scaling right on up, but even the fastest implementations have been targeted to phones at the upper end of performance, and ARM is just getting around to targeting low-performance servers, high-performance tablets, and non-PC notebooks. VFP and NEON from ARM have consistently had gimped implementations, since their selling point has been small cheap processors, so I would expect it to be quite impressive compared to typical ARM FP, but not necessarily impressive compared to desktop FP.

soccerballtux · Nov 15, 2011

SickBeast said:
I'll bet the Cell CPU in the Playstation 3 would outperform even a quad core ARM at 3ghz, which is pretty wishful thinking at this point to begin with.

This is highly unlikely. The Cell's central processor is in-order only and as such wastes a lot of resources waiting around for data from RAM. It doesn't even have hyper threading like Atom does to keep the chip busy on a 2nd thread during a stall on the 1st thread. This is why games tend to stutter more on the PS3 than on the Xbox360 (in my opinion)-- vs Xbox360, where they've got 3 cores / 6 threads that they can use for extra processing.
As for integer resources, yes of course it would win-- but games need brain-thinking more than they need the gobs of math capabilities that the cell has. It was meant for decoding hidef blu-ray content anyways-- that developers can sometimes cook up ways to utilize the SPE's in certain scenarios is good, but not the purpose (from Sony's perspective).

Tuna-Fish · Nov 15, 2011

Cerb said:
(re FP)
Probably not too fast, but hopefully not slow. NEON should be perfectly capable of scaling right on up, but even the fastest implementations have been targeted to phones at the upper end of performance, and ARM is just getting around to targeting low-performance servers, high-performance tablets, and non-PC notebooks. VFP and NEON from ARM have consistently had gimped implementations, since their selling point has been small cheap processors, so I would expect it to be quite impressive compared to typical ARM FP, but not necessarily impressive compared to desktop FP.

All ARM FP implementations I know of put the entire FP pipeline after the integer one. This means that code like:

Code:

float a,b;
if (a < b) somefunction();
// or 
float a,b;
int c;
if (a < b) c = c+1;

Is always pretty slow. Basically, you suffer the whole pipeline length every time you bring results from a floating point context back into the integer one.

So while most arm solutions are actually pretty fast when doing computations on floating point, on real code that wants to use results from there it's just slow. I honestly don't think that APM has fixed this -- most of the loads they are interested in don't ever use any FP, so why spend expensive design resources on it?

Cerb · Nov 15, 2011

Tuna-Fish said:
So while most arm solutions are actually pretty fast when doing computations on floating point, on real code that wants to use results from there it's just slow.

I knew it was generally slow, the A8 being especially D: (VFPlite), but did not realize why it all tended to be that way, nor put time in to find it out. Good to know.

I honestly don't think that APM has fixed this -- most of the loads they are interested on don't ever use any FP, so why spend expensive design resources on it?

If their customers don't much care, their best bet would probably just be to beef up the stock NEON implementations. However, as general performance improves, performance of code not optimized by humans will matter more and more, and mixed int/FP is woefully common for desktop/office appliations. If that switching latency is removed, thus improving CPI, then a narrow FP unit would be fine. Chasing high FLOPS is likely not worth it, but high CPI for any halfway common loop with a few FP ops can and will eat those int performance gains.

CPU Integer Performance: Applied Micro ARMv8 vs AMD Bulldozer "Interlagos"

Lifer

Member

Lifer

Lifer

Lifer

Elite Member

Lifer

Lifer

Lifer

Member

Lifer

Golden Member

Lifer

Golden Member

Lifer

Lifer

Lifer

Golden Member

Golden Member

Elite Member

Elite Member

Lifer

Golden Member

Elite Member