[AT] Apple's A7: It's Better Than I Thought

NTMBK · Oct 30, 2013

I had heard rumors that Cyclone was substantially wider than its predecessor but I didnt really have any proof other than hearsay so I left it out of the article. Instead I surmised in the 5s review that the A7 was likely an evolved Swift core rather than a brand new design, after all - what sense would it make to design a new CPU core and then do it all over again for the next one? It turns out I was quite wrong.

...

With Cyclone Apple is in a completely different league. As far as I can tell, peak issue width of Cyclone is 6 instructions. Thats at least 2x the width of Swift and Krait, and at best more than 3x the width depending on instruction mix. Limitations on co-issuing FP and integer math have also been lifted as you can run up to four integer adds and two FP adds in parallel. You can also perform up to two loads or stores per clock.

http://www.anandtech.com/show/7460/apple-ipad-air-review/2

Two top notch, totally new CPU designs out of Apple in two years. They're improving their CPU design capability at a scary rate.

Nothingness · Oct 30, 2013

As I wrote in another thread, I don't believe Cyclone decodes 6 instructions per cycle, even Haswell can only decode 4 instructions/cycle (though these are obviously more complex to decode, this bandwidth comes after a pre-decoding stage, see figure 1).

Tuna-Fish · Oct 30, 2013

Nothingness said:
As I wrote in another thread, I don't believe Cyclone decodes 6 instructions per cycle, even Haswell can only decode 4 instructions/cycle (though these are obviously more complex to decode, this bandwidth comes after a pre-decoding stage, see figure 1).

Having more issue than decode capability in a OoO design makes sense, because whenever you data stall, the decode can keep going while issue stalls. A single L2 hit typically fills up the entire instruction window of any OoO CPU, meaning that when you do get the data, you have tens of instructions and can operate at full issue throughput until you data stall again.

Also, decoding fixed-width instructions is very easy. The only reason the CPU doesn't decode 6 or 10 or 20 of them per cycle is that it would raise power use linearly and would make no sense given the average ILP of the load. x86 is effectively limited by the variable-width instructions, many CPUs with fixed-width instructions before Haswell with much lower transistor and design budgets have had decode widths of 8 or more.

Nothingness · Oct 30, 2013

Tuna-Fish said:
Having more issue than decode capability in a OoO design makes sense, because whenever you data stall, the decode can keep going while issue stalls. A single L2 hit typically fills up the entire instruction window of any OoO CPU, meaning that when you do get the data, you have tens of instructions and can operate at full issue throughput until you data stall again.

Agreed, but you can't measure that effect easily unless you do micro-measurements which as far as I understand isn't what Anand is doing (and might even not be possible on Apple hardware; are PMU events / cycle counter user-accessible?).

Also, decoding fixed-width instructions is very easy. The only reason the CPU doesn't decode 6 or 10 or 20 of them per cycle is that it would raise power use linearly and would make no sense given the average ILP of the load. x86 is effectively limited by the variable-width instructions, many CPUs with fixed-width instructions before Haswell with much lower transistor and design budgets have had decode widths of 8 or more.

Think about this: Cyclone has 3 different instruction encodings to support, one of them using variable length instructions (16/32-bit)!

And show me a non VLIW/EPIC CPU with 8 decoders, and preferably one that made it to the non HPC market :biggrin:

Well you might be right and Anand too. But I'm skeptical

Exophase · Oct 30, 2013

The problem I'm having here is that there's no standardization on what the term issue means. Some like Intel use it to mean the in-order part of the front-end before renaming which is usually coupled with decode width, and use the term dispatch to mean the instructions that enter into the execution ports. Others use dispatch to mean the former and issue to mean the latter, exactly opposite from Intel.

I have no idea what Anand is referring to as issue here. If he means decode width then 6 is extremely high and like Nothingness I'm really skeptical. If he means execution width then he's wrong about it being wider than Cortex-A15, which is 8-wide in this regard. It'd help if Anand actually explained where he got these numbers.

As far as decode width goes, it isn't impossible that A7's full width is enabled in 64-bit mode only. The typical performance difference between the two is larger than I expected it to be from arch improvements alone, but who knows.

Tuna-Fish · Oct 30, 2013

Nothingness said:
Agreed, but you can't measure that effect easily unless you do micro-measurements which as far as I understand isn't what Anand is doing (and might even not be possible on Apple hardware; are PMU events / cycle counter user-accessible?).

I think it could be measured by intentionally consistently hitting L2.

Think about this: Cyclone has 3 different instruction encodings to support, one of them using variable length instructions (16/32-bit)!

Most Armv8 CPUs will probably decode the old instructions slower. A7 is noticeably slower on 32-bit instruction streams, so it might already do this.

And show me a non VLIW/EPIC CPU with 8 decoders, and preferably one that made it to the non HPC market :biggrin:

Non-HPC rather limits it and you know it. :biggrin: There are practically 3 markets, GP, HPC and embedded. Embedded doesn't want wide decode, and GP wants backwards compatibility. Doesn't leave a lot of room for wide designs. :biggrin: I think the last Alpha actually had 16-wide decode. Not because it needed it, but because it's cheap, and they had 16-wide fetch so why not.

However, as I said, I don't think it can decode 6 per clock, simply because it makes no sense for the domain. However, after hearing the assertion of 6-wide issue, I have no reason to contest that. Notably because the team behind the CPU has used multiple separate schedulers in the past, and in such a design, given optimal instruction mix, there's no reason for the design to *not* have very wide issue.

Nothingness · Oct 30, 2013

Tuna-Fish said:
I think it could be measured by intentionally consistently hitting L2.

That wouldn't work and it's not that difficult to see why

I'll try to explain but I'm not good at that...

Imagine you have an instruction hitting in L2. You have 3 decoders and L2 load to use latency is 10 cycles. This means you can decode 27 instructions. No matter how wide your issue is, this will require 10+epsilon cycles to run, so your IPC still is limited to 3.

Another way to look at it is to consider that no matter what you do, you'll be constrained by the narrowest part of your system (a variant of Amdahl law if you want

). And this is that part I'm interested in, not the number of issues.

Most Armv8 CPUs will probably decode the old instructions slower. A7 is noticeably slower on 32-bit instruction streams, so it might already do this.

Do you really think they want to reduce legacy app speed? That'd be crazy at that point in time.

Are you sure A7 is slower on ARM 32-bit? It's impossible to know since the 32 and 64-bit instruction sets are very different.

Non-HPC rather limits it and you know it. :biggrin: There are practically 3 markets, GP, HPC and embedded. Embedded doesn't want wide decode, and GP wants backwards compatibility. Doesn't leave a lot of room for wide designs. :biggrin: I think the last Alpha actually had 16-wide decode. Not because it needed it, but because it's cheap, and they had 16-wide fetch so why not.

Indeed :biggrin:

However, as I said, I don't think it can decode 6 per clock, simply because it makes no sense for the domain. However, after hearing the assertion of 6-wide issue, I have no reason to contest that. Notably because the team behind the CPU has used multiple separate schedulers in the past, and in such a design, given optimal instruction mix, there's no reason for the design to *not* have very wide issue.

In that case, we agree! But I fail to see how Anand measured issue width...

Exophase · Oct 30, 2013

Nothingness said:
That wouldn't work and it's not that difficult to see why I'll try to explain but I'm not good at that...

Imagine you have an instruction hitting in L2. You have 3 decoders and L2 load to use latency is 10 cycles. This means you can decode 27 instructions. No matter how wide your issue is, this will require 10+epsilon cycles to run, so your IPC still is limited to 3.

Another way to look at it is to consider that no matter what you do, you'll be constrained by the narrowest part of your system (a variant of Amdahl law if you want ). And this is that part I'm interested in, not the number of issues.

If you're sustaining maximum throughput at the most narrow part of the pipeline eventually the prefetch buffer will fill up and it'll only request an L1 icache transaction when it stops being full enough. But L1 icache accesses are still a really bad proxy for instruction throughput since typically a lot of the instructions in the accessed cached lines won't be candidates for execution. This is true before you even get to prefetching.

But he said L2 and I have no idea how L2 hits are supposed to be representative of anything instruction throughput related. I guess by "intentionally hitting" he's saying to have a big stream that blows up L1 icache. I'd be really alarmed if the core can sustain whatever the maximum fetch rate is while streaming from L2 cache..

Here's what I think Anand is doing - interpreting compiler source code that describes the scheduling (which also tend to be really easy to misinterpret). And I think that's why he suddenly has more insight about Swift too. Does anyone know if Apple actually released their Clang revisions?

Roland00Address · Oct 30, 2013

For a beginner, what are the downsides of making a cpu wider? It is harder to increase clock speed? Without the proper software design the extra wide cpu won't be utilized and thus it will be wasted die space? Any other problems?

ShintaiDK · Oct 30, 2013

Roland00Address said:
For a beginner, what are the downsides of making a cpu wider? It is harder to increase clock speed? Without the proper software design the extra wide cpu won't be utilized and thus it will be wasted die space? Any other problems?

Utilization. I think Intel said it only gave around 5% to make the Core 4 issue wide.

Tuna-Fish · Oct 30, 2013

Nothingness said:
That wouldn't work and it's not that difficult to see why I'll try to explain but I'm not good at that...

Exophase said:
But he said L2 and I have no idea how L2 hits are supposed to be representative of anything instruction throughput related. I guess by "intentionally hitting" he's saying to have a big stream that blows up L1 icache. I'd be really alarmed if the core can sustain whatever the maximum fetch rate is while streaming from L2 cache..

The idea is to avoid L1d, not L1i.

You can measure issue width by intentionally letting your decode run ahead by repeatedly missing L1d.

Basically, produce a large block of pointers, where each of them points to a random one within the block, sized to fit in L2, pointer chase through all of them like:
a = start;
for i in count {
a = *a;
}
measure time taken, this is the time it takes to just pointer chase.

Then add arithmetic to each iteration, so that all the operations in the iteration depend on the previous memory load, and the next load depends on the arithmetic. Measure time taken, deduct the time taken just for the pointer chasing, and you have a reasonably close measurement of the time it took to do the arithmetic, all of it running from the ROB post-decode.

The point is not to get average IPC above decode, it's to make sure average IPC is pinned at 0 for a known amount of time, so you can deduce the peak ipc from the average.

Roland00Address said:
For a beginner, what are the downsides of making a cpu wider? It is harder to increase clock speed? Without the proper software design the extra wide cpu won't be utilized and thus it will be wasted die space? Any other problems?

Getting results from units to each other gets more complex and expensive the more there are of them. Having more units means you need to spread them out further away from your register file to fit them. These both cost clock speed and area. If you have variable-width instructions, like x86, fast decode gets really expensive and hard really quickly with added width. All software has some ILP limit, and it varies a lot, so adding more width gives diminishing utility.

Note that these penalties don't count in some situations, specifically when you let the FP units and the integer units issue in parallel -- as they don't typically forward data between each other, you don't need fast paths, and they operate on independent register files.

Exophase · Oct 30, 2013

I still wish people would be clearer with what they meant when they said "issue" instead of just assuming everyone knows and the term isn't actually ambiguous :| Even if it's heavily implied by context.

Tuna-Fish said:
The idea is to avoid L1d, not L1i.

You can measure issue width by intentionally letting your decode run ahead by repeatedly missing L1d.

Basically, produce a large block of pointers, where each of them points to a random one within the block, sized to fit in L2, pointer chase through all of them like:
a = start;
for i in count {
a = *a;
}
measure time taken, this is the time it takes to just pointer chase.

Then add arithmetic to each iteration, so that all the operations in the iteration depend on the previous memory load, and the next load depends on the arithmetic. Measure time taken, deduct the time taken just for the pointer chasing, and you have a reasonably close measurement of the time it took to do the arithmetic, all of it running from the ROB post-decode.

The point is not to get average IPC above decode, it's to make sure average IPC is pinned at 0 for a known amount of time, so you can deduce the peak ipc from the average.

What does peak IPC mean? The widest part of the pipeline? If all you're doing is arithmetic then you won't saturate the execution width since usually you only achieve that with a diverse mix of instruction types. You'd have to figure out the exact mix of operations that gives you the best throughput in this loop, and they'd all have to be dependent on the load but independent of each other. There could be operations that you can't even make directly dependent on the load - branches, FP and SIMD operations are likely candidates - so you wouldn't even be able to account for this.

What Anand says later in the article is a stronger hint to what he's really looking at:

Limitations on co-issuing FP and integer math have also been lifted as you can run up to four integer adds and two FP adds in parallel. You can also perform up to two loads or stores per clock.

Has he really measured execution of all of these things in parallel? I doubt. Maybe he did separate tight loops with 4 adds, 2 FPU operations, and 2 loads and/or stores, and are assuming they can all be executed simultaneously since that's often the case. Or maybe he's reading all of that from compiler source.

Nothingness · Oct 30, 2013

Tuna-Fish said:
The point is not to get average IPC above decode, it's to make sure average IPC is pinned at 0 for a known amount of time, so you can deduce the peak ipc from the average.

OK, got what you meant

BTW I was thinking they could still reach 6 instructions per cycle even under L1D hit traffic if they have some form of loop buffer with uops simpler to decode than instructions.

Want to know more :biggrin:

Tuna-Fish · Oct 30, 2013

Nothingness said:
OK, got what you meant

BTW I was thinking they could still reach 6 instructions per cycle even under L1D hit traffic if they have some form of loop buffer with uops simpler to decode than instructions.

The entire point of regular encoding RISC like A64 is that uOps *are not* any simpler to decode than the instructions. The instructions are simple enough. If there is a loop buffer, it's going to store instructions.

Nothingness · Oct 30, 2013

Tuna-Fish said:
The entire point of regular encoding RISC like A64 is that uOps *are not* any simpler to decode than the instructions. The instructions are simple enough. If there is a loop buffer, it's going to store instructions.

Take a look at Aarch64 SIMD instruction encoding then let's talk back about simple encoding

Even for integer ops there is room to simplify decoding at the cost of number of bits to represent the instruction.

Exophase · Oct 30, 2013

Nothingness said:
Take a look at Aarch64 SIMD instruction encoding then let's talk back about simple encoding Even for integer ops there is room to simplify decoding at the cost of number of bits to represent the instruction.

Like the immediate format for logical instructions. A lot of people are commenting on AArch64 like it's MIPS I..

cbn · Oct 30, 2013

ShintaiDK said:
Utilization. I think Intel said it only gave around 5% to make the Core 4 issue wide.

Do you have a link for that info?

P.S. Conroe was the first Intel x86 CPU that had a 4 issue front end, but the performance was quite a bit better than the Netburst before. (Granted, there were other changes as well that went along with that wider front end.)

ShintaiDK · Oct 30, 2013

Computer Bottleneck said:
Do you have a link for that info?

P.S. Conroe was the first Intel x86 CPU that had a 4 issue front end, but the performance was quite a bit better than the Netburst before. (Granted, there were other changes as well that went along with that wider front end.)

Nope, I can try see if I can find it. Its several years ago. Its back from Core 2 when there was an interview with some Intel engineer about 4 issue wide. It was said there it only gave around 5%. If I recall right it was also due to questions about Itanium, since it was much wider. But didnt suffer from the realtime scheduling.

Exophase · Oct 30, 2013

Computer Bottleneck said:
P.S. Conroe was the first Intel x86 CPU that had a 4 issue front end, but the performance was quite a bit better than the Netburst before. (Granted, there were other changes as well that went along with that wider front end.)

AMD's K5 was actually quad-issue, and this is using the same terminology Intel uses (it could decode up to 4 instructions every cycle). Going with the other issue definition it'd be six-wide, also like Conroe. The execution resource balance was very different, though.

[AT] Apple's A7: It's Better Than I Thought

Lifer

Diamond Member

Golden Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Platinum Member

Lifer

Golden Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Lifer

Lifer

Diamond Member