Apple A7 is now 64-bit

Khato · Sep 21, 2013

Exophase said:
And Cortex-A7 was revealed October 2011 and the first products were out early 2013. The gap of time between announcement and first product has been shrinking for ARM

That's an assumption based on looking at only the announcement dates, not the details of the announcement. Anand's coverage of the A7 announcement specifically stated "ARM expects that we will see some 40nm A7 designs as early as the end of next year for use in low end smartphones (~$100)" - clear indication that it was further along in the development process when announced than either A15 or A57/A53 were. (From the A15 piece, "Architectural details are light, and ARM is stating that first silicon will ship in 2012 at 32/28nm.", while from the A57/A53 piece, "Completed Cortex A57 and A53 core designs will be delivered to partners (including AMD and Samsung) by the middle of next year. Silicon based on these cores should be ready by late 2013/early 2014, with production following 6 - 12 months after that.")

Note that ARM provided the A57/A53 POP IP for TSMC's 28nm HPM process near the beginning of April this year. Which then needs to be integrated into the remainder of the SoC design and then fabricated. Once those first chips are back silicon validation needs to be done before going to production... So yeah, Q2 2014 is definitely possible, but it'd be a very optimistic schedule. (Especially for Samsung's own design assuming that they're using their own fabs as that means they need to do their own structural design instead of using ARM's POP IP.)

krumme · Sep 22, 2013

Hans de Vries said:
Dual core Cyclone + 1 MB LB is 5.0 x 3.6 = 18 mm2

That's probably bigger than anybody expected.
Just like the performance
I wonder how the A57 will end up.

Hans.

Damn! Its upside down world.

When Intel presented the bt design going for a lean 64b fpu just seemed very sensitive to me.
But margins is on highend phones and when everybody and his brother is perhaps going 128b fpu armv8 fat boys it just seems i was wrong. At least this ecofriendly apple elephant indicates it.

Eug · Sep 22, 2013

^^^ What do you mean by "sensitive"?

---

BTW, the iPhone 5S encodes video twice as fast as the iPhone 5. They didn't give times though, so not a precise test. However, what we do know is by the time iPhone 5S finishes rendering the iMovie project, the iPhone 5 is only halfway done.

http://macsfuture.com/post/61891960935/iphone-5s-head-to-head-speed-test-with-iphone-5

Nothingness · Sep 22, 2013

I guess he meant sensible

krumme · Sep 22, 2013

Nothingness said:
I guess he meant sensible

That would be it - lol

StrangerGuy · Sep 23, 2013

Eug said:
^^^ What do you mean by "sensitive"?

---

BTW, the iPhone 5S encodes video twice as fast as the iPhone 5. They didn't give times though, so not a precise test. However, what we do know is by the time iPhone 5S finishes rendering the iMovie project, the iPhone 5 is only halfway done.

http://macsfuture.com/post/61891960935/iphone-5s-head-to-head-speed-test-with-iphone-5

By the time HEVC is released it's a sure bet encode/decode would be entirely hardware accelerated. Yet another CPU intensive task gone to the wayside.

Exophase · Sep 23, 2013

krumme said:
Damn! Its upside down world.

When Intel presented the bt design going for a lean 64b fpu just seemed very sensitive to me.

Don't know why people keep calling it 64-bit FPU. It's 128-bit FADD, 64-bit FMUL, and it can run both in parallel. It looks like they kept a lot of it the same from Saltwell.

krumme · Sep 23, 2013

Exophase said:
Don't know why people keep calling it 64-bit FPU. It's 128-bit FADD, 64-bit FMUL, and it can run both in parallel. It looks like they kept a lot of it the same from Saltwell.

You are right. It doesnt say much about performance anyway i guess.

As i understand David Kanter some instructions on the fp side of thing is not executed in single cycle througput to save power.
http://www.realworldtech.com/silvermont/5/

But apple a7 get it fp power from something...
I think the point stands that if armv8 and a57 derivatives can get same sort of performance boost Intel needs to do something about the fp performance. And it means an entire overhaul of the fp part.

Exophase · Sep 23, 2013

krumme said:
You are right. It doesnt say much about performance anyway i guess.

As i understand David Kanter some instructions on the fp side of thing is not executed in single cycle througput to save power.
http://www.realworldtech.com/silvermont/5/

Yeah, probably referring to FMUL. But I wonder if this really actually improves perf/W. Especially if there are already 128-bit datapaths elsewhere to facilitate 128-bit FADDs.

krumme said:
But apple a7 get it fp power from something...
I think the point stands that if armv8 and a57 derivatives can get same sort of performance boost Intel needs to do something about the fp performance. And it means an entire overhaul of the fp part.

Frankly, I don't think the floating point performance matters in very much at all in this space, outside of winning benchmarks. Which unfortunately is pretty important :/

Your comment did make me think of something obvious, which for some reason I never really thought about earlier.. Apple is showing a decent performance boost from a number of tests in moving to ARMv8 (way more than I expected, not counting the ones where there's new acceleration like AES). Apple's using totally open source compilers for this. So when other ARM CPUs move to 64-bit they could also receive a performance boost. Will have to wait and see how much GCC shows it, since I don't see the NDK moving to Clang, although you could probably substitute it in if you really want.

Khato · Sep 23, 2013

Exophase said:
Your comment did make me think of something obvious, which for some reason I never really thought about earlier.. Apple is showing a decent performance boost from a number of tests in moving to ARMv8 (way more than I expected, not counting the ones where there's new acceleration like AES). Apple's using totally open source compilers for this. So when other ARM CPUs move to 64-bit they could also receive a performance boost. Will have to wait and see how much GCC shows it, since I don't see the NDK moving to Clang, although you could probably substitute it in if you really want.

Will definitely have to wait and see, but I'd be somewhat surprised if we don't see similar gains for other ARMv8 cores between 64 and 32 bit executables. I can understand why the enthusiast community would find this peculiar given our experience with the transition to 64 bit some 10 years ago now where some compute-heavy applications showed decent gains while the majority of programs either showed little improvement or had marked penalties. But that was due to the combination of poor software adoption and the fact that Microsoft was all to let AMD stick us with a band-aid 64 bit x86 implementation. Whereas in this case ARMv8 is a good example of an instruction set evolution and it's coupled with Apple's typical software polish.

Exophase · Sep 23, 2013

Khato said:
Will definitely have to wait and see, but I'd be somewhat surprised if we don't see similar gains for other ARMv8 cores between 64 and 32 bit executables. I can understand why the enthusiast community would find this peculiar given our experience with the transition to 64 bit some 10 years ago now where some compute-heavy applications showed decent gains while the majority of programs either showed little improvement or had marked penalties. But that was due to the combination of poor software adoption and the fact that Microsoft was all to let AMD stick us with a band-aid 64 bit x86 implementation. Whereas in this case ARMv8 is a good example of an instruction set evolution and it's coupled with Apple's typical software polish.

I'm surprised because I really just didn't think that ARMv8 offered a lot that'd give a performance improvement. People have been saying for a long time that there's not that much performance to be exploited in differences between reasonable "sane" ISAs (x86 vs ARM being the common point of discussion). ARMv8 has some nice features but it isn't that exotic, much of it is paring back ARM's more esoteric features to more limited versions that are easier to implement in hardware.

The only real big performance enhancing feature would be the move from 15 to 31 GPRs. AMD did studies years ago that showed that going beyond 16 didn't give you more than a few percent improvement. But it's possible that those studies weren't that universal (for instance, slanted by x86's load+op instructions). What I'm most suspicious of is that this study reveals the ARMv7 had subpar register allocation.

Khato · Sep 23, 2013

Exophase said:
What I'm most suspicious of is that this study reveals the ARMv7 had subpar register allocation.

Definitely sounds like a reasonable theory. Especially when you consider that the intended performance target for designs using the ARM ISA has increased dramatically since ARMv7 was defined back in 2005.

Exophase · Sep 23, 2013

Khato said:
Definitely sounds like a reasonable theory. Especially when you consider that the intended performance target for designs using the ARM ISA has increased dramatically since ARMv7 was defined back in 2005.

My original sentence was totally broken >_> What I meant to say was that compilers targeting ARMv7 had poor register allocation, not the ISA itself. That's not going to be an ISA problem, and v7 didn't change much vs v6 outside of adding NEON.

It's also possible that auto-vectorization is happening more now, but I don't really see why it would, outside of double-precision float stuff where it wasn't supported at all until now.

Nothingness · Sep 23, 2013

Exophase said:
Your comment did make me think of something obvious, which for some reason I never really thought about earlier.. Apple is showing a decent performance boost from a number of tests in moving to ARMv8 (way more than I expected, not counting the ones where there's new acceleration like AES). Apple's using totally open source compilers for this. So when other ARM CPUs move to 64-bit they could also receive a performance boost. Will have to wait and see how much GCC shows it, since I don't see the NDK moving to Clang, although you could probably substitute it in if you really want.

Two things: Apple has not released source code and they don't have to (though they said they will); the NDK comes with both gcc and clang.

mavere · Sep 23, 2013

Not sure if this has been posted, but Futuremark has indicated they'll be looking into the 5s and its surprising CPU/Physics score.

Khato · Sep 23, 2013

Exophase said:
My original sentence was totally broken >_> What I meant to say was that compilers targeting ARMv7 had poor register allocation, not the ISA itself. That's not going to be an ISA problem, and v7 didn't change much vs v6 outside of adding NEON.

Whereas I took what you said as intending to mean that ARMv7 (and technically all that came before it) made trade-offs favoring power rather than performance. Misinterpretations all around, haha.

mavere said:
Not sure if this has been posted, but Futuremark has indicated they'll be looking into the 5s and its surprising CPU/Physics score.

Maybe they'll turn up something interesting. I have my theory about the reason for the lack of performance improvement, but it's all just speculation based on the information currently available. Recall that the LG G2's 3dmark physics score showed the largest decrease in performance compared to the MDP and all indications are that such was the result of reduced frequencies to keep power output in check - the same thing may very well be happening with the iPhone 5s. Either in the form of going below the supposed 1.3 GHz frequency or dropping back down to that from a 'turbo' state. As said, just speculation that I'm not at all confident in... but it's still fun to share

Exophase · Sep 23, 2013

Nothingness said:
Two things: Apple has not released source code and they don't have to (though they said they will); the NDK comes with both gcc and clang.

Wow, I'm amazed to find Clang and LLVM are under BSD. I had no idea. Hope Apple follows through on that, then. Good tip on NDK, thanks (I've never heard of anyone using Clang with it - do you have any report on how it does?)

jfpoole · Sep 24, 2013

Exophase said:
Wow, I'm amazed to find Clang and LLVM are under BSD. I had no idea. Hope Apple follows through on that, then. Good tip on NDK, thanks (I've never heard of anyone using Clang with it - do you have any report on how it does?)

Using Clang with the NDK is pretty straightforward, and I'm not aware of any major Clang-specific NDK issues. I believe it took us less than an hour to switch our Android build from GCC to Clang, and everything Just Worked after the switch.

That said, we did see a performance drop after switching to Clang. I can't remember what the magnitude of the drop was, though. If there's interest I can see if I can dig up some approximate numbers.

Ajay · Sep 24, 2013

Exophase said:
I'm surprised because I really just didn't think that ARMv8 offered a lot that'd give a performance improvement. People have been saying for a long time that there's not that much performance to be exploited in differences between reasonable "sane" ISAs (x86 vs ARM being the common point of discussion). ARMv8 has some nice features but it isn't that exotic, much of it is paring back ARM's more esoteric features to more limited versions that are easier to implement in hardware.

The only real big performance enhancing feature would be the move from 15 to 31 GPRs. AMD did studies years ago that showed that going beyond 16 didn't give you more than a few percent improvement. But it's possible that those studies weren't that universal (for instance, slanted by x86's load+op instructions). What I'm most suspicious of is that this study reveals the ARMv7 had subpar register allocation.

Well, I think it's pretty intuitive that a simple recompile with a half decent compiler on a load/store architecture would see significant improvements in performance with more GPRs simply by the reduction of loads and stores with a higher data locality (fewer cache read/writes). This should be true up to least up to some limit, but that limit may change based on changes of some architectural or internal performance metric. I would imagine that that was the case with whatever tests AMD did. Within the parameters of a certain limited number of variables, the external performance of the processor didn't improve beyond having 16 registers. It's like DEC's old research showing that the external performance didn't improve beyond having 4 hardware threads per core for the Alpha architecture. Now the newer Sparc processors have what, 8-16 threads per core?

What's unknown to me is that internally, modern x86 became more RISC like, executing micro-ops, etc. as you know; but what does the internal register architecture of a core look like? Was that streamlined as the x86 RISC core was, or was it left as is to avoid excessively complicated register remapping.

Sadly I didn't really kept up on this stuff for more than a few years after leaving embedded development circa 1999 and my copy of "Patterson and Hennessy" is probably the second edition. That and I haven't done any debugging on assembly level x86-64 code, so I can't really add any modern observations to the question at hand. The does make me want to buy the latest edition of "Computer Architecture: A Quantitative Approach".

Exophase · Sep 24, 2013

Ajay said:
Well, I think it's pretty intuitive that a simple recompile with a half decent compiler on a load/store architecture would see significant improvements in performance with more GPRs simply by the reduction of loads and stores with a higher data locality (fewer cache read/writes). This should be true up to least up to some limit, but that limit may change based on changes of some architectural or internal performance metric. I would imagine that that was the case with whatever tests AMD did. Within the parameters of a certain limited number of variables, the external performance of the processor didn't improve beyond having 16 registers. It's like DEC's old research showing that the external performance didn't improve beyond having 4 hardware threads per core for the Alpha architecture. Now the newer Sparc processors have what, 8-16 threads per core?

What's unknown to me is that internally, modern x86 became more RISC like, executing micro-ops, etc. as you know; but what does the internal register architecture of a core look like? Was that streamlined as the x86 RISC core was, or was it left as is to avoid excessively complicated register remapping.

Sadly I didn't really kept up on this stuff for more than a few years after leaving embedded development circa 1999 and my copy of "Patterson and Hennessy" is probably the second edition. That and I haven't done any debugging on assembly level x86-64 code, so I can't really add any modern observations to the question at hand. The does make me want to buy the latest edition of "Computer Architecture: A Quantitative Approach".

While I could see load/store ISAs benefiting from more registers more than x86, the improvements would still at least track. x86 is far from immune to the effects of spills.

I think I've seen 8 threads per core on SPARC.. but that's a good point. Research like this will be limited by some assumptions made in the hardware, limitations of what they can actually simulate, and quality of the software. With that in mind, I could see these influences playing a role when AMD did the study vs now:

1) Compilers had fewer things eligible to allocate in registers because intra-function analysis was inferior to today's
2) The compilers they used actually had better local register allocation for x86 than Clang had for ARM today, resulting in fewer spills
3) The stuff where Geekbench wins just happens to have more complex local working sets needing more registers, than whatever AMD tested

I guess I've been inundated with this a lot over the past few years, a lot of CPU designers and others very technically familiar with the topic have commented on 32 registers being overkill for modern OoO processors. It's been popular opinion that it's there for better performance on A53 (that needs more architectural registers to rename in software for scheduling purposes). But results with Apple are painting a very different story.

Ajay · Sep 25, 2013

Exophase said:
While I could see load/store ISAs benefiting from more registers more than x86, the improvements would still at least track. x86 is far from immune to the effects of spills.

I think I've seen 8 threads per core on SPARC.. but that's a good point. Research like this will be limited by some assumptions made in the hardware, limitations of what they can actually simulate, and quality of the software. With that in mind, I could see these influences playing a role when AMD did the study vs now:

1) Compilers had fewer things eligible to allocate in registers because intra-function analysis was inferior to today's
2) The compilers they used actually had better local register allocation for x86 than Clang had for ARM today, resulting in fewer spills
3) The stuff where Geekbench wins just happens to have more complex local working sets needing more registers, than whatever AMD tested

I guess I've been inundated with this a lot over the past few years, a lot of CPU designers and others very technically familiar with the topic have commented on 32 registers being overkill for modern OoO processors. It's been popular opinion that it's there for better performance on A53 (that needs more architectural registers to rename in software for scheduling purposes). But results with Apple are painting a very different story.

Thanks for the interesting comments. Personally, I don't see how OoOE would reduce the number of GPRs required**. The spill rate might be lower with OoOE since that will actively take better advantage of available local resources, minimizing delays waiting for registers to be filled or written out (as would be the case with In Order), but I would think that well optimized compilers would anticipate that and create a more complex local working set to optimize performance (for application code where this sort of optimization would work, hence the great variation in result on the A7).

That said, I don't know what the research is on this, but the A7 indicates that 32 GPRs and OoOE works fine. I would imagine Apple/ARM has the data to back up the use of 32 GPRs, otherwise why would they have gone with the added complexity in silicon?

** This could be a totally different story for x86, and I imagine most of the conversation regarding OoOE and register count would be concerning x86, even though it is no longer the dominant architecture numerically - it still garners a lot on conversation since it is the dominate high end CPU.

Exophase · Sep 25, 2013

Ajay said:
Thanks for the interesting comments. Personally, I don't see how OoOE would reduce the number of GPRs required**. The spill rate might be lower with OoOE since that will actively take better advantage of available local resources, minimizing delays waiting for registers to be filled or written out (as would be the case with In Order), but I would think that well optimized compilers would anticipate that and create a more complex local working set to optimize performance (for application code where this sort of optimization would work, hence the great variation in result on the A7).

That said, I don't know what the research is on this, but the A7 indicates that 32 GPRs and OoOE works fine. I would imagine Apple/ARM has the data to back up the use of 32 GPRs, otherwise why would they have gone with the added complexity in silicon?

** This could be a totally different story for x86, and I imagine most of the conversation regarding OoOE and register count would be concerning x86, even though it is no longer the dominant architecture numerically - it still garners a lot on conversation since it is the dominate high end CPU.

Okay, think about how a modern OoOE processor has several more physical registers than architectural registers in order to facilitate register renaming. An in-order processor without register renaming needs to have renaming done in software. Renaming is still needed in order to avoid false dependencies to improve opportunities for instruction-level parallelism - even if this parallelism is obtained by compile-time scheduling.

This isn't just about reordering instructions that are near each other - if you unroll or software pipeline loops in order to increase opportunities for parallelism register pressure can go up dramatically. Take a very simplistic example - you have a tight loop where every instruction is dependent on the previous one, and let's say they're single cycle latency and the CPU could dual issue all of these instructions (and the loop iterations are independent). You'd need to unroll the loop x2 and interleave each instruction to exploit ILP, and the register pressure could as much as double if all of the registers were used as temporaries local to the loop body.

There isn't really a question of why Apple would support 32 registers, because they're supporting AArch64 which uses 32 registers. They could have gone with their own custom ISA but that's disadvantageous for a variety of reasons (having to go through the design effort itself, losing validation and documentation support from ARM, not being able to benefit from other people developing compilers for it..) So the question is really more why ARM decided 32 GPRs than why Apple did, and since ARM has broader market interests than Apple does they're likely to have broader justifications for this. This extends to providing good performance on in-order processors like Cortex-A53.

But the performance advantages with AArch64 on A7 are real, and they're probably due to the increase in GPRs more than anything, so it seems like a win for them either way.

Ajay · Sep 27, 2013

Posted earlier, but an "expired token" took my post with it.

Thanks for the very good example on loop unrolling and it's implicit demand for more registers and why Apple implemented 32 GPRs in silicon, because it was the most sensible thing to do when implementing AArch64. It's unlikely that Apple has a large enough team, or the inclination to build a larger team, to doing any research on the optimal number of registers or any other CPU component when ARM is already doing that.

Nec_V20 · Sep 27, 2013

bullzz said:
this is unbelievable. 1 major step towards integrating mac os with ios

There is no such thing as an " Apple A7", Apple just gets others to make something for them to their specs. They have no fabs and no expertise whatsoever in designing processors.

Ajay · Sep 27, 2013

Nec_V20 said:
There is no such thing as an " Apple A7", Apple just gets others to make something for them to their specs. They have no fabs and no expertise whatsoever in designing processors.

Uhm, you forgot the j/k that or, what planet are you from?

Apple A7 is now 64-bit

Golden Member

Diamond Member

Lifer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Member

Golden Member

Diamond Member

Member

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Senior member

Lifer